Search


DASWG LAL Doxygen

Docs

How-To's
Technical
Software Docs
Minutes

Download

Browse CVS, Git, or SVN
Software Repositories
OS Security Updates
LIGO software virtual machine
VMware SL6 Install

Participate

SCCB - Software Change Control Board
Edit these pages
Sub-committees
Mailing List
Telecon

Projects

DMT
DQSEGDB
Glue
GraceDB
gstlal
LALSuite
LDAS Tools
LDG Client/Server
LDR
ligoDV
LIGOtools
LVAlert Administration
LVAlert
MatApps
Metaio
NDS Client
PyLAL
LSCSOFT VM

Legacy Projects

geopp
LDAS
LDM
LIGOtools
LSCGIS
Onasys
OSG-LIGO

Online Analysis Web Monitoring

Plans for Onasys in S6 and Beyond

Background

Onasys is essentially a smart cron daemon that knows about the LIGO segment and frame file databases, understands pipeline data requirements, and is able to combine these things to perform an online analysis task by repeatedly constructing and submitting Condor DAGs.  Onasys acts as glue code between Condor, the segment database, and the user's analysis pipeline.

The first prototype of Onasys was tested at Hanford during E11, and the first application of the system to produce useful output was during S4 when Onasys was used as the supervisory tool kit for the excess power, bns inspiral, and kleineWelle pipelines.  It has since been used as the supervisory tool kit for h(t) production, and a number of other gravitational wave search pipelines.

I myself used Onasys as the supervisory tool kit for the excess power pipeline nearly continuously from the start of S4, though the post-S4 astrowatch, and for the first year or so of S5 --- a period of nearly 18 months of continuous online analysis.

What I think has been learned

Things I think Onasys has done well:

Things I think Onasys is weak in:

Things Onasys is missing altogether:

Things users are missing altogether:

Thoughts On Where to Go

No Increase of Sloppiness

Onasys has always been intended for production data analysis, not for the quickie production of toy triggers.  This means that Onasys goes to great lengths to ensure that only the correct data is analyzed, that all possible data is analyzed, that the same data is not analyzed twice, and so on.  All the things that should be ensured in a real search for gravitational waves.

Ensuring these things requires Onasys to do things that would not be needed otherwise.  For example, Onasys needs to ensure that its internal state is preserved across invocations so that if it is stopped and restarted the new daemon process will not re-analyze old data, and this means maintaining an on-disk state file or checkpoint image.  Things like this add additional points of failure for the daemon, and are perceived to be unnecessary baggage by users who are not looking to use Onasys for production data analysis.

I do not at this time intend to make these features optional.  My belief is that we should be able to do production data analysis online, and I want other people to perceive that possibility to exist as well.  Part of that is having a tool like Onasys available that people can already trust will correctly supervise a final production analysis.

I believe the following are the reasons we can't do production data analysis online today:

I am convinced that "the lack of a correct pipeline planning tool" is not on that list.

No Latency Reduction

Onasys is not a realtime system, and will never evolve into one.  Onasys has always been meant as the means by which an existing offline analysis pipeline can be transitioned into an online analysis.  This is accomplished by running what is essentially an off-the-shelf offline analysis pipeline on short pieces of data in a loop.  The loop iteration period sets a latency that Onasys introduces into the analysis, but is user-selectable.  Reducing the loop iteration period lowers the latency but at the cost of increased resource pressure:  more DAGs are run per unit time, which leads to increased disk usage, increased file system demands (large numbers of files), and more pain when a failure occurs.

Onasys has demonstrated the ability to introduce latencies as small at 10 minutes into an online analysis without any difficulties, and most online analyses are run with loop iteration periods close to this.  It's likely that an iteration period as small as 5 minutes is still practical, but if a user wishes to push the latency any lower than that then they will need to switch to some other technology like a stream-based data analysis pipeline.

Work on Job Database and Other State Information

The job database is both Onasys' strength and weakness.  The job database's ability to quickly, reliably, and remotely report "all green", and the assistance it provides in diagnosing problems are essential tools for online data analysis and the current job database performs both of these tasks very well.  The job database, however, is easily the greatest source of problems for end users.

Since Onasys development began, Condor has introduced its own job tracking database called Quill.  The Quill database in principle provides all of the information currently available in the Onasys job database and more.  The Quill database is a more reliable source of information than the dagdbUpdator daemons used by Onasys for job status monitoring, and the code behind it is maintained by other people (less work for us).  For all of these reasons, I would like future versions of Onasys to rely on the Quill database for job progress information instead of the dagdbUpdator daemons.

The Quill database is probably not suitable for use as a back-end to a web interface like Onasys' current summary status pages.  I believe that several relatively complex queries on enormous tables are required in order to produce the sort of summary information presented by the web pages.  For this reason, I imagine there will still be an Onasys job database but it will be populated from the Quill database instead of from the individual DAG log files as is done currently.

My current thinking is that we should move away from a central job database to local databases (e.g., SQLite databases) maintained individually in association with each online analysis.  I imagine each Onasys daemon populating a private SQLite job database file from queries to the Quill database as part of the daemon's loop.  This approach decouples the Quill database from web servers and the like.  Besides job status the private database file could be used to store all of the daemon's internal state information, including the daemon metadata currently exposed to the web through .pid files and the like.  Even the daemon's segment lists could be stored in the file.

Exposing online analysis status information to the web becomes more challenging in this configuration, but the total number of components is greatly reduced, the reliability is increased, and new features become possible (like charts on the web showing which segments have been analyzed, and so on).  This sort of configuration also helps users who just want to use Onasys for toy trigger production:  just delete the database file and your daemon is completely reset, including clearing up all the old job state information (get rid of embarrassing red lights).

Onasys-Related Things

There are a number of things that are closely related to Onasys, but for which the responsibility lies elsewhere.

Summary

Essentially, my intention is for Onasys to evolve, not be re-engineered.  I believe Onasys accomplishes the task it is intended to accomplish, but there are some loose ends that need cleaning up.  The issue requiring the most immediate attention is the reliability of the job monitoring mechanism, which I hope to address by migrating from dagdbUpdator daemons to Quill + per-daemon SQLite database files.  Following that is the issue of multi-instrument analyses, which can be addressed by designing and implementing a new pair of data discovery and pipeline construction plug-ins.

$Id$