Online Analysis Web Monitoring
Plans for Onasys in S6 and Beyond
Background
Onasys is essentially a smart cron daemon that knows about the LIGO segment and frame file databases, understands pipeline data requirements, and is able to combine these things to perform an online analysis task by repeatedly constructing and submitting Condor DAGs. Onasys acts as glue code between Condor, the segment database, and the user's analysis pipeline.
The first prototype of Onasys was tested at Hanford during E11, and the first application of the system to produce useful output was during S4 when Onasys was used as the supervisory tool kit for the excess power, bns inspiral, and kleineWelle pipelines. It has since been used as the supervisory tool kit for h(t) production, and a number of other gravitational wave search pipelines.
I myself used Onasys as the supervisory tool kit for the excess power pipeline nearly continuously from the start of S4, though the post-S4 astrowatch, and for the first year or so of S5 --- a period of nearly 18 months of continuous online analysis.
What I think has been learned
Things I think Onasys has done well:
- Segment book-keeping, i.e. what's been been analyzed, what's not been analyzed, what can and can't be analyzed, etc. This appears to be well in hand. The job planning algorithms appear to be bug-free, they are highly configurable, and have been able to accommodate all searches so far.
- Fault tolerance and intolerance, i.e. detecting errors and responding to fatal and non-fatal errors correctly. Getting this right has been an iterative process, and it will never really be correct, but I believe that at this time the Onasys tools are quite good at working around non-fatal errors like network timeouts, but exiting upon fatal errors.
- Not becoming a fork bomb. Again, this has been an iterative process. Run-away resource consumption problems have been encountered at times, and they have been corrected as they are identified.
- DAG failure diagnosis. An online analysis with Onasys means constructing and submitting a Condor DAG every 10 minutes or so. In general it can be a time-consuming task to diagnose the failure of a Condor DAG, but this becomes a new kind of pleasure when a person can come in Monday morning and discover 300 failed DAGs that require diagnosing. It's important to be able to retrieve log files easily, identify patterns in the failures, and so on. I believe the DAG tracking tools in Onasys have proven invaluable in assisting in this task.
- High reliability. The rate at which an online analysis requires maintenance can be very low. Online h(t) production during S5 sometimes went many months without any human intervention being required.
- A green light means a green light. Onasys tracks the status of the DAGs it has launched via a job database, which can be browsed remotely through a web interface. A user can see a quick summary of their DAGs for the past n days, with a green light beside a DAG indicating a success. I am not aware of Onasys ever placing a green light beside a DAG that had not infact completed successfully. This is extremely important, because this means a user can trust that when Onasys reports all green across the board that their analysis is running correctly, and the user can do something else with the rest of their day.
Things I think Onasys is weak in:
- Job database reliability. Condor itself provides poor facilities for debugging the failure of a DAG, so Onasys tracks the status of analysis jobs with its own job database. Sometimes the programs that update this database can themselves fail, which leaves the database's contents not an accurate representation of the actual state of the compute cluster. Onasys' typical response when this happens is to assume the worst and stop submitting new jobs. This response is necessary to avoid run-away resource consumption but users find it inconvenient because they are required to manually correct the database in order to allow jobs to begin running again.
- Tool complexity. Related to the previous point, there is a steep learning curve when it becomes necessary to fix a problem. When things run correctly, the system presents a very low maintenance burden, but the skill-set required to use Onasys increases sharply when a problem is encountered.
- Miscellanea (things that could be fixed if I just had a little free time).
- Onasys implements its own check-pointing by writing its internal state to a data file. Because this is done frequently, the disk backup systems at the observatories do not backup the state file, and there have been occasions when the state of a user's search has been lost due to file system failures.
- Onasys' job database is hosted by the same MySQL server that hosts LDR, which means that excess database load from Onasys can interfere with file replication. This is addressed in Onasys with a shared memory semaphore to limit the total number of connections to the database, but this mechanism is delicate and has proven to be a regular point of failure.
- Onasys' source tree still contains C++ code. C++ is very problematic, and the code in question can only be compiled with g++ 3.3.x. I would like to get rid of it all, and happily this code is no longer required in order to run Onasys, but there are useful job database maintenance tools that need to be replaced before this code can be removed altogether.
Things Onasys is missing altogether:
- Multi-instrument pipeline support. Onasys' plug-in mechanism allows the end user to provide complex data discovery and pipeline construction plug-ins but the standard plug-ins that come with the source distribution are for single-instrument data discovery and pipelines only. Onasys has been used for an online multi-instrument stochastic analysis for which the user developed custom data discovery and pipeline construction plug-ins, so mult-instrument pipelines can certainly be supported but at present there are no standard plug-ins available for this.
Things users are missing altogether:
- Running an online analysis is a time commitment. I think if I summarize all the complaints I have heard about Onasys and look for a common theme, it's that "it burns up my time". Having run the excess power pipeline online, continuously, for nearly 18 months I estimate I spent a bit less than 1 day a week babysitting it on average. In the end I stopped the analysis because I too felt it was taking up too much time, but I can't imagine it having taken any less. I saw a DAG failure rate well below 1 in 1000, but I was running 1000 DAGs a week so there was always something to do. I think users are frustrated by this discovery, I think some come into it expecting that once they get things installed and configured it will run forever without assistance. Too many pieces have to work for that to be true, and this is independent of Onasys: there will always be a DNS outage, or NFS hang, or expired certificate, or missing frame file or something that trips up an online analysis.
Thoughts On Where to Go
No Increase of Sloppiness
Onasys has always been intended for production data analysis, not for the quickie production of toy triggers. This means that Onasys goes to great lengths to ensure that only the correct data is analyzed, that all possible data is analyzed, that the same data is not analyzed twice, and so on. All the things that should be ensured in a real search for gravitational waves.
Ensuring these things requires Onasys to do things that would not be needed otherwise. For example, Onasys needs to ensure that its internal state is preserved across invocations so that if it is stopped and restarted the new daemon process will not re-analyze old data, and this means maintaining an on-disk state file or checkpoint image. Things like this add additional points of failure for the daemon, and are perceived to be unnecessary baggage by users who are not looking to use Onasys for production data analysis.
I do not at this time intend to make these features optional. My belief is that we should be able to do production data analysis online, and I want other people to perceive that possibility to exist as well. Part of that is having a tool like Onasys available that people can already trust will correctly supervise a final production analysis.
I believe the following are the reasons we can't do production data analysis online today:
- the immaturity of analysis pipelines (people don't really know up front what it is they wish to run, how to tune it, etc.),
- the lack of a live final calibration,
- and the incompleteness of live data quality information.
No Latency Reduction
Onasys is not a realtime system, and will never evolve into one. Onasys has always been meant as the means by which an existing offline analysis pipeline can be transitioned into an online analysis. This is accomplished by running what is essentially an off-the-shelf offline analysis pipeline on short pieces of data in a loop. The loop iteration period sets a latency that Onasys introduces into the analysis, but is user-selectable. Reducing the loop iteration period lowers the latency but at the cost of increased resource pressure: more DAGs are run per unit time, which leads to increased disk usage, increased file system demands (large numbers of files), and more pain when a failure occurs.
Onasys has demonstrated the ability to introduce latencies as small at 10 minutes into an online analysis without any difficulties, and most online analyses are run with loop iteration periods close to this. It's likely that an iteration period as small as 5 minutes is still practical, but if a user wishes to push the latency any lower than that then they will need to switch to some other technology like a stream-based data analysis pipeline.
Work on Job Database and Other State Information
The job database is both Onasys' strength and weakness. The job database's ability to quickly, reliably, and remotely report "all green", and the assistance it provides in diagnosing problems are essential tools for online data analysis and the current job database performs both of these tasks very well. The job database, however, is easily the greatest source of problems for end users.
Since Onasys development began, Condor has introduced its own job tracking database called Quill. The Quill database in principle provides all of the information currently available in the Onasys job database and more. The Quill database is a more reliable source of information than the dagdbUpdator daemons used by Onasys for job status monitoring, and the code behind it is maintained by other people (less work for us). For all of these reasons, I would like future versions of Onasys to rely on the Quill database for job progress information instead of the dagdbUpdator daemons.
The Quill database is probably not suitable for use as a back-end to a web interface like Onasys' current summary status pages. I believe that several relatively complex queries on enormous tables are required in order to produce the sort of summary information presented by the web pages. For this reason, I imagine there will still be an Onasys job database but it will be populated from the Quill database instead of from the individual DAG log files as is done currently.
My current thinking is that we should move away from a central job database to local databases (e.g., SQLite databases) maintained individually in association with each online analysis. I imagine each Onasys daemon populating a private SQLite job database file from queries to the Quill database as part of the daemon's loop. This approach decouples the Quill database from web servers and the like. Besides job status the private database file could be used to store all of the daemon's internal state information, including the daemon metadata currently exposed to the web through .pid files and the like. Even the daemon's segment lists could be stored in the file.
Exposing online analysis status information to the web becomes more challenging in this configuration, but the total number of components is greatly reduced, the reliability is increased, and new features become possible (like charts on the web showing which segments have been analyzed, and so on). This sort of configuration also helps users who just want to use Onasys for toy trigger production: just delete the database file and your daemon is completely reset, including clearing up all the old job state information (get rid of embarrassing red lights).
Onasys-Related Things
There are a number of things that are closely related to Onasys, but for which the responsibility lies elsewhere.
- Failed DAG retries. I have been repeatedly asked for scripts to automatically re-run failed DAGs, to clean up failed DAGs, etc.. That it is difficult to re-run a failed DAG is a Condor issue, and these requests should be directed to them. Onasys already provides a command line tool to produce lists of the directories containing failed DAGs, and this tool can be incorporated into scripts to automatically process the failures, but how to do so is outside of Onasys' domain.
- Inter-analysis dependencies. It is quite reasonable to want one online analysis to use as input the data products of some other analysis. For example, a burst or inspiral pipeline might want to use burst triggers produced by another pipeline that is analyzing environmental channels as the basis for a veto. For one online analysis to share its data products with another requires two components: there needs to be a mechanism for transferring the data from one to the other, and there needs to be a discovery mechanism for that data (a way for the downstream search to check what upstream data is available). The LSC already possesses a widely-deployed mechanism for these things, namely LDR. But there is resistance to using LDR for this so it's possible some other thing is required.
- Related to the last point is the question of authentication mechanisms. Currently Onasys' design has taken the approach that Onasys is being used in a trusted environment For example the job database at each observatory is open to all users logged into the clusters, and it is assumed that the users will not corrupt each other's job status information. If online analysis pipelines begin to play a larger role as sources of data rather than sinks, then people might begin to want stronger assurance that the pipelines are not being interfered with, that data products have really been produced by the source they claim to have been produced by, and so on.
Summary
Essentially, my intention is for Onasys to evolve, not be re-engineered. I believe Onasys accomplishes the task it is intended to accomplish, but there are some loose ends that need cleaning up. The issue requiring the most immediate attention is the reliability of the job monitoring mechanism, which I hope to address by migrating from dagdbUpdator daemons to Quill + per-daemon SQLite database files. Following that is the issue of multi-instrument analyses, which can be addressed by designing and implementing a new pair of data discovery and pipeline construction plug-ins.