LSC Segment List Format Specification
Introduction
One thing learned from analyzing the data from the S1 run was that we needed a standard way of keeping track of data quality information. The "S2 Segment Data Quality Repository" was created to fill this need, and has been continued for subsequent science runs. This "repository" provides lists of segments (time intervals) in a simple ASCII file format which can be parsed fairly easily by data analysis programs and scripts. A number of software tools have been written to generate, parse, and manipulate segment lists in this format, including the LIGOtools 'segments' package (which includes a Tcl library and the 'segwizard' graphical user interface) and the Python 'segment' class in glue.Although the segment list files that we have been working with have a reasonably self-evident format, there has not been a written format specification up to now (September 2005). The different parsing codes differ somewhat in the details of what formats they are able to parse successfully. This document is an attempt to establish a baseline format specification that all parsing codes should be able to handle. The intent is to capture the common capabilties of the existing codes (to the extent that they are sufficient for our needs) rather than to require the development of additional code. Any given implementation may be able to handle more general cases.
Basic concepts
A segment is a time interval, possibly with some associated information such as a numerical index, a data quality flag string, etc. The end time of a segment is required to be greater than or equal to the start time of the segment.A segment list is simply a list of segments. The list has an order to it (it is a list, not a set), but the segments do not have to appear in chronological order in the list. The segments in a list may represent overlapping time intervals.
It is useful to define a few additional concepts even though they have no bearing on the format of a segment list file:
- A segment is said to be bare if it has no associated information, or annotated if it has one or more items of associated information.
- Two segments are said to overlap if there is some time interval which they both contain, i.e. either start time and/or the end time of one lies between the start and end times of the other.
- Two segments are said to touch if the end time of one is equal to the start time of the other. Note that they do not overlap in this case.
- A segment list is said to be sorted if the start times of the segments increase monotonically as one steps through the list, i.e. each segment's start time is greater than or equal to the previous segment's start time. This does not guarantee that the segment end times increase monotonically as well.
- A segment list is said to be disjoint if it is sorted and the segments do not overlap, i.e. each segment's start time is greater than or equal to the previous segment's end time.
- A segment list is said to be coalesced if it is sorted and the segments do not overlap or touch, i.e. each segment's start time is strictly greater than the previous segment's end time.
File format specification
A segment list file is an ASCII file which is to be parsed line by line. The hash character ('#') begins a comment, so the parsing code should ignore this character and all subsequent characters on a line. Blank lines - that is, lines consisting only of whitespace (spaces and/or tabs) after removal of the comment, if any - are to be ignored. Each non-blank line contains the information for one segment.A line with segment information consists of at least two fields separated by whitespace. The line may also have leading and/or trailing whitespace, which is ignored. Two of the fields - either the first and second, or the second and third (see below) - represent the start time and end time, respectively, expressed in GPS seconds. Each GPS time is formatted either as an integer or as a decimal floating-point number. Considering the time span of observations with gravitational wave interferometers, each GPS time should have an integer part which is a positive 9- or 10-digit decimal number.
Optionally, the first field on the line can be an integer "index". It should have no more than eight digits, so that it is distinguishable from a GPS time. Other than this, there is no restriction on the value of the index; in particular, the index is not required to be equal to the ordinal number of the segment in this list. If an index is present, then the start time and end time of the segment follow it on the line.
A segment may have one or more fields of associated information following the GPS end time (in addition to the index which may optionally appear at the beginning of the line, as described above). An item of associated information may be a number or a string (with no internal whitespace) and is application-specific. The parsing code should ignore the additional-information field(s) if not relevant for the application.
Example segment list file
# This is a segment list file. It may contain comments anywhere.
# There can be whitespace before the hash character.
723892545 723892560 #-- This segment happens to have integer times
723904200 723905200
723904205 723905205 #-- Extra whitespace is OK, and overlapping is OK
723905303.542 724038223.598746221
103878332 103878544 #-- Remember that GPS times change to 10 digits in 2011 !
# Blank lines and comments in the middle of the file are OK
5 804323335 804323504 #-- This line has an 'index' at the begining
12 804350000 804350000
# Now, some segments with associated information:
792331300 792331400 BAD_TIMING
792331500 792331600 BAD_TIMING 5 2 ex
23346 792331300.25 792331400.4 HighNoise # Has an index AND assocated info
Revision history
- Sept. 7, 2005: Format specification first written down (P. Shawhan)
- Sept. 25, 2005: Added definitions of the concepts "overlap", "touch", and "disjoint" (P. Shawhan)