The first comment is that the Numerical Recipes routine realft() is not as efficient as possible. In order to produce a
production version of the GRASP code, we suggest replacing this function
with a more-optimal version. For example, on the Intel Paragon, the
CLASSPACK library provides optimized real-FFT functions. To replace
the realft() routine, we provide a replacement routine by the
same name, which calls the CLASSPACK library. This routine may be
found in the src/optimization/paragon directory of GRASP. By including
the object file for this routine in the linking path, before the Numerical
Recipes library, it replaces the realft() routine.
(Note: GRASP currently contains optimized replacement routines for
the FFT on SGI/Cray, Sun, Paragon, DEC and Intel Linux machines; see the
src/optimization/* directories of GRASP,described in
Section
).
The second comment is related to inspiral-chirp template generation. The binary inspiral chirps may be saved in the multifilter program, but one is then limited by the available memory space, as well as incurring the overhead of frequent disk accesses if that memory space is swapped onto and off the disk. To avoid this, it is attractive to generate templates ``on the fly", then dispose of them after each segment of data is analyzed. This corresponds to setting STORE_TEMPLATES to 0 in multifilter. In this instance, the computational cost of computing binary chirp templates may become quite high, relative to the cost of the remaining computation (FFT's, orthogonalization, searching for the maximum SNR).
To cite a specific example, on the Intel Paragon, we found that the
template generation was almost a factor of ten more time-consuming than
the rest of the searching procedure. Some profiling revealed that the
two culprits were the cube-root operation and the calculations of sines
and cosines. Because the floating point hardware on the Paragon only
does add, subtract and multiply, these operations required expensive
library calls. In both cases, a small amount of work serves to eliminate
most of this computation time. In the case of the cube root function,
we have provided (through an ifdef INLINE_CUBEROOT in the code)
an inline computation of cuberoot in 15 FLOPS, which only uses add,
subtract and multiply. This routine shifts
into the range from
, then uses a fifth-order Chebyshev approximation of
then make one pass of Newton-Raphson to clean up to float
precision, and returns
. In the case of the trig
functions we have provided (through an ifdef INLINE_TRIGS in the
code) inline routines to calculate the sine and cosine as well.After
reducing the range of the argument to
, these use
a 6th order Chebyshev polynomial to approximate the sine and cosine.
These techniques speed up the template generation to the point where it
is approximately as expensive as the remaining computations. While there
is some small loss of computational accuracy, we have not found it to
be significant. Shown in Figure
is a timing
diagram illustrating the relative computational costs of these operations.