The first comment is that the *Numerical Recipes* routine `realft()` is not as efficient as possible. In order to produce a
production version of the GRASP code, we suggest replacing this function
with a more-optimal version. For example, on the Intel Paragon, the
CLASSPACK library provides optimized real-FFT functions. To replace
the `realft()` routine, we provide a replacement routine by the
same name, which calls the CLASSPACK library. This routine may be
found in the `src/optimization/paragon` directory of GRASP. By including
the object file for this routine in the linking path, before the *Numerical
Recipes* library, it replaces the `realft()` routine.
(Note: GRASP currently contains optimized replacement routines for
the FFT on SGI/Cray, Sun, Paragon, DEC and Intel Linux machines; see the
`src/optimization/*` directories of GRASP,described in
Section ).

The second comment is related to inspiral-chirp template generation. The
binary inspiral chirps may be saved in the multifilter program, but
one is then limited by the available memory space, as well as incurring
the overhead of frequent disk accesses if that memory space is swapped
onto and off the disk. To avoid this, it is attractive to generate
templates ``on the fly", then dispose of them after each segment of
data is analyzed. This corresponds to setting `STORE_TEMPLATES`
to 0 in `multifilter`. In this instance, the computational cost of
computing binary chirp templates may become quite high, relative to the
cost of the remaining computation (FFT's, orthogonalization, searching
for the maximum SNR).

To cite a specific example, on the Intel Paragon, we found that the
template generation was almost a factor of ten more time-consuming than
the rest of the searching procedure. Some profiling revealed that the
two culprits were the cube-root operation and the calculations of sines
and cosines. Because the floating point hardware on the Paragon only
does add, subtract and multiply, these operations required expensive
library calls. In both cases, a small amount of work serves to eliminate
most of this computation time. In the case of the cube root function,
we have provided (through an `ifdef INLINE_CUBEROOT` in the code)
an inline computation of cuberoot in 15 FLOPS, which only uses add,
subtract and multiply. This routine shifts into the range from
, then uses a fifth-order Chebyshev approximation of
then make one pass of Newton-Raphson to clean up to float
precision, and returns
. In the case of the trig
functions we have provided (through an `ifdef INLINE_TRIGS` in the
code) inline routines to calculate the sine and cosine as well.After
reducing the range of the argument to
, these use
a 6th order Chebyshev polynomial to approximate the sine and cosine.
These techniques speed up the template generation to the point where it
is approximately as expensive as the remaining computations. While there
is some small loss of computational accuracy, we have not found it to
be significant. Shown in Figure is a timing
diagram illustrating the relative computational costs of these operations.