The first comment is that the Numerical Recipes routine realft() is not as efficient as possible. In order to produce a production version of the GRASP code, we suggest replacing this function with a more-optimal version. For example, on the Intel Paragon, the CLASSPACK library provides optimized real-FFT functions. To replace the realft() routine, we provide a replacement routine by the same name, which calls the CLASSPACK library. This routine may be found in the src/optimization/paragon directory of GRASP. By including the object file for this routine in the linking path, before the Numerical Recipes library, it replaces the realft() routine. (Note: GRASP currently contains optimized replacement routines for the FFT on SGI/Cray, Sun, Paragon, DEC and Intel Linux machines; see the src/optimization/* directories of GRASP,described in Section ).
The second comment is related to inspiral-chirp template generation. The binary inspiral chirps may be saved in the multifilter program, but one is then limited by the available memory space, as well as incurring the overhead of frequent disk accesses if that memory space is swapped onto and off the disk. To avoid this, it is attractive to generate templates ``on the fly", then dispose of them after each segment of data is analyzed. This corresponds to setting STORE_TEMPLATES to 0 in multifilter. In this instance, the computational cost of computing binary chirp templates may become quite high, relative to the cost of the remaining computation (FFT's, orthogonalization, searching for the maximum SNR).
To cite a specific example, on the Intel Paragon, we found that the template generation was almost a factor of ten more time-consuming than the rest of the searching procedure. Some profiling revealed that the two culprits were the cube-root operation and the calculations of sines and cosines. Because the floating point hardware on the Paragon only does add, subtract and multiply, these operations required expensive library calls. In both cases, a small amount of work serves to eliminate most of this computation time. In the case of the cube root function, we have provided (through an ifdef INLINE_CUBEROOT in the code) an inline computation of cuberoot in 15 FLOPS, which only uses add, subtract and multiply. This routine shifts into the range from , then uses a fifth-order Chebyshev approximation of then make one pass of Newton-Raphson to clean up to float precision, and returns . In the case of the trig functions we have provided (through an ifdef INLINE_TRIGS in the code) inline routines to calculate the sine and cosine as well.After reducing the range of the argument to , these use a 6th order Chebyshev polynomial to approximate the sine and cosine. These techniques speed up the template generation to the point where it is approximately as expensive as the remaining computations. While there is some small loss of computational accuracy, we have not found it to be significant. Shown in Figure is a timing diagram illustrating the relative computational costs of these operations.