next up previous contents
Next: Template Placement Up: GRASP Routines: Template Bank Previous: Example: multifilter program   Contents

Optimization and computation-speed considerations

0 The previous subsection describes the multifilter program, which filters data through a bank of templates. We have experimented with the optimization of this code on several platforms, and here recount some of that experience.

The first comment is that the Numerical Recipes routine realft() is not as efficient as possible. In order to produce a production version of the GRASP code, we suggest replacing this function with a more-optimal version. For example, on the Intel Paragon, the CLASSPACK library provides optimized real-FFT functions. To replace the realft() routine, we provide a replacement routine by the same name, which calls the CLASSPACK library. This routine may be found in the src/optimization/paragon directory of GRASP. By including the object file for this routine in the linking path, before the Numerical Recipes library, it replaces the realft() routine. (Note: GRASP currently contains optimized replacement routines for the FFT on SGI/Cray, Sun, Paragon, DEC and Intel Linux machines; see the src/optimization/* directories of GRASP,described in Section [*]).

The second comment is related to inspiral-chirp template generation. The binary inspiral chirps may be saved in the multifilter program, but one is then limited by the available memory space, as well as incurring the overhead of frequent disk accesses if that memory space is swapped onto and off the disk. To avoid this, it is attractive to generate templates ``on the fly", then dispose of them after each segment of data is analyzed. This corresponds to setting STORE_TEMPLATES to 0 in multifilter. In this instance, the computational cost of computing binary chirp templates may become quite high, relative to the cost of the remaining computation (FFT's, orthogonalization, searching for the maximum SNR).

To cite a specific example, on the Intel Paragon, we found that the template generation was almost a factor of ten more time-consuming than the rest of the searching procedure. Some profiling revealed that the two culprits were the cube-root operation and the calculations of sines and cosines. Because the floating point hardware on the Paragon only does add, subtract and multiply, these operations required expensive library calls. In both cases, a small amount of work serves to eliminate most of this computation time. In the case of the cube root function, we have provided (through an ifdef INLINE_CUBEROOT in the code) an inline computation of cuberoot in 15 FLOPS, which only uses add, subtract and multiply. This routine shifts $x$ into the range from $1\rightarrow 2$, then uses a fifth-order Chebyshev approximation of $x^{-2/3}$ then make one pass of Newton-Raphson to clean up to float precision, and returns $x^{1/3} = x^{-2/3} x$. In the case of the trig functions we have provided (through an ifdef INLINE_TRIGS in the code) inline routines to calculate the sine and cosine as well.After reducing the range of the argument to $x\in [-\pi,\pi]$, these use a 6th order Chebyshev polynomial to approximate the sine and cosine. These techniques speed up the template generation to the point where it is approximately as expensive as the remaining computations. While there is some small loss of computational accuracy, we have not found it to be significant. Shown in Figure [*] is a timing diagram illustrating the relative computational costs of these operations.

Figure: This shows the performance of an ``on the fly" template search on the Intel Paragon, with different levels of optimization. The top diagram uses the Numerical Recipes FFT routine realft(), and takes about 4.2 seconds to process 6 seconds of data. The middle diagram shows identical code using the CLASSPACK optimized FFT routine, and takes about 2.1 seconds. Note that the template generation process is now becoming expensive. The bottom diagram shows identical code which includes inline functions for cube-root and sine/cosine functions to speed up the template generation process. The template generation takes about 325 msec, and the entire search procedure (including template generation) takes 780 msec per template per processor per 6-second stretch of data. Relative to the top diagram, this represents a speed-up factor of more than 5. Running on 256 nodes, it is possible to filter 5 hours of data through 66 templates (representing the mass range from 1.2 to 1.6 solar masses) in 5x3600x66x(0.780)/(256x6) seconds = 10.1 minutes.
\begin{figure}\vskip -0.9in
\vskip -1.0in

next up previous contents
Next: Template Placement Up: GRASP Routines: Template Bank Previous: Example: multifilter program   Contents
Bruce Allen 2000-11-19