The CDF Wavelet in SA-C


Introduction

Wavelets are commonly used for multi-scale analysis in computer vision, as well as for image compression. Honeywell has defined a set of benchmarks for reconfigurable computing systems, including a wavelet-based image compression algorithm. This algorithm is based on the Cohen-Daubechies-Feauveau wavelet (1992). We have translated the CDF wavelet program into SA-C, and generalized it to operate on any size image.

In general, wavelets separate the high-frequency components of images from their low frequency components. The low-order components can then be resampled into a smaller image, while the high frequency components look like derivatives in dx, dy and d(xy). In the SA-C program below, the input is an 8-bit grey scale image. There are four outputs: a reduced image (low frequencies), and three images of high frequency components. The range of the output images exceeds 8 bits, however. The output pixels can be positive or negative, and require ten bits of magnitude information. SA-C takes advantage of the FPGAs flexibility with regard to bit precisions to produce 11-bit signed output images. Below, we show a source image and the four corresponding output images. Readers should be aware that the output images were rescaled to 0-255 for viewing: the actual ranges were -2 to 272 (output image 1), -87 to 93 (output image 2), -160 to 110 (output image 3) and -113 to 96 (output image 4).

Aerial image of Fort Hood, TX (used as input for the CDF wavelet)
The four output images from the CDF Wavelet

Source Code

export main;

			
int10, int10 Valsuint8(uint8 col[5])
{
   int10 mask[3] = {-1,2,-1};
 
   int11 d0 = for p in col[0:2] dot m in mask return(sum((int11)p*m));
   int11 d1 = for p in col[2:4] dot m in mask return(sum((int11)p*m));
   int11 d01 = d0 + d1;

   // what I need is (d0+d1)/8. The next five lines compute it.
   int11 ud01 = if (d01 < 0) return(-d01) else return(d01);
   bits11 b01 = ud01;
   bits11 bdiv8 = b01 >> 3;
   int11 udiv8 = bdiv8;
   int11 adj  = if (d01 < 0) return(-udiv8) else return(udiv8);
} return(col[2]+adj, d0);
int11, int11 Valsint10(int10 col[5])
{
   int11 mask[3] = {-1,2,-1};
 
   int12 d0 = for p in col[0:2] dot m in mask return(sum((int12)p*m));
   int12 d1 = for p in col[2:4] dot m in mask return(sum((int12)p*m));
   int12 d01 = d0 + d1;

   // what I need is (d0+d1)/8. The next five lines compute it.
   int12 ud01 = if (d01 < 0) return(-d01) else return(d01);
   bits12 b01 = ud01;
   bits12 bdiv8 = b01 >> 3;
   int12 udiv8 = bdiv8;
   int12 adj  = if (d01 < 0) return(-udiv8) else return(udiv8);
} return(col[2]+adj, d0);

uint8 clip8(int11 src)
{
   uint10 upix = if (src < 0) return(-src) else return (src);
   uint8 pix = if (upix > 255) return(255) else return(upix);
} return(pix);

int11[:,:], int11[:,:], int11[:,:], int11[:,:] main(uint8 src[:,:])
{
   int11 s[:,:], int11 dx2[:,:], int11 dy2[:,:], int11 dxy[:,:] =
      // PRAGMA (nextify_cse,part_unroll(8,1))
      for window w[5,5] in src step(2,2)
      {
         int10 sy[5], int10 dy[5] = 
            for uint3 colnum in [0~4]
            {
            int10 sval, int10 dval = Valsuint8(w[:,colnum]);
             } return(array(sval), array(dval));

         int11 s, int11 dx2 = Valsint10(sy);
         int11 dy2, int11 dxy = Valsint10(dy);
      }  return(array(s), array(dx2), array(dy2), array(dxy)) 
} return(s, dx2, dy2, dxy);

Performance Results

Performance was measured by compiling the SA-C CDF wavelet routine with the May 13, 2001 version of the SA-C compiler and executing them on an Annapolis Microsystems StarFire with an Xilinx XV-1000 FPGA. The test image was an 8-bit 512x512 images (of Fort Hood, TX, shown above). Compiler optimizations included temporal CSE, loop unrolling, pipelining (6 stages), stripmining (16x), and using multiple output memories to avoid contention.

Seconds (800MHx P3)
Seconds (WildStar)
LUTs
Flip-Flops
Slices Frequency
0.075
0.002
54%
69%
99% 35.1 MHz

For comparison purposes, we ran Honeywell's original wavelet code (written in C) on an 800 MHz Pentium III running Windows2000 and compiled with VC++ version 6, optimized for speed.

For previous results, click here