The Canny Edge Operator in SA-C


The Canny (1986) Operator is probably the most widely used edge detector. It consists of four steps: 1) edge smoothing, 2) edge magnitude and orientation computation, 3) directional non-maximal suppression, and 4) hysteresis edge labeling (a.k.a. connected components). For a clear and simple explanation of Canny and the reasoning behind it, see Trucco & Verri (1998), Chapter 4.

For this experiment, we implemented the first three steps of Canny in SA-C. (Although connected components can be written in SA-C, at the moment the compiler compiles it as host code, and not to the FPGA. Hopefully this will change in the future!) The result is an image in which every pixel is labeled as a string edge, weak edge or not an edge. The figure below shows the result of applying Canny to an aerial image of Fort Hood, TX.

Aerial Image of Fort Hood, TX
Result of Canny Edge Detection

SA-C Source Code

export main;
#include <>
// This file is a SA-C implementation of the Canny edge 
// detection algorithm.  The "true" canny algorithm (as
// defined by Trucco & Verri, page 76-79) takes three
// parameters: sigma, low-thresh & high-thresh. Since
// the sigma argument dictates the size of the smoothing
// mask, we will have to know that at compile time. The 
// other two arguments remain run-time arguments.

// Also, the "true" canny does hysteresis edge smoothing, which
// is a form of connected components and requires a while loop. 
// This step will not be added until while loops go to hardware.
// Since sigma must be known at compile time, this version
// assumes sigma = 1.

// fixed point constants (to avoid divisions in loops)
// constants for derivative:
#define WEIGHT8 ((fix17.8)0.6666667)
#define WEIGHT1 ((fix17.8)0.0833333)

// constants for orientation
#define COS_PI_8 ((ufix16.8)0.9238795)
#define COS_PI3_8 ((ufix16.8)0.3826834)

uint8[:,:] GaussianSmooth(uint8 source[:,:])
   ufix20.12 mask_col1[5]={2.0/571, 7.0/571, 12.0/571, 7.0/571, 2.0/571};
   ufix20.12 mask_col2[5]={7.0/571, 31.0/571, 52.0/571, 31.0/571, 2.0/571};
   ufix20.12 mask_col3[5]={12.0/571, 52.0/571, 127.0/571, 52.0/571, 12.0/571};
   uint8 smooth[:,:] = 
      // PRAGMA (nextify_cse)
      for window W[5,5] in source {
        ufix20.12 c1 = for w in W[:,0] dot m in mask_col1 return(sum(w*m));
        ufix20.12 c2 = for w in W[:,1] dot m in mask_col2 return(sum(w*m));
        ufix20.12 c3 = for w in W[:,2] dot m in mask_col3 return(sum(w*m));
        ufix20.12 c4 = for w in W[:,3] dot m in mask_col2 return(sum(w*m));
        ufix20.12 c5 = for w in W[:,4] dot m in mask_col1 return(sum(w*m));
 uint8 pixel = c1 + c2 + c3 + c4 + c5;
      } return(array(pixel));
} return(smooth);

// According to Trucco & Verri, for Canny (sigma 1) we need 5x1
// derivative masks (computed according to Appendix A2). Then we
// compute edge magnitude, and a coarsely quantized direction (0-3)
// on the half-circle, with a bucket every 45 degrees.
uint2 Orientation(int9 x, int9 y, uint8 mag)
   int9 xpos, int9 ypos = if (x < 0) return(-x, -y) else return(x,y);
   uint2 orient = 
      if (xpos > (mag*COS_PI_8)) return(0)
      else return(if (xpos < (mag*COS_PI3_8)) return(2)
                     else return(if (ypos > 0) return(1) else return(3)));
} return(orient);

bits10[:,:] EdgeMagnitudeOrientation(uint8 source[:,:])
   bits10 magorient[:,:] =
      for window W[5,5] in source {
         int9 x =WEIGHT8*W[2,1]+WEIGHT1*W[2,4]- (WEIGHT8*W[2,3]+WEIGHT1*W[2,0]);
         int9 y =WEIGHT8*W[1,2]+WEIGHT1*W[4,2]-(WEIGHT8*W[3,2]+WEIGHT1*W[0,2]);
         uint8 mag = SC_sqrt18((uint18)( (int17)x*(int17)x+(int17)y*(int17)y) );
         uint2 orient = Orientation(x,y,mag);
         // everything in here is to concatenate magnitude & orientation
         // into a single value, so the loops can fuse
         bits10 mag8 = (bits10)mag;
         bits10 mag10 = mag8 << 2;
         bits10 magorient = mag10 | (bits2)orient;
         // end of concatenation
      } return(array(magorient));
} return(magorient);

uint8[:,:] NonMaximalSuppression(bits10 source[:,:], 
                                 uint8 lowthresh, 
                                 uint8 highthresh)
   uint8 suparray[:,:] =
      for window W[3,3] in source {
         uint8 mag11 = (uint8)(W[1,1] >> 2);
         uint2 orient = (uint2)(W[1,1]);
         // the following variables are needed because the SA-C compiler
         // won't let me access W[c,c] inside an if node 
         uint8 w00 = (uint8)(W[0,0]>>2);
         uint8 w01 = (uint8)(W[0,1]>>2);
         uint8 w02 = (uint8)(W[0,2]>>2);
         uint8 w10 = (uint8)(W[1,0]>>2);
         uint8 w12 = (uint8)(W[1,2]>>2);
         uint8 w20 = (uint8)(W[2,0]>>2);
         uint8 w21 = (uint8)(W[2,1]>>2);
         uint8 w22 = (uint8)(W[2,2]>>2);
         // end of access variables
         uint8 n1, uint8 n2 = 
            if (orient == 0) return(w10,w12)
            else return(
               if (orient == 1) return(w02,w20)
               else return(
                 if (orient == 2) return(w01,w21)
                 else return(w00,w22)));
         uint8 supvalue = 
            if ((lowthresh > mag11) || (n1 > mag11) || (n2 > mag11))
            else return(if (highthresh > mag11) return(127) else return(255));
      } return(array(supvalue));
} return(suparray);
uint8[:,:] main(uint8 source[:,:], uint8 low_thresh, uint8 high_thresh)
   uint8 smooth[:,:] = GaussianSmooth(source);
   bits10 magorient[:,:] = EdgeMagnitudeOrientation(smooth);
   uint8 suppress[:,:] = NonMaximalSuppression(magorient, 
} return(suppress);

Performance Results

Performance was measured by compiling SA-C routines with the Sept., 2001 version of the SA-C compiler and executing them on an Annapolis Microsystems WildStar using a single Xilinx XV-2000E FPGA. The compiler did 5-fold vertical stripmining, temporal CSE and loop fusion, among other optimizations. The test images were 8-bit 512x512 images.

Seconds (WildStar)
Cycles (MClocks)
Logic Blocks

How does this compare to a Pentium? Hard to say. We wrote a version of the same program in C, using Intel's IPL whenever possible. The resulting program took 0.85 seconds on an 800MHz Pentium 3. On the other hand, Intel's OpenCV library has a hand-optimized assembly-coded version of Canny that includes the fourth (connected components) step. We set the high and low thresholds to be the same (so that connected components would not iterate) and timed the OpenCV routine at 0.135 seconds. So SA-C + FPGA was 20 times faster than assembly code and 120 times faster than a C implementation.

To access older results, click here.