Performance of SA-C IPL Routines

Introduction

Introduction

To test the performance of SA-C programs on simple image processing procedures, we implemented 32 routines to exactly match the API of routines from Intel's Image Processing Library (IPL). These routines were verified by comparing their results to output from Intel's routines; the results match exactly, including the use of saturating arithmetic and (for some routines) rounding remainders of .5 up or down depending on the column number.

The comparisons were made by compiling SA-C routines with the November 2000 version of the SA-C compiler and executing them on an Annapolis Microsystems StarFire with an Xilinx XV-1000 FPGA. The Intel IPL routines were executed under WindowsNT on a 450MHz Pentium II . (We believe the XV-1000 & 450MHz Pentium are of approximately the same age.) The test images were 8-bit 512x512 images.

Execution times are reported in seconds, as are data upload and download times for the RCS. FPGA clock frequencies are reported in MHz.


Performance Results

0.021259
Routine
Pentium Exec.
RCS Exec.
RCS data download
RCS data upload
Frequency (MHz)
AddS 0.081531 0.008355 0.02221 0.03292 39.5
And 0.003179 0.008492 0.04418 0.03298 38.9
AndS 0.001865 0.008331 0.02222 0.03275 39.6
Close 0.018069 0.012800 0.02337 0.03308 25.0
Convolve2D 0.006548 0.006624 0.02341 0.03385 25.1
Dilate 0.011578 0.028910 0.02376 0.03385 25.0
Erode 0.016764 0.028910 0.02375 0.03385 25.0
Gaussian3x3 0.005670 0.006637 0.03386 0.03390 25.1
Greater 0.000109 0.008461 0.04430 0.00413 39.0
GreaterS 0.011431 0.009563 0.02220 0.02238 34.5
LShiftS 0.001469 0.008537 0.02239 0.02256 38.7
Less 0.000074 0.008438 0.04434 0.00413 39.1
LessS 0.011567 0.008179 0.02176 0.00409 40.4
MaxFilter 0.005189 0.021259 0.02304 0.03341 28.2
MinFilter 0.005328 0.021755 0.02304 0.03342 27.5
Multiply 0.003541 0.009055 0.04322 0.03306 36.4
MultiplyS 0.039078 0.008707 0.02228 0.03291 37.9
MultiplySScale 0.002057 0.009470 0.02224 0.03293 34.9
MultiplyScale 0.003659 0.011854 0.04510 0.03336 27.8
NormC 0.001053 0.010593 0.00008 31.0
NormL1 0.002023 0.009099 0.00008 36.0
Not 0.001883 0.008530 0.02227 0.03291 38.7
Open 0.017859 0.034167 0.02386 0.03380 25.0
Or 0.003093 0.008492 0.04490 0.03289 38.6
OrS 0.001976 0.008331 0.02220 0.03272 39.6
RShiftS 0.001359 0.008537 0.02233 0.03302 38.7
Square 0.045160 0.008530 0.02613 0.03297 38.7
Subtract 0.030182 0.007858 0.04330 0.03248 42.0
SubtractS 0.001854 0.008546 0.02226 0.03293 38.6
Threshold 0.001251 0.007806 0.02223 0.03703 42.0
Xor 0.002554 0.009390 0.04451 0.03302 35.1
XorS 0.001367 0.008331 0.02183 0.03272 39.6

Analysis

As you would expect, the Pentium II outperforms the reconfigurable system on simple image processing operators. This is because these tasks are I/O bound, and the I/O paths on the FPGAs are no wider than on the Pentium while operating at a slower clock speed. As a result, the FPGA is unable to exploit its advantage in terms of parallelism. (See ARAGTAP for an example of the reconfigurable system outperforming the Pentium on more complex tasks.)

Readers should also note the other numbers provided here: the time required to download the source image(s) to the RCS, and the time required to upload the output back to the host. These times are artifacts of the FPGAs being on a seperate co-processor board, while the Pentium is the main processor. Future reconfigurable systems may have the FPGAs and risk processors on the same chip, in which case these transfer times go away. In the meantime, it takes about 0.02 seconds to download a 512x512 8-bit image across our PCI bus; operators that take two images as arguments have twice that download time. Upload times depend on whether the result is an image or a single value, and if it is an image whether it is binary, 8-bit, or more.