ARAGTAP pre-screener in SA-C

Introduction

ARAGTAP is a automatic target recognition (ATR) program for synthetic aperature radar (SAR) images developed at Wright Labs for the U.S. Air Force. The ARAGTAP pre-screener is the initial focus-of-attention module; its job is to extract regions of interest (ROIs) around potential targets for further evaluation. The ARAGTAP pre-screener was reimplemented at Khoral Research, Inc. (KRI) as a Khoros workspace. The workspace contains eleven glyphs and control logic. The eleven glyphs are: downsample, erode (cross), erode (box), dilate (cross), dilate (box), bitwise and, positive difference, majority threshold, prune, label, and display. The figure below shows the ARAGTAP workspace in Cantata.

We reimplemented eight of the eleven glyphs in SA-C, while keeping the ARAGTAP pre-screener as a Khoros workspace. (For more information on creating and executing SA-C programs in Khoros, click here.) We did not implement the labeling algorithm for consistency's sake: the labeling algorithm in ARAGTAP is highly sequential. A parallel version is possible, but would behave slightly differently. The current sequential algorithm runs more quickly on a host than on the more parallel (but slower clock speed) FPGA. The pruning algorithm had similar constraints, and it did not make sense to implement the display glyph on an FPGA since it doesn't have access to the monitor.

Most of the computation in the ARAGTAP pre-screener occurs in the dilate and erode glyphs. For each image, there are eight consecutive dilation operations, alternating between the cross and box kernels. Later on, there is a matching sequence of eight erosions. Taking advantage of SA-C's loop fusion capabilities, we compiled a single SA-C glyph that computes four alternating dilations, and another glyph that computes four dilations. Two calls to these glyphs replace the eight calls to the original dilation and erosion glyphs.

Performance Results

Our newest performance results for this application are for the sequences of dilations and erosions that take the bulk of the time in this application. Using the May 13, 2001 version of the SA-C compiler, we fused all eight dilations (and separately all eight erosions) into a single FPGA configuration. Since this did not fill up the FPGA, we had the compiler apply vertical stripmining to increase parallelism and decrease I/O traffic. The resulting run-times on an AMS WildStar using a single Virtex 2000E FPGA are shown below, in comparison to an 800MHz Pentium III. The test image was a 420x305 8-bit image.

The table above gives the performance numbers for the best optimization settings, which in this case involved vertically stripmining 14 times, so that on each cycle the system computes a column of fourteen final pixels. (Remember that is has fused all eight iterations, so each "final pixel" is the result of eight dilations. See optimizations for more on stripmining and other optimizations.)

The table below shows how performance improves as the amount of stripmining increases, although a law of diminishing returns sets in near the end.

Stripming #	Seconds	Frequency (MHz)	Logic Blocks
1	0.0135639	38.12	23%
2	0.0074069	37.08	26%
3	0.0051551	37.62	28%
4	0.0044855	34.25	30%
5	0.0039192	32.70	32%
6	0.0032890	34.47	35%
7	0.0027625	36.79	37%
8	0.0025715	36.33	39%
9	0.0024387	35.59	43%
10	0.0022889	34.66	44%
11	0.0022264	34.45	47%
12	0.0023786	30.96	49%
13	0.0024266	28.92	53%
14	0.0019025	34.84	52%
15	0.0019991	32.63	57%
16	0.0022539	28.38	57%

Performance Results (older)

We compared the run-times of the SA-C programs on a reconfigurable processor with the original C programs written by KRI. The SA-C programs were compiled using the November, 2000 version of the SA-C compiler and run on an Annapolis Microsystems StarFire with Xilinx XV-1000 FPGAs. The C programs were run on a 450MHz Pentium II. (The Xilinx XV-1000 is roughly cotemporaneous with the 450MHz Pentium II.) The run-times are given in the table below. The erode and dilate times are for a single operation. (Times for the sequence of four operations are given later.)

Routine	C Exec.	SA-C Exec.	RCS Data Download	RCS Config. Download	Frequency (MHz)	RCS Data Upload
Downsample	0.05	0.009761	0.033738	0.161508	29.963	0.012556
Erode (cross)	0.08	0.006102	0.008709	0.161871	28.279	0.012576
Erode (box)	0.08	0.006332	0.008712	0.161397	27.254	0.012562
Dilate (cross)	0.08	0.005499	0.008551	0.161516	31.383	0.012392
Dilate (box)	0.08	0.006198	0.008709	0.161615	27.842	0.012584
Bitwise AND	0.03	0.003136	0.016633	0.161510	38.877	0.012414
Pos. Diff.	0.03	0.003170	0.016405	0.161532	38.445	0.012370
Maj. Thresh	0.03	0.005396	0.008589	0.161505	31.977	0.012370

In general, execution times on the reconfigurable system are about ten times faster. The downside, however, is the cost of downloading the circuit configuration onto the FPGA, and of transporting the data back and forth between the RCS system and the host processor. In general, the time it takes to download a configuration swamp all other costs. As a result, it is only feasible to accelerate one operator per FPGA, since they cannot be quickly reconfigured and must therefore be pre-loaded. Given this, the gain in speed of the reconfigurable processor justifies the cost of downloading and uploading the image data for every routine except downsample (which gets the largest images), although the total improvement is reduced to a factor of four.

We get an even greater improvement in performance if we fuse sequences of four dilations or four erosions into one operator. In this case we get the table below, which shows an twentyfive-fold speedup using SA-C and the reconfigurable processor. Even if data download and upload times are included, there is still a fifteen-fold improvement.

Algorithm	Seconds (AMS Starfire)	Cycles (MClocks)	Logic Blocks
Dilate (8)	0.0019025210	66,283	94%

Routine	C Exec.	SA-C Exec.	RCS Data Download	RCS Config. Download	Frequency (MHz)	RCS Data Upload
Erode (4)	0.32	0.012962	0.009162	0.161483	25.0	0.012512
Dilate (4)	0.32	0.012962	0.009164	0.161513	25.0	0.012726

The ARAGTAP Pre-screener in SA-C

Introduction

Performance Results

Performance Results (older)