The ARAGTAP Pre-screener in SA-C


Introduction

ARAGTAP is a automatic target recognition (ATR) program for synthetic aperature radar (SAR) images developed at Wright Labs for the U.S. Air Force. The ARAGTAP pre-screener is the initial focus-of-attention module; its job is to extract regions of interest (ROIs) around potential targets for further evaluation. The ARAGTAP pre-screener was reimplemented at Khoral Research, Inc. (KRI) as a Khoros workspace. The workspace contains eleven glyphs and control logic. The eleven glyphs are: downsample, erode (cross), erode (box), dilate (cross), dilate (box), bitwise and, positive difference, majority threshold, prune, label, and display. The figure below shows the ARAGTAP workspace in Cantata.

We reimplemented eight of the eleven glyphs in SA-C, while keeping the ARAGTAP pre-screener as a Khoros workspace. (For more information on creating and executing SA-C programs in Khoros, click here.) We did not implement the labeling algorithm for consistency's sake: the labeling algorithm in ARAGTAP is highly sequential. A parallel version is possible, but would behave slightly differently. The current sequential algorithm runs more quickly on a host than on the more parallel (but slower clock speed) FPGA. The pruning algorithm had similar constraints, and it did not make sense to implement the display glyph on an FPGA since it doesn't have access to the monitor.

Most of the computation in the ARAGTAP pre-screener occurs in the dilate and erode glyphs. For each image, there are eight consecutive dilation operations, alternating between the cross and box kernels. Later on, there is a matching sequence of eight erosions. Taking advantage of SA-C's loop fusion capabilities, we compiled a single SA-C glyph that computes four alternating dilations, and another glyph that computes four dilations. Two calls to these glyphs replace the eight calls to the original dilation and erosion glyphs.

The ARAGTAP pre-screener Khoros workspace, with SA-C glyphs. SA-C glyphs are differentiated from C glyphs by the red FPGA chip in the upper right corner of the glyph.


Performance Results

Our newest performance results for this application are for the sequences of dilations and erosions that take the bulk of the time in this application. Using the May 13, 2001 version of the SA-C compiler, we fused all eight dilations (and separately all eight erosions) into a single FPGA configuration. Since this did not fill up the FPGA, we had the compiler apply vertical stripmining to increase parallelism and decrease I/O traffic. The resulting run-times on an AMS WildStar using a single Virtex 2000E FPGA are shown below, in comparison to an 800MHz Pentium III. The test image was a 420x305 8-bit image.

Algorithm
Seconds

(AMS Starfire)

Cycles

(MClocks)

Logic Blocks
Dilate (8)
0.0019025210
66,283
94%

The table above gives the performance numbers for the best optimization settings, which in this case involved vertically stripmining 14 times, so that on each cycle the system computes a column of fourteen final pixels. (Remember that is has fused all eight iterations, so each "final pixel" is the result of eight dilations. See optimizations for more on stripmining and other optimizations.)

The table below shows how performance improves as the amount of stripmining increases, although a law of diminishing returns sets in near the end.

Stripming #
Seconds
Frequency (MHz)
Logic Blocks
1
0.0135639
38.12
23%
2
0.0074069
37.08
26%
3
0.0051551
37.62
28%
4
0.0044855
34.25
30%
5
0.0039192
32.70
32%
6
0.0032890
34.47
35%
7
0.0027625
36.79
37%
8
0.0025715
36.33
39%
9
0.0024387
35.59
43%
10
0.0022889
34.66
44%
11
0.0022264
34.45
47%
12
0.0023786
30.96
49%
13
0.0024266
28.92
53%
14
0.0019025
34.84
52%
15
0.0019991
32.63
57%
16
0.0022539
28.38
57%


Performance Results (older)

We compared the run-times of the SA-C programs on a reconfigurable processor with the original C programs written by KRI. The SA-C programs were compiled using the November, 2000 version of the SA-C compiler and run on an Annapolis Microsystems StarFire with Xilinx XV-1000 FPGAs. The C programs were run on a 450MHz Pentium II. (The Xilinx XV-1000 is roughly cotemporaneous with the 450MHz Pentium II.) The run-times are given in the table below. The erode and dilate times are for a single operation. (Times for the sequence of four operations are given later.)

Routine C Exec. SA-C Exec. RCS Data Download RCS Config. Download Frequency (MHz) RCS Data Upload
Downsample 0.05 0.009761 0.033738 0.161508 29.963 0.012556
Erode (cross) 0.08 0.006102 0.008709 0.161871 28.279 0.012576
Erode (box) 0.08 0.006332 0.008712 0.161397 27.254 0.012562
Dilate (cross) 0.08 0.005499 0.008551 0.161516 31.383 0.012392
Dilate (box) 0.08 0.006198 0.008709 0.161615 27.842 0.012584
Bitwise AND 0.03 0.003136 0.016633 0.161510 38.877 0.012414
Pos. Diff. 0.03 0.003170 0.016405 0.161532 38.445 0.012370
Maj. Thresh 0.03 0.005396 0.008589 0.161505 31.977 0.012370

In general, execution times on the reconfigurable system are about ten times faster. The downside, however, is the cost of downloading the circuit configuration onto the FPGA, and of transporting the data back and forth between the RCS system and the host processor. In general, the time it takes to download a configuration swamp all other costs. As a result, it is only feasible to accelerate one operator per FPGA, since they cannot be quickly reconfigured and must therefore be pre-loaded. Given this, the gain in speed of the reconfigurable processor justifies the cost of downloading and uploading the image data for every routine except downsample (which gets the largest images), although the total improvement is reduced to a factor of four.

We get an even greater improvement in performance if we fuse sequences of four dilations or four erosions into one operator. In this case we get the table below, which shows an twentyfive-fold speedup using SA-C and the reconfigurable processor. Even if data download and upload times are included, there is still a fifteen-fold improvement.

Routine C Exec. SA-C Exec. RCS Data Download RCS Config. Download Frequency (MHz) RCS Data Upload
Erode (4) 0.32 0.012962 0.009162 0.161483 25.0 0.012512
Dilate (4) 0.32 0.012962 0.009164 0.161513 25.0 0.012726