| |
AXIAL TOMOGRAHIC RECONSTRUCTION
Computed axial tomographic reconstruction using back-projection
is presented as a challenging real-world application illustrating
the performance obtainable with Alacron's FastSeries architecture
compared to standard cluster architectures. We will calculate the
performance of the latest microprocessors from Analog Devices, Intel,
Philips Semiconductors, and Texas Instruments.The important properties
of the selected processors are presented
in the table below. Cluster mode designs reach bus saturation very
quickly to the point where adding additional processors will not
improve the rate at which calculations can be performed. The table
assumes that the processors are isolated from each other when they
are performing the calculations. This assumption is valid because
most multiprocessor co-processor boards have isolated or local memory
designs due to scalability issues. The only non-local memory design
is the "native" PIII-450 design, added for comparison purposes.
| SPECIFICATION
|
Intel PIII-450
|
Philips
TM1300 |
TI
C6701 |
ADI
21160 |
| Architecture
|
CISC
|
VLIW
|
VLIW
|
VLIW
|
| FPU
|
Yes
|
Yes
|
Yes
|
Yes
|
| MFLOPs
(Peak) |
250
|
720
|
1000
|
600
|
| MIPS
(Peak) |
450
|
900
|
1336
|
100
|
| MOPS
(Peak) |
900
|
4860
|
1336
|
800
|
| Memory
Bus Bandwidth (MB/s) |
800 |
570 |
332 |
400 |
| 1K
FP cfft (µsec) |
300 |
106 |
108 |
90 |
| 1K
16 bit cfft (µsec) |
191 |
63 |
108 |
90 |
| 1F
FP dot product (µsec) |
7.38 |
2.84 |
3.07 |
5.12 |
| 16
x 16 MACs (MMAC/s) |
2.32 |
1.42 |
3.07 |
5.32 |
| 8
x 8 MACs (MMAC/s) |
9.34 |
6.55 |
7.11 |
11.80 |
| 512x512x8
bit Conv 3x3 (msec) |
5.34 |
2.62 |
7.11 |
11.80 |
| 512x512x512x
8 bit Erosion/Dilation (msec) |
7.34 |
1.42 |
3.62 |
3.93 |
| "Glue"
Logic Cost ($/CPU) |
$150 |
$3 |
$65 |
$39 |
| CPU
Price ($) |
$300 |
$100 |
$150 |
$100 |
Description of the Application
The reconstruction region is a square array of 1024
x 1024 pixels centered on the axis of rotation of the x-ray head
(Figure 1). Data is collected from the x-ray head as radial scans,
with 4 channels of 4096 values for each of 360 angles. The data
collected occupies 4*4096*360*2 = 11,796,480 bytes. The data is
moved into memory at an average rate of 5 megabytes per second (Figure
2).

Figure 1: Relation of the region of reconstruction
to the x-ray source and detector array.

Figure 2: Data collection.
The following steps reconstruct the pixel densities:
- Calculation of the density estimate from the output of the four
detectors at the end of each beam path.
- Back-projection. Conversion of the fan shaped polar form of
the data to the square array of pixels in the reconstruction region
is accomplished by summing the density along each ray into each
pixel the ray encounters. Linear interpolation is used when the
ray passes between two pixels (Figure 3).

Figure 3: Back Projection.
Four channels of data are used to estimate the actual
density along the beam path to compensate for the normal beam hardening
that occurs as x-rays pass through an object.
The value summed into each pixel is:

where Dbeam is the density measured for the current
beam being processed and alpha is determined from the geometry of
the beam.
- Application of a 2D FFT to the image array.
- Application of a weighting factor of the distance of each pixel
from the axis of rotation. This compensates for the greater number
of rays passing through pixels closer to the axis of rotation.
- Application of an inverse 2D FFT to the whole array
Steps 2 through 5 effectively apply a convolution to the data
to correct for the weighting error generated by the beam geometry,
giving a good approximation to the density distribution of the
object positioned in the reconstruction area 1.
Algorithmic Complexity
The number of operations performed in each step can
be counted as follows. These are summarized in Table 1.
Density estimation from detector data: A fifth
order polynomial is applied to the output of each the four detectors
at the end of each beam path, and the result of these four polynomials
are summed to give the density estimate. This step requires 4*(5
Multiples + 5 Adds) plus three adds per beam path, or 81,920 MACs
(multiply accumulates) and 12,288 adds. For 360 angles, the total
number of MACs is 29,491,200, and the total number of adds is 4,423,680.
Additionally, each detector has an array of 24 coefficients and
4 values that must be read from memory, followed by a write of the
density estimate. This results in 41,287,680 reads from memory and
1,474,560 writes to memory to process all 360 angles.
Back-projection: For each angle, one third of
the beam path pass into the top and out from the bottom of the reconstruction
region, encountering 2 pixels for each of the 1024 lines it crosses.
The weights for each beam path is pre-computed, and so each of the
2048 pixels on these 1366 beams requires one MAC, or a total of
2,797,568 MACs. The other two thirds of the beam paths access from
3 to 2048 pixels depending on how far up the left or right side
of the reconstruction region the ray path lies. In total, these
beam paths access 2,300,323 pixels, requiring 2,300,323 MACs. For
all beams then, 5,097,891 MACs are needed. Each MAC requires a pixel
value to be read from memory, a table value to be read from memory,
and a final write of the new pixel value back to memory. For all
of the 360 angles, the back-projection step required 1,835,240,760
MACs, 3,670,481,520 reads from memory, and 1,835,240,760 writes
to memory.
2D FFT: A 1024 x 1024 2D real FFT is composed
of 2048 real 1024 point FFTs, 1024 along the rows and 1024 along
the columns. Two real signals can be processed concurrently with
one complex 1024 point FFT plus 2,044 additional additions. Two
real FFTs can be performed by this method in 460 microseconds on
a single 40 MHz SHARC 2106x processor, averaging to 230 microseconds
for a 1024 point real FFT. The whole 2D FFT thus requires 0.470
seconds plus, plus 0.013 seconds required to read and write the
whole array twice from memory, or 0.483 seconds on a single processor.
Pixel weighting by radial position: A pre-computed
table of radius values is read with the image data, multiplied and
written back to memory. This step requires 2,097,152 reads from
memory, 1,048,576 multiplies, and 1,048,576 writes to memory.
Inverse 2D FFT: The 2D inverse FFT requires the
same number of operations as required by the 2D FFT in step 3.
| Step
|
MACs
|
Adds
|
Reads
|
Writes
|
| 1
|
29,491,200
|
4,423,000
|
41,287,000
|
1,474,000
|
| 2
|
1,835,240,000
|
0
|
3,670,481,000
|
1,835,240,000
|
| 3
|
28,800,000*
|
|
2,097,000
|
2,097,000
|
| 4
|
1,048,000
|
|
2,097,000
|
1,048,000
|
| 5
|
28,800,000*
|
|
2,097,000
|
2,097,000
|
| Total
|
1,923,380,000
|
4,423,000
|
3,718,060,000
|
1,841,958,000
|
Table 1: Operations required.
*Based on averaged 230 µs FFT time.
Parallelization of the Algorithm
In order to improve performance, the computation
will be divided between processors by splitting the reconstruction
region into horizontal stripes. The first processor will get the
first N rows (N = 1024 / number of processors), and each processor
will get succeeding rows. The algorithm is inherently linear allowing
this natural division between processors.
Memory Requirements
This algorithm uses several pre-computed tables.
Each detector has a table of 24 calibration coefficients for a total
of 98,304 four-byte floating point numbers. The pixel radius value
table (distance from the rotation axis) requires 131,072 floats,
taking advantage of symmetries. Again taking advantage of symmetries
the ray table requires 57,351,273 floats. The reconstructed image
requires 1,048,576 floats, and the input data requires 11,796,480
bytes. The total memory requirement is therefore under 256 megabytes.
Total (Elapsed) is the time the routine takes to
run on one processor with overlapping I/O and compute.
| Processor
|
Total
|
Time
(Mac) |
Time
(Add) |
Time
(read) |
Time (write) |
| TM1300-180
|
38.88
|
2.67
|
0.006
|
26.00
|
12.88
|
| TI-C6701
|
85.18
|
5.79
|
0.013
|
57.27
|
27.91
|
| ADSP21160
|
55.60
|
9.62
|
0.022
|
37.18
|
18.42
|
| PIII-450-MMX
|
27.78
|
8.55
|
0.009
|
18.58
|
9.20
|
Processor Cost vs. Reconstruction Time
The illustration below shows the cost/performance analysis for the
four processors.
Click to Register
to Download the application
note in pdf format
|