The Future of High
Performance Machine Vision
Introduction
Recently several of Alacron’s customers from the
inspection and semiconductor industries have asked us to integrate
multiple fast large CCD arrays or the new fast high frame rate CMOS
sensors with data rates in the 500 to 1000 MB/sec range into a real-time
systems. This trend is opening a two tiered approach
to machine vision. There
is the “Native”, i.e. Pentium based, computing with a “basic”, i.e.
non-accelerated frame grabber.
The other avenue is to accelerate the process prior to PC
transfer as represented by accelerated frame grabbers or cameras.
The native approach is usually is preferred by customers
because it is able to provide an environment, which is:Easy to program
with optimized native libraries from multiple vendors,Fast enough
for real-time processing,Cheaper to deploy.
This approach is feasible as noted by this author
in the supplement to the May 2002 Vision System Design Magazine,
for the lower end of the frame grabber market, which will probably
migrate to USB2.0 or IEEE 1394, i.e. Firewire, because of adequate
performance, widespread availability, and low or no cost motherboard
options. Hence these interfaces achieve reasons 1 to 3 above for
data rates within the 0-40MB/sec range, which is adequate for a
large portion of the machine vision market.
This also is within the realistic throughput for a single
or dual Intel based Pentium solution.
What is the alternative if a customer would like to
do more intensive processing or use a significantly increased
sensor data rates, which for newer multiple CCD and CMOS sensors
is approaching or exceeding 1GB/sec?The
“native” approach to this problem is to buy a SMD Pentium box
with an adequate throughput bus i.e. 32 or 64-bit PCI bus, and
an adequate basic frame grabber.
While this may seem to be the direct solution, it may not
be the cheapest, fastest, nor easiest to deploy in the high performance
machine vision environment.
Scalability of Native Processing
To examine the feasible of scalability of native processing
we need to examine the two different memory schemes that are commercially
available, i.e. cluster versus shared memory. The cluster (shared private distributed memory
approach model) is that every processor has local memory in which
to operate. A stack of
PCs can be linked by 100Mbit or Gigabit Ethernet exemplifies this
model or for an embedded example is Coreco’s Mamba Series.
The performance and unit cost is generally linear for some
reasonable number of units, i.e. less than 10.
The other shared memory scheme is inherent in the commercially
available server and workstation units, which come with a support,
chip that shares memory among four to eight processors.
These units often are not linear but super linear with
cost and with the number of processors their performance is sub-linear.
To see this effect, Alacron used the Intel Graphics Suite,
to benchmark the scalability of two and 4 processor shared memory
architectures. The extrapolation to eight processors is straightforward
since the eight-processor solution is no faster than the cluster
of two four shared memory units.
The Intel Fusion Chipset supplies this architecture. For the shared two-processor model we obtained a performance increment
of 1.6 units, i.e. the time to perform two threads of the Intel
library was 1.7 times the uni-processor model.
For the four-processor configuration running four threads
the result was 5/8 times four the uni-processor time.
This leads to a scalability factor of 90% for the two-processor
model, i.e.. 2 processors have the throughput of 1.7, and 60%
for the four-processor model i.e. 4 processors has the throughput
of 2.6 processors.
Microprocessor and FPGA Comparisons
In order to understand which approach native or other
has advantages we studied, the relative performance of near-future
microprocessor and FPGA offerings, the cost of implementation,
and the power consumption relative to throughput.
From this data we then can examine the implications for
the future of high performance machine vision.
In the above section we established that the scalability
of the shared memory native solution is approximately 85% for
two and %65 for four processors for the SMD approach.
The cluster or Shared private memory approach is usually
linear or nearly so with processing units because:
-
No inter-processor contention for a common
piece of hardware
-
Splitting of the I/O streams does not unduly
burden a processor with I/O that it does not need. For example
a 1GByte/sec I/O stream split over 8 processors results in a
125 MB/sec load to each processor on a cluster or SPDM machine
where all the processors will see the 1GByte/second load in
the SMP solution.
Solution Comparisons
In order to measure relative performance for imaging
we selected a suite of routines to give a performance ratio for
imaging. We are comparing
the near future state-of-the-art processors from Intel, Philips,
and Motorola and FPGA's (field programmable gate arrays) from Xilinx,
which represent the various solutions vendors are using to handle
high data rate or compute intensive applications. The table below
is consists of examples of processors and FPGA which are new or
will be soon introduced, and
their performance of imaging
, cost and power relative to the benchmark Intel P4. The table indicates
speed that is a larger number means faster, relative to the P4.
| Parameter |
300 Mhz |
1000 Mhz |
1300 Mhz |
3000 |
3000 Mhz |
|
|
TriMedia1500 |
MPC8540 |
MPC7455 |
Xilinx V2 |
P4 |
|
ImageAdd |
0.26 |
0.90 |
0.77 |
29.00 |
1.00 |
| Sobel8 |
0.28 |
0.43 |
2.01 |
1.30 |
1.00 |
|
3x3
Conv |
0.51 |
0.96 |
2.30 |
1.30 |
1.00 |
| 11x11
conv |
1.37 |
2.31 |
12.18 |
41.83 |
1.00 |
|
3x3
erode gray |
0.90 |
0.58 |
1.22 |
0.69 |
1.00 |
| Hist8
(32) |
2.10 |
2.24 |
1.89 |
1.08 |
1.00 |
|
2D
FFT |
1.19 |
0.72 |
2.61 |
21.29 |
1.00 |
| Lut8 |
0.86 |
1.11 |
0.95 |
0.53 |
1.00 |
|
Cost
|
$60 |
$100 |
$300 |
$600 |
$350 |
| Power |
3W |
7W |
20W |
4W |
60W |
| Average
Performance |
0.94 |
1.16 |
2.99 |
8.50 |
1.00 |
| Relative
Cost |
$64 |
$86 |
$100 |
$70 |
$350 |
| Relative
Power |
3.2W |
6W |
6.7W |
0.5W |
60W |
This table demonstrates the relative advantages and
disadvantages of each approach to the method one uses to solve
machine vision problems. The
conclusions are:
·
Performance
As one can see, the FPGA
approach has a significant performance advantage.
·
Implementation Cost.
This table shows that an embedded or
FPGA solution is the most cost effective if more than one or two
P4s are needed to handle the data flow.
It should also be noted that the sensor or frame grabber
I/O is limited by the latest PC bus architecture, i.e. PCI-X,
i.e. 64 bit X133 MHz = 1.2 GB/sec.
Also to handle this throughput most native solutions are
simply incapable of sufficient throughput to handle these data
rates at the high end, even in a Cluster or SMD native environment
because of the prohibitive cost of real-time processing
The power consumed by processors is
also important, especially if one considers the placement into
a camera as the optimal solution to high-speed imaging.
The camera environment provides a significant advantage
for processing since the selection of frame grabber and processing
platform is solved in the camera environment. Also significant data reduction can take place
prior to downloading the camera, which significantly decreases
integration and frame grabber costs.
Conclusion
The native solution is both feasible and desired for data rates and
camera applications that can be performed on one or two P4s, which
is cost effective for data rates in the 40 MB/sec or somewhat greater
range. However, when data
rates or real-time processing become intensive, I.e. 80 MB/sec or
greater, then an embedded or FPGA solution offers more cost effective,
efficient and no more difficult development than the native solution.
If one wants to produce smart or embedded processor cameras,
then the only feasible choice is the embedded processor with the FPGA
solution being optimal since most preprocessing of images requires
only limited repertoire of fixed point processing which gives the
FPGA solution the distinct advantage.
Thus the FPGA approach:
With the recent introduction of hybrid FPGA with processors,
e.g. Xilinx Vertex II Pro, this combination may allow significant
simplification of the choices for a manufacturer who can modify
the mix of cells and processors as need in either the camera or
frame grabber to meet the needs of the customers particular application.
Thus individual customization using the hybrid solution
may be the preferred approach in the not too distant future when
the hybrid FPGA/processor solution becomes more widely available
and cost effective.
Click to Download
the technical article
in pdf format
|