Programming a Pipelined Image Processor

December 22, 2017 | Autor: Lá Là La Syd | Categoria: Cognitive Science, Image Processing, Computer Model, Communication Channels, Real Time, Software Systems, Interactive graphics, Directed Graph, Coarse Grained Soil, Software Systems, Interactive graphics, Directed Graph, Coarse Grained Soil

Share Embed

Denunciar este link

Descrição do Produto

COMPUTER VISION AND IMAGE UNDERSTANDING

Vol. 64, No. 3, November, pp. 351–367, 1996 ARTICLE NO. 0064

Programming a Pipelined Image Processor THOMAS J. OLSON* Personal Systems Laboratory, Texas Instruments, Inc., P.O. Box 655303, Dallas, Texas 75265

JOHN R. TAYLOR FORE Systems, Pittsburgh, Pennsylvania 15237 AND

ROBERT J. LOCKWOOD Digital Systems Design, Tektronix, Inc., 13975 SW Karl Braun Drive, Beaverton, Oregon 97077 Received September 2, 1994; revised June 22, 1995

Real-time computer vision systems often make use of dedicated image processing hardware to perform the pixel-oriented operations typical of early vision. This type of hardware is notoriously difficult to program, limiting the types of experiments that can be performed and posing a serious obstacle to research progress. This paper describes a pair of programming tools that we have developed to simplify the task of building real-time early vision systems using special-purpose hardware. The system allows users to describe computations in terms of coarse-grained dataflow graphs constructed using an interactive graphical tool. At initialization time it compiles these graphs into efficient executable programs for the underlying hardware. The system has been implemented on a popular commercial pipelined image processor. We describe the computational model that the system supports, the facilities it provides for building real-vision applications, and the algorithms used to generate effective execution schedules for the target machine.  1996 Academic Press, Inc.

1. INTRODUCTION

Vision systems for robots and autonomous vehicles must meet severe throughput and latency requirements, and they place extreme demands on current-generation computing hardware. Pipelined image processors have proven to be a cost-effective way to meet these demands, at least for early visual processing, and have become very popular among researchers working on real-time vision problems. * Correspondence should be addressed to: Thomas J. Olson, Texas Instruments, Inc., P.O. Box 655303, m/s 8374, Dallas, TX 75265. E-mail: [email protected].

Unfortunately, as these machines have become more powerful and sophisticated they have also become more difficult to program. The difficulty of writing software for these machines limits the kinds of experiments that can be performed and has become a serious obstacle to progress. This paper describes a software system that we have developed to simplify the task of programming the Datacube MV20 [17], a pipelined image processor that is widely used for research in robot vision. The system presents an abstract view of the device’s capabilities, allowing the programmer to focus on the computation to be performed rather than the manipulations needed to map the computation onto the hardware. Because it is based on an abstract model of the machine, the system could be supported on other architectures a well. The core of the programming system is VEIL (Virginia’s Extensible Imaging Library), a C11 library that provides a dataflow abstraction for programming the underlying machine. VEIL represents computations as directed graphs whose nodes are standard image processing operators (add, subtract, convolve, etc.) and whose arcs represent communications channels. User programs can construct VEIL graphs procedurally by calling functions that instantiate nodes and link their inputs and outputs as needed. VEIL automatically maps the nodes and arcs of the graph onto the underlying hardware, breaking the graph into subgraphs if the computation is too complex to be performed in one step. User programs can control or interact with running VEIL graphs in various ways, making it easy to synchronize vision tasks with robot control computations. Extension mechanisms allow VEIL to support the development of special-purpose routines or the incorporation of new hardware.

351 1077-3142/96 $18.00 Copyright  1996 by Academic Press, Inc. All rights of reproduction in any form reserved.

352

OLSON, TAYLOR, AND LOCKWOOD

Although programming in C11 using VEIL is straightforward, the compile-edit-debug cycle makes exploratory programming tedious. To simplify program development, we have written a graphical user interface called MERLIN that runs on top of VEIL and supports rapid prototyping of early vision algorithms. MERLIN allows users to create, run, and modify VEIL graphs by drawing them on the workstation screen. Graphs constructed using MERLIN can be loaded into VEIL applications and controlled in the same manner as procedurally constructed graphs. The remainder of this paper is organized as follows: the next section describes the context of our work and presents the design goals of the VEIL/MERLIN system. An overview of the MV20 architecture is given in Section 3. Sections 4 and 5 describe VEIL and MERLIN from the user’s point of view. Section 6 describes VEIL’s internals, with particular attention to the scheduler. Section 7 presents conclusions and future plans. 2. BACKGROUND

Our primary motivation for writing VEIL has been to support research by ourselves and others in robot vision. We are particularly concerned with problems in active vision [3, 8, 9, 44]. The goal of active vision is to allow autonomous agents (vehicles or robots) to use visual information to accomplish tasks in the real world. Active vision work is characterized by an emphasis on interactions between sensing and behavior. Visual information is gathered incrementally and is used to guide real-time decisions about how sensor parameters (such as viewpoint) should be controlled and how incoming images should be processed. Active vision applications make extreme demands on the underlying hardware and software systems. Typical computations require fast on-line processing of large amounts of image data. Latencies must be kept small to allow the agent to respond quickly to changes in the environment. Image computations are tightly interleaved with control and planning activities, so the environment used for image processing must also be able to support more general computations and I/O activities. The type of image processing to be performed may change rapidly in response to shifts in the agent’s goals and priorities. The extraordinary processing requirements of robot vision applications exceed the capabilities of current generation computers and make the use of dedicated image processing hardware necessary. The nature of research, on the other hand, demands systems that support rapid prototyping and exploratory programming; but dedicated image processing systems are notoriously difficult to program. This conflict motivated us to consider ways of making programming easier. Our previous experience in building coarse-grained dataflow systems [34] suggested that a

dataflow paradigm would provide adequate flexibility for the programmer and would also map well to most dedicated image processors. This line of research led to the development of VEIL and MERLIN.

2.1. Related Work Software environments for computer vision and image processing have been under development for many years and take a variety of forms. They generally perform one or more of the following functions: • Encapsulating primitives. Image processing algorithms tend to rely heavily on repeated applications of a small number of well-understood primitive operations: convolution, pixelwise arithmetic, point transforms, and so on. Most image processing environments provide encapsulated implementations of these primitives. These may take the form of subroutine libraries [27], stand-alone executable programs [28, 37], or primitive commands in a textual or graphic shell [50]. • Interactive composition of primitives. Most modern systems allow users to invoke sequences of primitive operations interactively and to observe the results. Often the primitives are made available as stand-alone executables operating on files or pipes, as in HIPS [28], Vista [37], and the Khoros base system [38]. Other systems provide invocation via menus [4, 50]. Some systems allow primitives to be both defined and invoked using an interactive interpreted language such as LISP [29] or tcl [10, 35]. • Analysis and display. All systems provide some facility for examining the results of an image processing computation, e.g., displaying images on a monitor. Some also include sophisticated data analysis and plotting tools. • Software engineering. Some image processing systems (see Lawton and McConnell [29] for a survey) are expressly designed to support the construction of large, complex applications, and include facilities for version control, image and code database management, and automatic interface generation. In recent years coarse-grained dataflow [6] has emerged as the paradigm of choice for interactive construction of image processing programs. In this paradigm the computation is expressed as a directed graph, usually constructed using a graphical user interface. The nodes of the graph are the primitive operations provided by the system, and have named input and output ports corresponding to the arguments and outputs of the operations. Conceptually, the source nodes of the graph generate potentially infinite sequences of images. Images flow through the arcs of the graph, triggering execution of a primitive function whenever they arrive at a node. This paradigm was first applied to image processing in the HI-VISUAL language [22] and has also been applied to scientific visualization [48, 42],

PROGRAMMING A PIPELINED IMAGE PROCESSOR

signal processing [30], simulation [12], and data acquisition [24]. A number of visual dataflow languages have been built to explore issues such as user interface design [46] and implementation efficiency [34]. The paradigm has become well known via the widely distributed Khoros/Cantata system [38] and is now being retrofitted to older image processing and vision systems such as KBVision [4]. The concerns of programming systems for dedicated image processing hardware differ somewhat from those of the more general-purpose systems described above. Processing speed is the raison d’eˆtre of exotic architectures, so their programming interfaces tend to favor efficient use of the machine over ease of programming. They often mirror the structure of the hardware in the type of interface they present. The programming environment for the Aspex PIPE [26], for example, allows users to construct programs by manipulating graphical representations of the machine’s registers and datapath elements. WitFlow [16] and Zebra/ ZED [47] provide a similar level of access to Datacube products. Software tools of this type can make programming these architectures much more pleasant than it would otherwise be. However, they require the programmer to specify in detail how the computation should be mapped onto the hardware. The resulting programs are closely tied to the target architecture and, hence, are nonportable and hard to modify. There have been some attempts to devise more abstract languages for programming image processing and vision hardware. The text-based languages APPLY [20] (for neighborhood operators) and ADAPT [49] (for split/merge computations) allow restricted classes of image processing operators to be expressed in an Ada-like language with data-parallel extensions. They were originally designed to support vision applications on the Warp [5] systolic array but have been ported to a number of other architectures as well. INSIGHT [40] is a text-based dataflow language for programming reconfigurable networks such as the PROTEUS vision machine [21]. It is one of the few systems designed to handle intermediate and highlevel vision computations as well as image processing. The price of this generality is that programmers are required to abandon imperative programming constructs and code in a definitional dataflow language [1]. Other researchers have described coarse-grained dataflow visual programming environments for image processing hardware. The Khoros system [38] can be instructed to distribute processes across a local area network. Stewart and Dyer [43] proposed a coarse-grain dataflow environment for the PIPE [26], and they described heuristic algorithms for mapping dataflow graphs to the hardware. Zeltner et al. [52] describe a similar programming system for the proposed SYDAMA-2 vision engine. VEIL and MERLIN are outgrowths of our earlier attempt [45] to develop a visual programming language for

353

the MV20. They owe their general appearance and interaction style to earlier dataflow visual languages such as AVS [48] and MAVIS [34] and their underlying computational model to signal processing systems such as Gabriel [30, 32]. VEIL differs from earlier systems primarily in its support for building embedded applications that combine image processing with higher level vision and control computations. In particular, it provides a novel and elegant mechanism for communication between dataflow programs running on an accelerator and conventional programs running on the host. VEIL produces highly efficient code and targets a type of architecture that has not been addressed in previous work. VEIL and MERLIN have been fully implemented and are in daily use at a number of sites.

2.2. Design Goals The design of VEIL has been driven by our desire to conduct experimental research in active vision using realtime image processing hardware. We encountered many frustrations in trying to do this using existing tools. These experiences led us to establish the following requirements for the design: • Ease of use. The most fundamental requirement on the system is that it be easy to learn and easy to use. In particular, programmers with expertise in vision and AI should be able to write useful programs within a day of being introduced to the system. • Rapid prototyping. The ease of use of the system should extend to the development cycle. Instead of developing their algorithms on a workstation and then porting them to the real-time environment, users should be able to develop and debug applications directly on the specialpurpose hardware. • Efficiency. Ease of use must be achieved without too great a sacrifice in terms of speed. Some loss of efficiency is probably inevitable, but it should not be inherent in the design of the system. When application-specific optimizations are necessary, they should be accomplished by extending the base system rather than by abandoning it and recoding in some lower level language. • Flexibility. Active vision computations take place in a behavioral context and are often tightly interleaved with planning and motor control activities. For this reason, the system may not place undue restrictions on the form that user programs can take. It must allow image processing operations to be embedded in arbitrary C programs and must allow the program to start and stop image processing computations or modify their parameters at any time. • Extensibility/portability. The system should be extensible at several levels. First, it should support incremental performance enhancement. Programmers should be able to develop custom image processing routines that are opti-

354

OLSON, TAYLOR, AND LOCKWOOD

FIG. 1. Block diagram of the MV20. The principal components are the switch, the memories, and the four processing devices (AS, AG, AP, and AU).

mized for particular applications and to use them freely in combination with the standard primitives provided by the system. Second, the system should support incorporation of new hardware as it becomes available and should be based on a model that is general enough to allow porting to other architectures in the future. One goal that we did not adopt for VEIL was that of providing direct support for high-level vision computations. The immediate reason for this decision was that the target architecture is optimized for low-level pixel processing and provides no support for more abstract image descriptions. More generally, we felt that the dataflow paradigm encourages a functional, value-oriented programming style that is well suited to image processing, but poorly suited to high level vision. For the latter, an objectoriented style such as that provided by the Image Understanding Environment [27] is more natural. Accordingly, VEIL programs are expected to use dataflow to express their pixel-oriented operations and to perform higher level analysis in conventional, sequential C11 code. What VEIL does provide (and many image processing systems do not) is a natural, flexible way to combine dataflow image processing and higher-level sequential code in a single program. 3. THE HARDWARE

VEIL is designed to be independent of the underlying hardware, but its design has been strongly affected by the capabilities of the particular device chosen as the initial target architecture: the Datacube MV20. In this section we describe the MV20 architecture and programming model. The Datacube MaxVideo 20 [17] is a real-time image processing system that is widely used for active vision research. Figure 1 shows a block diagram of the device. The MV20 architecture can be broken down into three main sections: a programmable switch, a set of ‘‘smart’’ image

memories, and a set of deeply pipelined processing elements. The switch is the heart of the device and is responsible for its considerable flexibility. It allows any 10 of 32 inputs to be connected to any of 32 outputs. Each connection allows eight-bit data to be transmitted at 20 Mhz, providing a maximum bandwidth of 640 Mbytes per second. Each of the six memories has a capacity of 1 Mbyte (expanded to 4 or 16 Mb in recent versions of the device). Each memory also contains a pair of address generators that allow it to function as a two-dimensional vector register. The address generators can transfer rectangular arrays of eight-bit pixels in raster-scan order between the switch and the memory at a rate of 20 Mhz. Each memory has one read port, one write port, and one read/write port supporting VME access. All three ports can be active simultaneously. The processing elements perform computations on raster arrays received from the switch or generated internally. Each device is itself a complex network of multiplexors, switches, ALUs, etc. The basic services are provided by four devices: • AS device: analog to digital conversion of video signals in a variety of formats. • AG device: analog video output with overlay. • AU device: ALU operations, multiplication, weighted sums, and statistics. • AP device: 8 3 8 or paired 4 3 8 convolution, 8 3 8 and 16 3 16 lookup tables, and histogramming. All of the devices are designed to operate on pixel arrays presented in raster-scan order at 20 Mhz.

3.1. Programming Model The MV20 does not have a program counter or instruction set and, hence, does not execute programs in the usual sense. Instead, the host computer (typically a UNIX workstation) controls the device’s operation by storing values into appropriate registers. In order to perform a computation, the host configures the switches, multiplexors, and processing elements to form a network that performs the desired operations on any data that pass through it. It then commands the device to transmit rectangular arrays of pixels from one or more data sources (the AS device or memories) to the network inputs and to store the data appearing at the network outputs in one or more data sinks (the AG device, memories, or the statistics or histogramming sections of the processing elements). For computations that are too complex to be performed in a single pass, the host can repeat the process using the stored results of the previous computation as inputs to the text. The standard programming interface to the MV20 is a vendor-supplied software package called ImageFlow [18].

355

PROGRAMMING A PIPELINED IMAGE PROCESSOR

ImageFlow consists of a sophisticated device driver and a library of functions that can be called from user programs written in C. The library conceals the necessary low-level register manipulations from the programmer, allowing him or her to describe the computation by referring to the processing elements and switch ports by name. A typical ImageFlow program begins by constructing one or more processing networks (called pipes in ImageFlow terminology). It does this by calling ImageFlow functions that declare pixel arrays in memory, establish connections between the appropriate pixel arrays, switch ports and processing elements, and set the processing attributes (e.g., convolution kernels and filter parameters) as desired. ImageFlow automatically configures the MV20’s memory address generators so that the timing of the raster arrays reflects the desired geometric relationship between them. For example, if two images are to be added ImageFlow ensures that corresponding pixels arrive at the ALU at the same time. Once the pipes have been constructed, the application program has a number of options for controlling their execution. It can request that they be fired or executed either once or continuously. Firing is nonblocking, so the application can fire a pipe, perform other work, and then perform a blocking wait for the event associated with the pipe. A series of pipes can be packaged into a pipe altering thread (PAT) and uploaded to the MV20 device driver, which will execute the pipes in sequence as rapidly as possible, either once or repeatedly. This feature allows user programs to perform other tasks while the MV20 works on a complex (multipipe) computation.

3.2. The Programming Problem The ImageFlow library insulates the programmer from many of the idiosyncracies of the hardware and provides a powerful and flexible way of controlling MV20 computations. However, programming the MV20 in Imageflow remains painful even for relatively simple computations. The most obvious problem is simply that the device is extremely complex. For example, for the AU device alone the hardware reference manual identifies 109 distinct functional elements and provides six pages of block diagrams showing their interconnections. The result is that even trivial programs require dozens of function calls. A second, subtler source of difficulties in programming the MV20 arises from the fact that ImageFlow programs describe computations by referring explicitly to the physical resources (memories and computational elements) required to produce them. Programmers cannot simply state that two image streams are to be added; they must specify which of several ALUs is to be used, where in memory the images are stored, and what path the images should take to reach the ALU. The programmer is responsible

for ensuring that each operand resides in a distinct memory bank, that the right lookup tables, convolution kernels, etc. are loaded into the right places, and that no datapath or computational element is used twice in the same pipe. The result of this ‘‘explicit reference’’ problem is to make code reuse extremely difficult. Every ImageFlow program contains dozens of hard-coded assumptions about how data are laid out in memory and how they flow through the device. Any two ImageFlow programs invariably have incompatible memory maps and make conflicting demands on computational resources. The net result is that modifying an existing ImageFlow application (e.g., by merging it with another) is almost as difficult as rewriting it from scratch. 4. VEIL

VEIL is a programming library for real-time image processing on the Datacube MV20. It provides C11 class definitions that allow programmers to describe image processing computations in terms of coarse-grained dataflow graphs. It translates these graphs to appropriate ImageFlow structures during program initialization. The computational model that it supports is a form of homogeneous synchronous data flow (HSDF) [32]. In this paradigm, each node of the dataflow graph produces exactly one token per arc per invocation. VEIL adds the restriction that all data sources are synchronous and fire simultaneously. Under this restriction the flow of tokens through the graph depends only on the topology of the graph. This implies that a valid execution schedule can be computed statically, minimizing run-time overhead. A pure HSDF paradigm does not allow the structure or execution order of a graph to change once the graph has been created. In order to provide greater flexibility, VEIL allows applications to create multiple graphs and to start, stop, or modify them during execution. This ‘‘escape’’ mechanism provides the flexibility needed for active vision applications. In particular, it allows user code to implement conditional or iterative control structures when the application requires them. VEIL addresses most of the programming problems identified in the previous section. It handles the complexity problem by abstracting the functionality of the MV20 and presenting it as a set of reusable image processing primitives. It addresses the problem of explicit reference by concealing the physical resources of the MV20 from the programmer, providing instead an unbounded number of virtual resources. The fact that there is only one physical convolver does not prevent the program from building a graph with three convolution nodes into it; the VEIL scheduler simply detects the conflict and schedules the convolutions sequentially, allocating memory buffers as needed to store the intermediate results. This allows the

356

OLSON, TAYLOR, AND LOCKWOOD

VEIL programmer to describe the computation in terms of the data streams and their transformations, rather than in terms of what each physical device should be doing at each point in time.

4.1. Programming with VEIL The fundamental activity that VEIL supports is construction and execution of dataflow graphs. It provides the C11 programmer with two new object classes (data types): graphs and operators. Graphs correspond to single HSDF computations, each consisting of a set of operators (graph nodes) corresponding to image processing primitives. Each operator has some number of input and/or output ports, which are connected by links to form a graph. Operators also have attributes that specify details of their computation. Some attributes are common to all operators. For example, every input port on an operator has an attribute that specifies the size and relative position of the pixel array that the port processes. Most attributes are specific to individual operator types, however. These include such things as the masks for convolution of morphological operations, the contents of lookup tables, and the output scaling for arithmetic operations. Most programs that use VEIL begin by creating a graph object and some number of operators. Next, the operators are inserted into the graph, their attributes are set and their inputs and outputs are connected to complete the graph. The program may proceed to create other graphs or perform any other initialization it requires. The program then invokes the VEIL scheduler, which verifies topological correctness and builds an execution schedule for each graph. The scheduler arranges for the graphs to share physical resources such as memory buffers, lookup table banks, and convolution kernels whenever possible. Finally, the application executes the graph(s) by invoking VEIL’s execution control methods. Figure 2 presents an example of a simple MV20 computation as expressed in VEIL and in ImageFlow. At upper left is an abstract, graphical representation of the computation as the programmer might think of it during program design. The computation graph consists of a CAMERA (i.e., the MV20 A/D converter configured to capture RS-170 video data) connected to a THRESHOLD operator, which is in turn connected to a MONITOR for display on the VGA monitor. The VEIL declarations and code needed to build and execute the graph procedurally are shown at lower left. At right is a hand-optimized program that performs the same computation using raw ImageFlow. The difference in apparent complexity between the ImageFlow and VEIL programs is obvious. A less obvious but equally important difference is that the ImageFlow program commits to a particular strategy for performing the computation: the incoming image is sent from the A/D converter

directly to the lookup table for thresholding and is then buffered in memory bank zero. The VEIL program makes no such commitment and, hence, can be modified or extended without fear of introducing conflicts in resource or switch usage. However, the VEIL program does pay a price for its greater flexibility. Although both programs achieve the maximum possible throughput (30 frames per second), the VEIL program buffers the camera output in memory, introducing an additional latency of half a video frame time. VEIL provides a number of ways for applications programs to control the execution of data-flow graphs. The Cycle() method used in the above example instructs the ImageFlow device driver to begin the next iteration of the graph immediately upon completion of the previous iteration. (For graphs whose input comes from one or more video cameras, the next iteration of the graph is deferred until the start of the next full video frame.) Cycle() is non-blocking. The Run() method causes the graph to execute once and then halt, blocking the host application until the graph computation completes. A nonblocking version of Run() is also provided, so that the application can work in parallel with the MV20 computation if desired and then synchronize with the graph via a blocking Wait() call. The application can halt a graph at any time to replace it with another graph whose computation is better suited to the current sensing situation. VEIL applications can modify the computation performed by a graph (as opposed to halting it and starting another graph) by changing the attributes of the graph operators. For example, after calling graph.Cycle() the VEIL program in Fig. 2 could change the threshold of the running graph by calling threshold.setThreshold(value). Similar calls allow modification of convolution kernels, constant multipliers and other computational parameters. Unfortunately, ImageFlow does not allow onthe-fly changes to attributes that affect the MV20’s internal pipeline delays. Thus it is not possible, for example, to change the size or position of the pixel array being sent to a given operator without halting the graph and rescheduling it. Programs that use multiple graphs may need to use data produced by one graph as input to another graph. VEIL supports this through a pair of operators called SOURCE and SINK. These operators correspond to pixel arrays in the MV20’s memories; the SOURCE operator injects its pixels into the graph as input, while the SINK operator stores the data it receives in MV20 memory. The scheduler can be instructed to treat a SINK / SOURCE pair as aliases for the same physical memory buffer. The result is that pixels sent to the SINK operator by one graph will be fed into the other graph by the SOURCE operator when that graph is executed. VEIL provides several mechanisms for transferring information between image processing graphs and the host

PROGRAMMING A PIPELINED IMAGE PROCESSOR

357

FIG. 2. Three formulations of an MV20 computation that takes pixel data from the camera, thresholds it, and sends it to the monitor for display. At upper left is a conceptual representation; below it is the equivalent VEIL code. At right is a hand-optimized implementation using ImageFlow.

358

OLSON, TAYLOR, AND LOCKWOOD

application. A number of operators are specifically designed to gather information for the host and do not have output ports. Instead these operators provide methods that allow the host to retrieve the current contents of their buffers. For example, the HISTOGRAM, STATISTICS, and SSD (sum of squared difference) operators accumulate information about the grey levels in one or two images by summing over the image array and allow the host to read a vector containing their results. The IMPORT operator receives an array of pixels from the graph, and allows the host to read any specified subarray into user memory. The EXPORT operator serves as a companion to the IMPORT operator, allowing the host to copy pixels from user memory to an MV20 memory buffer that serves as a source in a graph. This operator is useful, for example, for injecting test images into a graph that is under development. The data transfer methods for these operators may be called at any time. If they are called while the graph is running or cycling, they will block until the end of the execution cycle in order to reduce the risk of read/write conflicts. Using the VEIL information-gathering operators as described above requires the programmer to keep track of the graph state explicitly. In order to support a more eventdriven style of programming, VEIL also allows callback functions to be associated with these operators. When a callback is added to an operator, the VEIL scheduler instructs the ImageFlow device driver to generate an event upon completion of the Image Flow pipe that terminates in that operator. If the program calls graph.Update( ) while the graph is cycling, VEIL will wait for the events in the order in which they were scheduled and will call the appropriate callback for each event as it arrives. 5. MERLIN

MERLIN is a graphical user interface that allows users to build and execute VEIL graphs interactively. It is written in C11 using the standard VEIL library interface and the SUIT user interface toolkit [36]. The functionality it provides is similar to that of Cantata (Khoros) [38] and the other dataflow environments discussed in Section 2. The user creates a graph by selecting VEIL operators from a bank of menus, placing them in the workspace, and connecting their inputs and outputs with the mouse. Once the graph has been built it can be executed by pressing the ‘‘Cycle’’ button, which invokes the VEIL scheduler and cycles the graph. MERLIN supports graph editing by standard direct manipulation techniques: operators and links can be repositioned, created, or destroyed using the mouse. Selecting an operator with the mouse causes that operator’s attributes to be displayed on a scrollable palette of appropriately typed widgets. The user can use these widgets to change the value of an attribute at any time. If possible,

MERLIN makes the requested changes immediately. As mentioned in Section 4, some attribute changes require that the graph be halted and rescheduled. Since rescheduling may take several seconds, attribute changes of this type are deferred until the user halts the graph or explicitly requests rescheduling. Visual cues are provided to inform the user that deferred attribute changes are pending. Both MERLIN and VEIL allow users to save graphs to files and load them in again. This allows graphs to be developed in the interactive MERLIN environment and then loaded into embedded, noninteractive VEIL applications. The application can gain access to the operators by looking them up in a name table. Names consist of ASCII strings and appear on the MERLIN icon representing the operator instance. Thus, instead of constructing a graph procedurally as in Fig. 2, an application might begin by loading it from a file and calling graph.Find() to obtain a handle for the THRESHOLD operator. Experience has shown that most users prefer to build applications this way, because it allows them to change the image processing their program does without recompiling. They typically keep the MERLIN window open all the time, turning off interactive execution to test their applications and turning it back on to modify and save the graph. A sample MERLIN screen is shown in Fig. 3. The graph displayed in the work area performs a set of image processing operations used by Horswill and Brooks for an experiment in reactive robot control [23]. The two nodes at the top of the graph digitize the camera input and subsample it by a factor of four in each dimension, producing a 128 3 128 image. The graph then splits into two separate streams. The right-hand stream approximates a temporal derivative by backward differencing. The left-hand stream performs logarithmic scaling using a look-up table, estimates the gradient using a Sobel operator, and computes the sum of the absolute values of the gradient components as an approximation to the gradient magnitude. The result is passed through a threshold, which can be altered interactively using the dial widget at left. Both the time derivative and thresholded gradient magnitude are sent to the host for display in a pair of X windows. The status line at lower left reports that VEIL’s scheduler broke the computation into three sequentially executed subgraphs (pipes) and achieved a data rate of 30 frames per second. The data rate is limited by the input video frame rate; for this particular graph the MV20 is idle 45% of the time. 6. VEIL INTERNALS

The internal architecture of VEIL is divided into four logical components. The graph manager consists of the data structures and functions that allow user programs to construct graphs and set their attributes. The operator set is the set of available image processing primitives. The

PROGRAMMING A PIPELINED IMAGE PROCESSOR

359

FIG. 3. A typical MERLIN screen. The illustrated graph subsamples the image and computes the temporal derivative and the filtered and thresholded gradient magnitude, as described in [23].

scheduler is responsible for mapping the computation described by a VEIL graph onto the hardware and building an execution schedule. Finally, the executive communicates with the ImageFlow device driver and allows user programs to control the computation as desired.

6.1. The Graph Manager The role of the graph manager is relatively simple: it defines the basic graph class and provides the methods needed to create or modify computation graphs. The basic operations it supports are creating or destroying a graph, inserting or deleting an operator, and creating or destroying a link between operators. The graph manager also

serves as a repository for information about each graph object, such as whether it has been scheduled or not, whether it is running, and what physical resources it uses. Finally, it serves as a launch point for the executive. Since the graph is the smallest executable object in VEIL, execution control commands are naturally expressed as methods of the graph class.

6.2. The Operator Set The design of the operator set is critical to the operation of the other components of the system. The various operator types are organized into a class hierarchy, and all are ultimately subclasses of the generic class

360

OLSON, TAYLOR, AND LOCKWOOD

Operator. Every operator instance provides the following information: • Port descriptors. For each input or output port, there is a descriptor that identifies any links that are attached to the port and (for input ports) the size and alignment of the incoming pixel array. This information is normally set by the application programmer. The descriptor also specifies the name of the physical switch port on the MV20 at which the port produces or accepts data. This information is derived from the class definition and is accessible only to the scheduler. • Attribute descriptors. For each attribute there is a descriptor that specifies its name (as an ASCII string), its type, and its current value. • Resource list. Every operator class contains a list of the physical MV20 resources it uses. These include the switch ports mentioned in the port descriptors, plus any multiplexors or computational elements that it requires in order to perform its function. The scheduler uses this list to determine whether a set of operators can run in parallel. • Build method. Every operator defines a procedure called its build method. When invoked, the build method constructs a partial ImageFlow pipe that begins and ends at the physical switch ports identified in the port descriptors. In between, the pipe passes through the computational elements listed in the resource list. The build method reads the current values of the attributes and configures the computational elements so that they transform the data passing through them appropriately. The graph manager, scheduler, and executive manipulate VEIL operators at the superclass level, using the information and methods described above. Dynamic inheritance allows them to do their jobs without any hard-coded knowledge of the individual operator classes. This makes it relatively straightforward for programmers familiar with basic ImageFlow programming to add new operator types to VEIL. They simply declare a new subclass of class Operator and supply appropriate descriptors and methods. The process is not painless, but it is much easier than writing a complete ImageFlow pipe. Once it is done, the new operator automatically inherits schedulability from the Operator class and can be used without modification in any VEIL application.

6.3. The Scheduler The job of the VEIL scheduler is to compile the data structure describing a VEIL graph into a form that the executive and the ImageFlow device driver can use to perform the computation. In this section we first present informal examples of the scheduling problem and then define the problem formally and describe the VEIL scheduling algorithm. Finally, we discuss the scheduler’s performance and limitations.

In simple cases the scheduler may be able to embed the input VEIL graph directly in the MV20 hardware. Figure 4 shows an example of such a situation. The graph at left takes an array of pixels supplied by the host (via an EXPORT operator), convolves it with a mask, and returns the absolute value of the result to the host via an IMPORT operator. The CONVOLVE and ABS (absolute value) operators have disjoint resource lists, so they can be run as part of the same computational pipeline, as shown at right. Note that since there is no buffering between the two computational operators, the presence of the second operator has essentially no impact on the running time of the graph; the execution time is given by the number of pixels in the rectangle divided by the pipeline data rate (20 Mhz). In order to build an ImageFlow pipe for this computation, the scheduler calls the build methods for each of the operators. The build methods for the CONVOLVE and ABS operators construct partial pipes, as described above. The build methods for the IMPORT and EXPORT operators invoke a resource allocator to reserve space in memory for their pixels and to update the switch port addresses contained in their port descriptors. The scheduler then makes connections through the switch to the ports identified in the port descriptors, completing the pipe. In more complex cases it may not be possible to schedule the entire VEIL computation as a single ImageFlow pipe. For example, the graph may contain operators whose resource needs are in conflict, or may require more connections than the switch can support. The VEIL scheduler handles these more difficult cases by partitioning the graph into a set of subgraphs, each of which is small enough to be run as a single pipe. Outgoing data flow arcs that cross a partition boundary are connected to temporary buffers in MV20 memory. These buffers then become inputs to the subgraphs whose input arcs cross the partition boundary. The subgraphs are executed sequentially inside an ImageFlow PAT, in such a manner that the order of execution of the operators is a topological sort of the input graph. This guarantees that the introduction of memory buffers does not change the result of the computation. The graph of Fig. 3 illustrates some of the problems posed by more complex graphs. The LUT 1 and THRESHOLD operators cannot execute simultaneously, because they each require exclusive access to the MV20’s single-input look-up table. The SHRINK, DELAY, and XWINDOW operators must be handled specially because they contain internal memory buffers. The SHRINK operator, which subsamples the image by a constant factor, is implemented by commanding the vector address generator of a memory write port to omit some of the rows and columns sent to it. Thus a SHRINK operation can only be performed in the course of storing an image into memory. The DELAY operator uses a memory buffer to store pixels received during the current iteration and releases them into the

PROGRAMMING A PIPELINED IMAGE PROCESSOR

361

FIG. 4. (a) A simple VEIL graph. (b) the same graph mapped to a single pipelined computation on the MV20 (cf. Fig. 1).

graph during the next iteration. The XWINDOW operator uses the VEIL callback mechanism to transfer pixels from the MV20 to the host and display them on the workstation screen. Figure 5 shows how the VEIL scheduler might partition the graph of Fig. 3. The first partition simply copies the CAMERA buffer to the memory buffer associated with the SHRINK operator. No further work can be done, because the SHRINK operator must terminate a pipe and no other node is ready to execute. The second partition subtracts the result of the SHRINK operation from the current contents of the DELAY buffer and sends it to the XWINDOW1 output buffer. At the same time, it sends the current contents of the SHRINK buffer through the first lookup table LUT1, the derivative operator DXDY, and the two-input lookup table LUT2. At this point the scheduler closes the pipe and deposits the LUT2 result in temporary buffer TEMP1. It

cannot add the THRESHOLD operator to the current pipe, because its resource needs conflict with those of LUT1, and it cannot copy the SHRINK buffer to the DELAY buffer because the latter is in use. In the third and last partition, the contents of TEMP1 are sent to the XWINDOW2 buffer via the THRESHOLD operator, and the SHRINK buffer is copied to the DELAY buffer. At 20 Mhz, the first partition (which works on 512 3 512 arrays) takes about half a video frame time. The remaining partitions work on 128 3 128 arrays and, hence, take only dQs of a frame time. The entire schedule thus runs at 30 Hz, limited by the video frame rate. The Scheduling Problem The scheduling problem can be formalized as follows: We are given a directed acyclic graph G whose vertices are taken from a set O representing the available operators.

FIG. 5. A VEIL schedule for the graph of Fig. 3. Shaded operators are implemented as MV20 memory buffers. The computation has been broken into three subgraphs, each of which executes as a single ImageFlow pipe.

362

OLSON, TAYLOR, AND LOCKWOOD

There is a compatibility relation on O; a given pair of vertices is compatible if the resource lists of the corresponding VEIL operators are disjoint. Each vertex i has associated with it an integer cost ci that specifies the number of pixels in the largest rectangle flowing through the node. The graph is to be scheduled for a system with M memory banks and a switch that can support S simultaneous connections. For the MV20, M and S have values six and ten, respectively. A valid schedule for a graph G is a partition of the graph into a sequence of disjoint subgraphs g1 ? ? ? gk satisfying the following constraints: (1) all ancestors of vertices in gj are found in subgraphs g1 ? ? ? gj . This guarantees that no operator can execute before its input data are available. (2) all vertices in each subgraph are mutually compatible. This ensures that there are no resource conflicts between vertices in the same subgraph and, hence, that all operators within a partition can execute simultaneously as part of the same ImageFlow pipe. (3) no subgraph contains more than S arcs, including arcs that cross a partition boundary. This guarantees that there is enough switch bandwidth to route data between the computational elements and memory buffers that implement the subgraph computation. (4) every arc that crosses a partition boundary is assigned an integer in the range [1 ? ? ? M ], and no two input or output arcs for a single subgraph are assigned the same integer. This restriction captures the constraint that each MV20 memory bank has one read port and one write port and, hence, can support at most one read and one write per pipe. Since the subgraphs that make up a schedule are executed sequentially, the execution time of the entire schedule is the sum of the execution times for each subgraph. Pipeline delay and setup time are negligible for images of reasonable size, so the execution time for a subgraph can be approximated by the time required to send the largest pixel array in the subgraph through the switch. Thus an optimal-time schedule for a VEIL graph is one that obeys the above constraints and also minimizes the cost function

O max (c ). k

i 51 j [ gi

j

The problem of finding optimal-time schedules for VEIL graphs is NP-complete, like many other multiprocessor scheduling problems [39]. Simply finding a partition into k compatible subgraphs is equivalent to k-coloring the complement of the graph defined by the compatibility relation. Assigning integers to arcs that cross partition boundaries also involves a graph-coloring problem. There is a

substantial body of literature on approximation algorithms for problems of this type [25, 31, 39, 41, 51], and our original intent was to adapt one of these algorithms to the VEIL scheduling problem. However, experiments (see below) have shown that a naive heuristic algorithm produces reasonably good schedules for the sort of problems that arise in our work. Thus we have not yet had to resort to these more sophisticated strategies. The Scheduling Algorithm The current VEIL scheduler constructs an execution schedule by traversing the graph in topologically sorted order, starting at its sources. At each stage it attempts to pack as many of the remaining nodes as possible into the current pipe, subject to resource and topology constraints. When no more nodes can be added to the current pipe, it connects the outputs of the pipe to temporary memory buffers and creates a new empty pipe. When all nodes have been scheduled, the temporary buffers are bound to physical memory locations using a register coloring algorithm [2] and the partitions are used to generate a PAT. The algorithm can be expressed concisely as follows: 1 mark all sources ‘ready’ 2 REPEAT 3 Create a new, empty current partition 4 WHILE there exists a ‘ready’ operator that is compatible with the operators in the current partition DO 5 Add it to the current partition 6 Check operators that use its outputs and mark ‘ready’ if appropriate 7 FOR all output arcs from the current partition DO 8 Connect the arcs to temporary memory buffers. 9 UNTIL all operators have been scheduled The algorithm is simple and fast, but it ignores a great deal of potentially useful information. In particular, it adds operators to the current partition in the order in which it encounters them. In some cases better schedules would be produced by an algorithm that considered alternate orders, or that used heuristic criteria to decide which operator to add next. Scheduler Performance We evaluated the VEIL scheduler by comparing its output to that of an experimental branch-and-bound scheduler that produces provably optimal schedules. We collected a set of 13 VEIL graphs developed in support of various projects around our lab, ranging in size from 9 to 39 operators. Space does not permit a detailed description of all of

PROGRAMMING A PIPELINED IMAGE PROCESSOR

FIG. 6. Phase-based horizontal motion estimator from the VEIL test set. The convolvers extract sine-phase and cosine-phase components of the image at every point. The arctangent of the ratio of appropriate sums of products of the components for the current and previous images is proportional to the horizontal rate of change of phase (see Burt et al. [14]).

the graphs, but Fig. 6 shows a typical example, a graph that uses phase differencing to estimate the horizontal component of the motion field [14]. The VEIL scheduler adds nodes to the current pipe in the order in which it encounters them. Thus its output depends on the order in which nodes are created during graph construction. In order to take this into account, we tested the scheduler on 1000 instances of each test graph, randomly permuting the order of node creation each time. For each instance we recorded the estimated running time of the resulting schedule. Although the number of possible node orderings is large, the number of topologically distinct schedules that the scheduler can find is invariably small, and there are usually several with identical running times. Thus the sequence of running times observed in 1000 trials contains only a small number of distinct values. Dividing the number of occurrences of each value by one thousand gives the relative frequency of that value, which is an estimate of the probability of observing that value on a random trial. The entries above the final entry in Table 1 summarize the results of the experiment for the thirteen test graphs. For the heuristic scheduler the table gives the sample distribution over 1000 trials, expressed as the relative frequency of each running time observed during the trials. The table also reports the sample mean running time for the heuristic schedules, the running time for the optimal schedule, and the percentage by which the sample mean running time

363

exceeds the optimal running time. For nine of the test graphs the heuristic schedules had degenerate sample distributions; the running times for all 1000 trials were identical and were equal to the running time of the optimal schedule. For the remaining graphs, the heuristic running times were always within 26% of optimal and were within 10% on average. Execution times for the heuristic scheduler were less than one second in all cases, compared with minutes or hours for the branch-and-bound scheduler. It is somewhat surprising that the current scheduler works as well as it does on the test graphs. If there is no constraint on the compatibility relation, it is theoretically possible to construct graphs for which the running time of the heuristic schedule exceeds that of the optimal schedule by an arbitrary amount. Figure 7 shows the construction for one such graph. In practice the finite bandwidth of the MV20 puts an upper bound on the degree of parallelism that an optimal schedule can exhibit. The worst graph we have been able to produce by following the construction shown in Fig. 7 is one for which the heuristic scheduler’s worst-case running time exceeds that of the optimal schedule by 150%. The last entry in Table 1 summarizes the results of 1000 trials on this graph. Fortunately the conditions that produce this behavior seem to arise rarely in practice. We speculate that the relatively good performance of the current VEIL scheduler is due to the fact that the topological and resource constraints of the MV20 place tight limits on the space of valid schedules. The graph induced by the compatibility relation is fairly sparse, limiting the opportunity for parallel execution. The fact that many nodes (e.g., SHRINK) force termination of the current pipe also limits parallelism. An architecture similar to the MV20 but offering more switch bandwidth and more potential parallelism would probably increase the benefit to be obtained by using a more intelligent scheduler. Improved scheduling heuristics are being investigated as part of our current work on VEIL. Scheduler Limitations The current VEIL scheduler finds reasonably good schedules for the graphs in the test set. However, it is constrained to solve the problem as formalized above, that is, to break the input graph into a series of partitions to be executed in sequence. Modifying the definition of a valid schedule would result in better performance in some cases. One approach would be to relax the requirement that the graph partitions be executed in strict sequence. For example, it might be possible to execute a series of pipes that operate on small images in parallel with a single pipe that operates on larger images. Another option would be to optimize the schedule for repeated execution via software pipelining [7], as is done in (e.g.) the Gabriel scheduler [31]. In a software pipelined schedule, multiple iterations of a loop are executed in parallel. In the example

364

OLSON, TAYLOR, AND LOCKWOOD

TABLE 1 Scheduler Test Results Heuristic schedule distribution

Graph name

Number of nodes

Tracker Derivatives Horswill Temporal filter Tracker 2 Change detect Gaussian Pyr Phase Motion Laplacian Pyr

9 10 10 12 12 12 13 14 14 16

Disparity filter

24

2D phase

26

Motion filter

39

Ad hoc scheduler breaker

14

Optimal schedule

Frequency

Running time

Mean running time

Running time

Difference between mean and optimal running times as per cent of optimal time

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.219 0.250 0.271 0.260 0.490 0.510 0.352 0.492 0.156 0.250 0.246 0.249 0.255 0.125 0.394 0.370 0.111

14.75 29.49 19.66 39.32 17.41 13.52 22.39 32.77 52.43 26.74 26.93 33.07 33.84 42.60 45.87 39.32 42.60 45.87 31.10 31.28 37.28 37.43 26.21 39.32 52.43 65.54

14.75 29.49 19.66 39.32 17.41 13.52 22.39 32.77 54.43 30.33

14.75 29.49 19.66 39.32 17.41 13.52 22.39 32.77 54.43 26.74

0% 0% 0% 0% 0% 0% 0% 0% 0% 13%

44.3

42.60

4%

41.9

39.32

6.6%

34.30

31.10

10.3%

45.43

26.21

73%

Note. Running times in milliseconds. The first 13 entries are for the test set. The final entry is for a graph designed specifically to provoke worstcase behavior.

FIG. 7. A class of graph for which the VEIL heuristic scheduler can produce arbitrarily bad schedules. Suppose that nodes in the graph at left are incompatible if and only if they have the same label. Clearly the optimal schedule contains exactly two partitions, as shown at center. If the VEIL scheduler always examines ‘‘ready’’ nodes in left-to-right order, however, it will produce the schedule shown at right, in which the number of partitions (and hence the running time) is proportional to the number of nodes in the graph.

of Fig. 5, for instance, the nth invocation of the first partition could be executed in parallel with the (n 2 1)th invocation of the third partition. The computation performed by the graph would be unchanged, as would its latency, but its iteration period would be reduced. Note that software pipelining requires prologue and epilog code to handle the initial and final iterations, however, and that in the general case it can increase latency. We have not done formal experiments to determine how much could be gained by redefining the scheduling problem to allow parallel pipes or ‘‘wrapped’’ execution order. Hand inspection of the graphs showed that allowing parallel pipes would reduce the running time of the disparity filter graph by 7.7%, but it would not affect any of the other graphs. Permitting wrapped execution order saves 5% to 10% in most cases, because the last pipe of a schedule can often be executed in parallel with the first. Permitting both parallel pipes and wrapped execution is significantly more useful, yielding savings of 20% to 40% on four of the 13 test graphs. We caution that these numbers have not been verified by implementing the schedules in question

PROGRAMMING A PIPELINED IMAGE PROCESSOR

and, hence, are not as reliable as the simulation results presented in Table 1. We do believe that they are qualitatively correct. Their overall implication is that modifying the scheduling paradigm would cut the running time of some schedules nearly in half, but it would produce only modest improvements in most cases. The ultimate test of the VEIL scheduler would be to compare its schedules to hand-coded ImageFlow. The computational elements of MV20 are deeply pipelined and contain complex internal datapaths that allow data to be routed between elements without going over the switch. VEIL cannot exploit this possibility because its operators are defined to accept input and produce output at switch ports. ImageFlow programmers are under no such restriction, so they should be able to improve on the VEIL scheduler in many cases. Unfortunately reimplementing the entire test set in ImageFlow would take months of full-time effort and, hence, is not practical. We do have hand-coded implementations of the Gaussian and Laplacian pyramid computations and have carefully inspected the other graphs to identify opportunities for optimization. For most graphs, including the pyramid computations, we have been unable to make any improvement beyond what is available by permitting parallel pipes and wrapped schedules. For three graphs, including the ‘‘phase’’ graph of Fig. 6, it appears possible to reduce the running times by 50% to 60%. These savings would be obtained by embedding sumof-product subgraphs in the internal datapaths of the AU device. Much of the speedup could be obtained within VEIL by creating a special-purpose operator that implements the appropriate algebraic operation.

6.4. The Executive The job of the VEIL executive is to allow programs to control and communicate with graph computations. It is implemented as a series of methods of the graph class. Most of the control operations (e.g., running or halting a graph computation) map directly to calls to the ImageFlow device driver. Communications and synchronization methods make use of the ImageFlow event mechanism, which provides semaphore-like variables that can be set, cleared, or waited for by either the application or the running PAT. User-defined callbacks are handled by storing a pointer to the callback function in the operator object and instructing the ImageFlow device driver to issue an event on completion of the pipe containing the operator. When the application invokes a blocking execution command or the asynchronous Update() command, the executive waits for the pipe events in order of issue and calls the appropriate callbacks after each pipe. Operations that set run-time modifiable attributes or retrieve data from the MV20 are executed immediately if the graph is halted. If the graph is currently executing, they are deferred until the next occurrence of the appropriate event.

365

Executive Limitations The current version of VEIL does not halt the MV20 during data transfers or execution of user callback functions. This makes the running time of a graph deterministic, which is an advantage in many real-time vision applications. However, it also makes it possible for applications that communicate with a running graph to receive corrupt or inconsistent data, or for attribute changes to take effect in the middle of a computation rather than between iterations. This may occur if the host’s UNIX scheduler fails to run the user process in a timely fashion, or if the user code fails to complete its task quickly. Whenever possible we prefer to build robust applications that can tolerate errors of this kind. For nonrobust applications the blocking Run() execution command can be used to enforce mutual exclusion. The Run() command allows programs to synchronize reliably at graph boundaries. In some cases, however, this is inadequate. Consider a program that contains two or more STATISTICS operators. Because there is only a single statistics unit on the MV20, each operator will execute in a separate pipe. In order to obtain valid data from the first operator, however, the application must read the data after the operator runs, but before the hardware statistics buffer is cleared in preparation for executing the second operator. This level of service cannot be guaranteed by a user process under UNIX. At present the only solution to this problem is to put each statistics operator into a separate graph and use Run() repeatedly, which is awkward and unintuitive. We are currently testing an experimental version of the executive that uses events to guarantee exclusive access to data gathering operators of this type. 7. CONCLUSION

VEIL and MERLIN provide a powerful environment for developing real-time vision systems. VEIL’s coarsegrained dataflow model allows the programmer to concentrate on the image processing task at hand, rather than the details of resource management, scheduling, and synchronization. It also provides extensive facilities for interaction and synchronization between the image processor and programs running on the host, making it easy to embed VEIL computations into robot or autonomous vehicle control programs. The VEIL scheduler produces efficient execution schedules that compare well with hand-coded ImageFlow. The MERLIN interface allows VEIL graphs to be constructed and modified interactively, supporting exploratory programming. VEIL has a number of limitations, some of which are correctable and others of which are fundamental consequences of the design. As we observed in the previous section, the scheduler ignores potentially useful information and considers only a restricted class of schedules. The

366

OLSON, TAYLOR, AND LOCKWOOD

underlying computational model is sometimes a problem as well: computations that cannot be expressed as HSDF graphs can be implemented by ‘‘escaping’’ to C11 code, but the mechanism is awkward and makes programs difficult to maintain. By far the most serious problem we encounter in using VEIL stems from the fact that VEIL applications run as user processes under UNIX. This places them at the mercy of the UNIX scheduler and makes it impossible to write programs that rely on meeting hard real-time constraints. One of the secondary design criteria for VEIL was that it should have the potential to be ported to other image processing architectures. The portions of VEIL that are directly visible to applications programs (including MERLIN) are largely device-independent and could be modified for use with another architecture fairly easily. The code that implements the individual VEIL operators, on the other hand, is intimately tied to the MV20 architecture. Fortunately these machine dependencies are well localized and could be rewritten for a new architecture with little effect on the rest of the system. The least portable parts of the system are the scheduler and executive, which would need to be completely redesigned in order to work with a different architecture. It would not be surprising if the operations provided by the VEIL executive turned out to be inappropriate for some architectures, particularly those that have their own CPUs and rely less on the host for general-purpose computation. Current work on VEIL is directed toward improving its usability, particularly for users outside our laboratory. VEIL can now be used with the MV20’s successor, the MV200. It can also be used in configurations containing multiple MV200 boards, although it requires the user to specify on which board each operator should execute. MERLIN is being rewritten using Tcl/Tk [35] in order to increase its speed and make it easier for remote users to compile. We continue to work on the scheduler and to add new operator types as the need arises. We are currently using VEIL on a daily basis in support of our robotics work and have distributed it to more than 40 remote sites. We believe that it has more than met the goals we established for it when we began the project. New users can indeed become productive within hours, and experienced users feel little motivation to bypass it. It has largely removed development time as a consideration in deciding whether to implement a new algorithm on the MV20 or on a conventional workstation. Instead, the decision tends to be driven (as it should be) by the intrinsic strengths and weaknesses of the machine, e.g., its orientation toward pixel processing and its lack of floatingpoint arithmetic. Current versions of VEIL and MERLIN are available for anonymous ftp at ftp.cs.virginia.edu in directory pub/ veil, or via world-wide web at http://www.cs.virginia.edu/

pvision. We hope that VEIL will be of use to the vision research community and that it will help to establish a higher standard for software tools for the next generation of high-speed image processors. ACKNOWLEDGMENTS The authors acknowledge the contributions of present and past members of the UVA Computer Vision group, particularly Worthy Martin, Frank Brill, Shawn Carnell, Scott Gietler, Glenn Wasson, and Jennifer Wong. We are also grateful to VEIL users around the world for providing valuable feedback and bug fixes, and to the technical support engineers at Datacube for helping us understand ImageFlow. MaxVideo, ImageFlow, and MV20 are trademarks of Datacube, Inc. This work was supported in part by the Defense Advanced Research Projects Agency under Grant N00014-94-1-0841 and in part by the National Science Foundation under Grant CDA-8922545-01.

REFERENCES 1. W. Ackerman, Data Flow Languages, IEEE Comput. 15, No. 2 1982, 15–25. 2. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools, Addison–Wesley, Reading, MA 1986. 3. J. Aloimonos, I. Weiss, and A. Bandyopadhyay, Active vision, in Proceedings, 1st International Conference on Computer Vision, London, 1987, pp. 35–54. 4. Amerinex Artificial Intelligence, Inc., KBVision Programmers Manual, Amerinex Artificial Intelligence, Amherst, MA, 1987. 5. M. Annatarone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzilcioglu, and J. A. Webb, The Warp computer: Architecture, implementation, and performance, IEEE Trans. Comput. C-36, No. 12, 1987, 1523–1538. 6. R. G. Babb II, Parallel processing with large-grain data flow techniques, IEEE Comput. 17, No. 7, 1984, 55–61. 7. D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler transformations for high-performance computing, ACM Comput. Surveys 26, No. 4, 1994, 345–420. 8. R. Bajcsy, Active perception, Proc. IEEE 76, No. 8, 1988, 996–1005. 9. D. H. Ballard, Reference frames for animate vision, in Eleventh International Joint Conference on Artificial Intelligence, Detroit, August 1989, pp. 1635–1641. 10. A. Biancardi and A. Rubini, PACCO—A new approach for an effective I. P. environment, in Proceedings 12th IAPR Int’l Conf. on Pattern Recognition, Jerusalem, October 1994, pp. 395–398. 11. R. Blanford and S. Tanimoto, The PyramidCalc system for research in pyramid machine algorithms, in Proceedings, Second International Workshop on Visual Languages, Dallas, June 1986, pp. 138–142. 12. J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt, Ptolemy: A framework for simulating and prototyping heterogeneous systems, Int. Comput. Simul. 4, 1994, 155–182. 13. P. Burt and E. Adelson, The Laplacian pyramid as a compact image code, IEEE Trans. Commun. 31, No. 4, 1983, 532–540. 14. P. Burt, J. Bergen, R. Hingorani, R. Kolczynski, W. Lee, A. Leung, J. Lubin, and H. Shvaytser, Object tracking with a moving camera, in Proceedings, IEEE Workshop on Visual Motion, Irvine, March 1988, pp. 2–12. 15. D. Coombs, I. Horswill, and P. von Kaenel, Disparity filtering: Proximity detection and segmentation, in Proceedings, SPIE Intelligent

PROGRAMMING A PIPELINED IMAGE PROCESSOR Robots and Computer Vision XI: Algorithms, Techniques, and Active Vision, Boston, Nov. 1992, pp. 195–206. 16. Datacube, Inc., Danvers, MA, WitFlow graphical programming tool, High Performance Imaging 6, No. 1, 1994. 17. Datacube, Inc., MaxVideo 20 Hardware Reference Manual, Datacube, Danvers, MA, 1991. 18. Datacube, Inc., ImageFlow Reference Manual, Datacube, Danvers, MA, 1991. 19. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman, New York, 1979. 20. L. G. C. Hamey, J. L. Webb, and I-C. Wu, Low-level vision on WARP and the APPLY programming model, in Parallel Computation and Computers for Artificial Intelligence (J. S. Kowalik, Ed.), pp. 185–199, Kluwer Academic, Boston, 1988. 21. R. M. Haralick, Y.-H. Yao, L. G. Shapiro, I. T. Phillips, A. K. Somani, J.-N. Hwang, M. Harrington, C. Wittenbrink, C.-H. Chen, X. Liu, and S. Chen, Proteus: Control and management system, in Proceedings, CAMP ’93 Workshop on Computer Architectures for Machine Perception, New Orleans, December 1993, pp. 101–108. 22. M. Hirakawa, S. Iwata, I. Yoshimoto, M. Tanaka, and T. Ichikawa, HI-VISUAL iconic programming, in Proceedings, IEEE Computer Society Workshop on Visual Languages, Linkoping, Sweden, 1987. 23. I. D. Horswill and R. A. Brooks, Situated vision in a dynamic world: Chasing objects, in Proceedings, AAAI 88, St. Paul, 1988, pp. 796–800. 24. J. M. Jagadeesh and Y. Wang, LabView, Computer 26(2), 1993, 100–103. 25. H. Kasahara and S. Narita, Practical multiprocessor scheduling algorithms for efficient parallel processing, IEEE Trans. Comput. C-33, No. 11, 1984, 1023–1029. 26. E. W. Kent, M. O. Shneier, and R. Lumia, PIPE—Pipelined image processing engine, J. Parallel Distrib. Comput. 2, 1985, 50–78. 27. C. Kohl and J. Mundy, The development of the Image Understanding Environment, in Proceedings, IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, Seattle, 1994, pp. 443–447. 28. M. S. Landy, Y. Cohen, and G. S. Sperling, HIPS: A Unix-based image processing system, Comput. Vision Graphics Image Process. 25, No. 3, 1984, 331–347. 29. D. T. Lawton and C. C. McConnell, Image understanding environments, Proc. IEEE 76(8), 1988, 1036–1050. 30. E. A. Lee, W.-H. Ho, E. Goei, J. Bier, and S. Bhattacharyya, Gabriel: A design environment for DSP, IEEE Trans. Acoust. Speech Signal Process. Nov. 1989. 31. E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous data flow programs for digital signal processing, IEEE Trans. Comput. C-36, No. 1, 1987, 24–35. 32. E. A. Lee and D. G. Messerschmitt, Synchronous data flow, Proc. IEEE 75, No. 9, 1987, 1235–1245. 33. R. J. Lockwood, Heuristic Scheduling in VEIL, M.S. thesis, Department of Computer Science, University of Virginia, Charlottesville, May 1994. 34. T. J. Olson, N. G. Klop, M. R. Hyett, and S. M. Carnell, MAVIS: A visual environment for active computer vision, Proceedings, IEEE Workshop on Visual Languages (VL’92), Seattle, September 1992, pp. 170–176.

367

35. J. K. Ousterhout, Tcl and the Tk Toolkit, Addison–Wesley, Reading, MA, 1994. 36. R. Pausch, N. Young III, and R. Deline, SUIT, the PASCAL of user interface toolkits, Proceedings of UIST: the Annual ACM SIGGRAPH Symposium on User Interface Software and Technology, November, 1991. 37. A. R. Pope and D. G. Lowe, Vista: A software environment for computer vision research, in Proceedings, IEEE Comput. Soc. Conf. on Computer Vision and Pattern Recognition, Seattle, 1994, pp. 768–772. 38. J. R. Rasure and C. S. Williams, An integrated data flow visual language and software development environment, J. Visual Lang. Comput. 2, 1991, 217–246. 39. V. Sarkar, Partitioning and Scheduling Parallel for Multiprocessors, MIT Press, Cambridge, 1989. 40. L. G. Shapiro, R. M. Haralick, and M. J. Goulish, INSIGHT: A dataflow language for programming vision algorithms in a reconfigurable computational network, Int. J. Pattern Recognit. Artif. Intell. 1, No. 3, 1987, 335–350. 41. G. C. Sih and E. A. Lee, A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures, IEEE Trans. Parallel Distrib. Systems 4, No. 2, 1993, 175–187. 42. Silicon Graphics Computer Systems, Inc., IRIS Explorer User’s Guide, Document 007-1371-010, Silicon Graphics Computer Systems, Mountain View, CA, 1992. 43. C. V. Stewart and C. R. Dyer, Heuristic scheduling algorithms for PIPE, in Proceedings, IEEE Workshop on Comp. Arch. for Pattern Analysis and Machine Intelligence, October, 1987. 44. M. J. Swain and M. Stricker (Eds.), Promising directions in active vision (written by the attendees of the NSF Active Vision Workshop, August 1991), University of Chicago Technical Report CS 91-27, November 1991. 45. J. R. Taylor, R. J. Lockwood, T. J. Olson, and S. A. Gietler, A visual programming system for pipelined image processors, in SPIE Machine Vision Applications, Architectures, and Systems Integration, Boston, Nov. 1992, pp. 92–101. 46. S. L. Tanimoto, VIVA: A visual language for image processing, J. Visual Lang. Comput. 1, No. 2, 1990, 127–139. 47. D. G. Tilley, ZEBRA for MaxVideo: An application of object oriented microprogramming to register level devices, Department of Computer Science Tech. Rep. 315, University of Rochester, 1990. 48. C. Upson, T. Faulhaber, D. Kamins, D. Laidlaw, D. Schlegel, J. Vroom, R. Gurwitz, and A. van Dam, The application visualization system: A computational environment for scientific visualization, IEEE Comput. Graphics Appl., July 1989, 30–42. 49. J. Webb, Steps toward architecture-independent image processing. IEEE Comput. 25, No. 2, 1992, 21–31. 50. G. Wolberg, IMPROC: An interactive image processing software package, Technical Report CUCS-330-88, Department of Computer Science, Columbia University, 1988. 51. T. Yang and A. Gerasoulis, DSC: Scheduling parallel tasks on an unbounded number of processors, IEEE Trans. Parallel Distrib. Systems 5, No. 9, 1994, 951–967. 52. M. Zeltner, B. Schneuwly, and W. Guggenbu¨hl, How to program and configure a heterogeneous multiprocessor, in Proceedings, 1991 Workshop on Comp. Arch. for Machine Perception, Paris, December 1991, pp. 111–122.

Lihat lebih banyak...

Programming a Pipelined Image Processor

Descrição do Produto

Comentários