Vertical processing systems: a survey

Share Embed


Descrição do Produto

Vertical Processing Systems: A Survey

Work with associative memories and associative-arrayprocessing has culminated in the development of fine-grained single-instruction,multiple-datacomputing systems, called vertical processing systems, that employ bit-slicesequential processing. After reviewingthe engineeringcharacteristics of various commercial and research models such as the DAP, MPP, and CM, t h i s survey proposes a combined architecture that links a VPS with a set of highly specialized, homogenous coprocessors. ine-grained single-instruction, multiple-data (SIMD) architectures, which Russian Academy of Science, originated from associative-array proSiberian Division cessors, have played a particularly important role in the evolution of parallel computing systems. Invented in 1956,' the associative, or contentaddressed memory (CAM) offers an ideal means for performing its dedicated basic operation, the equality search. When the initial data airay loads into corresponding memory elements and a comparand feeds to the digit buses, the CAM's 2D distributed structure produces the search results immediately after the transition processes terminate. The processing commonly found in digital techniques is absent. Rather, the CAM's specialized logical net directly models the search algorithm. CAM applications initially focused on nonnumeric processing tasks such as data retrieval. Parallel implementation designs that use arbitrary algorithms soon emerged in CAMS, or more exactly, in associative-array process0rs.l Interest in CAMS and associative-array processors has never waned. Despite remarkable advances in VLSI technology. manufacturing distributed CAMS in sizes sufficient for practical application remains a challenge.3 However, associative processing has gained wide commercial success in quasi-associative, or vertical processors. The vertical processing system is a particular kind of SIMD based on bit-slice sequential proYakov I. k t

0740-7475/95/$04.00 Q 1995 IEEE

cessing principles. It represents a major trend in supercomputer design. This survey t.races that course through several well-known implementations. It concludes with a look at an approach that combines a VPS with a set of coprocessors to capitalize on the advantages of each.

Background Shooman first suggested separating the vertical operational unit (OW from the argument array memory." He also introduced the terms vertical processing, vertical processor, and orthogonal processor. In the middle 1960s, Sanders Associates designed a Family of orthogonal coniputers, the OMEN-60,' based on Shooman's suggestion. These computers provided both horizontal and vertical processing. Slotnick6 explored another line in vertical processing design that involved building quasiassociative processors based on rotating headper-track memories. This idea applies the principles of associative processing t o inexpensive mass storage devices. In 1970, Codd proposed a relational data model' that proved, due to its inherent homogeneity and parallelism, quite suitable: for hardware implementation in associative processors. As Ozkarahan' shows, Slotnick's and Ctdd's work encouraged the design of a number of database processors based on quasi-associative principles. Nearly all dedicated database machines use par-

February 7995 65

allel processing of data-array bit slices, or vertical processing. In 1972, Goodyear Aerospace launched the Staran system, a supercomputer possessing all the features of an associative-array processor. Staran uses the principles of vertical, rather than associative processing. The classical CAM represents a superposition of a 2D logical network (a specific OU) onto a 2D memory unit (Figure la) that implements the processing in space. By separating the distributed OU from the memory, and compressing it vertically as Figure l b shows, we get a conventional horizontal (word-sequential) processing scheme with horizontal memory access. If the OU were compressed horizontally, we would get the vertical (bit-slice sequential) processing scheme with vertical memory access (Figure IC). Hennie" introduced space-time transformations of this kind. With compression, the OU's 2D logical network changes into a 1D network. Rather than distributed processing in space, we get concentrated processing in time. Instead of a global, continuous process involving the whole 2D argument

array, we get sequential processing of separate fragments (slices) of the array. Thus, we need to consider three processing variants for large data arrays: distributed processing in 2D structures, concentrated processing in 1D horizontal structures (von Neumann architecture), and concentrated processing in I D vertical structures. The third variant forms the basis of vertical systems. Vertical processing represents a happy compromise: Vertical 017s are well within the reach of present technology, yet they ensure parallel processing of sufficiently large argument-array fragments. For example, suppose that an array of 2'' 32-bit words must be processed and that the parallelism of the vertical 011 equals 214. The entire processing would require only 32 time steps, making it 2" times faster than the horizontal approach.

Vertical processing systems Memory

Memory

1

Memory

Distributed operational unit

Horizontal operational unit

Vertical operational unit

(4

(b)

(a

Figure 1. Space-time transformations: distributed (a), horizontal (b), a n d vertical (c) operational units.

I

To better understand the vertical computer, let's take a closer look at the VPS. I" To this category belong fine-grained SIMD systems characterized by a bit-slice sequential operation mode and a massive number (tens and hundreds of thousands) of simple, 1-bit processing elements (PES), each connected by means of a 1-bit-wide bus with its own local memory. Well-known representatives of VPSs include Staran, DAP, MPP, and CM supercomputers. Table 1 shows the main characteristics of several contemporary models. (This survey excludes the MasPar fine-grained SIMD computers because they have a rather complicated PE structure that does not fit our definition of

Table 1. Characteristics of vertical processing systems. System model Blitzen

DAP-510 DAP-610

Staran

66

Number Manufacturer

of PES

Interconnection network

MCNC Thinking Machines Thinking Machines Active Memory Technology Active Memory Technology Martin-Marietta Loral Defense Systems Goodyear Aerospace

16,384 (128x128) 65,536 65,536 1,024 (32x32) 4,096 (64x64) 10,000 t o 82,944 16,384 (1 28x1 28) up to8,192

X-grid (eight directions) Hypercube Hypercube and NEWS NEWS and broadcast lines NEWS and broadcast lines NEWS NEWS Flip

/E€€Micro

Local memory Memory (Kbits) (Mbytes)

1 4 64 32 64 128 bit 1 9

2 32 512 4 32 1,296 Kbit 2 u p t o 10

) Recrnt publications describe various VPS models;"~" Potter and Meilander" analyzes them. General structure. The main VPS components are host computers, controllers, OU (PE matrix), main memory, interconnection network. data structure transformation unit, mass memory, and coprocessors (accelerators). Vazhenin et al.li presents a detailed morphological analysis of existing VI%. This survey reviews the construction and functioning of the various subsystems of several well-known VPS models. Coprocessors, or accelerators, are especially important in VI'S development. so we will take a closer look at them. As Figure 2 shows, a collection of PES with their attached local memories form the core of each VPS.This figure depicts a square matrix of PES, the most common case. Associated with each PE is a bit-addressable local memory. The size o f the local memory-usually based on commercial RAM chips-varies between 1 and 64 Kbits. We commonly picture the local memory as a single-bit stack placed perpendicularly to the PE matrix. Then, the aggregate of all local memories forms a memory cube whose total capacity may reach hundreds of megabytes. Each PE contains at least four single-bit registers, denoted in Figure 2 as Q (accumulator), A (activity control), C (carry bit)? and D (thta buffer). Besides these registers, the PE also includes a full binary adder (or a more sophisticated singlebit arithmetic logic unit). several bit flags, and multiplexers to ensure connections with the local memory and other PES. We sometimes call the aggregate of the Q (A, C. D) registers of all PES the Q- (A-, C-, D-) plane. These are the processing planes. The aggregate of corresponding (having equal addresses) bits of all local memories forms the memory plane. A simple sequential bit-processing technique forms the basis of the vertical architecture. Consider the addition of W O integer matrices. The coefficients of the initial matrices A and B reside vertically in the memory cube: the local memory o f the ijth PE stores elements A,, and U,,, as well as the sum S,=A,,+B,, computed sequentially. bit-by-bit. This processing takes m cycles, where m is the word length of the coefficients. In each ktli cycle, the (jth PE reads from its local memory the kth bit of A,, then the kth bit o f B,r Next: it adds these bits together with the carry bit from the ( k 1 ) s t cycle storecl temporarily in register C. 'The PE stores the kth bit of the sum S, in the kth position of the local memory's sum field, while the produced value of the carry bit into the (k+l)st cycle replaces the previous contents of register C. The vertical architecture easily adapts to various data formats simply by changing the cycle counter. The VPS thus provides greater functional flexibility,as the OU serves equally well for arbitrary types of data. For instance, the VPS permits parallel high-precision computations with dynamic control o f operand length."]

AControi

VI'S.

Processinc elements

Rh memory / plane

Local memories

Figure 2 . Processing core of a VPS.

An important W S architectural feature is its variously complex means of interprocessor communications. The simplest case involves nearest-neighbor connection, oI a NEWS (north, east, west, south) grid. The classic associative-array and quasi-associative (head-per-track) processors did not provide specific means for interconnecting different cells o r processing channels. The interconnection network of a modern VPS decisively influences the system's performance and the flexibility of its applications. Staran. In the early 1970s, Goodyear Aerospace, under the direction of chief architect Kenneth Batcher. designed the Staran system. This was the first commercially successful VPS. Figure 3 (next page) shows a simplified block diagram of Staran. A typical Staran system configuration consists of four array modules (with possible extension up to 32 modules), and three control units: a parallel processing control unit of the array modules, a sequential control unit (a DEC PDP-11 was used), and a parallel I/O control unit. Each array niodule has 256 PESwith corresponding local memory capacity of to 9,216 bits. A Staran system's PES (1,024 PES in this configuration) form a linear 1D array instead of the 211 square array shown in Figure 2. The system can connect with different host computers via the I/O channel or by direct memory access.

February 1995 67

I I

'

Control memory

H

1

I

cozter

J

'

Peripherals

Array module 0

-1

Array module 1

control unit

F F 7 ' Array module 2

Array module 3

Figure 3. Staran block diagram.

Note that Staran is sometimes called an associative-array processorlHand the memory of its a m y modules an associative memory. These terms do not reflect the authentic features of Staran hardware, but rather denote this system's historical appearance as the first real implementation of the associative-processing concept. Actually, Staran uses quasiassociative bit-slice-sequential processing-that is, vertical processing-making it a typical VPS. Though standard RAM chips form the Staran memory. it exhibits the unusual characteristics of a multidimensional access memory. Logically, the memory of each array module consists of a set of square 256~256-bitmatrices. During read/write operations, a dedicated hardware unit generates particular chip numbers and corresponding addresses. to ensure up to 256 patterns of conflict-free mosaic accesses, including horizontal (word) and vertical (bit-slice) access. Read-cycle time is 150 ns. Write-cycle time is 250 ns. Communications between the PES in each array module flow through a flexible original network Flip. The Flip is a dedicated multistage interconnection network that implements a broad set of frequently used permutations,as well as batch shifts. This network has 256 inputs and 256 outputs. It exploits a simple control algorithm that allows Fast setting on up to 256 butterflies. The parallel I/O control unit provides high-bandwidth input and output, as well as transfer of 256bit data blocks between the array modules.

68 IEEE Micro

Programming the Staran system involves ,Associative Processor Programming Language, the assembly language developed for it. Commonly used Apple instructions execute parallel loads, stores. moves, associative searches, and arithmetic computations. At the end of the 1970s, an airborne version o f Starm, named Aspro, emerged. Having the same computing power as a typical large-scale Staran configuration, this processor occupies only 0.01 mLand weighs less then 15 kg. EGPA. In 1961, Wolfgang Haendler proposed organizing vertical processing in a standard general-purpose, von Neumann computer.I9In essence. this design places the data in a conventional memory that is "turned" 90 degrees. Due to this turn, each memory access-say, with a standard 32bit format-delivers 32 bits (of the same positions) of 32 tlifferent words, instead of one 32-bit word. If one makes corresponding changes in the microprograms controlling the standard horizontal OU, the 32-bit parallel ALLJ will behave as 32 single-bit ALUs. The classic general-purpose computer thus becomes an orthogonal horizontal-vertical (H-V) computer that can work either in the usual horizontal mode or in a quasi-associative vertical mode. This approach stores programs in the common main memory horizontally. With the computer turned to its horizontal mode, the usual horizontal data can occupy some of the memory fields for processing by conventional programs. This flexibility is especially valuable for solving problems with mixed horizontal and vertical parts. The Erlangen GeneralPurpose Array (EGPA) research project at the University of Erlangen-Nuerenberg'" uses Haendler's H-V approach. The H-V computer has an inherent limitation in the degree of its parallelism: When a 32- or 64-bit computer is used, the parallelism will be 32 or 64. However. this limitation clisappears in multiprocessor systems. Haendler" describes an elegant 2D-array architecture of vertical PES. In it, 32 general-purpose processors connected in line (by means of multiport memory switches) form the first dimension of the array. In accordance with the H-V principle, the 32 single-bit ALUs of each processor form the second dimension. The result is a 2D VPS having 1,024 (32x32) PES, easily assembled from commercially amikdbk processors. LUCAS. The Lund University Content-Addressable System is an early experimental VPS project developed at Sweden's University of hind by Fernstrom et al.2L It comprises a linear array of 128 one-bit PES with 4-Kbit local memories. built from ordinary RAM chips. In describing the architecture o f LL-CAS and the Pascal/L high-level programming language designed for it, Fernstrorn" discusses the results o f implementing several important problems on LIJCAS. These include matrix multiplication, signal processing, graph theory, and relational database processing. DAP. As ReddawayL discusses, the Distributed-Array Processor was conceived in 1972 at the Research and

Advanced Development Center of International Computers, Limited (ICL, Stevenage, UK). Flanders et a1.l’ describes the pilot DAP, which has been working since 1976. It has a 32X32-PE matrix with 1-Kbit local memories. That article also contains preliminary information on system software (DAPFortran), as well as the results of studies concerning typical problems (matrix multiplication and inversion, fast Fourier transform. and convolution). ICL manufactured the first-generation commercial series of ICL DAP systems based on the pilot DAP. This 64~64-I’E version has 4-Kbit local memories and a machine cycle time of 160 ns. It uses an ICL 2900 mainframe computer as host. Beginning in 1985, ICI. delivered 32x32 Mini-DAP systems. Figure 4 shows the DAP schematic diagram. This design organizes the ICL DAP store (the aggregate of local meinories) as standard modules of the host computer’s main memory, ensuring very efficient interaction of both parts. This feature makes the DAP an active memory that realizes highspeed processing of its contents. From this comes the name of the company, Active Memory Technology, or AMT, of Irvine, California. which produces DAP-500 and IIAP-600 series systems (see Table 1). These operate with Sun and VAX host workstations. In the naming convention for AMT DAP systems, the first digit corresponds to the PE matrix size (j for Z5, 6 for 29, and the second and third digits represent the clock frequency (10 MHz). Recently, AMT announced a DAI’ modification using 8-bit arithmetic accelerators. DAP software includes a high-led language Fortran-Plus (an extended version of Fortran 77) and an array processor assembly language APAL. MPP. Goodyear Aerospace, now Loral Defense Systems, built the Massively Parallel Processorl2 under a US NASA contract, delivering it to Goddard Space Flight Center in May 1983. Kenneth Batcher made the decisive design contribution. The main use envisioned for MPP was high-speed processing of 2D satellite images. The MPP processing matrix represents a square o f 128x128 PES,with 1-Kbit local memories. See Figure 5 for its general structure. In this configuration, a DEC PDP-1 U34A computer served as the program and data management unit (PIIMU) and a VAX lli780 as host computer. The MPP’s PE contains six single-bit registers (A. B, C, G, P, S), a full binary adder (BA), a logic unit (LU) performing any Boolean function of two input variables, and a Variablelength shift register (Shr). Shr length adjusts (under program control) to 2, 4, 10, 14, 18, 22, 26, or 30 bits. This register stores intermediate results inside the PE, which is important, for instance, in speeding up multiplication. The PE clock frequency is 10 MHz. The S registers serve for data I/O. The S plane (the aggregate of all 16,384 S registers) has two 128-bit ports. At each cycle, the next 128-bit data column pushes through the input port in the leftmost (western) column of the S plane.

Master control unit

Code memory

computer

matrix

User interface

array

Host memory

Figure 4. DAP block diagram.

d I

input

memory Staging

I

k I

Output

1 I

t

I

Figure 5. MPP block diagram.

Concurrently, the contents of the S plane shift horizontally, resulting in the output of the rightmost (eastern) column through the output port. In this way, the contents of the S plane completely update after 128 cycles, and may store (using one additional cycle) in one of the 1.024 memory planes (see Figure 2). An important MPP architectural feature is its staging memory, a dedicated unit intended for transforming data structures. These transformations are needed mainly because the inputdata arrays come sequentially in elements (words, pixels), whereas they should transfer into the array memory sequen-

February 7995 69

tially in bit planes, making inverse transformations at the output necessary The staging memory performs all transfonnations on-the-fly,without introducingmy daca-processing delay. Configurable in up to 32 banks. each with 256 64-bit words (total capacity up to 64 Mbytes), the MPP staging memory has a transfer rate of up to 160 Mbytes/s. Hence, in addition to its special fhnctions. this unit can also serve as an additional memory unit. A NEWS grid in the MPP provides interprocessor co~ninunications. Programmable reconfiguration of connections between the edges of the PE matrix is possible. In this way, different topologies are available: an isolated square of 128x128 PES; two cylinders with different axes if North and South (or East and West) edges are closed: and several tori and spirals. MPP software consists of the usual supervisory programs, several assemblers for various computer subsystems. ancl microcode libraries. A high-level language, Parallel Pascal (an extension of standard Pascal), ensures manipulation with parallel variables, which are 128x128 matrices of variablelength integers, or 32-bit floating-point numbers. Blitzen. -4s Blewins et al.” reported, the Microelectronic Center of North Carolina (MCNC) is developing a new VI’S, an improved version of MPP called Blitzen. This project aims at ensuring miniaturization and improving performance. The basic Blitzen building block is a custom chip designed by MCNC that contains 128 PES, each with 1-Kbit on-chip local memory. A 16,000-PEsystem requires 128 chips, coinpared to the MPP with only eight PES per chip and the Connection Machine with 16 PES, both without memories. The Blitzen PE design uses 1.25-pmCMOS technology, contains 1 . 1 million transistors, and operates at 20 MHz. Besides the on-chip incorporation of local memory, the Blitzen system includes other architectural improvements compared t o the MPP: use of bidirectional shift registers in the PE, local modifications of global memory addressing, local conditional control of PE operations, and application of a richer interconnection scheme called X-grid, which represents a 2D network that provides eight neighbors in compass directions N, NE, E, SE, S, SW. W. NW. The Blitz assembly-level language allows the design of microcode library routines. Also included is a high-level, object-oriented language based on C++.25 The expected performance ofa 16,000-PE Blitzen system is 450 Mflops (32-bit IEEE standard). Connection Machine. The advent of the CM; designed and manufactured by Thinking Machines Corporation, was an important landmark in VPS history. This event was closely associated with W. Daniel Hillis, chief architect of the CM and cofounder of TMC. Initially, Hillis conceived this

70 IEEE Micro

machine at the Massachusetts Institute of Technology Artificial Intelligence Laboratory as a cellular automata with elaborated intercell connections that could efficiently support manipulation with semantic network After undergoing engineering design at AIL, the CM proceeded to TMC, which was founded in June 1083. By the end of 1984, TMC built the first 16,000-PE prototype of the CM-1 system. and successfully demonstrated ;I full 64K PE system at the end o f 3985. In April 1986, TMC introduced the CM-1 commercially. Success with the first model led its designers to liuild an improved version, CM-2, featuring increased performance and reliability and a radically revised inass memory system. Thinking Machines introduced the CM-2 in April 1987. The new company quickly became the world leader in supercomputer design and marketing. See the box on the next page for a closer look at Thinking Machine’sConnection Machine.

Evolution of vertical processing systems Two clear trends mark the development of the VI’S: increasing arithmetical processing performance ancl extending the allowable programming styles. The first involves using various means of harclware support for flcrating-point arithmetic operations (%bit coprocessors in DAP> Weitek chips in CM-2). The second involves shifting away from pure vertical architectures. and marks the appearance of mixed architectures. At the end of 1991. TMC began manufacturing a very powerful system, the CM-5,’# which should reach the teraflop performance range. The CM-5 is no longer a VPS. The architecture of CM-1 and CM-2, their high level of parallelism, and the advantages afforded by the data-parallel programming style proved that VPSs provide great potential for efficient implementation o f a wide range o f problems. These include such applications as computational physics, geophysics, computer graphics, document retrieval, free text processing, and computer vision.”, ’‘) However, real application problenis are usually nonuniform. Different problems and various fragments of the same problem can successfully exploit the high performance characteristic for SIMD. MIMD. vector, and special architectures. Consequently, architects try to construct systems with mixed features that give users the flexi1,ility to vary their programming style according t o the peculiarities of the problem at hand. The designers of CM-5 abandoned TMC’s traditional approach by giving the CM-5 all properties of a MIMI) system. In the CM-5, its designers implemented the PES on the basis of a standard 32-bit KISC microprocessor (Sparc), interconnecting them using a very powerful network called a,fat tree.”’ Each PE can indepenclently execute its own program stored in its local instruction memory. The design also provides hardware support for a global (barrier) svnchronization, which allows the CM-5 t o retain all the advantages of the data-parallel style.

Coprocessors The coordinated operation of a very large number of PESeach PE performing comparatively slowly (bit-wise sequentiallyl-determines the super-high performance of a VPS.This approach lessens the demands for speed from each separate PE, allowing the use of simpler, cheaper electronic components, which gives the VPS its good cost/performance factor. Supercomputers using a vertical architecture surpassed the expectations of their designers. This architecture,which arose from nonnumerical associative processing concepts, also proved quite efficient for solving important, large-scale numerical problems. At the same time, we note the imbalance between separate subsystems of the VPS. In multiprocessorsof other architectures, the basic OUs-the node microprocessors-usually set the fashion, and all other subsystems must catch up. In the VPS, the O U lags behind other subsystems. Typically, a V T S possesses a main memory o f very large capacity, a huge mass memory array supported by a high-bandwidth VO control subsystem, and efficient interconnection networks. The O U consists of the simplest single-bit ALUs, though having a high degree of parallelism. The bit-sequential processing technique each PE uses especially affects the implementation of floating-point arithmetic. One common method for overcoming these limitations is to attach specializedprocessors or accelerators.The tendency toward using high-performance coprocessors surfaces in the development of many recent computing systems, including the VPS. However, most such designs employ only vectorpipeline arithmetic accelerators to speed up floating-point operations. An important source of increasing VPS performance is the broad application of various problem-oriented coprocessors. Especially noteworthy are homogeneous devices based on parallel-processing principles. Examples include systolic arrays,33classical cellular arrays,j' and distributed functional struct~res.3~ Two factors motivate the particular interest in homogeneous structures.

1, Homogeneous structures permit the highest level of parallelism because all the bits of argument data arrays are concurrently processed. 2. Modern VLSI circuits are gate arrays consisting of a large number of blanks (transistor cells) regularly placed on the chip, which is ideal for implementing various h o m e geneous computing structures.

1

Recently, many companies have begun exploring possible uses for homogeneous structures. Digital Equipment Corporation's programmable active memory (PAM) is a rectangular mesh of identical cells called programmable active bits. Each PAB connects to four neighboring PABs and imple-

ments logical functions defined by a truth table loaded into its control register. DEC's Paris Research Laboratory has built a universal reconfigurable homogeneous coprocessor called Perle-0.'" that represents a PAM of 3,200 (40x80) PABs realized on the basis of logic cell arrays from Xilinx. The reconfiguration of Perk-0 occurs when the host computer downloads the. corresponding program into the PAM'Scontrol registers. Use of Perle-0 has produced an order-of-magnitude speedup in solving several complicated problems such as high-accuracy multiplication, data compression?and image processing. Other organizations have used a similar approach. Most of these research models are universal. They can realize arbitrary functions and algorithms, and the synthesis of necessary logical structures proceeds using classical automata theory techniques. Unfortunately, most specific functions will incur a time and hardware redundancy when implemented this way. Specialized homogeneous structures, which iinmediately map algorithms into circuits, represent an alternative to m i versal ones. Signals propagating through a specialized logical net simulate the given algorithm. One example is the distrihuted associative memory with its special basic operation, the equality search. Other specialized structures realizing other basic operations have emerged as well. These cellular arrays are distributed functional structures (DF-structures).ji Many of these devices proved to be perfect accelerators. A VPS needs such accelerators, so the specializrd homogenous processors and the basic VPS make a good team.

Combined architectures In the multiprocessing computing complex called combined architecture,jz solving any problem involves an interaction of several processes, such that the specialized subsystem most efficient in implementinga particular process executes that process. The architecture controls its subsystems to ensure their balanced operation and to best exploit their special complementary features. The design selects a structure for each subsystem that best corresponds to the function it should perform. An immediate application of these considerations is to directly join a standard VPS with separately produced coprocessors. One such architectureis the MASC (mixed associative-systoliccomputer) system.3LIntended for solving laborintensive matrix problems, this system combines a Staran-like VPS with a set of systolic/wavefront processing arrays. The extremely high-speed matrix computxtions these processors provide justifies using the systolic architecture in the processing subsystem. As is well known, smooth systolidwavefront processor operation requires data 110 in specific formats and at proper times. For example, the coeficients of incoming matrices in systolic processing should go (by row or column) to the inputs o f certain PES o f a respective

February 1995 73

array in the form of a polygon of a given shape. Sometimes, the sequences of coefficients must be interleaved by zeros. Using conventional memory devices and I/O hardware would lead to serious difficultiesin realizing these elaborate data manipulations. Inevitably, the expected performance of the processing array would suffer. The combined MASC approach achieves the necessary balance, because the preprocessing subsystem is a VPS that possesses such inherent properties as flexible highly parallel memory access, fast manipulation on data structures, and large I/o channel bandwidth. The combined architecture thus pertains to both VPS development trends. Diverse problem-oriented processors provide the multiarchitecture environment, while the strong specialization of these processors provides computational power. The VPS itself retains all its properties and advantages, enabling problems to fit well easily into the VPS structure and programming style. The possibility of a closer combination-even a mutual penetration of the vertical and cellular structures-dso exists. Indeed, the topological similarity of the VPS operational plane (the PE matrix) to the cellular arrays suggests the possibility of embedding homogeneous coprocessors into the VPS's PE matrix. Such embedding requires simple modifications of the PE's logic circuit, but can substantially enrich the system's functional potential. Currently, all VPS data processing uses classical, universal techniques. Sequential microprograms realize the arithmetic and logic functions. These microprograms apply concurrently in all PES to corresponding data elements (words) stored in the third dimension of local memories. If the operational plane contains features of some specialized structure, in addition to its standard possibilities, the system would acquire all functional possibilities of the embedded structure. Not only universal bit-sequential algorithms but such methods as systolic processing and quasi-analog processing characteristics for DF-structures will apply to an array written in the local memories. Special microprograms called at various problem-solving stages can represent the corresponding procedures.

VERTICAL PROCESSING SYSTEMS, which evolved from quasi-associatiue(bit-slice-sequential)processing, represent one main direction for modern supercomputer design and manufacture. Kenneth Batcher, Stewart F. Reddaway, Dennis Parkinson. and W. Daniel Hillis were especially instrumental in creating contemporary VPS models. Specialized coprocessors are an important factor in increasing VPS performance. In some cases, it may be possible to exploit immediate mapping of algorithms into the structure of a coprocessor. This approach ensures sufficient improvement, compared to implementing the same algo-

74 /€€€Micro

rithms by conventional bit-sequential methods in the basic VPS. One possible application of such coprocessors is the use of distributed functional structures, which have resulted from distributed associatite-processing developments. Joining a VPS with a set of coprocessors into a structure called a combined architecture best capitalizes on the inherent advantages of both components, provided one can successfully organize their kyalanced interaction for definite classes of algorithms. The combined approach might extend the life of existing VPS models, making them attractive to a larger range of users. @

References 1. A.E. Slade and H.C. McMahon, "A Cryotron Catalog Memory," froc. Eastern Joint Computer Conf., Am. Inst. Electrical Engineers (AIEE), New York, Vol. I O , 1956, pp. 1 1 5-120. 2. R.H. Fuller and G. Estrin, "Some Applications for ContentAddressable Memories," Proc. Eastern foint Computer Conf., AIEE, Vol. 24, 1963, pp. 495-508. 3. K.E. Grosspietsch, "Associative Processors and Memories: A Survey," /€€€Micro, Vol. 12, No. 3, June 1992, pp. 12-19. 4. W. Shooman, "Parallel Computing with Vertical Data," Proc. Eastern Joint Computer Conf., AIEE, Vol. 18, 1960, pp. 1 1 1 -

115. 5. L.C. Higbie. "The OMEN Computers: Associative Array Processors, " froc. Sixth Ann. /€€€ Computer Soc. lnt'l Conf. , 1972, pp. 287-290. 6. D.L. Slotnick, Logic-fer-Track Devices: Advances in Computers, Vol. I O , Academic Press, New York, 1970, pp. 291-296. 7. E.F. Codd, "A Relational Model of Large Shared Data Banks," Comm. ACM, Vol. 13, No. 6, 1970, pp. 377-387. 8. E. Ozkarahan, Database Machines and Database Management, Prentice-Hall, Englewood Cliffs, N.J., 1986. 9. F.C. Hennie, Finite-State Models for Logical Machines, John Wiley & Sons, New York, 1968. 10. W. Haendler and Y.I. Fet, "Vertical Processing in Parallel Computing Systems," froc. lnt'l Conf. farallel Computing Tech., N.N. Mirenkov, ed., World Scientific. Singapore, 1991,pp. 56-75. 1 1 . D. Parkinsonand J. Litt, eds., MassivelyParallel Computing with the DAP, MIT Press, Cambridge, Mass., 1990. 12. R.M. Hord, ParallelSupercomputingin SIMDArchitectures, CRC Press, Boca Raton, Fla., 1990. 13. A. Trew and G. Wilson, Past, Present, Parallel: A Survey o f Available Parallel Computer Systems, Springer Verlag, Berlin,

1991. 14. J.L. Potter and W.C. Meilander, "Array Processor Supercomputers," Proc. /€€€, Vol. 77, No. 1 2 , 1989, pp. 1896-1914. 15. A.P. Vazhenin, A.E. Vartasaryan, and Y.I. Fet, "Morphological

16.

Approach to the Analysis of Vertical Processing Systems, " (In Russian), Preprint 982, Computing Center of Russian Academy of Science, Novosibirsk, Russia. 1993. A.P. Vazhenin, "Programming System of High Accuracy Computation for Associative Array Processor, Proc. Int'l Conf. Parallel Processing-Vector and Parallel Processors in Computational Science (CONPAR'90-VAPP /VI, Zurich, Swiss Institute of Technology, 1990, pp. C69-C-77. K.E. Batcher, Staran Parallel Processor System Hardware," Proc. Am. Fed. lnformation ProcessingSocieties (AFIPS) Conf.,Vol. 43, 1974, AFIPS Press, Arlington, Va., pp. 405-410. C.C. Foster, Content-Addressable Parallel Processors, Van Nostrand Reinhold, New York, 1976. W . Haendler, "An Arithmetic Unit of a Digital Computer," German Patent No. 1157009,1961. W. Haendler, "Innovative Computer Architecture-How to Increase Parallelism But Not Complexity," Parallel Processing Systems, D. Evans, ed.. Cambridge Press, London, 1982, pp. 141. W. Haendler, "Multiprocessor Working as a Fault-Tolerant Cellular Automaton," Computing, Vol. 48, 1992, pp. 5-20. C. Fernstrom, I. Kruzela, and B. Swensson, "LUCAS Associative Array Processor," Design, Programming, and Application Studies, Springer Verlag, Berlin, 1986. S.F. Reddaway, "DAP-A Distributed Array Processor," Proc. 1st Ann. Symp. ComputerArchitecture, IEEE Computer Society Press, Los Alamitos, Calif.,l973, pp. 61-65. P.M. Flanders, D.J. Hunt, and S.F. Reddaway, "Efficient HighSpeed Computing with the Distributed Array Processor," High Speed Computer and Algorithm Organization, Univ. of Illinois Press, Chicago, 1977, pp. 1 13-128. D.W. Blewins et al., "Blitzen: A Highly Integrated, Massively Parallel Machine,"J. ParallelDistributed Computing, Vol. 8, No. 2, 1990, pp. 150-160. W.D. Hillis, "The Connection Machine: A Computer Architecture Based on Cellular Automata," Physica D, Vol. I O , Nos. 112, Jan. 1984, pp. 213-228. W.D. Hillis, The Connection Machine, MIT Press, Cambridge, Mass., 1985. W.D. Hillis and L.W. Tucker, "The CM-5 Connection Machine: A Scalable Supercomputer" Comm. ACM, Vol. 36, No. 11. 1993, pp. 31-40. B.M. Boghosian, "Computational Physics on the Connection Machine," Computersin Physics,Vol. 4, No. 1, 1990, pp. 14-33. C.E. Leiserson, "Fat Trees: Universal Networks for HardwareEfficient Supercomputing," /€€€ Trans. Computers, Vol. C-34, NO. 10, 1985, pp. 892-901. A.A. Khokhar et al., "Heterogeneous Computing: Challenges and Opportunities," Computer, Vol. 26, No. 6, June 1993, pp. 18-27. A.P. Vazhenin, S.G. Sedukhin, and Y.I. Fet, "High-Performance Computing Systems of Combined Architecture," Proc. Int'l Conf. Parallel Computing Technologies, World Scientific, "

17.

18. 19. 20.

21. 22.

23.

24.

25.

26.

27. 28.

29.

30.

31.

32.

"

Singapore, 1991, pp. 246-257. 33. H.T. Kung and C.E. Leiserson, "Algorithms for VLSl Processor Arrays," lntroduction to VLS/ Systems, C.A. Mead and L.A. Convay, eds., Addison-Wesley, Reading, Mass., 1980, pp. 263332. 34. T. Toffoly and N. Margolus, Cellular Automata Machines: A New Environment for Modeling, MIT Press, Cambridge, Mass., 1987. 35. Y.I. Fet, Parallel Processing in Cellular Arrays, Research Studies Press, Taunton, Somerset, UK, 1994. 36. P. Bertin, D. Roncin, and J. Vuillemin, "lntroduction to Programmable Active Memories," DEC Paris Research Laboratories, Report No. 3, Paris, June 1989.

Yakov I. Fet is a senior researcher at the Computing Center of the Russian Academy of Sciences, Siberian Division. His research interests include massively parallel architectures, distributed computing in cellular arrays, and nonnumerical processing. Fet received his PhD from the Institute of Mathematics in Novosibirsk, Russia, and doctoral degree from the Power Engineering Institute in Moscow. He is a member of the Russian Association of Artificial Intelligence.

Direct questions about this article to Yakov I. Fet at the Computing Center, Siberian Division of the Russian Academy of Science, 6 Lavrentiev Avenue, Novosibirsk, Russia 630090; [email protected].

Reader Interest Survey Indicate your interest in this article by circling the appropriate number on the Reader Service Card Low 171

Medium 172

High 173

February 1995 75

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.