Very High Speed Vectorial Processors Using Serial Multiport Memory as Data Memory

June 4, 2017 | Autor: Mustapha Lalam | Categoria: Information Exchange

Descrição do Produto

Very High Speed Vectorial Processors Using Serial Multiport Memory As Data Memory A. MZOUGHI, M. LALAM and D. LITAIZE I. R. I. T. /Université Paul Sabatier, Toulouse, France, e-mail : [email protected]

A b s t r a c t . Complex scientific problems involving large volumes of data need a huge computing power. Memory bandwidth remains a key issue in high performance systems. This paper presents an original method of memory organization based on serial multiport memory components which allows simultaneous access to all ports without causing either conflict between them or suspension. The resulting process for information exchange gives a cost effective realization of a data memory for vector processors. The memory bandwidth can be considerably increased in a modular manner, without practical implementation constraints.

Keywords : serial multiport memory, memory bandwidth, realignment network, vector processor.

1 Introduction Performance improvements in scientific processing may be made in 2 different ways : - by increasing the number of processors, connected to a common memory, [1, 2] or organized in networks [3]. Nowadays program parallelism in these computers is performed more or less automatically but with different results. - by using a fast processor specialized in vector calculation [4, 5]. This solution is well adapted to sequential programming [4, 6] since the vectorisation of calculations is carried out automatically in a satisfactory way . Combining these two solutions is, of course, possible, but involves a significant volume of data. The main obstacle when the performance is to be increased becomes the memory-processors network [7]. The existing solution to this problem increases bandwidth to the detriment of complexity, reliability and paradoxically of speed. A new approach to this problem involves the use of serial multiport memories and an original network. This study presents a vector processor architecture based on serial multiport memory, which is called VEC-SM2 (VEC corresponds to VECtor and SM2 to Serial Multiport Memory).

2 Definition of a Serial Multiport Memory Component Fig. 1 shows the structure of a Serial Multiport Memory Component (SMMC). It consists of a conventional RAM of word size n, connected in parallel to P shift registers called Memory Shift Registers (MSRs) functioning at a frequency of f MHz. Example : Considering a RAM cycle time of 80 ns, a shift frequency of 200 MHz, and a MSR size of 72 bits = 64 + 8 bits (parity, Hamming code), then : -----------------------------------------------------------------------------------------------------------------VEC-SM2 : This project is supported by the CNRS, the Regional Council of Midi Pyrenées and the French Industry and Research Ministry.

MSR number

Address R/W

RAM

MSRs Load logic MSR0

MSR1

MSRp-1

the MSR emptying time(t f) is : t f = 72x1/(200x10 6 ) = 360 ns. SCLK Adding some residual times, we Serial I/O obtain roughly 400 ns. The RAM can continuously feed 5 MSRs SCLK within the emptying time, using these Serial I/O values. More generally, if P is defined as the SCLK number of MSRs in the SMMC, then Serial I/O the optimal value P is given by : P = tf /RAM cycle time.

Fig. 1. Serial Multiport Memory Component

3 Organization of a Serial Multiport Memory A simple association of SMMCs will give the organization presented in fig. 2. 1

M-1

RAM

RAM

RAM

st 1 Set of M Serial Links

{

Buffers Set 1

2 nd Set of M Serial Links

DSR1 DSR0

{

Buffers Set p-1

{

DSRM-1

Buffers Set 0

0

p th Set of M Serial Links

Fig. 2. Organization of a Serial Multiport Memory

A row of MSRs is linked with M (number of SMMCs) Destination Shift registers (DSRs) by M Serial Links (SLs) These M DSRs act as buffers. The memory organization will then have P sets of M SLs and P sets of buffers of M words. We will not discuss these sets of buffers further, but simply note that the memory is thus able to provide P sets of M words at each emptying time interval (tf) of a MSR. The memory bandwidth obtained is : Memory Bandwidth = MB = PxM/tf.

Example : if we use the same values (§ 2) with M=100 then MB = 1,25 Gigawords/s. We will now discuss the potential benefit of serial multiport memory in comparison with classical memory.

4 Memory Organization of the Cray-2 It is divided into four quadrants [9]. Each quadrant consists of 32 memory banks, each of them divided into 8 memory planes. The vector unit has 8 registers of 64 words, i.e. 512 words of 64 bits. Each quadrant will move 128 words to the vector registers in 128 clock cycles, since the clock cycle is 4.1 ns, 512 ns are needed. The quadrant is organized so that it will be requested every clock cycle and the memory bank every 4 clock cycles. In other words the quadrant functions at a frequency of 250 MHz. The complex interleaving avoids access conflicts to a bank, but the 256 Megawords size requires : 128x8x64 = 65536 components of 256 Kbits. As each memory bank has a data bus of 64 wires, a total number of 64x32x4x64=65536 wires is needed. This explains the difficulty of scalability in this organization.

5 Comparison of Cray-2 Memory and the SMM The memory quadrant is synchronous and uses complex interleaving due to the increased number of bus wires, supplying 128 words in 512 ns. Once the first word has been obtained in the first bank, new data is available every 4 clock cycles. Figure 3 illustrates the general timing. bank

access 32

bank 32 access 16 bank 3 bank 2 bank 1 output of data bus

access 3 access 2 access 1

access 33 64 words

access 65 64 words time

4 cycles 64 cycles Fig. 3. Chronogram of memory access to one quadrant of Cray-2 Component comp. M

comp. 2 comp. 1 output of data bus

access 1

access 2

access 1 access 1

access 2 access 2 M words from

M words from

access 1

access 2

time

1 memory cycle Fig. 4. Chronogram of memory access to the serial multiport memory

Let us now reconsider the memory organization of fig. 2 with a shift frequency of 200 MHz, a memory access time of 80 ns, thus P=5 MSRs. Owing to the much reduced

4

number of wires on the bus, simple interleaving is used. With M=100 SMMCs we obtain 5 sets of buffers of 100 words, and a memory bandwidth of 1,25 Gigawords/s, against 1 Gigaword/s for Cray-2 (fig. 4). The SMMC is considered here from a logical point of view. Its realization, using the same size and components as the Cray-2 memory would necessitate the same number of such components, but with a much simpler interface; a total number of wires MxP=500. We can now compare both organizations in the following table : Cray-2 Complex interleaving Quadrant has a transfer frequency of 250 MHz Access is synchronous 128 words required 512 ns per quadrant, so 512 words in 512 ns Incrementation is complex and limited 4 buses of 64 bits 4 parallel quadrants

Serial Multiport Memory Simple interleaving MSR has a transfer frequency of 200 MHz Access is synchronous or asynchronous P*M words in 400 ns, with M=200, P=5 1000 words in 400 ns Incrementation is easy and large P*M Serial links (1 serial link = 1 bus ) P groups of pipeline registers (1 group=M)

6 Connection of a Pipeline Operator to a SMM In the very general case of a classical diadic operation, the pipeline operator of a vector processor has two input feeding pipelines (operands of the operator) and an output pipeline for receiving the results. Let us leave aside the data alignment network for the moment. The architecture proposed in fig. 5, allows the operator to be fed continuously, as each group of DSRs corresponding to a row of MSRs is organized as a set of buffers, and the data is correctly aligned. It shows the interconnection of M SMMCs to a pipeline operator, each SMMC being equipped with 3 MSRs. RAM 0

RAM

1

RAM M-1

A0

A1

B0

Data Alignment Network

vector storage ( stride=1 ) skewing storage ∂ 1 =1,∂2 =1 M bits scrambled storage

B1

OPERATOR

Delay R1 R0

Fig. 5. Connection of a simple Operator Pipeline and Data Alignment Network to SMM

In addition to the operator, it has two sets of buffers (A0, B0), (A1, B1) working in flip flop mode, allowing the operator to be fed continuously, and one set (R0, R1) at the output of the operator to receive the results. They work as follows : (Ai, Bi) -->Ri i mod 2-1. This implies that, while the operands contained in (Ai, Bi) are being processed, (Ai+1, Bi+1) are fed, and vice versa. The association of a large number of SMMCs allows the number of buses to be increased, yet without having a large

4

number of wires. As a result we can envisage the simultaneous exchange of NxN words. This provides a considerable advantage in calculation and matrix representation. If M is a power of 2, address calculation is simpler and faster than for a prime number of classical memory banks [9]. In order to enlarge access capability to the SMMC, we will attach an address generator/controller to each of them, whose role is to compute the address and control access to the component, which can then be addressed individually, in groups or globally (using the same or different steps). A memory access will give M words simultaneously. These M words are then transferred through the SLs to the corresponding set of buffers in an order which is not necessarily that expected. An original circuit for data realignment has been designed to retreive the original order (fig. 5), which is not a permutation of operands but is deduced from a pre-established order before data is driven to the operator.

6 Data Alignment Network, Originality of Approach A set of buffers is composed of M serial inputs and one parallel 64 bit output (fig. 6). This approach to data realignment overcomes most of the problems posed by practical implementation of data alignment networks. The solution is very simple since we use serial links. Data is not rearranged during the transfer, but instead processed in an alignment system taking into account operands in the buffer, and imposing the required order (skewing and periodic access [10], scrambled storage [11]). There is no permutation during data transmission. The operand selection for the operator is performed by a shift register, a logical entity easy to implement. The network is modular and easily extensible. Cascading shift registers is simple, and does not require the notion of multiple stages. The system can provide periodic and scrambled permutations. It can be used for different data organizations. Two networks of realignment (one at the input and the second at the output of the operator) are unnecessary. A single network is sufficient. The network shown in fig. 6 is made up of two shift registers SR-A and SR-B of M bits and of a FIFO file organized in words of M bits. The signals S0 , S1 , … SM-1 give us the output order of the operands buffer. At a given time only one signal Si is active (i.e. only one bit at level 1, for example, at the output of the SR-B). SET OF BUFFERS

DATA ALIGNMENT NETWORK CLK

SL

shift register 0

SL

shift register 1

SL

shift register M-1

CONTROL ORDERS

SR-A SR-B S0

S1 FIFO Output bus S 3-state Command M-1 ( 64 bits ) Fig. 6. Data Alignment Network Connected to a Buffer

- The alignment of the operands which have been stored in memory following vector storage (the vector's element of order i is associated to the address i mod M) is done by the SR-B, whose bit "1" at the output is shifted at each clock cycle.

4

- The alignment of the operands which have been stored in memory following the skewed storage (∂1 =1, ∂2 =1, Lawrie's principle [11]) is done through two shift registers SR-A and SR-B, both of them simulating the working of two nested loops. - The alignment of the operands stored in memory following scrambled storage is realized by the FIFO file. If the speed up of the operator is four times as large as the access to the FIFO, then four FIFO files have to be put in parallel, those files containing consecutive orders which might be arranged horizontally across them. Those orders' orthogonal arrangement known as memory interleave would be used to speed up a single port FIFO system.

7 Conclusion Serial multiport memories are used in an MIMD architecture project with shared memory [12], whose evaluation was done in our laboratory a prototype of which is already under way. Using the same principles, we plan to use serial multiport memory in a vector processor context. The proposed architecture of VEC-SM2 is representative of two architecture families, those of Cray and Cyber. In fact, VEC-SM2 feeds pipeline operators by registers and thus uses the same philosophy as the Cray, except that in VEC-SM2 the vector registers are managed by hardware and are transparent to the user, feeding being done from memory. In this way we also use the philosophy of Cyber. The large number of simultaneous accesses to the memory, the short memory busy time, and the use of buffers in flip flop mode, guarantee a continuous data flow to the pipeline operator, consequently allowing it to compute at its maximum rate. The transmission speed obtained, in the previous examples, can easily be realized with current technology. Faster speeds can be obtained with more elaborate technology.

References 1. Kenneth E. Batcher ,"Design of a Massively Parallel Processor", pp.104-108, Tutorial Supercomputers design and applications Kai Hwang Computer Society Press, 1984. 2. Burton J. Smith, "Architecture and applications of the Hep Multiprocessor Computer System", pp. 231-238, Tutorial Supercomputers Design and Applications Kai Hwang Computer Society Press, 1984. 3. Tse-Yun Feng, "A Survey of Interconnection Networks", pp. 109-124, Tutorial Supercomputer Design and Applications Kai Hwang Computer Society Press, 1984. 4. Kai Hwang and Fayé A. Briggs, "Computer Architecture and Parallel Processing", pp. 145-320, Mc Graw Hill Book Company, New York, 1984. 5. Supercomputers, Class VI Systems, Hardware and Software, pp. 1-168, Elsevier Science Publishers B.V, North Holland, 1986. 6. Clifford N. Arold, " Vector Optimisation on the Cyber 205 ", pp. 179-185, Tutorial Supercomputer Design and Applications Kai Hwang Computer Society Press, 1984. 7. Howard Jay Siegel, " Interconnection Networks for Large Scale Parallel Processing Theories and case Studies ", pp. 35-173 , D.C. Heath and Company , Massachusetts, 1985. 8. RW Hockney and CR Jesshop, " Parallel Computers : Architecture, Programming and Algorithms ", 2nd ed. , pp. 82-205, Bristol : Adam Hilger , Great Britain, 1988. 9. Duncan H. Lawrie and Chandra R. Vora, " The Prime Memory System For Array Access" pp. 435-442, IEEE Transactions on Computers, vol C-31 , N° 5, may 1982. 10. Duncan H. Lawrie, " Access and Alignment of data in an Array Processor ", IEEE Transactions Computer, Vol C-24 N° 12, pp. 1145-1154 , December 1975. 11. De-lei Lee, "Scrambled Storage for Parallel Memory Systems", Computer Architecture News, Vol. 16, N°2, pp. 232-239, May 1988. 12. D. Litaize, A. Mzoughi, C. Rochange, P. Sainrat, "Towards a Shared-Memory Massively Parallel Multiprocessor", The 19th ISCA ,Vol. 20, N° 2, pp. 70-79, May 1992.

4

4

Lihat lebih banyak...

Very High Speed Vectorial Processors Using Serial Multiport Memory as Data Memory

Descrição do Produto

Comentários