NnSP: embedded neural networks stream processor

July 11, 2017 | Autor: Hadi Esmaeilzadeh | Categoria: Parallel Processing, Neural Network, Process Engineering, Systems, Mobile Robot Navigation, Data Flow Diagram, Hardware Implementation of Algorithms, Data Stream, Processing Element, Data Flow Diagram, Hardware Implementation of Algorithms, Data Stream, Processing Element

Share Embed

Denunciar este link

Descrição do Produto

NnSP: Embedded Neural Networks Stream Processor Hadi Esmaeilzadeh, Farhang Farzan, Neda Shahidi, S. M. Fakhraie, Caro Lucas

M. Tehranipoor Department of Computer Science and Electrical Engineering University of Maryland Baltimore County, Baltimore, USA

Department of Electrical and Computer Engineering University of Tehran, Tehran, Iran [email protected], [email protected], [email protected], [email protected], [email protected]

[email protected] realization. The architecture proposed in this work is a programmable stream processor for neural networks. The central idea of stream processing is to organize computations of an application into streams of data. The idea is recently employed in multimedia applications [5], [6].

Abstract— Exploiting neural networks native parallelism and interaction locality, dedicated parallel hardware implementation of neural networks is essential for their effective use in timecritical applications. The architecture proposed in this paper is a parallel stream processor called Neural Networks Stream Processor or NnSP which can be programmed to realize different neural-network topologies and architectures. NnSP is a collection of programmable processing engines organized in custom FIFObased cache architecture and busing system. Streams of synaptic data flow through the parallel processing elements, and computations are performed based on the instructions embedded in the preambles of the data streams. The command and configuration words embedded in the preamble of a stream, program each processing element to perform a desired computation on the upcoming data. The packetized nature of the stream architecture brings up a high degree of flexibility and scalability for NnSP. The stream processor is synthesized targeting an ASIC standard cell library for SoC implementation and also is realized on Xilinx VirtexII-Pro SoPC beds. A neural network employed for mobile robot navigation control, is implemented on the realized SoPC hardware. The realizationspeedup achievements are presented here.

Streams contain a set of data elements of the same type. Exploiting neural networks inherent parallelism and interaction locality, NnSP, neural networks stream processor, makes the main idea of stream processing feasible for their realization. The main challenge of stream processing is mapping the application on the streams of data which is resolved in this work with the proposed architecture and data structure for neural computations. NnSP is a programmable stream processor that is designed flexible enough for realization of various neural networks. On the other hand, while preserving the flexibility, it has a high computational power due to its parallel processing architecture. Since computations are mapped to data streams, a small NnSP processor can perform computations of a large neural network. NnSP contains a set of parallel processing engines that are connected with a O(n)-based bus architecture which is as well controlled with data streams. Therefore, it is scalable in terms of processing elements. Also, it is possible that a number of NnSP processors, are pipelined to bear larger neural networks. NnSP is designed with online reconfigurability enabling it to realize co-operative neural networks in a single framework. A custom FIFO-based caching architecture in conjunction with a pre-fetching mechanism is implemented for NnSP to fill the speed gap between the memory and the processing elements. The prefetching mechanism is implemented in its busing system for enhancement of parallelism in the NnSP. In addition to prefetching mechanism, the fetching is performed in bursts that the memory bandwidth is utilized with its highest performance. From another point of view, a small NnSP is a complete neural network processor, and therefore, its architecture is appropriate for embedded applications that require intelligent processing.

Keywords—Neural networks; stream processors; parallel processing; SoC implementation

I. INTRODUCTION Neural networks are employed in various areas, but their effective use in real-world applications requires efficient hardware implementations. As compared to analog implementations, digital realizations of neural networks can provide advantageous features such as dynamic range, accuracy, modularity, scalability, and programmability. From the architecture design view, digital implementations of neural networks can be classified into three general categories: custom implementations [1][2], systolic-based implementations, and SIMD/MIMD-based implementations [3][4]. Custom and systolic implementations benefit from their high performance but suffer from little inflexibility. Programmable implementations such as MIMD/SIMD-based implementations offer more flexibility, but they cannot achieve the performance of a well-designed custom

0-7803-9197-7/05/$20.00 © 2005 IEEE.

Section II discusses the details of NnSP architecture and streaming procedure of neural-network architectures. Utilization of NnSP for a mobile robot navigation control and its SoC and SoPC implementations are explained in Section III. Finally, the paper is concluded in Section IV.

223

SDRAMn

Bus Arbiter

SDRAMn+1

Input Cache

SDRAM0

PU Cache1

PE0 PE1

Processing Unit0 PEn-1

SDRAM1

Processing Unit1 PE Controller

SDRAMn-1 SDRAMn+2

Processing Unitm-

PE Computation Core PE Config Unit

1

Output Cache

Figure 1. Overall NnSP architecture, its PUs and PEs.

To accomplish each synaptic computation, the associated input value must be delivered as well. It is obtained from a neighboring neuron which might have been mapped onto another processing element. A special local communication policy and a bussing architecture are designed for NnSP to carry these local data exchanged between processing elements representing virtual neurons.

II. ARCHITECTURE, STREAMING, AND RECONFIGURABILITYE Neurons of a neural network are connected through some synaptic links that each carries the out-going value of a neuron to another. Each synaptic link is associated with some parameters indicating the importance of the value flowing through that link. Neurons of a network generally compute their output values by performing a weighted-sum operation on their inputs. One important issue from the implementation and realization point of view is that the number and values of synaptic units of each neuron are what distinguishes that neuron from the other ones.

A. Streaming and Data Flow Mechanism Synaptic weights of a neuron are fetched from memory as synaptic data packets, while the synaptic input values are exchanged between PEs. Each synaptic data packet has a header section that identifies the target processing element which the packet must be delivered to. Or simply, the packet header includes the ID number of the target PE which uniquely identifies the processing element in the stream processor.

NnSP is a collection of processing elements that each can be programmed to carry out computations of different neurons, and they are not exclusively committed to a specific neuron. Indeed, after performing computations of each neuron, the PE is reprogrammed to accomplish the computations of another one. The programmability of NnSP implies that it can be used to realize various neural networks with different topologies and different neuron counts. Despite those implementations that each neuron has its dedicated processing unit [3], the NnSP’s design scheme enhances scalability and reusability of the implementation.

Prior to synaptic data packets, a PE configuration packet is fetched from the memory. The PE configuration packet contains information about the neuron which is mapped to the PE. It includes the functionality of neuron, number of synaptic packets that the PE will receive (number of neuron’s inputs), and the number of read operations that will occur on its output value. In fact, the PE configuration packet programs the PE for the upcoming flow of synaptic packets. Similar to the synaptic data packets, a PE configuration packet contains a header that identifies its target PE.

Synaptic weights of each neuron form a stream of data which are stored in the memory and flow through a processing element. These streams are called synaptic data streams, and each of which has a unique target processing element which the neuron is mapped on. A synaptic data stream has a preamble that programs the target PE to act as the mapped neuron. The payload of the stream is a set of synaptic data words which are the synaptic weights of that neuron. Synaptic data words are fetched from a memory and are delivered to the neuron’s allotted processing element. Since external memories are slower than processing units, caches are employed to alleviate the speed mismatch between memories and processors.

The PE configuration packet and its associated synaptic data packets build up a synaptic data stream that flows through a processing element and realizes a neuron using that processing element. B. PE Arrangement, Caching, and Local Communications As depicted in Figure 1, a number of PEs associated with a linear FIFO-based cache construct a processing unit (PU). A memory and a memory interface unit are associated with each

224

processing unit (PU). Synaptic data streams of each PU are stored in its memory, and the memory interface unit fetches streams starting from a specific location of the memory consequently. This consequent memory-access scheme leads to burst data transfer from memory to a stream processor and utilizes the memory bandwidth optimally.

bus. One is for outputs, and the other is for incoming inputs from the bus. After completion of all synaptic data computations for a neuron, the PE writes the output to its bus output FIFO buffer. When an output value reaches the head of the output buffer, it is declared to the bus arbiter that its output is ready to be sent over to the bus. On the other hand, prior to the synaptic data stream, a bus access configuration packet is sent to the bus arbiter informing it that a specific PE must send its output to which PEs. In fact, the bus configuration packet contains a PE ID that specifies the PE that must write its output on the bus, and a write bit pattern specifying a group of PEs that the data must be written to their bus input FIFOs. Therefore, when a PE declares its output as ready, the bus arbiter initiates a write operation for it with the write pattern specified in the bus configuration packet. The write pattern is a sequence of a 0s and 1s which is sent to the write enable signal of the PEs’ input buffers.

NnSP has a parallel architecture, and effective utilization of its parallel processing elements requires a high input data throughput which cannot be reached using a single memory bank. Using a number of smaller memory banks instead of one large memory is more effective, but the data should be distributable over several memory banks. On the other hand, synaptic data streams are independent and can be stored in different memory banks. This characteristic of synaptic data streams is inherited from native parallelism of neural networks. Considering these two facts, for each PU a separate memory bank is allotted which enhances NnSP’s degree of parallelism. 1)NnSP’s Caching Architecture: Each PU has a linear FIFO-based cache that is employed targeting two objectives: first, filling the speed gap between processing units and offchip memories, and the second, providing a routing mechanism for incoming data packets.

There is a FIFO inside the bus arbiter for bus configuration packets. When a memory interface of a processing unit or input cache detects a bus configuration packet, writes it to the bus arbiter FIFO. Bus arbiter FIFO and the input and output FIFOs inside different PEs reduce the blocking possibility of the writes over the bus and enhance the performance of the NnSP stream processor. The employed arbitration approach reduces the number of connections between the PEs from the O(n2) connections to the O(n) connections. Hence, the scalability of the system is improved.

As shown in Figure 1, starting from a pre-specific location, memory interface unit of each PU fetches data streams from its memory and fills them in the cache consequently and without header decoding. Using this caching mechanism, data packets are pre-fetched to be ready for further processing. This scheme utilizes the space locality of synaptic data streams to fill the speed gap between the processor and the memory. The space locality is inherited from locality of interactions in neural networks. When a data packet reaches the head of cache, its header, which contains its target PE ID, is decoded and based on header information, is sent to the appropriate PE in the processing unit.

C. Design Reconfigurability and NnSP Builder The code level (compile time) reconfigurable NnSP stream processor is implemented using Verilog HDL, and the implementation is parameterized in terms of: • • • • • • • • • • •

Since always the flow of synaptic data packets is a unidirectional flow from memory to PEs, and it is probable that a PE is busy when a cache wants to send a packet to it; inside each PE a small FIFO is implemented into which the cache writes packets. This scheme reduces the blocking possibility of the flow of data streams from the memory to the processing elements and improves the performance of the stream processor. In addition to PU’s caches, two other caches are employed for primary inputs and outputs which are directly connected to the internal local bus of NnSP. Inputs that are fetched from memory are filled to the input cache and are sent to the PEs through internal bus when necessary. On the other, hand when a PE computes a primary output, sends it to the output cache which is then written back to a pre-specified location of memory.

Number of processing units Length of processing unit cache Number of PEs per processing unit PE cache interface FIFO length PE bus input interface FIFO length PE bus output interface FIFO length Input cache length Output cache length Bus Arbiter FIFO length Bit width of arithmetic operations Precision of arithmetic operations.

Also, the processing elements are implemented in a way that their arithmetic core can be replaced with another arithmetic core with the same interface. Several soft cores are designed to cover the requirements of different NNs including MLP core (MAC), RBFN [7] core, CSFN [8] core, etc. On the other hand, the arithmetic operations are fixed-point operations that their width and precision can be configured using the available parameters. The design reconfigurability and its controllability from system level in conjunction with NnSP programmable nature enables it to be reused in different applications with various demands and restrictions.

2)Local Communications and Busing Architecture: For local communications between PEs, a common-bus architecture is employed. Each PE can write/read data to/from the bus, but cannot initiate a write/read operation. There are two small FIFO buffers inside each PE for interfacing with the

To facilitate the code-level reconfigurability of the NnSP stream processor, an NnSP Builder application is developed

225

In addition to ASIC implementation, the 4-PE NnSP is synthesized to the Xilinx VirtexII-Pro 2VP30FF1152. The overall area of the 4-PE NnSP is 3.72% of the logic cells and its clock frequency is 62.5 MHz.

which takes the intended parameters of the NnSP and builds the corresponding synthesizable Verilog code ready for realization on SoC or SoPC beds. III. NNSP AT WORK

IV. CONCLUSION AND FUTURE WORK NnSP, a programmable stream processor for realization of neural networks with various topologies was presented. The stream flow mechanism of the processor was discussed. The local communication approach of the architecture was explained. The neural network which was trained for miniature mobile robot navigation control was implemented using the proposed stream processing architecture. System level simulations were performed to consider limitation of digital implementation. Finally, the SoC and SoPC realization results were presented which show that the NnSP can perform computations of a neural network without restrictions on its topology. Also, caching architecture and memory access mechanism employed in NnSP enhances its performance. Controllability from system level and programmability of the NnSP in the implemented application was obvious.

A. Mobile Robot Neural Controller An MLP neural network which is employed for miniature mobile robot navigation control is realized using the NnSP stream processor. The Khepra [9] mobile robot is used and the neural controller is implemented for its collision-free navigation into a maze. The Khepra mobile robot has 8 peripheral infra-red sensors. These sensors estimate the distance from obstacles and measure environment light. The motion of the robot can be controlled via two DC motors, responsible for driving its wheels. The robot can turn to any direction by applying different speeds to each wheel. Due to the definition of this navigation application, number of inputs and outputs are fixed, but number of hidden neurons is determined through simulation. The Khepra simulation package [10] is used for training of the neural controller. First, a deterministic if-then-else based controller is developed and then, a log file is generated from its inputs and outputs while at work, that is then used for training of neural controller. The minimum number of neurons in the hidden layer that enables the network to control the robot is 12. The employed training algorithm in the simulation is the error back propagation algorithm. Appropriate collision-free navigation is achieved after training with 1000 epochs over 6710 training data.

Employing the proposed architecture in a more-dynamic application such as autonomous robot control with online growth and learning is our future step. REFERENCES [1]

D. Roggen, S. Hofmann, Y. Thoma, and D. Floreano, “Hardware spiking neural network with run-time reconfigurable connectivity in an autonomous robot,” in Proc. The NASA/Dod Conference on Evolvable Hardware, 2003. [2] A. Prez-Uribe, “Structure-adaptable digital neural networks,” Ph.D. dissertation, Swiss Federal Institute of Technology-Lausanne, Lausanne, 1999. [3] K. W. Przytula and V. K. Prasnna, Paralle Digital Implementations of Neural Networks. Englewood Cliffs, New Jersey: Prentice-Hall, 1993. [4] S. M. Fakhraie and K. C. Smith, VLSI-Compatible Implementations for Artificial Neural Networks. Norwell, Massachusets: Kluwer-Academic Publisher, 1997. [5] B. Khailany, W. Dally, U. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, A. Chang, S. Rixner, Imagine: Media processing with streams, IEEE Micro, pp. 35-46, Volume 21, No. 2, (March 2001). [6] U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. Ahn, P. Mattson, J. Owens, Programmable stream processors, IEEE Computer, pp. 54-62, Volume 36, No. 8, (August 2003). [7] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulation. New Yourk, NY: JohnWiley & Sons, 2000. [8] G. Dorffner, “Unified framework for MLPs and RBFNs: Introducing conic section networks,” Cybernetics and Systems: An International Journal, vol. 25, pp. 511–554, 1994. [9] (2005). [Online]. Available: www.k-team.com. [10] Olivier Michel. Khepera simulator package version 2.0: Freeware mobile robot simulator written at the University of Nice Sophia-Antipolis by Olivier Michel, 2005. Downloadable from the World Wide Web at http://wwwi3s.unice.fr/~om/khep-sim.html.

B. System Level Simulation System level simulation is intended to consider some limitations of hardware that might influence the functionality of the system. Consider that the employed network is trained using C++ codes in which all parameters and variables of the neural network are defined in double precision format. Moving towards SoC or SoPC realization, the problem is how to present neural network parameters in the corresponding hardware. The parameters should be converted to the fixedpoint format, but the precision of the format must be adjusted in a way that the functionality of the system remains intact. For fixed-point format precision adjustment, the neural network is implemented at algorithmic level with fixed-point variables and operations. The outputs of the fixed-point based network are compared to those of the double-precision network. With setting acceptable average error bellow %5, the integer part and fractional part bit-widths have been chosen, 5 bits and 6 bits respectively. C. SoC and SoPC Implementations Results obtained from system level simulations and the architecture parameters (two processing units and two PEs per processing unit) and FIFOs and cache lengths are given to the NnSP Builder and the synthesizable Verilog code of the 4-PE NnSP is obtained. HDL descriptions of the NnSP are synthesized using a 0.5µm standard cell library. The overall area of the 4-PE NnSP is 126476 (gate count) and its clock frequency is 113.2 MHz.

226

Lihat lebih banyak...

NnSP: embedded neural networks stream processor

Descrição do Produto

Comentários