Ultra-Low Power Nanomagnet-Based Computing: A System-Level Perspective

Descrição do Produto

778

IEEE TRANSACTIONS ON NANOTECHNOLOGY, VOL. 10, NO. 4, JULY 2011

Ultra-Low Power Nanomagnet-Based Computing: A System-Level Perspective Charles Augustine, Xuanyao Fong, Behtash Behin-Aein, and Kaushik Roy, Fellow, IEEE

Abstract—MOSFET scaling is facing overwhelming challenges with increased parameter variations, exponentially higher leakage current, and higher power density. Thus, researchers have started looking at alternative switching devices and spintronicsbased computing paradigms. Nanomagnet-based computing is one such paradigm with intrinsic switching energy close to thermal limits and scalability down to 5 nm. In this paper, we explore the possibility of nanomagnet-based design using nonmajority gates. The design approach can offer significant area, delay, and energy advantages compared to majority-gate-based designs. Moreover, new clock technologies and architectures are developed to improve computation robustness and power dissipation of nanomagnet systems. We also developed a comprehensive device/circuit/system compatible simulation framework to evaluate the functionality and architecture of a nanomagnet system and conducted a feasibility/comparison study to determine the effectiveness of the technology compared to standard digital electronics. Performance results from a nanomagnet-based 16-point discrete cosine transform (DCT) with enhanced clock architecture, narrow gap cladding of nanomagnets, or embedding nanomagnets in solenoid with steel core, together with near neighbor system architecture, show up to 10× improvement over subthreshold 15 nm CMOS (Vdd = 90 mV) design, using “energy-delay0 .5 -area product (ED0 .5 A)” as comparison metric. Finally, we explored the scalability of nanomagnets and the effectiveness of field-based switching. Index Terms—Low power, nanomagnet, spintronics, systolic array architectures.

I. INTRODUCTION OSFET scaling, which aided the exponential growth of semiconductor industry, is approaching its fundamental physical limits. Hence, research has started in earnest to explore next-generation “switches”, either in the charge domain or in the noncharge-based domain. Moreover, MOSFETs are associated with parametric variations and temporal reliability degradation, which have significant impact on circuit and system performance (delay, power, and area). Thus, alternative switching devices based on electron spin are being investigated to overcome the limitations associated with Si-MOSFET devices. Spin-based logic and its variants have been theoretically studied and experimentally demonstrated over the past 30 years as

M

Manuscript received November 23, 2009; revised June 19, 2010; accepted July 12, 2010. Date of publication September 23, 2010; date of current version July 8, 2011. This work was supported by the Nanoelectronics Research Initiative. The review of this paper was arranged by Associate Editor D. Litvinov. The authors are with the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907 USA (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNANO.2010.2079941

Fig. 1. (a) Energy profile of nanomagnet with zero H. (b) Energy profiles of nanomagnet with finite H.

a viable replacement for MOSFET [1]–[3]. Nanomagnet-based logic is one such computing paradigm, the viability of which has been shown both theoretically and experimentally [3]. The energy barrier of a nanomagnet is given by Ku2 V, where Ku2 is the uniaxial anisotropy of the magnet and V is the volume of magnet, as shown in Fig. 1(a). Thus, energy barrier can be decreased by decreasing V. However, for a finite retention time of ten years, the magnet requires a minimum energy barrier of 40 KB T between 0◦ and 180◦ states, where KB is the Boltzmann constant and T is the temperature [5]. Interestingly, nanomagnets within single domain limits are characterized by collective switching behavior where the entire magnet behaves as one giant spin [4]. For example, in a magnet with “N” electrons and with Ku2 V of energy barrier, each electron contributes to Ku2 V/N of energy barrier, and thus each individual electron needs only Ku2 V/N of energy to go from one stable state to another. Thus, switching energy of entire nanomagnet is 40 KB T, and it is same as one electron spin switching over 40 KB T energy barrier. On the other hand, in charge-based devices, such as MOSFETs, switching energy is proportional to the total number of electrons (N) contributing to charging/discharging the capacitance. Thus, total energy is “N40 KB T” with an isoenergy barrier of 40 KB T between logic zero and one states [4]. Thus, intrinsic energy required for switching one nanomagnet at room temperature can be theoretically as small as hundreds of zepto-Joules (40 KB T = 160 × 10−21 J, T = 300 K) and research is in progress to achieve this energy limit in practical nanomagnet circuits by reducing clock power dissipation [8]. Compared with the ITRS projections for energy dissipation in a double gate MOSFET transistor at the 15-nm technology node, nanomagnet offers an improvement of two orders of magnitude [7]. Moreover, the nonvolatility of nanomagnet also facilitates designs with zero-leakage power. It has been shown experimentally that nanomagnets can be scaled down to 5 nm, making them very scalable [6]. The aforementioned properties

1536-125X/$26.00 © 2010 IEEE

AUGUSTINE et al.: ULTRA-LOW POWER NANOMAGNET-BASED COMPUTING

of nanomagnet can enable design of spin-based, ultra-low power and scalable systems, and we explore such possibility in this paper. Each magnet in nanomagnet logic is engineered to have an energy profile, as shown in Fig. 1(a) under zero external magnetic field (H). This energy profile is symmetric with up spin (0◦ ) and down spin (180◦ ) as lowest energy states and neutral spin (90◦ ) as the highest energy state. The two stable spin polarization (θ) states—0◦ and 180◦ are termed easy axes. The other state, 90◦ , has maximum energy and is called the hard axis. However, under the influence of nonzero “H” from surrounding, the nanomagnet energy profile transforms, as shown in Fig. 1(b). In the first case, 180◦ is more energetically favorable compared to 0◦ as shown by the “red curve with square markers” and a transition from 0◦ to 180◦ has more probability than a transition from 180◦ to 0◦ . In the second case, energy profile favoring 0◦ is created by an opposite polarity H as shown by the “green curve with star markers.” The aforementioned properties of nanomagnet can be achieved by engineering uniaxial anisotropy (Ku2 ) and volume (V) of the nanoagnet [8]. As mentioned earlier, Ku2 V gives the energy barrier between the 0◦ and 180◦ states, and this barrier energy is directly related to noise-immunity and retention time of magnets [7]. Although the basic nanomagnet logic implementations were experimentally demonstrated, their adoption in commercial products has been hindered due to high energy dissipated by clock circuits (almost 99% of total system energy [8]), lack of system design methodologies, system level simulators, and efficient and reliable manufacturing techniques. However, with the current expertise in fabrication techniques for nanomagnets, there is a need to explore nanomagnet circuits and architectures and compare them to their MOSFET counterparts (CMOS). In order to evaluate the power-performance limits of nanomagnets, an integrated device/circuit/system simulation framework is needed, and we developed such a framework. Moreover, we also show novel device/circuit and architectures, and new clock technologies to overcome the performance limitations. The rest of the paper is organized as follows. In Section II, design of nonmajority logic gate using nanomagnets, which has energy, delay, and area advantages compared to a majority gate, is presented. Section III discuses in depth, the techniques for lowering the energy dissipation associated with external magnetic field clock HCLOCK , which dominates the total system energy. Section IV describes clock architectures for logic isolation, noise immunity, and lowering delay penalty for both logic gates and interconnects in nanomagnet-based circuits. An integrated device/circuit/architecture simulation framework for nanomagnet system design is presented in Section V. Two design examples such as ring oscillator (RO) circuits and a more complex 16-point discrete cosine transform (DCT) are implemented using the proposed design methodology. The performance of these nanomagnet circuits and comparison with standard CMOS-based designs in 15 nm predictive technology (ultimate CMOS) are discussed in Section VI. Section VI also considers scalability of nanomagnets under external field driven switching. Finally, Section VII concludes this paper.

779

Fig. 2. (a) Two-input majority gate. (b) Symmetric energy profile of nanomagnet with and without HC L O C K .

II. NONMAJORITY LOGIC GATE Majority logic gate was proposed in [3] to realize logic functions using nanomagnets. As shown in Fig. 2(a), for a simple two input AND or OR majority gate, along with the two inputs, an extra input is needed to switch the output magnet to the correct logic state. However, the extra input presents the following challenges for nanomagnet logic gates. First, in the event of accidental switching of the extra input during circuit operation, it will result in erroneous computations, which then require complex online built-in self-test techniques [10] to identify and correct the errors. These techniques are costly in terms of performance, power, and area. Second drawback of majority gate is due to its extra input, causing signal routing congestion by blocking the signal fan-out space, as shown in Fig. 2(a), resulting in higher area, power, and delay. Hence, in this paper, we propose a nonmajority gate that avoids the extra input requirement of majority gate and is explained next. In nonmajority gate instead of extra input bias, nonmajority gate has in-built bias in the OUT magnet, either toward 0◦ or 180◦ , which can be achieved by engineering the hard axis angle [11]. In conventional nanomagnet design, all nanomagnets have their easy and hard axes oriented in the identical direction [e.g., easy axes (0◦ , 180◦ ) and hard axis (90◦ )]. The symmetrical energy landscape of conventional nanomagnet is shown in Fig. 2(b), before and after application of HCLOCK along 90◦ [8]. The magnitude of HCLOCK field required in this case is approximately 75 Oe and a detailed explanation of schemes to generate this field is provided in Section III. In case of nonmajority gate, only the OUT magnet hard axis is misaligned (δ = 0◦ ). For example, as shown in Fig. 3(b), the hard axis is misaligned to 76◦ instead of 90◦ (δ = −14◦ ). This misalignment δ produces HBIAS , which can be approximated by HBIAS = ±Ku2

1 − sin2 (90 − δ) MS

(1)

where Ku2 is the uniaxial anisotropy and MS is the saturation magnetization of the nanomagnet. HBIAS is positive when δ > 0◦ and negative when δ < 0◦ . In the case, where δ > 0◦ , when HCLOCK is applied, the magnet energy profile attains a shape as indicated in Fig. 3(b). Next, let us analyze the operation of this logic gate under different logic inputs:

780

IEEE TRANSACTIONS ON NANOTECHNOLOGY, VOL. 10, NO. 4, JULY 2011

TABLE I DESIGN PARAMETERS FOR NANOMAGNET DESIGN

Fig. 3. (a) Two-input nonmajority gate. (b) Asymmetric energy profiles of nanomagnet after misalignment with and without H C L O C K .

Fig. 5. Design-space for nanomagnet (with HD IP and HC L O C K and retention time design constraints).

Fig. 4.

Algorithm to compute the misalignment (δ) for nonmajority designs.

1) A = 0(180◦ ) and B = 0(180◦ ): in this case, due to nonzero HDIP from inputs (magnets A and B) in addition to inbuilt bias (HBIAS ), the OUT magnet goes to logic 0(180◦ ) state. 2) A = 1(0◦ ) and B = 1(0◦ ): in this case, due to a net HDIP , the OUT magnet has sufficient energy to overcome the barrier (EBARRIER = E(90◦ )-E(90◦ + δ ◦ )), as shown in Fig. 3(b), and the OUT magnet settles to logic 1(0◦ ). 3) A = 1(0◦ ) and B = 0(180◦ ) or vice versa: in this case, HDIP is zero; however, due to HBIAS , the OUT magnet state goes to logic 0(180◦ ). From the truth table, we can identify this structure as AND gate. Similarly, an OR gate can be realized using an opposite polarity δ, having opposite polarity HBIAS . The earlier discussion clearly shows that the misalignment (δ) plays an important role for realizing logic functions, such as AND or OR gate. The value of δ required for the nonmajority gate depends on the factors like Ku2 , MS , and volume (V). We have developed a robust algorithm for determining the angle δ for nonmajority gates, as shown in Fig. 4. The input parameters that we have used for designing the nonmajority gate are shown in Table I. In a circuit, the gates need to be cascaded to perform a useful logic computation. Thus, concatenability of nonmajority gates is an important requirement:

Concatenability of nonmajority gates: The output state of nonmajority gate is shifted from equilibrium states (0◦ and 180◦ ). For a δ shift of 14◦ , logic 1 is −14◦ and logic 0 is 166◦ . Due to this shift, the HDIP of OUT magnet on the following magnet is decreased by 3% (1-cos(14◦ )). However, our simulations indicate that this minor degradation of field on neighboring magnets has negligible impact on performance (delay and energy) of gates. Hence, nonmajority gates can be cascaded with other nonmajority and majority gates with negligible energy and delay overhead. As mentioned earlier, the performance of nanomagnet logic depends on its Ku2 , MS , and V. Hence, it is important to identify suitable materials, which can be used for designing nonmajority/majority logic gates. Fig. 5 shows the combination of Ku2 and MS that satisfy the design constraints: 1) HCLOCK > 75 Oe and HCLOCK < 1000 Oe (lower limit on HCLOCK is to ensure that HCLOCK is larger than HDIP ; higher limit is due to power dissipation constraints since higher HCLOCK requires more power). 2) HDIP > 4 Oe and HDIP < 75 Oe (lower limit on HDIP is due to noise from stray fields; higher limit is to ensure that HDIP cannot switch OUT magnet without assistance from HCLOCK ). 3) Ku2 V > 40 KB T (thermal stability requirement). On the graph, we have also plotted a list of ferromagnetic materials that satisfy (inside blue region)/not satisfy (inside white region) these design constraints. So far, we have discussed basics of nonmajority gate and explained the role played by HCLOCK and HDIP in logic design.

AUGUSTINE et al.: ULTRA-LOW POWER NANOMAGNET-BASED COMPUTING

781

Fig. 7. MTJs as HC L O C K generator in nanomagnet systems. (a) Parallel configuration: finite field. (b) Antiparallel case: zero field.

Fig. 6. Magnetic field (HC L O C K ) produced by a finite length wire. In [13], the clock wire is shared among multiple nanomagnets.

TABLE II PARAMETERS FOR CLOCK WIRE AND CLADDING LAYERS

In next section, we present a set of technologies that can be used to generate HCLOCK with lower energy. III. CLOCK TECHNOLOGIES In this section, we discuss four technologies to generate magnetic fields, to be used as HCLOCK . The four technologies are 1) parallel wires; 2) spin-transfer torque magnetic tunneling junctions (STT-MTJs); 3) clock wire with narrow gap cladding (NGC); and 4) nanocoil or solenoid. Further, we discuss in detail the efficiency of each approach and identify the techniques that can be used for low-power nanomagnet designs. A. Parallel Wires Niemier et al. showed magnetic fields generation using current (ICLOCK ) through parallel resistive clock wires that are shared between multiple nanomagnets [12], as illustrated in Fig. 6. Using this approach nanomagnets between adjacent clock wires might be incorrectly clocked due to insufficient magnetic field and coupling magnetic noise from nearby stray nanomagnets, possibly resulting in propagation of wrong data. Furthermore, large current is required to generate sufficient magnetic field strength to switch all the magnets shared by a clock wire. Results showed that clock energy using this scheme can be 100× larger than nanomagnet switching energy [8]. Hence, more power efficient clock technologies are required to improve the performance of nanomagnet circuits. B. Spin-Transfer Torque Magnetic Tunneling Junctions By placing individual nanomagnets above a tunneling layer of STT-MTJs, the magnetic field from MTJ can be used to clock the nanomagnets. Fig. 7 illustrates how the parallel or antiparallel arrangement of the hard and free layers of the MTJ is able to generate a magnetic field in the region above the tunneling layer. The STT-MTJ is switched by passing critical switching current (IC ), as given in [13]. With a given set of technology and material, 100 μA of current is needed to switch an MTJ with tox = 1.1 nm and 10 × 15 nm2 cross-section in 2 ns. A current source would use 60 fJ of energy to deliver this current. However, the spacing between the nanomagnet and the MTJ needs to be very small (∼1 nm) for the magnetic field to be large enough to clock the nanomagnets.

Fig. 8. Clock wire with NGC. Cladding focuses the magnetic field on nanomagnet and decreases the clock energy.

C. Clock Wire With NGC As shown in Fig. 6, a wire carrying current (ICLOCK ) creates the necessary HCLOCK , which is given by ICLOCK (cosθ + cosϕ) (2) 4πυ where ICLOCK is the current in the wire, and υ is the shortest distance between the wire and a point “d.” Note that θ and ϕ are angles of inclination that observation point “d” makes with respect to left end and right end of the finite length wire. For an infinite length wire, both θ and ϕ are 0◦ . The dimensions of clock wire used in the simulation are presented in Table II. In addition, we have modified the clock wire layout, to include a ferromagnet cladding region positioned on four sides of the wire, as shown in Fig. 8. The cladding region has high magnetic permeability for enhancement and concentration of magnetic field on the nanomagnet [14], [17]. Due to NGC field enhancement, we can obtain a higher HCLOCK , which is given by HCLOCK =

HCLOCK = 2

W ICLOCK (cosθ + cosϕ) g 4πυ

(3)

where W is the width of the wire and “g” is the width of trench

782

IEEE TRANSACTIONS ON NANOTECHNOLOGY, VOL. 10, NO. 4, JULY 2011

TABLE III RELATIVE PERMEABILITY OF DIFFERENT CORE MATERIALS AND ITS IMPACT ON ENERGY REQUIRED TO GENERATE HC L O C K

Fig. 9. Magnetic field produced by NGC clocking scheme with different “g” values.

Fig. 11. Interaction of nanomagnet with HPM material lowers its energy barrier, thus reducing retention time. Fig. 10. Nanomagnet encompassed by a solenoid, which is the HC L O C K generator. HPM in nanocoil decreases the HC L O C K energy requirement.

region. Thus, for a fixed clock wire width, the HCLOCK can be improved by decreasing the trench width “g.” However, the HCLOCK cannot be enhanced indefinitely by decreasing “g.” As “g” approaches 9 nm, the HCLOCK reaches a maximum value and then decreases as “g” reaches 0 nm. Fig. 9 shows a plot of normalized magnetic field (normalized w.r.t. g = 0 nm) for different “g” values. The dimensions of cladding regions and the trench are also given in Table II. The entire clock layout is implemented in a Maxwell simulator [15], and results indicate that static magnetic field on the switching magnets have increased as expected to 75 Oe, and is sufficient to switch the magnet. Stray fields on nearby nonswitching magnets are much less than the required HCLOCK , and in this particular case, 19× lower than 75 Oe (considering 40 nm spacing between the magnets). This stray field does not affect the state of neighboring magnets, which makes this scheme immune to clock noise. D. Nanocoils and Solenoids Alternatively, the nanomagnet can be wrapped inside nanocoils or solenoids for clocking. Fig. 10 illustrates how a nanocoil can be used to clock nanomagnets. The magnetic field generated by nanocoil is given by HCLOCK =

N ICLOCK 2r

(4)

where N is the number of turns and r is the distance from center of coil to its periphery. Thus nanocoil generates a field, which is higher than field generated by parallel wire, when N is larger than (cosθ + cosϕ)/(4πυ). Moreover, replacing the core material with high-permeability material (HPM) can increase the strength of the magnetic field [16]. Table III lists the permeability of suitable core materials and its impact on nanocoil energy

requirement. Fig. 11 shows the arrangement of magnet, HPM, and nanocoil. After turning on the current in nanocoil, both nanomagnet and cladding material are magnetized along hard axis (90◦ ) of the magnet. The HDIP coupling (ferromagnetic coupling) between HPM and nanomagnet results in lowering of energy barrier in nanomagnet, which results in faster switching. However, after turning off the current in nanocoil, HDIP between HPM and nanomagnet will still remain, which in turn results in nonzero magnetic field along hard axis. Thus, energy barrier of nanomagnet is no longer 40 KB T as designed, but less than 40 KB T resulting in lower retention time. However, it is certainly feasible to increase the original energy barrier (say 20% higher), thus ensuring 40 KB T barrier even under influence from HPM. It is very important to note that the volume of cladding material is small, of the order 100 nm3 , and thus results lower HDIP , since HDIP is proportional to the volume. We have performed magnetostatic simulations to illustrate this point and the result is presented in Fig. 11. As expected, the magnet energy barrier has decreased from 40 KB T to 35 KB T. However, Fig. 12 shows how overdesigning the magnet with 46 KB T energy barrier can ensure 40 KB T barrier even in the presence HDIP from HPM. One way to increase this energy barrier is to increase the volume of nanomagnet. This results in larger nanomagnet area and thus larger design area (14% higher in this case) and is the primary drawback for nanocoil-based clocking scheme. However, larger magnet volume increases HDIP between magnets, and our simulations indicate that the delay of each nanomagnet switching has decreased by approximately 1%. Another drawback of nanocoil-based clocks is the inductive coupling between them, which can generate voltages in nonswitching coils, resulting in stray magnetic field on nonswitching nanomagnets. However, coupling inductance in our design is of the order of femto–henry (fH), and it can generate only a

AUGUSTINE et al.: ULTRA-LOW POWER NANOMAGNET-BASED COMPUTING

783

Fig. 12. Overdesigning nanomagnet with higher energy barrier ensures retention time even after interaction with HPM material. TABLE IV COMPARISON OF DIFFERENT HC L O C K GENERATION TECHNOLOGIES

Fig. 13.

Nanomagnet logic pipeline.

Fig. 14. (a) Four-phase HC L O C K waveforms. (b) Precession of magnets with four-phase clocking.

few microvolts noise on nearby nanocoil. Stray magnetic field generated by this noise voltage is negligible. We have estimated the energy dissipated by proposed clock generation techniques for generating HCLOCK of 75 Oe. Table IV shows the comparison of energies for the earlierproposed techniques. We chose NGC-based clock generation technique for nanomagnet simulations in rest of the paper, since it can generate HCLOCK with lowest energy. Moreover, this method was shown experimentally by researchers in [17] for a system with larger dimensions. IV. CLOCK ARCHITECTURES FOR LOGIC GATES AND INTERCONNECTS In this section, we discuss various clock architectures to improve the performance of nanomagnet designs. To achieve maximum performance, separate clock architectures are proposed for logic gates and interconnects as explained shortly. A. Clock Architectures for Logic Gate Nanomagnet circuits can be made to work in a pipelined fashion, where computations on different nanomagnets are synchronized by clock signals (HCLOCK[1–3] ). The basic idea for clocking a nanomagnet logic pipeline is shown in Fig. 13 and can be described as follows: 1) stage 1 is clocked in preparation for computation; 2) stage 2 is clocked to receive the data from stage 1; 3) stage 3 is clocked while the stage 2 value is getting computed; and 4) computation is completed in stages 1–3, and stage 4 is prepared to receive input from stage 3. Thus, multiphase

clocks are needed for high-throughput, robust nanomagnet systems, and the following section describes the proposed clock architectures for logic gates. 1) Four-Phase Overlapping Clocks: Fig. 14(a) and (b) shows the operation of a five-stage nanomagnet inverter chain, using four-phase overlapping clock scheme. This strategy requires 50% overlap between clocks of consecutive stages. The overlap time ensures that while computing the magnet value, magnet in next state is fully aligned along the hard axis and thus prevents backward data propagation. In this particular example, we have chosen a clock time period (TCLOCK ) of 3.6 ns (with 50% duty cycle). The estimated delay for propagating data from magnet 0 to magnet 4 is 3.9 ns and corresponding energy is 12.5 aJ, using the NGC clock technology. 2) Three-Phase Overlapping Clocks: A robust alternative to four-phase clocking is a three-phase overlapping clock scheme, which is shown in Fig. 15(a). This clocking strategy requires 33% overlap between clocks of consecutive stages. Compared to the four-phase scheme, the clock frequency needs to be lowered to maintain the required clock overlap time. The evolution of magnet logic states as the clock propagates through the magnets 0 to 4 is shown in Fig. 15(b). For three-phase clocking scheme, a TCLOCK of 4 ns (50% duty cycle) is chosen, which is larger than TCLOCK for four-phase clock. Using this clock frequency, we estimated the delay for propagating data from magnet 0 to magnet 4 to be 5.1 ns and corresponding energy to be 13.8 aJ, using the NGC clock technology. Thus for nanomagnet circuits, power, speed, and robustness of the circuit can be traded off

784

IEEE TRANSACTIONS ON NANOTECHNOLOGY, VOL. 10, NO. 4, JULY 2011

Fig. 16. HD IP on magnets M1 through M10 after GIC HC L O C K switching. Stray fields and thermal noise prevent propagation of data beyond magnet M6. Fig. 15. (a) Three-phase HC L O C K waveforms. (b) Precession of magnets with three-phase clocking.

by choosing between three- or four-phase clocking scheme. In this paper, we have opted for more robust three-phase clocking approach for designing nanomagnet circuits. B. Clock Architectures for Interconnects In [6], same three-phase clocking approach with 33% overlapping is used for both logic gates and interconnects. However, this clocking approach for a 10-μm-long interconnect results in an interconnect-delay of 2.2 μs, which is ∼400× larger than switching delay of an AND gate. However, interconnects have a unique clocking property, and that can be exploited to decrease the interconnect delay, as explained shortly. 1) Grouped Interconnect Clocking (GIC): In case of interconnects along an axis parallel to hard axis (say horizontal), values are propagated by successively inverting every interconnect magnet (antiferromagnetic coupling) along the axis. On the other hand, in interconnects along an axis parallel to easy axis (say vertical), magnets successively buffers (ferromagnetic coupling) value stored in the first magnet. Since, there is no logic computation other than inversion or buffering, we can employ a GIC scheme with 100% clock overlap to decrease the interconnect delay. A horizontal interconnect example with ten magnets (M1– M10) is shown in the inset of Fig. 16, and it can be used to understand the clock grouping technique. M1 is fixed (say in up direction—0◦ ) and the information in M1 needs to be propagated to M10. First, magnets M2 through M10 are put in hard-axis direction (90◦ ) using an external clock. When the clock releases, magnets M2 through M10 start relaxing from the 90◦ direction. Since M2 is closer to M1, M2 feels sufficiently strong input bias field to switch toward down (180◦ ) direction. It is important to note that magnets on the right side of M2, which are M3 through M10, do not impact switching of M2. This is due to the fact that magnets M3 through M10 are biased toward hard axis and remain along hard axis (because of its nanoseconds long response time) thus producing dipolar filed toward hardaxis direction. The hard-axis field is much smaller (10’s of Oe) compared to internal field HC (100’s of Oe) and has negligible

impact on dynamics of M2. Hence, in GIC scheme we can safely ignore the influence from right-side magnets. The net field on all magnets is computed and is shown in Fig. 16. In order to estimate whether data have propagated reliably through the magnet ensemble, we need to estimate the noise field (due to nearby interconnects and random thermal fields) acting on each magnet. Using the parameters presented in Table I, we have estimated standard deviation σ of thermal fluctuation using the fluctuation dissipation theorem [5] σ2 =

2KB T α 1 + α2 |γ| MS V

(5)

to be 0.5 mOe. For calculating stray field from surrounding magnets, we have made an assumption that nearby interconnect is at a distance 600 nm, which is 2.5 times the distance between nearby coupled magnets. Hence, maximum stray field on any magnet in the chain is 0.5 Oe (0.5 mOe due to thermal fluctuation is negligible in this case). Given these assumptions, net HDIP magnetic field on any given magnet need to be greater than 0.5 Oe. Using this noise information and results in Fig. 16, we can estimate the set of magnets that can be included in same clock cycle by estimating the magnitude of HDIP and its polarity. As shown in Fig. 16, M7 has smaller bias field and switches toward wrong direction (180◦ instead of 0◦ ) due to nearby interconnect lines (worst-case scenario). This situation can happen only in specific circumstances, depending on the net field from the coupled interconnects. However, for predictable correct operation of GIC, restriction on number of grouped magnets is necessary and it is five (M2 to M6) in the given example. In Fig. 16, we can also see that M9 feels higher HDIP field compared to M8, which is nonintuitive. This is due to the fact that M8 starts switching toward wrong direction due to coupled noise field, which in turn exerts a strong field on M9 due to their close proximity. Thus, while employing GIC scheme we need to choose correct number of magnets to avoid computation errors. So far, in this paper, we have discussed innovation at device/circuit level using nanomajority gate and interconnect optimizations. We have also discussed technologies to lower clock power, which dominates the total power dissipation. In the next

AUGUSTINE et al.: ULTRA-LOW POWER NANOMAGNET-BASED COMPUTING

785

section, we propose an integrated device/circuit/system simulation framework incorporating the proposed device/circuit level solutions, to design high-performance nanomagnet-based computing systems. V. DEVICE/CIRCUIT/SYSTEM COMPATIBLE SIMULATION FRAMEWORK The dynamics of nanomagnets presented in Section II can be described by a set of equations that captures the interactions between HCLOCK , HDIP , and nanomagnets. The LandauLifshitz-Gilbert (LLG) micromagnetic equation [18] describes the dynamics of a magnet and can be modified to include the dipolar interactions between magnets, HCLOCK , and HANI . After these incorporations, LLG equation is given by → − dM α|γ| − → − − → → → − − → (1 + α2 ) = |γ|( M × H TOT ) − M × ( M × H TOT ) dt |M| (6) where HTOT is given by → − → − → − → − (7) H TOT = H CLOCK + H ANI + H DIP

Fig. 17. Impact of τ r on delay and energy dissipation in nanomagnet designs and optimum τ r design choice. TABLE V PERFORMANCE OF STANDARD CELLS IN NANOMAGNET LIBRARY

where HANI is given by 2Ku2 − → mz z H ANI = HCRIT mz z = MS

(8)

where HCRIT is the critical magnetic field of nanomagnet, “z” is the magnet easy axis, and HDIP is given by − → − → N 3( M n . rnj ) rnj − M n r2nj → − H DIP = (9) r5nj n =1,n = j

where “m” is magnetization vector of the magnet and “r” is the spacing between magnets. A complete nanomagnet system with multiple magnets can be simulated by initializing all the magnets to an initial state and numerically solving (6–9) and the LLG equation (6) for each nanomagnet in a time domain, to compute the system performance. The power dissipated is computed using (10), as discussed in [19]. Pdiss =

→ − − → 2 → (| M × H TOT |) − (1 + α2 )| M| α|γ|

(10)

where α is the damping factor and γ is the gyromagnetic ratio. In this simulation framework, we also provided provision for incorporating finite slope for ramping up (τ r ) and ramping down (τ f ) HCLOCK , as shown in Fig. 17. Using this approach, it is possible to reduce energy dissipation as reported in [21]. Increasing the clock rise time (τ r ) decreases the energy dissipation in nanomagnet systems. However, in case of fixed clock period (say, TCLOCK = 4 ns), increasing τ r decreases energy dissipation until τ r reaches τ C , which is given by τC =

(1 + α2 ) 2α(|γHC |)

(11)

and dissipation increases after τ C , as shown in Fig. 17. Moreover, the delay of nanomagnet also decreases with higher τ r and then increases after τ C . Thus, to obtain optimum nanomagnet performance, τ r has to be fixed at τ C . It is important

to mention that the device-circuit-system compatible simulation framework we have developed for nanomagnet systems has been benchmarked against the framework presented in [21]. The proposed framework has been utilized to design a library of standard cells. The delay, area, and the intrinsic energy dissipated by some representative nanomagnet gates in the library are presented in Table V. With the assistance of synthesized cell library, we can design optimized nanomagnet circuits/systems as discussed in the following section. VI. DESIGN EXAMPLES USING NANOMAGNETS The effectiveness of nonmajority gate-based design approach is demonstrated in this section with two design examples: RO and 16-point DCT. RO is chosen to illustrate the feasibility of nanomagnet designs with feedbacks. On the other hand, a DCT implementation shows the relevance of nanomagnet designs in specific applications like signal processing, using systolic array architecture. A. RO Design We have designed a three-stage, two-input NAND gate-based RO, schematic of which is shown in Fig. 18(a). RO is a common benchmark circuit that is used to estimate the performance of novel switching devices [20]. Simulation results show that nonmajority gate RO frequency is 36 MHz. The corresponding frequency for the majority gate-based RO is 28.7 MHz. Moreover, the nonmajority gate-based implementation also shows a 33% improvement in area, and the performance results are summarized in Table VI. In order to estimate the true merit of nanomagnet-based designs, we have designed an RO using 15 nm CMOS [9] (Vdd = 0.7 V) and corresponding results are

786

IEEE TRANSACTIONS ON NANOTECHNOLOGY, VOL. 10, NO. 4, JULY 2011

Fig. 18. Three-stage RO circuit. (a) RO schematic. (b) Layout of RO with nonmajority gates. TABLE VI RESULTS FROM RO EXPERIMENT

presented in Table VI as well. We have also defined a performance metric—millions of operations per second per microwatt power (P) per micrometer square area (A), MOPS/P/A–to compare various switching devices. Results show that the proposed nonmajority RO has 13× performance improvement compared to 15 nm CMOS design. It is important to note that, without HCLOCK enhancement using NGC, the performance degrades by a factor 22×, compared to CMOS. Hence, in order to ensure the competitiveness of nanomagnet technology, clock power reduction technology like NGC is an absolute necessity. B. 16-Point DCT Design Nanomagnet-based implementation is inherently a nearneighbor communication system and information propagation over long distances is inefficient [8]. Due to this reason, generalpurpose computation system with multiple memory accesses cannot be efficiently implemented using nanomagnets. This limitation of nanomagnets can be addressed by using systolic array architecture, in particular for signal processing applications. We have chosen DCT [21], which is suitable for systolic implementation, as a design example to demonstrate the effectiveness of our proposed device-circuit-architecture design scheme. DCT algorithm is frequently used in image and video processing systems and is given by N −1 π 1 xn cos n+ k, where k = 0, . . . , N − 1. YK = N 2 n =0 (12) The design flow graph (DFG) of a 16-point DCT is shown in Fig. 19. The system consists of 16 identical processing elements (PEs), where each PE has one multiplier and one adder. The input (X) is multiplied with coefficient (C) inside each PE and gets added to the result from the previous stage. The output is then transferred to the next PE for similar computations. This process is repeated in every PE sequentially, and the outputs

Fig. 19.

DFG of 16-point DCT system.

Fig. 20.

Performance figures for 16-point DCT system.

from last column of PEs are the final result (Y) of the 16-point DCT. The power, delay, and area performance results for the DCT design are computed using the simulation framework discussed in Section V and are presented in Fig. 20, along with the results from a majority gate-based design. The optimized DCT design shows an improvement of 46% in power and 36% in delay compared to majority gate-based design. However, the area benefit of nonmajority design compared to majority design, decreased compared to RO, since DCT design is dominated by interconnects within each PE. In a core logic dominated design, the area benefit of nonmajority design will be more as in the RO case. For isofrequency comparison between nanomagnet design and CMOS design, the supply voltage (Vdd) of 15 nm CMOS has been scaled to 90 mV. This technique of lowering Vdd (deep subthreshold operation) is an effective technique for lowering of power consumption in CMOS-based systems, where frequency requirement is on the order of MHz [22]. Finally, we have introduced a metric, Energy-Delay0.5 -Area product (ED0.5 A), which gives more importance to energy dissipation of the system rather than delay. Using this metric nanomagnet design achieves 10× improvement compared to ultimate 15 nm CMOS. However, the nanomagnet design performance degrades drastically (28×) when an unoptimized clock (w/o NGC) is used for the design, as shown in Fig. 20. Thus, in order to make the nanomagnet technology competitive compared to the CMOS technology, we need to exploit the clock energy reduction technologies like, solenoid with steel core and/or NGC.

AUGUSTINE et al.: ULTRA-LOW POWER NANOMAGNET-BASED COMPUTING

787

tion. This is one of the major constraints for external field-based nanomagnet designs. VII. CONCLUSION

Fig. 21. Delay and power projections for DCT system with respect to nanomagnet volume.

1) Scalability of Nanomagnets: Let us study the scalability of nanomagnets considering external field-based switching, as described in this paper. Nanomagnet thermal stability in terms of retention time (τ RT ) is given by the following equation: τRT =

e(Ku 2 V /K B T ) f0

(13)

where f0 is the attempt frequency. It is on the order of 1 GHz for storage purposes [23]. For a given value of Ku2 V of 40KB T, τ RT of approximately ten years can be achieved. However, if volume (V) is scaled down, τ RT decreases exponentially. Thus, Ku2 has to be increased proportionally to ensure the necessary thermal stability, which in turn increases the critical magnetic field (HCRIT ) required for HCLOCK as given by the equation |HCRIT | =

2Ku2 . MS

(14)

Nanomagnet with higher HCRIT requires higher HCLOCK , which increases the required current (ICLOCK ) as discussed in Section III. This in turn increases the power consumption of clock circuitry. Moreover, the resistance of clock wires also increases as it has to be scaled as well, to concentrate the magnetic field on individual nanomagnets, which further increases the power dissipation. In order to understand the impact on power dissipation with nanomagnet scaling, let us take an example of scaling the 16-point DCT (described in this section) by a scaling factor of two (area decreases by a factor 2×). Area scaling ratio of two from one technology generation to the next is used in today’s CMOS designs as given by Moore’s law [24]. Due to by a factor of scaling, ICLOCK magnitude has to be increased √ √ 2 to accommodate 2× increase in HCRIT and 2× decrease in spacing between nanomagnet and clock√wire (3). Moreover, clock wire resistance goes up by a factor 2 due to scaling of wire dimensions. As a result, the total power consumption of DCT would increase to 7.7 nW compared to the initial value of 2.7 nW. Thus, decreasing the area by 2× comes with an overhead of 2.8× increase in power dissipation. Fig. 21 shows the projection of delay and power of the 16-point DCT with respect to scaling of nanomagnet volume for multiple technology generations. Results clearly show that the area benefit of scaling comes at much higher cost of increased power dissipa-

In this paper, we have presented a nanomagnet-based logic device with in-built bias, which can be used to build nonmajority logic gates. Circuits based on the proposed logic can offer more than 46% power saving compared to majority gate circuits, in addition to the improvements in delay and area. We also propose a new clock architecture based on GIC for interconnects, to achieve significant reduction in delay. Subsequently, after the incorporation of device/circuit level improvements using a comprehensive design methodology, it is possible to achieve large improvement in performance compared to 15 nm CMOS. Moreover, the performance of nanomagnet circuits can be further improved using near-neighbor architecture. In comparison to subthreshold 15-nm CMOS-based DCT, nanomagnet system demonstrate a 10× improvement in the metric, ED0.5 A, (under optimistic assumptions with three-phase clocking using NGC). However, scalability of such designs can be an issue under external field-based switching of nanomagnets.

REFERENCES [1] G. I. Bourianoff, P. A. Garginia, and D. E. Nikonov, “Research directions in beyond CMOS computing,” Solid-State Electron., vol. 51, pp. 1426– 1431, 2007. [2] S. S. P. Parkin, K. P. Roche, M. G. Samant, P. M. Rice, R. B. Beyers, R. E. Scheuerlein, E. J. O’Sullivan, S. L. Brown, J. Bucchigano, D. W. Abraham, Y. Lu, M. Rooks, P. L. Trouilloud, R. A. Wanner, and W. J. Gallagher, “Exchange-biased magnetic tunnel junctions and application to nonvolatile magnetic random access memory,” J. Appl. Phys., vol. 85, no. 8, pp. 5828–5833, 1999. [3] A. Imre, G. Csaba, L. Ji, A. Orlov, G. H. Bernstein, and W. Porod, “Majority logic gate for magnetic quantum-dot cellular automata,” Science, vol. 311, pp. 205–208, 2006. [4] S. Salahuddin and S. Datta, “Interacting systems for self-correcting low power switching,” Appl. Phy. Lett.: Device Phys., vol. 90, no. 9, pp. 093503-1–093503-3, Feb. 2007. [5] W. F. Brown, “Thermal fluctuations of a single-domain particle,” Phys. Rev., vol. 130, pp. 1677–1686, 1963. [6] S. Sun, C. B. Murray, D. Weller, L. Folks, and A. Moser, “Monodisperse FePt nanoparticles and Ferromagnetic FePt nanocrystal superlattices,” Science, vol. 287, pp. 1989–1992, 2000. [7] ITRS. (2007). [Online]. Available: http://www.itrs.net [8] C. Augustine, B. Behin-Aein, X. Fong, and K. Roy, “A design methodology and device/circuit/ architecture compatible simulation framework for low-power magnetic quantum cellular automata systems,” in Proc. ASPDAC, 2009, pp. 847–852. [9] A. Khakifirooz and D. A. Antoniadis, “MOSFET performance scaling— Part I: Historical trends,” IEEE Trans. Electron. Device, vol. 55, no. 6, pp. 1391–1400, Jun. 2008. [10] H. Al-Asaad and M. Shringi, “On-line built-in self-test for operational faults,” in Proc. SRT Conf., 2000, pp. 168–174. [11] A. Anguelouch, B. D. Schrag, G. Xiao, Y. Lu, P. L. Trouilloud, R. A. Wanner, W. J. Gallagher, and S. S. P. Parkin, “Two-dimensional magnetic switching of micron-size films in magnetic tunnel junctions,” Appl. Phys. Lett., vol. 76, pp. 622-1–622-3, 2000. [12] M. Niemier, M. T. Alam, X. S. Hu, G. Bernstein, W. Porod, M. Putney, and J. Deangelis, “Clocking structures and power analysis for nanomagnetbased logic devices,” in Proc. ISLPED, Aug. 2007, pp. 26–31. [13] R. H. Koch, J. A. Katine, and J. Z. Sun, “Time-resolved reversal of spintransfer switching in a nanomagnet,” Phys. Rev. Lett., vol. 92, pp. 0883021–088302-4, 2004. [14] M. Durlam, P. Naji, A. Omair, M. DeHerrera, J. Calder, J. M. Slaughter, B. Engel, N. Rizzo, G. Grynkewich, B. Butcher, C. Tracy, K. Smith,

788

[15] [16] [17] [18] [19] [20] [21] [22] [23] [24]

IEEE TRANSACTIONS ON NANOTECHNOLOGY, VOL. 10, NO. 4, JULY 2011

K. Kyler, J. J. Ren, J. Molla, B. Feil, R. Williams, and S. Tehrani, “A 1-Mbit MRAM based on 1T1MTJ bit cell integrated with copper interconnects,” in Proc. IEEE 2002 Symp. VLSI Circuits, Jun. 2002, pp. 158–161. ANSYS, ANSYS Reference Manual. Houston, PA: Swanson Analysis Systems, Inc., Dec. 1992, Version 5.0. P. Leroy, C. Coillot, A. Roux, and G. Chanteur, “High magnetic field amplification for improving the sensitivity of hall sensors,” IEEE Sens. J., vol. 6, no. 3, pp. 707–713, Jun. 2006. (2003). [Online]. Available: http://www.patentstorm.us/patents/6559511/ description.html. L. Landau and E. Lifshitz, “On the theory of the dispersion of magnetic permeability in ferromagnetic bodies,” Phys. Z. Sowjetunion, vol. 8, pp. 153–169, 1935. B. Behin-Aein, S. Salahuddin, and S. Datta, “Switching energy of ferromagnetic logic bits,” IEEE Trans. Nanotech., vol. 8, no. 4, pp. 505–514, Jul. 2009. L. Leem and J. S. Harris, “Magnetic coupled spin-torque devices and magnetic ring oscillator,” in Proc. IEDM, 2008, pp. 1–4. L. W. Chang and M. C. Wu, “A unified systolic array for discrete cosine and sine transforms,” IEEE Trans. Signal Process., vol. 39, no. 1, pp. 192– 194, Jan. 1991. H. Soeleman, K. Roy, and B. C. Paul, “Robust sub-threshold logic for ultra-low power operation,” IEEE Trans. VLSI Syst., vol. 9, no. 1, pp. 90– 99, Feb. 2001. L. Sun, Y. Hao, C.-L. Chien, and P. C. Searson, “Tuning the properties of magnetic nanowires,” IBM J. Res. Dev., vol. 49, pp. 79–102, 2005. G. E. Moore, “Cramming more components onto integrated circuits,” Electron. Mag., vol. 38, no. 8, pp. 114–117, Apr. 19, 1965.

Charles Augustine received the Bachelor’s degree in electronics from Birla Institute of Technology and Science, Pilani, India, in 2004. He is currently working toward the Ph.D. degree at Purdue University, West Lafayette, IN. During his Ph.D. studies, he was engaged in research on spin-based logic and memory technologies. He was with Intel, Texas Instruments, ST Microelectronics, Philips Semiconductors, and Freescale Semiconductor, where he was engaged in research on CMOS digital integrated circuits and memories for computing, including spin-torque-based novel memory and logic structures. Mr. Augustine received the “Best Paper in Session Award” at the Semiconductor Research Corporation Techcon, 2009 and nominated for “Best Paper Award” at the International Symposium on Quality Electronic Design, 2009.

Xuanyao Fong received the B.S. degree in electrical engineering from Purdue University, West Lafayette, IN, in 2006, where he is currently working toward the Ph.D. degree in electrical and computer engineering. During January to August 2007, he was an Intern Engineer with Advanced Micro Devices, Inc., in the Boston Design Center, Boxboro, MA. He is currently a Research Assistant to Professor Kaushik Roy in the Nanoelectronics Research Laboratory, Purdue University. His research interests include device-circuitarchitecture codesign for Si and non-Si nanoelectronics and VLSI logic and memory systems using spintronic devices, circuits, and architectures. Prof. Fong received the Best Paper Award at the 2006 International Symposium on Low Power Electronics and Design.

Behtash Behin-Aein received the B.Sc. and Ph.D. degrees in electrical and computer engineering from Purdue University, West Lafayette, IN, in 2004 and 2010, respectively. He is currently with the School of Electrical and Computer Engineering, Purdue University. His research interests include electronic transport in nanostructures currently focusing on spintronic devices for logic, memory, and oscillator applications. This includes design, modeling, and performance evaluation of spin devices that incorporate spin-torque and magneto-dynamic phenomenon.

Kaushik Roy (F’02) received the B.Tech. degree in electronics and electrical communications engineering from the Indian Institute of Technology, Kharagpur, India, and the Ph.D. degree from the Electrical and Computer Engineering Department of the University of Illinois at Urbana-Champaign, in 1990. He was with the Semiconductor Process and Design Center of Texas Instruments, Dallas, where he was engaged in research on field-programmable gate array architecture development and low-power circuit design. In 1993, he joined the Electrical and Computer Engineering Faculty, Purdue University, West Lafayette, IN, where he is currently a Professor and holds the Roscoe H. George Chair of Electrical and Computer Engineering. His research interests include spintronics, VLSI design/CAD for nanoscale silicon and nonsilicon technologies, low-power electronics for portable computing and wireless communications, VLSI testing and verification, and reconfigurable computing. He is the author or coauthor of more than 500 papers published in refereed journals and conference proceedings. He holds 15 patents, graduated 50 Ph.D. students, and is a coauthor of two books on Low Power CMOS VLSI Design (John Wiley and McGraw Hill). Dr. Roy is the Purdue University Faculty Scholar. He was a Research Visionary Board Member of Motorola Labs in 2002. He has been in the editorial board of the IEEE DESIGN AND TEST, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and the IEEE TRANSACTIONS ON VLSI SYSTEMS. He was a Guest Editor for Special Issue on Low-Power VLSI in the IEEE DESIGN AND TEST (1994), the IEEE TRANSACTIONS ON VLSI SYSTEMS (June 2000), IEE Proceedings—Computers and Digital Techniques (July 2002), IEEE SENSORS JOURNAL. He received the National Science Foundation Career Development Award in 1995, the IBM Faculty Partnership Award, the ATT/Lucent Foundation Award, the 2005 Semiconductor Research Corporation (SRC) Technical Excellence Award, the SRC Inventors Award, the Purdue College of Engineering Research Excellence Award, the Humboldt Research Award in 2010, and the best paper awards at the 1997 International Test Conference, the IEEE 2000 International Symposium on Quality of IC Design, the 2003 IEEE Latin American Test Workshop, the 2003 IEEE Nano, the 2004 IEEE International Conference on Computer Design, the 2006 IEEE/ACM International Symposium on Low Power Electronics and Design, the 2005 IEEE Circuits and System Society Outstanding Young Author Award (Chris Kim), and the 2006 IEEE Transactions on VLSI Systems Best Paper Award.

Lihat lebih banyak...

Ultra-Low Power Nanomagnet-Based Computing: A System-Level Perspective

Descrição do Produto

Comentários