Parallelizing Power System Contingency Analysis Using D Programming Language

June 6, 2017 | Autor: S. Khaitan | Categoria: High Performance Computing, Power System, High Performance Computing (HPC)

Share Embed

Denunciar este link

Descrição do Produto

Parallelizing Power System Contingency Analysis Using D Programming Language Siddhartha Kumar Khaitan and James D. McCalley, Fellow, IEEE Department of Electrical and Computer Engineering Iowa State University, Ames, Iowa, USA. {skhaitan, jdm}@iastate.edu Abstract—To ensure security, analyzing a large number of contingencies is important, which requires use of parallel computing resources. In this paper, we present an approach for parallelization and load balancing of contingency analysis (CA) in power systems using D programming language. We parallelize CA using a multicore processor and propose and employ workstealing based efficient scheduling to achieve load-balancing. We evaluate the features of D which are important for parallelization of CA and obtaining large performance gains. Our approach promotes legacy code reuse and hence is suitable for modern control centers which cannot afford porting their legacy code to other high-performance computing (HPC) platforms. We have conducted time domain simulation using a large 13029-bus test system with hundreds of contingencies and parallelized CA over 2, 4, 8, 12 and 16 cores. The results have confirmed that our approach outperforms a conventional scheduling technique and also offers large computational savings over serial execution. Index Terms—D programming language (dlang), High performance computing, parallel programming, power system analysis computing, power system dynamics, power system simulation.

I. I NTRODUCTION Due to increasing electricity demand for supporting critical infrastructures, modern power systems operate at huge power levels which may be greater than the levels for which they were designed. This has significantly increased the probability of disturbances and catastrophic events such as blackouts [1]. Such blackouts have serious economic and social impacts. For example, the service cost of one hour of downtime in brokerage operations is $6,450,000; in credit card authorization companies it is $2,600,000; in eBay it is $225,000 and in Amazon it is $180,000 [2]. Moreover, a disturbance in power system today is likely to affect much larger area than ever before and due to complex operation of power systems, the margin of time available for taking corrective action has also reduced. To address this challenge, efficient security analysis of power systems is highly important. With increasing power system size and emphasis on analyzing N − k contingencies, the number of contingencies to be analyzed is on rise. Hence, serial execution platforms are proving to be insufficient to fulfill the demands of CA. This necessitates parallelization for CA task, however, this also brings the need of achieving load-balancing to maximize resource usage efficiency. Thus, This work was supported in part by DOE OE subcontract B601014

state-of-the-art in power system calls for novel approaches to benefit CA from HPC resources. In this paper, we present an approach for parallelization of contingency analysis in power systems utilizing the featurs of the D programming language. We conduct time domain simulation of power system using a simulator [3] and parallelize CA over a multicore processor using D language [4, 5, 6], which has been recently used for many commercial applications [7]. We examine the features of D which are crucial for parallelization of CA and achieving large performance gains. To achieve load-balancing, we propose the use of work-stealing based efficient scheduling [8] and also compare this to a conventional scheduling technique, namely masterslave scheduling. We present the algorithmic procedure and implementation details of both work-stealing and master-slave scheduling algorithms. Unlike other HPC resources such as GPUs, which require significant rewriting of the code and thus incur significant porting overhead; our approach promotes legacy code reuse and hence is suitable for modern control centers which cannot afford porting their legacy code to other HPC platforms. To evaluate our approach, we have used a 13029-bus test system and analyzed hundreds of contingencies. Further, CA has been parallelized over 2, 4, 8, 12 and 16 cores. The results have shown that our approach offers large computational savings over serial execution and also outperforms masterslave scheduling. For example, using 16 cores, our approach offers up to 13.59×. Thus, a task which requires one day on serial platform; can be completed in less than two hours using our approach. This paper is organized as follows. Section II discusses the related work regarding HPC in power systems. It also presents a brief background on D. Section III discusses our approach and also discusses the load-balancing algorithms. The simulation platform and results are discussed in Section IV. Finally, Section V concludes this paper. II. R ELATED W ORK

AND

BACKGROUND

A. HPC in Power Systems Recently, several concurrent programming languages have been introduced such as Java [9], Cilk [10], Charm++ [11], MPI [12] and CUDA [13]. Each of these languages offer specific features which are useful for different class of problems.

Software

System Model Dynamic Data and Steady State

Hardware

Parallelization and load-balancing using D

Initialization

C1 DAE Kernel Integrator Nonlinear Solver Linear Solver

N-k Contingency Selection based on Topology processing

C2

C3

Feedback Stability Analysis Failure Detection Time Domain Simulation

Cn Multicore Processor

Fig. 1: Flow Diagram Of Our Approach

In this paper, we use D language and discuss its specific features in next subsection. HPC is a promising approach for accelerating performancecritical tasks and has been used in several domains to obtain large performance gains [14, 15, 16, 17, 18, 19]. In the context of power systems, several researchers have used HPC resources to parallelize the task of contingency analysis [20, 21, 22, 23, 24]. Further, load-balancing techniques have also been used [25] to maximize resource usage efficiency. B. Relevant Features and Functions of D D is a multi-paradigm language which allows objectoriented and concurrent programming. D is a high-level language, but also provides ability to do pointer-manipulation and assembly language code, which is useful for writing device drivers, high performance system applications and embedded systems. D language is statically typed and enables writing cross-platform code. It provides several features such as operator-overloading, built-in document generator, exception handling, assertions, template programming, nested functions and nested classes, function overloading, bounds-checking on array-indices, class and function templates and Unicode support. D offers programming constructs for doing multithreading. In D, synchronization can be done at either the method or the object level. Synchronized functions allow only one thread to execute a function and place a mutex around a block of statements to control access to it. To start a thread, the programmer uses the function spawn. These threads execute concurrently with the calling routine and by-default do not share any data with the calling routine. In hardware, the threads are scheduled on the OS (operating system) threads. By-default, the data is not shared among the threads. For sharing data across threads, D provisions use of shared qualifier, which marks that a vari-

able is shared by multiple threads. The threads are recognized by their thread-id (Tid) number, which are used as handles for communication. Communication between threads is done using send and receive functions. Time measurement is done using StopWatch class in datetime module. The simulator binary is run as an external program using shell command from process module. III. S YSTEM A RCHITECTURE

AND

D ESIGN

Figure 1 shows the overall work-flow of our approach. In the following sections, we discuss each component in detail. A. Power System Simulation Simulation plays a very important role in modeling and experimenting with the design innovations proposed and gain insights into their working [26]. For contingency analysis, we use a high speed power system simulator [3] which has been verified against commercial simulators. In the simulator, the dynamic and steady state data is used to initialize the system model. The selection of contingencies is done using topological processing. This simulator solves the power system differential and algebraic equations (DAEs) using an integration solver, a linear solver and a nonlinear solver. Of the various numerical solvers provided in the simulator, we use IDAS integrator, KLU linear solver and VDHN (very dishonest Newton method) non-linear solver. The results from the DAE kernel are used for stability analysis and failure detection. B. Parallelization and Load Balancing Approach With increasing sizes of power system and emphasis on N − k contingency analysis, use of parallelization has become important. To see an example of it, consider a system with N = 15, 000 and k = 3. If analysis of each contingency takes merely 10 seconds, then analyzing all possible N − k

contingencies on serial platform will take 1.78 × 105 years. Clearly, use of parallelization is inevitable for security analysis in future power grids. In our approach, parallelization of simulator is achieved on a multicore computer using D programming language, and load-balancing is achieved using work-stealing scheduling algorithm. For comparison purposes, we have also implemented master-slave scheduling algorithm. A na¨ıve method for parallelization is to spawn as many threads as there are number of tasks to do. However, since thread-creation and destruction incurs overhead, such an approach is likely to be inefficient. To avoid this overhead, master-slave and work-stealing scheduling methods spawn a fixed and limited number of threads. Each scheduling method uses an algorithmic procedure to distribute the tasks to worker such that communication-overhead is kept minimum while also achieving load-balancing. We now discuss these algorithms in detail. Algorithm 1: Master-slave scheduling algorithm Input: A task list T Output: Scheduling with load-balancing 1 /* Initialization 2 Spawn all slaves 3 Assign a single task to each slave 4 Code for the master 5 while T is not empty do 6 Wait till a task-request arrives from a slave 7 if A slaved s requested a task then 8 Remove a task t from T 9 Assign t to s 10 end 11 end 12 /* All tasks finished. Mark termination. 13 foreach slave s do 14 Send NULL task 15 end 16 Code for a slave node s 17 while there is a task to be done do 18 Complete the task 19 Request a task from the master 20 if NULL task arrived then 21 Break 22 end 23 end

*/

*/

The master-slave scheduling algorithm, shown in Algorithm 1 works by initially spawning all slaves and assigning a single task to each of them. Afterwards, each slave works on its task. On completion, it requests a task from the master. The master takes care of scheduling tasks on the slaves. When all the tasks are finished, the master signals completion of work by sending a NULL task to each slave. This completes the algorithm. Work-stealing [8] scheduling algorithm works by allocating

Algorithm 2: Work-stealing scheduling algorithm Input: A task list T Output: Scheduling with load-balancing 1 /* Initialization 2 Spawn all workers 3 Distribute available tasks among workers 4 Code for worker p 5 while p has unfinished tasks do 6 foreach unfinished task t assigned to p do 7 Complete the task t 8 end 9 foreach worker p′ , such that p′ 6= p do 10 Try stealing a task from p′ 11 if Stealing was successful then 12 Assign stolen task to p 13 Break 14 end 15 end 16 end 17 wait on a barrier for all workers

*/

all the tasks across the worker threads to finish them and then allowing a worker with no remaining task (called thief) to steal a task from a worker with extra/queued tasks (called victim). The work-stealing algorithm shown in Algorithm 2 works by initially spawning the workers and distributing all the available tasks among the workers. After that, each worker tries to complete its task. When the tasks available at each worker are finished, it tries to steal the tasks from the remaining workers. The stealing is done in circular manner, which means that each “thief” starts searching for a “victim” from its right-side neighbor (assuming circular arrangement). In each stealing attempt, at most a single task is stolen. When a worker cannot steal a task, its execution is completed. The main program waits on a barrier till all the workers have completed their tasks. This completes the algorithm. C. Salient Features of Our Approach Choice of Implementation Platform: Recently, researchers have also used platforms such as GPUs [27], MPI [28] and OpenMPI [29]. The limitations of GPUs is that they have limited memory, which is not shared with the CPU memory and hence, use of GPU leads to overhead of data transfer between CPU and GPU. Further, GPUs are wellsuited to throughput-oriented computing where a fine-grain parallelism is available. However, contingency analysis is a latency-oriented task and does not offer fine-grain parallelism. The limitations of MPI is requirement of reserving all the nodes from the beginning to the end, which leads to overprovisioning and wastage of resources. Further, for using several HPC platforms, significant rewriting and remodeling of the code is required. However, most control centers use large legacy codes and porting them to these HPC platforms would incur very high overhead. In contrast, our approach

allows legacy code reuse since the scheduler does not change the simulation code. Instead, it runs simulation code binary as an external program. Further, our approach uses widelyavailable multicore processors and thus, the need of acquiring special purpose hardware (e.g. GPU) is avoided. Choice of Scheduling Algorithm: Work-stealing scheduling algorithm, used in our approach is a dynamic scheduling algorithm. Compared to static scheduling, which requires a priori knowledge of completion time of different tasks to generate load-balanced schedules, work-stealing can generate schedules in an online manner. Compared to master-slave scheduling method, work-stealing scheduling avoids congestion on a single master. Also, all available cores can act as workers, which reduces the overhead of scheduling. Further, work-stealing algorithm has been shown to be efficient in terms of time and space requirement [8]. In context of this paper, contingencies can be analyzed independent of each other. Our scheduling algorithm takes advantage of this fact to simplify the scheduling method, since during taskdistribution and stealing, maintaining an ordering of tasks is not required. Leveraging Features of D: Unlike C/C++, but similar to Java, D provides fully automatic garbage collection of memory. This significantly simplifies the design, reduces the bugs due to memory management and shortens the development time. Different from C/C++, but similar to Java, D provides multithreading primitives as part of the language itself, which promotes a portable implementation of multithreading. D allows calling C functions and maintains function link compatibility with the C calling conventions. It supports all C-types. Compared to C/C++, it provides improved array and string handling. However, it does not provide multiple inheritance and namespaces. In D, source files have a one-toone correspondence with modules and hence can be directly imported. Thus, compared to C/C++, it reduces burden of compiling a large number of included files. Java language allows portability such that once written, a Java program can run on a wide variety of platforms using Java virtual machine (JVM); however it leads to reduced execution speed. Unlike Java, but similar to C/C++, D is designed for writing efficient native system applications and thus, it does not use virtual machine for execution. Thus, D has the potential of offering higher speed by virtue of direct native code generation. IV. E XPERIMENTATION P LATFORM

AND

R ESULTS

We test our approach using a large power system with 13029 buses, 12488 branches, 431 generators and 5950 loads. We simulate 128, 256 and 512 contingencies, using 1 (serial execution), 2, 4, 8, 12 and 16 cores with both master-slave method and our approach. Different contingencies simulate different combinations of fault events such as bus faults, branch faults, branch-tripping and combinations of them. The results on simulation times are shown in Table I. To study the how simulation time scales with number of

TABLE I: Simulation times for master-slave and our approach (seconds) N 1 2 4 8 12 16

C=128 4001 4098 1334 602 407 323

Master-slave C=256 8000 8056 2674 1173 791 604

C=512 15217 14868 5458 2560 1513 1222

C=128 4001 1960 983 517 380 316

Our Approach C=256 C=512 8000 15217 4001 7423 1968 3873 1041 2048 718 1395 588 1119

cores, we use speed-up metric, S(C, N ), which is defined as S(C, N ) =

T ime(C, 1) T ime(C, N )

(1)

Here C shows the number of contingencies analyzed and N shows the number of cores used. T ime(C, N ) shows the time taken in analyzing C contingencies using N cores with either of the method (i.e. master-slave or work-stealing). T ime(C, 1) shows the time taken using a single core (serial execution) and represents the baseline. Using this, we compute the speedup values which are shown in Figure 2. The maximum speedup offered by our approach for 2, 4, 8, 12 and 16 cores are 2.04×, 4.06×, 7.73×, 11.14× and 13.59×, respectively. Clearly, our approach offers large computational gains. With a speedup of nearly 13×, a task which requires one day on serial platform can be completed in less than two hours using our approach. For 2 and 4 core, the speedup values are slightly greater than the number of cores, which can be attributed to the random variation in the load on the host during runtime. For large number of cores (e.g. N = 16), the hardware resource contention and limitations of OS (operating system) limit the speedup values from approaching to number of cores. With master-slave approach, the maximum speedup achieved for 2, 4, 8, 12 and 16 cores are 1.02×, 2.99×, 6.81×, 10.10× and 13.23×, respectively. In master-slave approach, one core becomes occupied with the work of scheduling and thus, performs no useful work. Moreover, the congestion at the master core also increases the scheduling overhead. The results presented in this section confirm that our approach is effective in offering computational savings and also outperforms conventional scheduling technique. V. C ONCLUSION

AND

F UTURE W ORK

Due to large sizes of power systems, use of HPC is important for analyzing large number of contingencies. Further, the HPC platforms should provide ways to reuse legacy code. In this paper, we presented an approach to accelerate time domain simulation of CA using concurrent programming with D language. We evaluated the features of D and qualitatively compared them with those of the existing concurrent languages to gain insights into its utility in concurrent programming. The experimental results have shown that our approach is effective in achieving load-balancing and also offers large computational gains. Our approach enables reuse of legacy code and thus can be easily integrated into modern control centers to provide timely situational awareness to the operator.

Speedup Values

Speedup values for 128 contingencies 14 12 10 8 6 4 2 0

Our Approach

Master Slave

2

4

8 Number of Cores

12

16

12

16

12

16

Speedup Values

Speedup values for 256 contingencies 14 12 10 8 6 4 2 0

Master Slave

Our Approach

2

4

8 Number of Cores

Speedup Values

Speedup values for 512 contingencies 14 12 10 8 6 4 2 0

Master Slave

Our Approach

2

4

8 Number of Cores

Fig. 2: Speedup Values for master-slave and our approach

We are currently working to scale our approach to large number of cores to achieve larger speedup gains. R EFERENCES [1] G. Andersson et al., “Causes of the 2003 major grid blackouts in north america and europe, and recommended means to improve system dynamic performance,” IEEE TPWRS, vol. 20, no. 4, 2005. [2] W. Feng, “Making a case for efficient supercomputing,” Queue, vol. 1, no. 7, p. 54, 2003. [3] S. K. Khaitan and J. D. McCalley, “High performance computing for power system dynamic simulation,” in High Performance Computing in Power and Energy Systems, ser. Power Systems, 2012, pp. 43–69. [4] “The D Programming Language,” http://dlang.org/, 2012. [5] A. Alexandrescu, The D Programming Language. Addison-Wesley Professional, 2010. [6] C. Allison, “D: a programming language for our time,” Journal of Computing Sciences in Colleges, vol. 26, no. 2, pp. 113–119, 2010. [7] “Stack Overflow,” http://stackoverflow.com/questions/56315/dprogramming-language- in-the-real-world, 2012. [8] R. Blumofe and C. Leiserson, “Scheduling multithreaded computations by work stealing,” in Annual Symposium on Foundations of Computer Science, 1994, pp. 356–368. [9] K. Arnold, J. Gosling, and D. Holmes, The Java programming language. Addison-wesley Reading, MA, 2000, vol. 2. [10] R. D. Blumofe et al., Cilk: An efficient multithreaded runtime system. ACM, 1995, vol. 30, no. 8. [11] L. Kale and S. Krishnan, CHARM++: a portable concurrent object oriented system based on C++. ACM, 1993, vol. 28, no. 10. [12] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: portable parallel programming with the message passing interface. MIT press, 1999. [13] Nvidia, “Compute unified device architecture (cuda) programming guide,” www.nvidia.com, 2012. [14] S. Mittal and Z. Zhang, “Integrating sampling approach with full system simulation : Bringing together the best of both,” in IEEE International Conference On Electro/Information Technology (EIT), USA, 2012. [15] A. Pande and J. Zambreno, “A reconfigurable architecture for secure multimedia delivery,” in IEEE VLSID, 2010. [16] S. Mittal, A. Pande, L. Wang, and P. Verma, “Design exploration and implementation of simplex algorithm over reconfigurable computing platforms,” in IEEE International Conference on Digital Convergence, 2011, pp. 204–209.

[17] A. Pande and J. Zambreno, “Polymorphic wavelet architectures using

[18]

[19]

[20] [21] [22] [23] [24] [25] [26] [27] [28]

[29]

reconfigurable hardware,” in International Conference on Field Programmable Logic and Applications, 2008, pp. 471–474. S. Gupta, S. Mittal, and S. Dasgupta, “Guaranteed QoS with MIMO Systems for Scalable Low Motion Video Streaming Over Scarce Resource Wireless Channels,” in International Conference On Information Processing, 2008. S. Mittal, S. Gupta, and S. Dasgupta, “FPGA: An efficient and promising platform for real-time image processing applications,” in National Conference On Research and Development In Hardware Systems (CSIRDHS), 2008. A. Mittal, J. Hazra, N. Jain, V. Goyal, D. Seetharam, and Y. Sabharwal, “Real time contingency analysis for power grids,” Euro-Par 2011 Parallel Processing, pp. 303–315, 2011. S. Khaitan and J. McCalley, “Dynamic load balancing and scheduling for parallel power system dynamic contingency analysis,” High Performance Computing in Power and Energy Systems, pp. 189–209, 2012. I. Gorton et al., “A high-performance hybrid computing approach to massive contingency analysis in the power grid,” in Fifth IEEE International Conference on e-Science, 2009, pp. 277 –283. R. Green, L. Wang, and M. Alam, “High performance computing for electric power systems: Applications and trends,” in IEEE PES-GM, 2011, pp. 1–8. V. Jalili-Marandi, Z. Zhou, and V. Dinavahi, “Large-scale transient stability simulation of electrical power systems on parallel GPUs,” IEEE TPDS, 2011. Z. Huang, Y. Chen, and J. Nieplocha, “Massive contingency analysis with high performance computing,” in IEEE PES-GM, July 2009. S. Mittal, “Opnet: An integrated design paradigm for simulations,” Software Engineering : An International Journal (SEIJ),, vol. 2, no. 2, pp. 57–67, September 2012. Y. Zhang et al., “Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU’s power,” BMC Bioinformatics, vol. 13, no. Suppl 5, p. S3, 2012. A. Agrawal, S. Misra, D. Honbo, and A. Choudhary, “Mpipairwisestatsig: Parallel pairwise statistical significance estimation of local sequence alignment,” in 19th ACM International Symposium on High Performance Distributed Computing, 2010, pp. 470–476. Y. Zhang et al., “Accelerating Pairwise Statistical Significance Estimation Using NUMA Machine,” Journal of Computational Information Systems, vol. 8, no. 9, pp. 3887–3894, 2012.

Lihat lebih banyak...

Parallelizing Power System Contingency Analysis Using D Programming Language

Descrição do Produto

Comentários