An Approach to Execute Conditional Branches onto SIMD Multi-Context Reconfigurable Architectures

May 18, 2017 | Autor: Marcos Sanchez-elez | Categoria: Interactive application, Boolean Satisfiability
Share Embed


Descrição do Produto

An Approach to Execute Conditional Branches onto SIMD Multi-Context Reconfigurable Architectures F. Rivera, M. Sanchez-Elez, M. Fernandez Depto. de Arquitectura de Computadores y Automática, Universidad Complutense, Madrid, 28040, SPAIN [email protected]

N. Bagherzadeh Dept. of Electrical and Computing Engineering, University of California, Irvine, CA 92697, USA

ABSTRACT

advantage of their abundant parallel computation resources in order to improve the performance. The applications usually implemented in these systems have a deterministic behavior, i.e. MPEG2 [7]. Nowadays, interactive applications are becoming feasible in such architectures. These applications have a highly dynamic and non-deterministic behavior as well as highperformance requirements so a novel technique has to be applied in order to implement them efficiently onto a reconfigurable hardware. SIMD computational model prevails over others in reconfigurable systems. Concurrent processing following the SIMD fashion leads to a problem when a compare and branch instruction appears: it is highly probable to find different branch targets during parallel processing. Mapping applications on SIMD reconfigurable systems can be achieved though some time penalty is inevitable. One more problem refers to the data access management. Interactive applications need to handle large sets of data in a brief time in order to respond to real-time constraints. Next data to process should be available immediately in the on-chip memory to avoid processor stalls, but unpredictability of interactive applications does not allow achieving it every time. Therefore, some prefetch policy must be applied to provide proper data to the processing units. Besides the special interest that deserves execution time in interactive applications, power consumption is another aspect that must be attended. Considering the random characteristics of the interactive applications data accesses, profiling methodologies can help to define a data prefetch scheme that takes advantage of memory access regularities. In [8] authors develop a profiling technique to find the application memory regular behaviour that can be easily extended to reconfigurable systems. A study of interactive multimedia applications execution on coarse-grain reconfigurable architectures is required in order to tune the prefetch policy in a conditional branch. In [9] the authors focus on application-specific algorithm optimizations in order to achieve interactive results but they do not deal with any prefetch model. An in-depth analysis of coherent mapping solutions for 3D graphics applications onto reconfigurable architectures was done in [10]. This paper only finds an optimum way to handle and execute different configurations on a SIMD reconfigurable system; it does not deal with the problem of context and

Reconfigurable architectures have becoming very relevant in recent years. In this paper we propose a methodology dedicated to analyze interactive applications in order to execute them in a SIMD reconfigurable architecture taking into account power / performance trade-offs. This methodology starts from a kernel description of the interactive application. Kernels are conditionally executed depending on dynamic conditions like user’s input data manipulation. The volume of data involved in this kind of applications combined with user’s actions occurring at unexpected times strongly impact on performance. We define an execution model to deal with conditional branches accompanied by a data prefetch scheme in order to avoid reconfigurable processing unit stalls due to operands unavailability. Experimental results satisfy time constraints of interactive applications and show a power effective solution for them.

1. INTRODUCTION Reconfigurable architectures are driving a revolution in general-purpose processors due to their performance, flexibility and cost. FPGAs [1] are the most common fine-grained devices used for reconfigurable computing. Dynamic reconfiguration [2] has emerged as a particularly attractive technique for minimizing the reconfiguration time, which has a negative effect on FPGA performance. In the case of coarse-grained systems, multi-context architectures support dynamic reconfiguration. They can store a set of different configurations (contexts) in an internal memory. When a new configuration is needed, it is loaded from this internal memory which is faster than reconfiguration from the external memory. Many coarse-grain reconfigurable systems have been proposed, i.e. MorphoSys [3], REMARC [4], MATRIX [5] and CHAMELEON/MONTIUM [6], they all combining relevant aspects of microprocessors, ASICs and DSPs. Target applications of these architectures are focused on multimedia consumer electronics (wireless communications, voice and video processing, etc) which exhibit a high degree of inherent parallelism. This fact is exploited for coarse-grained architectures taking

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

data prefetch in a branch. A study of data and contexts transfers minimization in coarse-grain reconfigurable systems is discussed in [11], but it does not realize data scheduling dynamically as multimedia interactive applications demand. A hybrid design-time/run-time scheduling for FPGAs has been developed in [12]. But, it deals with the minimization of reconfiguration overhead instead of data transfers overhead. In this paper we propose a methodology dedicated to analyze interactive applications with power / performance trade-offs in order to execute them in a SIMD reconfigurable architecture. Based on profiling results, we define a data prefetch scheme in order to avoid reconfigurable processing stalls due to the unpredictable behavior of dynamic applications. This paper is organized as follows. In Section 2 we describe the target architecture. Section 3 presents the problem overview focused on interactive applications and Section 4 describes the execution model of dynamic applications on SIMD reconfigurable architectures. Section 5 deals with dynamic scheduling of data transfers. Proposed methodology is summarized in Section 6. Experiments are presented in Section 7. Section 8 concludes the paper.

2. TARGET ARCHITECTURE Reconfigurable fabrics provide massive parallelism, high computational capability and their behavior can be configured dynamically. The coarse-grain architectures core is the Reconfigurable Array, it consists of a set of Processing Elements (PEs) connected in a 2D array. They also have a high-speed memory interface and a main processor that controls overall operation [13]. Our target architecture, MG, is the implementation of MorphoSys for 3D Graphics [14]. It has the same features of any coarse-grain architecture. In MorphoSys (Figure 1), the Reconfigurable Array consists of an 8 × 8 array of PEs. MorphoSys high-bandwidth memory interface is implemented through the Frame Buffer, the SPT Buffer, the Context Memory and the DMA controller. The Frame Buffer is located between the Reconfigurable Array and the main memory and it is analogous to a data cache. This buffer is organized in two sets and makes memory access transparent to the PE Array by overlapping data transfers with computation. The SPT Buffer is organized in eight banks, one per row or column. The Context Memory broadcasts context words to the Reconfigurable Array. Context words are loaded into the context register of each PE to configure it. All eight PEs in a row or column share the same context and perform the same operations. Therefore, MorphoSys supports only SIMD operations. The Context Memory can be updated concurrently with the Reconfigurable Array operation. The DMA controller enables fast transfers between the main memory and the Frame Buffer or the Context Memory. Simultaneous transfers of data and contexts are not possible.

Figure 1. MorphoSys architecture (MG)

3. INTERACTIVE APPLICATIONS Static multimedia applications have a behavior that can be known at compilation time. This kind of applications can be subjected to a static task, data and configuration scheduling in order to obtain an optimum execution code. This means that in the final code the maximum numbers of data and contexts transfers are overlapped with previous task execution, minimizing the time penalty due to data and contexts transfers. On the other hand, some applications operate in a dynamically changing environment and must be able of reacting to new environments. For example, a mobile device will have to deal with unpredictable user’s action or should be able to handle different protocols and standards. Then a static scheduling should take into account all possible user’s action, but it implies the study of a huge amount of completely different cases. Moreover, if the solution must be applicable for all the cases, it is very probable that the solution for some of these cases is far away of the optimum. In this paper we propose a technique that based on profiling results decides which data or configuration to preload. In order to study that kind of applications we describe them as a sequence of macro-tasks (kernels). Each kernel is characterized by its own sets of data and contexts. Kernels constituting an application can be executed conditionally a number of times depending on user’s input data manipulation. This number of times is completely fortuitous because small changes on input data can produce totally different execution patterns. Figure 2 shows examples of interactive applications from a kernel point of view. Figure 2(a) illustrates an application in which one or several kernels can be repeatedly executed depending on the result of a condition test (kernels K1 and K5, in this case). Figure 2(b) shows an application in which there is not a repetitive execution of kernels but a possibility of having different targets after a conditional block (kernels K2 and K3, for instance). Based on the application kernel description, an intrinsic condition establishes that each

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

3.2. Overhead due to data transfers associated to a conditional branch

Figure 2. Interactive applications from a kernel level kernel processes a fixed amount of data during its execution, although the particular data and the number of passes is only known at execution time.

3.1. Overhead due to conditional branches in a SIMD reconfigurable architecture Our goal is to exploit the Reconfigurable Array parallel capability. This means that each PE processes a subset of input data and the entire array operates concurrently. In an ideal case, a synchronized evolution of the application (from the kernel point of view) is obtained when: x Each PE processes its corresponding data set without any stall x The entire Reconfigurable Array processes the same kernel at the same time This case represents an execution model perfectly suited for the SIMD style. Even so, there are Reconfigurable Array processing stalls due to conditional tests: it will be necessary to wait until transferring contexts and data to the on-chip memory after testing a conditional block in order to start executing kernel associated to the target address defined by the conditional branch test. Far away from the previous ideal case, and considering more realistic applications, it is highly probable that several PEs inside the Reconfigurable Array have different branch targets after a conditional test. For example, in Figure 2a, after the kernel K1 execution over the entire PE Array, it could happen that some PEs must start to execute kernel K2, while the others must execute kernel K1 again. SIMD computational model, as the majority of reconfigurable systems have, does not allow to execute at the same time on the PE Array the contexts required by K1 and K2. Therefore, a methodology to define how to proceed in a decision block (conditional branch) must be developed.

Many reconfigurable systems support overlapping data transfers with computations in order to reduce execution time. However, in dynamic applications, data required by kernels following a compare and branch block are only known after testing the branch condition. This implies reconfigurable processing elements stall until required data are transferred to the on-chip memory, with the consequent increase in execution time. In Figure 2a, for example, after kernel K1 execution, next data to process could be either kernel K2 input data or next data required by kernel K1, depending on the branch condition test. As the volume of data involved in this kind of applications is considerable, it is very improbable that both K2 input data and K1 next data can be loaded from the external memory to the on-chip memory overlapped with the K1 execution, due to memory bandwidth and time constraints. To solve this problem some prefetch technique should be applied. An application profiling can give us an indication of data sets being usually processed by different kernels. Loading speculatively these data sets during the previous kernel execution can help to reduce the stall time. According to data transfer time window and probable input data size, sometimes only a fraction of this data set can be preloaded. Therefore, a heuristic should be used to choose the proper data subset to be prefetched.

4. SIMD EXECUTION OF AN INTERACTIVE APPLICATION The majority of the coarse grain reconfigurable architectures have a SIMD execution model in order to efficiently exploit DSP applications parallelism. In order to exploit parallel capabilities of Reconfigurable Array, it must be fed with proper data without time penalties. However, according to the dynamic evolution of an interactive application, several PEs can require different data sets widely scattered in memory. This lack of spatial locality can produce an important amount of processing stalls. This case can be considered very common for some applications, but it is not so for many multimedia applications which have an important degree of data coherence. Data coherence means that several processing elements present the same memory access pattern while the application is being executed. Hence, a coherent group of PEs requires the same data sets and accesses to the same memory addresses. Data coherence is relevant because it impacts both on the number of memory accesses and on the prefetch technique complexity. For instance, 3D image interactive applications have an important degree of data coherence. Nearby pixels will be obtained as result of processing similar data because they cover a reduced image region. For a typical N × N reconfigurable array there are several possible coherent schemes [10]. Current work supposes target applications having 1 × N coherence.

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

Therefore, all PEs in the same row (column) are supposed to be coherent. Each group of N PEs is assumed to be data coherent, and the N groups of N PEs are allowed to process different data. Some applications could have a coarse control granularity. In this case, branch decisions are taken globally: the entire Reconfigurable Array continues executing K1 or begins K2 execution after the conditional branch CB1 (see Figure 2a) and it is well-suited for the SIMD execution model. However, most common applications have a fine control granularity: branch condition tests are performed locally (inside each PE) and they can have different target addresses. SIMD computational model does not support this condition because it forces to the entire Reconfigurable Array to execute the same context. Therefore, mapping interactive applications onto such architectures can be achieved but some time penalty is inevitable. Mapping solutions for this kind of applications must define when to make the context switch from one kernel to the next one, following the SIMD computational model. This must be achieved with the lowest time penalty making the best use of data coherence. There are several possible mapping solutions as are exposed in [10]. Generalizing this methodology to our target applications, the execution model can be described as: x

When a block decision (conditional branch) appears after executing one kernel over the entire Reconfigurable Array, branch target addresses, that is, pointers to subsequent kernels are pushed into their corresponding stacks. There will be N FIFO stacks to store target addresses, one per each coherent group, i.e. one per row/column of PEs.

x

In order to define next kernel to execute, one target address is pulled from each row/column stack: o

If all target addresses are the same, the corresponding kernel defined by the target address will be executed in the next control step.

o

In the case of two different target addresses: ƒ

If one of them corresponds to the previously executed kernel, that is, the previous kernel need to be executed again, it will prevail over others. This means that the next kernel to execute is the previous one, if its address appears on the candidate target addresses.

ƒ

If the previous kernel address not appears on the two different target addresses pulled from the stacks (previous kernel need not to be executed again), kernel represented by the predominant target address will be executed in the next control step. In both cases, postponed target addresses are pushed back into their corresponding stack.

a: Kernel execution in the Reconfigurable Array (Ki: Kernel i) b: Data transfers time (Di: Kernel i input data) c: Contexts transfers time (Ci: Kernel i contexts)

Figure 3. Time diagram for different preload techniques

5. DATA TRANSFERS DYNAMIC SCHEDULING As we have said, applications for reconfigurable systems can be described as a set of cooperative kernels. Kernel functionality is defined by a set of contexts and it operates on a fixed-size volume of data. According to our target applications, kernels only process a subset of their possible input data. Moreover, in most cases this data subset is unknown in advance because of block decision appearances. The simplest way to provide data to a kernel consists in transferring the proper data only when kernel execution requires it, implementing an on-demand data transfer policy. From the point of view of power, this is the better solution because only demanded data are transferred. But, in terms of execution time, on-demand policy takes the longer to complete an application because there is a processing stall for each data set required by a kernel. In Figure 3 can be observed a processing stall after executing one kernel and prior the execution of the next one while data require are transferred. The only way to avoid reconfigurable unit processing stalls due to data unavailability during kernel execution would consist in preloading all its possible input data, supposing that memory bandwidth allows it. This option corresponds to a complete preload policy, showing the best results in terms of execution time: kernels have available all data for its execution just in time and there will not be processing stalls as can be observed in Figure 3. Processing stalls occur only when next kernel differs from the previous one due to contexts transfers or when data transfers take more time than concurrent kernel execution. Complete preload shows the worst results considering power consumption: a great volume of data is transferred to the on-chip memory and only a fraction of them will be finally processed by the kernel. In Figure 3 darkest boxes represents preloaded data not processed by the kernel. In order to find an intermediate solution between ondemand transfer and complete preload policies we have

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

used data transfers profiling. Profiling methodologies will help us to identify data subsets usually processed by kernels, looking for a balance between power consumption due to data transfers and execution time. Our purpose is to perform speculative data preload based on data sets commonly processed by kernels, looking for interactive results at a reasonable power cost. In Figure 3, selective preload illustrates this intermediate solution. During a kernel execution, some data is preloaded. When the next kernel is executed and it uses previously loaded data (a hit occurs), there will not be a processing stall. But, if the next kernel execution not employs previously loaded data (a miss occurs), processing stalls until required data are transferred. Depending on the possible input data size of a kernel, a determined fraction of them need to be preloaded in order to satisfy power / performance trade-off. Preloading a reduced subset of possible input data exhibits a lower hit rate. This means a minor probability of coincidence between data required by a kernel and data preloaded, producing a processing unit stall when there is a miss. Preloading a bigger subset of possible input data will reflect a higher hit rate through a bigger memory bandwidth and power overhead.

6. METHODOLOGY IMPLEMENTATION Efficient execution of applications onto reconfigurable architectures demands a detailed analysis of its characteristics combined with architecture qualities and limitations. Our methodology is shown in Figure 4. First step consists on describing an application from a kernel level. This description corresponds to the sequence of kernels forming the application. Input data sizes for each kernel as well as its sets of contexts are completely determined in this step by a kernel information extractor. Previous information combined with time constraints is useful to determine time windows in order to overlap computations with data transfers. Next, dynamic applications must be subjected to a profiling study. This profiling will help us to find memory access regularities, possible input data size for each kernel, from which kernel input data is a subset, and most common target addresses after executing compare and branch blocks. All previous information is used to define a first approach to the prefetch technique in order to guarantee interactive results. Our goal is to provide data to each kernel prior its execution in order to avoid processing stalls. Different fractions of possible input data of each kernel need to be tested in order to obtain a power / performance plot (see Figure 6). Given a power / performance trade-off for the application (how much power cost due to data transfers is allowed and the maximum number of clock cycles suited for interactive results), and depending on possible input data size for each kernel, computation time windows, memory

Figure 4. Structure of the methodology bandwidth, power constraints and profiling results, the fraction of possible input data to preload is determined. Later, contexts and data transfers are scheduled and the application is executed following the SIMD execution model described in previous section. If necessary, a new profiling step can be performed using SIMD behaviour as input information in order to tune the power / performance trade-off.

7. EXPERIMENTAL RESULTS We have implemented synthetic experiments and an interactive ray tracing algorithm [9] for the MG architecture. Experimental framework executes applications on the target architecture and delivers execution time and power consumed by data transfers to the on-chip memory. Four different tested experiments are summarized in Table 1. Synthetic Experiment 1 (SE1) and Synthetic Experiment 2 (SE2) have the kernel structures illustrated in Figures 2a and 2b, respectively. Ray tracing application was used in a simplified form as is shown in Figure 5. In Table 1 only kernels being possible targets of conditional blocks are considered. IDS-Ki / PIDS-Ki shows the ratio between input data size and possible input data size of kernel Ki. T/F Rate CBi illustrates the ratio between the true branch taken and the false branch taken for the conditional block CBi. These values are obtained through profiling. Ray tracing experiments differs between them in the octree depth parameter [14]. Figure 6 shows the power / performance trade-off delivered by different preloaded data sizes. This figure

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

Figure 5. Simplified ray tracing from kernel level

Table 1. Experiments summary

We show in Table 2 experimental results for synthetic applications SE1 y SE2 and ray tracing RT1 y RT2 subjected to different preloaded data sizes. On-demand preload technique was used as a reference to determine time and power improvements. This is the best solution in terms of power because only required data are transferred. On-demand policy takes the longer time to complete an application because there is a processing stall for each data set required by a kernel. Results obtained in this section shows that interactivity can be achieved preloading a fraction of the possible applications input data, with important power savings. On step further, it would be implement a historybased predictor following the iteration by iteration behaviour. Based on the data subsets processed in a previous iteration, a more accurate prediction can be done for the current one, keeping the reduced power consumption delivered by the selective preload policies.

8. CONCLUSIONS We have proposed a methodology to analyze interactive applications in order to execute them in a SIMD reconfigurable architecture taking into account power / performance trade-offs. This methodology starts from a kernel description of the interactive application. Kernels are conditionally executed depending on dynamic conditions like user’s input data manipulation.

Table 2. Power and time improvements

9500

9% Time (Clock Cycles)

was obtained varying the fraction of possible input data preloaded for the ray tracing experiments. Normalized power, illustrated in Figure 6 is calculated as the ratio of power consumed by data transfers and the power cost due to transfer a 32 bit word. Previous results give an indication of the power / performance behavior as far as the size of preloaded data subsets is increased. It is necessary to know how much power consumption due to data transfers is admitted in order to choose a proper preloaded data size, always looking for interactive results. The best solution in terms of execution time represents a bigger data subset to preload and a higher power overhead. From this starting point, preloading more reduced data subsets yields to a rising in execution time and a lower power cost.

IEEE

18% 8500

27% 8000

7500

45% 7000 120000

140000

160000

180000

200000

220000

240000

Normalized Power

Figure 6. Power / performance trade-off The volume of data involved in this kind of applications combined with user’s actions occurring at unexpected times strongly impact on performance as experimental results demonstrate. Based on profiling results, we define a data prefetch scheme in order to avoid reconfigurable processing stalls due to the unpredictable behavior of dynamic applications. Based on a tuning via profiling, a trade-off between power and performance can be found in order to meet interactive results. Experimental results satisfy time constraints of interactive applications and show a power effective solution for them.

9. REFERENCES [1] S. Brown and J. Rose, “FPGA and CPLD Architectures: A Tutorial,” IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996.

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

9000

[2] E. Tau, D. Chen, I. Eslick et al. “A First Generation DPGA Implementation,” Proc. Canadian Workshop on FieldProgrammable Devices, pp. 138-143, May 1995. [3] H. Singh, M. Lee, G. Lu et al., “MorphoSys: An Integrated Reconfigurable System for Data-Parallel and ComputationIntensive Applications,” IEEE Transactions on Computers, Vol. 49, No. 5, pp. 465-480, May 2000. [4] T. Miyamori and U. Olukotun, “A quantitative analysis of reconfigurable coprocessors for multimedia applications,” Proc. IEEE Symp. on FPGAs for Custom Computing Machines, pp. 211, Apr. 1998. [5] E. Mirsky and A. DeHon, “MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources,” Proc. IEEE Symp. on FPGAs for Custom Computing Machines, pp.157-166, Apr. 1996. [6] G. Smit, P. Havinga, L. Smit et al., “Dynamic Reconfiguration in Mobile Systems,” Intl. Conf. on Field Programmable Logic and Applications (FPL), pp. 171–181, Sep. 2002. [7] D. Moolenaar, L. Nachtergaele, F. Catthoor and H. De Man, “System-level power exploration for MPEG-2 decoder on embedded cores: a systematic approach”, IEEE Workshop on Signal Processing Systems, pp. 395-404, Nov. 1997. [8] Q. Wu, A. Pyatakov, A. Spiridonov et al., “Exposing Memory Access Regularities Using Object-Relative Memory Profiling,” Intl. Symp. on Code Generation and Optimization (CGO), pp. 315 – 323, Mar. 2004.

[9] M. Sánchez-Elez, H. Du, N. Tabrizi et al., “Algorithm Optimizations and Mapping Schemes for Interactive Ray Tracing on a Reconfigurable Architecture,” Computer & Graphics, Vol. 27, pp. 701-713, Elsevier, 2003. [10] F. Rivera, M. Sanchez-Elez, M. Fernandez et al., “Efficient Mapping of Hierarchical Trees on Coarse-Grain Reconfigurable Architectures,” Intl. Conf. on Hardware/Software Codesign and System Synthesis (CODES + ISSS), pp. 30–35, Sep. 2004. [11] M. Sanchez-Elez, M. Fernandez, N. Bagherzadeh, R. Hermida, “A Low Energy Data Management for Multi-context Reconfigurable Architectures”, in New Algorithms, Architectures and Applications for Reconfigurable Computing, P. Lysaght and W. Rosenstiel (Eds.), Kluwer, 2004. [12] J. Resano, D. Verkest, D. Mozos, S. Vernalde and F. Catthoor, “A run time scheduling flow to minimize the reconfiguration overhead of FPGAs”, in Microprocessors and Microsystems, Vol. 28/5-6, pp. 291-301, Elsevier, Aug. 2004. [13] J. Lee, K. Choi et al., “Compilation Approach for CoarseGrain Reconfigurable Architectures,” IEEE Design and Test of Computers, Vol. 20, No. 1, pp. 26-33, Jan.-Feb. 2003. [14] H. Du, M. Sanchez-Elez, N. Tabrizi et al., “Interactive Ray Tracing on Reconfigurable SIMD MorphoSys, Proceedings ASP-DAC, Jan. 2003. [15] M. Kamble and K. Ghose, “Analytical Energy Dissipation Models for Low Power Caches”, ACM/IEEE Intl. Symp. on Microarchitecture, pp. 184-193, Dec. 1997.

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.