Integrating Software and Hardware Verification
4 stars based on
Stream processing is hardware software trade offs dfg model computer programming paradigm, equivalent to dataflow programmingevent stream processingand reactive programming that allows some applications to more easily exploit a limited form of parallel processing.
Such applications can use multiple computational units, such as the floating point unit on a graphics processing unit or field-programmable gate arrays FPGAs without explicitly managing allocation, synchronization, or communication among those units.
The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a sequence of data a streama series of operations kernel functions is applied to each element in the stream. Kernel functions are usually pipelinedand optimal local on-chip memory reuse is attempted, in order to minimize the loss in bandwidth, accredited to external memory interaction.
Uniform streamingwhere one kernel function is applied to all elements in the stream, is typical. Hardware software trade offs dfg model the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboardingfor example, to initiate a direct memory access DMA when dependencies become known.
During the s stream processing was explored within dataflow programming. Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications such as image, video and digital signal processing but less so for general purpose processing with more randomized data access such as databases.
By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor design may be tuned for maximum efficiency or a trade-off for flexibility.
Stream processing hardware software trade offs dfg model especially suitable for applications that exhibit three application characteristics: For each record we can only read from the input, perform hardware software trade offs dfg model on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable. Basic computers started from a sequential execution paradigm.
As the computing needs of the world evolved, the amount of data to be managed increased very quickly. It was obvious that the sequential programming model could not cope with the increased need for processing power. Various efforts have been spent on finding alternative ways to perform massive amounts of computations but the only solution was to exploit some level of parallel execution.
The result of those efforts was SIMDa programming paradigm which allowed applying one instruction to multiple instances of different data. By using more complicated structures, one could also have MIMD parallelism. Although those two paradigms were efficient, real-world implementations were plagued with limitations from memory alignment problems to synchronization issues and limited parallelism.
Consider a simple program adding up two hardware software trade offs dfg model containing 4-component vectors i. This is the sequential paradigm that is most familiar. Variations do exist such as inner loops, structures and suchbut they ultimately boil down to that construct. This is actually oversimplified. Although this is what happens with instruction hardware software trade offs dfg modelmuch information is actually not taken into account here such hardware software trade offs dfg model the number of vector components and their data format.
This is done for clarity. The number hardware software trade offs dfg model jump hardware software trade offs dfg model is also decreased, as the loop is run fewer times.
These gains result from the parallel execution of the four mathematical operations. What happened however is that the packed SIMD register holds a certain amount of data so it's not possible to get more parallelism.
The speed up is somewhat limited by the assumption we made of performing four parallel operations please note this is common for both AltiVec and SSE. In this paradigm, the whole dataset is defined, rather than each component block being defined separately. Describing the set hardware software trade offs dfg model data is assumed to be in the first two rows.
After that, the result is inferred from the sources and kernel. For simplicity, there's a 1: Applied kernels can also be much more complex. An implementation of this paradigm can "unroll" a loop internally. This allows throughput to scale with chip complexity, easily utilizing hundreds of ALUs. Although SIMD implementations can often work in a "streaming" manner, their performance is hardware software trade offs dfg model comparable: It has been noted  that when applied on generic processors such as standard CPU, only a 1.
By contrast, ad-hoc stream processors easily reach over 10x performance, hardware software trade offs dfg model attributed to the more efficient memory access and higher levels of parallel processing. Although there are various degrees of flexibility allowed by the model, stream processors usually impose some limitations on the kernel or stream size.
For example, consumer hardware often lacks the ability to perform high-precision math, lacks complex indirection chains or presents lower limits on the number of instructions which can be executed. The most immediate challenge in the realm of parallel processing does not lie as much in the type of hardware architecture used, but in how easy it will be to program the system in question in a real-world environment with acceptable performance.
Machines like Imagine use a straightforward single-threaded model with automated dependencies, memory allocation and DMA scheduling. This in itself is a result of the research at MIT and Stanford in finding an optimal layering of tasks between programmer, tools and hardware.
Programmers beat tools in mapping algorithms to parallel hardware, and tools beat programmers in figuring out smartest memory allocation schemes, etc.
Of particular concern are MIMD designs such as Cellfor which the programmer needs to deal with application partitioning across multiple cores and deal with process synchronization and load balancing. Efficient multi-core programming tools are severely hardware software trade offs dfg model today. Programmers often wanted to build data structures with a 'real' meaning, for example:.
What happened is that those structures were then assembled in arrays to keep things nicely hardware software trade offs dfg model. This is array of structures AoS. When the structure is laid out in memory, the compiler will produce interleaved data, in the sense that all the structures will be contiguous but there will be hardware software trade offs dfg model constant offset between, say, the "size" attribute of a structure instance and the same element of the following instance.
The offset depends on the structure definition and possibly other things not considered here such as compiler's policies.
There are also other problems. For example, the three position variables cannot be SIMD-ized that way, because it's not sure they will be allocated in continuous memory space. To make sure SIMD operations can work on them, they shall be grouped in a 'packed memory location' or at least in an array. Another problem lies in both "color" and "xyz" to be defined in three-component vector quantities.
SIMD processors usually have support for 4-component operations only with some exceptions however. The proposed solution, structure of arrays SoA follows as:. In this case, they will be used to point to the first element of an array, hardware software trade offs dfg model is to be allocated later. For Java programmers, this is roughly equivalent to "".
The drawback here is that the various attributes could be spread in memory. To make sure this does not cause cache misses, we'll have to update all the various "reds", then all the "greens" and "blues". For stream processors, the usage of structures is encouraged.
From an application point of view, all the attributes can be defined with some flexibility. Taking GPUs as reference, there is a set of attributes at least 16 available. For each attribute, the application can state the number of components and the format of the components but only primitive data types are supported for now. The various attributes are then attached to a memory block, possibly defining a stride between 'consecutive' elements of the same attributes, effectively allowing interleaved data.
When the GPU begins the stream processing, it will gather all the various attributes in a single set of parameters usually this looks like a structure or a "magic global variable"performs the operations and scatters the results to some memory area for later processing or retrieving. More modern stream processing frameworks provide a FIFO like interface to structure data as a literal stream.
Apart from specifying streaming applications in high-level language. Models of computation MoCs also have been widely used such as dataflow models and process-based models. Historically, CPUs began implementing various tiers of memory access optimizations because of the ever-increasing performance when compared to relatively slow growing external memory bandwidth.
As this gap widened, big amounts of die area were dedicated to hiding memory latencies. A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistors dedicated to management is actually very little. Beginning from a whole system point of view, stream processors usually exist in a controlled environment.
GPUs do exist on an add-in board this seems to also apply to Imagine. CPUs do the dirty job of managing system resources, running applications and such. The stream processor is usually equipped with a fast, efficient, proprietary memory bus crossbar switches are now common, multi-buses have been employed in the past. The exact amount of memory lanes is dependent on the market range. As this is written, there are still bit wide interconnections around entry-level.
By contrast, standard processors from Intel Pentium to some Athlon 64 have only a single bit wide data bus. Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which most closely matches a multiple pointer indirection is an indirection chainwhich is however guaranteed to finally read or write from a specific memory area inside a stream.
This also allows for efficient memory bus negotiations. This is where knowing the kernel temporaries and dependencies pays. Internally, a stream processor features some clever communication and management circuits but what's interesting is the Stream Register File SRF. This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks. The key concept and innovation here done with Hardware software trade offs dfg model Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer.
The dependencies between kernel functions and data is known through the programming model which enables the compiler to perform flow analysis and optimally pack the SRFs. Commonly, this cache and DMA management can take up the majority of a project's schedule, something the stream processor or at least Imagine totally automates. Tests done at Stanford showed that the compiler did an as well or better job at scheduling memory than hardware software trade offs dfg model you hand tuned the thing with much effort.
There is proof; there can be a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can efficiently exploit a much lower amount of ALUs because intra-cluster communication is common and thus needs to be highly efficient.
This three-tiered data access pattern, makes it easy to keep temporary data away from slow memories, thus making the silicon implementation highly efficient and power-saving. Although an order of magnitude speedup can be reasonably expected even from mainstream GPUs when computing in a streaming mannernot all applications benefit from this.