Computer Architecture and Systems Laboratory

Georgia Tech > CoE > ECE > CASL






People























Heterogenous Computing
Sponsors: IBM, Intel, Logicblox, NSF, and NVIDIA.

Current trends have led to the development chip-scale and rack-scale of heterogeneous many-core platforms -- large scale, heterogeneous systems comprised of homogeneous general purpose cores intermingled with customized heterogeneous cores and using diverse memory and cache hierarchies. These systems have had a disruptive impact on the software infrastructure and present numerous architecture and system challenges. Our efforts are anchored in three systems projects: the Ocelot Dynamic Execution infrastructure and the Harmony Runtime. The preceding two infrastructures are the basis for the development of Red Fox, a joint effort between LogicBlox Inc. and CASL to develop a compilation environment for multi-GPU enterprise applications in a domain specific declarative language. CASL is an active participant in the NVIDIA Center of Excellence established at Georgia Tech.

Ocelot

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the open source Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core x86 CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications. Ocelot is an open source project that is intended to provide a set of binary translation tools from PTX to diverse many-core architectures. It currently includes an internal representation for PTX, a PTX parser and assembly emitter, a set of PTX to PTX transformation passes, a PTX emulator, a dynamic compiler to many-core CPUs, a dynamic compiler to NVIDIA GPUs, and a full re-implementation of the CUDA runtime that permits kernels within a single application to be executed on different back-ends. The emulator, many-core code generator, and GPU code generator support the full PTX 1.4 specification with support for Fermi planned to be available by the end of 2010. A variety of projects are in progress at GT and elsewhere (NEU for example) for the addition of additional back-ends. Finally, a number of correctness and debugging tools have been and continue to be developed around the emulator. The project continues to benefit from the participation of many researchers at other institutions in the form of ideas, feedback and functionality.
Source Code

Harmony

Harmony is a runtime supported programming and execution model that provides: (1) semantics for intuitively managing parallelism, (2) dynamic mappings from compute intensive kernels to heterogeneous processor resources, and (3) online monitoring and performance optimization services for heterogeneous many core systems. The programming model is based on the identification of compute kernels, predicated kernel execution, and a managed shared address space. The execution model is based on dynamic detection and tracking of dependencies between compute kernels (enabled by the programming model), and a decoupling of kernel invocation by the application and kernel scheduling/execution on a core. The approach is inspired by solutions to instruction scheduling and management in out-of-order (OOO) superscalar processors, where these solutions are now adapted to schedule kernels on diverse cores. When integrated with Ocelot, the result is portable execution across a range of system configurations. Scalable performance is maintained via a two step solution - producer/consumer dependencies are first inferred for a window of compute kernels that have yet to execute and then used as constraints to a scheduler that attempts to minimize the execution time of the application while satisfying all dependencies. Optimizations implemented in Harmony include kernel level speculative execution for performance and on-line construction and application of performance models to drive scheduling decisions. Harmony has been demonstrated on single node multi-GPU systems and is now being targeted towards high node count systems with a large (order of hundreds) number of GPUs. The first target will be the Keeneland system. Most recently Harmony has defined a kernel level intermediate representation to facilitate the integration of multiple front-ends. The first application will be in Red Fox (below)

Red Fox

The goal of this project is to harness the cost and performance advantages of GPUs for data intensive computations in enterprise applications. Towards this end we are working with LogicBlox Inc., (LB) a company that specializes in enterprise class applications for decision automation, analytics and planning. This joint project is developing a compilation and execution environment that integrates the front-end from LB with the Harmony and Ocelot execution environment. The LB front-end is based on Datalog - a declarative language originally developed as a query language for deductive databases. The LB toolset is applied to data intensive applications and currently executes on commodity clusters. The major components of Red Fox are the LB Datalog front-end, the Harmony run-time and the Ocelot dynamic compiler. The integration has driven the development of a kernel intermediate representation that could lay the foundation for the integration of other front-ends while the compilation chain will incorporate domain specific compiler and run-time optimizations. The first instantiation is focused on the implementation and optimization of Relational Algebra operators.

Publications