While memory consistency models has long been an interesting topic in the computer architecture community, the programming languages and compilers community has not given much attention to the issues of correctness and ease of programming with respect to the underlying memory consistency models. The influences of the memory consistency model to programming model become more important because the widespread use of SMP's and the increasing need for multi-threaded languages, such as Java, OpenMP, and Pthreads.
Unfortunately, the usability of memory consistency models, their impact on performance, and the compiler technology needed to perform optimizations of parallel programs are poorly understood. A concrete example is the current Java Memory Model that is one of the relaxed memory consistency models. Java is the first widespread concurrent programming language that explicitly specifies a memory model at the language level. However, the Java memory model is very difficult for programmers to understand, and there are several ways of interpreting the memory model. In addition, like most programming languages that follow the shared memory parallel programming model, non-deterministic behaviors due to data races can also occur in Java concurrent programs. The synergic effect of the non-intuitive Java memory model and the non-deterministic behavior makes it difficult for a programmer to write correct and efficient Java concurrent programs. Moreover, the relaxed memory consistency models make programming and porting more difficult, even though most of the current shared memory multiprocessors have a relaxed memory consistency model and provide a wide variety of hardware level optimizations. The programmer expect the behavior of the memory to be similar to that of a uniprocessor undertaking concurrent execution of several tasks. In this point of view, sequential consistency is arguably the most intuitive and natural memory consistency model for programmers.
This research focuses on building an optimizing compiler for explicitly parallel shared memory programs that hides the underlying relaxed memory consistency model. The compiler presents an intuitive and natural memory consistency model (e.g., sequential consistency) to the programmer. It shifts the programmer's burden of considering the underlying machine architecture to the compiler itself so that it makes programming and debugging easier. Moreover, it provides correct compiler level optimizations that are not considered by conventional compilers. The compiler uses Shasha and Snir's delay set analysis, and Concurrent Static Single Assignment program representation to distinguish the effects of different memory consistency models on compiler optimization and analysis techniques. In addition, the compiler will serve as a testbed to prototype new memory consistency models at the language level, and to measure the effects of different memory models on program performance. Because of its widespread deployment, and its use in general purpose, commercially important applications programming, the project is targeting Java, or a Java-like language.
Processor In Memory (PIM) technology integrates processor logic and DRAM in the same chip. One interesting usage of these chips is to replace the main memory chips in a workstation or a server. In this case, PIM chips act as co-processors in memory that execute code when signaled by the host (main) processor. The target intelligent memory architecture has a heterogeneous mix of processors: host and memory processors. Typically, a host processor is a wide-issue superscalar with a deep cache-hierarchy and a long memory access latency, while a memory processor is a simple, narrow-issue superscalar with only a small cache and a low memory access latency. Cache coherence of the system is guaranteed by a compiler.
While other compiler approaches in processors-in-memory architectures are often not much different from compiling for systems where processors in memory chips are the main processors, our approach is to automatically partition the code into sections and schedule each section on its most suitable processors. We exploit the heterogeneity of the system in addition to the parallelism provided by the architecture. To do so, we use adaptive execution techniques with dynamic feedback in addition to performance prediction. When the intelligent memory architecture has multiple host processors and/or memory processors, it is expected that the communication overhead between processors is large and that the compiler cache coherence and memory consistency become more complicated. In this case, the compiler inserts code that makes the program to be executed adaptively in order to reduce the overheads. To fully exploit the parallelism provided by the architecture, we are exploring speculative execution techniques on the intelligent memory architecture.
The target applications include floating point, integer, multimedia, and object-oriented applications. By building a compiler we measure the effectiveness of exploiting the heterogeneity of the intelligent memory architecture in the domain of performance and low-power consumption, and the effectiveness of adaptive and speculative execution techniques on the intelligent memory architecture.