Utilizing Memory Parallelism for High Performance Data Processing


While advances in microprocessor design continue to increase computing speed, improvements in data access speed of computing systems lag far behind. At the same time, data-intensive large-scale applications, such as information retrieval, computer animation, and big data analytics are emerging. Data access delay has become the vital performance bottleneck of modern high performance computing (HPC). Memory concurrency exists at each layer of modern memory hierarchies; however, conventional computing systems are primarily designed to improve CPU utilization and have inherent limitations in addressing the critical issue of data movement in HPC. In order to address the data movement bottleneck issues, this project extracts the general principles of parallel memory system by building on the new Concurrent-AMAT metric, a set of fresh theoretical results in data access concurrency, a series of recent successes in parallel I/O optimization, and new technology opportunities.

Customized Parallel Memory (CuPM) architecture

Current HPC systems do not fully recognize or exploit the concurrency that modern memory systems provide. The key new idea of CuPM is to establish a systematic way of exploring, enhancing, utilizing, and customizing memory concurrency to build effective memory systems. It will be developed with two goals: to understand and reveal the memory concurrency and its properties, and to explore and utilize current memory concurrency. Figure 1 shows the system design of CuPM.

Figure 1. The system design of CuPM

Recently proposed theory and techniques

Memory Sluice Gate Theory. This theory is proposed from an architectural perspective to mitigate the “memory wall” problem. The focus of the Sluice Gate Theory is not on hardware peak performance, but the achieved memory stall time. Based on Sluice Gate Theory, a memory system is built to transfer data and to mask the performance gap between CPU and memory devices during the data transfer process. Sluice gates are designed to control data transfer at each memory layer (i.e., sluice stage) dynamically, and a global control algorithm is developed to match the data transfer request/supply at each memory layer thus matching the overall performance between the CPU and memory system. In Sluice Gate Theory, not only the traditional data access locality but also the data access concurrency is considered. Rather than traditional miss event occurred in memory system, a “pure miss” is introduced to better measure parallel memory system’s performance. Particularly, the correctness of the theory is verified with rigorous mathematical proofs, and supported with its associated C-AMAT model (Concurrent Average Memory Access Time).

LPM: Concurrency-driven Layered Performance Matching. The rationale of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to closely match the request of the layer directly above it. The LPM model simultaneously considers both data access concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will alleviate the performance impact of the lower layer. Figure 2 shows the details of LPM.

Figure 2. LPM (Layered Performance matching)


  • X.-H. Sun and Y.-H. Liu, "Utilizing Concurrency Data Access: A New Theory," in Proc. of the 29th International Workshop on Languages and Compilers for Parallel Computing (LCPC2016) (a position paper), Sept, 2016, New York, USA.

  • Yu-Hang Liu and Xian-He Sun, "Reevaluating Data Stall Time with the Consideration of Data Access Concurrency," Journal Of Computer Science And Technology, vol. 30, no. 2, pp. 227-245, Mar. 2015.

  • Yu-Hang Liu and Xian-He Sun, "C^2-bound: A Capacity and Concurrency driven Analytical Model for Manycore Design," in Proc. of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis 2015 (SC'15). Texas, Austin, USA, Nov. 2015.

  • Yu-Hang Liu and Xian-He Sun, "LPM: Concurrency-driven Layered Performance Matching," in Proc. of the 44th International Conference on Parallel Processing (ICPP'15), Beijing, China, Sept. 2015.

  • Dawei Wang and Xian-He Sun, "APC: A Novel Memory Metric and Measurement Methodology for Modern Memory System," IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626-1639, July. 2014.

  • Xian-He Sun, "C-AMAT: a data access model for the Big Data era," Communication of CCF, vol. 10, no. 6, pp. 19-22, June 2014.

  • X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time," IEEE Computer, vol. 47, no. 5, pp. 74-80, May 2014.

  • Xian-He Sun, "Concurrent-AMAT: a mathematical model for Big Data access," HPC Today, May 2014. A local copy can be found here (Page 1 2 3).

  • Contact:

    Xian-He Sun
    Department of Computer Science
    Illinois Institute of Technology
    Chicago, IL 60616

    Back to SCS Home Page