publications
More frequently updated publications can be found at Google Scholar .
Following publications by categories in reverse chronological order.
2023
- Chaosity: Understanding Contemporary NUMA-architectures2023
Modern hardware is increasingly complex, requiring increasing effort to understand in order to carefully engineer systems for optimal performance and effective utilization. Moreover, established design principles and assumptions are not portable to modern hardware because: 1) Non-Uniform Memory Access (NUMA) architectures are becoming increasingly complex and diverse across CPU vendors; Chiplet-based architecture provides hierarchical NUMA instead of flat-NUMA topology, while heterogeneous compute cores (e.g., Apple Silicon) and on-chip accelerators (e.g., Intel sapphire rapids) are also normalized in materializing the vision for workload- and requirement-specific compute scheduling. 2) Increasing IO bandwidth (e.g., arrays of NVMe drives approaching memory bandwidth) is a double-edged sword; having high-bandwidth IO can interfere with the concurrent memory access bandwidth as the IO target is also memory; hence IO itself consumes memory bandwidth. 3) Interference modeling is becoming more complex in modern hierarchical NUMA and on-chip heterogeneous architectures due to topology obliviousness. Therefore, systems designs need to be hardware topology-aware, which requires understanding the bottlenecks and data flow characteristics, and then adapting scheduling over the given hardware topology. Modern hardware promises performance by providing powerful and complex yet non-intuitive computing models which require tuning specifically for target hardware or risk under-utilizing the hardware. Therefore, system designers need to understand, carefully engineer, and adapt to the target hardware to avoid unnecessarily hitting bottlenecks in the hardware topology. In this paper, we propose the Chaosity framework, which enables system designers to systematically analyze, benchmark, and understand complex system topologies, their bandwidth characteristics, and interference of effects of data access paths, including memory and PCIe-based IO. Chaosity aims to provide critical insights into system designs and workload schedulers for modern NUMA hierarchies.
- One-shot Garbage Collection for In-memory OLTP through Temporality-aware Version Storage2023
Most modern in-memory online transaction processing (OLTP) engines rely on multi-version concurrency control (MVCC) to provide data consistency guarantees in the presence of conflicting data accesses. MVCC improves concurrency by generating a new version of a record on every write, thus increasing the storage requirements. Existing approaches rely on garbage collection and chain consolidation to reduce the length of version chains and reclaim space by freeing unreachable versions. However, finding unreachable versions requires the traversal of long version chains, which incurs random accesses right into the critical path of transaction execution, hence limiting scalability. This paper introduces OneShotGC, a new multi-version storage design that eliminates version traversal during garbage collection, with minimal discovery and memory management overheads. OneShotGC leverages the temporal correlations across versions to opportunistically cluster them into contiguous memory blocks that can be released in one shot. We implement OneShotGC in Proteus, and use YCSB and TPC-C to experimentally evaluate its performance with respect to the state-of-the-art, where we observe an improvement of up to 2x in transactional throughput.
- HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics2023
Accessing input data is a critical operation in data analytics: i) slow data access significantly degrades performance, and ii) storing everything in the fastest medium, i.e., memory, incurs high operational and hardware costs. Further, while GPUs offer increased analytical performance, equipping them with correspondingly fast memory requires even more expensive memory technologies than DRAM; making memory resources even more precious. Existing GPU-accelerated engines rely on CPU memory for bigger working sets, albeit at the expense of slower execution. Such a combination of both memory and compute disaggregation, however, invalidates the assumption of existing caching mechanisms: i) the processing tier is highly heterogeneous, ii) data access bandwidth depends on the access method and compute unit, iii) with NVMe arrays, persistent storage can approach in-memory bandwidth, and iv) all these relative quantities depend on the current query and data placement. Thus, existing caching approaches waste interconnect bandwidth, cache inefficiently, and overall result in suboptimal execution times. This work proposes HetCache, a storage engine for analytical workloads that optimizes the data access paths and tunes data placement by co-optimizing for the combinations of different memories, compute devices, and queries. Specifically, we present how the increasingly complex storage hierarchy impacts analytical query processing in GPU-NVMe-accelerated servers. HetCache accelerates analytics on CPU-GPU servers for larger-than-memory datasets through proportional and access-path-aware data placement. Our prototype implementation of HetCache demonstrates a 1.14x-1.78x speedup of GPU-only execution on NVMe resident data and achieves near in-system-memory performance for hybrid CPU-GPU execution, while substantially improving memory efficiency. Overall, HetCache turns the multi-memory-node nature of such heterogeneous servers from a burden into a performance booster.
2020
- Adaptive HTAP through Elastic Resource Scheduling2020
Modern Hybrid Transactional/Analytical Processing (HTAP) systems use an integrated data processing engine that performs analytics on fresh data, which are ingested from a transactional engine. HTAP systems typically consider data freshness at design time, and are optimized for a fixed range of freshness requirements, addressed at a performance cost for either OLTP or OLAP. The data freshness and the performance requirements of both engines, however, may vary with the workload. We approach HTAP as a scheduling problem, addressed at runtime through elastic resource management. We model an HTAP system as a set of three individual engines: an OLTP, an OLAP and a Resource and Data Exchange (RDE) engine. We devise a scheduling algorithm which traverses the HTAP design spectrum through elastic resource management, to meet the workload data freshness requirements. We propose an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and we use it to evaluate the performance of our approach. Our evaluation shows that the performance benefit of our system for OLAP queries increases over time, reaching up to 50% compared to static schedules for 100 query sequences, while maintaining a small, and controlled, drop in the OLTP throughput.
- GPU-accelerated data management under the test of timeAunn Raza, Periklis Chrysogelos, Panagiotis Sioulas, Vladimir Indjic, Angelos-Christos G. Anadiotis, and 1 more author2020
GPUs are becoming increasingly popular in large scale data center installations due to their strong, embarrassingly parallel, processing capabilities. Data management systems are riding the wave by using GPUs to accelerate query execution, mainly for analytical workloads. However, this acceleration comes at the price of a slow interconnect which imposes strong restrictions in bandwidth and latency when bringing data from the main memory to the GPU for processing. The related research in data management systems mostly relies on late materialization and data sharing to mitigate the overheads introduced by slow interconnects even in the standard CPU processing case. Finally, workload trends move beyond analytical to fresh data processing, typically referred to as Hybrid Transactional and Analytical Processing (HTAP). Therefore, we experience an evolution in three different axes: interconnect technology, GPU architecture, and workload characteristics. In this paper, we break the evolution of the technological landscape into steps and we study the applicability and performance of late materialization and data sharing in each one of them. We demonstrate that the standard PCIe interconnect substantially limits the performance of state-of-the-art GPUs and we propose a hybrid materialization approach which combines eager with lazy data transfers. Further, we show that the wide gap between GPU and PCIe throughput can be bridged through efficient data sharing techniques. Finally, we provide an H2TAP system design which removes software-level interference and we show that the interference in the memory bus is minimal, allowing data transfer optimizations as in OLAP workloads.
2019
- Unsupervised Machine Learning for Networking: Techniques, Applications and Research ChallengesMuhammad Usama, Junaid Qadir, Aunn Raza, Hunain Arif, Kok-Lim Alvin Yau, and 3 more authors2019
2017
- Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster schedulingCalin Iorgulescu, Florin Dinu, Aunn Raza, Wajih Ul Hassan, and Willy Zwaenepoel2017