Adaptive Multi-Tier Intelligent Data Manager for Exascale EU project
The growing need to process and access extremely large volumes of heterogeneous data sets is one of the main drivers for building exascale HPC systems. The advent of data-intensive applications in combination with the steep growth of data sets starts to question the traditional compute-centric view on HPC. The flat storage hierarchies found in classic HPC architectures no longer satisfies the performance requirements of the growing share of data-processing applications. Uncoordinated file accesses in combination with limited bandwidth make the centralised back-end parallel file system a serious bottleneck in traditional systems. At the same time, this shift towards data-centric computing is accompanied by a disruptive change of the underlying storage technology with the potential to remove this bottleneck. Emerging multi-tier storage hierarchies with fast non-volatile memory can significantly lower the pressure on the back-end file system. But maximising performance still requires careful control to avoid congestion and balance compute and storage performance. Unfortunately, appropriate interfaces and policies for managing such an enhanced I/O stack are still lacking.
The main objective of the ADMIRE project is to establish this control by creating an active I/O stack that dynamically adjusts compute and storage requirements through intelligent global coordination, elasticity of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy. To achieve this, we will develop a software-defined framework based on the principles of scalable monitoring and control, separated control and data paths, and the orchestration of key system components and applications through embedded control points. The framework will consist of three new active main components: (1) an ad-hoc parallel storage systems will reduce the pressure on the back-end parallel file system and improve checkpointing performance, (2) malleability management will cost-effectively balance I/O and compute performance via dynamic scaling of application resources, and finally, (3) an I/O scheduler will offer end-to-end quality-of-service guarantees for the whole storage stack and will reduce data movement. To orchestrate the entire system, global monitoring and performance profiling will feed intelligent controllers that coordinate storage allocation and access through control points installed in these three new active components as well as the job scheduler and the applications. Our software-only solution will offer quality-of-service (QoS), energy efficiency, and resilience. I/O interference will be reduced via globally coordinated minimisation of data transfers between storage tiers, while conveying and enforcing end-to-end QoS needs.
The project will allow the throughput of HPC systems as well as the performance of individual applications to be substantially increased – and consequently energy consumption to be decreased – by taking advantage of fast and power-efficient node-local storage tiers through novel, European ad-hoc storage systems, and in-transit/in-situ processing facilities. An integrated and operational prototype will be validated and demonstrated with several use case applications from various domains, including climate/weather, life sciences, physics, remote sensing, and deep learning. The consortium comprises leading European companies, research organisations and universities, bringing together several PRACE members and Centres of Excellence for HPC applications, allowing us to jointly increase our impact in the HPC ecosystem with real-world data-intensive applications.