Job Management With Mpi_ Jm

HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2018(2018)

引用 2|浏览9
暂无评分
摘要
Access to Leadership computing is required for HPC applications that require a large fraction of compute nodes for a single computation and also for use cases where the volume of smaller tasks can only be completed in a competitive or reasonable time frame through use of these Leadership computing facilities. In the latter case, a robust and lightweight manager is ideal so that all these tasks can be computed in a machine-friendly way, notably with minimal use of mpirun or equivalent to launch the executables (simple bundling of tasks can over-tax the service nodes and crash the entire scheduler). Our library, mpi_jm, can manage such allocations, provided access to the requisite MPI functionality is provided. mpi_jm is fault-tolerant against a modest number of down or non-communicative nodes, can begin executing work on smaller portions of a larger allocation before all nodes become available for the allocation, can manage GPU-intensive and CPU-only work independently and can overlay them peacefully on shared nodes. It is easily incorporated into existing MPI-capable executables, which then can run both independently and under mpi_jm management. It provides a flexible Python interface, unlocking many high-level libraries, while also tightly binding users' executables to hardware.
更多
查看译文
关键词
Pilot systems, Job management, CORAL
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要