Computer Science
Parallel Computing in Matlab using MPI4.0 via the Caryam C/C++ interface
Publié le - Journées Calcul et Données
Matlab is widely used in scientific community and is an effective framework to develop code prototypes in numerical modeling. Developers can easily test mathematical methods to solve complex problems, and also take benefits of multicores hardware architectures to apply parallelization techniques. The Parallel Computing Toolbox (PCT) provided by Mathworks is an user-friendly way to implement explicit message passing parallel programing paradigm for distributed memory architectures. But the functionalities of this toolkit can restrict the scope of large memory applications. Indeed, additional licence (Parallel Server) is mandatory to run applications which several nodes or more than 12 cores. Moreover, these toolkits don't provide as much as features than standard APIs like MPI, broadly used in compiled programming langages in the HPC field, specially when domain decomposition methods are involved. In response to these limitations, ad-hoc MPI wrappers can be developed to enable the use of MPI standard features from a Matlab program. We developed the Caryam interface, based on a recent C/C++ API of Matlab (v2023a). MEX files can be compiled with mpich4.0.2, Intel MPI 2021 and OpenMPI4.1.3. It makes possible to use a major part of MPI version 4.0 features like point-to-point and collectives communications, derived datatypes, but also cartesian and graph communicators topologies and asynchronous mode which are not accessible in the Mathworks PCT. Thus, Caryam v1.0 helps to familiarize oneself with the MPI API in an existing Matlab prototype before rewritting it into a compiled langage when it is needed. We conducted benchmarks on typical algorithms to compare the performances of Caryam and the PCT. Benchmarking set was run on the supercomputer RUCHE held by the regional Mésocentre du Moulon. RUCHE nodes each contains 40 cores (2 x Intel Xeon Gold 6230) with 192GB of memory and are interconnected through a 100 Gbits/s OPA network. We first perform point-to-point and collective operations on data sizes up to 1GBytes. In all cases, results highliht at least two time faster communications by using the Caryam interface than the PCT. Caryam also provides similar performances to the pure C++ version of benchmarks for data size higher than 1MBytes. We also describe Caryam capabilities to perform asynchronous operations. Then we carry out scalability studies based on a Parallelized Conjugate Gradient solver (PCG) of a Toeplitz system without preconditionning. We calculate performance scores (speedup, efficiency, Gflops/s) in both strong and weak scaling cases. Based on the resolution of ~6 Gbytes matrix system, Caryam provides a similar scalability but is less memory consuming (-14 %) than the PCT on up to 12 cores. Besides, The version 1.0 of Caryam enables to reach a 1.046 TFlops peak performance by solving a 1GiB/core size system with the PCG algorithm running on 1000 cores of the RUCHE supercomputer. We finally show how the use of Caryam v1.0 makes possible the resolution of a 30 million degree of freedom problem on 320 cores by a Matlab prototype on RUCHE. This experiment focus on a mixed domain decomposition method to solve 2D and 3D magnetostatic problems.