GaspiLS – The GPI-2 based Sparse linear Solver library


GaspiLS is a scalable solver library for sparse linear systems. It is based on the communication library GPI-2. As such, GaspiLS incorporates the shift in programming paradigm which is stimulated by the GPI-2 API. This yields superior performance and scalability (c.f. Fig.) in comparison to e.g. MPI based PetSc or Trilinos. The interoperability of GPI-2 with MPI and the PetSc/Trilinos compliant interface allows for an easy integration into legacy applications. GaspiLS is used in industrial CFD and FEM simulations.

Code example
Scalablity for CFD and FEM simulations

Use GaspiLS to speed up your simulation.


  • (P)CG, BiPCGStab, GMRES solvers
  • Jacobi, ILU(0), ILUM(0) preconditioners
  • Real and complex arithmetics
  • Multi threaded matrix assembly
  • Superior performance and scalability
  • Efficient SpMVM kernel
    • Hybrid-parallel implementation
    • GPI-2: Optimal overlap of communication
    • Task based parallelization: Optimal load
  • Easily extendable / Standard API
  • MPI interoperable
  • Open-source GPLv3


GaspiLS shows excellent scalability.

Comparison of GaspiLS vs. PetSC strong scalability: Jacobi preconditioned modified Richardson iteration method applied to the solution of the 2nd order finite difference discretization of a  2D Poisson equation problem on a cubic grid (359³). GaspiLS scales almost perfect. Architecture: Intel IvyBridge, two sockets per node, 10 cores per socket, Infiniband interconnect.

GaspiLS design and implementation

GaspiLS follows an object oriented C++ design in order to provide an easily extendable interface. The interface is in the spirit of the PetSc/Trilinos APIs in order to allow for a smooth porting of legacy streamlinesvelocity_hiresapplications. The GaspiLS implementation currently provides the concepts of a vector, a matrix, a preconditioner and a solver. The concept of a Matrix is realized within the library by the compressed row storage (CSR) format. The provided solvers include the CG, PCG and BiPCGStab method. As a preconditioner, the library provides the Jacobi method. End users may complement GaspiLS by their own implementation of e.g. another matrix format and/or linear solver etc.

GaspiLS is built on top of a C++ template design to employ performance benefits generated at compile time. A small number of unknowns in the problem to solve e.g. may allow for an index type with reduced memory footprint. This provides an increase in the effectively available memory bandwidth along the SpMVM kernel.

GaspiLS uses GPI-2, the reference implementation of the GASPI specification. GASPI provides a compact yet powerful API for parallel computations. It provides single-sided truly asynchronous communication primitives with a separated fine granular data synchronisation mechanism. It adopts the PGAS model where each rank owns one or more globally accessible memory segments. GPI-2 can be used in conjunction with MPI.

In contrast to bulk synchronous execution, GaspiLS incorporates the shift in programming paradigm which is stimulated by the GPI-2 API. GaspiLS follows a hybrid SPMD execution policy where one process using several threads is started for every single NUMA domain. Potential global synchronization points are readily disentangled to a multi-threaded data-dependency driven task parallelism which allows for an optimal load balancing. Every thread can compute and communicate.

GaspiLS is using an internal splitting of the matrix into contributions having exclusive dependencies to local or remote vector entries. A maximum overlap of communication and computation is achieved. The lightweight runtime system of GPI-2 is minimally-intrusive and allows for optimal overlap.

GaspiLS code example

// Solve Poisson's equation in 3D with finite differences.

#include <GASPI.h>
#include <GASPI_Ext.h>
#include <GaspiLS/GaspiLS.hpp>
#include <GaspiLS/Solvers.hpp>
#include <cmath>
#include <cstdlib>
#include <cstdio>
#include <ctime>

// The problem settings
int    n1d;
int    n;
double h;
double h2inv;

// Matrix and vector classes used in this example
typedef GaspiLS::Vector<double, int, int>    vec;
typedef GaspiLS::CSRMatrix<double, int, int> mat;

// The right-hand side of Poisson's equation.
double func_f(double x, double y, double z)
  // Approximate a Dirac Delta.
  x -= 0.5;
  y -= 0.5;
  z -= 0.5;
  if(sqrt(x*x + y*y + z*z) < h) return 1.0/(h*h*h);
  return 0.0;

// Set up the right-hand side of the linear system.
void setRHS(vec &f)
  double x, y, z;
  double *fEntries = f.localEntryPointer();
  int    k         = f.distribution().minGlobalIndex();

  for(int i = 0; i < f.localSize(); i++, k++) {
    x = (k%n1d + 1)*h;
    y = (k%(n1d*n1d)/n1d + 1)*h;
    z = (k/(n1d*n1d) + 1)*h;
    fEntries[i] = -func_f(x, y, z);

// Discretize the Laplace operator using finite differences.
void setOperator(mat &A)
  for(int k  = A.rowDistribution().minGlobalIndex();
          k <= A.rowDistribution().maxGlobalIndex();       
          k++) {     

    A.insertEntry(k, k, 6.0*h2inv);     

    if(k%n1d > 0) {
      A.insertEntry(k, k - 1, -1.0*h2inv);
    if(k%n1d < n1d - 1) {       
      A.insertEntry(k, k + 1, -1.0*h2inv);     
    if(k%(n1d*n1d) >= n1d) {
      A.insertEntry(k, k - n1d, -1.0*h2inv);
    if(k%(n1d*n1d) < n1d*n1d - n1d) {       
      A.insertEntry(k, k + n1d, -1.0*h2inv);     
    if(k >= n1d*n1d) {
      A.insertEntry(k, k - n1d*n1d, -1.0*h2inv);
    if(k < n1d*n1d*n1d - n1d*n1d) {
      A.insertEntry(k, k + n1d*n1d, -1.0*h2inv);
  // fully threaded matrix finalization

int main(int argc, char *argv[])
  if(argc < 5 || argc > 6) {
    printf("Usage: poisson3d N_1D rtol maxit N_THREADS [OUTPUT_FILE]\n");
    return -1;

  // Read in problem size.
  n1d   = atoi(argv[1]);
  n     = n1d*n1d*n1d;
  h     = 1.0/(n1d + 1);
  h2inv = 1.0/(h*h);

  const size_t segmentSize = static_cast<size_t>(1) << 30;
  // Create a GASPI interface for the group GASPI_GROUP_ALL
  // with the segment 0 and queue 0.

  // Next, create a thread pool with the number of threads given by 
  // argv[4]. Thread pools are used by many GaspiLS objects to per-
  // form parallel computations. They are independent from any user-
  // generated threads, and the user might even use a different 
  // threading package, such as OpenMP. Please refer to the GaspiLS 
  // documentation for thread pinning options.
  GaspiLS::ThreadPool             pool(atoi(argv[4]));
  // Create a GASPI interface for the group GASPI_GROUP_ALL
  // using segment 0 and queue 0.
  GaspiLS::GASPIInterface         interface(GASPI_GROUP_ALL, 0, 0);
  // Now create a uniform distribution of n elements on the GASPI 
  // interface. Distributions have two template parameters which 
  // specify the local and global index type respectively. Using int 
  // here will improve performance in later computations, but re-
  // stricts the distributon to 2^31 - 1 elements.
  // If the distribution has more elements, long must be used for 
  // the global index type, while the local index type should still
  // be int as long as possible.
  GaspiLS::Distribution<int, int> distribution(interface, n);
  // Vectors also have three template parameters. The first para-
  // meter specifies the scalar type, such as double, or 
  // std::complex<float>. The second and third parameter again 
  // specify the local and global index type. Each vector requires 
  // a distribution to determine which vector entries are stored 
  // on which rank, as well as a thread pool for parallel vector 
  // computations.
  vec                             u(distribution, pool);
  vec                             f(distribution, pool);
  // The parameters for creating a matrix are very similar to those
  // of the vector class. The main difference is that matrices have
  // two distributions. The first one is called the row distribution
  // and it specifies which matrix rows are owned by which rank.
  // The second one is called the domain distribution and it 
  // corresponds to the distribution of the vector x during the
  // matrix-vector product y = Ax. It also determines the number
  // of columns for the matrix A.
  mat                             A(distribution, distribution, pool);
  GaspiLS::CGSolver<mat, vec>     cg;

  // Discretize the PDE.

  // Set up linear solver.
  cg.setProblem(A, u, f);

  // Solve linear system.

  // Write solution to disk.
  if(argc == 4) {

  if(interface.groupRank() == 0) {
    printf("FINISHED (%d iterations, rel. residual: %e, "
              "abs. residual: %e\n",

  return 0;

Scalablity for CFD and FEM simulations

In general, scalability is the right measure to evaluate the efficiency of a parallel program and the ability to exploit the hardware provided parallelism. It essentially quantifies the additional benefit which is gained by adding extra resources.

Perfect scalability implies

  • improved time to solution
  • access to advanced/more complex modeling
  • full resource utilization
  • energy efficient computing

In practice, optimal scalability is achieved by a good overlap of communication and computation in order to avoid communication induced latencies and by a good load balance over the cores and sockets. This is achieved by avoiding or at least minimizing explicit synchronization.

The GPI-2 programming model, underlying our library, allows the operations to run fully asynchronously, to balance the work load on the threads and also to fully overlap communication with computations. This is a disruptive approach compared to other linear algebra tools on the market like e.g. PETSc or Trilinos.

Many modeling, simulation and/or optimization problems in engineering and scientific computing are based on CFD or FEM simulations. Sparse linear systems typically appear after the discretization of the underlying PDEs. The solution of these sparse linear systems usually is the most expensive part with respect to the run time of the entire simulation. Thus performance and efficiency gains achievable within linear algebra libraries like GaspiLS have a huge impact for industrial and scientific simulations.

GaspiLS has proven to improve the scalability of CFD simulations

Performance and scalability improvement achieved by GaspiLS in comparison to an original MPI implementation in an industrial CFD simulation. Shown is the scalability of the computation of a single PISO pressure correction step iteration on a grid with 1.1 million cells. The solver is a Jacobi preconditioned Conjugate Gradient method. In case of GaspiLS also the scalability on a bigger grid containing 7.8 million cells is shown. Architecture: Intel IvyBridge, two sockets per node, 10 cores per socket, Infiniband interconnect.