Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelizing the assembly over the elements

See original GitHub issue

I’m investigating the feasibility of parallelizing assembly over elements.

Take the following snippet (from #439 ):

import pygmsh as pm
import numpy as np
from skfem.io import from_meshio
from skfem.helpers import ddot, grad, dot, transpose, prod
import matplotlib.pyplot as plt
from math import pi
from skfem import *

def vdet(A):
    detA = np.zeros_like(A[0, 0])
    detA = A[0, 0] * (A[1, 1] * A[2, 2] -
                      A[1, 2] * A[2, 1]) -\
           A[0, 1] * (A[1, 0] * A[2, 2] -
                      A[1, 2] * A[2, 0]) +\
           A[0, 2] * (A[1, 0] * A[2, 1] -
                      A[1, 1] * A[2, 0])
    return detA

def vinv(A):
    invA = np.zeros_like(A)
    detA = vdet(A)
    invA[0, 0] = (-A[1, 2] * A[2, 1] +
                  A[1, 1] * A[2, 2]) / detA
    invA[1, 0] = (A[1, 2] * A[2, 0] -
                  A[1, 0] * A[2, 2]) / detA
    invA[2, 0] = (-A[1, 1] * A[2, 0] +
                  A[1, 0] * A[2, 1]) / detA
    invA[0, 1] = (A[0, 2] * A[2, 1] -
                  A[0, 1] * A[2, 2]) / detA
    invA[1, 1] = (-A[0, 2] * A[2, 0] +
                  A[0, 0] * A[2, 2]) / detA
    invA[2, 1] = (A[0, 1] * A[2, 0] -
                  A[0, 0] * A[2, 1]) / detA
    invA[0, 2] = (-A[0, 2] * A[1, 1] +
                  A[0, 1] * A[1, 2]) / detA
    invA[1, 2] = (A[0, 2] * A[1, 0] -
                  A[0, 0] * A[1, 2]) / detA
    invA[2, 2] = (-A[0, 1] * A[1, 0] +
                  A[0, 0] * A[1, 1]) / detA
    return invA, detA

def firstPKStress(u):
    F = grad(u)
    F[0,0] += 1.
    F[1,1] += 1.
    F[2,2] += 1.
    J = vdet(F)
    invF, _ = vinv(F)
    return mu * F - mu * transpose(invF) + lmbda * J * (J - 1) * transpose(invF)

def jacobianPK(u):
    F = grad(u)
    eye = np.zeros_like(F)
    for i in range(3):
        F[i,i] += 1.
        eye[i,i] += 1.
    Finv, J = vinv(F)
    dFdF = np.einsum("ik...,jl...->ijkl...", eye, eye)
    dFinvdF = np.einsum("jk...,li...->ijkl...", Finv, Finv)
    C = mu * dFdF + mu * dFinvdF -\
        lmbda * J * (J - 1) * dFinvdF +\
        lmbda * (2 * J - 1) * J * np.einsum("ji...,lk...->ijkl...", Finv, Finv)
    return C


mesh = MeshTet()
mesh.refine(4)
elem = ElementTetP1()
uelem = ElementVectorH1(elem)
iBasis = InteriorBasis(mesh, uelem)
fBasis = FacetBasis(mesh, uelem)
u = np.zeros(iBasis.N) #this takes care of dimension

# materialParams and init
bodyForce = np.array([0., -1./2, 0])
E, nu = 10., 0.3
mu = E/2/(1+nu)
lmbda = 2*mu*nu/(1-2*nu)
dofs = {
    "left": iBasis.get_dofs(lambda x: x[0]==0),
    "right": iBasis.get_dofs(lambda x: x[0]==1.)
}


# assign DirichletBC
# variables used in the FEniCS demo
scale = y0 = z0 = 0.5
theta = pi/3.

# scaling factor: bta: for Newton's method'
bta = 0.7

u1Right = 0.
u2Right = lambda x,y,z: scale*(y0 + (y - y0)*np.cos(theta) - (z - z0)*np.sin(theta) - y)
u3Right = lambda x,y,z: scale*(z0 + (y - y0)*np.sin(theta) + (z - z0)*np.cos(theta) - z)

rightNodes = mesh.p[:,mesh.nodes_satisfying(lambda x: np.isclose(x[0], 1.))]
leftNodes = mesh.p[:,mesh.nodes_satisfying(lambda x: np.isclose(x[0], 0.))]

u[dofs["left"].nodal['u^1']] = 0.
u[dofs["left"].nodal['u^2']] = 0.
u[dofs["left"].nodal['u^3']] = 0.
u[dofs["right"].nodal['u^1']] = 0.
u[dofs["right"].nodal['u^2']] = u2Right(*iBasis.doflocs[:, dofs["right"].nodal['u^2']])
u[dofs["right"].nodal['u^3']] = u3Right(*iBasis.doflocs[:, dofs["right"].nodal['u^3']])

I = iBasis.complement_dofs(dofs)

@LinearForm
def rhs(v, w):
    return ddot(firstPKStress(w["w"]), grad(v)) #+ dot(bodyForce, v)

@BilinearForm
def jac(u, v, w):
    return np.einsum('ijkl...,ij...,kl...', jacobianPK(w["w"]), grad(u), grad(v))

w = iBasis.interpolate(u)

Assembly takes quite a while because so many FP operations are done inside the form:

In [3]: %time     J = asm(jac, iBasis, w=w)                                                                                                         
CPU times: user 12.4 s, sys: 4.01 s, total: 16.4 s
Wall time: 16.4 s

This is despite there being only about 4k DOF’s (Edit: 3 * 4k = 12 k DOF’s)

In [12]: mesh.p.shape                                                                                                                               
Out[12]: (3, 4233)

What happens if we assemble only half of the elements?


In [6]: ib1 = InteriorBasis(mesh, uelem, elements=mesh.elements_satisfying(lambda x: x[0]<0.5))                                                     

In [7]: ib2 = InteriorBasis(mesh, uelem, elements=mesh.elements_satisfying(lambda x: x[0]>0.5)) 


In [9]: w1 = ib1.interpolate(u)                                                                                                                     

In [10]: w2 = ib2.interpolate(u)                                                                                                                    

In [11]: %time asm(jac, ib1, w=w1)                                                                                                                  
CPU times: user 5.99 s, sys: 1.92 s, total: 7.91 s
Wall time: 7.92 s
Out[11]: 
<12699x12699 sparse matrix of type '<class 'numpy.float64'>'
	with 260207 stored elements in Compressed Sparse Row format>

Looking at htop while this is running, we see that assembly (in this case) is done mostly using single core.

So we could try assembling ib1 and ib2 in parallel and save some time.

Let’s try this using dask.

In [6]: import dask.bag as db                                                                                                                       

In [7]: b = db.from_sequence([(ib1, w1), (ib2, w2)])                                                                                                

In [9]: c = b.map(lambda x: asm(jac, x[0], w=x[1])) 

In [12]: %time c.compute()                                                                                                                          
CPU times: user 83.7 ms, sys: 236 ms, total: 320 ms
Wall time: 12.6 s
Out[12]: 
[<12699x12699 sparse matrix of type '<class 'numpy.float64'>'
 	with 260207 stored elements in Compressed Sparse Row format>,
 <12699x12699 sparse matrix of type '<class 'numpy.float64'>'
 	with 260693 stored elements in Compressed Sparse Row format>]

12.6 s < 16.4 s

So we seem to have a chance of saving some seconds by splitting the assembly over elements.

Remaining questions:

Can we combine these resulting matrices so that it actually saves time? I suppose the correct place to do this is before a call to any scipy.sparse routines.
Does it make sense to provide a method in Basis which splits it to multiple Basis objects?
Should we make this (parallelization) transparent to the user or simply make an example demonstrating this?
What actually causes the assembly run in one core only (do some profiling)?

I’ll make a branch which explores this when I have time.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:20 (20 by maintainers)

Top GitHub Comments

1reaction

kinnalacommented, Sep 14, 2020

My guess is that if parallelization over elements is to be performed, combining the results should be done before a call to _assemble_scipy_matrix here, e.g., by initializing data, rows and cols so that they can hold the entire matrix and then doing the loops before _assemble_scipy_matrix only for a subset of elements per thread.

1reaction

kinnalacommented, Sep 14, 2020

Yes I think so, unless you want to try what’s suggested in the title of the issue and the first post, i.e. parallelize over elements. That should end up being multiple times faster if done properly.

Top Results From Across the Web

Parallelizing the assembly over the elements #713 - GitHub

I'm investigating the feasibility of parallelizing assembly over elements. Take the following snippet (from #439 ):. import pygmsh as pm import numpy as...

parallelization of assembly operation in finite element method

This paper deals with evaluation of different parallelization strategies of assembly operations for global vectors and matrices, which are one of the critical ......

Task Parallel Assembly Language for Uncompromising ...

In this paper, we propose an execution model for heart- beat scheduling, formalized by our Task Parallel Assembly. Language (TPAL), and a runtime...

Parallelization Strategies for Matrix Assembly in Finite ...

Analysing data dependencies in the matrix assembly, one obtains the problem to synchronize access to one node (matrix element) from neighbouring elements. An ......

Parallel Implementation of The Finite Element Method ... - CORE

The finite element method consists of the assembly of a stiffness matrix and the solution of a set of simultaneous linear equations. A...