Parallelizing the assembly over the elements
See original GitHub issueI’m investigating the feasibility of parallelizing assembly over elements.
Take the following snippet (from #439 ):
import pygmsh as pm
import numpy as np
from skfem.io import from_meshio
from skfem.helpers import ddot, grad, dot, transpose, prod
import matplotlib.pyplot as plt
from math import pi
from skfem import *
def vdet(A):
detA = np.zeros_like(A[0, 0])
detA = A[0, 0] * (A[1, 1] * A[2, 2] -
A[1, 2] * A[2, 1]) -\
A[0, 1] * (A[1, 0] * A[2, 2] -
A[1, 2] * A[2, 0]) +\
A[0, 2] * (A[1, 0] * A[2, 1] -
A[1, 1] * A[2, 0])
return detA
def vinv(A):
invA = np.zeros_like(A)
detA = vdet(A)
invA[0, 0] = (-A[1, 2] * A[2, 1] +
A[1, 1] * A[2, 2]) / detA
invA[1, 0] = (A[1, 2] * A[2, 0] -
A[1, 0] * A[2, 2]) / detA
invA[2, 0] = (-A[1, 1] * A[2, 0] +
A[1, 0] * A[2, 1]) / detA
invA[0, 1] = (A[0, 2] * A[2, 1] -
A[0, 1] * A[2, 2]) / detA
invA[1, 1] = (-A[0, 2] * A[2, 0] +
A[0, 0] * A[2, 2]) / detA
invA[2, 1] = (A[0, 1] * A[2, 0] -
A[0, 0] * A[2, 1]) / detA
invA[0, 2] = (-A[0, 2] * A[1, 1] +
A[0, 1] * A[1, 2]) / detA
invA[1, 2] = (A[0, 2] * A[1, 0] -
A[0, 0] * A[1, 2]) / detA
invA[2, 2] = (-A[0, 1] * A[1, 0] +
A[0, 0] * A[1, 1]) / detA
return invA, detA
def firstPKStress(u):
F = grad(u)
F[0,0] += 1.
F[1,1] += 1.
F[2,2] += 1.
J = vdet(F)
invF, _ = vinv(F)
return mu * F - mu * transpose(invF) + lmbda * J * (J - 1) * transpose(invF)
def jacobianPK(u):
F = grad(u)
eye = np.zeros_like(F)
for i in range(3):
F[i,i] += 1.
eye[i,i] += 1.
Finv, J = vinv(F)
dFdF = np.einsum("ik...,jl...->ijkl...", eye, eye)
dFinvdF = np.einsum("jk...,li...->ijkl...", Finv, Finv)
C = mu * dFdF + mu * dFinvdF -\
lmbda * J * (J - 1) * dFinvdF +\
lmbda * (2 * J - 1) * J * np.einsum("ji...,lk...->ijkl...", Finv, Finv)
return C
mesh = MeshTet()
mesh.refine(4)
elem = ElementTetP1()
uelem = ElementVectorH1(elem)
iBasis = InteriorBasis(mesh, uelem)
fBasis = FacetBasis(mesh, uelem)
u = np.zeros(iBasis.N) #this takes care of dimension
# materialParams and init
bodyForce = np.array([0., -1./2, 0])
E, nu = 10., 0.3
mu = E/2/(1+nu)
lmbda = 2*mu*nu/(1-2*nu)
dofs = {
"left": iBasis.get_dofs(lambda x: x[0]==0),
"right": iBasis.get_dofs(lambda x: x[0]==1.)
}
# assign DirichletBC
# variables used in the FEniCS demo
scale = y0 = z0 = 0.5
theta = pi/3.
# scaling factor: bta: for Newton's method'
bta = 0.7
u1Right = 0.
u2Right = lambda x,y,z: scale*(y0 + (y - y0)*np.cos(theta) - (z - z0)*np.sin(theta) - y)
u3Right = lambda x,y,z: scale*(z0 + (y - y0)*np.sin(theta) + (z - z0)*np.cos(theta) - z)
rightNodes = mesh.p[:,mesh.nodes_satisfying(lambda x: np.isclose(x[0], 1.))]
leftNodes = mesh.p[:,mesh.nodes_satisfying(lambda x: np.isclose(x[0], 0.))]
u[dofs["left"].nodal['u^1']] = 0.
u[dofs["left"].nodal['u^2']] = 0.
u[dofs["left"].nodal['u^3']] = 0.
u[dofs["right"].nodal['u^1']] = 0.
u[dofs["right"].nodal['u^2']] = u2Right(*iBasis.doflocs[:, dofs["right"].nodal['u^2']])
u[dofs["right"].nodal['u^3']] = u3Right(*iBasis.doflocs[:, dofs["right"].nodal['u^3']])
I = iBasis.complement_dofs(dofs)
@LinearForm
def rhs(v, w):
return ddot(firstPKStress(w["w"]), grad(v)) #+ dot(bodyForce, v)
@BilinearForm
def jac(u, v, w):
return np.einsum('ijkl...,ij...,kl...', jacobianPK(w["w"]), grad(u), grad(v))
w = iBasis.interpolate(u)
Assembly takes quite a while because so many FP operations are done inside the form:
In [3]: %time J = asm(jac, iBasis, w=w)
CPU times: user 12.4 s, sys: 4.01 s, total: 16.4 s
Wall time: 16.4 s
This is despite there being only about 4k DOF’s (Edit: 3 * 4k = 12 k DOF’s)
In [12]: mesh.p.shape
Out[12]: (3, 4233)
What happens if we assemble only half of the elements?
In [6]: ib1 = InteriorBasis(mesh, uelem, elements=mesh.elements_satisfying(lambda x: x[0]<0.5))
In [7]: ib2 = InteriorBasis(mesh, uelem, elements=mesh.elements_satisfying(lambda x: x[0]>0.5))
In [9]: w1 = ib1.interpolate(u)
In [10]: w2 = ib2.interpolate(u)
In [11]: %time asm(jac, ib1, w=w1)
CPU times: user 5.99 s, sys: 1.92 s, total: 7.91 s
Wall time: 7.92 s
Out[11]:
<12699x12699 sparse matrix of type '<class 'numpy.float64'>'
with 260207 stored elements in Compressed Sparse Row format>
Looking at htop
while this is running, we see that assembly (in this case) is done mostly using single core.
So we could try assembling ib1 and ib2 in parallel and save some time.
Let’s try this using dask.
In [6]: import dask.bag as db
In [7]: b = db.from_sequence([(ib1, w1), (ib2, w2)])
In [9]: c = b.map(lambda x: asm(jac, x[0], w=x[1]))
In [12]: %time c.compute()
CPU times: user 83.7 ms, sys: 236 ms, total: 320 ms
Wall time: 12.6 s
Out[12]:
[<12699x12699 sparse matrix of type '<class 'numpy.float64'>'
with 260207 stored elements in Compressed Sparse Row format>,
<12699x12699 sparse matrix of type '<class 'numpy.float64'>'
with 260693 stored elements in Compressed Sparse Row format>]
12.6 s < 16.4 s
So we seem to have a chance of saving some seconds by splitting the assembly over elements.
Remaining questions:
- Can we combine these resulting matrices so that it actually saves time? I suppose the correct place to do this is before a call to any
scipy.sparse
routines. - Does it make sense to provide a method in Basis which splits it to multiple Basis objects?
- Should we make this (parallelization) transparent to the user or simply make an example demonstrating this?
- What actually causes the assembly run in one core only (do some profiling)?
I’ll make a branch which explores this when I have time.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:20 (20 by maintainers)
Top Results From Across the Web
Parallelizing the assembly over the elements #713 - GitHub
I'm investigating the feasibility of parallelizing assembly over elements. Take the following snippet (from #439 ):. import pygmsh as pm import numpy as...
Read more >parallelization of assembly operation in finite element method
This paper deals with evaluation of different parallelization strategies of assembly operations for global vectors and matrices, which are one of the critical ......
Read more >Task Parallel Assembly Language for Uncompromising ...
In this paper, we propose an execution model for heart- beat scheduling, formalized by our Task Parallel Assembly. Language (TPAL), and a runtime...
Read more >Parallelization Strategies for Matrix Assembly in Finite ...
Analysing data dependencies in the matrix assembly, one obtains the problem to synchronize access to one node (matrix element) from neighbouring elements. An ......
Read more >Parallel Implementation of The Finite Element Method ... - CORE
The finite element method consists of the assembly of a stiffness matrix and the solution of a set of simultaneous linear equations. A...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
My guess is that if parallelization over elements is to be performed, combining the results should be done before a call to
_assemble_scipy_matrix
here, e.g., by initializingdata
,rows
andcols
so that they can hold the entire matrix and then doing the loops before_assemble_scipy_matrix
only for a subset of elements per thread.Yes I think so, unless you want to try what’s suggested in the title of the issue and the first post, i.e. parallelize over elements. That should end up being multiple times faster if done properly.