question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low Efficiency caused by matrix-vector product in for loop

See original GitHub issue

Hi all,

When I am running moderately big (?) problem 670 sources, and 50,000 cells (3D-IP inversion). It takes about 8 hrs to run 15 GN iterations, which is not too bad. I need to give bit of context on IP inversion. It is linear inversion, but requries sensitivity of DC problem: d = Gm For 3D, rather than generating sensitivity matrix, I store factor of system matrix, and use that to solve linear IP problem. Hence, this shoud be fast …, but it was not than I expected. So I did few experiments, and found an issue.

Below illustrate the issue: screenshot from 2017-04-20 12 07 11

In the first cell above, I factorize A and solve to compute predicted data (IP), so it requires both A inverse, and back-substitutie. But in the second cell, it is already factorized, so we do not need A inverse, but just back-substitution is required. However, it still takes 19 sec, which is not that different from 22 sec (including factorization). So, I was curious that back subsitution takes that much time… In the outside, I evaluated how long that it takes:

screenshot from 2017-04-20 12 08 31

It only took 0.52 second. But evaluation of problem.getAderiv did take 16 second, which is major portion of total time. So I did break a part problem.getAderiv as

screenshot from 2017-04-20 12 09 16

Then figured problem.MeSigmaDeriv takes most of our time. This does require evaluation of mesh3D.getEdgeInnerProductDeriv, which could be broken apart:

screenshot from 2017-04-20 12 10 18

Basically, the above shows matrix-vector in a for loop is the monster. Does anyone have good idea what is happening here, and also a good fix of this issue? @rowanc1 @grosenkj @jcapriot @fourndo @lheagy @bsmithyman

I believe most of SimPEG code can suffer from this, and solving this issue could hugely increase efficiency of a number of SimPEG codes!

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
ahartikainencommented, Apr 22, 2017

Hi, the copy thing is only for the example in StackOverflow. It has nothing to do with SimPEG being slow.

Point was that slicing multidimensional array is slower than using whole array (contiguous layout). (see code and fig below)

The problem here is probably the sparse hstack. It is called everytime when mesh.aveE2CC is called. I understand that this value can not be saved to variable, but maybe memoization could help?

In python 3 there is lru_cache in functools. For python 2 we probably need to create a custom memoization function.

Other options are:

  • possibility to “lock” mesh so aveE2CC can be saved internally
  • avoid looping (is this even possible?)
  • create custom hstack, see link here

The example array A (see code below) has size 1Mx1M. I am not sure what is the size of your 3D mesh, but if you run hstack 600 times it will take some time even if its smaller than the example here. It would be good idea to see what is the case with your mesh.

A
<1000000x1000000 sparse matrix of type '<class 'numpy.float64'>'
with 2999998 stored elements in Compressed Sparse Row format>

%timeit sp.hstack((A, A, A))
1 loop, best of 3: 297 ms per loop

%timeit sp.hstack((A, A, A), format='csr')
1 loop, best of 3: 1.49 s per loop

%timeit sp.hstack((A, A, A), format='coo')
1 loop, best of 3: 301 ms per loop

A_ = A.tocoo()
A_
<1000000x1000000 sparse matrix of type '<class 'numpy.float64'>'
with 2999998 stored elements in COOrdinate format>

%timeit sp.hstack((A_, A_, A_))
1 loop, best of 3: 180 ms per loop

%timeit sp.hstack((A_, A_, A_), format='csr')
1 loop, best of 3: 1.2 s per loop

%timeit sp.hstack((A_, A_, A_), format='coo')
1 loops, best of 3: 181 ms per loop

taken from StackOverflow:

import numpy as np
import scipy.sparse as sp
import matplotlib.pyplot as plt
plt.style.use('ggplot')

n = int(1e6)
m = int(100)
e = np.ones(n)
A = sp.spdiags(np.vstack((e, e, e)), np.array([-1, 0, 1]), n, n)
A = A.tocsr()
u_C = np.random.randn(n,m, )
u_F = np.asfortranarray(u_C.copy())

col_slice_C = list(range(1,101, 5))
times_slice_C = []
for col_size in col_slice_C:
    t = %timeit -o -q A * u_C[:, :col_size]
    times_slice_C.append(t.best)

col_slice_F = list(range(1,101, 5))
times_slice_F = []
for col_size in col_slice_F:
    t = %timeit -o -q (A * u_F[:, :col_size])
    times_slice_F.append(t.best)

col_copy = list(range(1,101, 5))
times_copy = []
for col_size in col_copy:
    u_ = u_C[:, :col_size].copy()
    t = %timeit -o -q A * u_
    times_copy.append(t.best)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(7,14))

ax1.scatter(col_slice_C, np.array(times_slice_C)*1000, lw=3, label='C-slice', color='royalblue')
ax1.scatter(col_slice_F, np.array(times_slice_F)*1000, lw=3, label='F-slice', color='orange')
ax1.scatter(col_copy, np.array(times_copy)*1000, lw=3, label='1D copy', color='tomato')


ax2.scatter(col_slice_C, 
            np.array(times_slice_C)*1000/np.array(col_slice_C), 
            lw=3, label='C-slice', color='royalblue')
ax2.scatter(col_slice_F, 
            np.array(times_slice_F)*1000/np.array(col_slice_F), 
            lw=3, label='F-slice', color='orange')
ax2.scatter(col_copy, 
            np.array(times_copy)*1000/np.array(col_copy), 
            lw=3, label='1D copy', color='tomato')

ax2.set_ylim(0, ax2.get_ylim()[1])

ax1.legend(fontsize=12)
ax2.legend(fontsize=12)

ax1.set_xlabel('slice size', fontsize=12)
ax1.set_ylabel('time (ms)', fontsize=12)

ax2.set_xlabel('slice size', fontsize=12)
ax2.set_ylabel('time per col (ms)', fontsize=12)

fig.savefig("./timing_example", dpi=200, bbox_inches='tight')
1reaction
sgkangcommented, Apr 21, 2017

That is true @lheagy. I’ll put together, and make an issue there!

Read more comments on GitHub >

github_iconTop Results From Across the Web

vector * matrix product efficiency issue - c++ - Stack Overflow
Its rows are just single scalars and the vector * matrix product implementation becomes an inefficient implementation of dot products of columns ...
Read more >
A sparse matrix‐vector multiplication method with low ...
In this work, we investigate improving the efficiency of SpMV based on complete loop unrolling and using a variation of CSR as the...
Read more >
Optimizing Sparse Matrix-Vector Product Computations Using ...
This paper describes a strategy for improving the performance of sparse matrix-vector product computations using a loop transformation known ...
Read more >
(PDF) Optimizing Sparse Matrix-Vector Product Computations ...
This paper describes a strategy for improving the performance of sparse matrix vector product computations using a loop transformation known ...
Read more >
A Sparse Matrix-Vector Multiplication Method with Low ...
CSRLen-. Goto is based on complete loop unrolling, and gives performance improvements in particular for matrices whose mean row length is low.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found