Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF Optimize dot product order

See original GitHub issue

When multiplying 3 or more matrices, the order of parathesis doesn’t impact the results but it can have a very significant impact on the number of operations and on performance see https://en.wikipedia.org/wiki/Matrix_chain_multiplication

For matrix multiplication of dense arrays there is numpy.linalg.multi_dot and we we should use it I think. To find existing occurrences where it could be used, see for instance the result of

git grep 'dot(.*dot'

sklearn/datasets/_samples_generator.py:    return np.dot(np.dot(u, s), v.T)
sklearn/datasets/_samples_generator.py:    X = np.dot(np.dot(U, 1.0 + np.diag(generator.rand(n_dim))), Vt)
sklearn/decomposition/_fastica.py:    w -= np.dot(np.dot(w, W[:j].T), W[:j])
sklearn/decomposition/_fastica.py:    return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
sklearn/decomposition/_fastica.py:                S = np.dot(np.dot(W, K), X).T
sklearn/decomposition/_nmf.py:            norm_WH = trace_dot(np.dot(np.dot(W.T, W), H), H)
sklearn/decomposition/_nmf.py:        denominator = np.dot(np.dot(W.T, W), H)
sklearn/discriminant_analysis.py:        self.coef_ = np.dot(self.means_, evecs).dot(evecs.T)
sklearn/gaussian_process/_gpc.py:            s_1 = .5 * a.T.dot(C).dot(a) - .5 * R.T.ravel().dot(C.ravel())
sklearn/gaussian_process/_gpc.py:            s_3 = b - K.dot(R.dot(b))  # Line 14
sklearn/linear_model/_bayes.py:            coef_ = np.dot(X.T, np.dot(
sklearn/linear_model/_logistic.py:        ret[:n_features] = X.T.dot(dX.dot(s[:n_features]))
sklearn/linear_model/_ridge.py:        AXy = A.dot(X_op.T.dot(y))

Ideally each replacement should be benchmarked.

For matrix multiplication safe_sparse_dot using a combination of sparse and dense matrice, some of this could apply as well, though defining a general heuristic is probably a bit more difficult there.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:11 (11 by maintainers)

Top GitHub Comments

2reactions

postmalloccommented, Jun 25, 2020

Turns out multi_dot can be slower^[1] than dot depending on the size and variation in the size of the arrays. However, multi_dot uses a much simpler logic to identify the right order if the dot product is on 3 matrices^[2]. Considering that most of the nested dot products in the code seem to have 3 matrices, maybe multi_dot can provide performance gains.

[1] https://stackoverflow.com/questions/45852228/how-is-numpy-multi-dot-slower-than-numpy-dot [2] https://github.com/numpy/numpy/blob/94721320b1e13fd60046dc8bd0d343c54c2dd2e9/numpy/linalg/linalg.py#L2664

1reaction

postmalloccommented, Jun 26, 2020

All of the places where we use multi_dot should use dense data exclusively

Yes, that makes sense. I got these errors when I changed them in the wrong places, such as in _ridge.py.

Yes, please it would be easier for reviewers to evaluate if it’s in one branch…

I pushed the changes for FastICA, NMF, BayesianRidge, and ARDRegression in #17737. Do you want the changes for other modules that have not been benchmarked yet to go in a separate PR?

Top Results From Across the Web

How Do I Attain Peak CPU Performance With Dot Product?

The problem is that your CPU can do one 128-bit load per clock cycle and to do the dot product you need two...

Optimizing Dot Product - Eric Holk

Any decent dot product implementation should be bound by the memory bandwidth. This is true of many algorithms, but many offer opportunities to ......

How quickly can you compute the dot product between two ...

How quickly can you compute the dot product between two large vectors? A dot (or scalar) product is a fairly simple operation that...

Innefficient paralellization? Need some help optimizing a ...

I have a very simple code I'd like to optimize, and I'm not sure I am getting the ... I have implemented two...

Accelerating DSP functions with dot product instructions - Arm

Accelerating DSP functions with dot product instructions ... These operations are used to improve the performance of the libvpx ...