Introduction
This lesson gives a brief introduction to code vectorization and explains it with the help of an example.
We'll cover the following...
Problem vectorization is much harder than code vectorization because it means that you fundamentally have to rethink your problem in order to make it vectorizable. Most of the time this means you have to use a different algorithm to solve your problem or even worse… to invent a new one. The difficulty is thus to think out-of-the-box.
To illustrate this, let’s consider a simple problem where given two vectors X
and
Y
, we want to compute the sum of X[i]*Y[j]
for all pairs of indices i
,
j
.
One simple and obvious solution is given below.
def compute_python(X, Y):result = 0for i in range(len(X)):for j in range(len(Y)):result += X[i] * Y[j]return resultprint(compute_python([1,2],[1,2]))
However, this first and naïve implementation requires two loops and we already know it will be slow:
import numpy as npfrom tools import timeitdef compute_python(X, Y):result = 0for i in range(len(X)):for j in range(len(Y)):result += X[i] * Y[j]return resultX = np.arange(1000)timeit("compute_python(X,X)", globals())
How to vectorize the problem then? If you remember your linear algebra course,
you may have identified the expression X[i] * Y[j]
to be very similar to a
matrix product expression. So maybe we could benefit from some NumPy
speedup. One wrong solution would be to write:
def compute_numpy_wrong(X, Y):return (X*Y).sum()print(compute_numpy_wrong([1,2],[1,2]))
This is wrong because the X*Y
expression will actually compute a new vector
Z
such that Z[i] = X[i] * Y[i]
and this is not what we want. Instead, we
can exploit NumPy broadcasting by first reshaping the two vectors and then
multiplying them:
def compute_numpy(X, Y):Z = X.reshape(len(X),1) * Y.reshape(1,len(Y))return Z.sum()
Here we have Z[i,j] == X[i,0]*Y[0,j]
and if we take the sum over each elements of
Z
, we get the expected result. Let’s see how much speed we gain in the
process:
import numpy as npfrom tools import timeitdef compute_numpy(X, Y):Z = X.reshape(len(X),1) * Y.reshape(1,len(Y))return Z.sum()X = np.arange(1000)timeit("compute_numpy(X,X)", globals())
This is better, we gained a factor of ~150. But we can do much better.
If you look again and more closely at the pure Python version, you can see that
the inner loop is using X[i]
that does not depend on the j
index, meaning
it can be removed ...