I am using cython for the first time to get some speed for a function. The function takes a square matrix
A floating point numbers and outputs a single floating point number. The function it is computing is the permanent of a matrix
When A is 30 by 30 my code takes about 60 seconds currently on my PC.
In the code below I have implemented the Balasubramanian-Bax/Franklin-Glynn formula for the permanent from the wiki page. I have called the matrix M.
One sophisticated part of the code is the array f which is used to hold the index of the next position to flip in the array d. The array d holds values that are +-1. The manipulation of f and j in the loop is just a clever way to update a Gray code quickly.
from __future__ import division import numpy as np cimport numpy as np cimport cython DTYPE_int = np.int ctypedef np.int_t DTYPE_int_t DTYPE_float = np.float64 ctypedef np.float64_t DTYPE_float_t @cython.boundscheck(False) # turn off bounds-checking for entire function @cython.wraparound(False) # turn off negative index wrapping for entire function def permfunc(np.ndarray [DTYPE_float_t, ndim =2, mode='c'] M): cdef int n = M.shape cdef np.ndarray[DTYPE_float_t, ndim =1, mode='c' ] d = np.ones(n, dtype=DTYPE_float) cdef int j = 0 cdef int s = 1 cdef np.ndarray [DTYPE_int_t, ndim =1, mode='c'] f = np.arange(n, dtype=DTYPE_int) cdef np.ndarray [DTYPE_float_t, ndim =1, mode='c'] v = M.sum(axis=0) cdef DTYPE_float_t p = 1 cdef int i cdef DTYPE_float_t prod for i in range(n): p *= v[i] while (j < n-1): for i in range(n): v[i] -= 2*d[j]*M[j, i] d[j] = -d[j] s = -s prod = 1 for i in range(n): prod *= v[i] p += s*prod f = 0 f[j] = f[j+1] f[j+1] = j+1 j = f return p/2**(n-1)
I have used all the simple optimizations I found in the cython tutorial. Some aspects I have to admit I don't fully understand. For example, if I make the array
d ints, as the values are only ever +-1, the code runs about 10% more slowly so I have left it as float64s.
Is there anything else I can do to speed up the code?
This is the result of cython -a . As you can see everything in the loop is compiled to C so the basic optimizations have worked.
Here is the same function in numpy which is over 100 times slower than my current cython version.
def npperm(M): n = M.shape d = np.ones(n) j = 0 s = 1 f = np.arange(n) v = M.sum(axis=0) p = np.prod(v) while (j < n-1): v -= 2*d[j]*M[j] d[j] = -d[j] s = -s prod = np.prod(v) p += s*prod f = 0 f[j] = f[j+1] f[j+1] = j+1 j = f return p/2**(n-1)
Here are timings (using ipython) of my cython version, the numpy version and romeric's improvement to the cython code. I have set the seed for reproducibility.
from scipy.stats import ortho_group import pyximport; pyximport.install() import permlib # This loads in the functions from permlib.pyx import numpy as np; np.random.seed(7) M = ortho_group.rvs(23) #Creates a random orthogonal matrix %timeit permlib.npperm(M) # The numpy version 1 loop, best of 3: 44.5 s per loop %timeit permlib.permfunc(M) # The cython version 1 loop, best of 3: 273 ms per loop %timeit permlib.permfunc_modified(M) #romeric's improvement 10 loops, best of 3: 198 ms per loop M = ortho_group.rvs(28) %timeit permlib.permfunc(M) # The cython version run on a 28x28 matrix 1 loop, best of 3: 15.8 s per loop %timeit permlib.permfunc_modified(M) # romeric's improvement run on a 28x28 matrix 1 loop, best of 3: 12.4 s per loop
Can the cython code be sped up at all?
I am using gcc and the CPU is the AMD FX 8350.