## Can this cython code be optimized?

I am using cython for the first time to get some speed for a function. The function takes a square matrix `A`

floating point numbers and outputs a single floating point number. The function it is computing is the permanent of a matrix

When A is 30 by 30 my code takes about 60 seconds currently on my PC.

In the code below I have implemented the Balasubramanian-Bax/Franklin-Glynn formula for the permanent from the wiki page. I have called the matrix M.

One sophisticated part of the code is the array f which is used to hold the index of the next position to flip in the array d. The array d holds values that are +-1. The manipulation of f and j in the loop is just a clever way to update a Gray code quickly.

```
from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
DTYPE_int = np.int
ctypedef np.int_t DTYPE_int_t
DTYPE_float = np.float64
ctypedef np.float64_t DTYPE_float_t
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False) # turn off negative index wrapping for entire function
def permfunc(np.ndarray [DTYPE_float_t, ndim =2, mode='c'] M):
cdef int n = M.shape[0]
cdef np.ndarray[DTYPE_float_t, ndim =1, mode='c' ] d = np.ones(n, dtype=DTYPE_float)
cdef int j = 0
cdef int s = 1
cdef np.ndarray [DTYPE_int_t, ndim =1, mode='c'] f = np.arange(n, dtype=DTYPE_int)
cdef np.ndarray [DTYPE_float_t, ndim =1, mode='c'] v = M.sum(axis=0)
cdef DTYPE_float_t p = 1
cdef int i
cdef DTYPE_float_t prod
for i in range(n):
p *= v[i]
while (j < n-1):
for i in range(n):
v[i] -= 2*d[j]*M[j, i]
d[j] = -d[j]
s = -s
prod = 1
for i in range(n):
prod *= v[i]
p += s*prod
f[0] = 0
f[j] = f[j+1]
f[j+1] = j+1
j = f[0]
return p/2**(n-1)
```

I have used all the simple optimizations I found in the cython tutorial. Some aspects I have to admit I don't fully understand. For example, if I make the array `d`

ints, as the values are only ever +-1, the code runs about 10% more slowly so I have left it as float64s.

Is there anything else I can do to speed up the code?

This is the result of cython -a . As you can see everything in the loop is compiled to C so the basic optimizations have worked.

Here is the same function in numpy which is over 100 times slower than my current cython version.

```
def npperm(M):
n = M.shape[0]
d = np.ones(n)
j = 0
s = 1
f = np.arange(n)
v = M.sum(axis=0)
p = np.prod(v)
while (j < n-1):
v -= 2*d[j]*M[j]
d[j] = -d[j]
s = -s
prod = np.prod(v)
p += s*prod
f[0] = 0
f[j] = f[j+1]
f[j+1] = j+1
j = f[0]
return p/2**(n-1)
```

**Timings updated**

Here are timings (using ipython) of my cython version, the numpy version and romeric's improvement to the cython code. I have set the seed for reproducibility.

```
from scipy.stats import ortho_group
import pyximport; pyximport.install()
import permlib # This loads in the functions from permlib.pyx
import numpy as np; np.random.seed(7)
M = ortho_group.rvs(23) #Creates a random orthogonal matrix
%timeit permlib.npperm(M) # The numpy version
1 loop, best of 3: 44.5 s per loop
%timeit permlib.permfunc(M) # The cython version
1 loop, best of 3: 273 ms per loop
%timeit permlib.permfunc_modified(M) #romeric's improvement
10 loops, best of 3: 198 ms per loop
M = ortho_group.rvs(28)
%timeit permlib.permfunc(M) # The cython version run on a 28x28 matrix
1 loop, best of 3: 15.8 s per loop
%timeit permlib.permfunc_modified(M) # romeric's improvement run on a 28x28 matrix
1 loop, best of 3: 12.4 s per loop
```

Can the cython code be sped up at all?

I am using gcc and the CPU is the AMD FX 8350.

Show source

## Answers ( 2 )

Well, one obvious optimization is to set d[i] to -2 and +2 and avoid the multiplication by 2. I suspect this won't make any difference, but still.

Another is to make sure the C++ compiler that compiles the resulting code has all the optimizations turned on (especially vectorization).

The loop that calculates the new v[i]s can be parallelized with Cython's support of OpenMP. At 30 iterations this also might not make a difference.

There isn't much you can do with your

`cython`

function, as it is already well optimised. However, you will still be able to get a moderate speed-up by completely avoiding the calls to`numpy`

.Here are essential checks and timings:

EDITLets perform some basic`SSE`

vectorisation by unrolling the inner`prod`

loop, that is change the loop in the above code to the followingand now the timing

So almost

2Xspeed-up over the original optimised cython code, not bad.