# [SOLVED] Optimizing a matrix transpose function with OpenMP

## Issue

I have this code that transposes a matrix using loop tiling strategy.

``````void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
``````

I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding `#pragma omp parallel for` but doesn’t this just parallelize the outer loop?

## Solution

When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.

In your case, the write location depends on `k,l`. Since `k<n` and `l*n`, there are no pairs `k.l` / `k',l'` that write to the same location. Furthermore, there are no two inner iterations that have the same `k` or `l` value. So all four loops are parallel, and they are perfectly nested, so you can use `collapse(4)`.

You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it’s completely parallel.