## Issue

I have this code that transposes a matrix using loop tiling strategy.

```
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
```

I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding `#pragma omp parallel for`

but doesn’t this just parallelize the outer loop?

## Solution

When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.

In your case, the write location depends on `k,l`

. Since `k<n`

and `l*n`

, there are no pairs `k.l`

/ `k',l'`

that write to the same location. Furthermore, there are no two inner iterations that have the same `k`

or `l`

value. So all four loops are parallel, and they are perfectly nested, so you can use `collapse(4)`

.

You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it’s completely parallel.

Answered By – Victor Eijkhout

Answer Checked By – Katrina (BugsFixing Volunteer)