[SOLVED] Why is nested for loop not faster with multiple threads using OpenMP?

Issue

I wrote a function with following 5 parameters:

  1. array of lines of a text;
  2. array of some words;
  3. number of lines (around 40k);
  4. number of words (about 4-5);
  5. number of threads to be created (k, tried many values such that 2<k<20).

The function searches for the given words in a for loop, scanning every line for one word in each iteration (outer for).

The inner for loop:

  • loops through the lines;
  • checks if the line contains the given word;
  • counts the word’s occurrence in the line;
  • prints a message if the word was found in the line;
  • increases the total occurrence of the word in the whole text with its occurrence in the line.

I wrote it as a serial program first and then tried to parallelize it using OpenMP, and compare the time it takes for each to analyze the whole text.
Unfortunately I’m not able to improve the time of the parallel program compared to that of the serial version.

For measuring the time I use the following functions:

double ClockCounter;

void start()
{
        ClockCounter = omp_get_wtime();
}

void stop()
{
        ClockCounter = omp_get_wtime() - ClockCounter;
        printf("Elapsed time: %f\n", ClockCounter);
}

And the text analyzing function:

void analyze_text(char **lines, char *words[], size_t nr_lines, int nr_words, int k) {
        int i, total_word_count, word_occurence;
        start();
#       pragma omp parallel for num_threads(k) \
        default(none) shared(total_word_count, word_occurence) private(i, j, nr_words, nr_lines, word_occurence, counter)
        for (i=0; i<nr_words; i++) {
                total_word_count = 0;
                size_t j;
                printf("The word is: %s.\nSEARCHING...\n", words[i]);
//#             pragma omp parallel for num_threads(k) \
//              default(none) shared(total_word_count) private(j, nr_lines, word_occurence, counter)
                for (j=0; j<nr_lines; j++) {
                        char *line = NULL;
                        line = (char *) malloc (strlen(lines[j]) + 1);
                        strcpy(line, lines[j]);
                        word_occurence = 0;
                        if (strstr(line, words[i]) != NULL) {
                                printf("Word found at line %ld, position(s): ", j);
                                char *found_word = NULL;
                                found_word = (char *) malloc (strlen(words[i]) +1);
                                char delim[] = " -";
                                char *ptr = strtok(line, delim);
                                int counter =  0;
                                while (ptr != NULL) {
                                        counter++;
                                        strncpy(found_word, ptr, strlen(words[i]));
                                        found_word[strlen(found_word)] = '\0';
                                        if (strcmp(words[i], found_word) == 0) {
                                                word_occurence++;
                                                printf("%d ", counter);
                                        }
                                        ptr =  strtok(NULL, delim);

                                }
                                printf("\nline[%3zu]: %s\n", j, lines[j]);
                        }
                        total_word_count += word_occurence;
                }
                printf("The word appears %d times in the text in total.\n\n", total_word_count);
        }
        stop();
}

As you can see, I tried only parallelizing the inner for loop too but I got no improvement with that either.
Since the nested for loops are a bit messy, I’m not sure where and how to declare the parallel for directive, nor do I know how to correctly set the variables either to shared or to private.

The sequential version looks exactly the same but without the omp directive being declared.

Time (~=) with the sequential version:

Elapsed time: 0.025583

Time (~=) with the parallel version (k=2, 4, 6, 8, always the same):

Elapsed time: 0.025690

Can somebody please explain to me what I am missing here?

Solution

TL;DR answer: Your code is memory bound (a lots or memory reads or writes, practically no CPU intensive calculations). The CPU can easily exceed the available memory bandwidth and increasing the number of threads will not improve the performance. This may happen on your hardware and surprisingly this also happens on Compiler Explorer, but not on my computer. I am really puzzled.

Detailed answer:

Your code has some problems (data race, memory leak, variable appears more than once in data clauses, etc.), so I did the following to fix these errors and to make your code faster:

  1. used the variables at their minimum required scope. It helps to avoid data race and helps the compiler to make better optimizations.
  2. Corrected the OpenMP directives. Before the i loop you have to add the following line:
#pragma omp parallel for num_threads(k) default(none) shared(lines, words, nr_words, nr_lines)

The other alternative is to parallelize j loop. Here to avoid race condition you have to use reduction (reduction(+:total_word_count)):

#pragma omp parallel for num_threads(k) default(none) shared(i,lines, words, nr_words, nr_lines) reduction(+:total_word_count)
  1. Removed the time consuming malloc and replaced it by static allocation:
//Maximum length of a line - modify it as necessary and make sure that no lines are longer than this value
#define MAX_LINE_LEN 1000

...
char line[MAX_LINE_LEN]; 
....
char found_word[MAX_LINE_LEN]; //it may be smaller
  1. strcpy(line, lines[j]); was moved after the subsequent if statement. This is quite time consuming and only needs if a word was found. So
char *line = NULL;
line = (char *) malloc (strlen(lines[j]) + 1);
strcpy(line, lines[j]);
word_occurence = 0;
if (strstr(line, words[i]) != NULL) {

was changed to

int word_occurence = 0;
if (strstr(lines[j], words[i]) != NULL) {
  char line[MAX_LINE_LEN];                        
  strcpy(line, lines[j]);
  ...
  //Note that the code below is not critical (rarely executed)...
  1. Changed all the printf functions to
#ifdef USE_PRINTF
 printf(...);
#endif

therefore it is easy to turn on/off the printout and easier to test its performance. Note that in case of more threads used the printout will be mixed, so you have to first collect the output and print it later.

  1. Note that it would be much better to rewrite your code and first loop over lines of text and search for each words in a single line. In this case the cache utilization would be much better which will increase the performance.

  2. Added a main function to create a text (by adding lines randomly out of 5 predefined lines) to test its performance

The code is the following:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>

//Maximum length of a line - modify it as necessary and make sure that no lines are longer than this value
#define MAX_LINE_LEN 1000

//Uncomment this line to see printout
//#define USE_PRINTF

double ClockCounter;
void start()
{
        ClockCounter = omp_get_wtime();
}

void stop()
{
        ClockCounter = omp_get_wtime() - ClockCounter;
        printf("Elapsed time: %f\n", ClockCounter);
}


void analyze_text(char **lines, const char *words[], const size_t nr_lines, const int nr_words, const int k) {
        start();
        #pragma omp parallel for num_threads(k) default(none) shared(lines, words, nr_words, nr_lines) 
        for (int i=0; i<nr_words; i++) {
                int total_word_count = 0;
                #ifdef USE_PRINTF
                printf("The word is: %s.\nSEARCHING...\n", words[i]);
                #endif
                //#pragma omp parallel for num_threads(k) default(none) shared(i,lines, words, nr_words, nr_lines) reduction(+:total_word_count)
                for (size_t j=0; j<nr_lines; j++) {
                        int word_occurence = 0;
                        if (strstr(lines[j], words[i]) != NULL) {
                                char line[MAX_LINE_LEN];                        
                                strcpy(line, lines[j]);
                                #ifdef USE_PRINTF
                                  printf("Word found at line %ld, position(s): ", j);
                                #endif
                                char found_word[MAX_LINE_LEN]; 
                                char delim[] = " -";
                                char *ptr = strtok(line, delim);
                                int counter =  0;
                                while (ptr != NULL) {
                                        counter++;
                                        strncpy(found_word, ptr, strlen(words[i]));
                                        found_word[strlen(found_word)] = '\0';
                                        if (strcmp(words[i], found_word) == 0) {
                                                word_occurence++;
                                                #ifdef USE_PRINTF
                                                printf("%d ", counter);
                                                #endif
                                        }
                                        ptr =  strtok(NULL, delim);

                                }
                                #ifdef USE_PRINTF
                                printf("\nline[%3zu]: %s\n", j, lines[j]);
                                #endif
                        }
                        total_word_count += word_occurence;
                }
                #ifdef USE_PRINTF
                printf("The word appears %d times in the text in total.\n\n", total_word_count);
                #endif
        }
        stop();
}

int main(){    
    const size_t nr_lines=40000;
    char* lines[nr_lines];
    //sample lines, the text have these lines in a random order
    const char* samples[]={
        "xxx yyy zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "one yyy zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx two zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx yyy three qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx yyy zzz four sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx yyy zzz qqq five dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt"
    };


    for(size_t i=0;i<nr_lines;++i){
        unsigned long int rnd=rand()%1000;
        rnd = rnd >= sizeof(samples)/sizeof(samples[0]) ? 0 : rnd;
        lines[i]=(char*) malloc(strlen(samples[rnd])+1);
        strcpy(lines[i],samples[rnd]);

    }
    
    printf("nr_lines=%ld\n",nr_lines);

    const char* words[]={"one", "two", "three", "four","five"};
    const size_t nr_words=sizeof(words)/sizeof(words[0]);

    printf("nr_words=%ld\n",nr_words);

    for(int i=1;i<6;i++)
    {
        printf("Threads used=%d  --- ",i);
        analyze_text(lines, words, nr_lines, nr_words, i);
    }
  // You should free allocated memory here.
  return 0;
}

I have tested the code first parallelized the first (i) loop using 400000 lines (note that 40000 was simply too fast, so I used 10x more lines than you) and 5 words on Compiler Explorer, the result was:

nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.032551
Threads used=2  --- Elapsed time: 0.028461
Threads used=3  --- Elapsed time: 0.029304
Threads used=4  --- Elapsed time: 0.027002
Threads used=5  --- Elapsed time: 0.026587

on my laptop the scaling with cores was much better:

gcc 5.c -fopenmp -O3 -mavx2  && ./a.out
nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.046173
Threads used=2  --- Elapsed time: 0.028958
Threads used=3  --- Elapsed time: 0.019951
Threads used=4  --- Elapsed time: 0.015462
Threads used=5  --- Elapsed time: 0.020994

I also parallelized the second loop (in this case the first loop was not parallelized to avoid nested parallelism). The results on compiler explorer:

nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.032781
Threads used=2  --- Elapsed time: 0.026288
Threads used=3  --- Elapsed time: 0.026601
Threads used=4  --- Elapsed time: 0.027013
Threads used=5  --- Elapsed time: 0.027567

On my laptop:

gcc 5.c -fopenmp -O3 -mavx2  && ./a.out
nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.047092
Threads used=2  --- Elapsed time: 0.024240
Threads used=3  --- Elapsed time: 0.016849
Threads used=4  --- Elapsed time: 0.014475
Threads used=5  --- Elapsed time: 0.012770
Threads used=6  --- Elapsed time: 0.011703
Threads used=7  --- Elapsed time: 0.009543
Threads used=8  --- Elapsed time: 0.012953

You can see that

  1. The results obtained on Compiler Explorer is disappointing. I simply do not know what its reason is. Most probably it is related to the memory bandwidth / available channels of memory access, but I suppose high end processors are used to host Compiler Explorer.

  2. On my laptop the scaling is much better, especially if the second loop is parallelized. It is not surprising, because the cache utilization is much better in this case. Therefore I suggest you to rewrite your code and first go through lines and search for all the words in a single line.

Final note: If performance is critical, most probably you can find a parallel library, which is designed to do this type of search as efficiently as possible.

Answered By – Laci

Answer Checked By – Willingham (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *