[SOLVED] Most time-efficient sorting algorithm for array of struct with conditions

Issue

I know quicksort is the best sorting algorithm and this would be my first thought as well. But if I have the following situation:

typedef struct{
   char name[35];
   int age;
}person;

If I have an array of type person with N elements and I want to sort people aged between 30-40 by name. Would quicksort still be the best here, since we don’t sort all the people in the array, just the ones aged between 30 to 40. If so how would I pick the pivot (since picking the middle element wouldn’t guarantee that the person is in the right age range)? I’ve also thought that maybe merge sort would be a good candidate, since I could just split the array into 2 different arrays and put the conditions.

Solution

The first thing to do is to find the items with age in the range 30..40 and copy them in an optimized growing data structure. This data structure is an array with a capacity growing exponentially every time there is not enough remaining space (ie. std::vector in C++ or for example this in C).

Each item in this new data structure is defined as:

typedef struct {
   uint32_t name_prefix;
   uint32_t id; /* Assume the number of persons is not bigger than 4 billion */
} person_ref;

where name_prefix contains the first characters of person::name and where id contains the position of the item in the initial unfiltered array. Assuming person::name contains ASCII/ANSI characters, it can be for example (p.name[0] << 24) | (p.name[1] << 16) | (p.name[2] << 8) | p.name[3]. While it may sound expensive, modern mainstream processors can do that in only 1 fast instruction. If there is some additional restriction about the name (eg. upper case printable ASCII characters), you can pack more character in this prefix.

Then you can sort the new data structure using a simple quick-sort (or better: an Introsort). Note that qsort can be used in C and std::sort in C++. The comparison operator can be:

/* For a C++ code, please use references instead of pointers */
bool compare(const person& p1, const person* p2) {
    const uint32_t prefix1 = p1.name_prefix;
    const uint32_t prefix2 = p2.name_prefix;
    return prefix1 < prefix2 || 
           (prefix1 == prefix2 && strcmp(array[p1.id].name, array[p2.id].name) < 0);
}

where array is the original array with all the unfiltered items.

Finally, you can copy the item back thanks to the person_ref::id field.


This approach works well since the prefixes of the names are assumed often not to be equal. Indeed, comparing integers is much faster than comparing two strings with strcmp. Moreover, working on small items makes sorting algorithm faster because the copy is less expensive and also because the full array can better fit in fast CPU caches. If a lot prefixes are equal and the input data structure is big, then it should be better to use a 64-bit integer prefix, or to copy the filtered items in another data structure in the worst case (so to better use CPU caches).

If the number of filtered items is huge and you want an even faster sort, you can use a linear-time radix-sort combined with an intro-sort so to sort the person sharing the same prefix.

Answered By – Jérôme Richard

Answer Checked By – Jay B. (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *