[SOLVED] Best way to implement bit shifting over a block of memory


I’d like to bit shift a pointed block of memory of arbitrary size. I thought about working on the problem char by char and I found out this solution that works.

void rshift(void *self, int shift, size_t size) {
    char *s = self;
    bool take, pass;
    for (int i = 0; i < shift; i++) {
        take = false;
        for (int j = 0; j < size; j++, take = pass) {
            pass = *(s + j) & 1;
            *(s + j) >>= 1;
            *(s + j) ^= (-take ^ *(s + j)) & (1 << (CHAR_BIT - 1));

Where self is the memory block to shift, size is its size in chars and shift is how many bits to shift. Essentially what I’m doing is, for each byte I shift it by one passing the bit that has been lost to the next byte.

However this algorithm is quite terrible, I think it’s like a O(shift * size) or something, and I’m quite sure the whole thing can be solved with and O(size).

Note: I showed right shift, left shifting is quite the same thing


First of all, you do not need the shift loop. You can first do a modulus: int n = shift % CHAR_BIT; and a division int m = shift / CHAR_BIT;. Indeed, rotating a char variable k * CHAR_BIT times is equivalent to move directly bytes by k. Moreover, you you can directly exact the n least significant bits of *(s + j) using int pass = *(s + j) & ((1 << n) - 1);. Then, a shift of n bits can be directly with a simple left shift (*(s + j) >>= n;). Finally, the "merge" of pass with the previously left shifted value can be done using *(s + j) |= pass << (CHAR_BIT - n));. Note that the previous operations work on unsigned char variables that are safer (than a signed char) in this case since a right shift of a negative signed type is implementation defined. This makes the algorithm running in O(size) time.

An optimisation is to not do anything if shift is 0 and otherwise use memmove when n is 0.

Another optimization is to work on several char items at the same time using bigger types like uint64_t. However, this assume CHAR_BIT is 8 which is the case on almost all (sane) modern processor in the world. However, you should be very careful with this optimization since the loaded/stored values need to be correctly aligned and the type punning must be safe (using either memcpy or a unions) so to not break the strict aliasing rule (resulting in an undefined behaviour).

An alternative optimization is to use SIMD intrinsics (like SSE, AVX2 on modern x86-64 processors or NEON on most ARM processors). This optimization is more efficient than working on bigger types and much safer. However, using intrinsics requires advanced programming skills and make the code less portable and often barely readable. SSE/NEON can work simultaneously on 16 8-bit items per cycle and AVX2 up to 32 8-bit items resulting in a drastically faster code. Note that compilers can sometimes do that automatically for you (assuming optimization flags are enabled).

Note that writing s[j] instead of *(s + j) is easier to read and shorter. Also note that the expression -take ^ *(s + j) certainly results in an implementation defined behaviour since -1 can be represented differently on different architectures (see two’s complement and one’s complement) although almost all modern processors use the two’s complement.

Answered By – Jérôme Richard

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *