# [SOLVED] Best way to implement bit shifting over a block of memory

## Issue

I’d like to bit shift a pointed block of memory of arbitrary size. I thought about working on the problem `char` by `char` and I found out this solution that works.

``````void rshift(void *self, int shift, size_t size) {
char *s = self;
bool take, pass;
for (int i = 0; i < shift; i++) {
take = false;
for (int j = 0; j < size; j++, take = pass) {
pass = *(s + j) & 1;
*(s + j) >>= 1;
*(s + j) ^= (-take ^ *(s + j)) & (1 << (CHAR_BIT - 1));
}
}
}
``````

Where `self` is the memory block to shift, `size` is its size in `char`s and `shift` is how many bits to shift. Essentially what I’m doing is, for each byte I shift it by one passing the bit that has been lost to the next byte.

However this algorithm is quite terrible, I think it’s like a `O(shift * size)` or something, and I’m quite sure the whole thing can be solved with and `O(size)`.

Note: I showed right shift, left shifting is quite the same thing

## Solution

First of all, you do not need the `shift` loop. You can first do a modulus: `int n = shift % CHAR_BIT;` and a division `int m = shift / CHAR_BIT;`. Indeed, rotating a `char` variable `k * CHAR_BIT` times is equivalent to move directly bytes by `k`. Moreover, you you can directly exact the `n` least significant bits of `*(s + j)` using `int pass = *(s + j) & ((1 << n) - 1);`. Then, a shift of `n` bits can be directly with a simple left shift (`*(s + j) >>= n;`). Finally, the "merge" of `pass` with the previously left shifted value can be done using `*(s + j) |= pass << (CHAR_BIT - n));`. Note that the previous operations work on `unsigned char` variables that are safer (than a signed `char`) in this case since a right shift of a negative signed type is implementation defined. This makes the algorithm running in `O(size)` time.

An optimisation is to not do anything if `shift` is 0 and otherwise use `memmove` when `n` is 0.

Another optimization is to work on several `char` items at the same time using bigger types like `uint64_t`. However, this assume `CHAR_BIT` is 8 which is the case on almost all (sane) modern processor in the world. However, you should be very careful with this optimization since the loaded/stored values need to be correctly aligned and the type punning must be safe (using either `memcpy` or a `union`s) so to not break the strict aliasing rule (resulting in an undefined behaviour).

An alternative optimization is to use SIMD intrinsics (like SSE, AVX2 on modern x86-64 processors or NEON on most ARM processors). This optimization is more efficient than working on bigger types and much safer. However, using intrinsics requires advanced programming skills and make the code less portable and often barely readable. SSE/NEON can work simultaneously on 16 8-bit items per cycle and AVX2 up to 32 8-bit items resulting in a drastically faster code. Note that compilers can sometimes do that automatically for you (assuming optimization flags are enabled).

Note that writing `s[j]` instead of `*(s + j)` is easier to read and shorter. Also note that the expression `-take ^ *(s + j)` certainly results in an implementation defined behaviour since `-1` can be represented differently on different architectures (see two’s complement and one’s complement) although almost all modern processors use the two’s complement.