Popcnt is a quite common function which counts the amount of hot-bits in some input; also known as the Hamming Weight. It is a well researched topic. However, still lacking a 'parallel' implementation for arbitrary precision integers, I decided to make my own.

For a recent project I was looking to use Bloom filters to verify a serialized structure, with meaningfull error messages. And so I looked into implementing my own as I wanted to use tc39-BigInt's, which have arbitrary precision, to create my filters.

So from a common 'parallel' implementation with 64-bit constants, I derived algorithm to generate n-bit (n = 2**k) constants. Allowing for efficient arbitrary precision popcnt.

The 'parallel' implementation has the benefits of being 'constant-time' and O(log2(n)), rather than O(n).

You can check the relevant source code on github

## Research

At first I used the basic popcnt algorithm:

popcnt(v: bigint) {
let c = 0n
for (; v; c++) {
// clear the least significant bit set
v &= v - 1n;
}
return c
}

And this worked fine, but because I was already writing my own implementation, I thought I might as well optimize it.

First looking if anyone had found a good algorithm for arbitrary size integers, I found that most implementations unrolled their data into 64-bits.

However then I found this:

An answer which highlighted the naive implementation of the parallel popcnt.

Here is a simplified implementation:

 const masks = [
0x5555555555555555,
0x3333333333333333,
0x0F0F0F0F0F0F0F0F,
0x00FF00FF00FF00FF,
0x0000FFFF0000FFFF,
0x00000000FFFFFFFF
]
popcnt(n: u64){
for (let i = 0; i < masks.length; i++){
n = (n & masks[i]) + (n ^ ~masks[i] >> (1 << i))
}
return n
} 

Supprisingly everywhere I looked (1, 2, 3) it only was made for 64-bit integers using the same constant. Even though the principle should work for any power-of-2 bits.

And so I decided to extends this commonly used algorithm to work for arbitrarily sized integers.

## Algorithm

### Constraints

We start by asserting our constraints, the parallel algorithm works for any power of 2.

// Convert to power of 2 (e.g.) 118 -> 128
let approx_bits = bits
bits = 1n;
while(approx_bits < bits) bits <<= 1n;


## Creating masks

The algorithm heavily relies on the different 'masks'. Each mask is used with its complement, to count bits. These masks can be reused for any size smaller or equal than the constant (also smaller/faster by truncating it, although we won't be using that).

The masks follow a relatively simple pattern, however this is harder to generate efficiently in binary form.

Extending these constants to bigger integers requires adding more masks, and increase the size of the existing masks; you need ceil(log2(n)) masks per n-bit integer.

### Generating masks in reverse

We can generate the constants by using the patterns observed; which is

$$m^{n} = (m^{n+1} \ll (\frac{b^{n+1}}{2})) \oplus m^{n+1}$$

Where b is the minimum repeating pattern size of m (e.g. m = 0b01010101, b = 2) Or when put in this context:

$$b^{n} = |m| \gg n$$

However since each mask is dependent on the one after it we need to generate the masks in reverse. This is slightly annoying because we can't create an implementation where the masks are dynamically created as popcnt runs, but what can you do.

With that our final implementation becomes:

let mask = (1n << bits) - 1n
// prevent overflow because of the arbitrary precision int
const fmask = mask
let masks: bigint[] = []
while (bits > 1) {
bits >>= 1n
mask = ((mask << bits) ^ mask) & fmask
masks.push(mask)
}
masks = masks.reverse()

### Generating masks in execution order

After creating the reversed algorithm I observed another pattern.
For any n-bit number the following pattern can be generated:

$$m^{n} = (1 \ll n) - 1$$

You can also see this at the beginning of the reversed algorithm. The problem with this is that it generates a number only n-bit long, so if we need to mask for an higher bit number it wouldn't cover the entire input. However we can cover for this by just repeating the pattern as many times as we have masks.

    let masks: bigint[] = []
for (let n = 1n; n < bits;){
masks = masks.map(x => x | (x << n))
masks.push((1n << n) - 1n)
n <<= 1n;
}
return masks

Sadly this algorithm still requires us keep mask-generation and popcnt separate as the masks are extended to the right bit-size during the algorithm. This could be fixed by calculating the bit-size generating each mask, but I'm unsure bout the efficacy of this.

We don't really need it as bloom-filters have a constant bit-size and it would also be a lot slower, but it would be nice if doesn't require pre-computation.

### Calculate popcnt

Finally with our masks in place the popcnt implementation becomes quite simple:

popcnt(v: bigint) {
let bits = 1n
for (const mask of masks) {
v = (v & mask) + ((v & ~mask) >> bits)
bits <<= 1n
}
return v
}

## Benchmarks

Here is the benchmarks for the original non-parallel popcnt vs our new generalized popcnt.

In our benchmark we run each algorithm 100000 times. fastpopcnt does include mask generation, which is only run once. We use a 8192-bit number with 1024-bits being hot, this ratio is realistic for a partially filled bloom filter.

popcnt:            1024, samples: 100000, 16488ms, 6.06501698204755p/ms
fastpopcnt(8192b): 1024, samples: 100000, 947ms,   105.59662090813094p/ms

As we can see our new algorithm is almost 21x faster than the original.

Looking at the Big-O for each of the algorithms this makes sense, popcnt has O(n), while fastpopcnt has O(log2(n)). The difference is that popcnt scales by the hot-bits, while fastpopcnt scales by the total bits.

Going to the extreme with this principle we devise a worst-case and best-case scenario: all/signel bit(s) are/is set in a large integer. Here are the results:

popcnt:            8192, samples: 100000, 585785ms, 0.17071109p/ms
fastpopcnt(8192b): 8192, samples: 100000, 2281ms,   43.8404208p/ms
Best-case: 256.8x faster

popcnt:            1,    samples: 100000, 23ms,     4347.82608p/ms
fastpopcnt(8192b): 1,    samples: 100000, 668ms,    149.700598p/ms
Worst-case: 29x slower

While in a worst case it is 29x slower, this is a decent trade-off for our application as bloom-filters are often 1:1 for hot-to-cold-bit ratio.

## Interested?

If you're interested or want more information, feel free to contact us! We are more than happy to help.