# [SOLVED] Is there a faster way to get the index of the maximum value and keep count?

## Issue

Is there a faster way to get the count of each team having the highest score?

### Input

``````# Random scores per player per round (each player corresponds to an integer from 0 - 100)
scores_per_round = np.random.rand(10_000,100)

# 100,000 random teams of 8
teams = np.array([random.sample(list(range(100)), 8) for _ in range(100_000)])
``````

### Desired Output

``````# Count of top scores per team key being the index of the team in the teams array and the value being the amount of wins.
{
0: 20,
1: 12,
...
}
``````

Currently I loop through the rounds, add up each teams score and then grab the maximum values index using `np.argmax` and store the count in a dictionary.

``````import random
from collections import defaultdict

win_count = defaultdict(int)

# Random scores
scores_per_round = np.random.rand(10_000,100)

# 100,000 random teams of 8
teams = np.array([random.sample(list(range(100)), 8) for _ in range(100_000)])

# Loop through and keep track of teams wins
for round in range(10_000):
win_count[np.argmax(np.sum(np.take(scores_per_round[round], teams), axis=1))] += 1
``````

## Solution

The initial code is slow because it allocates pretty-big temporary arrays. Iterating over them 10_000 is expensive because the RAM or the last-level cache are relatively slow (compared to the L1 cache or registers). Using Numba can fix this problem by computing the arrays on-the-fly in a more cache-friendly way.

Here is a simple parallel implementation:

``````import numpy as np
import numba as nb

@nb.njit('int32[::1](float64[:,::1], int32[:,::1])', parallel=True, fastmath=True)
def computeTeamWins(scores_per_round, teams):
roundCount = scores_per_round.shape
result = np.empty(roundCount, dtype=np.int32)
n, m = teams.shape
assert m == 8 # See the comment below

for r in nb.prange(roundCount):
iMax, sMax = -1, -1.0
for i in range(n):
s = 0.0
# Faster if the size is known as the loop can be unrolled
for j in range(8):
s += scores_per_round[r, teams[i, j]]
if s > sMax:
iMax, sMax = i, s
result[r] = iMax

return result

win_count = defaultdict(int)
for v in computeTeamWins(scores_per_round, teams):
win_count[v] += 1
``````

On my 6-core machine, it takes 0.8 second while the initial code takes 54.3 seconds. This means the Numba implementation is about 68 times faster. If you cast `teams` to an `np.uint8` array, then the computation takes only 0.57 seconds resulting in a 95 times faster execution (because of the caches). Note that this means the maximum integer value is bounded to 255 (included). Note that the final loop takes only few milliseconds.