# [SOLVED] Python: faster way of counting occurences in numpy arrays (large dataset)

## Issue

I am new to Python. I have a `numpy.array` which size is `66049x1` (66049 rows and 1 column). The values are sorted smallest to largest and are of `float` type, with some of them being repeated.

I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.

The code I am currently using is as follows, but it is extremely slow, as it has to loop `66049x66049=4362470401` times. Is there any way to augment the speed of such piece of code? Will perhaps the use of `dictionaries` help in any way? Unfortunately I cannot change the size of the arrays I am working with.

``````+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))

#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1

#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
``````

## Solution

``````import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data'])

df_p.T.plot(kind='hist')

plt.show()
``````

That whole script took a very short period to execute (~2s) for (`100,000x1)` array. I didn’t time, but if you provide the time it took to do yours we can compare.

I used `[Counter][2]` from `collections` to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into `DataFrame` to plot and used `T` to transpose.

Your data does replicate a bit, but you can try and refine it some more. As it is, it’s pretty fast.

Edit

Create CDF using `cumsum()`

``````import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data']).T

df_p['cumu'] = df_p['data'].cumsum()

df_p['cumu'].plot(kind='line')

plt.show()
``````

Edit 2

For `scatter()` plot you must specify the (x,y) explicitly. Also, calling `df_p['cumu']` will result in a `Series`, not a `DataFrame`.

To properly display a scatter plot you’ll need the following:

``````import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data']).T

df_p['cumu'] = df_p['data'].cumsum()

df_p.plot(kind='scatter', x='data', y='cumu')

plt.show()
``````