## Issue

I am new to Python. I have a `numpy.array`

which size is `66049x1`

(66049 rows and 1 column). The values are sorted smallest to largest and are of `float`

type, with some of them being repeated.

I need to determine the frequency of occurrences of each value (the number of times a given value is **equalled but not surpassed**, e.g. **X<=x in statistical terms**), in order to later plot the Sample Cumulative Distribution Function.

The code I am currently using is as follows, but it is extremely slow, as it has to loop `66049x66049=4362470401`

times. Is there any way to augment the speed of such piece of code? Will perhaps the use of `dictionaries`

help in any way? Unfortunately I cannot change the size of the arrays I am working with.

```
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
```

## Solution

```
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
```

That whole script took a very short period to execute (~2s) for (`100,000x1)`

array. I didn’t time, but if you provide the time it took to do yours we can compare.

I used `[Counter][2]`

from `collections`

to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into `DataFrame`

to plot and used `T`

to transpose.

Your data does replicate a bit, but you can try and refine it some more. As it is, it’s pretty fast.

**Edit**

Create CDF using `cumsum()`

```
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
```

**Edit 2**

For `scatter()`

plot you must specify the (x,y) explicitly. Also, calling `df_p['cumu']`

will result in a `Series`

, not a `DataFrame`

.

To properly display a scatter plot you’ll need the following:

```
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()
```

Answered By – Leb

Answer Checked By – Candace Johnson (BugsFixing Volunteer)