## Issue

I was tired of waiting while loading a simple distance matrix from a `csv`

file using `numpy.genfromtxt`

. Following another SO question, I performed a perfplot test, while including some additional methods. The results (source code at the end):

The result for the largest input size shows that the best method is `read_csv`

, which is this:

```
def load_read_csv(path: str):
with open(path, 'r') as csv_file:
reader = csv.reader(csv_file)
matrix = None
first_row = True
for row_index, row in enumerate(reader):
if first_row:
size = len(row)
matrix = np.zeros((size, size), dtype=int)
first_row = False
matrix[row_index] = row
return matrix
```

Now I doubt that reading the file line by line, converting it to the list of strings, then calling `int()`

on each item in the list and adding it to NumPy matrix is the best possible way.

Can this function be optimized further, or is there some fast library for CSV loading (like Univocity parser in Java), or maybe just a dedicated NumPy function?

The source code of the test:

```
import perfplot
import csv
import numpy as np
import pandas as pd
def load_read_csv(path: str):
with open(path, 'r') as csv_file:
reader = csv.reader(csv_file)
matrix = None
first_row = True
for row_index, row in enumerate(reader):
if first_row:
size = len(row)
matrix = np.zeros((size, size), dtype=int)
first_row = False
# matrix[row_index] = [int(item) for item in row]
matrix[row_index] = row
return matrix
def load_loadtxt(path: str):
matrix = np.loadtxt(path, dtype=int, comments=None, delimiter=",", encoding="utf-8")
return matrix
def load_genfromtxt(path: str):
matrix = np.genfromtxt(path, dtype=int, comments=None, delimiter=",", deletechars=None, replace_space=None, encoding="utf-8")
return matrix
def load_pandas(path: str):
df = pd.read_csv(path, header=None, dtype=np.int32)
return df.values
def load_pandas_engine_pyarrow(path: str):
df = pd.read_csv(path, header=None, dtype=np.int32, engine='pyarrow')
return df.values
def load_pandas_engine_python(path: str):
df = pd.read_csv(path, header=None, dtype=np.int32, engine='python')
return df.values
def setup(n):
matrix = np.random.randint(0, 10000, size=(n, n), dtype=int)
filename = f"square_matrix_of_size_{n}.csv"
np.savetxt(filename, matrix, fmt="%d", delimiter=",")
return filename
b = perfplot.bench(
setup=setup, # or setup=np.random.rand
kernels=[
load_read_csv,
load_loadtxt,
load_genfromtxt,
load_pandas,
load_pandas_engine_pyarrow,
load_pandas_engine_python
],
n_range=[2 ** k for k in range(15)]
)
b.save("out.png")
b.show()
```

## Solution

Parsing CSV files correctly while supporting several data types (eg. floating-point numbers, integers, strings) and possibly ill-formed input files is clearly not easy, and doing so efficiently is actually pretty hard. Moreover, decoding UTF-8 strings is also much slower than reading directly ASCII strings. This is the reasons why most CSV libraries are pretty slow. Not to mention wrapping library in Python could introduce pretty big overheads regarding the input types (especially string).

Hopefully, if you need to read a CSV file containing a square matrix of integers that is assumed to be correctly formed, then you can write a **much faster specific code dedicated to your needs** (which does not care about floating-point numbers, strings, UTF-8, header decoding, error handling, etc.).

That being said, any call to a basic CPython function tends to introduce a huge overhead. Even a simple call to `open`

+`read`

is relatively slow (the binary mode is significantly faster than the text mode but unfortunately not so fast). The trick is to use Numpy to load the *whole binary file* in RAM with `np.fromfile`

. This function is extremely fast: it just read the whole file at once, put its binary content in a raw memory buffer and return a view on it. When the file is in the operating system cache or a high-throughput NVMe SSD storage device, it can load the file at the speed of several GiB/s.

One the file is loaded, you can decode it with **Numba** (or Cython) so the decoding can be nearly as fast as a native code. Note that Numba does not support well/efficiently strings/bytes. Hopefully, the function `np.fromfile`

produces a contiguous byte array and Numba can compute it very quickly. You can know the size of the matrix by just reading the first line and counting the number of comma. Then you can fill the matrix very efficiently by decoding integer on-the-fly, packing them in a flatten matrix and just consider end-of-line characters as regular separators. Note that `\r`

and `\n`

can both appear in the file since the file is read in binary mode.

Here is the resulting implementation:

```
import numba as nb
import numpy as np
@nb.njit('int32[:,:](uint8[::1],)', cache=True)
def decode_csv_buffer(rawData):
COMMA = np.uint8(ord(','))
CR = np.uint8(ord('\r'))
LF = np.uint8(ord('\n'))
ZERO = np.uint8(ord('0'))
# Find the size of the matrix (`n`)
n = 0
lineSize = 0
for i in range(rawData.size):
c = rawData[i]
if c == CR or c == LF:
break
n += rawData[i] == COMMA
lineSize += 1
n += 1
# Empty matrix
if lineSize == 0:
return np.empty((0, 0), dtype=np.int32)
# Initialization
res = np.empty(n * n, dtype=np.int32)
# Fill the matrix
curInt = 0
curPos = 0
lastCharIsDigit = True
for i in range(len(rawData)):
c = rawData[i]
if c == CR or c == LF or c == COMMA:
if lastCharIsDigit:
# Write the last int in the flatten matrix
res[curPos] = curInt
curPos += 1
curInt = 0
lastCharIsDigit = False
else:
curInt = curInt * 10 + (c - ZERO)
lastCharIsDigit = True
return res.reshape(n, n)
def load_numba(filename):
# Load fully the file in a raw memory buffer
rawData = np.fromfile(filename, dtype=np.uint8)
# Decode the buffer using the Numba JIT
# This method only work for your specific needs and
# can simply crash if the file content is invalid.
return decode_csv_buffer(rawData)
```

Be aware that the code is not robust (any bad input results in an undefined behaviour, including a crash), but it is very fast.

Here are the results on my machine:

As you can see, the above Numba implementation is **at least one order of magnitude faster than all others**. Note that you can write an even faster code using multiple threads during the decoding, but this makes the code significantly more complex.

Answered By – Jérôme Richard

Answer Checked By – Cary Denson (BugsFixing Admin)