## Issue

I’m comparing the single thread performance of the matrix-matrix products in **TensorFlow 2** and **NumPy**. I compare separately for single precision (float32) and double precision (float64). I find that the **NumPy** performance is almost equivalent to the Intel MKL C++ implementation (used as a benchmark for matrix multiplication) for both single and double precision (DGEMM and SGEMM). But in **TensorFlow**, only the single precision (float32) performance is equivalent to the MKL, and the double precision (float64) performance is significantly slower. **Why is Tensorflow slower when used with double precision data?**

**Sample Scripts:**

I consider the following instance to reproduce my observation. Consider the matrix multiplication:

C = ABwhere A and B are of size 3000×3000

The TensorFlow2 and NumPy code are given below:

**Tensorflow2 code**

```
import tensorflow as tf
import os
import time
#Check if MKL is enabled
import tensorflow.python.framework as tff
print("MKL Enabled : ", tff.test_util.IsMklEnabled())
#Set threads
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
#Problem size
N = 3000
REPS = 20
DTYPE = tf.float64
#DTYPE = tf.float32
@tf.function
def gemm_implicit_noup(A, B):
#C = A @ B
start = tf.timestamp()
with tf.control_dependencies([start]):
C = tf.matmul(A,B)
with tf.control_dependencies([C]):
end = tf.timestamp()
tf.print(end-start)
return C
tf.config.run_functions_eagerly(False)
A = tf.random.normal([N, N], dtype=DTYPE)
B = tf.random.normal([N, N], dtype=DTYPE)
#Building Trace
C = gemm_implicit_noup(A,B)
for i in range(REPS):
C = gemm_implicit_noup(A,B)
```

**Numpy code**

```
import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import time
N = 3000
REPS = 20
DTYPE = np.float64
#DTYPE = np.float32
def gemm_implicit_noup(A, B):
#C = A @ B
C = np.matmul(A,B)
return C
A = np.random.randn(N,N).astype(DTYPE)
B = np.random.randn(N,N).astype(DTYPE)
for i in range(REPS):
start = time.perf_counter()
C = gemm_implicit_noup(A,B)
end = time.perf_counter()
print(end-start)
```

**System and Installation settings:**

The performance was compared on Intel Xeon Skylake 2.1 GHz with CentOS 7 and also on MacBook Pro 2018 with BigSur. The performance was compared on both **Tensorflow 2.7** and **2.8**, which were built with Intel MKL. **Python 3.9.7** and **3.7.4** were checked. I compare the single thread performance so that the results can be reliably reproduced. I observe similar performance numbers in all the settings:

Single precision performance is as expected:

- Intel MKL C++ SGEMM ~
**0.5s** - NumPy float32 ~
**0.5s** - TensorFlow float32 ~
**0.5s**

But Double precision performance:

- Intel MKL C++ DGEMM ~
**0.9s** - NumPy float64 ~
**1s** - TensorFlow float64 >
**2.5s**(Much Slower!!)

## Solution

Assuming that you are using an **Intel® AVX-512** instruction-supported processor, try installing the **Intel® Optimization for TensorFlow** Wheel via PIP specifically build for AVX512. These packages are available as *.whl on the Intel® website for specific Python versions or can be installed using the following command for Python versions 3.7, 3.8, and 3.9 (Linux Only).

```
pip install intel-tensorflow-avx512==2.7.0
```

This is documented on the official Intel® website and its sub-sections given in the below links:

Intel® Optimization for TensorFlow: Installation Guide

Intel® Optimization for TensorFlow: Install the Intel® Optimization for TensorFlow Wheel via PIP

**AVX512** is a Single Instruction Multiple Data (SIMD) instruction set specifically designed to handle complex data types like double-precision numbers. In order to take full advantage of Intel® architecture and to extract maximum performance, the TensorFlow framework has been optimized using **oneAPI Deep Neural Network Library (oneDNN)** primitives, a popular performance library for deep learning applications. As an additional optimization step, also try setting the environment variable **TF_ENABLE_ONEDNN_OPTS to 1** inside your Linux terminal using the following command before running the TensorFlow code:

```
export TF_ENABLE_ONEDNN_OPTS=1
```

The single-thread performance obtained for double-precision matrix-matrix products using the code that you provided is given below. This test is done on an *Intel® Xeon® Platinum 8260M CPU @ 2.40GHz* with *Python 3.8* along with *Intel® MKL and AVX512 optimized TensorFlow 2.7*.

- NumPy float64 ~
**1.44s** - TensorFlow float64 (MKL Enabled) ~
**2.77s** - TensorFlow float64 (MKL Enabled, AVX512 Optimized, oneDNN Optimization Enabled) ~
**1.19s**

Answered By – Jyothis – Intel

Answer Checked By – Willingham (BugsFixing Volunteer)