# [SOLVED] Cosine Simiarlity scores for each array combination in a list of arrays Python

## Issue

I have list of arrays and I want to calculate the cosine similarity for each combination of arrays in my list of arrays.

My full list comprises 20 arrays with 3 x 25000. A small selection below

``````import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances

C = np.array([[-127, -108, -290],
[-123,  -83, -333],
[-126,  -69, -354],
[-146, -211, -241],
[-151, -209, -253],
[-157, -200, -254]])

D = np.array([[-129, -146, -231],
[-127, -148, -238],
[-132, -157, -231],
[ -93, -355, -112],
[ -95, -325, -137],
[ -99, -282, -163]])

E = np.array(([[-141, -133, -200],
[-132, -123, -202],
[-119, -117, -204],
[-107, -210, -228],
[-101, -194, -243],
[-105, -175, -244]]))

ArrayList = (C,D,E)
``````

My first problem is I am getting a pairwise result for each element of each array, however, what I am trying to achieve is the result looking at the arrays as a whole.

For example I try

``````scores = cosine_similarity(C,D)
scores
array([[0.98078461, 0.98258287, 0.97458466, 0.643815  , 0.71118811,
0.7929595 ],
[0.95226207, 0.95528395, 0.9428837 , 0.55905221, 0.63291722,
0.7240552 ],
[0.9363733 , 0.93972303, 0.9255921 , 0.51752531, 0.59402196,
0.68918496],
[0.98998438, 0.98903931, 0.99377116, 0.85494921, 0.8979725 ,
0.9449272 ],
[0.99335622, 0.99255262, 0.99635952, 0.84106771, 0.88619755,
0.93616556],
[0.9955969 , 0.99463213, 0.99794805, 0.82706302, 0.8738389 ,
0.92640196]])

``````

What I am expecting is a singular value 0.989… (this is a made up number)
The next challenge is how to iterate over each array in my list of arrays to get a pairwise result of the array something like this

``````     C    D       E
C  1.0    0.97   0.95

D  0.97   1.0    0.96

E  0.95  0.95    1.0

``````

As a beginner to python I am not sure how to proceed. Any help appreciated.

## Solution

If I understand correctly, what you are trying to do is to get he cosine distance when using each matrix as an `1Xn` dimensional vector. The easiest thing in my opinion will be to vectorially implement the cosine similarity with numpy functions. As a reminder, given two 1D vectors `x` and `y`, the cosine similarity is given by:

``````cosine_similarity = x.dot(y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2))
``````

To do this with the three metrices, we will first flatten them into 1D representation and stack them together:

``````matrices_1d = temp = np.vstack((C.reshape((1, -1)), D.reshape(1, -1), E.reshape(1,-1)))
``````

Now that we have the vector-representation of each matrix, we can compute the L2 norm using `np.linalg.norm`(read on this functions here) as follows:

``````norm_vec = np.linalg.norm(matrices_1d , ord=2, axis=1)
``````

And finally, we can compute the cosine distances as follows:

``````cos_sim = matrices_1d .dot(matrices_1d .T) / np.outer(norm_vec ,norm_vec)
# array([[1.        , 0.9126993 , 0.9699609 ],
#        [0.9126993 , 1.        , 0.93485159],
#        [0.9699609 , 0.93485159, 1.        ]])
``````

Note that as a sanity check, the diagonal values are 1 since the cosine distance of a vector from itself is 1.

The cosine distance if defined to be `1-cos_sim` and is easy to computeonce you have the similarity.