[SOLVED] How to make a function to check if the combination and or redundancy of words has a correlation with the number of sales?

Issue

In my dataframe highlighting product sales on the internet, I have a column that contains the description of each product sold.

I would like to create an algorithm to check if the combination and or redundancy of words has a correlation with the number of sales.

But I would like to be able to filter out words that are too redundant like the product type. For example, my dataframe deals with the sale of wines, so the algorithm must not take into account the word "wine" in the description.

In my df I have 700 rows consisting of 4 columns:

  • product_id: id for each product
  • product_price: product price
  • total_sales: total number of product sales
  • product_description: product description (e.g.: "Fruity wine, perfect as a starter"; "Dry and full-bodied wine"; "Fresh and perfect wine as a starter"; "Wine combining strength and character"; "Wine with a ruby ​​color, full-bodied "; etc…)

Edit:
I added:

  • the column ‘CA’: the total sales by product * the product’s price
  • an example of my df

My DataFrame example:

import pandas as pd

data = {'Product_id': [1, 2, 3, 4, 5],
        'Price': [24, 13.5, 12.9, 34, 26],
        'Total_sales': [28, 34, 29, 42, 10],
        'CA': [672, 459, 374.1, 1428, 260],
        'Product_description': ["Fruity wine, perfect as a starter",
                                "Dry and full-bodied wine",
                                "Fresh and perfect wine as a starter",
                                "Wine combining strength and character",
                                "Wine with a ruby ​​color, full-bodied "]}

df = pd.DataFrame(data)
df

Edit 2:

  • to find out if a correlation between certain words (and/or combinations of words) would have an impact on the number of sales. I thought that for that I could create a heatmap putting in ordered the number of different values ​​for my column ["total_sales"] and in abscissa a list of the most used words in the column ["Product_description"]. I thought that an ANOVA could help me verify a correlation between these two variables, or a Chi-square…
    My action process:
  • find the number of unique values ​​for my column ["total_sales"], I have 43 different ones.
  • create a list stopwords=[list of redundant words (ex: ‘the’, ‘the’, ‘by’, etc…)]
  • split the words of all my lines for the column ["description"]
wordslist = df["description"].str.split()
  • I can’t filter the results of my wordlist variable with stopwords
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]

print(z)

I get

TypeError: expected string or bytes-like object
  • After that I intend to get the frequency of each word in the column df["description"]
  • The words that will have a significant frequency should appear on the abscissa of my heatmap with the ordered numbers of sales

Is this a good method (provided I find the solution to my error) to check if the use of a word/a combination of words has an impact on the sale of a product?

Could you please give me some hints?

Edit 3:
Thank to @maaniB for the great help, thanks to that I took a big step towards the final solution but I still have a little way to go, here is where I am:

I’m french so for the cleaning method with stop_words i replaced nltk by spacy

import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
    "\\/",
    "\\[",
    "\\]",
    "\\:",
    "\\|",
    '\\"',
    "\\?",
    "\\<",
    "\\>",
    "\\,",
    "\\(",
    "\\)",
    "\\\\",
    "\\.",
    "\\+",
    "\\-",
    "\\!",
    "\\$",
    "\\`",
    "\\،",
    "\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
    lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
    axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
  • To extract the feature i tried with CountVectorizer and TfidfVectorizer (i confused it with TfidfTransformer) and i find better result with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
                               ngram_range=(1, 4),
                               encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
                         columns=tfidf_vector.get_feature_names(),
                         index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
                         axis=1)
df_total_bow

And i’m stuck with the last step, i try the good version of @maaniB with the least square

import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())

To run it and have a result in Jupyter notebook i had to changed the --NotebookApp.iopub_data_rate_limit
by the command line

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

It worked after 3 minutes of process but i’m totaly lost with the result, it returned to me 46987 lines but i don’t know how to interpret it.
Here a screenshot of my results.

Could someone explain to me how to interpret it please?

extract feature
model result top
model result bot

I tried another method but after an hour of process without result
i cancel it:

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Edit 4:

  • I tried to make a heatmap in vain with the df_total_bow
import seaborn as sns

tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n

c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()

But i get

ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)

Solution

Your question is a combination of text mining tasks, which I try to briefly address here. The first step is, as always in NLP and text mining projects, the cleaning one, including removing stop words, stop characters, etc.:

import re

import pandas as pd
from nltk.corpus import stopwords

# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
    "\\/",
    "\\[",
    "\\]",
    "\\:",
    "\\|",
    '\\"',
    "\\?",
    "\\<",
    "\\>",
    "\\,",
    "\\(",
    "\\)",
    "\\\\",
    "\\.",
    "\\+",
    "\\-",
    "\\!",
    "\\$",
    "\\`",
    "\\،",
    "\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
    lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
    axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)

#   Product_id  Price  Total_sales      CA           Product_description
# 0           1   24.0           28   672.0        fruity perfect starter
# 1           2   13.5           34   459.0                dry fullbodied
# 2           3   12.9           29   374.1         fresh perfect starter
# 3           4   34.0           42  1428.0  combining strength character
# 4           5   26.0           10   260.0       ruby ​​color fullbodied

Next, you need to extract features (you mentioned count of words, phrases).

from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
    tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
#   Total_sales  character  color  color fullbodied  combining  ...  ruby # color  ruby color fullbodied  starter  strength  strength character
# 0           28          0      0                 0          0  ...           # 0                      0        1         0                   0
# 1           34          0      0                 0          0  ...           # 0                      0        0         0                   0
# 2           29          0      0                 0          0  ...           # 0                      0        1         0                   0
# 3           42          1      0                 0          1  ...           # 0                      0        0         1                   1
# 4           10          0      1                 1          0  ...           # 1                      1        0         0                   0

Finally, you can make your models on the data:

import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())

Regarding your other questions:

  • There are various statistical methods to find the importance of words in a text, and their correlation with some other variables. CountVectorizer is just a simple method of feature_extraction. There are better methods like TfidfTransformer.
  • The type of statistical tests or models depends on the problem. Since you just need to find out the correlation of word combinations with sales statistics, simple regression-based methods with feature extraction are helpful. To rank features (find the word combinations with highest correlations and significance), the recursive feature elimination (sklearn.feature_selection.RFE) might be practical.

Answered By – maaniB

Answer Checked By – Marilyn (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *