Issue
In my dataframe highlighting product sales on the internet, I have a column that contains the description of each product sold.
I would like to create an algorithm to check if the combination and or redundancy of words has a correlation with the number of sales.
But I would like to be able to filter out words that are too redundant like the product type. For example, my dataframe deals with the sale of wines, so the algorithm must not take into account the word "wine" in the description.
In my df I have 700 rows consisting of 4 columns:
- product_id: id for each product
- product_price: product price
- total_sales: total number of product sales
- product_description: product description (e.g.: "Fruity wine, perfect as a starter"; "Dry and full-bodied wine"; "Fresh and perfect wine as a starter"; "Wine combining strength and character"; "Wine with a ruby color, full-bodied "; etc…)
Edit:
I added:
- the column ‘CA’: the total sales by product * the product’s price
- an example of my df
My DataFrame example:
import pandas as pd
data = {'Product_id': [1, 2, 3, 4, 5],
'Price': [24, 13.5, 12.9, 34, 26],
'Total_sales': [28, 34, 29, 42, 10],
'CA': [672, 459, 374.1, 1428, 260],
'Product_description': ["Fruity wine, perfect as a starter",
"Dry and full-bodied wine",
"Fresh and perfect wine as a starter",
"Wine combining strength and character",
"Wine with a ruby color, full-bodied "]}
df = pd.DataFrame(data)
df
Edit 2:
- to find out if a correlation between certain words (and/or combinations of words) would have an impact on the number of sales. I thought that for that I could create a heatmap putting in ordered the number of different values for my column ["total_sales"] and in abscissa a list of the most used words in the column ["Product_description"]. I thought that an ANOVA could help me verify a correlation between these two variables, or a Chi-square…
My action process: - find the number of unique values for my column ["total_sales"], I have 43 different ones.
- create a list stopwords=[list of redundant words (ex: ‘the’, ‘the’, ‘by’, etc…)]
- split the words of all my lines for the column ["description"]
wordslist = df["description"].str.split()
- I can’t filter the results of my wordlist variable with stopwords
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]
print(z)
I get
TypeError: expected string or bytes-like object
- After that I intend to get the frequency of each word in the column df["description"]
- The words that will have a significant frequency should appear on the abscissa of my heatmap with the ordered numbers of sales
Is this a good method (provided I find the solution to my error) to check if the use of a word/a combination of words has an impact on the sale of a product?
Could you please give me some hints?
Edit 3:
Thank to @maaniB for the great help, thanks to that I took a big step towards the final solution but I still have a little way to go, here is where I am:
I’m french so for the cleaning method with stop_words i replaced nltk
by spacy
import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\\/",
"\\[",
"\\]",
"\\:",
"\\|",
'\\"',
"\\?",
"\\<",
"\\>",
"\\,",
"\\(",
"\\)",
"\\\\",
"\\.",
"\\+",
"\\-",
"\\!",
"\\$",
"\\`",
"\\،",
"\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
- To extract the feature i tried with
CountVectorizer
andTfidfVectorizer
(i confused it withTfidfTransformer
) and i find better result withTfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
ngram_range=(1, 4),
encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
columns=tfidf_vector.get_feature_names(),
index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
axis=1)
df_total_bow
And i’m stuck with the last step, i try the good version of @maaniB with the least square
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
To run it and have a result in Jupyter notebook i had to changed the --NotebookApp.iopub_data_rate_limit
by the command line
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
It worked after 3 minutes of process but i’m totaly lost with the result, it returned to me 46987 lines but i don’t know how to interpret it.
Here a screenshot of my results.
Could someone explain to me how to interpret it please?
I tried another method but after an hour of process without result
i cancel it:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Edit 4:
- I tried to make a heatmap in vain with the
df_total_bow
import seaborn as sns
tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n
c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()
But i get
ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)
Solution
Your question is a combination of text mining tasks, which I try to briefly address here. The first step is, as always in NLP and text mining projects, the cleaning one, including removing stop words, stop characters, etc.:
import re
import pandas as pd
from nltk.corpus import stopwords
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\\/",
"\\[",
"\\]",
"\\:",
"\\|",
'\\"',
"\\?",
"\\<",
"\\>",
"\\,",
"\\(",
"\\)",
"\\\\",
"\\.",
"\\+",
"\\-",
"\\!",
"\\$",
"\\`",
"\\،",
"\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
# Product_id Price Total_sales CA Product_description
# 0 1 24.0 28 672.0 fruity perfect starter
# 1 2 13.5 34 459.0 dry fullbodied
# 2 3 12.9 29 374.1 fresh perfect starter
# 3 4 34.0 42 1428.0 combining strength character
# 4 5 26.0 10 260.0 ruby color fullbodied
Next, you need to extract features (you mentioned count of words, phrases).
from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
# Total_sales character color color fullbodied combining ... ruby # color ruby color fullbodied starter strength strength character
# 0 28 0 0 0 0 ... # 0 0 1 0 0
# 1 34 0 0 0 0 ... # 0 0 0 0 0
# 2 29 0 0 0 0 ... # 0 0 1 0 0
# 3 42 1 0 0 1 ... # 0 0 0 1 1
# 4 10 0 1 1 0 ... # 1 1 0 0 0
Finally, you can make your models on the data:
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
Regarding your other questions:
- There are various statistical methods to find the importance of words in a text, and their correlation with some other variables.
CountVectorizer
is just a simple method of feature_extraction. There are better methods likeTfidfTransformer
. - The type of statistical tests or models depends on the problem. Since you just need to find out the correlation of word combinations with sales statistics, simple regression-based methods with feature extraction are helpful. To rank features (find the word combinations with highest correlations and significance), the recursive feature elimination (
sklearn.feature_selection.RFE
) might be practical.
Answered By – maaniB
Answer Checked By – Marilyn (BugsFixing Volunteer)