Customer Segmentation is a powerful analytics technique to group customers and enable businesses to customize their product offering and marketing strategies.
For example, we can group customers by the month of their first purchase, segment by their recency, frequency and monetary values or use a clustering model like K-means to identify similar groups of customers based on their purchasing behavior.
Cohort analysis is a subset of behavioral analytics that takes the data from a given data set (e-commerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span. Cohort analysis allows a company to “see patterns clearly across the life-cycle of a customer (or user), rather than slicing across all customers blindly without accounting for the natural cycle that a customer undergoes.” By seeing these patterns of time, a company can adapt and tailor its service to those specific cohorts. While cohort analysis is sometimes associated with a cohort study, they are different and should not be viewed as one and the same.
Types of cohorts
Elements of cohort analysis
Over $0.5$ million transactions from a UK-based online retail store. We will use a randomly sampled $20\%$ subset of this dataset.
import pandas as pd
import numpy as np
import datetime as dt
online = pd.read_csv('online.csv', index_col='Unnamed: 0')
online.info()
online.InvoiceDate = pd.to_datetime(online.InvoiceDate)
online.dtypes
Defining a cohort is the first step to cohort analysis. We will now create daily cohorts based on the day each customer has made their first transaction.
# Define a function that will parse the date
def get_day(x):
return dt.datetime(x.year, x.month, x.day)
# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day)
# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay']
# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')
# View the top 5 rows
online.head()
Now each customer belongs to a daily acquisition cohort that we can use for further analysis!
Calculating time offset for each transaction allows us to report the metrics for each cohort in a comparable fashion.
First, we will create $6$ variables that capture the integer value of years, months and days for Invoice and Cohort Date using a get_date_int()
function
def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
# Get the integers for date parts from the InvoiceDaycolumn
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')
# Get the integers for date parts from the CohortDay column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')
Now we will use these integer values to calculate business metrics for our time cohorts!
Great work! Now, we have $6$ different data sets with year, month and day values for Invoice and Cohort dates - invoice_year
, cohort_year
, invoice_month
, cohort_month
, invoice_day
, and cohort_day
.
In this exercise we will calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be our days offset which we will use in the next exercise to visualize the customer count.
# Calculate difference in years
years_diff = invoice_year - cohort_year
# Calculate difference in months
months_diff = invoice_month - cohort_month
# Calculate difference in days
days_diff = invoice_day - cohort_day
# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1
online.head()
We have successfully assigned the daily time offset to each transaction and can use it for running daily cohort analysis!
Customer retention: cohort_counts table
We have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's time to build the retention metrics.
# Define a function that will parse the month from the date
def get_month(x):
return dt.datetime(x.year, x.month, 1)
# Extract the difference in months from all previous values
online['CohortIndex_M'] = years_diff * 12 + months_diff + 1
# Create InvoiceMonth column
online['InvoiceMonth'] = online['InvoiceDate'].apply(get_month)
# Group by CustomerID and select the InvoiceMonth value
grouping = online.groupby('CustomerID')['InvoiceMonth']
# Assign a minimum InvoiceMonth value to the dataset
online['CohortMonth'] = grouping.transform('min')
# View the top 5 rows
online.head()
# Group by CohortMonth and CohortIndex
grouping = online.groupby(['CohortMonth', 'CohortIndex_M'])
# Count the number of customers in each group
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)
# Reset the index
cohort_data = cohort_data.reset_index()
# Create a pivot table with CohortMonth as rows, CohortIndex as the columns, and CustomerID counts as the values
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='CustomerID')
cohort_counts
# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]
# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)
We will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts.
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex_M'])
# Calculate the average of the unit price
cohort_data = grouping['UnitPrice'].mean()
# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()
# Create a pivot
average_price = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='UnitPrice')
print(average_price.round(1))
We are now going to visualize average quantity values in a heatmap.
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex_M'])
# Calculate the average of the quantity
cohort_data = grouping['Quantity'].mean()
# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()
# Create a pivot
average_quantity = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='Quantity')
import matplotlib.pyplot as plt
import seaborn as sns
# Initialize an 8 by 6 inches plot figure
fig, ax = plt.subplots(figsize=(12, 6))
# Add a title
ax.set_title('Average Spend by Monthly Cohorts')
# Create the heatmap
sns.heatmap(average_quantity, annot=True, fmt=".1f", annot_kws={"size": 12}, ax=ax)
b, t = ax.set_ylim()
b += 0.5
t -= 0.5
ax.set_ylim(b, t)
plt.show()
What is RFM segmentation?
Behavioral customer segmentation based on three metrics:
Grouping RFM values
The RFM values can be grouped in several ways:
We are going to implement percentile-based grouping.
Short review of percentiles
Process of calculating percentiles:
Calculate Recency
, Frequency
and Monetary
values for the online dataset we have used before.
# Calculate the total sum
online['TotalSum'] = online['Quantity'] * online['UnitPrice']
# Print min and max dates
print('Min:{}; Max:{}'.format(min(online.InvoiceDate), max(online.InvoiceDate)))
# Create a hypothetical snapshot_day data as if we're doing analysis recently
snapshot_date = max(online.InvoiceDate) + dt.timedelta(days=1)
print(snapshot_date)
# Calculate Recency, Frequency and Monetary value for each customer
datamart = online.groupby(['CustomerID']).agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'TotalSum': 'sum'})
# Rename the columns
datamart.rename(columns={'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'TotalSum': 'MonetaryValue'}, inplace=True)
# Print top 5 rows
datamart.head()
We will now group the customers into three separate groups based on Recency
, and Frequency
.
We will use the result from the exercise in the next one, where we will group customers based on the MonetaryValue
and finally calculate and RFM_Score
.
# Create labels for Recency and Frequency
r_labels = range(3, 0, -1); f_labels = range(1, 4)
# Assign these labels to three equal percentile groups
r_groups = pd.qcut(datamart['Recency'], q=3, labels=r_labels)
# Assign these labels to three equal percentile groups
f_groups = pd.qcut(datamart['Frequency'], q=3, labels=f_labels)
# Create new columns R and F
datamart = datamart.assign(R=r_groups.values, F=f_groups.values)
We will now finish the job by assigning customers to three groups based on the MonetaryValue
percentiles and then calculate an RFM_Score
which is a sum of the $R$, $F$, and $M$ values.
# Create labels for MonetaryValue
m_labels = range(1, 4)
# Assign these labels to three equal percentile groups
m_groups = pd.qcut(datamart['MonetaryValue'], q=3, labels=m_labels)
# Create new column M
datamart = datamart.assign(M=m_groups.values)
# Calculate RFM_Score
datamart['RFM_Score'] = datamart[['R','F','M']].sum(axis=1)
datamart.head()
Here we will create a custom segmentation based on RFM_Score
values. We will create a function to build segmentation and then assign it to each customer.
# Define rfm_level function
def rfm_level(df):
if df['RFM_Score'] >= 9:
return 'Top'
elif ((df['RFM_Score'] >= 5) and (df['RFM_Score'] < 9)):
return 'Middle'
else:
return 'Low'
# Create a new variable RFM_Level
datamart['RFM_Level'] = datamart.apply(rfm_level, axis=1)
# Print the header with the top 5 rows to the console.
datamart.head()
As a final step, we will analyze average values of Recency
, Frequency
and MonetaryValue
for the custom segments we've created.
# Calculate average values for each RFM_Level, and return a size of each segment
rfm_level_agg = datamart.groupby('RFM_Level').agg({
'Recency': 'mean',
'Frequency': 'mean',
# Return the size of each segment
'MonetaryValue': ['mean', 'count']}).round(1)
# Print the aggregated dataset
rfm_level_agg
Advantages of k-means clustering
with certain assumptions about the data
Key k-means assumptions
Variables on the same scale
We will now explore the RFM distributions.
datamart_rfm = datamart[['Recency', 'Frequency', 'MonetaryValue']]
# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_rfm['Recency'])
# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_rfm['Frequency'])
# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_rfm['MonetaryValue'])
# Show the plot
plt.tight_layout()
plt.show()
Since the variables are skewed and are on different scales, we will now un-skew and scale them.
from sklearn.preprocessing import StandardScaler
# Unskew the data
datamart_log = np.log(datamart_rfm)
# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(datamart_log)
# Scale and center the data
datamart_normalized = scaler.transform(datamart_log)
# Create a pandas DataFrame
datamart_normalized = pd.DataFrame(data=datamart_normalized, index=datamart_rfm.index,
columns=datamart_rfm.columns)
Now we will plot the normalized and unskewed variables to see the difference in the distribution as well as the range of the values.
# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_normalized['Recency'])
# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_normalized['Frequency'])
# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_normalized['MonetaryValue'])
# Show the plot
plt.tight_layout()
plt.show()
We will now build a $3$ cluster k-means model.
# Import KMeans
from sklearn.cluster import KMeans
# Initialize KMeans
kmeans = KMeans(n_clusters=3, random_state=1)
# Fit k-means clustering on the normalized data set
kmeans.fit(datamart_normalized)
# Extract cluster labels
cluster_labels = kmeans.labels_
We will now analyze the average RFM values of the three clusters we've created in the previous exercise.
# Create a DataFrame by adding a new cluster label column
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)
# Group the data by cluster
grouped = datamart_rfm_k3.groupby(['Cluster'])
# Calculate average RFM values and segment sizes per cluster value
grouped.agg({
'Recency': 'mean',
'Frequency': 'mean',
'MonetaryValue': ['mean', 'count']
}).round(1)
You can immediately see the differences in RFM values of these segments!
Methods
Elbow criterion method
Experimental approach - analyze segments
In this exercise, we will calculate the sum of squared errors for different number of clusters ranging from $1$ to $20$.
sse = dict()
# Fit KMeans and calculate SSE for each k
for k in range(1, 21):
# Initialize KMeans with k clusters
kmeans = KMeans(n_clusters=k, random_state=1)
# Fit KMeans on the normalized dataset
kmeans.fit(datamart_normalized)
# Assign sum of squared distances to k element of dictionary
sse[k] = kmeans.inertia_
Now we will plot the sum of squared errors for each value of $k$ and identify if there is an elbow. This will guide us towards the recommended number of clusters to use.
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')
# Add X-axis label "k"
plt.xlabel('k')
# Add Y-axis label "SSE"
plt.ylabel('SSE')
# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()
You can see the elbow is clearly around $2-3$ clusters!
The Silhouette Coefficient is calculated using the mean intra-cluster distance $(a)$ and the mean nearest-cluster distance $(b)$ for each sample. The Silhouette Coefficient for a sample is
$$s = \frac{b - a}{max(a, b)}$$To clarify, $b$ is the distance between a sample and the nearest cluster that the sample is not a part of. $a$ is the mean distance between a sample and all other points in the same class. Note that Silhouette Coefficient is only defined if number of labels is $2$ <= n_labels <= n_samples $- 1$.
The best value is $1$ and the worst value is $-1$. Values near $0$ indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
Now we will plot the silhouette coefficients for each value of $k$ and identify if the highest value. This will guide us towards the recommended number of clusters to use.
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
X = datamart_normalized.values
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
Approaches to build customer personas
Snake plots to understand and compare segments
Now we will prepare data for the snake plot. We will use the 3-cluster RFM segmentation solution we have built previously. We will transform the normalized RFM data into a long format by "melting" the metric columns into two columns - one for the name of the metric, and another for the actual numeric value.
datamart_normalized['Cluster'] = datamart_rfm_k3['Cluster']
# Melt the normalized dataset and reset the index
datamart_melt = pd.melt(datamart_normalized.reset_index(),
# Assign CustomerID and Cluster as ID variables
id_vars=['CustomerID', 'Cluster'],
# Assign RFM values as value variables
value_vars=['Recency', 'Frequency', 'MonetaryValue'],
# Name the variable and value
var_name='Metric', value_name='Value')
datamart_melt.head()
Now we will now use the melted dataset to build the snake plot.
# Add the plot title
plt.title('Snake plot of normalized variables')
# Add the x axis label
plt.xlabel('Metric')
# Add the y axis label
plt.ylabel('Value')
# Plot a line for each value of the cluster variable
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()
Relative importance of segment attributes
Now we will calculate the relative importance of the RFM values within each cluster.
# Calculate average RFM values for each cluster
cluster_avg = datamart_rfm_k3.groupby(['Cluster']).mean()
# Calculate average RFM values for the total customer population
population_avg = datamart_rfm.mean()
# Calculate relative importance of cluster's attribute value compared to population
relative_imp = cluster_avg / population_avg - 1
# Print relative importance score rounded to 2 decimals
relative_imp.round(2)
As the ratio moves away from $0$, attribute importance for a segment (relative to total pop.) increases.
Now we will build a heatmap visualizing the relative scores for each cluster.
# Initialize a plot with a figure size of 8 by 2 inches
plt.figure(figsize=(8, 2))
# Add the plot title
plt.title('Relative importance of attributes')
# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn', annot_kws={"size": 16})
b, t = plt.ylim()
b += 0.5
t -= 0.5
plt.ylim(b, t)
plt.show()