Unsupervised learning use case: mobile users segmentation

I’d like to show through this post a minimal example on using python for an unsupervised learning task: clustering. My goal is to segment customers based on their past mobile data usage.

In particular, I’ve used pandas/numpy for data manipulation and preprocessing; scikit-learn to use the k-means algorithm and matplotlib for data visualization.

Source dataframe

I’m going to play with a pandas dataframe which contains the monthly mobile data usage (in MB) of N users:

# Print transposed for better understanding
print df.head(3).transpose()
                 USER1       USER2     ...  USERN 
2016-01-01         421         185     ...   1260
2015-12-01         739         282     ...    973
2015-11-01         751         186     ...    978
2015-10-01         738         388     ...    767
2015-09-01         419         185     ...    688
2015-08-01        1108         489     ...    792
2015-07-01         670         185     ...    981
2015-06-01         758         582     ...    858
2015-05-01         537         185     ...    716
2015-04-01         445         381     ...    817
2015-03-01         138         185     ...    763
2015-02-01         188         384     ...    800

Unclustered heatmap

Let’s plot a heatmap of data usage by 500 customers during the past 12 months.

def plot_dataframe_heatmap(df, title):
  # values fixing
  df = df.fillna(0)
  df = df.astype(int)

  # create a new figure
  plt.figure(figsize=(18,5))

  # generate the heatmap
  m = plt.pcolormesh(
    df.as_matrix(),\
    cmap=plt.cm.get_cmap('viridis_r'))
  plt.colorbar(m)

  # set a x-tick for each month
  plt.xticks(\
    np.arange(0.5, len(df.columns), 1),\
    [ e.strftime('%Y-%m') for e in df.columns],\
    rotation=45)

  # set plot labels and title
  plt.xlabel('Last %d months' % NUM_MONTHS)
  plt.ylabel('Mobile users')
  plt.title(title)

  # show the plot
  plt.show()

plot_dataframe_heatmap(df, title="Mobile MB Usage Heatmap")
heatmap
Heatmap showing the source unclustered data.

Clustering with k-means

Choosing the optimal number of clusters is not trivial and is out of the scope of this post. I’ll assume there are 3 clusters corresponding low, medium and high data-usage customers.

def get_clustered_dataframe(df):
  X = df.as_matrix()

  # create the predictor and fit our data
  pred = KMeans(n_clusters=3)
  pred.fit(X)

  # get the output
  y_pred = pred.predict(X)

  # add a new temporary column to the dataframe
  # indicating the cluster index, then sort by it
  df_pred = pd.DataFrame(\
    y_pred,\
    index=df.index,\
    columns=['cluster'])
  df_clustered = pd.concat([df, df_pred], axis=1)
  df_clustered = df_clustered.sort(columns=['cluster'])
  del df_clustered['cluster']

  return df_clustered, pred.cluster_centers_

df_clustered, clusters = get_clustered_dataframe(df)
plot_dataframe_heatmap(df_clustered,\
  title='Mobile MB Usage Heatmap (clustered)')
heatmap_clustered
Heatmap of the data sorted by the clusters.

Conclusion

Clustering is very powerful to extract value from unlabeled data and scikit-learn makes it really easy. In this case, it allowed to segment customers in groups that share a similar network usage pattern, which is really valuable for a mobile operator. In addition, plotting heatmaps, has helped to better understand the data and verify the clustering behaviour.

Advertisements

2 thoughts on “Unsupervised learning use case: mobile users segmentation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s