In this project, I apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import HTML
from scipy.stats import itemfreq, chisquare
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# Set base plotting style
plt.style.use('seaborn-ticks')
# # Set base plotting size
plt.rcParams['figure.figsize'] = 12, 9
# Magic word for producing visualizations in notebook
%matplotlib inline
# Increase figure resolution for high dpi screens
%config InlineBackend.figure_format = 'retina'
# Autoreload modules
%load_ext autoreload
%autoreload 2
There are four files associated with this project:
Demographics.csv
: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Customers.csv
: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md
: Detailed information file about the features in the provided datasets.Feature_Summary.csv
: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
# Load in the general demographics data.
azdias = pd.read_csv('Demographics.csv', delimiter=';')
# Load in the feature summary file.
feat_info = pd.read_csv('Feature_Summary.csv', delimiter=';')
feat_info.head()
azdias.head()
df = azdias.copy()
df.head()
df.describe()
df.info()
# Percentage of nans per row before precessing
nan_perc = (df.isnull().mean() * 100).sort_values(ascending=False)
nan_perc[:20]
KK_KUNDENTYP has the highest percentage of missing values (65%) with the next KBA05_ANTG1 at 14.9% at a different scale
#How many rows don't have nans?
nan_perc[nan_perc == 0].shape[0]
# as a percentage
nan_perc[nan_perc > 0].shape[0]/ df.shape[0] * 100
# How many nans before processing?
print('Total percentage of Nans before processing: ',
round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
'%')
The fourth column of the feature attributes summary (loaded in above as feat_info
) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]
), this will get read in as a string object.
# find out the different kind of encodings
feat_info.missing_or_unknown.unique()
# check for the XX type
feat_info[feat_info.missing_or_unknown == '[XX]']
def convert_to_list_for_X_or_XX(s):
'''Takes the column with the missing value
encodings and converts to list, specially processing
the casses where X or XX is in the encodings
TODO : Generalize for any new string'''
if 'XX' in s:
if len(s) == 4:
return [s[1:-1]]
else:
return [-1, 'XX']
elif 'X' in s:
return [-1, 'X']
else:
return eval(s)
# Process the missing values encodings
feat_info.missing_or_unknown = feat_info.missing_or_unknown.apply(convert_to_list_for_X_or_XX)
# Create a dictionary to map column names to missing data encodings
missing_data_encode = dict(zip(feat_info.attribute, feat_info.missing_or_unknown))
# test
missing_data_encode['AGER_TYP']
# test
missing_data_encode['CAMEO_DEU_2015']
# Convert to nans using the missing_data_encode dictionary
for col in missing_data_encode.keys():
nans_before = df[col].isnull().sum()
df.loc[df[col].isin(missing_data_encode[col]), col] = np.nan
nans_after = df[col].isnull().sum()
if nans_after != nans_before:
print(col, 'nans before: ', nans_before)
print(col, 'nans after: ', nans_after, '\n')
# How many nans after processing?
print('Total percentage of Nans after processing: ',
round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
'%')
# A Series with the proportion of nans per row after precessing
nan_perc = (df.isnull().mean()).sort_values(ascending=False)
nan_perc[:20]
#How many coluns don't have nans?
nan_perc[nan_perc == 0].shape[0]
# as a percentage
nan_perc[nan_perc == 0].shape[0]/ df.shape[1] * 100
def hist_box_plot(x, x_label, y_label, bin_incr):
'''Take an array as input and draw a histogram with a boxblot above it'''
f, (ax_box, ax_hist) = plt.subplots(2,
sharex=True,
gridspec_kw={
"height_ratios": (.15, .85)},
figsize=(14, 6))
sns.boxplot(x, ax=ax_box)
bins = np.arange(0, x.max() + bin_incr, bin_incr)
x.hist(grid=False, bins=bins)
ax_box.set(yticks=[])
ax_hist.set_ylabel(y_label)
ax_hist.set_xlabel(x_label)
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
hist_box_plot(nan_perc, x_label='Proportion of missing values', y_label='% of features', bin_incr=0.05);
The majority of features have less than 20% missing values. There are six features with a considerably higher number of missing values.
# Which are the columns with the higher NaN values?
high_nan = nan_perc[nan_perc>0.2]
high_nan
# Merge feat_info with nan_perc
nan_perc_df = pd.DataFrame(nan_perc, columns=['perc_missing'])
feat_nan = (feat_info
.merge(nan_perc_df, left_on='attribute', right_index=True)
.drop(columns='missing_or_unknown', axis=1)
)
# Split the data into three categories:
# High percentage of nans (>20%)
# Moderate percentage of nans (5-20%)
# Low percentage of nans (0-5%)
# No nans
feat_nan['nan_cat'] = pd.cut(feat_nan.perc_missing,
bins=[-np.inf, 0, 0.1, 0.2, np.inf],
labels=['zero', 'low', 'moderate', 'high'],
)
feat_nan.head()
fig, ax1 = plt.subplots(figsize=(14,8))
order = ['person', 'household', 'building', 'postcode', 'community',
'microcell_rr3', 'microcell_rr4', 'region_rr1', 'macrocell_plz8']
hue_order = ['zero', 'low', 'moderate', 'high']
sns.countplot(data=feat_nan,
x='information_level',
hue='nan_cat',
order=order,
hue_order=hue_order,
ax=ax1)
plt.legend(loc='upper right')
sns.despine(ax=ax1);
Observations
person
informational level has features with zero NaNs
.NaN
features (>0% - 10%) are distributed among person
, household
and region
information levels.NaN
features (10% - 20%) is the only category distributed among all information levels.NaN
features (>20%) are distributed among person
, household
and microcell_rr3
information levels.fig, ax1 = plt.subplots(figsize=(14,8))
order = ['person', 'household', 'building', 'postcode', 'community',
'microcell_rr3', 'microcell_rr4', 'region_rr1', 'macrocell_plz8']
hue_order = ['zero', 'low', 'moderate', 'high']
sns.swarmplot(data=feat_nan,
x='information_level',
y='perc_missing',
hue='nan_cat',
order=order,
hue_order=hue_order,
ax=ax1)
plt.legend(loc='upper right')
sns.despine(ax=ax1);
Observations
postcode
, community
, microcell_rr3 and 4
and macrocell_plz8
. This suggests that they exist some methodical bias in creating the NaNs
for each information level.household
information level has the most evenly distributed Nan categories.person
and household
consist mainly of moderate NaN categories.# Look at the similar values of microcell_rr3
feat_nan[(feat_nan.information_level == 'microcell_rr3') & (feat_nan.nan_cat == 'moderate')]
Confirms the observation that features of the same attribute type (KBA05
classifies according to the number of family houses and buildings in the microcell) have the exact same percentage of missing values.
The outlying columns with a high percentage of NaNs
will be removed.
cols_to_remove = feat_nan[feat_nan.nan_cat == 'high'].attribute
cols_to_remove
df = df.drop(columns=cols_to_remove, axis=1)
df.shape
The total percentage of Nans after the converting missing value encoding to NaNs
is 11.05 %
Fig1 observations: The majority of features (79 out of 85) have less than 20% missing values. There are six features with a considerably higher number of missing values. These features are:
TITEL_KZ
: Academic title flag (1: Dr., 4: Prof. Dr., 5: other)AGER_TYP
: Best-ager typology (1: passive elderly, 3: experience-driven elderly)KK_KUNDENTYP
: Consumer pattern over past 12 months (1: regular customer, 6: passive customer)SHOPPER_TYP
and the high level of missing values would suggest that this feature can be dropped without too much loss of information.KBA05_BAUMAX
: Most common building type within the microcell (1: mainly 1-2 family homes in the microcell, 4: mainly 10+ family homes in the microcell, 5: mainly business buildings in the microcell)KBA05_
features and the high level of missing values suggest that this feature can be dropped without too much information loss.GEBURTSJAHR
: Year of birth (Numeric)ALTERSKATEGORIE_GROB
suggest that this feature can be dropped without too much information loss.ALTER_HH
: Birthdate of head of household (1: 1895-01-01 to 1899-12-31, 21: 1995-01-01 to 1999-12-31)Fig2 observations:
person
informational level has features with zero NaNs
.NaN
features (>0% - 10%) are distributed among person
, household
and region
information levels.NaN
features (10% - 20%) is the only category distributed among all information levels.NaN
features (>20%) are distributed among person
, household
and microcell_rr3
information levels.Fig3 observations:
postcode
, community
, microcell_rr3 and 4
and macrocell_plz8
.KBA05
-type features with a moderate level of NaNs
have 14.6% of missing data. This suggests that there may exist some methodical bias in creating the NaNs
for each information level.household
information level has the most evenly distributed Nan categories.person
and household
consist mainly of moderate NaN categories.All six outlier columns with a high percentage of NaNs (>20%) will be removed.
df['row_nan_perc'] = df.isnull().mean(axis=1)
hist_box_plot(df['row_nan_perc'], x_label='Proportion of missing values', y_label='% of rows', bin_incr=0.01)
We will use 10% as the cut of threshhold to split the data into two separate subsets and compare their distributions.
threshhold = 0.1
df_lowna = df.loc[df['row_nan_perc'] < threshhold, :]
df_highna = df.loc[df['row_nan_perc'] > threshhold, :]
df_lowna.shape
# percentage of low na rows
df_lowna.shape[0] / df.shape[0]
df_highna.shape
# percentage of high na rows
df_highna.shape[0] / df.shape[0]
# find columns with no nans at lowna
df_lowna.dropna(axis=1).shape
# find columns with no nans at highna
df_highna.dropna(axis=1).shape
# get the commmon columns
common_cols = list(set(df_lowna.dropna(axis=1).columns).intersection(df_highna.dropna(axis=1).columns))
common_cols
# Drop the 'row_nan_perc'
common_cols.remove('row_nan_perc')
# compare distributions
for col in common_cols[:6]:
fig, axes = plt.subplots(1,2, figsize=(14, 6), sharey=True)
sns.countplot(df_lowna[col], ax=axes[0], color='b')
sns.countplot(df_highna[col], ax=axes[1], color='r')
Observations
# Statistical comparison of distributions
for col in common_cols:
print(col)
print(chisquare(df_highna[col].value_counts().sort_index(),
df_lowna[col].value_counts().sort_index(),
axis=0),
'\n')
The p value is 0 for all the chisquared tests which means that we saw the expected frequencies based on the observed distribution (High NaN
features) about 0% of the time which is statistically significant and confirms the assumption that the two distributions are different [5]
Fig4 observations:
We used 10% as the cut of threshhold to split the data into two separate subsets and compare their distributions taking 5 sample columns that do no have any NaN
values.
Fig5 observations:
We will continue the analysis with the dataset that contains rows with less than 10% NaNs
# Keep only the rows with NaNs perc below 10%
df = df_lowna
# Drop the row nan percentage row
df = df[df.columns[:-1]]
df.head()
# Drop the missing_or_unknown column from the
# processed feat_info Df
feat_info = feat_info[feat_info.columns[:-1]]
feat_info.head()
# filter feat_info to the remaining columns
feat_info = feat_info[feat_info.attribute.isin(df.columns)]
feat_info.shape
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=feat_info,
x='type',
color='c',
ax=ax1)
ax1.set_ylabel('No of features')
ax1.set_xlabel('Data Type')
for p in ax1.patches:
ax1.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.35, p.get_height()+1))
sns.despine(ax=ax1);
# Get categorical features
feat_info[feat_info.type == 'categorical']
The majority of the categorical features is of the person
information_level.
# How many of each information_level
feat_info[feat_info.type == 'categorical'].groupby('information_level').count().iloc[:,-1]
# Get the categorical feature names
df_cat_cols = feat_info[feat_info.type == 'categorical'].attribute
df_cat_cols
# Find the number of levels
df[df_cat_cols].nunique()
# Cound the number of levels
df[df_cat_cols].nunique().sort_values()
# how many binary
sum(df[df_cat_cols].nunique() == 2)
# how many multilevel
sum(df[df_cat_cols].nunique() != 2)
df[df_cat_cols].dtypes
The last three columns are of dtype object
. Let's have a look at them
df[df_cat_cols[-3:]].sample(10)
OST_WEST_KZ
: Building location via former East / West Germany (GDR / FRG) (O: East (GDR), W: West (FRG))CAMEO_DEUG_2015
German CAMEO: Wealth / Life Stage Typology, rough scale (1: upper class, 9: urban working class)object
dtype. Since it is a multilevel categorical feature and will be transformed to a dummy variable or dropped it can stay like this. CAMEO_DEU_2015
should be a float64
data type but is encoded as object.CAMEO_DEU_2015
is a more detailed version CAMEO_DEUG_2015
and would probably be highly correlated. Further investigation should be done.# Reencode the binary OST_WEST_KZ to numeric categories
df = df.copy()
df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'O': 0,
'W': 1})
# test
df[df_cat_cols[-3:]].sample(10)
# Choose the multilevel categorical features
cat_nunique = df[df_cat_cols].nunique()
cat_to_encode = cat_nunique[cat_nunique>2]
cat_to_encode.sort_values()
# DROPING THE MULTI-LEVEL CATEGORICAL COLUMNS IN THIS ITERATION
df = df.drop(columns=cat_to_encode.index, axis=0)
df.info()
The dataset contains 18 categorical variables, 5 of which are binary and 13 are multilevel.
OST_WEST_KZ
: Building location via former East / West Germany (GDR / FRG) (O: East (GDR), W: West (FRG))CAMEO_DEUG_2015
: German CAMEO: Wealth / Life Stage Typology, rough scale (1: upper class, 9: urban working class)The rest of the binary features will be kept as is. The multilevel categorical features need to be one hot encoded in order to be used for feature transformation and training our model. They are:
Feature | # of Levels |
---|---|
NATIONALITAET_KZ | 3 |
SHOPPER_TYP | 4 |
LP_FAMILIE_GROB | 5 |
LP_STATUS_GROB | 5 |
CJT_GESAMTTYP | 6 |
FINANZTYP | 6 |
ZABEOTYP | 6 |
GEBAEUDETYP | 7 |
CAMEO_DEUG_2015 | 9 |
LP_STATUS_FEIN | 10 |
LP_FAMILIE_FEIN | 11 |
GFK_URLAUBERTYP | 12 |
CAMEO_DEU_2015 | 44 |
Since some of the information they contain is covered by other features in the dataset like CAMEO_DEUG_2015
for CAMEO_DEU_2015
or FINANZ_
for FINANZTYP
LP_STATUS_FEIN
and LP_STATUS_GROB
for LP_FAMILIE_FEIN
and LP_FAMILIE_GROB
, we will choose to drop them in this iteration of the analysis and keep things more straightforward and not increase the number of variables too much at this phase.
# Get mixed dtype features
feat_info[feat_info.type == 'mixed']
Data Dictionary:
PRAEGENDE_JUGENDJAHRE: Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west)
# Let's have a look at it
df.PRAEGENDE_JUGENDJAHRE.sample(10)
# Let's have a look at it
df.PRAEGENDE_JUGENDJAHRE.nunique()
plt.style.use('ggplot')
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
x='PRAEGENDE_JUGENDJAHRE',
color="darkslategray",
ax=ax1);
df.PRAEGENDE_JUGENDJAHRE.value_counts()
df.PRAEGENDE_JUGENDJAHRE.isnull().sum()
df.PRAEGENDE_JUGENDJAHRE.isnull().mean() * 100
Observations
Level 14 (90s - digital media kids (Mainstream, E+W))
leads with close to 172952 rows followed by level 8 (70s - family orientation (Mainstream, E+W)
) with 134165 observationsLevels 7 (60s - opponents to the building of the Wall (Avantgarde, E))
, 13 ((80s - Swords into ploughshares (Avantgarde, E))
and 2 (40s - reconstruction years (Avantgarde, E+W))
have the least observations with 3888, 5300 and 7340 rows respectivelyNaNs
- 2.16 % of the total.New Variables Data encoding:
ENG_DECADE:
Person’s decade of youth.
ENG_MOVEMENT:
Person’s movement alignment.
decade_dict = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3,
8: 4, 9: 4, 10: 5, 11: 5, 12: 5, 13: 5,
14: 6, 15: 6
}
movement_dict = {1: 1, 3: 1, 5: 1, 8: 1, 10: 1, 12: 1, 14: 1,
2: 2, 4: 2, 6: 2, 7: 2, 9: 2, 11: 2, 13: 2, 15: 2
}
# It appears that map is approximately 10x faster than replace.
# Ref: https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
df['ENG_DECADE'] = df['PRAEGENDE_JUGENDJAHRE'].map(decade_dict)
df['ENG_MOVEMENT'] = df['PRAEGENDE_JUGENDJAHRE'].map(movement_dict)
ENG_DECADE
categories¶fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
x='ENG_DECADE',
palette="Greens",
ax=ax1);
df.ENG_DECADE.value_counts()
# Sanity check
df.ENG_DECADE.isnull().sum()
Observations
Level 6 90s
leads with close to 212488 rows followed by level 4 70s
with 166658 observations consistently with the original feature.Levels 1: 40s
, and 2 50s
have the least observations with 26923 and 71852 rows respectively.NaNs
- 2.16 % of the feature's total values.ENG_MOVEMENT
categories¶fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
x='ENG_MOVEMENT',
palette="Greens_r",
ax=ax1);
df.ENG_MOVEMENT.value_counts()
# Sanity check
df.ENG_MOVEMENT.isnull().sum()
Observations
1: Mainstream
leads with close to 564070 rows almost 3.5 times more than 2: Avantgarde
with 166918 observations.NaNs
- 2.16 % of the total.Data Dictionary:
CAMEO_INTL_2015: Wealth / Life Stage Typology, mapped to international code
# Let's have a look at it
df.CAMEO_INTL_2015.head(10)
ENG_MOVEMENT
categories¶fig, ax1 = plt.subplots(figsize=(14,8))
levels_sort = [str(int(l)) for l in np.sort(df.CAMEO_INTL_2015.astype(float).unique())[:-1]]
color_map = dict(zip())
sns.countplot(data=df,
x='CAMEO_INTL_2015',
color = 'darkslategray',
order = levels_sort,
ax=ax1);
df.CAMEO_INTL_2015.value_counts()
df.CAMEO_INTL_2015.isnull().sum()
df.CAMEO_INTL_2015.isnull().mean() * 100
Observations
51: Poorer Households - Pre-Family Couples & Singles
leads with 128033 rows followed by level 41: Less Affluent Households - Pre-Family Couples & Singles
) with 87902 observations, almost 1/3 less.33: Comfortable Households - Families With School Age Children
, 32: Comfortable Households - Young Couples With Children
and 35 (40s - reconstruction years (Avantgarde, E+W))
have the least observations with 9161, 9777 and 9882 rows respectivelyNaNs
- 4.32% of it's total."CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values)
New Variables Data encoding:
ENG_WEALTH:
Household's wealth
ENG_LIFE_STAGE:
Person's Life Stage
def breakup_codes(x, digit):
''' Breaks up two-digit codes by their
'tens'-place and 'ones'-place digits
into two new ordinal variables.
Leaves NaNs unchanged'''
if not pd.isna(x):
if digit == 'first':
return int(str(x)[0])
elif digit == 'second':
return int(str(x)[1])
return x
df['ENG_WEALTH'] = df['CAMEO_INTL_2015'].apply(breakup_codes, digit='first')
df['ENG_LIFE_STAGE'] = df['CAMEO_INTL_2015'].apply(breakup_codes, digit='second')
ENG_WEALTH
categories¶fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
x='ENG_WEALTH',
color = 'teal',
ax=ax1);
df.ENG_WEALTH.value_counts()
# Sanity check
df.ENG_WEALTH.isnull().sum()
Observations
5: Poorer Households
leads with close to 213978 rows followed by level 4: Less Affluent Households
with 181187 observations consistently with the original feature.3: Comfortable Households
, and 1: Wealthy Households
have the least observations with 62642 and 111831 rows respectively.ENG_LIFE_STAGE
categories¶fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
x='ENG_LIFE_STAGE',
color = 'c',
ax=ax1);
df.ENG_LIFE_STAGE.value_counts()
# Sanity check
df.ENG_LIFE_STAGE.isnull().sum()
mixed_feat_cols = feat_info[feat_info.type == 'mixed'].attribute
mixed_feat_cols
# How many levels per feature
df[mixed_feat_cols].nunique()
# Drop mixed dtype features
df = df.drop(mixed_feat_cols, axis=1)
df.head()
Observations
1: Pre-Family Couples & Singles
leads with 232220 rows followed by level 4: Older Families & Mature Couples
with 220077 observations.2: Young Couples With Children
, and 3: Families With School Age Children
have the least observations with 71973 and 108061 rows respectively.The dataset contains 6 mixed variables, 5 of which are binary and 13 are multilevel.
The mixed feature PRAEGENDE_JUGENDJAHRE
: Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west) (1: 40s - war years (Mainstream, E+W), 15: 90s - ecological awareness (Avantgarde, E+W)), with the following observations Fig7:
14 (90s - digital media kids (Mainstream, E+W))
leads with close to 172952 rows followed by level 8 (70s - family orientation (Mainstream, E+W)
) with 134165 observations7 (60s - opponents to the building of the Wall (Avantgarde, E))
, 3 ((80s - Swords into ploughshares (Avantgarde, E))
and 2 (40s - reconstruction years (Avantgarde, E+W))
have the least observations with 3888, 5300 and 7340 rows respectivelyNaNs
- 2.16 % of it's total.and two new variables have been engineered:
ENG_DECADE
: Person’s decade of youth. (1: 40s , 6: 90s) , Fig8 observations:
Level 6 90s
leads with close to 212488 rows followed by level 4 70s
with 166658 observations consistently with the original feature.Levels 1: 40s
, and 2 50s
have the least observations with 26923 and 71852 rows respectively.ENG_MOVEMENT
: Person’s movement alignment (1: Mainstream, 2: Avantgarde ), Fig9 observations:
1: Mainstream
leads with close to 564070 rows almost 3.5 times more than 2: Avantgarde
with 166918 observations.NaNs
- 2.16 % of the total.The mixed feature CAMEO_INTL_2015
: German CAMEO: Wealth / Life Stage Typology, mapped to international code (11: Wealthy Households - Pre-Family Couples & Singles, 55: Poorer Households - Elders In Retirement), has been explored, with the following observations Fig10:
51: Poorer Households - Pre-Family Couples & Singles
leads with 128033 rows followed by level 41: Less Affluent Households - Pre-Family Couples & Singles
) with 87902 observations, almost 1/3 less.33: Comfortable Households - Families With School Age Children
, 32: Comfortable Households - Young Couples With Children
and 35 (40s - reconstruction years (Avantgarde, E+W))
have the least observations with 9161, 9777 and 9882 rows respectivelyNaNs
- 4.32% of it's total.and two new variables have been engineered:
ENG_WEALTH
: Household's wealth (1: Wealthy Households, 5: Poorer Households) , Fig11 observations:
5: Poorer Households
leads with close to 213978 rows followed by level 4: Less Affluent Households
with 181187 observations consistently with the original feature.3: Comfortable Households
, and 1: Wealthy Households
have the least observations with 62642 and 111831 rows respectively.ENG_MOVEMENT
: Person's Life Stage (Pre-Family Couples & Singles, 5: Elders In Retirement), Fig12 observations:
1: Pre-Family Couples & Singles
leads with 232220 rows followed by level 4: Older Families & Mature Couples
with 220077 observations.2: Young Couples With Children
, and 3: Families With School Age Children
have the least observations with 71973 and 108061 rows respectively.The rest of the mixed features are:
Feature | # of Levels |
---|---|
LP_LEBENSPHASE_FEIN | 40 |
LP_LEBENSPHASE_GROB | 12 |
WOHNLAGE | 8 |
PLZ8_BAUMAX | 5 |
which we chose to drop in this iteration of the analysis and keep things more straightforward and not increase the number of variables too much at this phase.
df.columns
df.shape
# Check if all colums are of numeric dtype
from pandas.api.types import is_numeric_dtype
#should be equal to the no of cols if all True
assert(sum([is_numeric_dtype(df[col]) for col in df.columns]) == df.shape[1])
print('Tests OK')
def clean_data(df):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
# Put in code here to execute all main cleaning steps:
# convert missing value codes into NaNs, ...
df = convert_missingdata_encodings_to_nans(df)
print('Convert missing value codes into NaN, OK')
# remove selected columns , ...
df = drop_selected_columns(df)
# remove selected rows and return both dfs
df, df_cust_highna = drop_selected_rows(df)
print('Remove selected columns and rows, OK')
# select, re-encode, and engineer column values.
df = reencode_categorical_feat(df)
df = engineer_mixed_feat(df)
print('Select, re-encode, and engineer column value, OK')
# Return the cleaned dataframe and the high NaN Df.
return df, df_cust_highna
def convert_missingdata_encodings_to_nans(df, debug=False):
'''Convert to nans using the missing_data_encode dictionary'''
for col in missing_data_encode.keys():
if debug:
print(col, 'nans before: ', df[col].isnull().sum())
df.loc[df[col].isin(missing_data_encode[col]), col] = np.nan
if debug:
print(col, 'nans after: ', df[col].isnull().sum(), '\n')
return df
def drop_selected_columns(df):
''''''
return df.drop(columns=cols_to_remove, axis=1)
def drop_selected_rows(df):
'''split the dataset into two parts:
One with a percntage of NaN per row over the threshhold
and one below. Return both.'''
threshhold = 0.1
df['row_nan_perc'] = df.isnull().mean(axis=1)
df_lowna = df.loc[df['row_nan_perc'] < threshhold, :].\
drop(columns='row_nan_perc', axis=1)
df_highna = df.loc[df['row_nan_perc'] > threshhold, :].\
drop(columns='row_nan_perc', axis=1)
return df_lowna, df_highna
def reencode_categorical_feat(df):
''''''
# Reencode the binary OST_WEST_KZ to numeric categories
df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'O': 0,
'W': 1})
# choose the categorical features with more than 2 levels
df_cat_cols = feat_info[feat_info.type == 'categorical'].attribute
cat_nunique = df[df_cat_cols].nunique()
cat_to_encode = cat_nunique[cat_nunique > 2]
# DROP THE MULTI-LEVEL CATEGORICAL FEATURES
df = df.drop(columns=cat_to_encode.index, axis=0)
return df
def engineer_mixed_feat(df):
'''Engineer new variables from the
mixed features. Drop he old features'''
# Engineer two new variables from 'PRAEGENDE_JUGENDJAHRE'
decade_dict = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3,
8: 4, 9: 4, 10: 5, 11: 5, 12: 5, 13: 5,
14: 6, 15: 6
}
movement_dict = {1: 1, 3: 1, 5: 1, 8: 1, 10: 1, 12: 1, 14: 1,
2: 2, 4: 2, 6: 2, 7: 2, 9: 2, 11: 2, 13: 2, 15: 2
}
df['ENG_DECADE'] = df['PRAEGENDE_JUGENDJAHRE'].map(decade_dict)
df['ENG_MOVEMENT'] = df['PRAEGENDE_JUGENDJAHRE'].map(movement_dict)
df.drop
# Engineer two new variables from 'CAMEO_INTL_2015'
df['ENG_WEALTH'] = df['CAMEO_INTL_2015'].apply(
breakup_codes, digit='first')
df['ENG_LIFE_STAGE'] = df['CAMEO_INTL_2015'].apply(
breakup_codes, digit='second')
# Drop mixed dtype features
mixed_feat_cols = feat_info[feat_info.type == 'mixed'].attribute
df = df.drop(mixed_feat_cols, axis=1)
return df
def test_clean_data(df):
'''A simple test function to test that the cleaning function
is at least producing an identical Dataframe as the one we created
TODO: proper unit tests'''
df_test = pd.read_csv('Udacity_AZDIAS_Subset.csv', delimiter=';')
print('Load the dataset, OK')
pd.testing.assert_frame_equal(clean_data(df_test)[0], df)
print('Tests Passed!')
test_clean_data(df)
# Percentage of nans per column
(df.isnull().mean(axis=0) * 100).sort_values(ascending=False)[:15]
W_KEIT_KIND_HH
has the highest percentage of missing values (6%) with the next REGIOTYP
and KKK
at 5.4%.
# Percentage of nans per row
nan_perc = (df.isnull().mean(axis=1) * 100).sort_values(ascending=False)
#How many rows have nans?
nan_perc[nan_perc > 0].shape[0]
# as a percentage of rows
print('Total percentage of row that have NaNs : ',
round(nan_perc[nan_perc > 0].shape[0]/ df.shape[0] * 100, 4),
'%')
df.isnull().sum().sum()
np.product(df.shape)
# How many nans as perc?
print('Total percentage of Nans : ',
round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
'%')
# Fit the Scaler without NaNs
df_drop = df.dropna(axis=0)
scaler = StandardScaler()
scaler.fit(df_drop)
# Imput missing values in the full df with the mean
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X_impute = imp.fit_transform(df)
X_impute
# Transform the df with the fitted scaler
X_scale = scaler.transform(X_impute)
# Check the distribution of the feature with the highest pec of NaNs
df_scale = pd.DataFrame(X_scale, columns=df.columns)
df_scale.W_KEIT_KIND_HH.describe()
# check the distributions
df_scale.describe()
# How many nans as perc after?
print('Total percentage of Nans after drop: ',
round(df_drop.isnull().sum().sum() / np.product(df_drop.shape) * 100, 2),
'%')
The total percentage of NaNs in the dataset is 0.67% while the percentage of rows that contain NanNs is 16,37%.
Droping them would discard a significant amount of data in order to handle much smaller amount of actually missing data.
Imputing them with the mean will result them having a 0 value after using StandardScaler()
, possibly reducing the variance of the resulting variables.
An alternative which was used is to fit the StandardScaler
to the data without the NaNs
and then reintroduce them, imputing them with the mean and transforming them with the fitted StandardScaler
object thus resulting in a distributions that retain a higher amount of information.
X_scale.shape
# With all the feaures
pca = PCA().fit(X_scale)
num_components=len(pca.explained_variance_ratio_)
inds = np.arange(num_components)
vals = pca.explained_variance_ratio_
fig, ax = plt.subplots(figsize=(12,9))
plt.bar(inds, vals, color='b');
fig, ax = plt.subplots(figsize=(16,10))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', color='b')
ax.set_xticks(inds+1, minor=False);
The numbers that stand out are 12, 18 and 25.
def scree_plot(pca, c, ann_f=6, figsize=(12, 9), pct_change=False):
'''
Creates a scree plot associated with the principal components
INPUT: pca - the result of instantian of PCA in scikit learn
OUTPUT:
None
'''
num_components = len(pca.explained_variance_ratio_)
ind = np.arange(num_components) + 1
if pct_change:
vals = -pd.Series(pca.explained_variance_ratio_).pct_change()
# the first item is NaN
vals = vals[1:].tolist()
ind = ind[1:]
else:
vals = pca.explained_variance_ratio_
plt.figure(figsize=figsize)
ax = plt. subplot(111)
cumvals = np.cumsum(vals)
ax.bar(ind, vals, color=c)
ax.plot(ind, cumvals, color=c)
for i in range(num_components):
try: # Bandage aid - need to fix
ax.annotate(r"%s%%" % (
(str(vals[i] * 100)[:4])),
(ind[i] + 0.2, vals[i]),
va="bottom", ha="center", fontsize=ann_f)
except IndexError:
pass
ax.xaxis.set_tick_params(width=0)
ax.set_xticks(ind, minor=False)
ax.yaxis.set_tick_params(width=2, length=12)
ax.set_xlabel("Principal Component")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component')
# Re-apply PCA to the data while selecting for number of components to retain.
pca_25 = PCA(n_components=25)
X_pca_25 = pca_25.fit(X_scale)
scree_plot(pca_25, c='slategray', ann_f=8, figsize=(12,9))
Observations
There seems to be a stabilization of the variance explained and a decrease in the ratio of cumulative variance explained round 12 components.
np.cumsum(pca.explained_variance_ratio_[:12])[-1]
The cumulative explained variance of the first 12 components is 66.21%.
# Fit on X for 12 components
pca = PCA(n_components=12)
X_pca = pca.fit_transform(X_scale)
X_pca.shape
df_pca = pd.DataFrame(X_pca)
df_pca.head()
df_pca.describe()
We applied the dimensionality reduction PCA()
method to the scaled dataset and by observing the combined plot of explained variance and cumulative explained variance and per principal component cumulative explained variance plot (Fig15) we chose to keep 12 components as at this number the seems to start a decrease in the ratio of cumulative variance explained as new components are added.
The cumulative explained variance of the first 12 components is 66.21%.
def map_pca_weights_to_feats(pca, df, comp_no):
'''Map pca weights to individual features
and return two pd.Series on with the highest
positive weights and one with the lowest negative
weights'''
weights = pd.DataFrame(np.round(pca.components_, 4), columns=df.keys())
component = weights.iloc[comp_no - 1, :]
comp_pos = component[component > 0].sort_values(ascending=False)
comp_neg = component[component < 0].sort_values(ascending=True)
return comp_pos, comp_neg
def create_bar_table(comp_pos, comp_neg):
'''Create and display a conditionally styled pandas dataframe
for the interpertation of the positive and negative weights
of PCA and centroid distances for KMeans'''
head = """
<table>
"""
row = ""
for serie in [comp_pos[:15],comp_neg[:15]]:
s = serie.copy()
s.name=''
row += "<td>{}</td>".format(s.to_frame().style.bar(
align='mid',
color=['#d65f5f', '#5fba7d'],
width=100).render()
)
row += '</tr>'
head += row
head+= """
</table>"""
display(HTML(head))
# the weights df
weights = pd.DataFrame(np.round(pca.components_, 4), columns = df_drop.keys())
weights.head()
pd.Series(X_pca[:, 1]).describe()
comp_pos1, comp_neg1 = map_pca_weights_to_feats(pca, df_drop, 1)
create_bar_table(comp_pos1, comp_neg1)
Most Positive
Most Negative
The first component component seems to capture the population density, and financial affluence of the person
PLZ8_ANTG3, PLZ8_ANTG4 , ENG_WEALTH, HH_EINKOMMEN_SCORE, RTSGR_KLS9 and EWDICHTE have high positive weights (People that live in more densely populated areas and with lower financial affluence).
The assumption is supported by the negative weights of MOBI_REGIO , KBA05_ANTG1, PLZ8_ANTG1 and FINANZ_MINIMALIST (People that live in less densely populated areas and with higher financial affluence).
comp_pos2, comp_neg2 = map_pca_weights_to_feats(pca, df_drop, 2)
create_bar_table(comp_pos2, comp_neg2)
Most Positive
Most Negative
The second component seems to capture the age, generation and culture (financial, buying and other) of the person.
RETOURTYP_BK_S, SEMIO_ERL and SEMIO_LUST ALTERSKATEGORIE_GROB and FINANZ_VORSORGER have high positive weights and (older people with lower financial preparation and home ownership and less sensual and event oriented).
The assumption is supported by the negative weights of SEMIO_REL, ENG_DECADE, FINANZ_SPARER and SEMIO_TRADV (younger generation, with lower religious affinity, lower money saving, less inconspicuous and less traditional minded).
comp_pos3, comp_neg3 = map_pca_weights_to_feats(pca, df_drop, 3)
create_bar_table(comp_pos3, comp_neg3)
Most Positive
Most Negative
The third component seems to capture the gender and their main corresponding personality types (dreamfulness, dominance, rationality, social and family attitude) of the person.
SEMIO_VERT, SEMIO_SOZ and SEMIO_FAM, SEMIO_KULT and FINANZ_MINIMALIST and RETOURTYP_BK_S have high positive weights and in conjunction with the very strong negative weight of ANREDE_KZ can refer to (Men that are less dreamful, less socially-minded, less cultural-minded not financial minimalists and conservative return shopper types).
The assumption is supported by the negative weights of ANREDE_KZ, SEMIO_KAEM, SEMIO_DOM, SEMIO_KRIT, SEMIO_RAT and FINANZ_ANLEGER (Women that are less dominant-minded, less critical-minded, less rational have less combative attitude and are less prone to investing).
1.The first component component seems to capture the population density, and financial status of the person - Fig16.
PLZ8_ANTG3, PLZ8_ANTG4 , ENG_WEALTH, HH_EINKOMMEN_SCORE, RTSGR_KLS9 and EWDICHTE have high positive weights correlated with people that live in more densely populated areas and with lower financial affluence.
The assumption is supported by the negative weights of MOBI_REGIO , KBA05_ANTG1, PLZ8_ANTG1 and FINANZ_MINIMALIST correlated with People that live in less densely populated areas and with higher financial affluence.
2.The second component seems to capture the age, generation and culture (financial, buying and other) of the person - Fig17.
RETOURTYP_BK_S, SEMIO_ERL and SEMIO_LUST ALTERSKATEGORIE_GROB and FINANZ_VORSORGER have high positive weights correlated with older people with lower financial preparation and home ownership that are less sensual and event oriented.
The assumption is supported by the negative weights of SEMIO_REL, ENG_DECADE, FINANZ_SPARER and SEMIO_TRADV correlated with (younger generation, with lower religious affinity, lower money saving, less inconspicuous and less traditional minded).
3.The third component seems to capture the gender and their main corresponding personality types (dreamfulness, dominance, rationality, social and family attitude) of the person - Fig18.
SEMIO_VERT, SEMIO_SOZ and SEMIO_FAM, SEMIO_KULT and FINANZ_MINIMALIST and RETOURTYP_BK_S have high positive weights and in conjunction with the very strong negative weight of ANREDE_KZ can refer to (men that are less dreamful, less socially-minded, less cultural-minded not financial minimalists and conservative return shopper types).
The assumption is supported by the negative weights of ANREDE_KZ, SEMIO_KAEM, SEMIO_DOM, SEMIO_KRIT, SEMIO_RAT and FINANZ_ANLEGER correlated with (women that are less dominant-minded, less critical-minded, less rational, have less combative attitude and are less prone to investing).
#THIS TAKES A LONG TIME - DON'T RUN EVERY TIME!
# compute the average within-cluster distances.
# Over a number of different cluster counts...
scores = {}
for k in range(2, 31):
# run k-means clustering on the data and...
scores[k] = np.abs(KMeans(n_clusters=k, n_jobs=-1).fit(X_pca).score(X_pca))
# 1print('Calculated: ', k)
# Plot relationship plot
fig, ax = plt.subplots(figsize=(16,10))
ax = pd.Series(scores).plot(marker='o', color='slategray')
ax.set_xticks(np.arange(2, 31), minor=False);
ax.set_xlabel("# of Clusters")
ax.set_ylabel("Total Distance to centroid");
# Chosen number of clusters
K=8
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=K, n_jobs=-1, random_state=0).fit(X_pca)
kmeans.labels_.shape
kmeans.cluster_centers_.shape
The curve in the Distance to Number of Clusters relationship plot (Fig19) is quite even and the the most obvious elbow in the curve where adding additional clusters less effective in reducing the within-cluster distance can be inferred to be 8 and this number has been used as the number of clusters to fit the the KMeans algorithm on the transformed demographics data and predict cluster labels.
Another elbow can be observed around the 12 clusters but the k = 8 was preferred in this iteration in order to keep the number of clusters lower and see if meaningful separation can be observed with this number.
# Load in the customer demographics data.
customers = pd.read_csv('Customers.csv', delimiter=';')
customers.head()
df_cust, df_cust_highna = clean_data(customers)
df_cust.shape
df_cust_highna.shape
# Impute
X_cust_imp = imp.transform(df_cust)
# Scale
X_cust_sc = scaler.transform(X_cust_imp)
X_cust_pca = pca.transform(X_cust_sc)
X_cust_pca.shape
df_cust_pca = pd.DataFrame(X_cust_pca, columns=np.arange(1, 13))
df_cust_pca.describe()
X_cust_kmeans = kmeans.predict(X_cust_pca)
np.unique(X_cust_kmeans)
X_cust_kmeans.shape
Compare the proportion of data in each cluster for the customer data to the proportion of data in each cluster for the general population.
# General population dataset kmean labels
gen_clust_labels = kmeans.labels_
# Add the high NaN dataset as a cluster with label -1
gen_clust_labels_na = np.append(gen_clust_labels, [-1] * df_highna.shape[0])
# Make proportion table
gen_clust_freq = itemfreq(gen_clust_labels_na)
gen_proportion_col = gen_clust_freq[:, 1] / gen_clust_freq[:, 1].sum()
gen_clust_freq = np.c_[gen_clust_freq, gen_proportion_col]
# customer dataset
cust_clust_labels = X_cust_kmeans
# Add the high NaN dataset as a cluster with label -1
cust_clust_labels_na = np.append(cust_clust_labels, [-1] * df_cust_highna.shape[0])
# Make proportion table
cust_clust_freq = itemfreq(cust_clust_labels_na)
cust_proportion_col = cust_clust_freq[:, 1] / cust_clust_freq[:, 1].sum()
cust_clust_freq = np.c_[cust_clust_freq, cust_proportion_col]
# Plot
fig, axes = plt.subplots(1,2, figsize=(16, 10), sharey=True)
sns.barplot(x=gen_clust_freq[:, 0], y=gen_clust_freq[:, 2], color='b', ax=axes[0]);
sns.barplot(x=cust_clust_freq[:, 0], y=cust_clust_freq[:, 2], color='r', ax=axes[1]);
axes[0].set_title('General Population')
axes[1].set_title('Customer Data');
# create plot
fig, ax = plt.subplots(figsize=(16, 10))
index = np.append([-1], np.arange(K))
bar_width = 0.35
opacity = 0.8
gen = plt.bar(index, gen_clust_freq[:, 2], bar_width,
alpha=opacity,
color='b',
label='General Population')
cust = plt.bar(index + bar_width, cust_clust_freq[:, 2], bar_width,
alpha=opacity,
color='g',
label='Customer Data')
plt.xlabel('# of Cluster')
plt.ylabel('Proportion of cluster')
plt.title('Proportions of Clusters for General and Customer Data')
plt.xticks(index + bar_width, index)
plt.legend()
plt.tight_layout()
sns.set(style="whitegrid")
sns.despine();
#Calculate differece of proportions
prop_diff = cust_clust_freq[:, 2] - gen_clust_freq[:, 2]
positive= prop_diff > 0
# create plot
fig, ax = plt.subplots(figsize=(14, 8))
index = np.append([-1], np.arange(K))
bar_width = 0.35
opacity = 0.8
diff = plt.bar(index, prop_diff, bar_width,
alpha=opacity,
color=np.vectorize({True: 'k', False: 'r'}.get)(positive)
)
plt.xlabel('# of Cluster')
plt.ylabel('Difference of proportions')
plt.title('Difference of Proportions of Clusters for General and Customer Data')
plt.xticks(index)
plt.tight_layout()
sns.set(style="whitegrid")
sns.despine();
Observations
# Add cluster labels
df_cust_pca['cluster'] = X_cust_kmeans
g = sns.pairplot(df_cust_pca,
vars = np.arange(1, 13),
diag_kind = 'kde',
hue="cluster",
palette="Paired");
Observations
Let's have a closer look at the first three principal components.
# Rename components for better reading
df_cust_pca = df_cust_pca.rename(columns = {1: 'pop_density-finance_status',
2: 'age-generation',
3: 'gender'}
)
g = sns.pairplot(df_cust_pca,
vars = ['pop_density-finance_status', 'age-generation', 'gender'],
diag_kind = 'kde',
hue="cluster",
palette="Paired");
Observations
def map_kmeans_weights_to_feats(kmeans, df, clust_no):
'''Map pca weights to individual features
and return two pd.Series on with the highest
positive weights and one with the lowest negative
weights'''
weights = pd.DataFrame(np.round(kmeans.cluster_centers_, 4), columns=df.keys())
centroid = weights.iloc[clust_no, :]
cent_pos = centroid[centroid > 0].sort_values(ascending=False)
cent_neg = centroid[centroid < 0].sort_values(ascending=True)
return cent_pos, cent_neg
Fig20, Fig21 and Fig22 observations:
Fig22 observations:
# Overrepresented
over_pos, over_neg = map_kmeans_weights_to_feats(kmeans, df_cust_pca.iloc[:, :-1], 3)
create_bar_table(over_pos, over_neg)
pop_density-finance_status
and the strongly positive value for gender
, confirms the pairplot observations and strengthens the assumption that the overrepresented group are: Men that are less dreamful, less socially-minded, less cultural-minded, not financial minimalists, conservative return shopper types, that have a higher financial status and live in less densely populated areas. The negative weight for the age-generation
principal component in conjunction with the attitude characteristics of these men suggests that they could be of relative younger age.# Underrepresented
under_pos, under_neg = map_kmeans_weights_to_feats(kmeans, df_cust_pca.iloc[:, :-1], 6)
create_bar_table(under_pos, under_neg)
pop_density-finance_status
and the strongly negative value for gender
along with negative weight for the age-generation
principal component confirms the pairplot observations and strengthens the assumption that the underrepresented group are: Younger women that are less dominant-minded, less critical-minded, less rational, have a less combative attitude and are less prone to investing and live in densely populated areas with a lower financial status. They have lower religious affinity, lower money saving culture and are less traditional minded.[1]:https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
[2]:https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-
[3]:https://stackoverflow.com/questions/36653443/retaining-nan-values-after-get-dummies-in-pandas
[4]:http://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation
[5]:http://greenteapress.com/thinkstats2/html/thinkstats2010.html#sec94
[6]:https://github.com/jadianes/data-science-your-way/tree/master/03-dimensionality-reduction-and-clustering
[7]:https://www.mailman.columbia.edu/research/population-health-methods/cluster-analysis-using-k-means