Table of Contents

Customer Segmentation - Introduction

In this project, I apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns.

Imports

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import HTML


from scipy.stats import itemfreq, chisquare

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 

# Set base plotting style
plt.style.use('seaborn-ticks')
# # Set base plotting size
plt.rcParams['figure.figsize'] = 12, 9
# Magic word for producing visualizations in notebook
%matplotlib inline
# Increase figure resolution for high dpi screens
%config InlineBackend.figure_format = 'retina'
# Autoreload modules
%load_ext autoreload
%autoreload 2

Data Loading

There are four files associated with this project:

  • Demographics.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Customers.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

In [3]:
# Load in the general demographics data.
azdias = pd.read_csv('Demographics.csv', delimiter=';')

# Load in the feature summary file.
feat_info = pd.read_csv('Feature_Summary.csv', delimiter=';')
In [4]:
feat_info.head()
Out[4]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
In [5]:
azdias.head()
Out[5]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

Preprocessing

Assess Missing Data

Identify missing or unknown data values and convert them to NaNs.
In [6]:
df = azdias.copy()
df.head()
Out[6]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

In [7]:
df.describe()
Out[7]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
count 891221.000000 891221.000000 891221.000000 886367.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 ... 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 794005.000000 794005.000000 794005.00000
mean -0.358435 2.777398 1.522098 3.632838 3.074528 2.821039 3.401106 3.033328 2.874167 3.075121 ... 2.253330 2.801858 1.595426 0.699166 1.943913 3.612821 3.381087 3.167854 5.293002 3.07222
std 1.198724 1.068775 0.499512 1.595021 1.321055 1.464749 1.322134 1.529603 1.486731 1.353248 ... 0.972008 0.920309 0.986736 0.727137 1.459654 0.973967 1.111598 1.002376 2.303739 1.36298
min -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.00000
25% -1.000000 2.000000 1.000000 2.000000 2.000000 1.000000 3.000000 2.000000 2.000000 2.000000 ... 1.000000 2.000000 1.000000 0.000000 1.000000 3.000000 3.000000 3.000000 4.000000 2.00000
50% -1.000000 3.000000 2.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ... 2.000000 3.000000 2.000000 1.000000 1.000000 4.000000 3.000000 3.000000 5.000000 3.00000
75% -1.000000 4.000000 2.000000 5.000000 4.000000 4.000000 5.000000 5.000000 4.000000 4.000000 ... 3.000000 3.000000 2.000000 1.000000 3.000000 4.000000 4.000000 4.000000 7.000000 4.00000
max 3.000000 9.000000 2.000000 6.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 ... 4.000000 4.000000 3.000000 2.000000 5.000000 5.000000 5.000000 9.000000 9.000000 9.00000

8 rows × 81 columns

In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
AGER_TYP                 891221 non-null int64
ALTERSKATEGORIE_GROB     891221 non-null int64
ANREDE_KZ                891221 non-null int64
CJT_GESAMTTYP            886367 non-null float64
FINANZ_MINIMALIST        891221 non-null int64
FINANZ_SPARER            891221 non-null int64
FINANZ_VORSORGER         891221 non-null int64
FINANZ_ANLEGER           891221 non-null int64
FINANZ_UNAUFFAELLIGER    891221 non-null int64
FINANZ_HAUSBAUER         891221 non-null int64
FINANZTYP                891221 non-null int64
GEBURTSJAHR              891221 non-null int64
GFK_URLAUBERTYP          886367 non-null float64
GREEN_AVANTGARDE         891221 non-null int64
HEALTH_TYP               891221 non-null int64
LP_LEBENSPHASE_FEIN      886367 non-null float64
LP_LEBENSPHASE_GROB      886367 non-null float64
LP_FAMILIE_FEIN          886367 non-null float64
LP_FAMILIE_GROB          886367 non-null float64
LP_STATUS_FEIN           886367 non-null float64
LP_STATUS_GROB           886367 non-null float64
NATIONALITAET_KZ         891221 non-null int64
PRAEGENDE_JUGENDJAHRE    891221 non-null int64
RETOURTYP_BK_S           886367 non-null float64
SEMIO_SOZ                891221 non-null int64
SEMIO_FAM                891221 non-null int64
SEMIO_REL                891221 non-null int64
SEMIO_MAT                891221 non-null int64
SEMIO_VERT               891221 non-null int64
SEMIO_LUST               891221 non-null int64
SEMIO_ERL                891221 non-null int64
SEMIO_KULT               891221 non-null int64
SEMIO_RAT                891221 non-null int64
SEMIO_KRIT               891221 non-null int64
SEMIO_DOM                891221 non-null int64
SEMIO_KAEM               891221 non-null int64
SEMIO_PFLICHT            891221 non-null int64
SEMIO_TRADV              891221 non-null int64
SHOPPER_TYP              891221 non-null int64
SOHO_KZ                  817722 non-null float64
TITEL_KZ                 817722 non-null float64
VERS_TYP                 891221 non-null int64
ZABEOTYP                 891221 non-null int64
ALTER_HH                 817722 non-null float64
ANZ_PERSONEN             817722 non-null float64
ANZ_TITEL                817722 non-null float64
HH_EINKOMMEN_SCORE       872873 non-null float64
KK_KUNDENTYP             306609 non-null float64
W_KEIT_KIND_HH           783619 non-null float64
WOHNDAUER_2008           817722 non-null float64
ANZ_HAUSHALTE_AKTIV      798073 non-null float64
ANZ_HH_TITEL             794213 non-null float64
GEBAEUDETYP              798073 non-null float64
KONSUMNAEHE              817252 non-null float64
MIN_GEBAEUDEJAHR         798073 non-null float64
OST_WEST_KZ              798073 non-null object
WOHNLAGE                 798073 non-null float64
CAMEO_DEUG_2015          792242 non-null object
CAMEO_DEU_2015           792242 non-null object
CAMEO_INTL_2015          792242 non-null object
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_BAUMAX             757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797481 non-null float64
EWDICHTE                 797481 non-null float64
INNENSTADT               797481 non-null float64
GEBAEUDETYP_RASTER       798066 non-null float64
KKK                      770025 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        886367 non-null float64
REGIOTYP                 770025 non-null float64
KBA13_ANZAHL_PKW         785421 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   794005 non-null float64
ORTSGR_KLS9              794005 non-null float64
RELAT_AB                 794005 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 578.0+ MB
In [9]:
# Percentage of nans per row before precessing
nan_perc = (df.isnull().mean() * 100).sort_values(ascending=False)
nan_perc[:20]
Out[9]:
KK_KUNDENTYP        65.596749
KBA05_ANTG1         14.959701
KBA05_ANTG2         14.959701
KBA05_ANTG3         14.959701
KBA05_ANTG4         14.959701
KBA05_BAUMAX        14.959701
KBA05_GBZ           14.959701
MOBI_REGIO          14.959701
REGIOTYP            13.598872
KKK                 13.598872
PLZ8_ANTG3          13.073637
PLZ8_ANTG1          13.073637
PLZ8_GBZ            13.073637
PLZ8_ANTG2          13.073637
PLZ8_HHZ            13.073637
PLZ8_BAUMAX         13.073637
PLZ8_ANTG4          13.073637
W_KEIT_KIND_HH      12.073549
KBA13_ANZAHL_PKW    11.871354
CAMEO_INTL_2015     11.106000
dtype: float64

KK_KUNDENTYP has the highest percentage of missing values (65%) with the next KBA05_ANTG1 at 14.9% at a different scale

In [10]:
#How many rows don't have nans?
nan_perc[nan_perc == 0].shape[0]
Out[10]:
32
In [11]:
# as a percentage
nan_perc[nan_perc > 0].shape[0]/ df.shape[0]  * 100
Out[11]:
0.005946897570860651
In [12]:
# How many nans before processing?
print('Total percentage of Nans before processing: ',
      round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
      '%')
Total percentage of Nans before processing:  6.46 %

Convert Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object.

In [13]:
# find out the different kind of encodings
feat_info.missing_or_unknown.unique()
Out[13]:
array(['[-1,0]', '[-1,0,9]', '[0]', '[-1]', '[]', '[-1,9]', '[-1,X]',
       '[XX]', '[-1,XX]'], dtype=object)
In [14]:
# check for the XX type
feat_info[feat_info.missing_or_unknown == '[XX]']
Out[14]:
attribute information_level type missing_or_unknown
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX]
In [15]:
def convert_to_list_for_X_or_XX(s):
    '''Takes the column with the missing value
    encodings and converts to list, specially processing
    the casses where X or  XX is in the encodings

    TODO : Generalize for any new string'''
    if 'XX' in s:
        if len(s) == 4:
            return [s[1:-1]]
        else:
            return [-1, 'XX']
    elif 'X' in s:
        return [-1, 'X']
    else:
        return eval(s)
In [16]:
# Process the missing values encodings
feat_info.missing_or_unknown = feat_info.missing_or_unknown.apply(convert_to_list_for_X_or_XX)
In [17]:
# Create a dictionary to map column names to missing data encodings
missing_data_encode = dict(zip(feat_info.attribute, feat_info.missing_or_unknown))
In [18]:
# test
missing_data_encode['AGER_TYP']
Out[18]:
[-1, 0]
In [19]:
# test
missing_data_encode['CAMEO_DEU_2015']
Out[19]:
['XX']
In [20]:
# Convert to nans using the missing_data_encode dictionary
for col in missing_data_encode.keys():
    nans_before = df[col].isnull().sum()
    df.loc[df[col].isin(missing_data_encode[col]), col] = np.nan
    nans_after = df[col].isnull().sum()
    if nans_after != nans_before:
        print(col, 'nans before: ', nans_before)
        print(col, 'nans after: ', nans_after, '\n')
AGER_TYP nans before:  0
AGER_TYP nans after:  685843 

ALTERSKATEGORIE_GROB nans before:  0
ALTERSKATEGORIE_GROB nans after:  2881 

GEBURTSJAHR nans before:  0
GEBURTSJAHR nans after:  392318 

HEALTH_TYP nans before:  0
HEALTH_TYP nans after:  111196 

LP_LEBENSPHASE_FEIN nans before:  4854
LP_LEBENSPHASE_FEIN nans after:  97632 

LP_LEBENSPHASE_GROB nans before:  4854
LP_LEBENSPHASE_GROB nans after:  94572 

LP_FAMILIE_FEIN nans before:  4854
LP_FAMILIE_FEIN nans after:  77792 

LP_FAMILIE_GROB nans before:  4854
LP_FAMILIE_GROB nans after:  77792 

NATIONALITAET_KZ nans before:  0
NATIONALITAET_KZ nans after:  108315 

PRAEGENDE_JUGENDJAHRE nans before:  0
PRAEGENDE_JUGENDJAHRE nans after:  108164 

SHOPPER_TYP nans before:  0
SHOPPER_TYP nans after:  111196 

TITEL_KZ nans before:  73499
TITEL_KZ nans after:  889061 

VERS_TYP nans before:  0
VERS_TYP nans after:  111196 

ALTER_HH nans before:  73499
ALTER_HH nans after:  310267 

W_KEIT_KIND_HH nans before:  107602
W_KEIT_KIND_HH nans after:  147988 

ANZ_HAUSHALTE_AKTIV nans before:  93148
ANZ_HAUSHALTE_AKTIV nans after:  99611 

CAMEO_DEUG_2015 nans before:  98979
CAMEO_DEUG_2015 nans after:  99352 

CAMEO_DEU_2015 nans before:  98979
CAMEO_DEU_2015 nans after:  99352 

CAMEO_INTL_2015 nans before:  98979
CAMEO_INTL_2015 nans after:  99352 

KBA05_BAUMAX nans before:  133324
KBA05_BAUMAX nans after:  476524 

KKK nans before:  121196
KKK nans after:  158064 

REGIOTYP nans before:  121196
REGIOTYP nans after:  158064 

ARBEIT nans before:  97216
ARBEIT nans after:  97375 

ORTSGR_KLS9 nans before:  97216
ORTSGR_KLS9 nans after:  97274 

RELAT_AB nans before:  97216
RELAT_AB nans after:  97375 

Assess Missing Data in Each Column

  • How much missing data is present in each column?
  • For the remaining features, are there any patterns in which columns have, or share, missing data?
In [21]:
# How many nans after processing?
print('Total percentage of Nans after processing: ',
      round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
      '%')
Total percentage of Nans after processing:  11.05 %
In [22]:
# A Series with the proportion of nans per row after precessing
nan_perc = (df.isnull().mean()).sort_values(ascending=False)
nan_perc[:20]
Out[22]:
TITEL_KZ          0.997576
AGER_TYP          0.769554
KK_KUNDENTYP      0.655967
KBA05_BAUMAX      0.534687
GEBURTSJAHR       0.440203
ALTER_HH          0.348137
REGIOTYP          0.177357
KKK               0.177357
W_KEIT_KIND_HH    0.166051
KBA05_ANTG4       0.149597
KBA05_GBZ         0.149597
MOBI_REGIO        0.149597
KBA05_ANTG1       0.149597
KBA05_ANTG2       0.149597
KBA05_ANTG3       0.149597
PLZ8_GBZ          0.130736
PLZ8_HHZ          0.130736
PLZ8_BAUMAX       0.130736
PLZ8_ANTG4        0.130736
PLZ8_ANTG1        0.130736
dtype: float64
In [23]:
#How many coluns don't have nans?
nan_perc[nan_perc == 0].shape[0]
Out[23]:
24
In [24]:
# as a percentage
nan_perc[nan_perc == 0].shape[0]/ df.shape[1]  * 100
Out[24]:
28.235294117647058
In [25]:
def hist_box_plot(x, x_label, y_label, bin_incr):
    '''Take an array as input and draw a histogram with a boxblot above it'''
    f, (ax_box, ax_hist) = plt.subplots(2,
                                        sharex=True,
                                        gridspec_kw={
                                            "height_ratios": (.15, .85)},
                                        figsize=(14, 6))

    sns.boxplot(x, ax=ax_box)
    bins = np.arange(0, x.max() + bin_incr, bin_incr)
    x.hist(grid=False, bins=bins)
    ax_box.set(yticks=[])
    ax_hist.set_ylabel(y_label)
    ax_hist.set_xlabel(x_label)
    sns.despine(ax=ax_hist)
    sns.despine(ax=ax_box, left=True)
In [26]:
hist_box_plot(nan_perc, x_label='Proportion of missing values', y_label='% of features', bin_incr=0.05);

The majority of features have less than 20% missing values. There are six features with a considerably higher number of missing values.

In [27]:
# Which are the columns with the higher NaN values?
high_nan = nan_perc[nan_perc>0.2]
high_nan
Out[27]:
TITEL_KZ        0.997576
AGER_TYP        0.769554
KK_KUNDENTYP    0.655967
KBA05_BAUMAX    0.534687
GEBURTSJAHR     0.440203
ALTER_HH        0.348137
dtype: float64

Investigate patterns in the amount of missing data in each column.

In [28]:
# Merge feat_info with nan_perc 
nan_perc_df = pd.DataFrame(nan_perc, columns=['perc_missing'])
feat_nan = (feat_info
            .merge(nan_perc_df, left_on='attribute', right_index=True)
            .drop(columns='missing_or_unknown', axis=1)
           )

# Split the data into three categories:
# High percentage of nans (>20%)
# Moderate percentage of nans (5-20%)
# Low percentage of nans (0-5%)
# No nans
feat_nan['nan_cat'] = pd.cut(feat_nan.perc_missing,
                             bins=[-np.inf, 0, 0.1, 0.2, np.inf],
                             labels=['zero', 'low', 'moderate', 'high'],
                             )
feat_nan.head()
Out[28]:
attribute information_level type perc_missing nan_cat
0 AGER_TYP person categorical 0.769554 high
1 ALTERSKATEGORIE_GROB person ordinal 0.003233 low
2 ANREDE_KZ person categorical 0.000000 zero
3 CJT_GESAMTTYP person categorical 0.005446 low
4 FINANZ_MINIMALIST person ordinal 0.000000 zero
In [29]:
fig, ax1 = plt.subplots(figsize=(14,8))
order = ['person', 'household', 'building', 'postcode', 'community',
       'microcell_rr3', 'microcell_rr4', 'region_rr1', 'macrocell_plz8']
hue_order = ['zero', 'low', 'moderate', 'high']
sns.countplot(data=feat_nan,
              x='information_level',
              hue='nan_cat',
              order=order,
              hue_order=hue_order,
              ax=ax1)
plt.legend(loc='upper right')
sns.despine(ax=ax1);

Observations

  • Only the person informational level has features with zero NaNs.
  • Low NaN features (>0% - 10%) are distributed among person, household and region information levels.
  • Moderate NaN features (10% - 20%) is the only category distributed among all information levels.
  • High NaN features (>20%) are distributed among person, household and microcell_rr3 information levels.
In [30]:
fig, ax1 = plt.subplots(figsize=(14,8))
order = ['person', 'household', 'building', 'postcode', 'community',
       'microcell_rr3', 'microcell_rr4', 'region_rr1', 'macrocell_plz8']
hue_order = ['zero', 'low', 'moderate', 'high']
sns.swarmplot(data=feat_nan,
              x='information_level',
              y='perc_missing',
              hue='nan_cat',
              order=order,
              hue_order=hue_order,
              ax=ax1)
plt.legend(loc='upper right')
sns.despine(ax=ax1);

Observations

  • The moderate - NaN features (10% - 20%) have all the same values for each from the information levels: postcode, community, microcell_rr3 and 4 and macrocell_plz8. This suggests that they exist some methodical bias in creating the NaNs for each information level.
  • The household information level has the most evenly distributed Nan categories.
  • All the information levels, except person and household consist mainly of moderate NaN categories.
  • The high NaN categories are clear outliers.
In [31]:
# Look at the similar values of microcell_rr3
feat_nan[(feat_nan.information_level == 'microcell_rr3') & (feat_nan.nan_cat == 'moderate')]
Out[31]:
attribute information_level type perc_missing nan_cat
60 KBA05_ANTG1 microcell_rr3 ordinal 0.149597 moderate
61 KBA05_ANTG2 microcell_rr3 ordinal 0.149597 moderate
62 KBA05_ANTG3 microcell_rr3 ordinal 0.149597 moderate
63 KBA05_ANTG4 microcell_rr3 ordinal 0.149597 moderate
65 KBA05_GBZ microcell_rr3 ordinal 0.149597 moderate

Confirms the observation that features of the same attribute type (KBA05 classifies according to the number of family houses and buildings in the microcell) have the exact same percentage of missing values.

Remove the outlier columns from the dataset.

The outlying columns with a high percentage of NaNs will be removed.

In [32]:
cols_to_remove = feat_nan[feat_nan.nan_cat == 'high'].attribute             
cols_to_remove
Out[32]:
0         AGER_TYP
11     GEBURTSJAHR
40        TITEL_KZ
43        ALTER_HH
47    KK_KUNDENTYP
64    KBA05_BAUMAX
Name: attribute, dtype: object
In [33]:
df = df.drop(columns=cols_to_remove, axis=1)
df.shape
Out[33]:
(891221, 79)

Summary

The total percentage of Nans after the converting missing value encoding to NaNs is 11.05 %

Fig1 observations: The majority of features (79 out of 85) have less than 20% missing values. There are six features with a considerably higher number of missing values. These features are:

  • TITEL_KZ: Academic title flag (1: Dr., 4: Prof. Dr., 5: other)
    • 99.8% missing, making it unusable.
  • AGER_TYP: Best-ager typology (1: passive elderly, 3: experience-driven elderly)
    • 76.9% missing values.This could potentially be an interesting feature for identifying subgroups within older persons groups but the very high level of missing values would make the information unreliable
  • KK_KUNDENTYP: Consumer pattern over past 12 months (1: regular customer, 6: passive customer)
    • 65.6% missing values. This could also potentially be an interesting feature for identifying consumer behavior but the fact that that there are already other features that encode shopping behavior like SHOPPER_TYP and the high level of missing values would suggest that this feature can be dropped without too much loss of information.
  • KBA05_BAUMAX: Most common building type within the microcell (1: mainly 1-2 family homes in the microcell, 4: mainly 10+ family homes in the microcell, 5: mainly business buildings in the microcell)
    • 53.5% missing values. Similar information to the other KBA05_ features and the high level of missing values suggest that this feature can be dropped without too much information loss.
  • GEBURTSJAHR: Year of birth (Numeric)
    • 44.0% missing. The high level of missing values for a numeric variable and the fact that that there exist also other features that encode age like ALTERSKATEGORIE_GROB suggest that this feature can be dropped without too much information loss.
  • ALTER_HH: Birthdate of head of household (1: 1895-01-01 to 1899-12-31, 21: 1995-01-01 to 1999-12-31)
    • 34.8% missing values: This feature could provide valuable information about to the age groupings. We will drop it in this phase due to the high level of missing values would make the information unreliable. It could in a later iteration be imputed and tested if it's inclusion leads to better results.

Fig2 observations:

  • Only the person informational level has features with zero NaNs.
  • Low NaN features (>0% - 10%) are distributed among person, household and region information levels.
  • Moderate NaN features (10% - 20%) is the only category distributed among all information levels.
  • High NaN features (>20%) are distributed among person, household and microcell_rr3 information levels.

Fig3 observations:

  • The moderate - NaN features (10% - 20%) have all the same values for each from the information levels: postcode, community, microcell_rr3 and 4 and macrocell_plz8.
    For example all the KBA05-type features with a moderate level of NaNs have 14.6% of missing data. This suggests that there may exist some methodical bias in creating the NaNs for each information level.
  • The household information level has the most evenly distributed Nan categories.
  • All the information levels, except person and household consist mainly of moderate NaN categories.
  • The high NaN categories are clear outliers.

All six outlier columns with a high percentage of NaNs (>20%) will be removed.

Assess Missing Data in Each Row

How much data is missing in each row of the dataset?
In [34]:
df['row_nan_perc'] = df.isnull().mean(axis=1) 
hist_box_plot(df['row_nan_perc'], x_label='Proportion of missing values', y_label='% of rows', bin_incr=0.01)
  • The majority of the rows is below the 10% threshhold of missing values.
  • The iqr is around the 5 % threshhold.

We will use 10% as the cut of threshhold to split the data into two separate subsets and compare their distributions.

Divide the data into two subsets based on the number of missing values in each row.
In [35]:
threshhold = 0.1

df_lowna = df.loc[df['row_nan_perc'] < threshhold, :]
df_highna = df.loc[df['row_nan_perc'] > threshhold, :]
In [36]:
df_lowna.shape
Out[36]:
(747109, 80)
In [37]:
# percentage of low na rows
df_lowna.shape[0] / df.shape[0]
Out[37]:
0.8382982447675716
In [38]:
df_highna.shape
Out[38]:
(144112, 80)
In [39]:
# percentage of high na rows
df_highna.shape[0] / df.shape[0]
Out[39]:
0.16170175523242833

Compare the distribution of values for at least five columns where there are no or few missing values, between the two subsets.

In [40]:
# find columns with no nans at lowna
df_lowna.dropna(axis=1).shape
Out[40]:
(747109, 40)
In [41]:
# find columns with no nans at highna
df_highna.dropna(axis=1).shape
Out[41]:
(144112, 25)
In [42]:
# get the commmon columns
common_cols = list(set(df_lowna.dropna(axis=1).columns).intersection(df_highna.dropna(axis=1).columns))
common_cols
Out[42]:
['SEMIO_DOM',
 'SEMIO_MAT',
 'SEMIO_KRIT',
 'row_nan_perc',
 'SEMIO_TRADV',
 'SEMIO_PFLICHT',
 'SEMIO_REL',
 'GREEN_AVANTGARDE',
 'FINANZ_ANLEGER',
 'FINANZ_VORSORGER',
 'SEMIO_LUST',
 'SEMIO_RAT',
 'SEMIO_ERL',
 'ZABEOTYP',
 'SEMIO_SOZ',
 'FINANZ_UNAUFFAELLIGER',
 'ANREDE_KZ',
 'SEMIO_KAEM',
 'FINANZTYP',
 'FINANZ_MINIMALIST',
 'FINANZ_HAUSBAUER',
 'SEMIO_FAM',
 'SEMIO_VERT',
 'SEMIO_KULT',
 'FINANZ_SPARER']
In [43]:
# Drop the 'row_nan_perc'
common_cols.remove('row_nan_perc')
In [44]:
# compare distributions
for col in common_cols[:6]:
    fig, axes = plt.subplots(1,2, figsize=(14, 6), sharey=True)
    sns.countplot(df_lowna[col], ax=axes[0], color='b')
    sns.countplot(df_highna[col], ax=axes[1], color='r')

Observations

  • The distributions are discrete and look distinctly different.
In [45]:
# Statistical comparison of distributions
for col in common_cols:
    print(col)
    print(chisquare(df_highna[col].value_counts().sort_index(),
                    df_lowna[col].value_counts().sort_index(),
                    axis=0),
                    '\n')
SEMIO_DOM
Power_divergenceResult(statistic=578605.8437949589, pvalue=0.0) 

SEMIO_MAT
Power_divergenceResult(statistic=551804.8483307012, pvalue=0.0) 

SEMIO_KRIT
Power_divergenceResult(statistic=553268.8941379717, pvalue=0.0) 

SEMIO_TRADV
Power_divergenceResult(statistic=522819.77933806577, pvalue=0.0) 

SEMIO_PFLICHT
Power_divergenceResult(statistic=546673.9839817142, pvalue=0.0) 

SEMIO_REL
Power_divergenceResult(statistic=530242.3705674206, pvalue=0.0) 

GREEN_AVANTGARDE
Power_divergenceResult(statistic=491086.218469656, pvalue=0.0) 

FINANZ_ANLEGER
Power_divergenceResult(statistic=528002.691086477, pvalue=0.0) 

FINANZ_VORSORGER
Power_divergenceResult(statistic=524160.8354827201, pvalue=0.0) 

SEMIO_LUST
Power_divergenceResult(statistic=570363.2314128326, pvalue=0.0) 

SEMIO_RAT
Power_divergenceResult(statistic=514140.4834929664, pvalue=0.0) 

SEMIO_ERL
Power_divergenceResult(statistic=544795.960523225, pvalue=0.0) 

ZABEOTYP
Power_divergenceResult(statistic=501233.1126444783, pvalue=0.0) 

SEMIO_SOZ
Power_divergenceResult(statistic=529546.6526565166, pvalue=0.0) 

FINANZ_UNAUFFAELLIGER
Power_divergenceResult(statistic=560081.7545795058, pvalue=0.0) 

ANREDE_KZ
Power_divergenceResult(statistic=486726.3350846125, pvalue=0.0) 

SEMIO_KAEM
Power_divergenceResult(statistic=532804.5291834823, pvalue=0.0) 

FINANZTYP
Power_divergenceResult(statistic=588409.658322646, pvalue=0.0) 

FINANZ_MINIMALIST
Power_divergenceResult(statistic=506973.43732221884, pvalue=0.0) 

FINANZ_HAUSBAUER
Power_divergenceResult(statistic=513968.48003178235, pvalue=0.0) 

SEMIO_FAM
Power_divergenceResult(statistic=558890.2452963234, pvalue=0.0) 

SEMIO_VERT
Power_divergenceResult(statistic=648640.5005068531, pvalue=0.0) 

SEMIO_KULT
Power_divergenceResult(statistic=526705.8311784347, pvalue=0.0) 

FINANZ_SPARER
Power_divergenceResult(statistic=536958.2495254633, pvalue=0.0) 

The p value is 0 for all the chisquared tests which means that we saw the expected frequencies based on the observed distribution (High NaN features) about 0% of the time which is statistically significant and confirms the assumption that the two distributions are different [5]

Summary

Fig4 observations:

  • The majority of the rows (83.8%) is below the 10% of missing values.

We used 10% as the cut of threshhold to split the data into two separate subsets and compare their distributions taking 5 sample columns that do no have any NaN values.

Fig5 observations:

  • These distributions are discrete and look distinctly different, an assumption that was confirmed by performing a Chisquared test for testing their proportions.

We will continue the analysis with the dataset that contains rows with less than 10% NaNs

Select and Re-Encode Features

In [46]:
# Keep only the rows with NaNs perc  below 10%
df = df_lowna

# Drop the row nan percentage row
df = df[df.columns[:-1]]
df.head()
Out[46]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
5 1.0 2.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 2.0 ... 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0

5 rows × 79 columns

In [47]:
# Drop the missing_or_unknown column from the
# processed feat_info Df
feat_info = feat_info[feat_info.columns[:-1]]
feat_info.head()
Out[47]:
attribute information_level type
0 AGER_TYP person categorical
1 ALTERSKATEGORIE_GROB person ordinal
2 ANREDE_KZ person categorical
3 CJT_GESAMTTYP person categorical
4 FINANZ_MINIMALIST person ordinal
In [48]:
# filter feat_info to the remaining columns
feat_info = feat_info[feat_info.attribute.isin(df.columns)]
feat_info.shape
Out[48]:
(79, 3)
How many features are there of each data type?
In [49]:
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=feat_info,
              x='type',
              color='c',
              ax=ax1)
ax1.set_ylabel('No of features')
ax1.set_xlabel('Data Type')
for p in ax1.patches:
    ax1.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.35, p.get_height()+1))
sns.despine(ax=ax1);
  • 49 out of 79 features (62%) are ordinal.
  • The categorical and mixed features, that need to be processed, represent 18 + 6 (30.38%) of the features.

Re-Encode Categorical Features

In [50]:
# Get categorical features
feat_info[feat_info.type == 'categorical']
Out[50]:
attribute information_level type
2 ANREDE_KZ person categorical
3 CJT_GESAMTTYP person categorical
10 FINANZTYP person categorical
12 GFK_URLAUBERTYP person categorical
13 GREEN_AVANTGARDE person categorical
17 LP_FAMILIE_FEIN person categorical
18 LP_FAMILIE_GROB person categorical
19 LP_STATUS_FEIN person categorical
20 LP_STATUS_GROB person categorical
21 NATIONALITAET_KZ person categorical
38 SHOPPER_TYP person categorical
39 SOHO_KZ person categorical
41 VERS_TYP person categorical
42 ZABEOTYP person categorical
52 GEBAEUDETYP building categorical
55 OST_WEST_KZ building categorical
57 CAMEO_DEUG_2015 microcell_rr4 categorical
58 CAMEO_DEU_2015 microcell_rr4 categorical

The majority of the categorical features is of the person information_level.

In [51]:
# How many of each information_level
feat_info[feat_info.type == 'categorical'].groupby('information_level').count().iloc[:,-1]
Out[51]:
information_level
building          2
microcell_rr4     2
person           14
Name: type, dtype: int64
In [52]:
# Get the categorical feature names
df_cat_cols = feat_info[feat_info.type == 'categorical'].attribute
df_cat_cols
Out[52]:
2            ANREDE_KZ
3        CJT_GESAMTTYP
10           FINANZTYP
12     GFK_URLAUBERTYP
13    GREEN_AVANTGARDE
17     LP_FAMILIE_FEIN
18     LP_FAMILIE_GROB
19      LP_STATUS_FEIN
20      LP_STATUS_GROB
21    NATIONALITAET_KZ
38         SHOPPER_TYP
39             SOHO_KZ
41            VERS_TYP
42            ZABEOTYP
52         GEBAEUDETYP
55         OST_WEST_KZ
57     CAMEO_DEUG_2015
58      CAMEO_DEU_2015
Name: attribute, dtype: object
In [53]:
# Find the number of levels
df[df_cat_cols].nunique()
Out[53]:
ANREDE_KZ            2
CJT_GESAMTTYP        6
FINANZTYP            6
GFK_URLAUBERTYP     12
GREEN_AVANTGARDE     2
LP_FAMILIE_FEIN     11
LP_FAMILIE_GROB      5
LP_STATUS_FEIN      10
LP_STATUS_GROB       5
NATIONALITAET_KZ     3
SHOPPER_TYP          4
SOHO_KZ              2
VERS_TYP             2
ZABEOTYP             6
GEBAEUDETYP          7
OST_WEST_KZ          2
CAMEO_DEUG_2015      9
CAMEO_DEU_2015      44
dtype: int64
In [54]:
# Cound the number of levels
df[df_cat_cols].nunique().sort_values()
Out[54]:
ANREDE_KZ            2
OST_WEST_KZ          2
VERS_TYP             2
SOHO_KZ              2
GREEN_AVANTGARDE     2
NATIONALITAET_KZ     3
SHOPPER_TYP          4
LP_STATUS_GROB       5
LP_FAMILIE_GROB      5
FINANZTYP            6
ZABEOTYP             6
CJT_GESAMTTYP        6
GEBAEUDETYP          7
CAMEO_DEUG_2015      9
LP_STATUS_FEIN      10
LP_FAMILIE_FEIN     11
GFK_URLAUBERTYP     12
CAMEO_DEU_2015      44
dtype: int64
In [55]:
# how many binary
sum(df[df_cat_cols].nunique() == 2)
Out[55]:
5
In [56]:
# how many multilevel
sum(df[df_cat_cols].nunique() != 2)
Out[56]:
13
In [57]:
df[df_cat_cols].dtypes
Out[57]:
ANREDE_KZ           float64
CJT_GESAMTTYP       float64
FINANZTYP           float64
GFK_URLAUBERTYP     float64
GREEN_AVANTGARDE    float64
LP_FAMILIE_FEIN     float64
LP_FAMILIE_GROB     float64
LP_STATUS_FEIN      float64
LP_STATUS_GROB      float64
NATIONALITAET_KZ    float64
SHOPPER_TYP         float64
SOHO_KZ             float64
VERS_TYP            float64
ZABEOTYP            float64
GEBAEUDETYP         float64
OST_WEST_KZ          object
CAMEO_DEUG_2015      object
CAMEO_DEU_2015       object
dtype: object

The last three columns are of dtype object. Let's have a look at them

In [58]:
df[df_cat_cols[-3:]].sample(10)
Out[58]:
OST_WEST_KZ CAMEO_DEUG_2015 CAMEO_DEU_2015
13845 W 4 4E
400531 O 9 9E
722889 W 5 5F
63590 W 5 5A
673145 W 4 4A
462521 W 9 9E
433515 W 4 4C
137159 W 2 2D
412390 W 6 6B
773771 W 7 7A
  • OST_WEST_KZ: Building location via former East / West Germany (GDR / FRG) (O: East (GDR), W: West (FRG))
    • is a binary categorical variable and is emcoded with string values and and we will reencode the string catgories in numerical types so that it can be used with the algorithms.
  • CAMEO_DEUG_2015 German CAMEO: Wealth / Life Stage Typology, rough scale (1: upper class, 9: urban working class)
    • is numeric but encoded as object dtype. Since it is a multilevel categorical feature and will be transformed to a dummy variable or dropped it can stay like this.
  • CAMEO_DEU_2015 should be a float64 data type but is encoded as object.
  • CAMEO_DEU_2015 is a more detailed version CAMEO_DEUG_2015 and would probably be highly correlated. Further investigation should be done.

Re-encode categorical variable(s) to be kept in the analysis.

In [59]:
# Reencode the binary OST_WEST_KZ to numeric categories
df = df.copy()
df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'O': 0, 
                                               'W': 1})
In [60]:
# test
df[df_cat_cols[-3:]].sample(10)
Out[60]:
OST_WEST_KZ CAMEO_DEUG_2015 CAMEO_DEU_2015
624729 0 6 6B
668818 1 6 6C
109095 1 4 4A
250516 1 9 9D
752581 1 3 3D
569456 1 9 9D
231614 1 2 2B
319830 1 5 5D
738995 1 9 9D
739989 1 7 7B
In [61]:
# Choose the multilevel categorical features
cat_nunique = df[df_cat_cols].nunique()
cat_to_encode = cat_nunique[cat_nunique>2]
cat_to_encode.sort_values()
Out[61]:
NATIONALITAET_KZ     3
SHOPPER_TYP          4
LP_FAMILIE_GROB      5
LP_STATUS_GROB       5
CJT_GESAMTTYP        6
FINANZTYP            6
ZABEOTYP             6
GEBAEUDETYP          7
CAMEO_DEUG_2015      9
LP_STATUS_FEIN      10
LP_FAMILIE_FEIN     11
GFK_URLAUBERTYP     12
CAMEO_DEU_2015      44
dtype: int64
In [62]:
# DROPING THE MULTI-LEVEL CATEGORICAL COLUMNS IN THIS ITERATION
df = df.drop(columns=cat_to_encode.index, axis=0)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 747109 entries, 1 to 891220
Data columns (total 66 columns):
ALTERSKATEGORIE_GROB     744640 non-null float64
ANREDE_KZ                747109 non-null float64
FINANZ_MINIMALIST        747109 non-null float64
FINANZ_SPARER            747109 non-null float64
FINANZ_VORSORGER         747109 non-null float64
FINANZ_ANLEGER           747109 non-null float64
FINANZ_UNAUFFAELLIGER    747109 non-null float64
FINANZ_HAUSBAUER         747109 non-null float64
GREEN_AVANTGARDE         747109 non-null float64
HEALTH_TYP               722753 non-null float64
LP_LEBENSPHASE_FEIN      714735 non-null float64
LP_LEBENSPHASE_GROB      717274 non-null float64
PRAEGENDE_JUGENDJAHRE    730988 non-null float64
RETOURTYP_BK_S           747109 non-null float64
SEMIO_SOZ                747109 non-null float64
SEMIO_FAM                747109 non-null float64
SEMIO_REL                747109 non-null float64
SEMIO_MAT                747109 non-null float64
SEMIO_VERT               747109 non-null float64
SEMIO_LUST               747109 non-null float64
SEMIO_ERL                747109 non-null float64
SEMIO_KULT               747109 non-null float64
SEMIO_RAT                747109 non-null float64
SEMIO_KRIT               747109 non-null float64
SEMIO_DOM                747109 non-null float64
SEMIO_KAEM               747109 non-null float64
SEMIO_PFLICHT            747109 non-null float64
SEMIO_TRADV              747109 non-null float64
SOHO_KZ                  747109 non-null float64
VERS_TYP                 722753 non-null float64
ANZ_PERSONEN             747109 non-null float64
ANZ_TITEL                747109 non-null float64
HH_EINKOMMEN_SCORE       747109 non-null float64
W_KEIT_KIND_HH           701696 non-null float64
WOHNDAUER_2008           747109 non-null float64
ANZ_HAUSHALTE_AKTIV      741793 non-null float64
ANZ_HH_TITEL             744114 non-null float64
KONSUMNAEHE              747063 non-null float64
MIN_GEBAEUDEJAHR         747109 non-null float64
OST_WEST_KZ              747109 non-null int64
WOHNLAGE                 747109 non-null float64
CAMEO_INTL_2015          743878 non-null object
KBA05_ANTG1              736120 non-null float64
KBA05_ANTG2              736120 non-null float64
KBA05_ANTG3              736120 non-null float64
KBA05_ANTG4              736120 non-null float64
KBA05_GBZ                736120 non-null float64
BALLRAUM                 746624 non-null float64
EWDICHTE                 746624 non-null float64
INNENSTADT               746624 non-null float64
GEBAEUDETYP_RASTER       747104 non-null float64
KKK                      706689 non-null float64
MOBI_REGIO               736120 non-null float64
ONLINE_AFFINITAET        747109 non-null float64
REGIOTYP                 706689 non-null float64
KBA13_ANZAHL_PKW         746377 non-null float64
PLZ8_ANTG1               744272 non-null float64
PLZ8_ANTG2               744272 non-null float64
PLZ8_ANTG3               744272 non-null float64
PLZ8_ANTG4               744272 non-null float64
PLZ8_BAUMAX              744272 non-null float64
PLZ8_HHZ                 744272 non-null float64
PLZ8_GBZ                 744272 non-null float64
ARBEIT                   743376 non-null float64
ORTSGR_KLS9              743455 non-null float64
RELAT_AB                 743376 non-null float64
dtypes: float64(64), int64(1), object(1)
memory usage: 381.9+ MB

Summary

The dataset contains 18 categorical variables, 5 of which are binary and 13 are multilevel.

  • OST_WEST_KZ: Building location via former East / West Germany (GDR / FRG) (O: East (GDR), W: West (FRG))
    • is a binary categorical variable and is emcoded with string values and and we will reencode the string catgories in numerical types so that it can be used with the algorithms.
  • CAMEO_DEUG_2015: German CAMEO: Wealth / Life Stage Typology, rough scale (1: upper class, 9: urban working class)

The rest of the binary features will be kept as is. The multilevel categorical features need to be one hot encoded in order to be used for feature transformation and training our model. They are:

Feature # of Levels
NATIONALITAET_KZ 3
SHOPPER_TYP 4
LP_FAMILIE_GROB 5
LP_STATUS_GROB 5
CJT_GESAMTTYP 6
FINANZTYP 6
ZABEOTYP 6
GEBAEUDETYP 7
CAMEO_DEUG_2015 9
LP_STATUS_FEIN 10
LP_FAMILIE_FEIN 11
GFK_URLAUBERTYP 12
CAMEO_DEU_2015 44

Since some of the information they contain is covered by other features in the dataset like CAMEO_DEUG_2015 for CAMEO_DEU_2015 or FINANZ_ for FINANZTYP LP_STATUS_FEIN and LP_STATUS_GROB for LP_FAMILIE_FEIN and LP_FAMILIE_GROB, we will choose to drop them in this iteration of the analysis and keep things more straightforward and not increase the number of variables too much at this phase.

Engineer Mixed-Type Features

In [63]:
# Get mixed dtype features
feat_info[feat_info.type == 'mixed']
Out[63]:
attribute information_level type
15 LP_LEBENSPHASE_FEIN person mixed
16 LP_LEBENSPHASE_GROB person mixed
22 PRAEGENDE_JUGENDJAHRE person mixed
56 WOHNLAGE building mixed
59 CAMEO_INTL_2015 microcell_rr4 mixed
79 PLZ8_BAUMAX macrocell_plz8 mixed

Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.

Data Dictionary:

PRAEGENDE_JUGENDJAHRE: Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west)

  • -1: unknown
  • 0: unknown
  • 1: 40s - war years (Mainstream, E+W)
  • 2: 40s - reconstruction years (Avantgarde, E+W)
  • 3: 50s - economic miracle (Mainstream, E+W)
  • 4: 50s - milk bar / Individualisation (Avantgarde, E+W)
  • 5: 60s - economic miracle (Mainstream, E+W)
  • 6: 60s - generation 68 / student protestors (Avantgarde, W)
  • 7: 60s - opponents to the building of the Wall (Avantgarde, E)
  • 8: 70s - family orientation (Mainstream, E+W)
  • 9: 70s - peace movement (Avantgarde, E+W)
  • 10: 80s - Generation Golf (Mainstream, W)
  • 11: 80s - ecological awareness (Avantgarde, W)
  • 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
  • 13: 80s - Swords into ploughshares (Avantgarde, E)
  • 14: 90s - digital media kids (Mainstream, E+W)
  • 15: 90s - ecological awareness (Avantgarde, E+W)
In [64]:
# Let's have a look at it
df.PRAEGENDE_JUGENDJAHRE.sample(10)
Out[64]:
803911     5.0
221428     8.0
324692     5.0
247918    10.0
552519     NaN
293176    14.0
21530      4.0
778637    10.0
687600     9.0
880926    14.0
Name: PRAEGENDE_JUGENDJAHRE, dtype: float64
In [65]:
# Let's have a look at it
df.PRAEGENDE_JUGENDJAHRE.nunique()
Out[65]:
15
In [66]:
plt.style.use('ggplot')
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='PRAEGENDE_JUGENDJAHRE',
              color="darkslategray",
              ax=ax1);
In [67]:
df.PRAEGENDE_JUGENDJAHRE.value_counts()
Out[67]:
14.0    172952
8.0     134165
5.0      81722
10.0     80527
3.0      51784
15.0     39536
11.0     33173
9.0      32493
6.0      25120
12.0     23337
4.0      20068
1.0      19583
2.0       7340
13.0      5300
7.0       3888
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64
In [68]:
df.PRAEGENDE_JUGENDJAHRE.isnull().sum()
Out[68]:
16121
In [69]:
df.PRAEGENDE_JUGENDJAHRE.isnull().mean() * 100
Out[69]:
2.157784205517535

Observations

  • Level 14 (90s - digital media kids (Mainstream, E+W)) leads with close to 172952 rows followed by level 8 (70s - family orientation (Mainstream, E+W)) with 134165 observations
  • Levels 7 (60s - opponents to the building of the Wall (Avantgarde, E)), 13 ((80s - Swords into ploughshares (Avantgarde, E)) and 2 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 3888, 5300 and 7340 rows respectively
  • The feature has 16121 NaNs - 2.16 % of the total.

New Variables Data encoding:

ENG_DECADE:

Person’s decade of youth.

  • 1: 40s
  • 2: 50s
  • 3: 60s
  • 4: 70s
  • 5: 80s
  • 6: 90s

ENG_MOVEMENT:

Person’s movement alignment.

  • 1: Mainstream
  • 2: Avantgarde
In [70]:
decade_dict = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7:  3,
               8: 4, 9: 4, 10: 5, 11: 5, 12: 5, 13: 5,
               14: 6, 15: 6
              }
movement_dict = {1: 1, 3: 1, 5: 1, 8: 1, 10: 1, 12: 1, 14: 1,
                2: 2, 4: 2, 6: 2, 7: 2, 9: 2, 11: 2, 13: 2, 15: 2
                }

# It appears that map is approximately 10x faster than replace.
# Ref: https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
df['ENG_DECADE'] = df['PRAEGENDE_JUGENDJAHRE'].map(decade_dict)
df['ENG_MOVEMENT'] = df['PRAEGENDE_JUGENDJAHRE'].map(movement_dict)
In [71]:
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_DECADE',
              palette="Greens",
              ax=ax1);
In [72]:
df.ENG_DECADE.value_counts()
Out[72]:
6.0    212488
4.0    166658
5.0    142337
3.0    110730
2.0     71852
1.0     26923
Name: ENG_DECADE, dtype: int64
In [73]:
# Sanity check
df.ENG_DECADE.isnull().sum()
Out[73]:
16121

Observations

  • In the new feature Level 6 90s leads with close to 212488 rows followed by level 4 70s with 166658 observations consistently with the original feature.
  • Levels 1: 40s, and 2 50s have the least observations with 26923 and 71852 rows respectively.
  • The feature as expected still has 16121 NaNs - 2.16 % of the feature's total values.
In [74]:
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_MOVEMENT',
              palette="Greens_r",
              ax=ax1);
In [75]:
df.ENG_MOVEMENT.value_counts()
Out[75]:
1.0    564070
2.0    166918
Name: ENG_MOVEMENT, dtype: int64
In [76]:
# Sanity check
df.ENG_MOVEMENT.isnull().sum()
Out[76]:
16121

Observations

  • In this new binary feature 1: Mainstream leads with close to 564070 rows almost 3.5 times more than 2: Avantgarde with 166918 observations.
  • The feature as expected still has 16121 NaNs - 2.16 % of the total.

Investigate "CAMEO_INTL_2015" and engineer two new variables.

Data Dictionary:

CAMEO_INTL_2015: Wealth / Life Stage Typology, mapped to international code

  • -1: unknown
  • 11: Wealthy Households - Pre-Family Couples & Singles
  • 12: Wealthy Households - Young Couples With Children
  • 13: Wealthy Households - Families With School Age Children
  • 14: Wealthy Households - Older Families & Mature Couples
  • 15: Wealthy Households - Elders In Retirement
  • 21: Prosperous Households - Pre-Family Couples & Singles
  • 22: Prosperous Households - Young Couples With Children
  • 23: Prosperous Households - Families With School Age Children
  • 24: Prosperous Households - Older Families & Mature Couples
  • 25: Prosperous Households - Elders In Retirement
  • 31: Comfortable Households - Pre-Family Couples & Singles
  • 32: Comfortable Households - Young Couples With Children
  • 33: Comfortable Households - Families With School Age Children
  • 34: Comfortable Households - Older Families & Mature Couples
  • 35: Comfortable Households - Elders In Retirement
  • 41: Less Affluent Households - Pre-Family Couples & Singles
  • 42: Less Affluent Households - Young Couples With Children
  • 43: Less Affluent Households - Families With School Age Children
  • 44: Less Affluent Households - Older Families & Mature Couples
  • 45: Less Affluent Households - Elders In Retirement
  • 51: Poorer Households - Pre-Family Couples & Singles
  • 52: Poorer Households - Young Couples With Children
  • 53: Poorer Households - Families With School Age Children
  • 54: Poorer Households - Older Families & Mature Couples
  • 55: Poorer Households - Elders In Retirement
  • XX: unknown
In [77]:
# Let's have a look at it
df.CAMEO_INTL_2015.head(10)
Out[77]:
1     51
2     24
3     12
4     43
5     54
6     22
7     14
8     13
9     15
10    51
Name: CAMEO_INTL_2015, dtype: object
In [78]:
fig, ax1 = plt.subplots(figsize=(14,8))
levels_sort = [str(int(l)) for l in np.sort(df.CAMEO_INTL_2015.astype(float).unique())[:-1]]
color_map = dict(zip())
sns.countplot(data=df,
              x='CAMEO_INTL_2015',
              color = 'darkslategray',
              order = levels_sort,
              ax=ax1);
In [79]:
df.CAMEO_INTL_2015.value_counts()
Out[79]:
51    128033
41     87902
24     85759
14     59027
43     54044
54     43637
25     37564
22     30589
45     25124
13     24528
55     22681
23     20328
52     19627
34     17537
15     16296
31     16285
44     14117
12     11980
35      9882
32      9777
33      9161
Name: CAMEO_INTL_2015, dtype: int64
In [80]:
df.CAMEO_INTL_2015.isnull().sum()
Out[80]:
3231
In [81]:
df.CAMEO_INTL_2015.isnull().mean() * 100
Out[81]:
0.4324670161917471

Observations

  • Level 51: Poorer Households - Pre-Family Couples & Singles leads with 128033 rows followed by level 41: Less Affluent Households - Pre-Family Couples & Singles) with 87902 observations, almost 1/3 less.
  • Levels 33: Comfortable Households - Families With School Age Children, 32: Comfortable Households - Young Couples With Children and 35 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 9161, 9777 and 9882 rows respectively
  • The feature has 3231 NaNs - 4.32% of it's total.

"CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values)

New Variables Data encoding:

ENG_WEALTH:

Household's wealth

  • 1: Wealthy Households
  • 2: Prosperous Households
  • 3: Comfortable Households
  • 4: Less Affluent Households
  • 5: Poorer Households

ENG_LIFE_STAGE:

Person's Life Stage

  • 1: Pre-Family Couples & Singles
  • 2: Young Couples With Children
  • 3: Families With School Age Children
  • 4: Older Families & Mature Couples
  • 5: Elders In Retirement
In [82]:
def breakup_codes(x, digit):
    ''' Breaks up  two-digit codes by their
    'tens'-place and 'ones'-place digits
    into two new ordinal variables.
    Leaves NaNs unchanged'''

    if not pd.isna(x):
        if digit == 'first':
            return int(str(x)[0])
        elif digit == 'second':
            return int(str(x)[1])
    return x
In [83]:
df['ENG_WEALTH'] = df['CAMEO_INTL_2015'].apply(breakup_codes, digit='first')
df['ENG_LIFE_STAGE'] = df['CAMEO_INTL_2015'].apply(breakup_codes, digit='second')
In [84]:
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_WEALTH',
              color = 'teal',
              ax=ax1);
In [85]:
df.ENG_WEALTH.value_counts()
Out[85]:
5.0    213978
4.0    181187
2.0    174240
1.0    111831
3.0     62642
Name: ENG_WEALTH, dtype: int64
In [86]:
# Sanity check
df.ENG_WEALTH.isnull().sum()
Out[86]:
3231

Observations

  • In the new feature Level 5: Poorer Households leads with close to 213978 rows followed by level 4: Less Affluent Households with 181187 observations consistently with the original feature.
  • Levels 3: Comfortable Households, and 1: Wealthy Households have the least observations with 62642 and 111831 rows respectively.
In [87]:
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_LIFE_STAGE',
              color = 'c',
              ax=ax1);
In [88]:
df.ENG_LIFE_STAGE.value_counts()
Out[88]:
1.0    232220
4.0    220077
5.0    111547
3.0    108061
2.0     71973
Name: ENG_LIFE_STAGE, dtype: int64
In [89]:
# Sanity check
df.ENG_LIFE_STAGE.isnull().sum()
Out[89]:
3231
In [90]:
mixed_feat_cols = feat_info[feat_info.type == 'mixed'].attribute
mixed_feat_cols
Out[90]:
15      LP_LEBENSPHASE_FEIN
16      LP_LEBENSPHASE_GROB
22    PRAEGENDE_JUGENDJAHRE
56                 WOHNLAGE
59          CAMEO_INTL_2015
79              PLZ8_BAUMAX
Name: attribute, dtype: object
In [91]:
# How many levels per feature
df[mixed_feat_cols].nunique()
Out[91]:
LP_LEBENSPHASE_FEIN      40
LP_LEBENSPHASE_GROB      12
PRAEGENDE_JUGENDJAHRE    15
WOHNLAGE                  8
CAMEO_INTL_2015          21
PLZ8_BAUMAX               5
dtype: int64
In [92]:
# Drop mixed dtype features
df = df.drop(mixed_feat_cols, axis=1)
df.head()
Out[92]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB ENG_DECADE ENG_MOVEMENT ENG_WEALTH ENG_LIFE_STAGE
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 ... 1.0 5.0 4.0 3.0 5.0 4.0 6.0 1.0 5.0 1.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 ... 0.0 4.0 4.0 3.0 5.0 2.0 6.0 2.0 2.0 4.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 ... 0.0 3.0 4.0 2.0 3.0 3.0 4.0 1.0 1.0 2.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 ... 1.0 3.0 3.0 4.0 6.0 5.0 4.0 1.0 4.0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 ... 1.0 5.0 5.0 2.0 3.0 3.0 2.0 1.0 5.0 4.0

5 rows × 64 columns

Observations

  • In the new feature Level 1: Pre-Family Couples & Singles leads with 232220 rows followed by level 4: Older Families & Mature Couples with 220077 observations.
  • Levels 2: Young Couples With Children, and 3: Families With School Age Children have the least observations with 71973 and 108061 rows respectively.

Summary

The dataset contains 6 mixed variables, 5 of which are binary and 13 are multilevel.

The mixed feature PRAEGENDE_JUGENDJAHRE: Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west) (1: 40s - war years (Mainstream, E+W), 15: 90s - ecological awareness (Avantgarde, E+W)), with the following observations Fig7:

  • Level 14 (90s - digital media kids (Mainstream, E+W)) leads with close to 172952 rows followed by level 8 (70s - family orientation (Mainstream, E+W)) with 134165 observations
  • Levels 7 (60s - opponents to the building of the Wall (Avantgarde, E)), 3 ((80s - Swords into ploughshares (Avantgarde, E)) and 2 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 3888, 5300 and 7340 rows respectively
  • The feature has 16121 NaNs - 2.16 % of it's total.

and two new variables have been engineered:

  1. ENG_DECADE: Person’s decade of youth. (1: 40s , 6: 90s) , Fig8 observations:

    • In the new feature Level 6 90s leads with close to 212488 rows followed by level 4 70s with 166658 observations consistently with the original feature.
    • Levels 1: 40s, and 2 50s have the least observations with 26923 and 71852 rows respectively.
  2. ENG_MOVEMENT: Person’s movement alignment (1: Mainstream, 2: Avantgarde ), Fig9 observations:

    • In this new binary feature 1: Mainstream leads with close to 564070 rows almost 3.5 times more than 2: Avantgarde with 166918 observations.
    • The feature as expected still has 16121 NaNs - 2.16 % of the total.

The mixed feature CAMEO_INTL_2015: German CAMEO: Wealth / Life Stage Typology, mapped to international code (11: Wealthy Households - Pre-Family Couples & Singles, 55: Poorer Households - Elders In Retirement), has been explored, with the following observations Fig10:

  • Level 51: Poorer Households - Pre-Family Couples & Singles leads with 128033 rows followed by level 41: Less Affluent Households - Pre-Family Couples & Singles) with 87902 observations, almost 1/3 less.
  • Levels 33: Comfortable Households - Families With School Age Children, 32: Comfortable Households - Young Couples With Children and 35 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 9161, 9777 and 9882 rows respectively
  • The feature has 3231 NaNs - 4.32% of it's total.

and two new variables have been engineered:

  1. ENG_WEALTH: Household's wealth (1: Wealthy Households, 5: Poorer Households) , Fig11 observations:

    • In the new feature Level 5: Poorer Households leads with close to 213978 rows followed by level 4: Less Affluent Households with 181187 observations consistently with the original feature.
    • Levels 3: Comfortable Households, and 1: Wealthy Households have the least observations with 62642 and 111831 rows respectively.
  2. ENG_MOVEMENT: Person's Life Stage (Pre-Family Couples & Singles, 5: Elders In Retirement), Fig12 observations:

    • In the new feature Level 1: Pre-Family Couples & Singles leads with 232220 rows followed by level 4: Older Families & Mature Couples with 220077 observations.
    • Levels 2: Young Couples With Children, and 3: Families With School Age Children have the least observations with 71973 and 108061 rows respectively.

The rest of the mixed features are:

Feature # of Levels
LP_LEBENSPHASE_FEIN 40
LP_LEBENSPHASE_GROB 12
WOHNLAGE 8
PLZ8_BAUMAX 5

which we chose to drop in this iteration of the analysis and keep things more straightforward and not increase the number of variables too much at this phase.

Complete Feature Selection

If there are other re-engineering tasks you need to perform, make sure you take care of them here. (Dealing with missing data will come in step 2.1.)
Do whatever you need to in order to ensure that the dataframe only contains the columns that should be passed to the algorithm functions.
In [93]:
df.columns
Out[93]:
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
       'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
       'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'GREEN_AVANTGARDE',
       'HEALTH_TYP', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL',
       'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT',
       'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT',
       'SEMIO_TRADV', 'SOHO_KZ', 'VERS_TYP', 'ANZ_PERSONEN', 'ANZ_TITEL',
       'HH_EINKOMMEN_SCORE', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
       'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'KONSUMNAEHE',
       'MIN_GEBAEUDEJAHR', 'OST_WEST_KZ', 'KBA05_ANTG1', 'KBA05_ANTG2',
       'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
       'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
       'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_ANZAHL_PKW', 'PLZ8_ANTG1',
       'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ',
       'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB', 'ENG_DECADE', 'ENG_MOVEMENT',
       'ENG_WEALTH', 'ENG_LIFE_STAGE'],
      dtype='object')
In [94]:
df.shape
Out[94]:
(747109, 64)
In [95]:
# Check if all colums are of numeric dtype
from pandas.api.types import is_numeric_dtype

#should be equal to the no of cols if all True
assert(sum([is_numeric_dtype(df[col]) for col in df.columns]) == df.shape[1])
print('Tests OK')
Tests OK

Create a Cleaning Function

In [96]:
def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data

    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """

    # Put in code here to execute all main cleaning steps:
    # convert missing value codes into NaNs, ...
    df = convert_missingdata_encodings_to_nans(df)
    print('Convert missing value codes into NaN, OK')

    # remove selected columns , ...
    df = drop_selected_columns(df)
    # remove selected rows and return both dfs
    df, df_cust_highna = drop_selected_rows(df)
    print('Remove selected columns and rows, OK')
    
    # select, re-encode, and engineer column values.
    df = reencode_categorical_feat(df)
    df = engineer_mixed_feat(df)
    print('Select, re-encode, and engineer column value, OK')
    
    # Return the cleaned dataframe and the high NaN Df.
    return df, df_cust_highna


def convert_missingdata_encodings_to_nans(df, debug=False):
    '''Convert to nans using the missing_data_encode dictionary'''

    for col in missing_data_encode.keys():
        if debug:
            print(col, 'nans before: ', df[col].isnull().sum())
        df.loc[df[col].isin(missing_data_encode[col]), col] = np.nan
        if debug:
            print(col, 'nans after: ', df[col].isnull().sum(), '\n')
    return df


def drop_selected_columns(df):
    ''''''
    return df.drop(columns=cols_to_remove, axis=1)


def drop_selected_rows(df):
    '''split the dataset into two parts:
    One with a percntage of NaN per row over the threshhold
    and one below. Return both.'''
    threshhold = 0.1
    df['row_nan_perc'] = df.isnull().mean(axis=1)
    
    df_lowna = df.loc[df['row_nan_perc'] < threshhold, :].\
                                drop(columns='row_nan_perc', axis=1)
    df_highna = df.loc[df['row_nan_perc'] > threshhold, :].\
                                drop(columns='row_nan_perc', axis=1)

    return df_lowna, df_highna


def reencode_categorical_feat(df):
    ''''''
    # Reencode the binary OST_WEST_KZ to numeric categories
    df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'O': 0,
                                                   'W': 1})

    # choose the categorical features with more than 2 levels
    df_cat_cols = feat_info[feat_info.type == 'categorical'].attribute
    cat_nunique = df[df_cat_cols].nunique()
    cat_to_encode = cat_nunique[cat_nunique > 2]

    # DROP THE MULTI-LEVEL CATEGORICAL FEATURES
    df = df.drop(columns=cat_to_encode.index, axis=0)

    return df


def engineer_mixed_feat(df):
    '''Engineer new variables from the 
    mixed features. Drop he old features'''
    # Engineer two new variables from 'PRAEGENDE_JUGENDJAHRE'
    decade_dict = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7:  3,
                   8: 4, 9: 4, 10: 5, 11: 5, 12: 5, 13: 5,
                   14: 6, 15: 6
                   }
    movement_dict = {1: 1, 3: 1, 5: 1, 8: 1, 10: 1, 12: 1, 14: 1,
                     2: 2, 4: 2, 6: 2, 7: 2, 9: 2, 11: 2, 13: 2, 15: 2
                     }

    df['ENG_DECADE'] = df['PRAEGENDE_JUGENDJAHRE'].map(decade_dict)
    df['ENG_MOVEMENT'] = df['PRAEGENDE_JUGENDJAHRE'].map(movement_dict)
    df.drop

    # Engineer two new variables from 'CAMEO_INTL_2015'
    df['ENG_WEALTH'] = df['CAMEO_INTL_2015'].apply(
        breakup_codes, digit='first')
    df['ENG_LIFE_STAGE'] = df['CAMEO_INTL_2015'].apply(
        breakup_codes, digit='second')

    # Drop mixed dtype features
    mixed_feat_cols = feat_info[feat_info.type == 'mixed'].attribute
    df = df.drop(mixed_feat_cols, axis=1)

    return df
In [97]:
def test_clean_data(df):
    '''A simple test function to test that the cleaning function 
    is at least producing an identical Dataframe as the one we created
    TODO: proper unit tests'''
    df_test = pd.read_csv('Udacity_AZDIAS_Subset.csv', delimiter=';')
    print('Load the dataset, OK')
    pd.testing.assert_frame_equal(clean_data(df_test)[0], df)
    print('Tests Passed!')
    
test_clean_data(df)
Load the dataset, OK
Convert missing value codes into NaN, OK
Remove selected columns and rows, OK
Select, re-encode, and engineer column value, OK
Tests Passed!

Feature Transformation

Apply Feature Scaling

Explore and clean the dataset of all NaN values

In [98]:
# Percentage of nans per column
(df.isnull().mean(axis=0) * 100).sort_values(ascending=False)[:15]
Out[98]:
W_KEIT_KIND_HH         6.078497
KKK                    5.410188
REGIOTYP               5.410188
HEALTH_TYP             3.260033
VERS_TYP               3.260033
ENG_MOVEMENT           2.157784
ENG_DECADE             2.157784
KBA05_ANTG1            1.470870
KBA05_ANTG3            1.470870
KBA05_ANTG4            1.470870
KBA05_GBZ              1.470870
MOBI_REGIO             1.470870
KBA05_ANTG2            1.470870
ANZ_HAUSHALTE_AKTIV    0.711543
ARBEIT                 0.499659
dtype: float64

W_KEIT_KIND_HH has the highest percentage of missing values (6%) with the next REGIOTYP and KKK at 5.4%.

In [99]:
# Percentage of nans per row 
nan_perc = (df.isnull().mean(axis=1) * 100).sort_values(ascending=False)

#How many rows have nans?
nan_perc[nan_perc > 0].shape[0]
Out[99]:
123898
In [100]:
# as a percentage of rows
print('Total percentage of row that have NaNs : ',
    round(nan_perc[nan_perc > 0].shape[0]/ df.shape[0]  * 100, 4),
      '%')
Total percentage of row that have NaNs :  16.5837 %
In [101]:
df.isnull().sum().sum() 
Out[101]:
320763
In [102]:
np.product(df.shape)
Out[102]:
47814976
In [103]:
# How many nans as perc?
print('Total percentage of Nans : ',
      round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
      '%')
Total percentage of Nans :  0.67 %
In [104]:
# Fit the Scaler without NaNs
df_drop = df.dropna(axis=0)
scaler = StandardScaler()
scaler.fit(df_drop)
Out[104]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [105]:
# Imput missing values in the full df with the mean
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X_impute = imp.fit_transform(df)
In [106]:
X_impute
Out[106]:
array([[1., 2., 1., ..., 1., 5., 1.],
       [3., 2., 1., ..., 2., 2., 4.],
       [4., 2., 4., ..., 1., 1., 2.],
       ...,
       [2., 2., 2., ..., 1., 2., 4.],
       [1., 1., 1., ..., 1., 5., 1.],
       [4., 1., 4., ..., 1., 4., 3.]])
In [107]:
# Transform the df with the fitted scaler
X_scale = scaler.transform(X_impute)
In [108]:
# Check the distribution of the feature with the highest pec of NaNs
df_scale = pd.DataFrame(X_scale, columns=df.columns)
df_scale.W_KEIT_KIND_HH.describe()
Out[108]:
count    747109.000000
mean         -0.001767
std           0.985268
min          -1.839105
25%          -0.688383
50%          -0.001767
75%           1.037699
max           1.037699
Name: W_KEIT_KIND_HH, dtype: float64
In [109]:
# check the distributions
df_scale.describe()
Out[109]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB ENG_DECADE ENG_MOVEMENT ENG_WEALTH ENG_LIFE_STAGE
count 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 ... 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000 747109.000000
mean 0.003698 0.015413 0.001930 0.008389 -0.000648 0.030617 0.024543 -0.011613 -0.026449 0.000891 ... -0.020237 -0.002344 0.020456 -0.019430 -0.034443 -0.021471 0.005359 -0.014822 -0.029240 -0.000760
std 0.997498 0.999536 1.006980 1.008161 0.995106 1.011898 1.009243 1.015730 0.982942 0.979313 ... 0.993209 1.009009 0.996363 1.003865 0.999515 1.002373 0.995194 0.979819 1.003429 0.992829
min -1.746287 -1.022678 -1.512226 -1.141397 -1.765054 -1.221852 -1.139026 -1.588887 -0.553672 -1.591635 ... -0.981973 -2.721004 -2.111170 -2.223015 -1.914219 -1.563993 -2.280170 -0.553672 -1.595951 -1.251111
25% -0.772089 -1.022678 -0.777350 -1.141397 -1.045045 -1.221852 -1.139026 -0.856544 -0.553672 -0.273495 ... -0.981973 -0.639701 -0.314765 -0.201175 -0.607437 -0.823841 -0.902320 -0.553672 -0.909992 -1.251111
50% 0.202108 0.977825 -0.042475 0.219832 0.394972 0.158605 -0.410325 -0.124201 -0.553672 -0.273495 ... 0.384671 -0.002344 -0.314765 -0.201175 -0.171843 -0.083688 -0.213395 -0.553672 0.461926 0.082843
75% 1.176305 0.977825 0.692400 0.900446 1.114980 0.848833 1.047076 0.608142 -0.553672 1.044646 ... 0.384671 0.400951 0.583438 0.809745 0.699344 0.656464 1.164455 -0.553672 1.147884 0.749820
max 1.176305 0.977825 1.427276 1.581061 1.114980 1.539061 1.775776 1.340485 1.806125 1.044646 ... 1.751315 1.441603 1.481641 1.820665 1.570532 1.396616 1.164455 1.806125 1.147884 1.416797

8 rows × 64 columns

In [110]:
# How many nans as perc after?
print('Total percentage of Nans after  drop: ',
      round(df_drop.isnull().sum().sum() / np.product(df_drop.shape) * 100, 2),
      '%')
Total percentage of Nans after  drop:  0.0 %

Summary

The total percentage of NaNs in the dataset is 0.67% while the percentage of rows that contain NanNs is 16,37%.

Droping them would discard a significant amount of data in order to handle much smaller amount of actually missing data.

Imputing them with the mean will result them having a 0 value after using StandardScaler(), possibly reducing the variance of the resulting variables.

An alternative which was used is to fit the StandardScaler to the data without the NaNs and then reintroduce them, imputing them with the mean and transforming them with the fitted StandardScaler object thus resulting in a distributions that retain a higher amount of information.

Dimensionality Reduction

Investigate the variance accounted for by each principal component.

In [111]:
X_scale.shape
Out[111]:
(747109, 64)
In [112]:
# With all the feaures
pca = PCA().fit(X_scale)
In [113]:
num_components=len(pca.explained_variance_ratio_)
inds = np.arange(num_components)
vals = pca.explained_variance_ratio_

fig, ax = plt.subplots(figsize=(12,9))
plt.bar(inds, vals, color='b');
In [114]:
fig, ax = plt.subplots(figsize=(16,10))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', color='b')
ax.set_xticks(inds+1, minor=False);

The numbers that stand out are 12, 18 and 25.

In [115]:
def scree_plot(pca, c, ann_f=6, figsize=(12, 9), pct_change=False):
    '''
    Creates a scree plot associated with the principal components

    INPUT: pca - the result of instantian of PCA in scikit learn

    OUTPUT:
            None
    '''
    num_components = len(pca.explained_variance_ratio_)
    ind = np.arange(num_components) + 1
    if pct_change:
        vals = -pd.Series(pca.explained_variance_ratio_).pct_change()
        # the first item is NaN
        vals = vals[1:].tolist()
        ind = ind[1:]
    else:
        vals = pca.explained_variance_ratio_

    plt.figure(figsize=figsize)
    ax = plt.   subplot(111)
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals, color=c)
    ax.plot(ind, cumvals, color=c)
    for i in range(num_components):
        try:  # Bandage aid - need to fix
            ax.annotate(r"%s%%" % (
                (str(vals[i] * 100)[:4])),
                (ind[i] + 0.2, vals[i]),
                va="bottom", ha="center", fontsize=ann_f)
        except IndexError:
            pass

    ax.xaxis.set_tick_params(width=0)
    ax.set_xticks(ind, minor=False)
    ax.yaxis.set_tick_params(width=2, length=12)

    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    plt.title('Explained Variance Per Principal Component')
In [116]:
# Re-apply PCA to the data while selecting for number of components to retain.
pca_25 = PCA(n_components=25)
X_pca_25 = pca_25.fit(X_scale)

scree_plot(pca_25, c='slategray', ann_f=8, figsize=(12,9))

Observations

There seems to be a stabilization of the variance explained and a decrease in the ratio of cumulative variance explained round 12 components.

In [117]:
np.cumsum(pca.explained_variance_ratio_[:12])[-1]
Out[117]:
0.6621255269843482

The cumulative explained variance of the first 12 components is 66.21%.

Final dimensionality reduction

In [118]:
# Fit on X for 12 components
pca = PCA(n_components=12)
X_pca = pca.fit_transform(X_scale)
In [119]:
X_pca.shape
Out[119]:
(747109, 12)
In [120]:
df_pca = pd.DataFrame(X_pca)
df_pca.head()
Out[120]:
0 1 2 3 4 5 6 7 8 9 10 11
0 3.907796 -2.656621 -2.895773 -0.441863 -2.512817 2.215937 0.704196 0.841138 1.356967 -0.523036 0.015870 2.084486
1 -0.893332 0.179033 -3.051267 2.233205 -0.904234 -0.238592 0.286181 -0.989670 0.839629 -1.411833 -1.036324 -0.259661
2 -4.068326 1.143361 -0.822045 -0.920451 0.778765 0.887047 -1.682292 -0.519529 -1.366521 -0.240180 0.464542 -1.224527
3 0.571175 0.223070 3.234685 0.625786 1.402563 -2.781060 -0.620200 2.284333 1.291613 -0.727196 1.193146 1.175043
4 -0.710996 -1.353162 -1.047915 -1.812801 -3.406913 -1.069949 0.458173 -0.608483 -2.732078 1.532272 0.296287 2.311738
In [121]:
df_pca.describe()
Out[121]:
0 1 2 3 4 5 6 7 8 9 10 11
count 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05 7.471090e+05
mean 8.274190e-17 5.448174e-16 1.742525e-16 -2.588775e-16 1.005267e-17 2.195799e-16 -2.259710e-16 7.197594e-17 1.714089e-16 -3.414482e-16 1.674240e-16 -2.921835e-16
std 3.305431e+00 2.941747e+00 2.445980e+00 1.834498e+00 1.575535e+00 1.436010e+00 1.385909e+00 1.327920e+00 1.240437e+00 1.212733e+00 1.158618e+00 1.134653e+00
min -8.158599e+00 -8.056119e+00 -5.388341e+00 -5.578758e+00 -6.321833e+00 -4.393262e+00 -1.001662e+01 -5.470201e+00 -6.285885e+00 -4.982893e+00 -5.095626e+00 -5.487985e+00
25% -2.797101e+00 -2.248605e+00 -2.213944e+00 -1.315640e+00 -9.998736e-01 -9.600904e-01 -5.595538e-01 -8.943880e-01 -8.435577e-01 -8.298672e-01 -7.765242e-01 -7.862076e-01
50% -1.499720e-01 6.883725e-02 -4.478043e-01 -1.937895e-01 6.640127e-02 -1.627032e-01 1.417113e-01 3.172766e-02 -6.029639e-02 -3.995200e-03 2.304979e-02 -2.780641e-02
75% 2.670554e+00 2.171403e+00 2.290881e+00 1.100015e+00 1.042296e+00 7.587979e-01 8.241712e-01 9.202707e-01 7.958256e-01 8.167085e-01 7.949304e-01 7.713673e-01
max 1.216322e+01 1.094159e+01 6.402612e+00 8.286278e+00 7.252907e+00 2.614140e+01 1.680698e+01 1.238407e+01 1.515561e+01 1.096451e+01 1.025272e+01 6.254072e+00

Summary

We applied the dimensionality reduction PCA() method to the scaled dataset and by observing the combined plot of explained variance and cumulative explained variance and per principal component cumulative explained variance plot (Fig15) we chose to keep 12 components as at this number the seems to start a decrease in the ratio of cumulative variance explained as new components are added.

The cumulative explained variance of the first 12 components is 66.21%.

Interpret Principal Components

In [122]:
def map_pca_weights_to_feats(pca, df, comp_no):
    '''Map pca weights to  individual features
    and return two pd.Series on with the highest
    positive weights and one with the lowest negative
    weights'''

    weights = pd.DataFrame(np.round(pca.components_, 4), columns=df.keys())
    component = weights.iloc[comp_no - 1, :]
    comp_pos = component[component > 0].sort_values(ascending=False)
    comp_neg = component[component < 0].sort_values(ascending=True)

    return comp_pos, comp_neg
In [123]:
def create_bar_table(comp_pos, comp_neg):
    '''Create and display a conditionally styled pandas dataframe
    for the interpertation of the positive and negative weights
    of PCA and centroid distances for KMeans'''
    
    head = """
    <table>
    """
    row = ""
    for serie in [comp_pos[:15],comp_neg[:15]]:
        s = serie.copy()
        s.name=''
        row += "<td>{}</td>".format(s.to_frame().style.bar(
                                                           align='mid',
                                                           color=['#d65f5f', '#5fba7d'],
                                                           width=100).render()
                                   ) 
    row += '</tr>'
    head += row

    head+= """
    </table>"""

    display(HTML(head))

Map weights for the first principal component to corresponding feature names.

In [124]:
# the weights df
weights = pd.DataFrame(np.round(pca.components_, 4), columns = df_drop.keys())
weights.head()
Out[124]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB ENG_DECADE ENG_MOVEMENT ENG_WEALTH ENG_LIFE_STAGE
0 -0.1175 0.0057 -0.2193 0.1436 -0.1105 0.0650 0.0734 0.1645 -0.1083 0.0283 ... 0.2178 0.0419 -0.1672 0.1441 0.1970 0.1353 0.0997 -0.1068 0.2078 -0.1236
1 0.2604 0.0864 0.0719 -0.2418 0.2337 -0.2121 -0.2312 0.1161 -0.0134 -0.0581 ... 0.0861 0.0130 -0.0694 0.0652 0.0882 0.0630 -0.2515 -0.0113 0.0709 0.0028
2 0.0765 -0.3727 0.1599 -0.1035 0.0975 -0.1905 -0.0951 -0.0495 0.0501 -0.0151 ... 0.0432 0.0052 -0.0358 0.0323 0.0462 0.0305 -0.1049 0.0493 0.0237 -0.0079
3 -0.0435 0.0403 0.0640 -0.0003 -0.0163 -0.1265 0.0904 -0.1095 0.3895 0.0056 ... 0.0796 0.1483 0.0425 0.0462 0.2498 0.0936 0.0398 0.3877 -0.1280 0.0593
4 0.0055 0.0078 0.0859 -0.0064 -0.0647 0.0361 -0.0460 -0.0977 0.0636 0.0133 ... 0.0316 -0.4701 -0.3933 0.2175 0.0950 0.1405 0.0285 0.0613 -0.0117 0.0213

5 rows × 64 columns

In [125]:
pd.Series(X_pca[:, 1]).describe()
Out[125]:
count    7.471090e+05
mean     5.448174e-16
std      2.941747e+00
min     -8.056119e+00
25%     -2.248605e+00
50%      6.883725e-02
75%      2.171403e+00
max      1.094159e+01
dtype: float64
In [126]:
comp_pos1, comp_neg1 = map_pca_weights_to_feats(pca, df_drop, 1)
create_bar_table(comp_pos1, comp_neg1)
PLZ8_ANTG3 0.2264
PLZ8_ANTG4 0.2178
HH_EINKOMMEN_SCORE 0.2082
ENG_WEALTH 0.2078
ORTSGR_KLS9 0.197
EWDICHTE 0.1969
FINANZ_HAUSBAUER 0.1645
PLZ8_ANTG2 0.1539
KBA05_ANTG4 0.153
ARBEIT 0.1441
FINANZ_SPARER 0.1436
ANZ_HAUSHALTE_AKTIV 0.1384
RELAT_AB 0.1353
KBA05_ANTG3 0.1339
SEMIO_REL 0.1089
MOBI_REGIO -0.2425
PLZ8_ANTG1 -0.2245
KBA05_ANTG1 -0.223
FINANZ_MINIMALIST -0.2193
KBA05_GBZ -0.2183
KONSUMNAEHE -0.1694
PLZ8_GBZ -0.1672
INNENSTADT -0.164
BALLRAUM -0.126
ENG_LIFE_STAGE -0.1236
GEBAEUDETYP_RASTER -0.1177
ALTERSKATEGORIE_GROB -0.1175
FINANZ_VORSORGER -0.1105
GREEN_AVANTGARDE -0.1083
ENG_MOVEMENT -0.1068

Most Positive

  • PLZ8_ANTG3: Number of 6-10 family houses in the PLZ8 region (0: no 6-10 family homes, 3: high share of 6-10 family homes)
    • Higher share of medium apartment buildings
  • PLZ8_ANTG4 : Number of 10+ family houses in the PLZ8 region (0: no 10+ family home, 2: high share of 10+ family homes)
    • Higher share of larger apartment buildings
  • HH_EINKOMMEN_SCORE: Estimated household net income (1: highest income, 6: very low income)
    • Lower incomes
  • ENG_WEALTH: Household wealth - Engineered (1: Wealthy Households, 5: Poorer Households)
    • Poorer Households
  • ORTSGR_KLS9: Size of community (1: <= 2,000 inhabitants, 9: > 700,000 inhabitants)
    • Bigger community size
  • EWDICHTE : Density of households per square kilometer (1: less than 34 households per km^2, 6: more than 999 households per km^2)
    • Denser populated areas

Most Negative

  • MOBI_REGIO : Movement patterns (1: very high movement, 6: none)
    • Lower movement patterns
  • PLZ8_ANTG1: Number of 1-2 family houses in the PLZ8 region (0: no 1-2 family homes, 4: very high share of 1-2 family homes)
    • Higher share of of 1-2 family homes
  • KBA05_ANTG1 : Number of 1-2 family houses in the microcell (0: no 1-2 family homes, 4: very high share of 1-2 family homes)
    • Higher share of of 1-2 family homes
  • FINANZ_MINIMALIST : Financial typology - MINIMALIST: low financial interest (1: very high, 5: very low)
    • Lower financial minimalism

The first component component seems to capture the population density, and financial affluence of the person

PLZ8_ANTG3, PLZ8_ANTG4 , ENG_WEALTH, HH_EINKOMMEN_SCORE, RTSGR_KLS9 and EWDICHTE have high positive weights (People that live in more densely populated areas and with lower financial affluence).

The assumption is supported by the negative weights of MOBI_REGIO , KBA05_ANTG1, PLZ8_ANTG1 and FINANZ_MINIMALIST (People that live in less densely populated areas and with higher financial affluence).

Map weights for the second principal component to corresponding feature names.

In [127]:
comp_pos2, comp_neg2 = map_pca_weights_to_feats(pca, df_drop, 2)
create_bar_table(comp_pos2, comp_neg2)
ALTERSKATEGORIE_GROB 0.2604
FINANZ_VORSORGER 0.2337
SEMIO_ERL 0.2297
SEMIO_LUST 0.1812
RETOURTYP_BK_S 0.1681
FINANZ_HAUSBAUER 0.1161
W_KEIT_KIND_HH 0.116
SEMIO_KRIT 0.1136
SEMIO_KAEM 0.1054
PLZ8_ANTG3 0.0908
EWDICHTE 0.0905
ORTSGR_KLS9 0.0882
ANREDE_KZ 0.0864
PLZ8_ANTG4 0.0861
FINANZ_MINIMALIST 0.0719
SEMIO_REL -0.2609
ENG_DECADE -0.2515
FINANZ_SPARER -0.2418
FINANZ_UNAUFFAELLIGER -0.2312
SEMIO_TRADV -0.2276
SEMIO_PFLICHT -0.2263
SEMIO_KULT -0.2168
FINANZ_ANLEGER -0.2121
SEMIO_FAM -0.179
SEMIO_RAT -0.1699
ONLINE_AFFINITAET -0.1658
SEMIO_MAT -0.1566
SEMIO_SOZ -0.1016
PLZ8_ANTG1 -0.0876
MOBI_REGIO -0.0872

Most Positive

  • ALTERSKATEGORIE_GROB: Estimated age based on given name analysis (1: < 30 years old, 4: > 60 years old, 9: uniformly distributed)
    • Older or evenly distributed
  • FINANZ_VORSORGER: Financial typology - Be prepared (1: Very high, 5: Very low)
    • Lower financial preparation
  • SEMIO_ERL: Personality typology - event-oriented ( 1: highest affinity, 7: lowest affinity)
    • Lower event oriented affinity
  • SEMIO_LUST: Personality typology - sensual-minded ( 1: highest affinity, 7: lowest affinity)
    • Lower sensual-minded affinity
  • RETOURTYP_BK_S: Return type (1: influenceable Crazy-Shopper, 5: determined Minimal-Returner)
    • More conservative return type
  • FINANZ_HAUSBAUER: Financial typology - home ownership (1: Very high, 5: Very low)
    • Lower home ownership

Most Negative

  • SEMIO_REL: Personality typology - religious ( 1: highest affinity, 7: lowest affinity)
    • Lower religious affinity
  • ENG_DECADE: Person’s decade of youth (1: 40s , 6: 90s)
    • Younger generation
  • FINANZ_SPARER: Financial typology - money-saver (1: Very high, 5: Very low)
    • Lower money saving
  • FINANZ_UNAUFFAELLIGER: Financial typology - inconspicuous (1: Very high, 5: Very low)
    • Less inconspicuous
  • SEMIO_TRADV: Personality typology - tradional-minded ( 1: highest affinity, 7: lowest affinity)
    • Less traditional minded

The second component seems to capture the age, generation and culture (financial, buying and other) of the person.

RETOURTYP_BK_S, SEMIO_ERL and SEMIO_LUST ALTERSKATEGORIE_GROB and FINANZ_VORSORGER have high positive weights and (older people with lower financial preparation and home ownership and less sensual and event oriented).

The assumption is supported by the negative weights of SEMIO_REL, ENG_DECADE, FINANZ_SPARER and SEMIO_TRADV (younger generation, with lower religious affinity, lower money saving, less inconspicuous and less traditional minded).

Map weights for the third principal component to corresponding feature names.

In [128]:
comp_pos3, comp_neg3 = map_pca_weights_to_feats(pca, df_drop, 3)
create_bar_table(comp_pos3, comp_neg3)
SEMIO_VERT 0.3524
SEMIO_SOZ 0.2664
SEMIO_FAM 0.2529
SEMIO_KULT 0.2354
FINANZ_MINIMALIST 0.1599
RETOURTYP_BK_S 0.112
FINANZ_VORSORGER 0.0975
W_KEIT_KIND_HH 0.0878
SEMIO_REL 0.0786
ALTERSKATEGORIE_GROB 0.0765
SEMIO_LUST 0.0635
SEMIO_MAT 0.0546
GREEN_AVANTGARDE 0.0501
ENG_MOVEMENT 0.0493
EWDICHTE 0.0465
ANREDE_KZ -0.3727
SEMIO_KAEM -0.3332
SEMIO_DOM -0.3087
SEMIO_KRIT -0.2713
SEMIO_RAT -0.2119
FINANZ_ANLEGER -0.1905
SEMIO_ERL -0.178
ENG_DECADE -0.1049
FINANZ_SPARER -0.1035
FINANZ_UNAUFFAELLIGER -0.0951
SEMIO_TRADV -0.076
SEMIO_PFLICHT -0.0729
ONLINE_AFFINITAET -0.054
FINANZ_HAUSBAUER -0.0495
PLZ8_ANTG1 -0.0431

Most Positive

  • SEMIO_VERT: Personality typology - VERT: dreamful (1: highest affinity, 7: lowest affinity)
    • Less dreamful
  • SEMIO_SOZ: Personality typology - SOZ: socially-minded (1: highest affinity, 7: lowest affinity)
    • Less socially-minded
  • SEMIO_FAM: Personality typology - family-minded ( 1: highest affinity, 7: lowest affinity)
    • Less family-minded
  • SEMIO_KULT: Personality typology - cultural-minded ( 1: highest affinity, 7: lowest affinity)
    • Less cultural-minded
  • FINANZ_MINIMALIST : Financial typology - MINIMALIST: low financial interest (1: very high, 5: very low)
    • Lower financial minimalism
  • RETOURTYP_BK_S: Return type (1: influenceable Crazy-Shopper, 5: determined Minimal-Returner)
    • More conservative return type

Most Negative

  • ANREDE_KZ: Gender ( 1: male, 2: female)
    • Female
  • SEMIO_KAEM: Personality typology - KAEM: combative attitude ( 1: highest affinity, 7: lowest affinity)
    • Less combative attitude
  • SEMIO_DOM: Personality typology - DOM: dominant-minded ( 1: highest affinity, 7: lowest affinity)
    • Less dominant-minded
  • SEMIO_KRIT: Personality typology - KRIT: critical-minded ( 1: highest affinity, 7: lowest affinity)
    • Less critical-minded
  • SEMIO_RAT: Personality typology - RAT: rational ( 1: highest affinity, 7: lowest affinity)
    • Less rational
  • FINANZ_ANLEGER : Financial typology - ANLEGER: investor (1: very high, 5: very low)
    • Less prone to investing

The third component seems to capture the gender and their main corresponding personality types (dreamfulness, dominance, rationality, social and family attitude) of the person.

SEMIO_VERT, SEMIO_SOZ and SEMIO_FAM, SEMIO_KULT and FINANZ_MINIMALIST and RETOURTYP_BK_S have high positive weights and in conjunction with the very strong negative weight of ANREDE_KZ can refer to (Men that are less dreamful, less socially-minded, less cultural-minded not financial minimalists and conservative return shopper types).

The assumption is supported by the negative weights of ANREDE_KZ, SEMIO_KAEM, SEMIO_DOM, SEMIO_KRIT, SEMIO_RAT and FINANZ_ANLEGER (Women that are less dominant-minded, less critical-minded, less rational have less combative attitude and are less prone to investing).

Summary

1.The first component component seems to capture the population density, and financial status of the person - Fig16.

  • PLZ8_ANTG3, PLZ8_ANTG4 , ENG_WEALTH, HH_EINKOMMEN_SCORE, RTSGR_KLS9 and EWDICHTE have high positive weights correlated with people that live in more densely populated areas and with lower financial affluence.

  • The assumption is supported by the negative weights of MOBI_REGIO , KBA05_ANTG1, PLZ8_ANTG1 and FINANZ_MINIMALIST correlated with People that live in less densely populated areas and with higher financial affluence.

2.The second component seems to capture the age, generation and culture (financial, buying and other) of the person - Fig17.

  • RETOURTYP_BK_S, SEMIO_ERL and SEMIO_LUST ALTERSKATEGORIE_GROB and FINANZ_VORSORGER have high positive weights correlated with older people with lower financial preparation and home ownership that are less sensual and event oriented.

  • The assumption is supported by the negative weights of SEMIO_REL, ENG_DECADE, FINANZ_SPARER and SEMIO_TRADV correlated with (younger generation, with lower religious affinity, lower money saving, less inconspicuous and less traditional minded).

3.The third component seems to capture the gender and their main corresponding personality types (dreamfulness, dominance, rationality, social and family attitude) of the person - Fig18.

  • SEMIO_VERT, SEMIO_SOZ and SEMIO_FAM, SEMIO_KULT and FINANZ_MINIMALIST and RETOURTYP_BK_S have high positive weights and in conjunction with the very strong negative weight of ANREDE_KZ can refer to (men that are less dreamful, less socially-minded, less cultural-minded not financial minimalists and conservative return shopper types).

  • The assumption is supported by the negative weights of ANREDE_KZ, SEMIO_KAEM, SEMIO_DOM, SEMIO_KRIT, SEMIO_RAT and FINANZ_ANLEGER correlated with (women that are less dominant-minded, less critical-minded, less rational, have less combative attitude and are less prone to investing).

Clustering

Apply Clustering to General Population

Investigate the change in within-cluster distance across number of clusters.

In [493]:
#THIS TAKES A LONG TIME - DON'T RUN EVERY TIME!

# compute the average within-cluster distances.
# Over a number of different cluster counts...
scores = {}
for k in range(2, 31):
    # run k-means clustering on the data and...
    scores[k] = np.abs(KMeans(n_clusters=k, n_jobs=-1).fit(X_pca).score(X_pca))
#     1print('Calculated: ', k)

# Plot relationship plot
fig, ax = plt.subplots(figsize=(16,10))    
ax = pd.Series(scores).plot(marker='o', color='slategray')
ax.set_xticks(np.arange(2, 31), minor=False);  
ax.set_xlabel("# of Clusters")
ax.set_ylabel("Total Distance to centroid");

Final general population clustering

In [129]:
# Chosen number of clusters
K=8
In [130]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=K, n_jobs=-1, random_state=0).fit(X_pca)
In [131]:
kmeans.labels_.shape
Out[131]:
(747109,)
In [132]:
kmeans.cluster_centers_.shape
Out[132]:
(8, 12)

Summary

The curve in the Distance to Number of Clusters relationship plot (Fig19) is quite even and the the most obvious elbow in the curve where adding additional clusters less effective in reducing the within-cluster distance can be inferred to be 8 and this number has been used as the number of clusters to fit the the KMeans algorithm on the transformed demographics data and predict cluster labels.

Another elbow can be observed around the 12 clusters but the k = 8 was preferred in this iteration in order to keep the number of clusters lower and see if meaningful separation can be observed with this number.

Apply All Steps to the Customer Data

In [133]:
# Load in the customer demographics data.
customers = pd.read_csv('Customers.csv', delimiter=';')
In [134]:
customers.head()
Out[134]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.0 5 1 5 1 2 2 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
1 -1 4 1 NaN 5 1 5 1 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1 4 2 2.0 5 1 5 1 4 4 ... 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 5.0 3.0
3 1 4 1 2.0 5 1 5 2 1 2 ... 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
4 -1 3 1 6.0 3 1 4 4 5 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 3.0 5.0 1.0

5 rows × 85 columns

Data cleaning

In [135]:
df_cust, df_cust_highna = clean_data(customers)
Convert missing value codes into NaN, OK
Remove selected columns and rows, OK
Select, re-encode, and engineer column value, OK
In [136]:
df_cust.shape
Out[136]:
(133427, 64)
In [137]:
df_cust_highna.shape
Out[137]:
(58225, 79)

Feature transformation

In [138]:
# Impute
X_cust_imp = imp.transform(df_cust)
In [139]:
# Scale
X_cust_sc = scaler.transform(X_cust_imp)

Dimensionality reduction

In [140]:
X_cust_pca = pca.transform(X_cust_sc)
X_cust_pca.shape
Out[140]:
(133427, 12)
In [141]:
df_cust_pca = pd.DataFrame(X_cust_pca, columns=np.arange(1, 13))
df_cust_pca.describe()
Out[141]:
1 2 3 4 5 6 7 8 9 10 11 12
count 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000 133427.000000
mean -2.420121 1.310225 1.459637 0.926338 -0.021179 0.016726 0.159784 0.006172 0.219032 0.335859 0.310805 -0.033378
std 2.719404 2.177325 2.195575 2.011782 1.504549 1.364308 1.366125 1.417460 1.265658 1.058863 1.127363 1.099069
min -7.957290 -7.519114 -4.753556 -5.340468 -6.097607 -3.509534 -9.756177 -5.348061 -5.287467 -4.524880 -5.010266 -5.786345
25% -4.481219 -0.028086 -0.776505 -0.708078 -0.983224 -0.793144 -0.347024 -0.997691 -0.634143 -0.383103 -0.451180 -0.805974
50% -3.155341 1.249607 2.196779 0.962499 0.083418 -0.116760 0.300930 0.037469 0.180509 0.359224 0.310001 -0.121571
75% -0.792494 2.715572 3.232513 2.563815 1.013612 0.618804 0.920310 1.012337 1.043033 1.067154 1.069303 0.702715
max 11.321042 10.919396 6.259865 9.449031 6.793778 29.673035 17.875117 7.616711 7.292835 5.457265 4.869678 5.209275

Clustering

In [142]:
X_cust_kmeans = kmeans.predict(X_cust_pca)
In [143]:
np.unique(X_cust_kmeans)
Out[143]:
array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int32)
In [144]:
X_cust_kmeans.shape
Out[144]:
(133427,)

Compare Customer Data to Demographics Data

Compare proportions

Compare the proportion of data in each cluster for the customer data to the proportion of data in each cluster for the general population.

In [146]:
# General population dataset kmean labels
gen_clust_labels = kmeans.labels_

# Add the high NaN dataset as a cluster with label -1 
gen_clust_labels_na = np.append(gen_clust_labels, [-1] * df_highna.shape[0])

# Make proportion table
gen_clust_freq = itemfreq(gen_clust_labels_na)
gen_proportion_col = gen_clust_freq[:, 1] / gen_clust_freq[:, 1].sum()
gen_clust_freq = np.c_[gen_clust_freq, gen_proportion_col]


# customer dataset
cust_clust_labels = X_cust_kmeans

# Add the high NaN dataset as a cluster with label -1 
cust_clust_labels_na = np.append(cust_clust_labels, [-1] * df_cust_highna.shape[0])

# Make proportion table
cust_clust_freq = itemfreq(cust_clust_labels_na)
cust_proportion_col = cust_clust_freq[:, 1] / cust_clust_freq[:, 1].sum()
cust_clust_freq = np.c_[cust_clust_freq, cust_proportion_col]
In [147]:
# Plot
fig, axes = plt.subplots(1,2, figsize=(16, 10), sharey=True)
sns.barplot(x=gen_clust_freq[:, 0], y=gen_clust_freq[:, 2], color='b', ax=axes[0]);
sns.barplot(x=cust_clust_freq[:, 0], y=cust_clust_freq[:, 2], color='r', ax=axes[1]);

axes[0].set_title('General Population')
axes[1].set_title('Customer Data');
In [148]:
# create plot
fig, ax = plt.subplots(figsize=(16, 10))
index = np.append([-1], np.arange(K))
bar_width = 0.35
opacity = 0.8
 
gen = plt.bar(index, gen_clust_freq[:, 2], bar_width,
                 alpha=opacity,
                 color='b',
                 label='General Population')
 
cust = plt.bar(index + bar_width, cust_clust_freq[:, 2], bar_width,
                 alpha=opacity,
                 color='g',
                 label='Customer Data')
 
plt.xlabel('# of Cluster')
plt.ylabel('Proportion of cluster')
plt.title('Proportions of Clusters for General and Customer Data')
plt.xticks(index + bar_width, index)
plt.legend()
 
plt.tight_layout()
sns.set(style="whitegrid")
sns.despine();
In [149]:
#Calculate differece of proportions
prop_diff = cust_clust_freq[:, 2] - gen_clust_freq[:, 2]
positive= prop_diff > 0

# create plot
fig, ax = plt.subplots(figsize=(14, 8))
index = np.append([-1], np.arange(K))
bar_width = 0.35
opacity = 0.8
 
diff = plt.bar(index, prop_diff, bar_width,
                 alpha=opacity,
                 color=np.vectorize({True: 'k', False: 'r'}.get)(positive)
              )

 
plt.xlabel('# of Cluster')
plt.ylabel('Difference of proportions')
plt.title('Difference of Proportions of Clusters for General and Customer Data')
plt.xticks(index)
 
plt.tight_layout()
sns.set(style="whitegrid")
sns.despine();

Observations

  • Cluster 3 is the most overrepresented and is actually dominating the customer data and cluster 6 the most underrepresented.
  • The high-missing NaN cluster is quite overrepresented in the customer data which suggests a possibly methodical problem in the data acquisition of the customer data.
In [150]:
# Add cluster labels
df_cust_pca['cluster'] = X_cust_kmeans
In [151]:
g = sns.pairplot(df_cust_pca,
                 vars = np.arange(1, 13), 
                 diag_kind = 'kde',
                 hue="cluster",
                 palette="Paired");

Observations

  • By looking at the density plots we see that there seems to be a much clearer separation of clusters in the first 3 components and a more uniform spread of clusters at the rest of the components.
  • There is an interesting binary separation in most scatter plots that involve the third component which agrees with our assumption of it being a latent gender feature.
  • We have not include the high-NAN in this plot as it refers to different dataset characteristics.

Let's have a closer look at the first three principal components.

In [152]:
# Rename components for better reading
df_cust_pca = df_cust_pca.rename(columns = {1: 'pop_density-finance_status',
                                            2: 'age-generation',
                                            3: 'gender'}
                                )
In [153]:
g = sns.pairplot(df_cust_pca,
                 vars = ['pop_density-finance_status', 'age-generation', 'gender'], 
                 diag_kind = 'kde',
                 hue="cluster",
                 palette="Paired");

Observations

  • Cluster No 3 (dark green) seems to be strongly correlated to lower values of the 1st component (Population Density - Financial status) , values around 0 for the 2nd component (Age, Generation and Culture) and higher values of the 3rd component (Gender and their main corresponding personality types).
  • Cluster No 6 (ligh orange) seems to be the opposite of Cluster no 3 and correlates with higher values of the first component, values around 0 for the 2nd component (Age, Generation and Culture) and lower values for the 3rd principal component(Gender and their main corresponding personality)
In [155]:
def map_kmeans_weights_to_feats(kmeans, df, clust_no):
    '''Map pca weights to  individual features
    and return two pd.Series on with the highest
    positive weights and one with the lowest negative
    weights'''

    weights = pd.DataFrame(np.round(kmeans.cluster_centers_, 4), columns=df.keys())
    centroid = weights.iloc[clust_no, :]
    cent_pos = centroid[centroid > 0].sort_values(ascending=False)
    cent_neg = centroid[centroid < 0].sort_values(ascending=True)

    return cent_pos, cent_neg

Summary

Fig20, Fig21 and Fig22 observations:

  • Cluster 3 is the most overrepresented and is actually dominating the customer data and cluster 6 the most underrepresented.
  • The high-missing NaN cluster is quite overrepresented in the customer data which suggests a possibly methodical problem in the data acquisition of the customer data.

Fig22 observations:

  • By looking at the density plots we see that there seems to be a much clearer separation of clusters in the first 3 components and a more uniform spread of clusters at the rest of the components.
  • There is an interesting binary separation in most scatter plots that involve the 3rd component which agrees with our assumption of it being a latent gender feature.

What kinds of people are part of a cluster that is overrepresented in the customer data compared to the general population?

In [156]:
# Overrepresented 
over_pos, over_neg = map_kmeans_weights_to_feats(kmeans, df_cust_pca.iloc[:, :-1], 3)
create_bar_table(over_pos, over_neg)
gender 2.2146
12 0.1437
8 0.1346
5 0.0963
11 0.0732
7 0.0425
pop_density-finance_status -3.4025
9 -0.2033
10 -0.0651
age-generation -0.0223
6 -0.0134
4 -0.0116
  • Looking at the centroid distances table with the highly negative value for pop_density-finance_status and the strongly positive value for gender, confirms the pairplot observations and strengthens the assumption that the overrepresented group are: Men that are less dreamful, less socially-minded, less cultural-minded, not financial minimalists, conservative return shopper types, that have a higher financial status and live in less densely populated areas. The negative weight for the age-generation principal component in conjunction with the attitude characteristics of these men suggests that they could be of relative younger age.

What kinds of people are part of a cluster that is underrepresented in the customer data compared to the general population?

In [157]:
# Underrepresented 
under_pos, under_neg = map_kmeans_weights_to_feats(kmeans, df_cust_pca.iloc[:, :-1], 6)
create_bar_table(under_pos, under_neg)
pop_density-finance_status 4.3755
10 0.4814
4 0.361
11 0.1258
7 0.108
5 0.059
gender -2.6167
age-generation -1.2051
8 -0.2711
9 -0.1662
6 -0.0795
12 -0.049
  • Looking at the centroid distances table with the highly positive value for pop_density-finance_status and the strongly negative value for gender along with negative weight for the age-generation principal component confirms the pairplot observations and strengthens the assumption that the underrepresented group are: Younger women that are less dominant-minded, less critical-minded, less rational, have a less combative attitude and are less prone to investing and live in densely populated areas with a lower financial status. They have lower religious affinity, lower money saving culture and are less traditional minded.