Customer Segmentation - Introduction¶

In this project, I apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns.

Imports¶

# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import HTML


from scipy.stats import itemfreq, chisquare

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 

# Set base plotting style
plt.style.use('seaborn-ticks')
# # Set base plotting size
plt.rcParams['figure.figsize'] = 12, 9
# Magic word for producing visualizations in notebook
%matplotlib inline
# Increase figure resolution for high dpi screens
%config InlineBackend.figure_format = 'retina'
# Autoreload modules
%load_ext autoreload
%autoreload 2

Data Loading¶

There are four files associated with this project:

Demographics.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
Customers.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
Data_Dictionary.md: Detailed information file about the features in the provided datasets.
Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

# Load in the general demographics data.
azdias = pd.read_csv('Demographics.csv', delimiter=';')

# Load in the feature summary file.
feat_info = pd.read_csv('Feature_Summary.csv', delimiter=';')

feat_info.head()

azdias.head()

Preprocessing¶

Assess Missing Data¶

Identify missing or unknown data values and convert them to NaNs.¶

df = azdias.copy()
df.head()

df.describe()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
AGER_TYP                 891221 non-null int64
ALTERSKATEGORIE_GROB     891221 non-null int64
ANREDE_KZ                891221 non-null int64
CJT_GESAMTTYP            886367 non-null float64
FINANZ_MINIMALIST        891221 non-null int64
FINANZ_SPARER            891221 non-null int64
FINANZ_VORSORGER         891221 non-null int64
FINANZ_ANLEGER           891221 non-null int64
FINANZ_UNAUFFAELLIGER    891221 non-null int64
FINANZ_HAUSBAUER         891221 non-null int64
FINANZTYP                891221 non-null int64
GEBURTSJAHR              891221 non-null int64
GFK_URLAUBERTYP          886367 non-null float64
GREEN_AVANTGARDE         891221 non-null int64
HEALTH_TYP               891221 non-null int64
LP_LEBENSPHASE_FEIN      886367 non-null float64
LP_LEBENSPHASE_GROB      886367 non-null float64
LP_FAMILIE_FEIN          886367 non-null float64
LP_FAMILIE_GROB          886367 non-null float64
LP_STATUS_FEIN           886367 non-null float64
LP_STATUS_GROB           886367 non-null float64
NATIONALITAET_KZ         891221 non-null int64
PRAEGENDE_JUGENDJAHRE    891221 non-null int64
RETOURTYP_BK_S           886367 non-null float64
SEMIO_SOZ                891221 non-null int64
SEMIO_FAM                891221 non-null int64
SEMIO_REL                891221 non-null int64
SEMIO_MAT                891221 non-null int64
SEMIO_VERT               891221 non-null int64
SEMIO_LUST               891221 non-null int64
SEMIO_ERL                891221 non-null int64
SEMIO_KULT               891221 non-null int64
SEMIO_RAT                891221 non-null int64
SEMIO_KRIT               891221 non-null int64
SEMIO_DOM                891221 non-null int64
SEMIO_KAEM               891221 non-null int64
SEMIO_PFLICHT            891221 non-null int64
SEMIO_TRADV              891221 non-null int64
SHOPPER_TYP              891221 non-null int64
SOHO_KZ                  817722 non-null float64
TITEL_KZ                 817722 non-null float64
VERS_TYP                 891221 non-null int64
ZABEOTYP                 891221 non-null int64
ALTER_HH                 817722 non-null float64
ANZ_PERSONEN             817722 non-null float64
ANZ_TITEL                817722 non-null float64
HH_EINKOMMEN_SCORE       872873 non-null float64
KK_KUNDENTYP             306609 non-null float64
W_KEIT_KIND_HH           783619 non-null float64
WOHNDAUER_2008           817722 non-null float64
ANZ_HAUSHALTE_AKTIV      798073 non-null float64
ANZ_HH_TITEL             794213 non-null float64
GEBAEUDETYP              798073 non-null float64
KONSUMNAEHE              817252 non-null float64
MIN_GEBAEUDEJAHR         798073 non-null float64
OST_WEST_KZ              798073 non-null object
WOHNLAGE                 798073 non-null float64
CAMEO_DEUG_2015          792242 non-null object
CAMEO_DEU_2015           792242 non-null object
CAMEO_INTL_2015          792242 non-null object
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_BAUMAX             757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797481 non-null float64
EWDICHTE                 797481 non-null float64
INNENSTADT               797481 non-null float64
GEBAEUDETYP_RASTER       798066 non-null float64
KKK                      770025 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        886367 non-null float64
REGIOTYP                 770025 non-null float64
KBA13_ANZAHL_PKW         785421 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   794005 non-null float64
ORTSGR_KLS9              794005 non-null float64
RELAT_AB                 794005 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 578.0+ MB

# Percentage of nans per row before precessing
nan_perc = (df.isnull().mean() * 100).sort_values(ascending=False)
nan_perc[:20]

KK_KUNDENTYP        65.596749
KBA05_ANTG1         14.959701
KBA05_ANTG2         14.959701
KBA05_ANTG3         14.959701
KBA05_ANTG4         14.959701
KBA05_BAUMAX        14.959701
KBA05_GBZ           14.959701
MOBI_REGIO          14.959701
REGIOTYP            13.598872
KKK                 13.598872
PLZ8_ANTG3          13.073637
PLZ8_ANTG1          13.073637
PLZ8_GBZ            13.073637
PLZ8_ANTG2          13.073637
PLZ8_HHZ            13.073637
PLZ8_BAUMAX         13.073637
PLZ8_ANTG4          13.073637
W_KEIT_KIND_HH      12.073549
KBA13_ANZAHL_PKW    11.871354
CAMEO_INTL_2015     11.106000
dtype: float64

KK_KUNDENTYP has the highest percentage of missing values (65%) with the next KBA05_ANTG1 at 14.9% at a different scale

#How many rows don't have nans?
nan_perc[nan_perc == 0].shape[0]

32

# as a percentage
nan_perc[nan_perc > 0].shape[0]/ df.shape[0]  * 100

0.005946897570860651

# How many nans before processing?
print('Total percentage of Nans before processing: ',
      round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
      '%')

Total percentage of Nans before processing:  6.46 %

Convert Missing Value Codes to NaNs¶

The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object.

# find out the different kind of encodings
feat_info.missing_or_unknown.unique()

array(['[-1,0]', '[-1,0,9]', '[0]', '[-1]', '[]', '[-1,9]', '[-1,X]',
       '[XX]', '[-1,XX]'], dtype=object)

# check for the XX type
feat_info[feat_info.missing_or_unknown == '[XX]']

def convert_to_list_for_X_or_XX(s):
    '''Takes the column with the missing value
    encodings and converts to list, specially processing
    the casses where X or  XX is in the encodings

    TODO : Generalize for any new string'''
    if 'XX' in s:
        if len(s) == 4:
            return [s[1:-1]]
        else:
            return [-1, 'XX']
    elif 'X' in s:
        return [-1, 'X']
    else:
        return eval(s)

# Process the missing values encodings
feat_info.missing_or_unknown = feat_info.missing_or_unknown.apply(convert_to_list_for_X_or_XX)

# Create a dictionary to map column names to missing data encodings
missing_data_encode = dict(zip(feat_info.attribute, feat_info.missing_or_unknown))

# test
missing_data_encode['AGER_TYP']

[-1, 0]

# test
missing_data_encode['CAMEO_DEU_2015']

['XX']

# Convert to nans using the missing_data_encode dictionary
for col in missing_data_encode.keys():
    nans_before = df[col].isnull().sum()
    df.loc[df[col].isin(missing_data_encode[col]), col] = np.nan
    nans_after = df[col].isnull().sum()
    if nans_after != nans_before:
        print(col, 'nans before: ', nans_before)
        print(col, 'nans after: ', nans_after, '\n')

AGER_TYP nans before:  0
AGER_TYP nans after:  685843 

ALTERSKATEGORIE_GROB nans before:  0
ALTERSKATEGORIE_GROB nans after:  2881 

GEBURTSJAHR nans before:  0
GEBURTSJAHR nans after:  392318 

HEALTH_TYP nans before:  0
HEALTH_TYP nans after:  111196 

LP_LEBENSPHASE_FEIN nans before:  4854
LP_LEBENSPHASE_FEIN nans after:  97632 

LP_LEBENSPHASE_GROB nans before:  4854
LP_LEBENSPHASE_GROB nans after:  94572 

LP_FAMILIE_FEIN nans before:  4854
LP_FAMILIE_FEIN nans after:  77792 

LP_FAMILIE_GROB nans before:  4854
LP_FAMILIE_GROB nans after:  77792 

NATIONALITAET_KZ nans before:  0
NATIONALITAET_KZ nans after:  108315 

PRAEGENDE_JUGENDJAHRE nans before:  0
PRAEGENDE_JUGENDJAHRE nans after:  108164 

SHOPPER_TYP nans before:  0
SHOPPER_TYP nans after:  111196 

TITEL_KZ nans before:  73499
TITEL_KZ nans after:  889061 

VERS_TYP nans before:  0
VERS_TYP nans after:  111196 

ALTER_HH nans before:  73499
ALTER_HH nans after:  310267 

W_KEIT_KIND_HH nans before:  107602
W_KEIT_KIND_HH nans after:  147988 

ANZ_HAUSHALTE_AKTIV nans before:  93148
ANZ_HAUSHALTE_AKTIV nans after:  99611 

CAMEO_DEUG_2015 nans before:  98979
CAMEO_DEUG_2015 nans after:  99352 

CAMEO_DEU_2015 nans before:  98979
CAMEO_DEU_2015 nans after:  99352 

CAMEO_INTL_2015 nans before:  98979
CAMEO_INTL_2015 nans after:  99352 

KBA05_BAUMAX nans before:  133324
KBA05_BAUMAX nans after:  476524 

KKK nans before:  121196
KKK nans after:  158064 

REGIOTYP nans before:  121196
REGIOTYP nans after:  158064 

ARBEIT nans before:  97216
ARBEIT nans after:  97375 

ORTSGR_KLS9 nans before:  97216
ORTSGR_KLS9 nans after:  97274 

RELAT_AB nans before:  97216
RELAT_AB nans after:  97375

Assess Missing Data in Each Column¶

How much missing data is present in each column?
For the remaining features, are there any patterns in which columns have, or share, missing data?

# How many nans after processing?
print('Total percentage of Nans after processing: ',
      round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
      '%')

Total percentage of Nans after processing:  11.05 %

# A Series with the proportion of nans per row after precessing
nan_perc = (df.isnull().mean()).sort_values(ascending=False)
nan_perc[:20]

TITEL_KZ          0.997576
AGER_TYP          0.769554
KK_KUNDENTYP      0.655967
KBA05_BAUMAX      0.534687
GEBURTSJAHR       0.440203
ALTER_HH          0.348137
REGIOTYP          0.177357
KKK               0.177357
W_KEIT_KIND_HH    0.166051
KBA05_ANTG4       0.149597
KBA05_GBZ         0.149597
MOBI_REGIO        0.149597
KBA05_ANTG1       0.149597
KBA05_ANTG2       0.149597
KBA05_ANTG3       0.149597
PLZ8_GBZ          0.130736
PLZ8_HHZ          0.130736
PLZ8_BAUMAX       0.130736
PLZ8_ANTG4        0.130736
PLZ8_ANTG1        0.130736
dtype: float64

#How many coluns don't have nans?
nan_perc[nan_perc == 0].shape[0]

24

# as a percentage
nan_perc[nan_perc == 0].shape[0]/ df.shape[1]  * 100

28.235294117647058

def hist_box_plot(x, x_label, y_label, bin_incr):
    '''Take an array as input and draw a histogram with a boxblot above it'''
    f, (ax_box, ax_hist) = plt.subplots(2,
                                        sharex=True,
                                        gridspec_kw={
                                            "height_ratios": (.15, .85)},
                                        figsize=(14, 6))

    sns.boxplot(x, ax=ax_box)
    bins = np.arange(0, x.max() + bin_incr, bin_incr)
    x.hist(grid=False, bins=bins)
    ax_box.set(yticks=[])
    ax_hist.set_ylabel(y_label)
    ax_hist.set_xlabel(x_label)
    sns.despine(ax=ax_hist)
    sns.despine(ax=ax_box, left=True)

Fig1 : Histogram of missing values per Feature ¶

hist_box_plot(nan_perc, x_label='Proportion of missing values', y_label='% of features', bin_incr=0.05);

The majority of features have less than 20% missing values. There are six features with a considerably higher number of missing values.

# Which are the columns with the higher NaN values?
high_nan = nan_perc[nan_perc>0.2]
high_nan

TITEL_KZ        0.997576
AGER_TYP        0.769554
KK_KUNDENTYP    0.655967
KBA05_BAUMAX    0.534687
GEBURTSJAHR     0.440203
ALTER_HH        0.348137
dtype: float64

Investigate patterns in the amount of missing data in each column.¶

# Merge feat_info with nan_perc 
nan_perc_df = pd.DataFrame(nan_perc, columns=['perc_missing'])
feat_nan = (feat_info
            .merge(nan_perc_df, left_on='attribute', right_index=True)
            .drop(columns='missing_or_unknown', axis=1)
           )

# Split the data into three categories:
# High percentage of nans (>20%)
# Moderate percentage of nans (5-20%)
# Low percentage of nans (0-5%)
# No nans
feat_nan['nan_cat'] = pd.cut(feat_nan.perc_missing,
                             bins=[-np.inf, 0, 0.1, 0.2, np.inf],
                             labels=['zero', 'low', 'moderate', 'high'],
                             )
feat_nan.head()

Fig2 : CountPlot of missing value category per information level.¶

fig, ax1 = plt.subplots(figsize=(14,8))
order = ['person', 'household', 'building', 'postcode', 'community',
       'microcell_rr3', 'microcell_rr4', 'region_rr1', 'macrocell_plz8']
hue_order = ['zero', 'low', 'moderate', 'high']
sns.countplot(data=feat_nan,
              x='information_level',
              hue='nan_cat',
              order=order,
              hue_order=hue_order,
              ax=ax1)
plt.legend(loc='upper right')
sns.despine(ax=ax1);

Observations

Only the person informational level has features with zero NaNs.
Low NaN features (>0% - 10%) are distributed among person, household and region information levels.
Moderate NaN features (10% - 20%) is the only category distributed among all information levels.
High NaN features (>20%) are distributed among person, household and microcell_rr3 information levels.

Fig3 : SwarmPlot of missing value category per information level.¶

fig, ax1 = plt.subplots(figsize=(14,8))
order = ['person', 'household', 'building', 'postcode', 'community',
       'microcell_rr3', 'microcell_rr4', 'region_rr1', 'macrocell_plz8']
hue_order = ['zero', 'low', 'moderate', 'high']
sns.swarmplot(data=feat_nan,
              x='information_level',
              y='perc_missing',
              hue='nan_cat',
              order=order,
              hue_order=hue_order,
              ax=ax1)
plt.legend(loc='upper right')
sns.despine(ax=ax1);

Observations

The moderate - NaN features (10% - 20%) have all the same values for each from the information levels: postcode, community, microcell_rr3 and 4 and macrocell_plz8. This suggests that they exist some methodical bias in creating the NaNs for each information level.
The household information level has the most evenly distributed Nan categories.
All the information levels, except person and household consist mainly of moderate NaN categories.
The high NaN categories are clear outliers.

# Look at the similar values of microcell_rr3
feat_nan[(feat_nan.information_level == 'microcell_rr3') & (feat_nan.nan_cat == 'moderate')]

Confirms the observation that features of the same attribute type (KBA05 classifies according to the number of family houses and buildings in the microcell) have the exact same percentage of missing values.

Remove the outlier columns from the dataset.¶

The outlying columns with a high percentage of NaNs will be removed.

cols_to_remove = feat_nan[feat_nan.nan_cat == 'high'].attribute             
cols_to_remove

0         AGER_TYP
11     GEBURTSJAHR
40        TITEL_KZ
43        ALTER_HH
47    KK_KUNDENTYP
64    KBA05_BAUMAX
Name: attribute, dtype: object

df = df.drop(columns=cols_to_remove, axis=1)
df.shape

(891221, 79)

Summary¶

The total percentage of Nans after the converting missing value encoding to NaNs is 11.05 %

Fig1 observations: The majority of features (79 out of 85) have less than 20% missing values. There are six features with a considerably higher number of missing values. These features are:

TITEL_KZ: Academic title flag (1: Dr., 4: Prof. Dr., 5: other)
- 99.8% missing, making it unusable.
AGER_TYP: Best-ager typology (1: passive elderly, 3: experience-driven elderly)
- 76.9% missing values.This could potentially be an interesting feature for identifying subgroups within older persons groups but the very high level of missing values would make the information unreliable
KK_KUNDENTYP: Consumer pattern over past 12 months (1: regular customer, 6: passive customer)
- 65.6% missing values. This could also potentially be an interesting feature for identifying consumer behavior but the fact that that there are already other features that encode shopping behavior like SHOPPER_TYP and the high level of missing values would suggest that this feature can be dropped without too much loss of information.
KBA05_BAUMAX: Most common building type within the microcell (1: mainly 1-2 family homes in the microcell, 4: mainly 10+ family homes in the microcell, 5: mainly business buildings in the microcell)
- 53.5% missing values. Similar information to the other KBA05_ features and the high level of missing values suggest that this feature can be dropped without too much information loss.
GEBURTSJAHR: Year of birth (Numeric)
- 44.0% missing. The high level of missing values for a numeric variable and the fact that that there exist also other features that encode age like ALTERSKATEGORIE_GROB suggest that this feature can be dropped without too much information loss.
ALTER_HH: Birthdate of head of household (1: 1895-01-01 to 1899-12-31, 21: 1995-01-01 to 1999-12-31)
- 34.8% missing values: This feature could provide valuable information about to the age groupings. We will drop it in this phase due to the high level of missing values would make the information unreliable. It could in a later iteration be imputed and tested if it's inclusion leads to better results.

Fig2 observations:

Only the person informational level has features with zero NaNs.
Low NaN features (>0% - 10%) are distributed among person, household and region information levels.
Moderate NaN features (10% - 20%) is the only category distributed among all information levels.
High NaN features (>20%) are distributed among person, household and microcell_rr3 information levels.

Fig3 observations:

The moderate - NaN features (10% - 20%) have all the same values for each from the information levels: postcode, community, microcell_rr3 and 4 and macrocell_plz8.
For example all the KBA05-type features with a moderate level of NaNs have 14.6% of missing data. This suggests that there may exist some methodical bias in creating the NaNs for each information level.
The household information level has the most evenly distributed Nan categories.
All the information levels, except person and household consist mainly of moderate NaN categories.
The high NaN categories are clear outliers.

All six outlier columns with a high percentage of NaNs (>20%) will be removed.

Assess Missing Data in Each Row¶

How much data is missing in each row of the dataset?¶

Fig4 : Histogram of missing values per row ¶

df['row_nan_perc'] = df.isnull().mean(axis=1) 
hist_box_plot(df['row_nan_perc'], x_label='Proportion of missing values', y_label='% of rows', bin_incr=0.01)

The majority of the rows is below the 10% threshhold of missing values.
The iqr is around the 5 % threshhold.

We will use 10% as the cut of threshhold to split the data into two separate subsets and compare their distributions.

Divide the data into two subsets based on the number of missing values in each row.¶

threshhold = 0.1

df_lowna = df.loc[df['row_nan_perc'] < threshhold, :]
df_highna = df.loc[df['row_nan_perc'] > threshhold, :]

df_lowna.shape

(747109, 80)

# percentage of low na rows
df_lowna.shape[0] / df.shape[0]

0.8382982447675716

df_highna.shape

(144112, 80)

# percentage of high na rows
df_highna.shape[0] / df.shape[0]

0.16170175523242833

Compare the distribution of values for at least five columns where there are no or few missing values, between the two subsets.¶

# find columns with no nans at lowna
df_lowna.dropna(axis=1).shape

(747109, 40)

# find columns with no nans at highna
df_highna.dropna(axis=1).shape

(144112, 25)

# get the commmon columns
common_cols = list(set(df_lowna.dropna(axis=1).columns).intersection(df_highna.dropna(axis=1).columns))
common_cols

['SEMIO_DOM',
 'SEMIO_MAT',
 'SEMIO_KRIT',
 'row_nan_perc',
 'SEMIO_TRADV',
 'SEMIO_PFLICHT',
 'SEMIO_REL',
 'GREEN_AVANTGARDE',
 'FINANZ_ANLEGER',
 'FINANZ_VORSORGER',
 'SEMIO_LUST',
 'SEMIO_RAT',
 'SEMIO_ERL',
 'ZABEOTYP',
 'SEMIO_SOZ',
 'FINANZ_UNAUFFAELLIGER',
 'ANREDE_KZ',
 'SEMIO_KAEM',
 'FINANZTYP',
 'FINANZ_MINIMALIST',
 'FINANZ_HAUSBAUER',
 'SEMIO_FAM',
 'SEMIO_VERT',
 'SEMIO_KULT',
 'FINANZ_SPARER']

# Drop the 'row_nan_perc'
common_cols.remove('row_nan_perc')

Fig5 : CountPlots of High and Low NaN features ¶

# compare distributions
for col in common_cols[:6]:
    fig, axes = plt.subplots(1,2, figsize=(14, 6), sharey=True)
    sns.countplot(df_lowna[col], ax=axes[0], color='b')
    sns.countplot(df_highna[col], ax=axes[1], color='r')

Observations

The distributions are discrete and look distinctly different.

# Statistical comparison of distributions
for col in common_cols:
    print(col)
    print(chisquare(df_highna[col].value_counts().sort_index(),
                    df_lowna[col].value_counts().sort_index(),
                    axis=0),
                    '\n')

SEMIO_DOM
Power_divergenceResult(statistic=578605.8437949589, pvalue=0.0) 

SEMIO_MAT
Power_divergenceResult(statistic=551804.8483307012, pvalue=0.0) 

SEMIO_KRIT
Power_divergenceResult(statistic=553268.8941379717, pvalue=0.0) 

SEMIO_TRADV
Power_divergenceResult(statistic=522819.77933806577, pvalue=0.0) 

SEMIO_PFLICHT
Power_divergenceResult(statistic=546673.9839817142, pvalue=0.0) 

SEMIO_REL
Power_divergenceResult(statistic=530242.3705674206, pvalue=0.0) 

GREEN_AVANTGARDE
Power_divergenceResult(statistic=491086.218469656, pvalue=0.0) 

FINANZ_ANLEGER
Power_divergenceResult(statistic=528002.691086477, pvalue=0.0) 

FINANZ_VORSORGER
Power_divergenceResult(statistic=524160.8354827201, pvalue=0.0) 

SEMIO_LUST
Power_divergenceResult(statistic=570363.2314128326, pvalue=0.0) 

SEMIO_RAT
Power_divergenceResult(statistic=514140.4834929664, pvalue=0.0) 

SEMIO_ERL
Power_divergenceResult(statistic=544795.960523225, pvalue=0.0) 

ZABEOTYP
Power_divergenceResult(statistic=501233.1126444783, pvalue=0.0) 

SEMIO_SOZ
Power_divergenceResult(statistic=529546.6526565166, pvalue=0.0) 

FINANZ_UNAUFFAELLIGER
Power_divergenceResult(statistic=560081.7545795058, pvalue=0.0) 

ANREDE_KZ
Power_divergenceResult(statistic=486726.3350846125, pvalue=0.0) 

SEMIO_KAEM
Power_divergenceResult(statistic=532804.5291834823, pvalue=0.0) 

FINANZTYP
Power_divergenceResult(statistic=588409.658322646, pvalue=0.0) 

FINANZ_MINIMALIST
Power_divergenceResult(statistic=506973.43732221884, pvalue=0.0) 

FINANZ_HAUSBAUER
Power_divergenceResult(statistic=513968.48003178235, pvalue=0.0) 

SEMIO_FAM
Power_divergenceResult(statistic=558890.2452963234, pvalue=0.0) 

SEMIO_VERT
Power_divergenceResult(statistic=648640.5005068531, pvalue=0.0) 

SEMIO_KULT
Power_divergenceResult(statistic=526705.8311784347, pvalue=0.0) 

FINANZ_SPARER
Power_divergenceResult(statistic=536958.2495254633, pvalue=0.0)

The p value is 0 for all the chisquared tests which means that we saw the expected frequencies based on the observed distribution (High NaN features) about 0% of the time which is statistically significant and confirms the assumption that the two distributions are different [5]

Summary¶

Fig4 observations:

The majority of the rows (83.8%) is below the 10% of missing values.

We used 10% as the cut of threshhold to split the data into two separate subsets and compare their distributions taking 5 sample columns that do no have any NaN values.

Fig5 observations:

These distributions are discrete and look distinctly different, an assumption that was confirmed by performing a Chisquared test for testing their proportions.

We will continue the analysis with the dataset that contains rows with less than 10% NaNs

Select and Re-Encode Features¶

# Keep only the rows with NaNs perc  below 10%
df = df_lowna

# Drop the row nan percentage row
df = df[df.columns[:-1]]
df.head()

# Drop the missing_or_unknown column from the
# processed feat_info Df
feat_info = feat_info[feat_info.columns[:-1]]
feat_info.head()

# filter feat_info to the remaining columns
feat_info = feat_info[feat_info.attribute.isin(df.columns)]
feat_info.shape

(79, 3)

How many features are there of each data type?¶

Fig6 : CountPlot of Features per Data Type ¶

fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=feat_info,
              x='type',
              color='c',
              ax=ax1)
ax1.set_ylabel('No of features')
ax1.set_xlabel('Data Type')
for p in ax1.patches:
    ax1.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.35, p.get_height()+1))
sns.despine(ax=ax1);

49 out of 79 features (62%) are ordinal.
The categorical and mixed features, that need to be processed, represent 18 + 6 (30.38%) of the features.

Re-Encode Categorical Features¶

# Get categorical features
feat_info[feat_info.type == 'categorical']

The majority of the categorical features is of the person information_level.

# How many of each information_level
feat_info[feat_info.type == 'categorical'].groupby('information_level').count().iloc[:,-1]

information_level
building          2
microcell_rr4     2
person           14
Name: type, dtype: int64

# Get the categorical feature names
df_cat_cols = feat_info[feat_info.type == 'categorical'].attribute
df_cat_cols

2            ANREDE_KZ
3        CJT_GESAMTTYP
10           FINANZTYP
12     GFK_URLAUBERTYP
13    GREEN_AVANTGARDE
17     LP_FAMILIE_FEIN
18     LP_FAMILIE_GROB
19      LP_STATUS_FEIN
20      LP_STATUS_GROB
21    NATIONALITAET_KZ
38         SHOPPER_TYP
39             SOHO_KZ
41            VERS_TYP
42            ZABEOTYP
52         GEBAEUDETYP
55         OST_WEST_KZ
57     CAMEO_DEUG_2015
58      CAMEO_DEU_2015
Name: attribute, dtype: object

# Find the number of levels
df[df_cat_cols].nunique()

ANREDE_KZ            2
CJT_GESAMTTYP        6
FINANZTYP            6
GFK_URLAUBERTYP     12
GREEN_AVANTGARDE     2
LP_FAMILIE_FEIN     11
LP_FAMILIE_GROB      5
LP_STATUS_FEIN      10
LP_STATUS_GROB       5
NATIONALITAET_KZ     3
SHOPPER_TYP          4
SOHO_KZ              2
VERS_TYP             2
ZABEOTYP             6
GEBAEUDETYP          7
OST_WEST_KZ          2
CAMEO_DEUG_2015      9
CAMEO_DEU_2015      44
dtype: int64

# Cound the number of levels
df[df_cat_cols].nunique().sort_values()

ANREDE_KZ            2
OST_WEST_KZ          2
VERS_TYP             2
SOHO_KZ              2
GREEN_AVANTGARDE     2
NATIONALITAET_KZ     3
SHOPPER_TYP          4
LP_STATUS_GROB       5
LP_FAMILIE_GROB      5
FINANZTYP            6
ZABEOTYP             6
CJT_GESAMTTYP        6
GEBAEUDETYP          7
CAMEO_DEUG_2015      9
LP_STATUS_FEIN      10
LP_FAMILIE_FEIN     11
GFK_URLAUBERTYP     12
CAMEO_DEU_2015      44
dtype: int64

# how many binary
sum(df[df_cat_cols].nunique() == 2)

5

# how many multilevel
sum(df[df_cat_cols].nunique() != 2)

13

df[df_cat_cols].dtypes

ANREDE_KZ           float64
CJT_GESAMTTYP       float64
FINANZTYP           float64
GFK_URLAUBERTYP     float64
GREEN_AVANTGARDE    float64
LP_FAMILIE_FEIN     float64
LP_FAMILIE_GROB     float64
LP_STATUS_FEIN      float64
LP_STATUS_GROB      float64
NATIONALITAET_KZ    float64
SHOPPER_TYP         float64
SOHO_KZ             float64
VERS_TYP            float64
ZABEOTYP            float64
GEBAEUDETYP         float64
OST_WEST_KZ          object
CAMEO_DEUG_2015      object
CAMEO_DEU_2015       object
dtype: object

The last three columns are of dtype object. Let's have a look at them

df[df_cat_cols[-3:]].sample(10)

OST_WEST_KZ: Building location via former East / West Germany (GDR / FRG) (O: East (GDR), W: West (FRG))
- is a binary categorical variable and is emcoded with string values and and we will reencode the string catgories in numerical types so that it can be used with the algorithms.
CAMEO_DEUG_2015 German CAMEO: Wealth / Life Stage Typology, rough scale (1: upper class, 9: urban working class)
- is numeric but encoded as object dtype. Since it is a multilevel categorical feature and will be transformed to a dummy variable or dropped it can stay like this.
CAMEO_DEU_2015 should be a float64 data type but is encoded as object.
CAMEO_DEU_2015 is a more detailed version CAMEO_DEUG_2015 and would probably be highly correlated. Further investigation should be done.

Re-encode categorical variable(s) to be kept in the analysis.¶

# Reencode the binary OST_WEST_KZ to numeric categories
df = df.copy()
df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'O': 0, 
                                               'W': 1})

# test
df[df_cat_cols[-3:]].sample(10)

# Choose the multilevel categorical features
cat_nunique = df[df_cat_cols].nunique()
cat_to_encode = cat_nunique[cat_nunique>2]
cat_to_encode.sort_values()

NATIONALITAET_KZ     3
SHOPPER_TYP          4
LP_FAMILIE_GROB      5
LP_STATUS_GROB       5
CJT_GESAMTTYP        6
FINANZTYP            6
ZABEOTYP             6
GEBAEUDETYP          7
CAMEO_DEUG_2015      9
LP_STATUS_FEIN      10
LP_FAMILIE_FEIN     11
GFK_URLAUBERTYP     12
CAMEO_DEU_2015      44
dtype: int64

# DROPING THE MULTI-LEVEL CATEGORICAL COLUMNS IN THIS ITERATION
df = df.drop(columns=cat_to_encode.index, axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 747109 entries, 1 to 891220
Data columns (total 66 columns):
ALTERSKATEGORIE_GROB     744640 non-null float64
ANREDE_KZ                747109 non-null float64
FINANZ_MINIMALIST        747109 non-null float64
FINANZ_SPARER            747109 non-null float64
FINANZ_VORSORGER         747109 non-null float64
FINANZ_ANLEGER           747109 non-null float64
FINANZ_UNAUFFAELLIGER    747109 non-null float64
FINANZ_HAUSBAUER         747109 non-null float64
GREEN_AVANTGARDE         747109 non-null float64
HEALTH_TYP               722753 non-null float64
LP_LEBENSPHASE_FEIN      714735 non-null float64
LP_LEBENSPHASE_GROB      717274 non-null float64
PRAEGENDE_JUGENDJAHRE    730988 non-null float64
RETOURTYP_BK_S           747109 non-null float64
SEMIO_SOZ                747109 non-null float64
SEMIO_FAM                747109 non-null float64
SEMIO_REL                747109 non-null float64
SEMIO_MAT                747109 non-null float64
SEMIO_VERT               747109 non-null float64
SEMIO_LUST               747109 non-null float64
SEMIO_ERL                747109 non-null float64
SEMIO_KULT               747109 non-null float64
SEMIO_RAT                747109 non-null float64
SEMIO_KRIT               747109 non-null float64
SEMIO_DOM                747109 non-null float64
SEMIO_KAEM               747109 non-null float64
SEMIO_PFLICHT            747109 non-null float64
SEMIO_TRADV              747109 non-null float64
SOHO_KZ                  747109 non-null float64
VERS_TYP                 722753 non-null float64
ANZ_PERSONEN             747109 non-null float64
ANZ_TITEL                747109 non-null float64
HH_EINKOMMEN_SCORE       747109 non-null float64
W_KEIT_KIND_HH           701696 non-null float64
WOHNDAUER_2008           747109 non-null float64
ANZ_HAUSHALTE_AKTIV      741793 non-null float64
ANZ_HH_TITEL             744114 non-null float64
KONSUMNAEHE              747063 non-null float64
MIN_GEBAEUDEJAHR         747109 non-null float64
OST_WEST_KZ              747109 non-null int64
WOHNLAGE                 747109 non-null float64
CAMEO_INTL_2015          743878 non-null object
KBA05_ANTG1              736120 non-null float64
KBA05_ANTG2              736120 non-null float64
KBA05_ANTG3              736120 non-null float64
KBA05_ANTG4              736120 non-null float64
KBA05_GBZ                736120 non-null float64
BALLRAUM                 746624 non-null float64
EWDICHTE                 746624 non-null float64
INNENSTADT               746624 non-null float64
GEBAEUDETYP_RASTER       747104 non-null float64
KKK                      706689 non-null float64
MOBI_REGIO               736120 non-null float64
ONLINE_AFFINITAET        747109 non-null float64
REGIOTYP                 706689 non-null float64
KBA13_ANZAHL_PKW         746377 non-null float64
PLZ8_ANTG1               744272 non-null float64
PLZ8_ANTG2               744272 non-null float64
PLZ8_ANTG3               744272 non-null float64
PLZ8_ANTG4               744272 non-null float64
PLZ8_BAUMAX              744272 non-null float64
PLZ8_HHZ                 744272 non-null float64
PLZ8_GBZ                 744272 non-null float64
ARBEIT                   743376 non-null float64
ORTSGR_KLS9              743455 non-null float64
RELAT_AB                 743376 non-null float64
dtypes: float64(64), int64(1), object(1)
memory usage: 381.9+ MB

Summary¶

The dataset contains 18 categorical variables, 5 of which are binary and 13 are multilevel.

OST_WEST_KZ: Building location via former East / West Germany (GDR / FRG) (O: East (GDR), W: West (FRG))
- is a binary categorical variable and is emcoded with string values and and we will reencode the string catgories in numerical types so that it can be used with the algorithms.
CAMEO_DEUG_2015: German CAMEO: Wealth / Life Stage Typology, rough scale (1: upper class, 9: urban working class)

The rest of the binary features will be kept as is. The multilevel categorical features need to be one hot encoded in order to be used for feature transformation and training our model. They are:

Feature	# of Levels
NATIONALITAET_KZ	3
SHOPPER_TYP	4
LP_FAMILIE_GROB	5
LP_STATUS_GROB	5
CJT_GESAMTTYP	6
FINANZTYP	6
ZABEOTYP	6
GEBAEUDETYP	7
CAMEO_DEUG_2015	9
LP_STATUS_FEIN	10
LP_FAMILIE_FEIN	11
GFK_URLAUBERTYP	12
CAMEO_DEU_2015	44

Since some of the information they contain is covered by other features in the dataset like CAMEO_DEUG_2015 for CAMEO_DEU_2015 or FINANZ_ for FINANZTYP LP_STATUS_FEIN and LP_STATUS_GROB for LP_FAMILIE_FEIN and LP_FAMILIE_GROB, we will choose to drop them in this iteration of the analysis and keep things more straightforward and not increase the number of variables too much at this phase.

Engineer Mixed-Type Features¶

# Get mixed dtype features
feat_info[feat_info.type == 'mixed']

Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.¶

Data Dictionary:

PRAEGENDE_JUGENDJAHRE: Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west)

-1: unknown
0: unknown
1: 40s - war years (Mainstream, E+W)
2: 40s - reconstruction years (Avantgarde, E+W)
3: 50s - economic miracle (Mainstream, E+W)
4: 50s - milk bar / Individualisation (Avantgarde, E+W)
5: 60s - economic miracle (Mainstream, E+W)
6: 60s - generation 68 / student protestors (Avantgarde, W)
7: 60s - opponents to the building of the Wall (Avantgarde, E)
8: 70s - family orientation (Mainstream, E+W)
9: 70s - peace movement (Avantgarde, E+W)
10: 80s - Generation Golf (Mainstream, W)
11: 80s - ecological awareness (Avantgarde, W)
12: 80s - FDJ / communist party youth organisation (Mainstream, E)
13: 80s - Swords into ploughshares (Avantgarde, E)
14: 90s - digital media kids (Mainstream, E+W)
15: 90s - ecological awareness (Avantgarde, E+W)

# Let's have a look at it
df.PRAEGENDE_JUGENDJAHRE.sample(10)

803911     5.0
221428     8.0
324692     5.0
247918    10.0
552519     NaN
293176    14.0
21530      4.0
778637    10.0
687600     9.0
880926    14.0
Name: PRAEGENDE_JUGENDJAHRE, dtype: float64

# Let's have a look at it
df.PRAEGENDE_JUGENDJAHRE.nunique()

15

Fig7 : CountPlot of PRAEGENDE_JUGENDJAHRE categories ¶

plt.style.use('ggplot')
fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='PRAEGENDE_JUGENDJAHRE',
              color="darkslategray",
              ax=ax1);

df.PRAEGENDE_JUGENDJAHRE.value_counts()

14.0    172952
8.0     134165
5.0      81722
10.0     80527
3.0      51784
15.0     39536
11.0     33173
9.0      32493
6.0      25120
12.0     23337
4.0      20068
1.0      19583
2.0       7340
13.0      5300
7.0       3888
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64

df.PRAEGENDE_JUGENDJAHRE.isnull().sum()

16121

df.PRAEGENDE_JUGENDJAHRE.isnull().mean() * 100

2.157784205517535

Observations

Level 14 (90s - digital media kids (Mainstream, E+W)) leads with close to 172952 rows followed by level 8 (70s - family orientation (Mainstream, E+W)) with 134165 observations
Levels 7 (60s - opponents to the building of the Wall (Avantgarde, E)), 13 ((80s - Swords into ploughshares (Avantgarde, E)) and 2 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 3888, 5300 and 7340 rows respectively
The feature has 16121 NaNs - 2.16 % of the total.

New Variables Data encoding:

ENG_DECADE:

Person’s decade of youth.

1: 40s
2: 50s
3: 60s
4: 70s
5: 80s
6: 90s

ENG_MOVEMENT:

Person’s movement alignment.

1: Mainstream
2: Avantgarde

decade_dict = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7:  3,
               8: 4, 9: 4, 10: 5, 11: 5, 12: 5, 13: 5,
               14: 6, 15: 6
              }
movement_dict = {1: 1, 3: 1, 5: 1, 8: 1, 10: 1, 12: 1, 14: 1,
                2: 2, 4: 2, 6: 2, 7: 2, 9: 2, 11: 2, 13: 2, 15: 2
                }

# It appears that map is approximately 10x faster than replace.
# Ref: https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
df['ENG_DECADE'] = df['PRAEGENDE_JUGENDJAHRE'].map(decade_dict)
df['ENG_MOVEMENT'] = df['PRAEGENDE_JUGENDJAHRE'].map(movement_dict)

Fig8 : CountPlot of `ENG_DECADE` categories ¶

fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_DECADE',
              palette="Greens",
              ax=ax1);

df.ENG_DECADE.value_counts()

6.0    212488
4.0    166658
5.0    142337
3.0    110730
2.0     71852
1.0     26923
Name: ENG_DECADE, dtype: int64

# Sanity check
df.ENG_DECADE.isnull().sum()

16121

Observations

In the new feature Level 6 90s leads with close to 212488 rows followed by level 4 70s with 166658 observations consistently with the original feature.
Levels 1: 40s, and 2 50s have the least observations with 26923 and 71852 rows respectively.
The feature as expected still has 16121 NaNs - 2.16 % of the feature's total values.

Fig9 : CountPlot of `ENG_MOVEMENT` categories ¶

fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_MOVEMENT',
              palette="Greens_r",
              ax=ax1);

df.ENG_MOVEMENT.value_counts()

1.0    564070
2.0    166918
Name: ENG_MOVEMENT, dtype: int64

# Sanity check
df.ENG_MOVEMENT.isnull().sum()

16121

Observations

In this new binary feature 1: Mainstream leads with close to 564070 rows almost 3.5 times more than 2: Avantgarde with 166918 observations.
The feature as expected still has 16121 NaNs - 2.16 % of the total.

Investigate "CAMEO_INTL_2015" and engineer two new variables.¶

Data Dictionary:

CAMEO_INTL_2015: Wealth / Life Stage Typology, mapped to international code

-1: unknown
11: Wealthy Households - Pre-Family Couples & Singles
12: Wealthy Households - Young Couples With Children
13: Wealthy Households - Families With School Age Children
14: Wealthy Households - Older Families & Mature Couples
15: Wealthy Households - Elders In Retirement
21: Prosperous Households - Pre-Family Couples & Singles
22: Prosperous Households - Young Couples With Children
23: Prosperous Households - Families With School Age Children
24: Prosperous Households - Older Families & Mature Couples
25: Prosperous Households - Elders In Retirement
31: Comfortable Households - Pre-Family Couples & Singles
32: Comfortable Households - Young Couples With Children
33: Comfortable Households - Families With School Age Children
34: Comfortable Households - Older Families & Mature Couples
35: Comfortable Households - Elders In Retirement
41: Less Affluent Households - Pre-Family Couples & Singles
42: Less Affluent Households - Young Couples With Children
43: Less Affluent Households - Families With School Age Children
44: Less Affluent Households - Older Families & Mature Couples
45: Less Affluent Households - Elders In Retirement
51: Poorer Households - Pre-Family Couples & Singles
52: Poorer Households - Young Couples With Children
53: Poorer Households - Families With School Age Children
54: Poorer Households - Older Families & Mature Couples
55: Poorer Households - Elders In Retirement
XX: unknown

# Let's have a look at it
df.CAMEO_INTL_2015.head(10)

1     51
2     24
3     12
4     43
5     54
6     22
7     14
8     13
9     15
10    51
Name: CAMEO_INTL_2015, dtype: object

Fig10 : CountPlot of `ENG_MOVEMENT` categories ¶

fig, ax1 = plt.subplots(figsize=(14,8))
levels_sort = [str(int(l)) for l in np.sort(df.CAMEO_INTL_2015.astype(float).unique())[:-1]]
color_map = dict(zip())
sns.countplot(data=df,
              x='CAMEO_INTL_2015',
              color = 'darkslategray',
              order = levels_sort,
              ax=ax1);

df.CAMEO_INTL_2015.value_counts()

51    128033
41     87902
24     85759
14     59027
43     54044
54     43637
25     37564
22     30589
45     25124
13     24528
55     22681
23     20328
52     19627
34     17537
15     16296
31     16285
44     14117
12     11980
35      9882
32      9777
33      9161
Name: CAMEO_INTL_2015, dtype: int64

df.CAMEO_INTL_2015.isnull().sum()

3231

df.CAMEO_INTL_2015.isnull().mean() * 100

0.4324670161917471

Observations

Level 51: Poorer Households - Pre-Family Couples & Singles leads with 128033 rows followed by level 41: Less Affluent Households - Pre-Family Couples & Singles) with 87902 observations, almost 1/3 less.
Levels 33: Comfortable Households - Families With School Age Children, 32: Comfortable Households - Young Couples With Children and 35 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 9161, 9777 and 9882 rows respectively
The feature has 3231 NaNs - 4.32% of it's total.

"CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values)

New Variables Data encoding:

ENG_WEALTH:

Household's wealth

1: Wealthy Households
2: Prosperous Households
3: Comfortable Households
4: Less Affluent Households
5: Poorer Households

ENG_LIFE_STAGE:

Person's Life Stage

1: Pre-Family Couples & Singles
2: Young Couples With Children
3: Families With School Age Children
4: Older Families & Mature Couples
5: Elders In Retirement

def breakup_codes(x, digit):
    ''' Breaks up  two-digit codes by their
    'tens'-place and 'ones'-place digits
    into two new ordinal variables.
    Leaves NaNs unchanged'''

    if not pd.isna(x):
        if digit == 'first':
            return int(str(x)[0])
        elif digit == 'second':
            return int(str(x)[1])
    return x

df['ENG_WEALTH'] = df['CAMEO_INTL_2015'].apply(breakup_codes, digit='first')
df['ENG_LIFE_STAGE'] = df['CAMEO_INTL_2015'].apply(breakup_codes, digit='second')

Fig11 : CountPlot of `ENG_WEALTH` categories ¶

fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_WEALTH',
              color = 'teal',
              ax=ax1);

df.ENG_WEALTH.value_counts()

5.0    213978
4.0    181187
2.0    174240
1.0    111831
3.0     62642
Name: ENG_WEALTH, dtype: int64

# Sanity check
df.ENG_WEALTH.isnull().sum()

3231

Observations

In the new feature Level 5: Poorer Households leads with close to 213978 rows followed by level 4: Less Affluent Households with 181187 observations consistently with the original feature.
Levels 3: Comfortable Households, and 1: Wealthy Households have the least observations with 62642 and 111831 rows respectively.

Fig12 : CountPlot of `ENG_LIFE_STAGE` categories ¶

fig, ax1 = plt.subplots(figsize=(14,8))
sns.countplot(data=df,
              x='ENG_LIFE_STAGE',
              color = 'c',
              ax=ax1);

df.ENG_LIFE_STAGE.value_counts()

1.0    232220
4.0    220077
5.0    111547
3.0    108061
2.0     71973
Name: ENG_LIFE_STAGE, dtype: int64

# Sanity check
df.ENG_LIFE_STAGE.isnull().sum()

3231

mixed_feat_cols = feat_info[feat_info.type == 'mixed'].attribute
mixed_feat_cols

15      LP_LEBENSPHASE_FEIN
16      LP_LEBENSPHASE_GROB
22    PRAEGENDE_JUGENDJAHRE
56                 WOHNLAGE
59          CAMEO_INTL_2015
79              PLZ8_BAUMAX
Name: attribute, dtype: object

# How many levels per feature
df[mixed_feat_cols].nunique()

LP_LEBENSPHASE_FEIN      40
LP_LEBENSPHASE_GROB      12
PRAEGENDE_JUGENDJAHRE    15
WOHNLAGE                  8
CAMEO_INTL_2015          21
PLZ8_BAUMAX               5
dtype: int64

# Drop mixed dtype features
df = df.drop(mixed_feat_cols, axis=1)
df.head()

Observations

In the new feature Level 1: Pre-Family Couples & Singles leads with 232220 rows followed by level 4: Older Families & Mature Couples with 220077 observations.
Levels 2: Young Couples With Children, and 3: Families With School Age Children have the least observations with 71973 and 108061 rows respectively.

Summary¶

The dataset contains 6 mixed variables, 5 of which are binary and 13 are multilevel.

The mixed feature PRAEGENDE_JUGENDJAHRE: Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west) (1: 40s - war years (Mainstream, E+W), 15: 90s - ecological awareness (Avantgarde, E+W)), with the following observations Fig7:

Level 14 (90s - digital media kids (Mainstream, E+W)) leads with close to 172952 rows followed by level 8 (70s - family orientation (Mainstream, E+W)) with 134165 observations
Levels 7 (60s - opponents to the building of the Wall (Avantgarde, E)), 3 ((80s - Swords into ploughshares (Avantgarde, E)) and 2 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 3888, 5300 and 7340 rows respectively
The feature has 16121 NaNs - 2.16 % of it's total.

and two new variables have been engineered:

ENG_DECADE: Person’s decade of youth. (1: 40s , 6: 90s) , Fig8 observations:
- In the new feature Level 6 90s leads with close to 212488 rows followed by level 4 70s with 166658 observations consistently with the original feature.
- Levels 1: 40s, and 2 50s have the least observations with 26923 and 71852 rows respectively.
ENG_MOVEMENT: Person’s movement alignment (1: Mainstream, 2: Avantgarde ), Fig9 observations:
- In this new binary feature 1: Mainstream leads with close to 564070 rows almost 3.5 times more than 2: Avantgarde with 166918 observations.
- The feature as expected still has 16121 NaNs - 2.16 % of the total.

The mixed feature CAMEO_INTL_2015: German CAMEO: Wealth / Life Stage Typology, mapped to international code (11: Wealthy Households - Pre-Family Couples & Singles, 55: Poorer Households - Elders In Retirement), has been explored, with the following observations Fig10:

Level 51: Poorer Households - Pre-Family Couples & Singles leads with 128033 rows followed by level 41: Less Affluent Households - Pre-Family Couples & Singles) with 87902 observations, almost 1/3 less.
Levels 33: Comfortable Households - Families With School Age Children, 32: Comfortable Households - Young Couples With Children and 35 (40s - reconstruction years (Avantgarde, E+W)) have the least observations with 9161, 9777 and 9882 rows respectively
The feature has 3231 NaNs - 4.32% of it's total.

and two new variables have been engineered:

ENG_WEALTH: Household's wealth (1: Wealthy Households, 5: Poorer Households) , Fig11 observations:
- In the new feature Level 5: Poorer Households leads with close to 213978 rows followed by level 4: Less Affluent Households with 181187 observations consistently with the original feature.
- Levels 3: Comfortable Households, and 1: Wealthy Households have the least observations with 62642 and 111831 rows respectively.
ENG_MOVEMENT: Person's Life Stage (Pre-Family Couples & Singles, 5: Elders In Retirement), Fig12 observations:
- In the new feature Level 1: Pre-Family Couples & Singles leads with 232220 rows followed by level 4: Older Families & Mature Couples with 220077 observations.
- Levels 2: Young Couples With Children, and 3: Families With School Age Children have the least observations with 71973 and 108061 rows respectively.

The rest of the mixed features are:

Feature	# of Levels
LP_LEBENSPHASE_FEIN	40
LP_LEBENSPHASE_GROB	12
WOHNLAGE	8
PLZ8_BAUMAX	5

which we chose to drop in this iteration of the analysis and keep things more straightforward and not increase the number of variables too much at this phase.

Complete Feature Selection¶

If there are other re-engineering tasks you need to perform, make sure you take care of them here. (Dealing with missing data will come in step 2.1.)¶

Do whatever you need to in order to ensure that the dataframe only contains the columns that should be passed to the algorithm functions.¶

df.columns

Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
       'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
       'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'GREEN_AVANTGARDE',
       'HEALTH_TYP', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL',
       'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT',
       'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT',
       'SEMIO_TRADV', 'SOHO_KZ', 'VERS_TYP', 'ANZ_PERSONEN', 'ANZ_TITEL',
       'HH_EINKOMMEN_SCORE', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
       'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'KONSUMNAEHE',
       'MIN_GEBAEUDEJAHR', 'OST_WEST_KZ', 'KBA05_ANTG1', 'KBA05_ANTG2',
       'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
       'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
       'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_ANZAHL_PKW', 'PLZ8_ANTG1',
       'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ',
       'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB', 'ENG_DECADE', 'ENG_MOVEMENT',
       'ENG_WEALTH', 'ENG_LIFE_STAGE'],
      dtype='object')

df.shape

(747109, 64)

# Check if all colums are of numeric dtype
from pandas.api.types import is_numeric_dtype

#should be equal to the no of cols if all True
assert(sum([is_numeric_dtype(df[col]) for col in df.columns]) == df.shape[1])
print('Tests OK')

Tests OK

Create a Cleaning Function¶

def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data

    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """

    # Put in code here to execute all main cleaning steps:
    # convert missing value codes into NaNs, ...
    df = convert_missingdata_encodings_to_nans(df)
    print('Convert missing value codes into NaN, OK')

    # remove selected columns , ...
    df = drop_selected_columns(df)
    # remove selected rows and return both dfs
    df, df_cust_highna = drop_selected_rows(df)
    print('Remove selected columns and rows, OK')
    
    # select, re-encode, and engineer column values.
    df = reencode_categorical_feat(df)
    df = engineer_mixed_feat(df)
    print('Select, re-encode, and engineer column value, OK')
    
    # Return the cleaned dataframe and the high NaN Df.
    return df, df_cust_highna


def convert_missingdata_encodings_to_nans(df, debug=False):
    '''Convert to nans using the missing_data_encode dictionary'''

    for col in missing_data_encode.keys():
        if debug:
            print(col, 'nans before: ', df[col].isnull().sum())
        df.loc[df[col].isin(missing_data_encode[col]), col] = np.nan
        if debug:
            print(col, 'nans after: ', df[col].isnull().sum(), '\n')
    return df


def drop_selected_columns(df):
    ''''''
    return df.drop(columns=cols_to_remove, axis=1)


def drop_selected_rows(df):
    '''split the dataset into two parts:
    One with a percntage of NaN per row over the threshhold
    and one below. Return both.'''
    threshhold = 0.1
    df['row_nan_perc'] = df.isnull().mean(axis=1)
    
    df_lowna = df.loc[df['row_nan_perc'] < threshhold, :].\
                                drop(columns='row_nan_perc', axis=1)
    df_highna = df.loc[df['row_nan_perc'] > threshhold, :].\
                                drop(columns='row_nan_perc', axis=1)

    return df_lowna, df_highna


def reencode_categorical_feat(df):
    ''''''
    # Reencode the binary OST_WEST_KZ to numeric categories
    df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'O': 0,
                                                   'W': 1})

    # choose the categorical features with more than 2 levels
    df_cat_cols = feat_info[feat_info.type == 'categorical'].attribute
    cat_nunique = df[df_cat_cols].nunique()
    cat_to_encode = cat_nunique[cat_nunique > 2]

    # DROP THE MULTI-LEVEL CATEGORICAL FEATURES
    df = df.drop(columns=cat_to_encode.index, axis=0)

    return df


def engineer_mixed_feat(df):
    '''Engineer new variables from the 
    mixed features. Drop he old features'''
    # Engineer two new variables from 'PRAEGENDE_JUGENDJAHRE'
    decade_dict = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7:  3,
                   8: 4, 9: 4, 10: 5, 11: 5, 12: 5, 13: 5,
                   14: 6, 15: 6
                   }
    movement_dict = {1: 1, 3: 1, 5: 1, 8: 1, 10: 1, 12: 1, 14: 1,
                     2: 2, 4: 2, 6: 2, 7: 2, 9: 2, 11: 2, 13: 2, 15: 2
                     }

    df['ENG_DECADE'] = df['PRAEGENDE_JUGENDJAHRE'].map(decade_dict)
    df['ENG_MOVEMENT'] = df['PRAEGENDE_JUGENDJAHRE'].map(movement_dict)
    df.drop

    # Engineer two new variables from 'CAMEO_INTL_2015'
    df['ENG_WEALTH'] = df['CAMEO_INTL_2015'].apply(
        breakup_codes, digit='first')
    df['ENG_LIFE_STAGE'] = df['CAMEO_INTL_2015'].apply(
        breakup_codes, digit='second')

    # Drop mixed dtype features
    mixed_feat_cols = feat_info[feat_info.type == 'mixed'].attribute
    df = df.drop(mixed_feat_cols, axis=1)

    return df

def test_clean_data(df):
    '''A simple test function to test that the cleaning function 
    is at least producing an identical Dataframe as the one we created
    TODO: proper unit tests'''
    df_test = pd.read_csv('Udacity_AZDIAS_Subset.csv', delimiter=';')
    print('Load the dataset, OK')
    pd.testing.assert_frame_equal(clean_data(df_test)[0], df)
    print('Tests Passed!')
    
test_clean_data(df)

Load the dataset, OK
Convert missing value codes into NaN, OK
Remove selected columns and rows, OK
Select, re-encode, and engineer column value, OK
Tests Passed!

Feature Transformation¶

Apply Feature Scaling¶

Explore and clean the dataset of all NaN values¶

# Percentage of nans per column
(df.isnull().mean(axis=0) * 100).sort_values(ascending=False)[:15]

W_KEIT_KIND_HH         6.078497
KKK                    5.410188
REGIOTYP               5.410188
HEALTH_TYP             3.260033
VERS_TYP               3.260033
ENG_MOVEMENT           2.157784
ENG_DECADE             2.157784
KBA05_ANTG1            1.470870
KBA05_ANTG3            1.470870
KBA05_ANTG4            1.470870
KBA05_GBZ              1.470870
MOBI_REGIO             1.470870
KBA05_ANTG2            1.470870
ANZ_HAUSHALTE_AKTIV    0.711543
ARBEIT                 0.499659
dtype: float64

W_KEIT_KIND_HH has the highest percentage of missing values (6%) with the next REGIOTYP and KKK at 5.4%.

# Percentage of nans per row 
nan_perc = (df.isnull().mean(axis=1) * 100).sort_values(ascending=False)

#How many rows have nans?
nan_perc[nan_perc > 0].shape[0]

123898

# as a percentage of rows
print('Total percentage of row that have NaNs : ',
    round(nan_perc[nan_perc > 0].shape[0]/ df.shape[0]  * 100, 4),
      '%')

Total percentage of row that have NaNs :  16.5837 %

df.isnull().sum().sum()

320763

np.product(df.shape)

47814976

# How many nans as perc?
print('Total percentage of Nans : ',
      round(df.isnull().sum().sum() / np.product(df.shape) * 100, 2),
      '%')

Total percentage of Nans :  0.67 %

# Fit the Scaler without NaNs
df_drop = df.dropna(axis=0)
scaler = StandardScaler()
scaler.fit(df_drop)

StandardScaler(copy=True, with_mean=True, with_std=True)

# Imput missing values in the full df with the mean
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X_impute = imp.fit_transform(df)

X_impute

array([[1., 2., 1., ..., 1., 5., 1.],
       [3., 2., 1., ..., 2., 2., 4.],
       [4., 2., 4., ..., 1., 1., 2.],
       ...,
       [2., 2., 2., ..., 1., 2., 4.],
       [1., 1., 1., ..., 1., 5., 1.],
       [4., 1., 4., ..., 1., 4., 3.]])

# Transform the df with the fitted scaler
X_scale = scaler.transform(X_impute)

# Check the distribution of the feature with the highest pec of NaNs
df_scale = pd.DataFrame(X_scale, columns=df.columns)
df_scale.W_KEIT_KIND_HH.describe()

count    747109.000000
mean         -0.001767
std           0.985268
min          -1.839105
25%          -0.688383
50%          -0.001767
75%           1.037699
max           1.037699
Name: W_KEIT_KIND_HH, dtype: float64

# check the distributions
df_scale.describe()

# How many nans as perc after?
print('Total percentage of Nans after  drop: ',
      round(df_drop.isnull().sum().sum() / np.product(df_drop.shape) * 100, 2),
      '%')

Total percentage of Nans after  drop:  0.0 %

Summary¶

The total percentage of NaNs in the dataset is 0.67% while the percentage of rows that contain NanNs is 16,37%.

Droping them would discard a significant amount of data in order to handle much smaller amount of actually missing data.

Imputing them with the mean will result them having a 0 value after using StandardScaler(), possibly reducing the variance of the resulting variables.

An alternative which was used is to fit the StandardScaler to the data without the NaNs and then reintroduce them, imputing them with the mean and transforming them with the fitted StandardScaler object thus resulting in a distributions that retain a higher amount of information.

Dimensionality Reduction¶

Investigate the variance accounted for by each principal component.¶

X_scale.shape

(747109, 64)

# With all the feaures
pca = PCA().fit(X_scale)

Fig13 : Explained Variance per Principal Component ¶

num_components=len(pca.explained_variance_ratio_)
inds = np.arange(num_components)
vals = pca.explained_variance_ratio_

fig, ax = plt.subplots(figsize=(12,9))
plt.bar(inds, vals, color='b');

Fig14 : Explained Cumulative Variance per Principal Component ¶

fig, ax = plt.subplots(figsize=(16,10))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', color='b')
ax.set_xticks(inds+1, minor=False);

The numbers that stand out are 12, 18 and 25.

def scree_plot(pca, c, ann_f=6, figsize=(12, 9), pct_change=False):
    '''
    Creates a scree plot associated with the principal components

    INPUT: pca - the result of instantian of PCA in scikit learn

    OUTPUT:
            None
    '''
    num_components = len(pca.explained_variance_ratio_)
    ind = np.arange(num_components) + 1
    if pct_change:
        vals = -pd.Series(pca.explained_variance_ratio_).pct_change()
        # the first item is NaN
        vals = vals[1:].tolist()
        ind = ind[1:]
    else:
        vals = pca.explained_variance_ratio_

    plt.figure(figsize=figsize)
    ax = plt.   subplot(111)
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals, color=c)
    ax.plot(ind, cumvals, color=c)
    for i in range(num_components):
        try:  # Bandage aid - need to fix
            ax.annotate(r"%s%%" % (
                (str(vals[i] * 100)[:4])),
                (ind[i] + 0.2, vals[i]),
                va="bottom", ha="center", fontsize=ann_f)
        except IndexError:
            pass

    ax.xaxis.set_tick_params(width=0)
    ax.set_xticks(ind, minor=False)
    ax.yaxis.set_tick_params(width=2, length=12)

    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    plt.title('Explained Variance Per Principal Component')

Fig15 : Combined Plot of cumulative Explained Variance and Cumulative explained Variance and per Principal Component (First 25)¶

# Re-apply PCA to the data while selecting for number of components to retain.
pca_25 = PCA(n_components=25)
X_pca_25 = pca_25.fit(X_scale)

scree_plot(pca_25, c='slategray', ann_f=8, figsize=(12,9))

Observations

There seems to be a stabilization of the variance explained and a decrease in the ratio of cumulative variance explained round 12 components.

np.cumsum(pca.explained_variance_ratio_[:12])[-1]

0.6621255269843482

The cumulative explained variance of the first 12 components is 66.21%.

Final dimensionality reduction¶

# Fit on X for 12 components
pca = PCA(n_components=12)
X_pca = pca.fit_transform(X_scale)

X_pca.shape

(747109, 12)

df_pca = pd.DataFrame(X_pca)
df_pca.head()

df_pca.describe()

Summary¶

We applied the dimensionality reduction PCA() method to the scaled dataset and by observing the combined plot of explained variance and cumulative explained variance and per principal component cumulative explained variance plot (Fig15) we chose to keep 12 components as at this number the seems to start a decrease in the ratio of cumulative variance explained as new components are added.

The cumulative explained variance of the first 12 components is 66.21%.

Interpret Principal Components¶

def map_pca_weights_to_feats(pca, df, comp_no):
    '''Map pca weights to  individual features
    and return two pd.Series on with the highest
    positive weights and one with the lowest negative
    weights'''

    weights = pd.DataFrame(np.round(pca.components_, 4), columns=df.keys())
    component = weights.iloc[comp_no - 1, :]
    comp_pos = component[component > 0].sort_values(ascending=False)
    comp_neg = component[component < 0].sort_values(ascending=True)

    return comp_pos, comp_neg

def create_bar_table(comp_pos, comp_neg):
    '''Create and display a conditionally styled pandas dataframe
    for the interpertation of the positive and negative weights
    of PCA and centroid distances for KMeans'''
    
    head = """
    <table>
    """
    row = ""
    for serie in [comp_pos[:15],comp_neg[:15]]:
        s = serie.copy()
        s.name=''
        row += "<td>{}</td>".format(s.to_frame().style.bar(
                                                           align='mid',
                                                           color=['#d65f5f', '#5fba7d'],
                                                           width=100).render()
                                   ) 
    row += '</tr>'
    head += row

    head+= """
    </table>"""

    display(HTML(head))

Map weights for the first principal component to corresponding feature names.¶

# the weights df
weights = pd.DataFrame(np.round(pca.components_, 4), columns = df_drop.keys())
weights.head()

pd.Series(X_pca[:, 1]).describe()

count    7.471090e+05
mean     5.448174e-16
std      2.941747e+00
min     -8.056119e+00
25%     -2.248605e+00
50%      6.883725e-02
75%      2.171403e+00
max      1.094159e+01
dtype: float64

Fig16 : Table-Bar Plot of the 1st Principal Component Weights ¶

comp_pos1, comp_neg1 = map_pca_weights_to_feats(pca, df_drop, 1)
create_bar_table(comp_pos1, comp_neg1)

Most Positive

PLZ8_ANTG3: Number of 6-10 family houses in the PLZ8 region (0: no 6-10 family homes, 3: high share of 6-10 family homes)
- Higher share of medium apartment buildings
PLZ8_ANTG4 : Number of 10+ family houses in the PLZ8 region (0: no 10+ family home, 2: high share of 10+ family homes)
- Higher share of larger apartment buildings
HH_EINKOMMEN_SCORE: Estimated household net income (1: highest income, 6: very low income)
- Lower incomes
ENG_WEALTH: Household wealth - Engineered (1: Wealthy Households, 5: Poorer Households)
- Poorer Households
ORTSGR_KLS9: Size of community (1: <= 2,000 inhabitants, 9: > 700,000 inhabitants)
- Bigger community size
EWDICHTE : Density of households per square kilometer (1: less than 34 households per km^2, 6: more than 999 households per km^2)
- Denser populated areas

Most Negative

MOBI_REGIO : Movement patterns (1: very high movement, 6: none)
- Lower movement patterns
PLZ8_ANTG1: Number of 1-2 family houses in the PLZ8 region (0: no 1-2 family homes, 4: very high share of 1-2 family homes)
- Higher share of of 1-2 family homes
KBA05_ANTG1 : Number of 1-2 family houses in the microcell (0: no 1-2 family homes, 4: very high share of 1-2 family homes)
- Higher share of of 1-2 family homes
FINANZ_MINIMALIST : Financial typology - MINIMALIST: low financial interest (1: very high, 5: very low)
- Lower financial minimalism

The first component component seems to capture the population density, and financial affluence of the person

PLZ8_ANTG3, PLZ8_ANTG4 , ENG_WEALTH, HH_EINKOMMEN_SCORE, RTSGR_KLS9 and EWDICHTE have high positive weights (People that live in more densely populated areas and with lower financial affluence).

The assumption is supported by the negative weights of MOBI_REGIO , KBA05_ANTG1, PLZ8_ANTG1 and FINANZ_MINIMALIST (People that live in less densely populated areas and with higher financial affluence).

Map weights for the second principal component to corresponding feature names.¶

Fig17 : Table-Bar Plot of the 2nd Principal Component Weights ¶

comp_pos2, comp_neg2 = map_pca_weights_to_feats(pca, df_drop, 2)
create_bar_table(comp_pos2, comp_neg2)

Most Positive

ALTERSKATEGORIE_GROB: Estimated age based on given name analysis (1: < 30 years old, 4: > 60 years old, 9: uniformly distributed)
- Older or evenly distributed
FINANZ_VORSORGER: Financial typology - Be prepared (1: Very high, 5: Very low)
- Lower financial preparation
SEMIO_ERL: Personality typology - event-oriented ( 1: highest affinity, 7: lowest affinity)
- Lower event oriented affinity
SEMIO_LUST: Personality typology - sensual-minded ( 1: highest affinity, 7: lowest affinity)
- Lower sensual-minded affinity
RETOURTYP_BK_S: Return type (1: influenceable Crazy-Shopper, 5: determined Minimal-Returner)
- More conservative return type
FINANZ_HAUSBAUER: Financial typology - home ownership (1: Very high, 5: Very low)
- Lower home ownership

Most Negative

SEMIO_REL: Personality typology - religious ( 1: highest affinity, 7: lowest affinity)
- Lower religious affinity
ENG_DECADE: Person’s decade of youth (1: 40s , 6: 90s)
- Younger generation
FINANZ_SPARER: Financial typology - money-saver (1: Very high, 5: Very low)
- Lower money saving
FINANZ_UNAUFFAELLIGER: Financial typology - inconspicuous (1: Very high, 5: Very low)
- Less inconspicuous
SEMIO_TRADV: Personality typology - tradional-minded ( 1: highest affinity, 7: lowest affinity)
- Less traditional minded

The second component seems to capture the age, generation and culture (financial, buying and other) of the person.

RETOURTYP_BK_S, SEMIO_ERL and SEMIO_LUST ALTERSKATEGORIE_GROB and FINANZ_VORSORGER have high positive weights and (older people with lower financial preparation and home ownership and less sensual and event oriented).

The assumption is supported by the negative weights of SEMIO_REL, ENG_DECADE, FINANZ_SPARER and SEMIO_TRADV (younger generation, with lower religious affinity, lower money saving, less inconspicuous and less traditional minded).

Map weights for the third principal component to corresponding feature names.¶

Fig18 : Table-Bar Plot of the 3rd Principal Component Weights ¶

comp_pos3, comp_neg3 = map_pca_weights_to_feats(pca, df_drop, 3)
create_bar_table(comp_pos3, comp_neg3)

Most Positive

SEMIO_VERT: Personality typology - VERT: dreamful (1: highest affinity, 7: lowest affinity)
- Less dreamful
SEMIO_SOZ: Personality typology - SOZ: socially-minded (1: highest affinity, 7: lowest affinity)
- Less socially-minded
SEMIO_FAM: Personality typology - family-minded ( 1: highest affinity, 7: lowest affinity)
- Less family-minded
SEMIO_KULT: Personality typology - cultural-minded ( 1: highest affinity, 7: lowest affinity)
- Less cultural-minded
FINANZ_MINIMALIST : Financial typology - MINIMALIST: low financial interest (1: very high, 5: very low)
- Lower financial minimalism
RETOURTYP_BK_S: Return type (1: influenceable Crazy-Shopper, 5: determined Minimal-Returner)
- More conservative return type

Most Negative

ANREDE_KZ: Gender ( 1: male, 2: female)
- Female
SEMIO_KAEM: Personality typology - KAEM: combative attitude ( 1: highest affinity, 7: lowest affinity)
- Less combative attitude
SEMIO_DOM: Personality typology - DOM: dominant-minded ( 1: highest affinity, 7: lowest affinity)
- Less dominant-minded
SEMIO_KRIT: Personality typology - KRIT: critical-minded ( 1: highest affinity, 7: lowest affinity)
- Less critical-minded
SEMIO_RAT: Personality typology - RAT: rational ( 1: highest affinity, 7: lowest affinity)
- Less rational
FINANZ_ANLEGER : Financial typology - ANLEGER: investor (1: very high, 5: very low)
- Less prone to investing

The third component seems to capture the gender and their main corresponding personality types (dreamfulness, dominance, rationality, social and family attitude) of the person.

SEMIO_VERT, SEMIO_SOZ and SEMIO_FAM, SEMIO_KULT and FINANZ_MINIMALIST and RETOURTYP_BK_S have high positive weights and in conjunction with the very strong negative weight of ANREDE_KZ can refer to (Men that are less dreamful, less socially-minded, less cultural-minded not financial minimalists and conservative return shopper types).

The assumption is supported by the negative weights of ANREDE_KZ, SEMIO_KAEM, SEMIO_DOM, SEMIO_KRIT, SEMIO_RAT and FINANZ_ANLEGER (Women that are less dominant-minded, less critical-minded, less rational have less combative attitude and are less prone to investing).

Summary¶

1.The first component component seems to capture the population density, and financial status of the person - Fig16.

PLZ8_ANTG3, PLZ8_ANTG4 , ENG_WEALTH, HH_EINKOMMEN_SCORE, RTSGR_KLS9 and EWDICHTE have high positive weights correlated with people that live in more densely populated areas and with lower financial affluence.
The assumption is supported by the negative weights of MOBI_REGIO , KBA05_ANTG1, PLZ8_ANTG1 and FINANZ_MINIMALIST correlated with People that live in less densely populated areas and with higher financial affluence.

2.The second component seems to capture the age, generation and culture (financial, buying and other) of the person - Fig17.

RETOURTYP_BK_S, SEMIO_ERL and SEMIO_LUST ALTERSKATEGORIE_GROB and FINANZ_VORSORGER have high positive weights correlated with older people with lower financial preparation and home ownership that are less sensual and event oriented.
The assumption is supported by the negative weights of SEMIO_REL, ENG_DECADE, FINANZ_SPARER and SEMIO_TRADV correlated with (younger generation, with lower religious affinity, lower money saving, less inconspicuous and less traditional minded).

3.The third component seems to capture the gender and their main corresponding personality types (dreamfulness, dominance, rationality, social and family attitude) of the person - Fig18.

SEMIO_VERT, SEMIO_SOZ and SEMIO_FAM, SEMIO_KULT and FINANZ_MINIMALIST and RETOURTYP_BK_S have high positive weights and in conjunction with the very strong negative weight of ANREDE_KZ can refer to (men that are less dreamful, less socially-minded, less cultural-minded not financial minimalists and conservative return shopper types).
The assumption is supported by the negative weights of ANREDE_KZ, SEMIO_KAEM, SEMIO_DOM, SEMIO_KRIT, SEMIO_RAT and FINANZ_ANLEGER correlated with (women that are less dominant-minded, less critical-minded, less rational, have less combative attitude and are less prone to investing).

Clustering¶

Apply Clustering to General Population¶

Investigate the change in within-cluster distance across number of clusters.¶

Fig19 : Total Distance to Cluster center per cluster center.¶

#THIS TAKES A LONG TIME - DON'T RUN EVERY TIME!

# compute the average within-cluster distances.
# Over a number of different cluster counts...
scores = {}
for k in range(2, 31):
    # run k-means clustering on the data and...
    scores[k] = np.abs(KMeans(n_clusters=k, n_jobs=-1).fit(X_pca).score(X_pca))
#     1print('Calculated: ', k)

# Plot relationship plot
fig, ax = plt.subplots(figsize=(16,10))    
ax = pd.Series(scores).plot(marker='o', color='slategray')
ax.set_xticks(np.arange(2, 31), minor=False);  
ax.set_xlabel("# of Clusters")
ax.set_ylabel("Total Distance to centroid");

Final general population clustering¶

# Chosen number of clusters
K=8

# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=K, n_jobs=-1, random_state=0).fit(X_pca)

kmeans.labels_.shape

(747109,)

kmeans.cluster_centers_.shape

(8, 12)

Summary¶

The curve in the Distance to Number of Clusters relationship plot (Fig19) is quite even and the the most obvious elbow in the curve where adding additional clusters less effective in reducing the within-cluster distance can be inferred to be 8 and this number has been used as the number of clusters to fit the the KMeans algorithm on the transformed demographics data and predict cluster labels.

Another elbow can be observed around the 12 clusters but the k = 8 was preferred in this iteration in order to keep the number of clusters lower and see if meaningful separation can be observed with this number.

Apply All Steps to the Customer Data¶

# Load in the customer demographics data.
customers = pd.read_csv('Customers.csv', delimiter=';')

customers.head()

Data cleaning¶

df_cust, df_cust_highna = clean_data(customers)

Convert missing value codes into NaN, OK
Remove selected columns and rows, OK
Select, re-encode, and engineer column value, OK

df_cust.shape

(133427, 64)

df_cust_highna.shape

(58225, 79)

Feature transformation¶

# Impute
X_cust_imp = imp.transform(df_cust)

# Scale
X_cust_sc = scaler.transform(X_cust_imp)

Dimensionality reduction¶

X_cust_pca = pca.transform(X_cust_sc)
X_cust_pca.shape

(133427, 12)

df_cust_pca = pd.DataFrame(X_cust_pca, columns=np.arange(1, 13))
df_cust_pca.describe()

Clustering¶

X_cust_kmeans = kmeans.predict(X_cust_pca)

np.unique(X_cust_kmeans)

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int32)

X_cust_kmeans.shape

(133427,)

Compare Customer Data to Demographics Data¶

Compare proportions¶

Compare the proportion of data in each cluster for the customer data to the proportion of data in each cluster for the general population.

# General population dataset kmean labels
gen_clust_labels = kmeans.labels_

# Add the high NaN dataset as a cluster with label -1 
gen_clust_labels_na = np.append(gen_clust_labels, [-1] * df_highna.shape[0])

# Make proportion table
gen_clust_freq = itemfreq(gen_clust_labels_na)
gen_proportion_col = gen_clust_freq[:, 1] / gen_clust_freq[:, 1].sum()
gen_clust_freq = np.c_[gen_clust_freq, gen_proportion_col]


# customer dataset
cust_clust_labels = X_cust_kmeans

# Add the high NaN dataset as a cluster with label -1 
cust_clust_labels_na = np.append(cust_clust_labels, [-1] * df_cust_highna.shape[0])

# Make proportion table
cust_clust_freq = itemfreq(cust_clust_labels_na)
cust_proportion_col = cust_clust_freq[:, 1] / cust_clust_freq[:, 1].sum()
cust_clust_freq = np.c_[cust_clust_freq, cust_proportion_col]

Fig20 : Barplots of Cluster proportions at the Population and Customer Datasets ¶

# Plot
fig, axes = plt.subplots(1,2, figsize=(16, 10), sharey=True)
sns.barplot(x=gen_clust_freq[:, 0], y=gen_clust_freq[:, 2], color='b', ax=axes[0]);
sns.barplot(x=cust_clust_freq[:, 0], y=cust_clust_freq[:, 2], color='r', ax=axes[1]);

axes[0].set_title('General Population')
axes[1].set_title('Customer Data');

Fig21 : Barplot of Cluster proportions at the Population and Customer Datasets ¶

# create plot
fig, ax = plt.subplots(figsize=(16, 10))
index = np.append([-1], np.arange(K))
bar_width = 0.35
opacity = 0.8
 
gen = plt.bar(index, gen_clust_freq[:, 2], bar_width,
                 alpha=opacity,
                 color='b',
                 label='General Population')
 
cust = plt.bar(index + bar_width, cust_clust_freq[:, 2], bar_width,
                 alpha=opacity,
                 color='g',
                 label='Customer Data')
 
plt.xlabel('# of Cluster')
plt.ylabel('Proportion of cluster')
plt.title('Proportions of Clusters for General and Customer Data')
plt.xticks(index + bar_width, index)
plt.legend()
 
plt.tight_layout()
sns.set(style="whitegrid")
sns.despine();

Fig22 : Barplots of difference of Cluster proportions at the Customer and Population Datasets ¶

#Calculate differece of proportions
prop_diff = cust_clust_freq[:, 2] - gen_clust_freq[:, 2]
positive= prop_diff > 0

# create plot
fig, ax = plt.subplots(figsize=(14, 8))
index = np.append([-1], np.arange(K))
bar_width = 0.35
opacity = 0.8
 
diff = plt.bar(index, prop_diff, bar_width,
                 alpha=opacity,
                 color=np.vectorize({True: 'k', False: 'r'}.get)(positive)
              )

 
plt.xlabel('# of Cluster')
plt.ylabel('Difference of proportions')
plt.title('Difference of Proportions of Clusters for General and Customer Data')
plt.xticks(index)
 
plt.tight_layout()
sns.set(style="whitegrid")
sns.despine();

Observations

Cluster 3 is the most overrepresented and is actually dominating the customer data and cluster 6 the most underrepresented.
The high-missing NaN cluster is quite overrepresented in the customer data which suggests a possibly methodical problem in the data acquisition of the customer data.

# Add cluster labels
df_cust_pca['cluster'] = X_cust_kmeans

Fig23: Pairplots of all the principal components colored by cluster ¶

g = sns.pairplot(df_cust_pca,
                 vars = np.arange(1, 13), 
                 diag_kind = 'kde',
                 hue="cluster",
                 palette="Paired");

Observations

By looking at the density plots we see that there seems to be a much clearer separation of clusters in the first 3 components and a more uniform spread of clusters at the rest of the components.
There is an interesting binary separation in most scatter plots that involve the third component which agrees with our assumption of it being a latent gender feature.
We have not include the high-NAN in this plot as it refers to different dataset characteristics.

Let's have a closer look at the first three principal components.

# Rename components for better reading
df_cust_pca = df_cust_pca.rename(columns = {1: 'pop_density-finance_status',
                                            2: 'age-generation',
                                            3: 'gender'}
                                )

Fig24: Pairplots of the first 3 principal components colored by cluster.¶

g = sns.pairplot(df_cust_pca,
                 vars = ['pop_density-finance_status', 'age-generation', 'gender'], 
                 diag_kind = 'kde',
                 hue="cluster",
                 palette="Paired");

Observations

Cluster No 3 (dark green) seems to be strongly correlated to lower values of the 1st component (Population Density - Financial status) , values around 0 for the 2nd component (Age, Generation and Culture) and higher values of the 3rd component (Gender and their main corresponding personality types).

Cluster No 6 (ligh orange) seems to be the opposite of Cluster no 3 and correlates with higher values of the first component, values around 0 for the 2nd component (Age, Generation and Culture) and lower values for the 3rd principal component(Gender and their main corresponding personality)

def map_kmeans_weights_to_feats(kmeans, df, clust_no):
    '''Map pca weights to  individual features
    and return two pd.Series on with the highest
    positive weights and one with the lowest negative
    weights'''

    weights = pd.DataFrame(np.round(kmeans.cluster_centers_, 4), columns=df.keys())
    centroid = weights.iloc[clust_no, :]
    cent_pos = centroid[centroid > 0].sort_values(ascending=False)
    cent_neg = centroid[centroid < 0].sort_values(ascending=True)

    return cent_pos, cent_neg

Summary¶

Fig20, Fig21 and Fig22 observations:

Cluster 3 is the most overrepresented and is actually dominating the customer data and cluster 6 the most underrepresented.
The high-missing NaN cluster is quite overrepresented in the customer data which suggests a possibly methodical problem in the data acquisition of the customer data.

Fig22 observations:

By looking at the density plots we see that there seems to be a much clearer separation of clusters in the first 3 components and a more uniform spread of clusters at the rest of the components.
There is an interesting binary separation in most scatter plots that involve the 3rd component which agrees with our assumption of it being a latent gender feature.

What kinds of people are part of a cluster that is overrepresented in the customer data compared to the general population?¶

# Overrepresented 
over_pos, over_neg = map_kmeans_weights_to_feats(kmeans, df_cust_pca.iloc[:, :-1], 3)
create_bar_table(over_pos, over_neg)

Looking at the centroid distances table with the highly negative value for pop_density-finance_status and the strongly positive value for gender, confirms the pairplot observations and strengthens the assumption that the overrepresented group are: Men that are less dreamful, less socially-minded, less cultural-minded, not financial minimalists, conservative return shopper types, that have a higher financial status and live in less densely populated areas. The negative weight for the age-generation principal component in conjunction with the attitude characteristics of these men suggests that they could be of relative younger age.

What kinds of people are part of a cluster that is underrepresented in the customer data compared to the general population?¶

# Underrepresented 
under_pos, under_neg = map_kmeans_weights_to_feats(kmeans, df_cust_pca.iloc[:, :-1], 6)
create_bar_table(under_pos, under_neg)

Looking at the centroid distances table with the highly positive value for pop_density-finance_status and the strongly negative value for gender along with negative weight for the age-generation principal component confirms the pairplot observations and strengthens the assumption that the underrepresented group are: Younger women that are less dominant-minded, less critical-minded, less rational, have a less combative attitude and are less prone to investing and live in densely populated areas with a lower financial status. They have lower religious affinity, lower money saving culture and are less traditional minded.

References¶

[1]:https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
[2]:https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-
[3]:https://stackoverflow.com/questions/36653443/retaining-nan-values-after-get-dummies-in-pandas
[4]:http://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation
[5]:http://greenteapress.com/thinkstats2/html/thinkstats2010.html#sec94
[6]:https://github.com/jadianes/data-science-your-way/tree/master/03-dimensionality-reduction-and-clustering
[7]:https://www.mailman.columbia.edu/research/population-health-methods/cluster-analysis-using-k-means

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
count	891221.000000	891221.000000	891221.000000	886367.000000	891221.000000	891221.000000	891221.000000	891221.000000	891221.000000	891221.000000	...	774706.000000	774706.000000	774706.000000	774706.000000	774706.000000	774706.000000	774706.000000	794005.000000	794005.000000	794005.00000
mean	-0.358435	2.777398	1.522098	3.632838	3.074528	2.821039	3.401106	3.033328	2.874167	3.075121	...	2.253330	2.801858	1.595426	0.699166	1.943913	3.612821	3.381087	3.167854	5.293002	3.07222
std	1.198724	1.068775	0.499512	1.595021	1.321055	1.464749	1.322134	1.529603	1.486731	1.353248	...	0.972008	0.920309	0.986736	0.727137	1.459654	0.973967	1.111598	1.002376	2.303739	1.36298
min	-1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	0.000000	1.00000
25%	-1.000000	2.000000	1.000000	2.000000	2.000000	1.000000	3.000000	2.000000	2.000000	2.000000	...	1.000000	2.000000	1.000000	0.000000	1.000000	3.000000	3.000000	3.000000	4.000000	2.00000
50%	-1.000000	3.000000	2.000000	4.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	...	2.000000	3.000000	2.000000	1.000000	1.000000	4.000000	3.000000	3.000000	5.000000	3.00000
75%	-1.000000	4.000000	2.000000	5.000000	4.000000	4.000000	5.000000	5.000000	4.000000	4.000000	...	3.000000	3.000000	2.000000	1.000000	3.000000	4.000000	4.000000	4.000000	7.000000	4.00000
max	3.000000	9.000000	2.000000	6.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	...	4.000000	4.000000	3.000000	2.000000	5.000000	5.000000	5.000000	9.000000	9.000000	9.00000

	OST_WEST_KZ	CAMEO_DEUG_2015	CAMEO_DEU_2015
13845	W	4	4E
400531	O	9	9E
722889	W	5	5F
63590	W	5	5A
673145	W	4	4A
462521	W	9	9E
433515	W	4	4C
137159	W	2	2D
412390	W	6	6B
773771	W	7	7A

	OST_WEST_KZ	CAMEO_DEUG_2015	CAMEO_DEU_2015
624729	0	6	6B
668818	1	6	6C
109095	1	4	4A
250516	1	9	9D
752581	1	3	3D
569456	1	9	9D
231614	1	2	2B
319830	1	5	5D
738995	1	9	9D
739989	1	7	7B

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_ANTG4	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	ENG_DECADE	ENG_MOVEMENT	ENG_WEALTH	ENG_LIFE_STAGE
count	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	...	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000	747109.000000
mean	0.003698	0.015413	0.001930	0.008389	-0.000648	0.030617	0.024543	-0.011613	-0.026449	0.000891	...	-0.020237	-0.002344	0.020456	-0.019430	-0.034443	-0.021471	0.005359	-0.014822	-0.029240	-0.000760
std	0.997498	0.999536	1.006980	1.008161	0.995106	1.011898	1.009243	1.015730	0.982942	0.979313	...	0.993209	1.009009	0.996363	1.003865	0.999515	1.002373	0.995194	0.979819	1.003429	0.992829
min	-1.746287	-1.022678	-1.512226	-1.141397	-1.765054	-1.221852	-1.139026	-1.588887	-0.553672	-1.591635	...	-0.981973	-2.721004	-2.111170	-2.223015	-1.914219	-1.563993	-2.280170	-0.553672	-1.595951	-1.251111
25%	-0.772089	-1.022678	-0.777350	-1.141397	-1.045045	-1.221852	-1.139026	-0.856544	-0.553672	-0.273495	...	-0.981973	-0.639701	-0.314765	-0.201175	-0.607437	-0.823841	-0.902320	-0.553672	-0.909992	-1.251111
50%	0.202108	0.977825	-0.042475	0.219832	0.394972	0.158605	-0.410325	-0.124201	-0.553672	-0.273495	...	0.384671	-0.002344	-0.314765	-0.201175	-0.171843	-0.083688	-0.213395	-0.553672	0.461926	0.082843
75%	1.176305	0.977825	0.692400	0.900446	1.114980	0.848833	1.047076	0.608142	-0.553672	1.044646	...	0.384671	0.400951	0.583438	0.809745	0.699344	0.656464	1.164455	-0.553672	1.147884	0.749820
max	1.176305	0.977825	1.427276	1.581061	1.114980	1.539061	1.775776	1.340485	1.806125	1.044646	...	1.751315	1.441603	1.481641	1.820665	1.570532	1.396616	1.164455	1.806125	1.147884	1.416797

	0	1	2	3	4	5	6	7	8	9	10	11
0	3.907796	-2.656621	-2.895773	-0.441863	-2.512817	2.215937	0.704196	0.841138	1.356967	-0.523036	0.015870	2.084486
1	-0.893332	0.179033	-3.051267	2.233205	-0.904234	-0.238592	0.286181	-0.989670	0.839629	-1.411833	-1.036324	-0.259661
2	-4.068326	1.143361	-0.822045	-0.920451	0.778765	0.887047	-1.682292	-0.519529	-1.366521	-0.240180	0.464542	-1.224527
3	0.571175	0.223070	3.234685	0.625786	1.402563	-2.781060	-0.620200	2.284333	1.291613	-0.727196	1.193146	1.175043
4	-0.710996	-1.353162	-1.047915	-1.812801	-3.406913	-1.069949	0.458173	-0.608483	-2.732078	1.532272	0.296287	2.311738

	attribute	information_level	type	missing_or_unknown
0	AGER_TYP	person	categorical	[-1,0]
1	ALTERSKATEGORIE_GROB	person	ordinal	[-1,0,9]
2	ANREDE_KZ	person	categorical	[-1,0]
3	CJT_GESAMTTYP	person	categorical	[0]
4	FINANZ_MINIMALIST	person	ordinal	[-1]

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	attribute	information_level	type	perc_missing	nan_cat
60	KBA05_ANTG1	microcell_rr3	ordinal	0.149597	moderate
61	KBA05_ANTG2	microcell_rr3	ordinal	0.149597	moderate
62	KBA05_ANTG3	microcell_rr3	ordinal	0.149597	moderate
63	KBA05_ANTG4	microcell_rr3	ordinal	0.149597	moderate
65	KBA05_GBZ	microcell_rr3	ordinal	0.149597	moderate

	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	FINANZTYP	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
1	1.0	2.0	5.0	1.0	5.0	2.0	5.0	4.0	5.0	1.0	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	3.0	2.0	3.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	4.0	2.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	6.0	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	3.0	1.0	5.0	4.0	3.0	4.0	1.0	3.0	2.0	5.0	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0
5	1.0	2.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	2.0	...	2.0	3.0	1.0	1.0	1.0	5.0	5.0	2.0	3.0	3.0

	attribute	information_level	type
15	LP_LEBENSPHASE_FEIN	person	mixed
16	LP_LEBENSPHASE_GROB	person	mixed
22	PRAEGENDE_JUGENDJAHRE	person	mixed
56	WOHNLAGE	building	mixed
59	CAMEO_INTL_2015	microcell_rr4	mixed
79	PLZ8_BAUMAX	macrocell_plz8	mixed

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_ANTG4	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	ENG_DECADE	ENG_MOVEMENT	ENG_WEALTH	ENG_LIFE_STAGE
1	1.0	2.0	1.0	5.0	2.0	5.0	4.0	5.0	0.0	3.0	...	1.0	5.0	4.0	3.0	5.0	4.0	6.0	1.0	5.0	1.0
2	3.0	2.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	3.0	...	0.0	4.0	4.0	3.0	5.0	2.0	6.0	2.0	2.0	4.0
3	4.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	0.0	2.0	...	0.0	3.0	4.0	2.0	3.0	3.0	4.0	1.0	1.0	2.0
4	3.0	1.0	4.0	3.0	4.0	1.0	3.0	2.0	0.0	3.0	...	1.0	3.0	3.0	4.0	6.0	5.0	4.0	1.0	4.0	3.0
5	1.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	0.0	3.0	...	1.0	5.0	5.0	2.0	3.0	3.0	2.0	1.0	5.0	4.0

	0	1	2	3	4	5	6	7	8	9	10	11
count	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05	7.471090e+05
mean	8.274190e-17	5.448174e-16	1.742525e-16	-2.588775e-16	1.005267e-17	2.195799e-16	-2.259710e-16	7.197594e-17	1.714089e-16	-3.414482e-16	1.674240e-16	-2.921835e-16
std	3.305431e+00	2.941747e+00	2.445980e+00	1.834498e+00	1.575535e+00	1.436010e+00	1.385909e+00	1.327920e+00	1.240437e+00	1.212733e+00	1.158618e+00	1.134653e+00
min	-8.158599e+00	-8.056119e+00	-5.388341e+00	-5.578758e+00	-6.321833e+00	-4.393262e+00	-1.001662e+01	-5.470201e+00	-6.285885e+00	-4.982893e+00	-5.095626e+00	-5.487985e+00
25%	-2.797101e+00	-2.248605e+00	-2.213944e+00	-1.315640e+00	-9.998736e-01	-9.600904e-01	-5.595538e-01	-8.943880e-01	-8.435577e-01	-8.298672e-01	-7.765242e-01	-7.862076e-01
50%	-1.499720e-01	6.883725e-02	-4.478043e-01	-1.937895e-01	6.640127e-02	-1.627032e-01	1.417113e-01	3.172766e-02	-6.029639e-02	-3.995200e-03	2.304979e-02	-2.780641e-02
75%	2.670554e+00	2.171403e+00	2.290881e+00	1.100015e+00	1.042296e+00	7.587979e-01	8.241712e-01	9.202707e-01	7.958256e-01	8.167085e-01	7.949304e-01	7.713673e-01
max	1.216322e+01	1.094159e+01	6.402612e+00	8.286278e+00	7.252907e+00	2.614140e+01	1.680698e+01	1.238407e+01	1.515561e+01	1.096451e+01	1.025272e+01	6.254072e+00

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_ANTG4	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	ENG_DECADE	ENG_MOVEMENT	ENG_WEALTH	ENG_LIFE_STAGE
0	-0.1175	0.0057	-0.2193	0.1436	-0.1105	0.0650	0.0734	0.1645	-0.1083	0.0283	...	0.2178	0.0419	-0.1672	0.1441	0.1970	0.1353	0.0997	-0.1068	0.2078	-0.1236
1	0.2604	0.0864	0.0719	-0.2418	0.2337	-0.2121	-0.2312	0.1161	-0.0134	-0.0581	...	0.0861	0.0130	-0.0694	0.0652	0.0882	0.0630	-0.2515	-0.0113	0.0709	0.0028
2	0.0765	-0.3727	0.1599	-0.1035	0.0975	-0.1905	-0.0951	-0.0495	0.0501	-0.0151	...	0.0432	0.0052	-0.0358	0.0323	0.0462	0.0305	-0.1049	0.0493	0.0237	-0.0079
3	-0.0435	0.0403	0.0640	-0.0003	-0.0163	-0.1265	0.0904	-0.1095	0.3895	0.0056	...	0.0796	0.1483	0.0425	0.0462	0.2498	0.0936	0.0398	0.3877	-0.1280	0.0593
4	0.0055	0.0078	0.0859	-0.0064	-0.0647	0.0361	-0.0460	-0.0977	0.0636	0.0133	...	0.0316	-0.4701	-0.3933	0.2175	0.0950	0.1405	0.0285	0.0613	-0.0117	0.0213

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	1	2	3	4	5	6	7	8	9	10	11	12
count	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000	133427.000000
mean	-2.420121	1.310225	1.459637	0.926338	-0.021179	0.016726	0.159784	0.006172	0.219032	0.335859	0.310805	-0.033378
std	2.719404	2.177325	2.195575	2.011782	1.504549	1.364308	1.366125	1.417460	1.265658	1.058863	1.127363	1.099069
min	-7.957290	-7.519114	-4.753556	-5.340468	-6.097607	-3.509534	-9.756177	-5.348061	-5.287467	-4.524880	-5.010266	-5.786345
25%	-4.481219	-0.028086	-0.776505	-0.708078	-0.983224	-0.793144	-0.347024	-0.997691	-0.634143	-0.383103	-0.451180	-0.805974
50%	-3.155341	1.249607	2.196779	0.962499	0.083418	-0.116760	0.300930	0.037469	0.180509	0.359224	0.310001	-0.121571
75%	-0.792494	2.715572	3.232513	2.563815	1.013612	0.618804	0.920310	1.012337	1.043033	1.067154	1.069303	0.702715
max	11.321042	10.919396	6.259865	9.449031	6.793778	29.673035	17.875117	7.616711	7.292835	5.457265	4.869678	5.209275

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	FINANZTYP	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
1	1.0	2.0	5.0	1.0	5.0	2.0	5.0	4.0	5.0	1.0	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	3.0	2.0	3.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	4.0	2.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	6.0	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	3.0	1.0	5.0	4.0	3.0	4.0	1.0	3.0	2.0	5.0	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0
5	1.0	2.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	2.0	...	2.0	3.0	1.0	1.0	1.0	5.0	5.0	2.0	3.0	3.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_ANTG4	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	ENG_DECADE	ENG_MOVEMENT	ENG_WEALTH	ENG_LIFE_STAGE
1	1.0	2.0	1.0	5.0	2.0	5.0	4.0	5.0	0.0	3.0	...	1.0	5.0	4.0	3.0	5.0	4.0	6.0	1.0	5.0	1.0
2	3.0	2.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	3.0	...	0.0	4.0	4.0	3.0	5.0	2.0	6.0	2.0	2.0	4.0
3	4.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	0.0	2.0	...	0.0	3.0	4.0	2.0	3.0	3.0	4.0	1.0	1.0	2.0
4	3.0	1.0	4.0	3.0	4.0	1.0	3.0	2.0	0.0	3.0	...	1.0	3.0	3.0	4.0	6.0	5.0	4.0	1.0	4.0	3.0
5	1.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	0.0	3.0	...	1.0	5.0	5.0	2.0	3.0	3.0	2.0	1.0	5.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

Table of Contents

Customer Segmentation - Introduction¶

Imports¶

Data Loading¶

Preprocessing¶

Assess Missing Data¶

Identify missing or unknown data values and convert them to NaNs.¶

Convert Missing Value Codes to NaNs¶

Assess Missing Data in Each Column¶

Fig1 : Histogram of missing values per Feature¶

Investigate patterns in the amount of missing data in each column.¶

Fig2 : CountPlot of missing value category per information level.¶

Fig3 : SwarmPlot of missing value category per information level.¶

Remove the outlier columns from the dataset.¶

Summary¶

Assess Missing Data in Each Row¶

How much data is missing in each row of the dataset?¶

Fig4 : Histogram of missing values per row¶

Divide the data into two subsets based on the number of missing values in each row.¶

Compare the distribution of values for at least five columns where there are no or few missing values, between the two subsets.¶

Fig5 : CountPlots of High and Low NaN features¶

Summary¶

Select and Re-Encode Features¶

How many features are there of each data type?¶

Fig6 : CountPlot of Features per Data Type¶

Re-Encode Categorical Features¶

Re-encode categorical variable(s) to be kept in the analysis.¶

Summary¶

Engineer Mixed-Type Features¶

Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.¶

Fig7 : CountPlot of PRAEGENDE_JUGENDJAHRE categories¶

Fig8 : CountPlot of ENG_DECADE categories¶

Fig9 : CountPlot of ENG_MOVEMENT categories¶

Investigate "CAMEO_INTL_2015" and engineer two new variables.¶

Fig10 : CountPlot of ENG_MOVEMENT categories¶

Fig11 : CountPlot of ENG_WEALTH categories¶

Fig12 : CountPlot of ENG_LIFE_STAGE categories¶

Summary¶

Complete Feature Selection¶

If there are other re-engineering tasks you need to perform, make sure you take care of them here. (Dealing with missing data will come in step 2.1.)¶

Do whatever you need to in order to ensure that the dataframe only contains the columns that should be passed to the algorithm functions.¶

Create a Cleaning Function¶

Feature Transformation¶

Apply Feature Scaling¶

Explore and clean the dataset of all NaN values¶

Summary¶

Dimensionality Reduction¶

Investigate the variance accounted for by each principal component.¶

Fig13 : Explained Variance per Principal Component¶

Fig14 : Explained Cumulative Variance per Principal Component¶

Fig15 : Combined Plot of cumulative Explained Variance and Cumulative explained Variance and per Principal Component (First 25)¶

Final dimensionality reduction¶

Summary¶

Interpret Principal Components¶

Map weights for the first principal component to corresponding feature names.¶

Fig16 : Table-Bar Plot of the 1st Principal Component Weights¶

Map weights for the second principal component to corresponding feature names.¶

Fig17 : Table-Bar Plot of the 2nd Principal Component Weights¶

Map weights for the third principal component to corresponding feature names.¶

Fig18 : Table-Bar Plot of the 3rd Principal Component Weights¶

Summary¶

Clustering¶

Apply Clustering to General Population¶

Investigate the change in within-cluster distance across number of clusters.¶

Fig19 : Total Distance to Cluster center per cluster center.¶

Final general population clustering¶

Summary¶

Apply All Steps to the Customer Data¶

Data cleaning¶

Feature transformation¶

Dimensionality reduction¶

Clustering¶

Compare Customer Data to Demographics Data¶

Compare proportions¶

Fig20 : Barplots of Cluster proportions at the Population and Customer Datasets¶

Fig21 : Barplot of Cluster proportions at the Population and Customer Datasets¶

Fig22 : Barplots of difference of Cluster proportions at the Customer and Population Datasets¶

Fig23: Pairplots of all the principal components colored by cluster¶

Fig24: Pairplots of the first 3 principal components colored by cluster.¶

Summary¶

Fig1 : Histogram of missing values per Feature ¶

Fig4 : Histogram of missing values per row ¶

Fig5 : CountPlots of High and Low NaN features ¶

Fig6 : CountPlot of Features per Data Type ¶

Fig7 : CountPlot of PRAEGENDE_JUGENDJAHRE categories ¶

Fig8 : CountPlot of `ENG_DECADE` categories ¶

Fig9 : CountPlot of `ENG_MOVEMENT` categories ¶

Fig10 : CountPlot of `ENG_MOVEMENT` categories ¶

Fig11 : CountPlot of `ENG_WEALTH` categories ¶

Fig12 : CountPlot of `ENG_LIFE_STAGE` categories ¶

Fig13 : Explained Variance per Principal Component ¶

Fig14 : Explained Cumulative Variance per Principal Component ¶

Fig16 : Table-Bar Plot of the 1st Principal Component Weights ¶

Fig17 : Table-Bar Plot of the 2nd Principal Component Weights ¶

Fig18 : Table-Bar Plot of the 3rd Principal Component Weights ¶

Fig20 : Barplots of Cluster proportions at the Population and Customer Datasets ¶

Fig21 : Barplot of Cluster proportions at the Population and Customer Datasets ¶

Fig22 : Barplots of difference of Cluster proportions at the Customer and Population Datasets ¶

Fig23: Pairplots of all the principal components colored by cluster ¶

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	FINANZTYP	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
1	1.0	2.0	5.0	1.0	5.0	2.0	5.0	4.0	5.0	1.0	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	3.0	2.0	3.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	4.0	2.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	6.0	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	3.0	1.0	5.0	4.0	3.0	4.0	1.0	3.0	2.0	5.0	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0
5	1.0	2.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	2.0	...	2.0	3.0	1.0	1.0	1.0	5.0	5.0	2.0	3.0	3.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_ANTG4	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	ENG_DECADE	ENG_MOVEMENT	ENG_WEALTH	ENG_LIFE_STAGE
1	1.0	2.0	1.0	5.0	2.0	5.0	4.0	5.0	0.0	3.0	...	1.0	5.0	4.0	3.0	5.0	4.0	6.0	1.0	5.0	1.0
2	3.0	2.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	3.0	...	0.0	4.0	4.0	3.0	5.0	2.0	6.0	2.0	2.0	4.0
3	4.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	0.0	2.0	...	0.0	3.0	4.0	2.0	3.0	3.0	4.0	1.0	1.0	2.0
4	3.0	1.0	4.0	3.0	4.0	1.0	3.0	2.0	0.0	3.0	...	1.0	3.0	3.0	4.0	6.0	5.0	4.0	1.0	4.0	3.0
5	1.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	0.0	3.0	...	1.0	5.0	5.0	2.0	3.0	3.0	2.0	1.0	5.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0