Ioannis Karakasoglou Breier

Exploratory Analysis of the Titanic Dataset¶

Display ad for Titanic's first but never made sailing from New York on 20 April 1912

Project Overview¶

In this project, I investigate the Titanic Dataset with the use of the Python libraries Scipy, NumPy, Pandas, Matplotlib and Seaborn.

Dataset Information/ Data Dictionary/Variable Notes¶

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Titanic Data - Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.

Variable	Definition	Key
Survived	Survival	0 = No, 1 = Yes
Pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
Sex	Sex
Age	Age in years
Sibsp	# of siblings / spouses aboard the Titanic
Parch	# of parents / children aboard the Titanic
Ticket	Ticket number
Fare	Passenger fare
Cabin	Cabin number
Embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown,S = Southampton

Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

Sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Initial Questions & Data Investigation¶

Did the Port of Embarkation affect the chances of surviving?.
How did the other factors influence this?
Did Age effect the chances of surviving?

We will load the necessary Python libraries for our analysis and set some parameters:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('seaborn-ticks')

SMALL_SIZE = 13
MEDIUM_SIZE = 14
BIGGER_SIZE = 16

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize

Data Acquisition¶

#Load the CSV into a Pandas Dataframe
titanic_data = pd.read_csv('titanic-data.csv')

Let's take a first look of our Dataframe using pandas descriptive statitstics functions

titanic_data.head(5)

titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

We observe that there are missing values at the Age, Cabin and Embarked columns.

Data Cleaning¶

Since most of Cabin column values are missing we will omit this column along with the Ticket, Fare, PassengerId and Name columns that we will not use for this initial investigation. We will make a new Dataframe in case we want to access the initial one again.

#Drop the unwanted columns
n_titanic_data=titanic_data.drop(['Cabin','Ticket','Name',
                                  'Fare','PassengerId'],axis=1)

n_titanic_data.head()

n_titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Embarked    889 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 48.8+ KB

We have only 714 Age values out of 891 of the entries and 2 values missing from the Embarked Variable. We will have to decide whether to omit these or impute them with some values when we model relationships based on Age or Embarked.

Imputing missing data is a complicated procedure and creating and evaluating a regression model to predict them based on the other variables is out of the scope of this analysis.
However, by using the mean or median , we can bias any relationships that we are modeling.[1] [2] .

Therefore we will choose to omit the missing Age and Embarked data whenever we are modeling relationships based on these two variables using the Available-case Analysis[3] method (Where different aspects of the problem are studied with different subsets of the data) and accept the limitations of this approach (Lack of consistency between analyzed subsets).

Further Exploration - Visualizations¶

We will change the keys to make them better readable and explore the initial composition of the passengers.

#Make another copy of the new dataframe
descript = n_titanic_data.copy()

#Change the embarked keys to better readable ones
descript.loc[:,'Embarked'].replace(['C','S','Q'],
                                      ['Cherbourg','Southampton','Queenstown'],
                                      inplace=True)
#And the survived keys
descript.loc[:,'Survived'].replace([0,1],['No','Yes'],inplace=True)

We will make a function for this operation since we will use for the other variables as well.

# Make a function to get the composition of the variables per number of passengers    
def Groupby_OneCol_comp_plot(df, col, plt_style = 'seaborn-ticks', color_palette = "coolwarm"):
    '''
    Group by col1, sort by size , return and plot the dataframe with a bar and pie plot
    '''
    gr=pd.DataFrame()
    gr['{} No'.format(col)] = df.groupby(col).size()
    gr['{} Ratio'.format(col)] = np.round(gr['{} No'.format(col)].divide(gr['{} No'.format(col)].sum())*100,0)
    
    print ('Total No. of {}:{}'.format(col,gr['{} No'.format(col)].sum()))
    
    plt.style.use(plt_style)
    sns.set_palette(sns.color_palette(color_palette))
    
    fig=plt.figure()
    plt.axis('off')

    fig.add_subplot(121)
    
    ax=gr['{} No'.format(col)].plot(kind='bar', title='{} Counts'.format(col), figsize=(16,8), color=sns.color_palette())
    _ = plt.setp(ax.get_xticklabels(), rotation=0)
    for p in ax.patches: ax.annotate(np.round(p.get_height(),decimals=2),
                                     (p.get_x()+p.get_width()/2., p.get_height()),
                                     ha='center', va='center', xytext=(0, 10), textcoords='offset points')
    ax.get_yaxis().set_ticks([])
    plt.xlabel('')

    fig.add_subplot(122)
    plt.axis('off')
    gr.loc[:,'{} Ratio'.format(col)].plot(kind= 'pie',
                                     autopct='%1.1f%%',shadow=False,
                                     title='{} Ratio'.format(col), legend=False, labels=None);
    sns.despine(top=True, right=True, left=True, bottom=False);

Analysis of the `Embarked` variable.¶

We start with the composition of passengers based on their port of embarkation.

Groupby_OneCol_comp_plot(descript, 'Embarked')

Total No. of Embarked:889

We see that the majority of passengers (644 of 889 - 72%) embarked in Southhampton and only 77 passengers - 9% - embarked in Queenstown

Let's examine the percentages of passengers that survived ,depending on their port of embarkation.

We will make functions for this operation since we will use for the other variables as well:

Correlation of `Survived` with `Embarked` .¶

def plot(table, legloc='upper right',
                                    plt_style = 'seaborn-ticks',
                                    color_palette="dark",sorter=None, stacked=False,
                                    kind = 'bar', percentage = True,
                               custom_title=None, minimal=True, figsize=(19,10), width=0.7 ):     
    grouped = table
    
    #Tranform to percentages
    if percentage == True:
        grouped = np.round(grouped.divide(grouped['Total'],axis=0)*100,0)
    try:   
        del grouped['Total']
    except:
        pass
    
    # rearrange the columns
    if sorter:
        grouped = grouped[sorter]

    plt.style.use(plt_style)
    sns.set_palette(sns.color_palette(color_palette))
    ax = grouped.plot(kind=kind,stacked=stacked, figsize=figsize, width=width)
    _ = plt.setp(ax.get_xticklabels(), rotation=0)  # Rotate labels
    plt.legend(loc=legloc) # plot the legend normally
    
    #annotate the bars
    if percentage == True:
      for p in ax.patches:
            ax.annotate('{}%'.format(int(np.round(p.get_height(),decimals=2))),
                                         (p.get_x()+p.get_width()/2.,
                                          p.get_height()), ha='center', va='center',
                                        xytext=(0, 10), textcoords='offset points')
    else:
      for p in ax.patches:
            ax.annotate(np.round(p.get_height(),decimals=2),
                                         (p.get_x()+p.get_width()/2.,
                                          p.get_height()), ha='center', va='center',
                                        xytext=(0, 10), textcoords='offset points')
    if minimal == True:
        ax.get_yaxis().set_ticks([])
        plt.xlabel('')
        sns.despine(top=True, right=True, left=True, bottom=False);
    else:
        pass     
    # set custom title    
    plt.title(custom_title)
    
def Groupby_TwoCol_Plot(df, col1, col2, legloc='upper right',
                                    plt_style = 'ggplot',
                                    color_palette="dark",sorter=None, stacked=False,
                                    kind = 'bar', percentage = True,
                               custom_title=None, minimal=True, figsize=(14,6), width=0.6):   
    
    #Group by Placement and Representative and unstack by Placement
    grouped = df.groupby([col2,col1]).size().unstack(col2)
    
    #Make a totals column sort and delete after
    grouped['Total'] = grouped.sum(axis=1)
    #grouped = grouped.sort_values('Total', ascending = False)
   
    plot(grouped, legloc=legloc,
                                    plt_style = plt_style,
                                    color_palette=color_palette,sorter=sorter, stacked=stacked,
                                    kind = kind , percentage = percentage,
                               custom_title=custom_title, minimal=minimal, figsize=figsize, width=width)

Groupby_TwoCol_Plot(descript,'Embarked', 'Survived', color_palette=('darkred','steelblue'),
                    plt_style = 'seaborn-ticks', custom_title='Proportion of Survived per Embarkation Port')

We see that 55% of passengers embarked in Cherbourg survived compared to 34% and 39% at Southhampton and Queensberg respectively.

This is counter-intuitive at a first look. Investigating deeper into the composition of the passengers regarding their gender and their class may given us more information about this relationship.

Correlation of `Embarked` with `Pclass`.¶

#Calculate percentages of port passengers per Class
Groupby_TwoCol_Plot(descript,'Embarked', 'Pclass', color_palette=('cubehelix'),
                    plt_style = 'seaborn-ticks', custom_title='Proportion of Embarked per PcClass', sorter = [1,2,3])

51% of the passengers embarked in Cherbourg are in the 1st Pclass compared to 20% and 3% respectively for Southhampton and Queenstown.

It looks like the class may play a role in port of embarkation's relationship with survibability.

Let's explore the survivability based on the Pcclass variable further.

Correlation of `Survived` with `Pclass`.¶

Groupby_TwoCol_Plot(descript,'Pclass', 'Survived', color_palette=('darkred','steelblue'),
                    plt_style = 'seaborn-ticks', custom_title='Proportion of Survived per PcClass')

63% of 1st class passengers survived compared to 47% and 24% for the 2nd and 3rd class respectively.

Indeed, survivability seems to be correlated with the Pcclass and this could be the main factor behind the correlation with the port of embarkation as well.

Let's investigate Embarked and the correlation with Sex

Correlation of `Embarked` with `Sex`.¶

#Calculate percentages of port passengers per Sex
Groupby_TwoCol_Plot(descript,'Embarked', 'Sex', color_palette=('lightpink','green'),
                    plt_style = 'seaborn-ticks', custom_title='Proportion of Sex per PcClass',
                   legloc='upper left')

There does not seem to be a clear pattern related to Sex that could be contributing to the increased survivability of the Cherbourg passengers.

Let's see the Sex composition of the whole population.

Analysis of the `Sex` variable.¶

Groupby_OneCol_comp_plot(descript, 'Sex', color_palette = ('lightpink','green') )

Total No. of Sex:891

And of the Pclass.

Correlation of `Sex` with `Pclass`.¶

#Calculate percentages of Pclass per Sex
Groupby_TwoCol_Plot(descript,'Pclass', 'Sex', color_palette=('lightpink','green'),
                    plt_style = 'seaborn-ticks', custom_title='Proportion of Sex per PcClass',
                   legloc='upper left')

We observe that the 3rd class has a higher than average (71 % vs 65%) male percentage.

And the survivability based on Sex

Correlation of `Sex` with `Survived`¶

Groupby_TwoCol_Plot(descript,'Survived', 'Sex', color_palette=('lightpink','green'),
                    plt_style = 'seaborn-ticks', custom_title='Proportion of Sex per Survived',
                   legloc='upper left')

74% of females survived compared to 19% for males. 44% of the 1st class(which had a 65% survivability) comprised of females compared to 29% of the third class (24% survivability).

We can observe this relationship in the following seaborn barplot where the black lines represent confidence intervals built using bootstrapping.

Correlation of `Survived` with `Sex` and `Pclass`.¶

plt.figure(figsize=(14,6))
sns.set_palette(sns.color_palette(('green','lightpink')))
sns.barplot(data=n_titanic_data, x="Pclass", hue='Sex', y='Survived', estimator=np.mean);
plt.ylabel('proportion of survival')
sns.despine(top=True, right=True, left=False, bottom=False);

The proportion of survival for females in the first class was almost 100% in the first class compared to 50% in the third class.

Further statistical tests need to be conducted but it seems that Age together with the Class have a compound effect on survivability as well as on the correlation of other variables to survivability.

Analysis of the `Age` variable¶

Let's examine now the age distribution of the passengers and how Age affected their chances of survival.

We will start with a plot of the entire population.

#Make a dataframe for non missing 'Age'values
not_missing = n_titanic_data[(n_titanic_data['Age'].notnull())] 

#And replace the survived keys
not_missing.loc[:,'Survived'].replace([0,1],['No','Yes'],inplace=True)

print ('No. of Passengers with not missing Age Values:{}'.format(len(not_missing)))

No. of Passengers with not missing Age Values:714

ax=plt.figure()
plt.suptitle('Passenger Age Distribution')
ax.add_subplot(121)
sns.distplot(not_missing['Age'],bins=11)
ax.add_subplot(122)
sns.violinplot(not_missing['Age']);

# Get summary descriptive statistics
v= pd.DataFrame(not_missing['Age'].describe())

#Change the index labels and round the values reported
v.index = ['Population Size', 'Mean', 'Std. Deviation', 'Min', '25% Qt', 'Median',\
               '75% Qt', 'Max']
v = v.round(decimals=3)
v

And the density distribution and boxplot of the Age variable depending by survivability.

not_missing.hist(column="Age",by="Survived",sharey=True,normed=True)
plt.suptitle('Age Density Distribution grouped by Survived');

We observe that the percentage of children below 10 that survived was significantly higher and almost nobody over 70 year's old survived. We would like to examine if this was by luck or by some other underlying reason (like the 'Women and Children first' rule).

#Make a datframe with the sample populations
age = pd.DataFrame()
age['all'] = not_missing['Age']
not_survived = age['Not-survived'] = not_missing['Age'][not_missing['Survived']=='No']
survived = age['Survived'] = not_missing['Age'][not_missing['Survived']=='Yes']

#Get the summary statistics
var = age.describe()

#Change the index labels and round the values reported
var.index = ['Sample Size', 'Mean', 'Std. Deviation', 'Min', '25% Qt', 'Median',\
               '75% Qt', 'Max']
var = var.round(decimals=3)

The not-survived and survived age populations have the following descriptive statistics:

var.loc[:,['Not-survived','Survived']]

`Survived`- `Age` Statistical Chi-SquaredTest¶

We will conduct a statistical chi-squared test to establish whether the Survived and Age variables are related.

Dependent Variable: Survived
Independent Variable: Age

$O_{i}$: the observed value of survived for the given age
$E_{i}$: the expected value of survived for the given age

We will test the following hypotheses:

$H_0$: The Null Hypothesis, that there is no relationship between the Survived and Age variables (independent) $\rightarrow O_{i} \neq E_{i}$

$H_A$: The Alternative Hypothesis, that there is a relationship between the Survived and Age variables (dependent) $\rightarrow O_{i} = E_{i}$

#Create age-groups
age_labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69',
              '70-80']
age_group_values = pd.cut(not_missing.Age, range(0,81,10),
                                   right=False, labels=age_labels)
not_missing.loc[:,'age-groups'] = age_group_values

#Set the value for the one 80-year old outside the bins 
#chi-squared is notvalid for no of observations below 5
not_missing.loc[not_missing['Age']>=80, 'age-groups'] = '70-80'

#Make an observed-table for chi-squared test
obs_table = pd.crosstab([not_missing['Survived']],[not_missing['age-groups']])

obs_table

We will compute the Pearson's Chi-square statistic based on the above observations table:

#Compute Chi-square statistic
chi2, p, dof, expected = chi2_contingency(obs_table)

#report results
print('chi2:{}\ndof:{}\np:{}'.format(chi2,dof,p))

chi2:17.4277216059
dof:7
p:0.0148368781128

For a=.05 and 7 degrees of freedom, p is smaller than 0.05 and we therefore reject the Null-Hypothesis and accept that Survived and Age are dependent variables and that there is indeed a relationship between age and survivability.

Further statistical tests can be conducted to explore in more detail their relationship and correlation.

Note¶

All conclusions above are tentative and and subject to further investigation and statistical tests.
The missing Age values could be adding an undefined bias to our hypothesis test and conclusions.

References/Sources¶

[1]:https://discussions.udacity.com/t/help-predicting-missing-age-values-in-titanic-dataset/194349/2
[2]:https://discussions.udacity.com/t/missing-age-titanic-data/165798/2
[3]:http://www.stat.columbia.edu/~gelman/arm/missing.pdf
[4]:http://www.ling.upenn.edu/~clight/chisquared.htm [5]:http://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots
[6]: Invaluable Udacity Project Reviewer Feedback

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Not-survived	Survived
Sample Size	424.000	290.000
Mean	30.626	28.344
Std. Deviation	14.172	14.951
Min	1.000	0.420
25% Qt	21.000	19.000
Median	28.000	28.000
75% Qt	39.000	36.000
Max	74.000	80.000

	Age
Population Size	714.000
Mean	29.699
Std. Deviation	14.526
Min	0.420
25% Qt	20.125
Median	28.000
75% Qt	38.000
Max	80.000

age-groups	0-9	10-19	20-29	30-39	40-49	50-59	60-69	70-80
Survived
No	24	61	143	94	55	28	13	6
Yes	38	41	77	73	34	20	6	1

Exploratory Analysis of the Titanic Dataset¶

Project Overview¶

Dataset Information/ Data Dictionary/Variable Notes¶

Initial Questions & Data Investigation¶

Data Acquisition¶

Data Cleaning¶

Further Exploration - Visualizations¶

Analysis of the Embarked variable.¶

Correlation of Survived with Embarked .¶

Correlation of Embarked with Pclass.¶

Correlation of Survived with Pclass.¶

Correlation of Embarked with Sex.¶

Analysis of the Sex variable.¶

Correlation of Sex with Pclass.¶

Correlation of Sex with Survived¶

Correlation of Survived with Sex and Pclass.¶

Analysis of the Age variable¶

Survived- Age Statistical Chi-SquaredTest¶