In this project, I investigate the Titanic Dataset with the use of the Python libraries Scipy, NumPy, Pandas, Matplotlib and Seaborn.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Titanic Data - Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
Variable | Definition | Key |
---|---|---|
Survived | Survival | 0 = No, 1 = Yes |
Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Sex | Sex | |
Age | Age in years | |
Sibsp | # of siblings / spouses aboard the Titanic | |
Parch | # of parents / children aboard the Titanic | |
Ticket | Ticket number | |
Fare | Passenger fare | |
Cabin | Cabin number | |
Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown,S = Southampton |
Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
We will load the necessary Python libraries for our analysis and set some parameters:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('seaborn-ticks')
SMALL_SIZE = 13
MEDIUM_SIZE = 14
BIGGER_SIZE = 16
plt.rc('font', size=SMALL_SIZE) # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE) # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
plt.rc('xtick', labelsize=MEDIUM_SIZE) # fontsize of the tick labels
plt.rc('ytick', labelsize=MEDIUM_SIZE) # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE) # legend fontsize
#Load the CSV into a Pandas Dataframe
titanic_data = pd.read_csv('titanic-data.csv')
Let's take a first look of our Dataframe using pandas descriptive statitstics functions
titanic_data.head(5)
titanic_data.info()
We observe that there are missing values at the Age, Cabin and Embarked columns.
Since most of Cabin
column values are missing we will omit this column along with the Ticket
, Fare
, PassengerId
and Name
columns that we will not use for this initial investigation. We will make a new Dataframe in case we want to access the initial one again.
#Drop the unwanted columns
n_titanic_data=titanic_data.drop(['Cabin','Ticket','Name',
'Fare','PassengerId'],axis=1)
n_titanic_data.head()
n_titanic_data.info()
We have only 714 Age
values out of 891 of the entries and 2 values missing from the Embarked
Variable. We will have to decide whether to omit these or impute them with some values when we model relationships based on Age
or Embarked
.
Imputing missing data is a complicated procedure and creating and evaluating a regression model to predict them based on the other variables is out of the scope of this analysis.
However, by using the mean or median , we can bias any relationships that we are modeling.[1] [2] .
Therefore we will choose to omit the missing Age
and Embarked
data whenever we are modeling relationships based on these two variables using the Available-case Analysis[3] method (Where different aspects of the problem are studied with different subsets of the data) and accept the limitations of this approach (Lack of consistency between analyzed subsets).
We will change the keys to make them better readable and explore the initial composition of the passengers.
#Make another copy of the new dataframe
descript = n_titanic_data.copy()
#Change the embarked keys to better readable ones
descript.loc[:,'Embarked'].replace(['C','S','Q'],
['Cherbourg','Southampton','Queenstown'],
inplace=True)
#And the survived keys
descript.loc[:,'Survived'].replace([0,1],['No','Yes'],inplace=True)
We will make a function for this operation since we will use for the other variables as well.
# Make a function to get the composition of the variables per number of passengers
def Groupby_OneCol_comp_plot(df, col, plt_style = 'seaborn-ticks', color_palette = "coolwarm"):
'''
Group by col1, sort by size , return and plot the dataframe with a bar and pie plot
'''
gr=pd.DataFrame()
gr['{} No'.format(col)] = df.groupby(col).size()
gr['{} Ratio'.format(col)] = np.round(gr['{} No'.format(col)].divide(gr['{} No'.format(col)].sum())*100,0)
print ('Total No. of {}:{}'.format(col,gr['{} No'.format(col)].sum()))
plt.style.use(plt_style)
sns.set_palette(sns.color_palette(color_palette))
fig=plt.figure()
plt.axis('off')
fig.add_subplot(121)
ax=gr['{} No'.format(col)].plot(kind='bar', title='{} Counts'.format(col), figsize=(16,8), color=sns.color_palette())
_ = plt.setp(ax.get_xticklabels(), rotation=0)
for p in ax.patches: ax.annotate(np.round(p.get_height(),decimals=2),
(p.get_x()+p.get_width()/2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
ax.get_yaxis().set_ticks([])
plt.xlabel('')
fig.add_subplot(122)
plt.axis('off')
gr.loc[:,'{} Ratio'.format(col)].plot(kind= 'pie',
autopct='%1.1f%%',shadow=False,
title='{} Ratio'.format(col), legend=False, labels=None);
sns.despine(top=True, right=True, left=True, bottom=False);
Embarked
variable.¶We start with the composition of passengers based on their port of embarkation.
Groupby_OneCol_comp_plot(descript, 'Embarked')
We see that the majority of passengers (644 of 889 - 72%) embarked in Southhampton and only 77 passengers - 9% - embarked in Queenstown
Let's examine the percentages of passengers that survived ,depending on their port of embarkation.
We will make functions for this operation since we will use for the other variables as well:
Survived
with Embarked
.¶def plot(table, legloc='upper right',
plt_style = 'seaborn-ticks',
color_palette="dark",sorter=None, stacked=False,
kind = 'bar', percentage = True,
custom_title=None, minimal=True, figsize=(19,10), width=0.7 ):
grouped = table
#Tranform to percentages
if percentage == True:
grouped = np.round(grouped.divide(grouped['Total'],axis=0)*100,0)
try:
del grouped['Total']
except:
pass
# rearrange the columns
if sorter:
grouped = grouped[sorter]
plt.style.use(plt_style)
sns.set_palette(sns.color_palette(color_palette))
ax = grouped.plot(kind=kind,stacked=stacked, figsize=figsize, width=width)
_ = plt.setp(ax.get_xticklabels(), rotation=0) # Rotate labels
plt.legend(loc=legloc) # plot the legend normally
#annotate the bars
if percentage == True:
for p in ax.patches:
ax.annotate('{}%'.format(int(np.round(p.get_height(),decimals=2))),
(p.get_x()+p.get_width()/2.,
p.get_height()), ha='center', va='center',
xytext=(0, 10), textcoords='offset points')
else:
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=2),
(p.get_x()+p.get_width()/2.,
p.get_height()), ha='center', va='center',
xytext=(0, 10), textcoords='offset points')
if minimal == True:
ax.get_yaxis().set_ticks([])
plt.xlabel('')
sns.despine(top=True, right=True, left=True, bottom=False);
else:
pass
# set custom title
plt.title(custom_title)
def Groupby_TwoCol_Plot(df, col1, col2, legloc='upper right',
plt_style = 'ggplot',
color_palette="dark",sorter=None, stacked=False,
kind = 'bar', percentage = True,
custom_title=None, minimal=True, figsize=(14,6), width=0.6):
#Group by Placement and Representative and unstack by Placement
grouped = df.groupby([col2,col1]).size().unstack(col2)
#Make a totals column sort and delete after
grouped['Total'] = grouped.sum(axis=1)
#grouped = grouped.sort_values('Total', ascending = False)
plot(grouped, legloc=legloc,
plt_style = plt_style,
color_palette=color_palette,sorter=sorter, stacked=stacked,
kind = kind , percentage = percentage,
custom_title=custom_title, minimal=minimal, figsize=figsize, width=width)
Groupby_TwoCol_Plot(descript,'Embarked', 'Survived', color_palette=('darkred','steelblue'),
plt_style = 'seaborn-ticks', custom_title='Proportion of Survived per Embarkation Port')
We see that 55% of passengers embarked in Cherbourg survived compared to 34% and 39% at Southhampton and Queensberg respectively.
This is counter-intuitive at a first look. Investigating deeper into the composition of the passengers regarding their gender and their class may given us more information about this relationship.
Embarked
with Pclass
.¶#Calculate percentages of port passengers per Class
Groupby_TwoCol_Plot(descript,'Embarked', 'Pclass', color_palette=('cubehelix'),
plt_style = 'seaborn-ticks', custom_title='Proportion of Embarked per PcClass', sorter = [1,2,3])
51% of the passengers embarked in Cherbourg are in the 1st Pclass compared to 20% and 3% respectively for Southhampton and Queenstown.
It looks like the class may play a role in port of embarkation's relationship with survibability.
Let's explore the survivability based on the Pcclass
variable further.
Survived
with Pclass
.¶Groupby_TwoCol_Plot(descript,'Pclass', 'Survived', color_palette=('darkred','steelblue'),
plt_style = 'seaborn-ticks', custom_title='Proportion of Survived per PcClass')
63% of 1st class passengers survived compared to 47% and 24% for the 2nd and 3rd class respectively.
Indeed, survivability seems to be correlated with the Pcclass
and this could be the main factor behind the correlation with the port of embarkation as well.
Let's investigate Embarked
and the correlation with Sex
Embarked
with Sex
.¶#Calculate percentages of port passengers per Sex
Groupby_TwoCol_Plot(descript,'Embarked', 'Sex', color_palette=('lightpink','green'),
plt_style = 'seaborn-ticks', custom_title='Proportion of Sex per PcClass',
legloc='upper left')
There does not seem to be a clear pattern related to Sex
that could be contributing to the increased survivability of the Cherbourg passengers.
Let's see the Sex
composition of the whole population.
Sex
variable.¶Groupby_OneCol_comp_plot(descript, 'Sex', color_palette = ('lightpink','green') )
And of the Pclass
.
Sex
with Pclass
.¶#Calculate percentages of Pclass per Sex
Groupby_TwoCol_Plot(descript,'Pclass', 'Sex', color_palette=('lightpink','green'),
plt_style = 'seaborn-ticks', custom_title='Proportion of Sex per PcClass',
legloc='upper left')
We observe that the 3rd class has a higher than average (71 % vs 65%) male percentage.
And the survivability based on Sex
Sex
with Survived
¶Groupby_TwoCol_Plot(descript,'Survived', 'Sex', color_palette=('lightpink','green'),
plt_style = 'seaborn-ticks', custom_title='Proportion of Sex per Survived',
legloc='upper left')
74% of females survived compared to 19% for males. 44% of the 1st class(which had a 65% survivability) comprised of females compared to 29% of the third class (24% survivability).
We can observe this relationship in the following seaborn barplot where the black lines represent confidence intervals built using bootstrapping.
Survived
with Sex
and Pclass
.¶plt.figure(figsize=(14,6))
sns.set_palette(sns.color_palette(('green','lightpink')))
sns.barplot(data=n_titanic_data, x="Pclass", hue='Sex', y='Survived', estimator=np.mean);
plt.ylabel('proportion of survival')
sns.despine(top=True, right=True, left=False, bottom=False);
The proportion of survival for females in the first class was almost 100% in the first class compared to 50% in the third class.
Further statistical tests need to be conducted but it seems that Age together with the Class have a compound effect on survivability as well as on the correlation of other variables to survivability.
Age
variable¶Let's examine now the age distribution of the passengers and how Age
affected their chances of survival.
We will start with a plot of the entire population.
#Make a dataframe for non missing 'Age'values
not_missing = n_titanic_data[(n_titanic_data['Age'].notnull())]
#And replace the survived keys
not_missing.loc[:,'Survived'].replace([0,1],['No','Yes'],inplace=True)
print ('No. of Passengers with not missing Age Values:{}'.format(len(not_missing)))
ax=plt.figure()
plt.suptitle('Passenger Age Distribution')
ax.add_subplot(121)
sns.distplot(not_missing['Age'],bins=11)
ax.add_subplot(122)
sns.violinplot(not_missing['Age']);
# Get summary descriptive statistics
v= pd.DataFrame(not_missing['Age'].describe())
#Change the index labels and round the values reported
v.index = ['Population Size', 'Mean', 'Std. Deviation', 'Min', '25% Qt', 'Median',\
'75% Qt', 'Max']
v = v.round(decimals=3)
v
And the density distribution and boxplot of the Age
variable depending by survivability.
not_missing.hist(column="Age",by="Survived",sharey=True,normed=True)
plt.suptitle('Age Density Distribution grouped by Survived');
We observe that the percentage of children below 10 that survived was significantly higher and almost nobody over 70 year's old survived. We would like to examine if this was by luck or by some other underlying reason (like the 'Women and Children first' rule).
#Make a datframe with the sample populations
age = pd.DataFrame()
age['all'] = not_missing['Age']
not_survived = age['Not-survived'] = not_missing['Age'][not_missing['Survived']=='No']
survived = age['Survived'] = not_missing['Age'][not_missing['Survived']=='Yes']
#Get the summary statistics
var = age.describe()
#Change the index labels and round the values reported
var.index = ['Sample Size', 'Mean', 'Std. Deviation', 'Min', '25% Qt', 'Median',\
'75% Qt', 'Max']
var = var.round(decimals=3)
The not-survived and survived age populations have the following descriptive statistics:
var.loc[:,['Not-survived','Survived']]
Survived
- Age
Statistical Chi-SquaredTest¶We will conduct a statistical chi-squared test to establish whether the Survived
and Age
variables are related.
Dependent Variable: Survived
Independent Variable: Age
$O_{i}$: the observed value of survived for the given age
$E_{i}$: the expected value of survived for the given age
We will test the following hypotheses:
$H_0$: The Null Hypothesis, that there is no relationship between the Survived
and Age
variables (independent) $\rightarrow O_{i} \neq E_{i}$
$H_A$: The Alternative Hypothesis, that there is a relationship between the Survived
and Age
variables (dependent) $\rightarrow O_{i} = E_{i}$
#Create age-groups
age_labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69',
'70-80']
age_group_values = pd.cut(not_missing.Age, range(0,81,10),
right=False, labels=age_labels)
not_missing.loc[:,'age-groups'] = age_group_values
#Set the value for the one 80-year old outside the bins
#chi-squared is notvalid for no of observations below 5
not_missing.loc[not_missing['Age']>=80, 'age-groups'] = '70-80'
#Make an observed-table for chi-squared test
obs_table = pd.crosstab([not_missing['Survived']],[not_missing['age-groups']])
obs_table
We will compute the Pearson's Chi-square statistic based on the above observations table:
#Compute Chi-square statistic
chi2, p, dof, expected = chi2_contingency(obs_table)
#report results
print('chi2:{}\ndof:{}\np:{}'.format(chi2,dof,p))
For a=.05 and 7 degrees of freedom, p is smaller than 0.05 and we therefore reject the Null-Hypothesis and accept that Survived
and Age
are dependent variables and that there is indeed a relationship between age and survivability.
Further statistical tests can be conducted to explore in more detail their relationship and correlation.
All conclusions above are tentative and and subject to further investigation and statistical tests.
The missing Age values could be adding an undefined bias to our hypothesis test and conclusions.
[1]:https://discussions.udacity.com/t/help-predicting-missing-age-values-in-titanic-dataset/194349/2
[2]:https://discussions.udacity.com/t/missing-age-titanic-data/165798/2
[3]:http://www.stat.columbia.edu/~gelman/arm/missing.pdf
[4]:http://www.ling.upenn.edu/~clight/chisquared.htm
[5]:http://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots
[6]: Invaluable Udacity Project Reviewer Feedback