Ioannis Karakasoglou Breier

Exploratory Analysis of the Titanic Dataset



Titanic Ad

Display ad for Titanic's first but never made sailing from New York on 20 April 1912

Project Overview


In this project, I investigate the Titanic Dataset with the use of the Python libraries Scipy, NumPy, Pandas, Matplotlib and Seaborn.

Dataset Information/ Data Dictionary/Variable Notes


The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Titanic Data - Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.

Variable Definition Key
Survived Survival 0 = No, 1 = Yes
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex Sex
Age Age in years
Sibsp # of siblings / spouses aboard the Titanic
Parch # of parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation C = Cherbourg, Q = Queenstown,S = Southampton

Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

Sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Initial Questions & Data Investigation


  • Did the Port of Embarkation affect the chances of surviving?.
  • How did the other factors influence this?
  • Did Age effect the chances of surviving?

We will load the necessary Python libraries for our analysis and set some parameters:

Data Acquisition

Let's take a first look of our Dataframe using pandas descriptive statitstics functions

Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

We observe that there are missing values at the Age, Cabin and Embarked columns.

Data Cleaning

Since most of Cabin column values are missing we will omit this column along with the Ticket, Fare, PassengerId and Name columns that we will not use for this initial investigation. We will make a new Dataframe in case we want to access the initial one again.

Out[18]:
Survived Pclass Sex Age SibSp Parch Embarked
0 0 3 male 22.0 1 0 S
1 1 1 female 38.0 1 0 C
2 1 3 female 26.0 0 0 S
3 1 1 female 35.0 1 0 S
4 0 3 male 35.0 0 0 S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Embarked    889 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 48.8+ KB

We have only 714 Age values out of 891 of the entries and 2 values missing from the Embarked Variable. We will have to decide whether to omit these or impute them with some values when we model relationships based on Age or Embarked.

Imputing missing data is a complicated procedure and creating and evaluating a regression model to predict them based on the other variables is out of the scope of this analysis.
However, by using the mean or median , we can bias any relationships that we are modeling.[1] [2] .

Therefore we will choose to omit the missing Age and Embarked data whenever we are modeling relationships based on these two variables using the Available-case Analysis[3] method (Where different aspects of the problem are studied with different subsets of the data) and accept the limitations of this approach (Lack of consistency between analyzed subsets).

Further Exploration - Visualizations


We will change the keys to make them better readable and explore the initial composition of the passengers.

We will make a function for this operation since we will use for the other variables as well.

Analysis of the Embarked variable.

We start with the composition of passengers based on their port of embarkation.

Total No. of Embarked:889

We see that the majority of passengers (644 of 889 - 72%) embarked in Southhampton and only 77 passengers - 9% - embarked in Queenstown

Let's examine the percentages of passengers that survived ,depending on their port of embarkation.

We will make functions for this operation since we will use for the other variables as well:

Correlation of Survived with Embarked .

We see that 55% of passengers embarked in Cherbourg survived compared to 34% and 39% at Southhampton and Queensberg respectively.

This is counter-intuitive at a first look. Investigating deeper into the composition of the passengers regarding their gender and their class may given us more information about this relationship.

Correlation of Embarked with Pclass.

51% of the passengers embarked in Cherbourg are in the 1st Pclass compared to 20% and 3% respectively for Southhampton and Queenstown.

It looks like the class may play a role in port of embarkation's relationship with survibability.

Let's explore the survivability based on the Pcclass variable further.

Correlation of Survived with Pclass.

63% of 1st class passengers survived compared to 47% and 24% for the 2nd and 3rd class respectively.

Indeed, survivability seems to be correlated with the Pcclass and this could be the main factor behind the correlation with the port of embarkation as well.

Let's investigate Embarked and the correlation with Sex

Correlation of Embarked with Sex.

There does not seem to be a clear pattern related to Sex that could be contributing to the increased survivability of the Cherbourg passengers.

Let's see the Sex composition of the whole population.

Analysis of the Sex variable.

Total No. of Sex:891

And of the Pclass.

Correlation of Sex with Pclass.

We observe that the 3rd class has a higher than average (71 % vs 65%) male percentage.

And the survivability based on Sex

Correlation of Sex with Survived

74% of females survived compared to 19% for males. 44% of the 1st class(which had a 65% survivability) comprised of females compared to 29% of the third class (24% survivability).

We can observe this relationship in the following seaborn barplot where the black lines represent confidence intervals built using bootstrapping.

Correlation of Survived with Sex and Pclass.

The proportion of survival for females in the first class was almost 100% in the first class compared to 50% in the third class.

Further statistical tests need to be conducted but it seems that Age together with the Class have a compound effect on survivability as well as on the correlation of other variables to survivability.

Analysis of the Age variable

Let's examine now the age distribution of the passengers and how Age affected their chances of survival.

We will start with a plot of the entire population.

No. of Passengers with not missing Age Values:714
Out[23]:
Age
Population Size 714.000
Mean 29.699
Std. Deviation 14.526
Min 0.420
25% Qt 20.125
Median 28.000
75% Qt 38.000
Max 80.000

And the density distribution and boxplot of the Age variable depending by survivability.

We observe that the percentage of children below 10 that survived was significantly higher and almost nobody over 70 year's old survived. We would like to examine if this was by luck or by some other underlying reason (like the 'Women and Children first' rule).

The not-survived and survived age populations have the following descriptive statistics:

Out[26]:
Not-survived Survived
Sample Size 424.000 290.000
Mean 30.626 28.344
Std. Deviation 14.172 14.951
Min 1.000 0.420
25% Qt 21.000 19.000
Median 28.000 28.000
75% Qt 39.000 36.000
Max 74.000 80.000

Survived- Age Statistical Chi-SquaredTest

We will conduct a statistical chi-squared test to establish whether the Survived and Age variables are related.

Dependent Variable: Survived
Independent Variable: Age

$O_{i}$: the observed value of survived for the given age
$E_{i}$: the expected value of survived for the given age

We will test the following hypotheses:

$H_0$: The Null Hypothesis, that there is no relationship between the Survived and Age variables (independent) $\rightarrow O_{i} \neq E_{i}$

$H_A$: The Alternative Hypothesis, that there is a relationship between the Survived and Age variables (dependent) $\rightarrow O_{i} = E_{i}$

Out[28]:
age-groups 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-80
Survived
No 24 61 143 94 55 28 13 6
Yes 38 41 77 73 34 20 6 1

We will compute the Pearson's Chi-square statistic based on the above observations table:

chi2:17.4277216059
dof:7
p:0.0148368781128

For a=.05 and 7 degrees of freedom, p is smaller than 0.05 and we therefore reject the Null-Hypothesis and accept that Survived and Age are dependent variables and that there is indeed a relationship between age and survivability.

Further statistical tests can be conducted to explore in more detail their relationship and correlation.

Note

  • All conclusions above are tentative and and subject to further investigation and statistical tests.

  • The missing Age values could be adding an undefined bias to our hypothesis test and conclusions.