In this project, I investigate the Titanic Dataset with the use of the Python libraries Scipy, NumPy, Pandas, Matplotlib and Seaborn.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Titanic Data - Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
Variable | Definition | Key |
---|---|---|
Survived | Survival | 0 = No, 1 = Yes |
Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Sex | Sex | |
Age | Age in years | |
Sibsp | # of siblings / spouses aboard the Titanic | |
Parch | # of parents / children aboard the Titanic | |
Ticket | Ticket number | |
Fare | Passenger fare | |
Cabin | Cabin number | |
Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown,S = Southampton |
Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
We will load the necessary Python libraries for our analysis and set some parameters:
Let's take a first look of our Dataframe using pandas descriptive statitstics functions
We observe that there are missing values at the Age, Cabin and Embarked columns.
Since most of Cabin
column values are missing we will omit this column along with the Ticket
, Fare
, PassengerId
and Name
columns that we will not use for this initial investigation. We will make a new Dataframe in case we want to access the initial one again.
We have only 714 Age
values out of 891 of the entries and 2 values missing from the Embarked
Variable. We will have to decide whether to omit these or impute them with some values when we model relationships based on Age
or Embarked
.
Imputing missing data is a complicated procedure and creating and evaluating a regression model to predict them based on the other variables is out of the scope of this analysis.
However, by using the mean or median , we can bias any relationships that we are modeling.[1] [2] .
Therefore we will choose to omit the missing Age
and Embarked
data whenever we are modeling relationships based on these two variables using the Available-case Analysis[3] method (Where different aspects of the problem are studied with different subsets of the data) and accept the limitations of this approach (Lack of consistency between analyzed subsets).
We will change the keys to make them better readable and explore the initial composition of the passengers.
We will make a function for this operation since we will use for the other variables as well.
Embarked
variable.¶We start with the composition of passengers based on their port of embarkation.
We see that the majority of passengers (644 of 889 - 72%) embarked in Southhampton and only 77 passengers - 9% - embarked in Queenstown
Let's examine the percentages of passengers that survived ,depending on their port of embarkation.
We will make functions for this operation since we will use for the other variables as well:
Survived
with Embarked
.¶We see that 55% of passengers embarked in Cherbourg survived compared to 34% and 39% at Southhampton and Queensberg respectively.
This is counter-intuitive at a first look. Investigating deeper into the composition of the passengers regarding their gender and their class may given us more information about this relationship.
Embarked
with Pclass
.¶51% of the passengers embarked in Cherbourg are in the 1st Pclass compared to 20% and 3% respectively for Southhampton and Queenstown.
It looks like the class may play a role in port of embarkation's relationship with survibability.
Let's explore the survivability based on the Pcclass
variable further.
Survived
with Pclass
.¶63% of 1st class passengers survived compared to 47% and 24% for the 2nd and 3rd class respectively.
Indeed, survivability seems to be correlated with the Pcclass
and this could be the main factor behind the correlation with the port of embarkation as well.
Let's investigate Embarked
and the correlation with Sex
Embarked
with Sex
.¶There does not seem to be a clear pattern related to Sex
that could be contributing to the increased survivability of the Cherbourg passengers.
Let's see the Sex
composition of the whole population.
Sex
variable.¶And of the Pclass
.
Sex
with Pclass
.¶We observe that the 3rd class has a higher than average (71 % vs 65%) male percentage.
And the survivability based on Sex
Sex
with Survived
¶74% of females survived compared to 19% for males. 44% of the 1st class(which had a 65% survivability) comprised of females compared to 29% of the third class (24% survivability).
We can observe this relationship in the following seaborn barplot where the black lines represent confidence intervals built using bootstrapping.
Survived
with Sex
and Pclass
.¶The proportion of survival for females in the first class was almost 100% in the first class compared to 50% in the third class.
Further statistical tests need to be conducted but it seems that Age together with the Class have a compound effect on survivability as well as on the correlation of other variables to survivability.
Age
variable¶Let's examine now the age distribution of the passengers and how Age
affected their chances of survival.
We will start with a plot of the entire population.
And the density distribution and boxplot of the Age
variable depending by survivability.
We observe that the percentage of children below 10 that survived was significantly higher and almost nobody over 70 year's old survived. We would like to examine if this was by luck or by some other underlying reason (like the 'Women and Children first' rule).
The not-survived and survived age populations have the following descriptive statistics:
Survived
- Age
Statistical Chi-SquaredTest¶We will conduct a statistical chi-squared test to establish whether the Survived
and Age
variables are related.
Dependent Variable: Survived
Independent Variable: Age
$O_{i}$: the observed value of survived for the given age
$E_{i}$: the expected value of survived for the given age
We will test the following hypotheses:
$H_0$: The Null Hypothesis, that there is no relationship between the Survived
and Age
variables (independent) $\rightarrow O_{i} \neq E_{i}$
$H_A$: The Alternative Hypothesis, that there is a relationship between the Survived
and Age
variables (dependent) $\rightarrow O_{i} = E_{i}$
We will compute the Pearson's Chi-square statistic based on the above observations table:
For a=.05 and 7 degrees of freedom, p is smaller than 0.05 and we therefore reject the Null-Hypothesis and accept that Survived
and Age
are dependent variables and that there is indeed a relationship between age and survivability.
Further statistical tests can be conducted to explore in more detail their relationship and correlation.
All conclusions above are tentative and and subject to further investigation and statistical tests.
The missing Age values could be adding an undefined bias to our hypothesis test and conclusions.
[1]:https://discussions.udacity.com/t/help-predicting-missing-age-values-in-titanic-dataset/194349/2
[2]:https://discussions.udacity.com/t/missing-age-titanic-data/165798/2
[3]:http://www.stat.columbia.edu/~gelman/arm/missing.pdf
[4]:http://www.ling.upenn.edu/~clight/chisquared.htm
[5]:http://stackoverflow.com/questions/25447700/annotate-bars-with-values-on-pandas-bar-plots
[6]: Invaluable Udacity Project Reviewer Feedback