Posts


Kaggle Titanic Project 🚢



Everybody is familiar with the Titanic, the story behind it and most probably everybody saw the movie behind it.
If you are a Data Enthusiast then you are probably also familiar with the famous Kaggle Titanic Machine Learning Project which is based on, as the name suggests, the Titanic ship.

The goal is to apply a machine learning algorithm to predict whether passengers survived the Titanic shipwreck or not based on some data features.

Below are the first 10 lines of the training csv file. The test csv file is similar to this one, but a little smaller.

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S

Data features :
- Passenger Id
- Survived : 1 for survived and 0 for didn't survive
- Pclass : Ticket class, we have 3 classes : 1st, 2nd and 3rd
- Name : Name of the passenger
- Sex : Sex of the passenger
- Age : Age of the passenger
- SibSp : Number of siblings or spouses aboard the ship
- Parch : Number of parents or children aboard the ship
- Ticket : Ticket number
- Fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of embarkation, C =Cherbourg, Q = Queenstown, S = Southampton


Model

We will be using Random Forest Classifier algorithm from the Scikit-learn library in Python.
Random forest algorithm is an ensemble or a multitude of decision trees.
We can see this problem as a classification task because we are trying to predict if a given passenger survived the shipwreck or not.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})

The output, or what we are trying to predict, is the column Survived. We are going to be picking the following data features : Ticket class, Sex, Number of siblings or spouses and Number of parents or children.
For extracting relevant features from a dataset, we can use a lot of different methods (more on those in a later post), for example comparing and looking at the correlation of all the features with the output could work wonders !


Results

The output defined in the image above will give us the following results

PassengerId    Survived
   892            0
   893            1
   894            0
   895            0
   896            1

We will obtain a table similar to the one in the image above (Here, we are only displaying the 5 first elements).
We have the PassengerId for identification and the predicted output Survived.


We have managed to predict passengers that survived the Titanic shipwreck and those who didn't using Random Forest Classifier algorithm
We can try changing and picking other features for example to see how different our result will be.
If you have any questions or wants to know more about the project, feel free to email me or tweet at me!

Source : Titanic - Machine Learning from Disaster
Related : Housing Prices Prediction


Join my newsletter for similar articles and early access