Everybody is familiar with the Titanic, the story
behind it and most probably everybody saw the movie behind it.
If you are a Data Enthusiast then you are probably also familiar with the famous Kaggle Titanic Machine Learning
Project which is based on, as the name suggests, the Titanic ship.
The goal is to apply a machine learning algorithm to predict whether passengers survived the Titanic
shipwreck or not based on some data features.
Below are the first 10 lines of the training csv file. The test csv file is similar to this one, but a little
smaller.
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
Data features :
- Passenger Id
- Survived : 1 for survived and 0 for didn't survive
- Pclass : Ticket class, we have 3 classes : 1st, 2nd and 3rd
- Name : Name of the passenger
- Sex : Sex of the passenger
- Age : Age of the passenger
- SibSp : Number of siblings or spouses aboard the ship
- Parch : Number of parents or children aboard the ship
- Ticket : Ticket number
- Fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of embarkation, C =Cherbourg, Q = Queenstown, S = Southampton
We will be using Random Forest Classifier algorithm
from the Scikit-learn library in Python.
Random forest algorithm is an ensemble or a multitude of decision trees.
We can see this problem as a classification task because we are trying to predict if a given passenger survived
the shipwreck or not.
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
The output, or what we are trying to predict, is the column Survived. We are going to be picking the following
data features : Ticket class, Sex,
Number of siblings or spouses and Number of parents or children.
For extracting relevant features from a dataset, we can use a lot of different methods (more on those in a later
post), for example comparing and looking at the correlation of all the features with the output could work
wonders !
The output defined in the image above will give us the following results
PassengerId Survived
892 0
893 1
894 0
895 0
896 1
We will obtain a table similar to the one in the image above (Here, we are only displaying the 5 first
elements).
We have the PassengerId for identification and the predicted output Survived.
We have managed to predict passengers that survived
the Titanic shipwreck and those who didn't using Random Forest Classifier algorithm
We can try changing and picking other features for example to see how different our result will be.
If you have any questions or wants to know more about the project,
feel free to email me or
tweet at me!