Ever wanted to predict house prices ?
This is your chance with this dataset of residential homes in Ames, Iowa.
We will be using some feature engineering methods and some regression machine learning models and techniques.
This is a Kaggle competition, and it is intended for data science students who have just finished a course in
machine learning course.
We have a huge
dataset
with a lot of data columns and data features to pick from.
We will be focusing on the following features :
- LotArea : Lot size in square feet
- YearBuilt : Original construction date
- 1stFlrSF : First Floor square feet
- 2ndFlrSF : Second floor square feet
- FullBath : Full bathrooms above grade
- BedroomAbvGr : Bedrooms above grade
- TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)
We want to predict the column SalePrice which is the property's sale price in dollars.
We will be using Random Forest Regressor algorithm
from the Scikit-learn library in Python.
Random Forest Regressor is a supervised machine learning algorithm that predicts a
continuous value using the Random Forest machine learning algorithm.
import pandas as pd
from sklearn.ensemble import Random Forest Regressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath','BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
X.head()
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Define a random forest model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))
We are using the mean absolute error (mae) to measure the amount of errors in our measurements and predictions.
The variable rf_val_predictions contains our
predictions. It has the Predicted sale price of each property in the variable val_X.
We have split our data into a train dataset - variables train_X and train_Y - and test or validation dataset -
variables val_X and val_Y.
For the mean absolute error, we got 21%, which means that on average the distance from the true value (in the
variable val_Y) is 21%.
We have managed to predict house prices using the
random forest regressor machine learning algorithm.
We can try changing and picking other features for example to see how different our result will be.
If you have any questions or wants to know more about the project,
feel free to email me or
tweet at me!