Posts


Data Analysis - Master's Project 🔍



This is my first-year master's project. The project is called TER (Travaux Encadrés de Recherche = Supervised Research Work) and the subject is Data Analysis.
In the project I have talked about PCA = Principal component analysis, SVD Decomposition = Singular Value Decomposition, Least Squares and Pseudo-Inverses.
Principal component analysis - PCA or ACP in French - is an unsupervised machine learning algorithm and a part of dimensionality reduction algorithms.

In the last chapter, I applied the principal component analysis to a dataset that was provided by my supervisor.
The dataset contains 200 Swiss banknote, 100 genuine and 100 counterfeit. The goal is to identify a genuine banknote from a counterfeit one using different banknote measurements.

(All documents displayed will be in French!)


PCA

The goal of PCA is to reduce the number of variables in a dataset. We want to keep only important and decisive variables, all while preserving as much information as possible.
These "new" variables are called Principal Components.
The lines in the data represent "The individuals", and "The variables" represent the columns. Each individual has an associated weight.
It is recommended to scale and/or to center your data before applying PCA.
In the project, I chose the euclidian distance to represent the distance between individuals, and as an origin, I chose the center of gravity of the cloud.


SVD

Singular value decomposition is an extension of matrix diagonalization in the case of rectangular matrices as well.
Decomposing a matrix into singular values gives us two orthogonal matrices and a diagonal matrix containing singular values.
We define singular values and singular vectors in the project, and we state the theorem of singular value decomposition.
Lastly, we give an example of a rectangular matrix and its singular value decomposition.


Least Squares and Pseudo-Inverses

The least squares method is very well-known.

We study the linear system \( Ax=b\), we want to minimize this system in order to find a solution \( x\) such that the vector \( Ax \) is very close to \( b \).
We look for the best approximation and the best solution to solve the minimization problem \( ||Ax=b||_{2}\).

In the project, I state the Markov Gauss theorem and I talk about the ordinary least squares and its formula in Statistics.

Moving on to Pseudo-inverses.

In the project, I start by defining the pseudo inverse of a random matrix \( A\), I then define and start talking about the pseudo inverse of Moore-Penrose, in addition to that, I state some important propositions and theorems.
Before ending on the theoretical part of the project, I define the QR decomposition and I talk a little bit about the Process of Gram-Schmidt.


PCA Application

As stated before, I did my application on 200 Swiss banknotes, 100 genuine and 100 counterfeit.
The first 100 lines are genuine banknotes and the last 100 are counterfeit.

This is what the first 10 lines and the last 10 lines of the dataset look like :



The columns or the variables that we have are :

- Diagonal : Length of diagonal of the banknote (mm)
- Top : Top margin width of the banknote (mm)
- Bottom : Bottom margin width of the banknote (mm)
- Right : Width of right edge of the banknote (mm)
- Left : Width of left edge of the banknote (mm)
- Length : Length of of the banknote (mm)


I have decided to create another dataset, in this case, a csv file, and group all the genuine banknotes together, and all the counterfeit banknotes together as well.
The goal of this approach is to create a biplot representation of the "grouped" individuals and the variables so that it is easier to read and to interpret.

The PCA application was coded in R.

library("readxl")
library("FactoMineR")
library("factoextra")
library("corrplot")

filename <- "D:/hind-/M1 MATHS/2021-2022/ter/banque.xlsx"

billets <- read_excel(filename)

data <- data.frame(billets)

pca <- PCA(data[-1], graph = FALSE)

eig.val <- get_eigenvalue(pca)

var <- get_pca_var(pca)

# Histogramme valeurs propres
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 50))

# Matrice de correlations
fviz_pca_var(pca, col.var = "black")

# Qualite de representations des variables suivant l'axe 1
fviz_cos2(pca, choice = "var", axes = 1)

# Qualite de representations des variables suivant l'axe 2
fviz_cos2(pca, choice = "var", axes = 2)

# Contributions relatives des variables suivant l'Axe 1
fviz_contrib(pca, choice = "var", axes = 1, top = 10)

# Contributions relatives des variables suivant l'Axe 2
fviz_contrib(pca, choice = "var", axes = 2, top = 10)

# Qualite de representations des individus
fviz_pca_ind(pca, col.ind = "cos2",
             gradient.cols = c("aliceblue", "steelblue", "darkblue"),
             repel = TRUE )

# On rajoute top=30 ci-dessous pour afficher
# que les 30 premieres valeurs

# Qualite de representations des individus suivant l'axe 1
fviz_cos2(pca, choice ="ind", axes = 1, top = 30)

# Qualite de representations des individus suivant l'axe 2
fviz_cos2(pca, choice ="ind", axes = 2, top = 30)

# Contributions relatives des individus suivant l'Axe 1
fviz_contrib(pca, choice = "ind", axes = 1, top=30)

# Contributions relatives des individus suivant l'Axe 2
fviz_contrib(pca, choice = "ind", axes = 2, top=30)

##### After grouping individuals

other_filename <- "D:/hind-/M1 MATHS/2021-2022/ter/banque_with_groups.xlsx"

billets_with_groups <- read_excel(other_filename)

data_with_groups <- data.frame(billets_with_groups)

pca_with_groups <- PCA(data_with_groups[-1], graph = FALSE)

# Nouvelle Qualite de representations des individus
fviz_pca_ind(pca_with_groups,
             geom.ind = "point",
             col.ind = data_with_groups$Groupe,
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE,
             legend.title = "Groupes"
)

# Biplot Individus et Variables
fviz_pca_biplot(pca_with_groups,
                col.ind = data_with_groups$Groupe, palette = "jco",
                addEllipses = TRUE, label = "var",
                col.var = "black", repel = TRUE,
                legend.title = "Groupes")


We get the following results :
(Pages from my memoir presentation)



Take a moment and look at all the plots. What can you conclude ?

Let's take the first plot. We can see that the first two dimensions represent 70.4% which is more than 50% of the variance, and with that, we are going to be focusing more on these two.

The second plot represents the correlation circle. We can see that the variables Diagonale (Diagonal), Hauteur.Gauche (Left) and Hauteur.Droite (Right) are well represented.
Hauteur.Gauche (Left) and Hauteur.Droite (Right) and positively correlated.
Diagonale (Diagonal) and Bord.Inferieur (Bottom) are negatively correlated.

The third and the forth plot shows that Diagonale (Diagonal), Hauteur.Droite (Right) and Hauteur.Gauche (Left) represent the first dimension.
And the second dimension is represented by the variable Longueur (Length).

The fifth plot shows us how Individuals, in our case, banknote are represented.

The last plot, groups Individuals and Variables, the "Faux" in Groupes is the group of counterfeit banknotes and the "Vrai" in Groupes in the genuine banknotes.

Right away we can tell that the genuine banknotes and the counterfeit ones represent two well-defined clusters.

We can see that the group of counterfeit banknotes progresses with the variables Hauteur.Droite (Right), Hauteur.Gauche (Left), Bord.Inferieur (Bottom) and Bord.Supérieur (Top).
And the group of genuine banknotes progresses with the variables Diagonale (Diagonal) and Longueur (Length).

This means that the counterfeit banknotes would have similar-ish values for the variables Hauteur.Droite (Right), Hauteur.Gauche (Left), Bord.Inferieur (Bottom) and Bord.Supérieur (Top), and the genuine banknotes would have similar-ish values for the variables Diagonale (Diagonal) and Longueur (Length).
We can also verify our result by simply looking at our dataset !


This was my memoir project for my first-year master's degree. It was really interesting and I had a lot of fun doing this project.
For information, I wrote the project in Latex and the presentation in Beamer Latex !
If you have any questions or wants to know more about the project, feel free to email me or tweet at me!

Related : Pokemon API


Join my newsletter for similar articles and early access