Environnement, packages, références¶
Environnement Julia¶
In [40]:
# activer l'environnement
using Pkg
Pkg.activate("env_julia_mvstats")
Activating project at `c:\Users\ricco\Desktop\demo\env_julia_mvstats`
In [41]:
# liste des packages installés
Pkg.status()
Status `C:\Users\ricco\Desktop\demo\env_julia_mvstats\Project.toml` [324d7699] CategoricalArrays v1.1.0 [a93c6f00] DataFrames v1.8.2 [7073ff75] IJulia v1.34.4 [add582a8] MLJ v0.23.2 [1b6a4a23] MLJMultivariateStatsInterface v0.5.4 [6f286f6a] MultivariateStats v0.10.4 [91a5bcdd] Plots v1.41.6 [fdbf4ff8] XLSX v0.11.4
Référence bibliographique¶
"Pratique de l'analyse discriminante linéaire" -- https://hal.science/hal-04868585v1
Analyse discriminante prédictive¶
Importation et préparation des données d'apprentissage¶
In [42]:
# packages
import DataFrames as DFR
import XLSX
# lecture des données d'apprentissage
dfTrain = DFR.DataFrame(XLSX.readtable("./wave_train.xlsx"))
# premières lignes
println(DFR.size(dfTrain))
(300, 22)
In [43]:
# description
println(DFR.describe(dfTrain))
22×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼─────────────────────────────────────────────────────────────────────── 1 │ x1 -0.0466335 -2.52043 0.01854 2.66637 0 Float64 2 │ x2 0.320517 -2.89351 0.294345 3.54477 0 Float64 3 │ x3 0.67669 -2.42351 0.68652 3.69728 0 Float64 4 │ x4 1.12775 -2.35432 1.17077 4.97699 0 Float64 5 │ x5 1.48515 -2.54653 1.4697 5.49571 0 Float64 6 │ x6 2.06779 -2.85966 1.91515 6.94504 0 Float64 7 │ x7 3.00775 -1.83927 2.88807 7.72531 0 Float64 8 │ x8 2.89695 -2.88727 2.93308 7.69873 0 Float64 9 │ x9 2.81287 -1.86449 2.98307 6.90046 0 Float64 10 │ x10 3.03862 -1.62797 2.95758 6.68646 0 Float64 11 │ x11 3.26922 -0.57114 3.17471 7.66825 0 Float64 12 │ x12 3.01467 -0.61422 2.9593 6.71937 0 Float64 13 │ x13 2.46963 -1.38337 2.57814 6.81612 0 Float64 14 │ x14 2.39693 -2.23617 2.43635 7.24212 0 Float64 15 │ x15 2.51824 -1.58677 2.44801 7.66272 0 Float64 16 │ x16 1.76906 -2.29467 1.55258 7.04885 0 Float64 17 │ x17 1.34181 -2.87635 1.19806 5.66506 0 Float64 18 │ x18 0.987342 -2.84758 0.92226 5.02358 0 Float64 19 │ x19 0.623762 -2.62458 0.649665 4.11995 0 Float64 20 │ x20 0.306729 -2.69401 0.349165 3.12017 0 Float64 21 │ x21 -0.0006192 -3.06149 0.01875 2.83729 0 Float64 22 │ label A C 0 String
In [44]:
# retirer la limitation d'affichage
ENV["LINES"] = 1000
# schema : types au sens de MLJ
import MLJ
MLJ.schema(dfTrain)
┌───────┬────────────┬─────────┐ │ names │ scitypes │ types │ ├───────┼────────────┼─────────┤ │ x1 │ Continuous │ Float64 │ │ x2 │ Continuous │ Float64 │ │ x3 │ Continuous │ Float64 │ │ x4 │ Continuous │ Float64 │ │ x5 │ Continuous │ Float64 │ │ x6 │ Continuous │ Float64 │ │ x7 │ Continuous │ Float64 │ │ x8 │ Continuous │ Float64 │ │ x9 │ Continuous │ Float64 │ │ x10 │ Continuous │ Float64 │ │ x11 │ Continuous │ Float64 │ │ x12 │ Continuous │ Float64 │ │ x13 │ Continuous │ Float64 │ │ x14 │ Continuous │ Float64 │ │ x15 │ Continuous │ Float64 │ │ x16 │ Continuous │ Float64 │ │ x17 │ Continuous │ Float64 │ │ x18 │ Continuous │ Float64 │ │ x19 │ Continuous │ Float64 │ │ x20 │ Continuous │ Float64 │ │ x21 │ Continuous │ Float64 │ │ label │ Textual │ String │ └───────┴────────────┴─────────┘
In [45]:
# isoler y et X dans des structures distinctes
yTrain, XTrain = MLJ.unpack(dfTrain,==(:label))
# dimensions
println("Dim. de y = $(DFR.size(yTrain))")
println("Dim. de X = $(DFR.size(XTrain))")
Dim. de y = (300,) Dim. de X = (300, 21)
In [46]:
# coder en catégorielle la variable cible
import CategoricalArrays as CA
yTrain = CA.categorical(yTrain)
# vérification des modalités
CA.levels(yTrain)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"C"
In [47]:
# fréquences des classes
DFR.combine(DFR.groupby(DFR.DataFrame(y=yTrain),:y),DFR.nrow => :freq)
3×2 DataFrame
| Row | y | freq |
|---|---|---|
| Cat… | Int64 | |
| 1 | A | 89 |
| 2 | B | 110 |
| 3 | C | 101 |
Modélisation -- Entraînement du modèle¶
In [49]:
# chargement du modèle
# perceptron simple
#BLDA = @MLJ.load LDA pkg = "MultivariateStats"
BLDA = @MLJ.load BayesianLDA pkg = "MultivariateStats"
import MLJMultivariateStatsInterface ✔
┌ Info: For silent loading, specify `verbosity=0`. └ @ Main C:\Users\ricco\.julia\packages\MLJModels\9LbNu\src\loading.jl:159
MLJMultivariateStatsInterface.BayesianLDA
In [50]:
# instanciation - décomposition en valeurs singulières (generalized)
lda = BLDA()
BayesianLDA( method = :gevd, cov_w = StatsBase.SimpleCovariance(false), cov_b = StatsBase.SimpleCovariance(false), outdim = 0, regcoef = 1.0e-6, priors = nothing)
In [51]:
# préparation à l'entraînement
mach_lda = MLJ.machine(lda,XTrain,yTrain)
# entraînement sur les données
# on ne dispose pas de modèle explicite ici malheureusement
# Livre, section 1.3
MLJ.fit!(mach_lda)
┌ Info: Training machine(BayesianLDA(method = gevd, …), …). └ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
trained Machine; caches model-specific representations of data
model: BayesianLDA(method = gevd, …)
args:
1: Source @585 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @134 ⏎ AbstractVector{ScientificTypesBase.Multiclass{3}}
Evaluation en test du modèle prédictif¶
Chargement, inspection et préparation de l'échantillon test.
In [52]:
# lecture des données de test
dfTest = DFR.DataFrame(XLSX.readtable("./wave_test.xlsx"))
# premières lignes
println(DFR.size(dfTest))
(5000, 22)
In [53]:
# isoler y et X dans des structures distinctes
yTest, XTest = MLJ.unpack(dfTest,==(:label))
# dimensions
println("Dim. de y = $(DFR.size(yTest))")
println("Dim. de X = $(DFR.size(XTest))")
Dim. de y = (5000,) Dim. de X = (5000, 21)
In [54]:
# coder en catégorielle la variable cible
yTest = CA.categorical(yTest)
# vérification des modalités
CA.levels(yTest)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"C"
In [55]:
# prédiction des probas d'appartenance
# Livre, section 1.6
proba = MLJ.predict(mach_lda,XTest)
# premières lignes
DFR.first(proba,5)
5-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.Multiclass{3}, String, UInt32, Float64}:
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(A=>0.00923, B=>0.68, C=>0.311)
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(A=>0.106, B=>0.000868, C=>0.893)
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(A=>0.074, B=>0.922, C=>0.00402)
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(A=>0.000672, B=>0.0335, C=>0.966)
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(A=>0.599, B=>0.375, C=>0.0262)
In [56]:
# prédiction des classes
# affectation à la classe la plus probable
pred = MLJ.predict_mode(mach_lda,XTest)
first(pred,5)
5-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"B"
"C"
"B"
"C"
"A"
In [57]:
# matrice de confusion
MLJ.confusion_matrix(pred,yTest)
┌──────────────┐
│ Ground Truth │
┌─────────┼────┬────┬────┤
│Predicted│ A │ B │ C │
├─────────┼────┼────┼────┤
│ A │1362│153 │133 │
├─────────┼────┼────┼────┤
│ B │ 92 │1446│273 │
├─────────┼────┼────┼────┤
│ C │207 │ 61 │1273│
└─────────┴────┴────┴────┘
In [58]:
# accuracy
MLJ.accuracy(pred,yTest)
0.8162
Distance de Mahalanobis aux centres de classes¶
Accès aux informations de l'objet avec report(.)¶
In [59]:
# informations récupérables à partir de l'objet
rm = MLJ.report(mach_lda)
keys(rm)
(:classes, :indim, :outdim, :mean, :nclasses, :class_means, :class_weights, :Sb, :Sw)
In [60]:
# les classes
rm.classes
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"C"
In [61]:
# leur nombre
length(rm.classes)
3
In [62]:
# effectifs conditionnels
rm.class_weights
3-element Vector{Float64}:
89.0
110.0
101.0
In [63]:
# moyennes marginales des variables
rm.mean
21-element Vector{Float64}:
-0.04663353333333334
0.3205167333333333
0.6766895666666666
1.1277548
1.4851508
2.0677927
3.0077525000000005
2.8969464000000005
2.812871099999999
3.038620366666666
3.2692178000000003
3.0146677
2.4696274999999996
2.3969274333333326
2.5182356666666674
1.7690586666666666
1.3418125000000005
0.9873421333333336
0.6237621666666666
0.3067289666666667
-0.0006192000000000385
In [64]:
# moyennes des variables conditionnellement aux classes
rm.class_means
21×3 Matrix{Float64}:
-0.119213 -0.0877993 0.0621566
-0.0721506 0.41643 0.56207
-0.0799418 1.09612 0.886621
-0.0177369 1.69372 1.52075
-0.164589 2.21258 2.14663
0.463895 3.06385 2.39632
1.18992 4.45604 3.03226
1.57374 4.21096 2.63184
2.06958 4.1432 2.01898
2.94504 3.97212 2.10439
4.18132 3.84576 1.83756
4.12977 2.90958 2.1465
3.92905 1.74144 1.97668
3.93963 1.0317 2.5244
4.00465 0.919461 2.94966
2.97232 0.387724 2.21319
2.00999 0.117088 2.08688
1.50123 0.0211349 1.58681
0.892425 -0.0219629 1.09028
0.65298 -0.103002 0.447858
-0.042834 -0.0230152 0.0609717
In [65]:
# matrice de variance-covariance intra-classes
# Livre, section 1.2.2 et 8.1.2.3
rm.Sw
21×21 Matrix{Float64}:
270.301 -32.886 -0.656529 … 13.0369 1.62773 -11.3546
-32.886 366.679 71.7423 0.236244 -1.16685 6.41157
-0.656529 71.7423 395.748 -20.4388 -49.4942 27.0752
-0.99474 52.8144 80.0276 -49.0439 -63.5087 21.6122
-6.38966 91.0351 178.378 -53.1832 -30.5544 -28.2988
1.98076 119.799 186.496 … -72.2495 -57.3624 29.6027
-7.22901 151.008 237.027 -104.031 -84.2127 0.218241
-4.78343 110.219 190.02 -110.972 -63.7187 10.8366
-8.30031 -10.6637 110.249 -132.59 -72.4806 1.64708
-24.406 1.27312 35.7332 -70.3067 -64.6131 -16.8611
-31.8808 -62.6919 -76.0114 … -49.2427 -43.1109 -18.5147
-20.576 -72.1974 -109.88 28.2486 -4.42427 -10.4524
-15.897 -114.356 -164.802 43.0851 36.7287 -31.1433
-10.0012 -74.9887 -184.964 110.697 86.0442 -31.3902
30.635 -97.3068 -181.107 154.131 89.482 -22.8973
3.91292 -77.2836 -135.143 … 147.935 109.683 -7.73113
14.7096 -83.1205 -113.395 102.241 105.717 -16.6727
-14.5772 -50.3244 -71.5855 86.6781 50.3059 5.67119
13.0369 0.236244 -20.4388 350.072 47.0075 -6.01434
1.62773 -1.16685 -49.4942 47.0075 282.803 -34.3456
-11.3546 6.41157 27.0752 … -6.01434 -34.3456 312.491
In [66]:
# matrice de variance-covariance inter-classes
# Livre, section 8.1.2.3
rm.Sb
21×21 Matrix{Float64}:
1.85061 4.75629 5.29493 … 6.3146 1.16942 1.05085
4.75629 20.6278 35.9891 -4.82009 -12.9803 2.74164
5.29493 35.9891 74.7541 -37.992 -39.228 3.11538
9.15467 55.591 111.582 -49.0732 -55.2065 5.35413
14.6309 81.467 158.681 -59.9479 -74.1956 8.52108
9.45992 74.5759 160.928 … -93.6205 -89.6359 5.61586
5.45354 79.4067 189.753 -145.183 -120.945 3.41434
-0.315755 53.6381 144.108 -137.464 -103.778 0.0851491
-9.9458 20.6431 94.5977 -149.673 -94.1798 -5.42324
-13.8877 -9.67311 29.562 -112.563 -58.2734 -7.75968
-24.2333 -60.7208 -65.1766 … -86.6004 -18.2844 -13.7531
-16.2665 -61.2593 -98.3475 -6.77943 26.7247 -9.33129
-11.5462 -70.712 -142.326 63.3927 70.7674 -6.75572
-2.38255 -65.2074 -164.171 139.866 110.889 -1.63987
2.37847 -58.2886 -164.711 169.431 124.013 1.03782
3.36248 -45.789 -135.341 … 147.814 105.668 1.645
9.41635 -18.0952 -85.7028 138.075 86.4098 5.14158
7.64253 -13.528 -66.4731 109.163 67.9285 4.17865
6.3146 -4.82009 -37.992 74.2716 44.0321 3.48347
1.16942 -12.9803 -39.228 44.0321 31.1486 0.58641
1.05085 2.74164 3.11538 … 3.48347 0.58641 0.596917
Distance de Mahalanobis aux centres de classes (Livre, section 1.8)¶
Distances des individus en test par rapport aux centres de classes calculées sur l'échantillon d'apprentissage !!!
In [67]:
# librairie pour calcul algébrique
import LinearAlgebra as linalg
# effectif
N = DFR.nrow(dfTrain)
# nombre de classes
K = length(rm.classes)
# matrice des moyennes conditionnelles -- transposée
M = Matrix(rm.class_means)'
# inverse de la matrice des variances covariances intre
inv_W = linalg.inv(N/(N-K)*rm.Sw)
# distances de Mahalanobis
# pas la distance généralisée ici (on ne tient pas compte de la prévalence des classes)
DM = [(x - m)' * inv_W * (x-m) for x in eachrow(Matrix(XTest)), m in eachrow(M)]
# affichage
DM[1:5,:]
5×3 Matrix{Float64}:
0.123249 0.0962728 0.100872
0.0677114 0.100847 0.0545106
0.121817 0.106569 0.141884
0.113441 0.0890353 0.0662892
0.0808624 0.0853429 0.102337
Analyse factorielle discriminante (Livre, chapitre 8)¶
Informations de fitted_params(.)¶
In [68]:
# informations additionnelles sur la structure du modèle obtenu
fp = MLJ.fitted_params(mach_lda)
keys(fp)
(:classes, :projection_matrix, :priors)
In [69]:
# classes
fp.classes
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"A"
"B"
"C"
In [70]:
# fréquences a priori
fp.priors
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(A=>0.297, B=>0.367, C=>0.337)
In [71]:
# coefficient pour la projection dans le plan factoriel
# plan (2D) parce que nous avons K=3 modalités
# Livre, section 8.1.2.6
fp.projection_matrix
21×2 Matrix{Float64}:
-0.000726192 0.00226647
0.00422932 0.00328851
0.00531823 0.00136124
0.00335293 0.00434475
-0.00955831 0.0166763
-0.00251836 -0.000359763
-0.00292454 -0.00260365
-0.00764494 -0.000333735
-0.00669738 -0.00302746
-0.00628354 -0.00746827
-0.00130523 -0.0158443
0.00384929 -0.00941728
0.000682806 -0.00726484
0.00954596 0.00143746
0.00537249 0.00109776
0.0025379 -0.00527687
0.00581189 0.00779223
0.00193115 0.0100564
0.00303657 0.00896
0.00486523 -0.00510801
0.00107837 0.00144465
Coordonnées factorielles des individus (TRAIN)¶
In [72]:
# coordonnées factorielles (dans le plan puisque (K-1) => 2 et p = 21)
# Livre, 8.1.3
Z = MLJ.transform(mach_lda,XTrain)
DFR.first(Z,10)
10×2 DataFrame
| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0219954 | -0.152593 |
| 2 | 0.156002 | -0.0839627 |
| 3 | -0.100994 | -0.195167 |
| 4 | -0.0865563 | -0.119689 |
| 5 | -0.0726575 | -0.271802 |
| 6 | 0.0107328 | -0.00656699 |
| 7 | 0.0823143 | -0.108251 |
| 8 | -0.0806854 | 0.0212487 |
| 9 | -0.106563 | 0.0187711 |
| 10 | -0.0677708 | -0.142402 |
In [73]:
# graphique dans le plan
# en fonction des classes d'appartenance
import Plots
Plots.scatter(Z[:,1],Z[:,2],group=yTrain)
Centres de classes dans le repère factoriel¶
Position des centres de classes (Solution 1)
In [74]:
# calcul des positions des centres de classes
# en appliquant la matrice de projection
# sur les moyennes conditionnelles dans l'espace originel
# on aurait pu faire appel à transform() aussi
coord_centres = rm.class_means' * fp.projection_matrix
size(coord_centres)
(3, 2)
In [75]:
# valeurs des centres de classes dans l'espace factoriel
coord_centres
3×2 Matrix{Float64}:
0.0520339 -0.139441
-0.0901927 -0.108516
0.00338405 -0.0145756
Position des centres de classes (Solution 2)
In [76]:
# calcul à partir des coordonnées factorielles
# ajouter la colonne des classes d'appartenance
Z.y = yTrain
import Statistics
# calculer les moyennes conditionnellement à y
# sur toutes les colonnes sauf y
DFR.combine(DFR.groupby(Z,:y), DFR.Not(:y) .=> Statistics.mean)
3×3 DataFrame
| Row | y | x1_mean | x2_mean |
|---|---|---|---|
| Cat… | Float64 | Float64 | |
| 1 | A | 0.0520339 | -0.139441 |
| 2 | B | -0.0901927 | -0.108516 |
| 3 | C | 0.00338405 | -0.0145756 |
In [77]:
# nuages de points avec les barycentres conditionnels
# Livre, section 8.1.3.2
Plots.scatter(Z[:,1],Z[:,2],group=yTrain,alpha=0.4)
Plots.scatter!(coord_centres[:,1],coord_centres[:,2],group=fp.classes,
marker=:x,markerstrokewidth=6,markersize=10,
label="")
Projection des individus TEST dans le repère factoriel¶
In [78]:
# plonger les individus en test dans l'espace factoriel
# en 2D donc
ZTest = MLJ.transform(mach_lda,XTest)
ZTest[1:5,:]
5×2 DataFrame
| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | -0.0954113 | -0.0344686 |
| 2 | 0.128989 | 0.0158463 |
| 3 | -0.090691 | -0.204306 |
| 4 | -0.0616318 | 0.0788989 |
| 5 | -0.0155028 | -0.180704 |
Distance aux centres de classes dans le repère factoriel¶
On passe bien par la distance euclidienne ici (Livre, section 8.2). Voir l'équivalence avec la distance de Mahalanobis -- Livre, section 8.2.1.3.
In [79]:
# pour les individus en test -- (carré de la) distance Euclidienne aux centres de classes
DE = [sum((x - m) .^ 2) for x in eachrow(Matrix(ZTest)), m in eachrow(Matrix(coord_centres))]
# affichage
DE[1:5,:]
5×3 Matrix{Float64}:
0.0327593 0.00551032 0.0101563
0.0300361 0.0635065 0.016702
0.0245779 0.00917596 0.0448479
0.0605922 0.0359402 0.0129645
0.00626384 0.0107896 0.0279553