Environnement et packages¶
In [76]:
# activer l'environnement
using Pkg
Pkg.activate("env_julia_curve")
Activating project at `c:\Users\ricco\Desktop\demo\env_julia_curve`
In [77]:
# liste des packages installés
Pkg.status()
Status `C:\Users\ricco\Desktop\demo\env_julia_curve\Project.toml` [324d7699] CategoricalArrays v1.1.0 [a93c6f00] DataFrames v1.8.2 [7073ff75] IJulia v1.34.4 [add582a8] MLJ v0.23.2 [6ee0df7b] MLJLinearModels v0.10.1 [23777cdb] MLJTransforms v0.1.5 [30f210dd] ScientificTypesBase v3.1.0 [a19d573c] StatisticalMeasures v0.3.5 [2913bbd2] StatsBase v0.34.10 [4c63d2b9] StatsFuns v1.5.2 [f3b207a7] StatsPlots v0.15.8 [fdbf4ff8] XLSX v0.11.3
Importation et préparation des données¶
Importation - Types des variables¶
In [78]:
# packages
import DataFrames as DFR
import XLSX
# lecture des données
df = DFR.DataFrame(XLSX.readtable("./spambase.xlsx"))
# premières lignes
println(DFR.describe(df))
56×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼───────────────────────────────────────────────────────────────────────────────── 1 │ wf_make 0.104553 0.0 0.0 4.54 0 Float64 2 │ wf_address 0.213015 0.0 0.0 14.28 0 Float64 3 │ wf_all 0.280656 0.0 0.0 5.1 0 Float64 4 │ wf_3d 0.0654249 0.0 0.0 42.81 0 Float64 5 │ wf_our 0.312223 0.0 0.0 10.0 0 Float64 6 │ wf_over 0.0959009 0.0 0.0 5.88 0 Float64 7 │ wf_remove 0.114208 0.0 0.0 7.27 0 Float64 8 │ wf_internet 0.105295 0.0 0.0 11.11 0 Float64 9 │ wf_order 0.0900674 0.0 0.0 5.26 0 Float64 10 │ wf_mail 0.239413 0.0 0.0 18.18 0 Float64 11 │ wf_receive 0.059824 0.0 0.0 2.61 0 Float64 12 │ wf_will 0.541702 0.0 0.1 9.67 0 Float64 13 │ wf_people 0.0939296 0.0 0.0 5.55 0 Float64 14 │ wf_report 0.0586264 0.0 0.0 10.0 0 Float64 15 │ wf_addresses 0.0492045 0.0 0.0 4.41 0 Float64 16 │ wf_free 0.248848 0.0 0.0 20.0 0 Float64 17 │ wf_business 0.142586 0.0 0.0 7.14 0 Float64 18 │ wf_email 0.184745 0.0 0.0 9.09 0 Float64 19 │ wf_you 1.6621 0.0 1.31 18.75 0 Float64 20 │ wf_credit 0.085577 0.0 0.0 18.18 0 Float64 21 │ wf_your 0.809761 0.0 0.22 11.11 0 Float64 22 │ wf_font 0.121202 0.0 0.0 17.1 0 Float64 23 │ wf_000 0.101645 0.0 0.0 5.45 0 Float64 24 │ wf_money 0.0942686 0.0 0.0 12.5 0 Float64 25 │ wf_hp 0.549504 0.0 0.0 20.83 0 Float64 26 │ wf_hpl 0.265384 0.0 0.0 16.66 0 Float64 27 │ wf_lab 0.0989155 0.0 0.0 14.28 0 Float64 28 │ wf_labs 0.102852 0.0 0.0 5.88 0 Float64 29 │ wf_telnet 0.0647533 0.0 0.0 12.5 0 Float64 30 │ wf_857 0.0470485 0.0 0.0 4.76 0 Float64 31 │ wf_data 0.0972289 0.0 0.0 18.18 0 Float64 32 │ wf_415 0.0478353 0.0 0.0 4.76 0 Float64 33 │ wf_85 0.105412 0.0 0.0 20.0 0 Float64 34 │ wf_technology 0.0974766 0.0 0.0 7.69 0 Float64 35 │ wf_1999 0.136953 0.0 0.0 6.89 0 Float64 36 │ wf_parts 0.0132015 0.0 0.0 8.33 0 Float64 37 │ wf_pm 0.0786286 0.0 0.0 11.11 0 Float64 38 │ wf_direct 0.0648337 0.0 0.0 4.76 0 Float64 39 │ wf_cs 0.0436666 0.0 0.0 7.14 0 Float64 40 │ wf_meeting 0.132339 0.0 0.0 14.28 0 Float64 41 │ wf_original 0.0460987 0.0 0.0 3.57 0 Float64 42 │ wf_project 0.0791958 0.0 0.0 20.0 0 Float64 43 │ wf_re 0.301224 0.0 0.0 21.42 0 Float64 44 │ wf_edu 0.179824 0.0 0.0 22.05 0 Float64 45 │ wf_table 0.00544447 0.0 0.0 2.17 0 Float64 46 │ wf_conference 0.0318692 0.0 0.0 10.0 0 Float64 47 │ cf_; 0.0385747 0.0 0.0 4.385 0 Float64 48 │ cf_( 0.13903 0.0 0.065 9.752 0 Float64 49 │ cf_[ 0.0169759 0.0 0.0 4.081 0 Float64 50 │ cf_! 0.269071 0.0 0.0 32.478 0 Float64 51 │ cf_$ 0.0758107 0.0 0.0 6.003 0 Float64 52 │ cf_# 0.0442382 0.0 0.0 19.829 0 Float64 53 │ capital_run_length_average 5.19152 1.0 2.276 1102.5 0 Float64 54 │ capital_run_length_longest 52.1728 1 15.0 9989 0 Int64 55 │ capital_run_length_total 283.289 1 95.0 15841 0 Int64 56 │ spam no yes 0 String
In [79]:
# dimension
println(DFR.size(df))
(4601, 56)
In [80]:
# retirer la limitation d'affichage
ENV["LINES"] = 1000
# vérifier le schéma de la base
# et les types scientifiques
import MLJ
MLJ.schema(df)
┌────────────────────────────┬────────────┬─────────┐ │ names │ scitypes │ types │ ├────────────────────────────┼────────────┼─────────┤ │ wf_make │ Continuous │ Float64 │ │ wf_address │ Continuous │ Float64 │ │ wf_all │ Continuous │ Float64 │ │ wf_3d │ Continuous │ Float64 │ │ wf_our │ Continuous │ Float64 │ │ wf_over │ Continuous │ Float64 │ │ wf_remove │ Continuous │ Float64 │ │ wf_internet │ Continuous │ Float64 │ │ wf_order │ Continuous │ Float64 │ │ wf_mail │ Continuous │ Float64 │ │ wf_receive │ Continuous │ Float64 │ │ wf_will │ Continuous │ Float64 │ │ wf_people │ Continuous │ Float64 │ │ wf_report │ Continuous │ Float64 │ │ wf_addresses │ Continuous │ Float64 │ │ wf_free │ Continuous │ Float64 │ │ wf_business │ Continuous │ Float64 │ │ wf_email │ Continuous │ Float64 │ │ wf_you │ Continuous │ Float64 │ │ wf_credit │ Continuous │ Float64 │ │ wf_your │ Continuous │ Float64 │ │ wf_font │ Continuous │ Float64 │ │ wf_000 │ Continuous │ Float64 │ │ wf_money │ Continuous │ Float64 │ │ wf_hp │ Continuous │ Float64 │ │ wf_hpl │ Continuous │ Float64 │ │ wf_lab │ Continuous │ Float64 │ │ wf_labs │ Continuous │ Float64 │ │ wf_telnet │ Continuous │ Float64 │ │ wf_857 │ Continuous │ Float64 │ │ wf_data │ Continuous │ Float64 │ │ wf_415 │ Continuous │ Float64 │ │ wf_85 │ Continuous │ Float64 │ │ wf_technology │ Continuous │ Float64 │ │ wf_1999 │ Continuous │ Float64 │ │ wf_parts │ Continuous │ Float64 │ │ wf_pm │ Continuous │ Float64 │ │ wf_direct │ Continuous │ Float64 │ │ wf_cs │ Continuous │ Float64 │ │ wf_meeting │ Continuous │ Float64 │ │ wf_original │ Continuous │ Float64 │ │ wf_project │ Continuous │ Float64 │ │ wf_re │ Continuous │ Float64 │ │ wf_edu │ Continuous │ Float64 │ │ wf_table │ Continuous │ Float64 │ │ wf_conference │ Continuous │ Float64 │ │ cf_; │ Continuous │ Float64 │ │ cf_( │ Continuous │ Float64 │ │ cf_[ │ Continuous │ Float64 │ │ cf_! │ Continuous │ Float64 │ │ cf_$ │ Continuous │ Float64 │ │ cf_# │ Continuous │ Float64 │ │ capital_run_length_average │ Continuous │ Float64 │ │ capital_run_length_longest │ Count │ Int64 │ │ capital_run_length_total │ Count │ Int64 │ │ spam │ Textual │ String │ └────────────────────────────┴────────────┴─────────┘
Préparation des structures¶
In [81]:
import MLJ
# isoler y et X dans des structures distinctes
y, X = MLJ.unpack(df,==(:spam))
# dimensions
println("Dim. de y = $(DFR.size(y))")
println("Dim. de X = $(DFR.size(X))")
Dim. de y = (4601,) Dim. de X = (4601, 55)
Ajustement du type des variables¶
In [82]:
# transformer les X count en variables continues
# utilisation du package ScientificTypesBase
import ScientificTypesBase
X = MLJ.coerce(X,ScientificTypesBase.Count => ScientificTypesBase.Continuous)
# schéma
MLJ.schema(X)
┌────────────────────────────┬────────────┬─────────┐ │ names │ scitypes │ types │ ├────────────────────────────┼────────────┼─────────┤ │ wf_make │ Continuous │ Float64 │ │ wf_address │ Continuous │ Float64 │ │ wf_all │ Continuous │ Float64 │ │ wf_3d │ Continuous │ Float64 │ │ wf_our │ Continuous │ Float64 │ │ wf_over │ Continuous │ Float64 │ │ wf_remove │ Continuous │ Float64 │ │ wf_internet │ Continuous │ Float64 │ │ wf_order │ Continuous │ Float64 │ │ wf_mail │ Continuous │ Float64 │ │ wf_receive │ Continuous │ Float64 │ │ wf_will │ Continuous │ Float64 │ │ wf_people │ Continuous │ Float64 │ │ wf_report │ Continuous │ Float64 │ │ wf_addresses │ Continuous │ Float64 │ │ wf_free │ Continuous │ Float64 │ │ wf_business │ Continuous │ Float64 │ │ wf_email │ Continuous │ Float64 │ │ wf_you │ Continuous │ Float64 │ │ wf_credit │ Continuous │ Float64 │ │ wf_your │ Continuous │ Float64 │ │ wf_font │ Continuous │ Float64 │ │ wf_000 │ Continuous │ Float64 │ │ wf_money │ Continuous │ Float64 │ │ wf_hp │ Continuous │ Float64 │ │ wf_hpl │ Continuous │ Float64 │ │ wf_lab │ Continuous │ Float64 │ │ wf_labs │ Continuous │ Float64 │ │ wf_telnet │ Continuous │ Float64 │ │ wf_857 │ Continuous │ Float64 │ │ wf_data │ Continuous │ Float64 │ │ wf_415 │ Continuous │ Float64 │ │ wf_85 │ Continuous │ Float64 │ │ wf_technology │ Continuous │ Float64 │ │ wf_1999 │ Continuous │ Float64 │ │ wf_parts │ Continuous │ Float64 │ │ wf_pm │ Continuous │ Float64 │ │ wf_direct │ Continuous │ Float64 │ │ wf_cs │ Continuous │ Float64 │ │ wf_meeting │ Continuous │ Float64 │ │ wf_original │ Continuous │ Float64 │ │ wf_project │ Continuous │ Float64 │ │ wf_re │ Continuous │ Float64 │ │ wf_edu │ Continuous │ Float64 │ │ wf_table │ Continuous │ Float64 │ │ wf_conference │ Continuous │ Float64 │ │ cf_; │ Continuous │ Float64 │ │ cf_( │ Continuous │ Float64 │ │ cf_[ │ Continuous │ Float64 │ │ cf_! │ Continuous │ Float64 │ │ cf_$ │ Continuous │ Float64 │ │ cf_# │ Continuous │ Float64 │ │ capital_run_length_average │ Continuous │ Float64 │ │ capital_run_length_longest │ Continuous │ Float64 │ │ capital_run_length_total │ Continuous │ Float64 │ └────────────────────────────┴────────────┴─────────┘
In [83]:
# convertir y en variable catégorielle pour la rég. logistique
# équivalent du type factor sous R
# utilisation du package CategoricalArrays
# /!\ TRES IMPORTANT : indiquer que la classe "positive" est "yes" (en 2e position ici)
import CategoricalArrays as CA
y = CA.categorical(y,ordered=true,levels=["no","yes"])
# vérification des modalités
CA.levels(y)
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"no"
"yes"
In [84]:
# fréquences des classes
DFR.combine(DFR.groupby(DFR.DataFrame(y=y),:y),DFR.nrow => :freq)
2×2 DataFrame
| Row | y | freq |
|---|---|---|
| Cat… | Int64 | |
| 1 | no | 2788 |
| 2 | yes | 1813 |
Partition TRAIN/TEST¶
In [85]:
# effectif
n = DFR.nrow(X)
println(n)
4601
In [86]:
# importer le générateur de nombre aléatoire
import Random as RND
# fixer une graine de départ (random_state)
# de mnière à ce que l'expé soit totalement reproductible
# MersenneTwister est le générateur standard de Julia
rng = RND.MersenneTwister(42)
# identifiants -> échantillonner 601 parmi 4601, sans remise
# en utilisation le générateur ci-dessus avec la "graine" spécifiée
import StatsBase
idTrain = StatsBase.sample(rng,1:n,601,replace=false)
println(length(idTrain))
601
In [87]:
# et les individus en test
idTest = setdiff(1:n,idTrain)
println(length(idTest))
4000
In [88]:
# structures y et X pour train/test
# par indexation avec les indices
yTrain, yTest = y[idTrain], y[idTest]
XTrain, XTest = X[idTrain,:], X[idTest,:]
# afficher les dimensions pour vérifications
println("Dim. de y = $(DFR.size(yTrain)) et $(DFR.size(yTest))")
println("Dim. de X = $(DFR.size(XTrain)) et $(DFR.size(XTest))")
Dim. de y = (601,) et (4000,) Dim. de X = (601, 55) et (4000, 55)
Standardisation des descripteurs¶
Paramètres de standardisation (moyennes, écarts-type)
In [89]:
# charger l'outil de centrage-réduction
# MLJ joue le rôle de wrapper
# l'outil est dans le package MLJTransforms
Standardizer = @MLJ.load Standardizer pkg=MLJTransforms
import MLJTransforms ✔
┌ Info: For silent loading, specify `verbosity=0`. └ @ Main C:\Users\ricco\.julia\packages\MLJModels\9LbNu\src\loading.jl:159
MLJTransforms.Standardizer
In [90]:
# calcul des paramètres de standardisation
# la classe est instaciée à la volée
# bien sûr, << on n'utilise que les données d'entraînement >>
std = MLJ.machine(Standardizer(),XTrain)
# l'objet std est directement modifié avec "!"
MLJ.fit!(std)
┌ Info: Training machine(Standardizer(features = Symbol[], …), …). └ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
trained Machine; caches model-specific representations of data
model: Standardizer(features = Symbol[], …)
args:
1: Source @182 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
In [91]:
# paramètres calculés (moyennes, écarts-type)
fp = MLJ.fitted_params(std)
# moyennes
# collect() transforme le tuple en liste
println(DFR.DataFrame(var=DFR.names(XTrain),moyennes=collect(fp.means)))
55×2 DataFrame Row │ var moyennes │ String Float64 ─────┼────────────────────────────────────────── 1 │ wf_make 0.100516 2 │ wf_address 0.203062 3 │ wf_all 0.288819 4 │ wf_3d 0.0341764 5 │ wf_our 0.309767 6 │ wf_over 0.109884 7 │ wf_remove 0.101481 8 │ wf_internet 0.0891348 9 │ wf_order 0.073777 10 │ wf_mail 0.217171 11 │ wf_receive 0.0475707 12 │ wf_will 0.559151 13 │ wf_people 0.0986855 14 │ wf_report 0.0697338 15 │ wf_addresses 0.0550582 16 │ wf_free 0.227255 17 │ wf_business 0.135507 18 │ wf_email 0.174676 19 │ wf_you 1.71411 20 │ wf_credit 0.106156 21 │ wf_your 0.762829 22 │ wf_font 0.0696007 23 │ wf_000 0.107155 24 │ wf_money 0.0671048 25 │ wf_hp 0.659268 26 │ wf_hpl 0.295042 27 │ wf_lab 0.0892845 28 │ wf_labs 0.11426 29 │ wf_telnet 0.0775042 30 │ wf_857 0.0509651 31 │ wf_data 0.0753577 32 │ wf_415 0.0532446 33 │ wf_85 0.092812 34 │ wf_technology 0.117704 35 │ wf_1999 0.144276 36 │ wf_parts 0.0140599 37 │ wf_pm 0.0882696 38 │ wf_direct 0.0694509 39 │ wf_cs 0.0461231 40 │ wf_meeting 0.120965 41 │ wf_original 0.0440932 42 │ wf_project 0.138985 43 │ wf_re 0.314792 44 │ wf_edu 0.217421 45 │ wf_table 0.00227953 46 │ wf_conference 0.0537271 47 │ cf_; 0.0312995 48 │ cf_( 0.131552 49 │ cf_[ 0.0244992 50 │ cf_! 0.343062 51 │ cf_$ 0.0712313 52 │ cf_# 0.0367155 53 │ capital_run_length_average 5.81461 54 │ capital_run_length_longest 53.827 55 │ capital_run_length_total 278.87
In [92]:
# écarts-type
println(DFR.DataFrame(var=DFR.names(XTrain),moyennes=collect(fp.stds)))
55×2 DataFrame Row │ var moyennes │ String Float64 ─────┼───────────────────────────────────────── 1 │ wf_make 0.300785 2 │ wf_address 1.32281 3 │ wf_all 0.486717 4 │ wf_3d 0.805123 5 │ wf_our 0.625448 6 │ wf_over 0.388862 7 │ wf_remove 0.376624 8 │ wf_internet 0.318037 9 │ wf_order 0.218076 10 │ wf_mail 0.525793 11 │ wf_receive 0.145138 12 │ wf_will 0.96617 13 │ wf_people 0.339056 14 │ wf_report 0.41476 15 │ wf_addresses 0.263164 16 │ wf_free 0.664566 17 │ wf_business 0.36657 18 │ wf_email 0.466861 19 │ wf_you 1.78898 20 │ wf_credit 0.83424 21 │ wf_your 1.08933 22 │ wf_font 0.769281 23 │ wf_000 0.336949 24 │ wf_money 0.204748 25 │ wf_hp 1.85746 26 │ wf_hpl 0.957733 27 │ wf_lab 0.531623 28 │ wf_labs 0.471564 29 │ wf_telnet 0.389475 30 │ wf_857 0.348871 31 │ wf_data 0.387251 32 │ wf_415 0.350923 33 │ wf_85 0.442947 34 │ wf_technology 0.435172 35 │ wf_1999 0.429476 36 │ wf_parts 0.181686 37 │ wf_pm 0.415519 38 │ wf_direct 0.336706 39 │ wf_cs 0.383414 40 │ wf_meeting 0.684683 41 │ wf_original 0.193371 42 │ wf_project 1.13889 43 │ wf_re 1.02488 44 │ wf_edu 1.23855 45 │ wf_table 0.0245281 46 │ wf_conference 0.373172 47 │ cf_; 0.197124 48 │ cf_( 0.228392 49 │ cf_[ 0.205984 50 │ cf_! 1.47563 51 │ cf_$ 0.205738 52 │ cf_# 0.310332 53 │ capital_run_length_average 33.2306 54 │ capital_run_length_longest 143.227 55 │ capital_run_length_total 609.168
Transformation pour TRAIN
In [93]:
# transformation de train
ZTrain = MLJ.transform(std,XTrain)
# type de l'objet
println(typeof(ZTrain))
DataFrames.DataFrame
In [94]:
# vérification => indicateurs stats - moyennes
# vectorisation avec "."
# moyennes nulles sur TRAIN
import StatsBase
StatsBase.mean.(DFR.eachcol(ZTrain))
55-element Vector{Float64}:
-4.1933548517455996e-17
-9.864544012476432e-17
1.5143183434883062e-16
-4.888398800190217e-17
2.309116107789427e-18
1.926957391950277e-17
-1.6625635976083876e-17
6.354687528636504e-17
-1.0621934095831365e-16
5.3756222989337864e-17
-1.2931050203620791e-17
7.204442256303012e-17
-7.426117402650797e-17
-8.312817988041938e-19
-1.2866394952602689e-16
7.222915185165328e-17
-2.5377186024605803e-17
2.0135492459923805e-17
-7.500009118100059e-17
-6.908875394505966e-17
-1.108375731738925e-17
8.234308040377097e-17
-4.9553631673161104e-17
-5.209365939172948e-17
1.4778343089852334e-18
-1.221060597799049e-16
3.2235260864740404e-17
1.7826376352134377e-17
3.417491839528352e-18
-5.772790269473568e-17
4.1656454584521265e-17
-2.9556686179704667e-18
7.620083155705109e-17
-1.1822674471881867e-17
-1.140703357247977e-17
2.849449277012153e-17
-2.8309763481498376e-17
-2.6785746850357353e-18
1.2099768404816598e-17
-8.90395171163603e-17
3.334363659647933e-17
7.804812444328263e-18
-3.837750971146028e-17
-3.5560388059957176e-17
1.5378713277877583e-17
4.4335029269557e-17
2.4014807521010043e-18
1.2349152944457855e-16
-3.198125809288357e-17
-9.979999817865904e-17
-8.033198459364312e-18
3.1427070227014105e-17
-4.5489587323451716e-17
-1.0621934095831365e-18
1.0113928552117691e-17
In [95]:
# écarts-type == 1 sur TRAIN
StatsBase.std.(DFR.eachcol(ZTrain))
55-element Vector{Float64}:
0.9999999999999998
1.0000000000000002
0.9999999999999998
1.0000000000000004
1.0
1.0000000000000004
0.9999999999999999
0.9999999999999997
0.9999999999999998
1.0
1.0000000000000002
1.0
1.0
1.0000000000000002
1.0000000000000002
0.9999999999999999
1.0
0.9999999999999999
1.0
0.9999999999999999
0.9999999999999997
0.9999999999999996
1.0000000000000004
0.9999999999999998
1.0
0.9999999999999993
1.0000000000000002
1.0
1.0000000000000004
1.0000000000000004
1.0000000000000002
0.9999999999999996
0.9999999999999998
1.0000000000000004
1.0
1.0000000000000002
1.0000000000000002
1.0000000000000002
0.9999999999999991
1.0000000000000002
1.0
0.9999999999999997
1.0000000000000007
1.0000000000000004
1.0
0.9999999999999997
0.9999999999999999
1.0000000000000002
0.9999999999999996
1.0
1.0000000000000002
1.0
1.0000000000000002
0.9999999999999999
1.0000000000000002
Transformation pour TEST (avec les paramètres de TRAIN)
In [96]:
# transformation pour l'échantillon TEST
# les moyennes ne sont pas forcément nulles cette fois-ci
ZTest = MLJ.transform(std,XTest)
# moyennes non nulles effectivement
StatsBase.mean.(DFR.eachcol(ZTest))
55-element Vector{Float64}:
0.015440217177922709
0.008654629574588532
-0.019289722078527454
0.044643624957049315
0.00451747741846949
-0.04136052487096174
0.0388694082432708
0.05844524583642155
0.0859239359195257
0.04865717438459207
0.09710942620186036
-0.02077420224066479
-0.016134564786717178
-0.030804047557246958
-0.025585750913939366
0.0373746688513609
0.022212717760662658
0.02480816788214461
-0.033440703770394334
-0.028374808772504125
0.049557115211658724
0.07715555849046948
-0.01880771103137323
0.15260333043102015
-0.06797207073205161
-0.03561961536823071
0.020838015292299245
-0.027826680798455794
-0.03765751452556837
-0.012913244508499995
0.0649637327215418
-0.01773062422718018
0.032719518422051154
-0.053464662091416114
-0.019613922569648605
-0.005434650221465671
-0.026688461177178113
-0.015773134052504807
-0.007369649214185281
0.01910729225494208
0.011929533590212229
-0.06038553312375047
-0.01522819778033067
-0.03491649359187598
0.14842012832693569
-0.06737406477067076
0.0424517201829847
0.037661550354327736
-0.04201126358088621
-0.05767550244346965
0.025602770418852148
0.02788312878542861
-0.021568071307880715
-0.013284563493068597
0.008344216413196832
Régression logistique avec MLJ¶
Importation de la classe de calcul¶
In [97]:
# importer la régression logistique
# à partir du module MLJLinearModels
LogisticClassifier = @MLJ.load LogisticClassifier pkg=MLJLinearModels
import MLJLinearModels ✔
┌ Info: For silent loading, specify `verbosity=0`. └ @ Main C:\Users\ricco\.julia\packages\MLJModels\9LbNu\src\loading.jl:159
MLJLinearModels.LogisticClassifier
Instanciation - Hyperparamètres¶
In [98]:
# instancier le modèle avec les paramètres par défaut
# c'est ici qu'il faudrait passer les paramètres de l'algo
# lambda = 0, régression non-pénalisée
lr_1 = LogisticClassifier(lambda=0)
# affichage des paramètres
for (param,value) in pairs(MLJ.params(lr_1))
println("$param = $value")
end
lambda = 0 gamma = 0.0 penalty = l2 fit_intercept = true penalize_intercept = false scale_penalty_with_samples = true solver = nothing
Entraînement du modèle¶
In [99]:
# préparation de l'objet pour l'entraînement
# avec la méthode machine() de MLJ
# on utilise bien ZTrain
mach_1 = MLJ.machine(lr_1,ZTrain,yTrain)
# lancer l'entraînement -> l'objet mach est màj directement avec "!"
MLJ.fit!(mach_1)
┌ Info: Training machine(LogisticClassifier(lambda = 0, …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}}
│ optim_options: Optim.Options{Float64, Nothing}
│ lbfgs_options: @NamedTuple{} NamedTuple()
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72
trained Machine; caches model-specific representations of data
model: LogisticClassifier(lambda = 0, …)
args:
1: Source @535 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @913 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}
Etude des coefficients estimés¶
In [100]:
# coefficients
fp_1 = MLJ.fitted_params(mach_1)
fp_1.coefs
55-element Vector{Pair{Symbol, Float64}}:
:wf_make => -0.20662775844013093
:wf_address => -0.3811389804797835
:wf_all => 0.1645957927535096
:wf_3d => 1.11931487567822
:wf_our => 0.3540115582062516
:wf_over => 0.07002507745031764
:wf_remove => 1.0852294566826728
:wf_internet => 0.5195139493086103
:wf_order => 0.40124329592484476
:wf_mail => 0.2670135142542665
:wf_receive => 0.5321140875041915
:wf_will => 0.10099089146678401
:wf_people => -0.257730835030072
:wf_report => -0.04610808253153313
:wf_addresses => 1.3262804424779204
:wf_free => 0.41701855421032163
:wf_business => 0.3637316339216612
:wf_email => -0.01332269386255259
:wf_you => 0.14104609674309038
:wf_credit => 0.8136149640427681
:wf_your => 0.7186893017220286
:wf_font => 0.21524435629540017
:wf_000 => 0.2091229954856885
:wf_money => 0.11819911614596045
:wf_hp => -2.738351089085636
:wf_hpl => -1.928803437799816
:wf_lab => -3.879073074778169
:wf_labs => 0.05031573236454722
:wf_telnet => -2.0846209075015976
:wf_857 => -2.2226802727324
:wf_data => -2.8800415674462236
:wf_415 => 1.9654102576293766
:wf_85 => -0.421562053277384
:wf_technology => 0.3406481624420365
:wf_1999 => -1.543889169617358
:wf_parts => -0.10611032300699687
:wf_pm => -0.39000503239031853
:wf_direct => -0.43360773751399
:wf_cs => -8.954766832401562
:wf_meeting => -1.4551539698007934
:wf_original => 0.036688704959824306
:wf_project => -1.0456474780122373
:wf_re => -1.3255834345909456
:wf_edu => -0.5680405634676539
:wf_table => 0.03938125758481521
:wf_conference => -1.6353268137272965
Symbol("cf_;") => -0.32929001515593
Symbol("cf_(") => -0.20975118646724641
Symbol("cf_[") => -1.4251811365093452
:cf_! => 0.4224999024795942
Symbol("cf_\$") => 0.7097391172800116
Symbol("cf_#") => -0.02705090438170104
:capital_run_length_average => 14.125278263170905
:capital_run_length_longest => -2.7564022249517106
:capital_run_length_total => 2.790850531449405
In [101]:
# type de l'objet, paires Symbol(nom des variables) et valeur (du coefficient)
println(typeof(fp_1.coefs))
Vector{Pair{Symbol, Float64}}
In [102]:
# pour ne manipuler que les valeurs (coefficients) -> last
# par ex. les 10 premières valeurs
last.(fp_1.coefs)[1:10]
10-element Vector{Float64}:
-0.20662775844013093
-0.3811389804797835
0.1645957927535096
1.11931487567822
0.3540115582062516
0.07002507745031764
1.0852294566826728
0.5195139493086103
0.40124329592484476
0.2670135142542665
In [103]:
# somme des carrés des coefficients
# à comparer plus loin avec la régression pénalisée !!!
println(sum(last.(fp_1.coefs) .^ 2))
363.68086825172304
In [104]:
# pour l'intercept
fp_1.intercept
-3.5008901731060185
Prédiction sur l'échantillon test (probabilité d'affectation)¶
In [105]:
# predict sur l'échantillon test -> proba d'appartenance
# via la machine qui a été entraînée
# appliqué sur les données test transformées ZTest
proba_1 = MLJ.predict(mach_1,ZTest)
# premières valeurs
DFR.first(proba_1,10)
10-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt32, Float64}:
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>1.0, yes=>1.03e-12)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.841, yes=>0.159)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.897, yes=>0.103)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.00133, yes=>0.999)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0672, yes=>0.933)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.936, yes=>0.0643)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.000712, yes=>0.999)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.759, yes=>0.241)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0349, yes=>0.965)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>4.54e-8, yes=>1.0)
Courbe ROC et AUC (Area Under Curve)¶
In [106]:
# calculer les éléments de la courbe ROC
# la troisième variable retournée est le seuil, nous n'en avons pas l'usage ici
# la fonction SAIT que (y = yes) est la modalité cible
# parce que nous l'avons codé ainsi plus haut (CategoricalArray)
fpr_1, tpr_1, _ = MLJ.roc_curve(proba_1,yTest)
# vérif
println(fpr_1[1:5])
println(tpr_1[1:5])
[0.0, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588] [0.0, 0.019023462270133164, 0.020291693088142042, 0.02092580849714648, 0.02155992390615092]
In [107]:
# StatsPlots (graphiques statistiques, extension de Plots)
import StatsPlots
# dessin de la courbe
StatsPlots.plot(fpr_1,tpr_1,legend=false,xlabel="FPR",ylabel="TPR",lw=2)
# rajouter la diagonale avec "!"
StatsPlots.plot!([0,1],[0,1],linestyle=:dash,legend=false)
In [108]:
# calcul de l'AUC
# attention, utilisation des probas d'affectation et des classes observées
# l'outil SAIT que (y=yes) est la classe cible
# parce que y codé ainsi avec CategoricalArrays
import MLJ
auc_1 = MLJ.auc(proba_1,yTest)
# affichage
println("AUC (Modele 1) = $(auc_1)")
AUC (Modele 1) = 0.948841829947677
In [109]:
# comptablisation des exemples positifs et négatifs
# dans l'échantillon test
n_pos = sum(yTest .== "yes")
n_neg = sum(yTest .== "no")
# affichage
print("Exemples positifs = $n_pos, et negatifs = $n_neg")
Exemples positifs = 1577, et negatifs = 2423
In [110]:
import StatsFuns
"""
auc_ci(auc, n_pos, n_neg, alpha=0.1)
Calcule un intervalle de confiance (bilatéral) pour l'AUC
en utilisant l'approximation de Hanley & McNeil (1982).
Arguments :
- auc : valeur de l'AUC (entre 0 et 1)
- n_pos : nombre d'exemples positifs
- n_neg : nombre d'exemples négatifs
- alpha : niveau de risque (par défaut 0.1 pour IC à 90%)
Retour :
- (lower, upper)
"""
function auc_ci(auc, n_pos, n_neg, alpha=0.1)
# Quantile de la loi normale
z = StatsFuns.norminvcdf(1 - alpha/2)
# Termes intermédiaires
Q1 = auc / (2 - auc)
Q2 = (2 * auc^2) / (1 + auc)
# Variance estimée
var_auc = (
auc * (1 - auc) +
(n_pos - 1) * (Q1 - auc^2) +
(n_neg - 1) * (Q2 - auc^2)
) / (n_pos * n_neg)
se = sqrt(var_auc)
lower = auc - z * se
upper = auc + z * se
# Optionnel : borner dans [0,1]
return (max(0.0, lower), min(1.0, upper))
end
auc_ci
In [111]:
# calcul pour notre AUC
println(auc_ci(auc_1,n_pos,n_neg))
(0.9422691642446792, 0.9554144956506747)
Régression RIDGE¶
Paramétrage¶
In [112]:
# modèle 2 -- RIDGE mais avec plus de pénalisation
# par défaut, penalty = l2
lr_2 = LogisticClassifier(lambda=5.0)
# affichage des paramètres
for (param,value) in pairs(MLJ.params(lr_2))
println("$param = $value")
end
lambda = 5.0 gamma = 0.0 penalty = l2 fit_intercept = true penalize_intercept = false scale_penalty_with_samples = true solver = nothing
Entraînement - Inspection des coefficients¶
In [113]:
# préparation de l'objet pour l'entraînement
mach_2 = MLJ.machine(lr_2,ZTrain,yTrain)
# lancer l'entraînement -> l'objet mach est màj directement avec "!"
MLJ.fit!(mach_2)
# coefficients
fp_2 = MLJ.fitted_params(mach_2)
# somme des carrés des coefficients
# à comparer avec la somme de la régression non pénalisée
# la pénalité a fortement opéré
println("SUM square coefs = $(sum(last.(fp_2.coefs) .^ 2))")
SUM square coefs = 0.014676052838459935
┌ Info: Training machine(LogisticClassifier(lambda = 5.0, …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}}
│ optim_options: Optim.Options{Float64, Nothing}
│ lbfgs_options: @NamedTuple{} NamedTuple()
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72
A bon escient ? Courbe ROC et AUC de RIDGE¶
In [114]:
# predict sur l'échantillon test -> proba d'appartenance
proba_2 = MLJ.predict(mach_2,ZTest)
# calculer les éléments de la courbe ROC
fpr_2, tpr_2, _ = MLJ.roc_curve(proba_2,yTest)
([0.0, 0.0, 0.0, 0.0, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588 … 0.9942220387948824, 0.9946347503095336, 0.9950474618241849, 0.9954601733388362, 0.9958728848534875, 0.9962855963681386, 0.9966983078827899, 0.9971110193974412, 0.9975237309120925, 1.0], [0.0, 0.0006341154090044388, 0.0012682308180088776, 0.0019023462270133164, 0.0019023462270133164, 0.0025364616360177552, 0.0031705770450221942, 0.003804692454026633, 0.004438807863031071, 0.0050729232720355105 … 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [0.7943463076716266, 0.787591482250194, 0.7720663676128136, 0.7374097517897057, 0.6965341388285624, 0.6799778221588728, 0.6677390838935167, 0.6634270741130385, 0.6584955592725488, 0.6458663965658581 … 0.21600702643453248, 0.18557917711040775, 0.174984399668831, 0.17467907244710576, 0.1744815223317159, 0.1691136805299214, 0.16881602053673636, 0.16864978219631957, 0.1625690454900294, 0.1554931369455556])
In [115]:
# dessin de la courbe
StatsPlots.plot(fpr_1,tpr_1,lw=2,title="ROC Curve",label="Reg.Log.")
StatsPlots.plot!(fpr_2,tpr_2,lw=2,label="Ridge (lambda=5)")
# rajouter la diagonale
StatsPlots.plot!([0,1],[0,1],linestyle=:dash,label=nothing)
In [116]:
# valeur de l'AUC - un peu moins bonne
# mais il faudrait calculer l'intervalle de confiance pour être affirmatif
auc_2 = MLJ.auc(proba_2,yTest)
println("AUC (Modele 2) = $auc_2")
AUC (Modele 2) = 0.9354262195075674
In [117]:
# intervalle de confiance
println(auc_ci(auc_2,n_pos,n_neg))
(0.9280643403432914, 0.9427880986718434)
Régression LASSO¶
Paramétrage¶
In [118]:
# modèle 3 -- LASSO
lr_3 = LogisticClassifier(lambda=0.1,penalty="l1")
# affichage des paramètres
for (param,value) in pairs(MLJ.params(lr_3))
println("$param = $value")
end
lambda = 0.1 gamma = 0.0 penalty = l1 fit_intercept = true penalize_intercept = false scale_penalty_with_samples = true solver = nothing
Entraînement et inspection des coefficients¶
In [119]:
# préparation de l'objet pour l'entraînement
mach_3 = MLJ.machine(lr_3,ZTrain,yTrain)
# lancer l'entraînement -> l'objet mach est màj directement avec "!"
MLJ.fit!(mach_3)
# coefficients
fp_3 = MLJ.fitted_params(mach_3)
# !!! NOMBRE DE COEFFICIENTS NULS (puisque LASSO)
println("\n<< NB coefs nuls = $(sum(last.(fp_3.coefs) .== 0)) >>\n")
<< NB coefs nuls = 44 >>
┌ Info: Training machine(LogisticClassifier(lambda = 0.1, …), …). └ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499 ┌ Info: Solver: MLJLinearModels.ProxGrad │ accel: Bool true │ max_iter: Int64 1000 │ tol: Float64 0.0001 │ max_inner: Int64 100 │ beta: Float64 0.8 │ gram: Bool false └ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72
Liste des coefficients non-nuls
In [120]:
# transformation de la structure coefs en data frame
df_temp = DFR.DataFrame(cle=first.(fp_3.coefs),valeur=last.(fp_3.coefs))
# filtrage du data frame
DFR.filter(row -> row.valeur != 0, df_temp)
11×2 DataFrame
| Row | cle | valeur |
|---|---|---|
| Symbol | Float64 | |
| 1 | wf_remove | 0.114821 |
| 2 | wf_order | 0.073568 |
| 3 | wf_receive | 0.0456103 |
| 4 | wf_free | 0.049217 |
| 5 | wf_business | 0.0429596 |
| 6 | wf_your | 0.468963 |
| 7 | wf_000 | 0.114493 |
| 8 | wf_money | 0.042865 |
| 9 | wf_hp | -0.00400554 |
| 10 | cf_$ | 0.113806 |
| 11 | capital_run_length_longest | 0.0258599 |
Evaluation en test - Courbe ROC¶
In [121]:
# predict sur l'échantillon test -> proba d'appartenance
proba_3 = MLJ.predict(mach_3,ZTest)
# calculer les éléments de la courbe ROC
fpr_3, tpr_3, _ = MLJ.roc_curve(proba_3,yTest)
([0.0, 0.0, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0008254230293025176, 0.0008254230293025176, 0.0008254230293025176, 0.0008254230293025176, 0.0012381345439537762 … 0.9954601733388362, 0.9966983078827899, 0.9971110193974412, 0.9975237309120925, 0.9979364424267437, 0.998349153941395, 0.9987618654560462, 0.9991745769706974, 0.9995872884853487, 1.0], [0.0, 0.0006341154090044388, 0.0006341154090044388, 0.0012682308180088776, 0.0019023462270133164, 0.0019023462270133164, 0.0025364616360177552, 0.0031705770450221942, 0.003804692454026633, 0.003804692454026633 … 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [0.9781154370866937, 0.9741093907340005, 0.970412451188318, 0.9676896769028709, 0.9666205285456669, 0.9623977084405708, 0.9612543569594248, 0.9609321779360758, 0.9590932745894498, 0.9575430988337869 … 0.26639193543725564, 0.26632137183450755, 0.26588809204447694, 0.26578237876364835, 0.26574714696295565, 0.2656766923021226, 0.26465101337864433, 0.2643792252377439, 0.26427389590053063, 0.264066360590983])
In [124]:
# dessin des courbes
# << LASSO >> -- TROP de variables supprimées visiblement
StatsPlots.plot(fpr_1,tpr_1,lw=2,title="ROC Curve",label="Reg.Log.")
StatsPlots.plot!(fpr_2,tpr_2,lw=2,label="Ridge (lambda = 5)")
StatsPlots.plot!(fpr_3,tpr_3,lw=2,label="Lasso (lambda = 0.1)")
# rajouter la diagonale
StatsPlots.plot!([0,1],[0,1],linestyle=:dash,label=nothing)
In [123]:
# auc et intervalle de confiance
auc_3 = MLJ.auc(proba_3,yTest)
println("AUC (Modele 3) = $auc_3")
# intervalle de confiance
bb_3, bh_3 = auc_ci(auc_3,n_pos,n_neg)
print("Int. de confiance = [$bb_3, $bh_3]")
AUC (Modele 3) = 0.8848493001046042 Int. de confiance = [0.8751620534709881, 0.8945365467382203]
Définir les traitements sous forme de PIPELINE (standardisation + régression)¶
Formation du pipeline - Entraînement, prédiction¶
In [125]:
# construire sous la forme d'un pipeline
# utilisation de |> pour l'enchaînement
pipe = Standardizer() |> LogisticClassifier(lambda=0)
# entraînement SUR LES DONNEES TRAIN NON TRANSFORMEES
mach_pipe = MLJ.machine(pipe,XTrain,yTrain)
MLJ.fit!(mach_pipe)
# prédiction des probas SUR LES TEST NON TRANSFORMEES
proba_pipe = MLJ.predict(mach_pipe,XTest)
# premières valeurs
DFR.first(proba_pipe,7)
┌ Info: Training machine(ProbabilisticPipeline(standardizer = Standardizer(features = Symbol[], …), …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Training machine(:standardizer, …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Training machine(:logistic_classifier, …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}}
│ optim_options: Optim.Options{Float64, Nothing}
│ lbfgs_options: @NamedTuple{} NamedTuple()
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72
7-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt32, Float64}:
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>1.0, yes=>1.03e-12)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.841, yes=>0.159)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.897, yes=>0.103)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.00133, yes=>0.999)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0672, yes=>0.933)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.936, yes=>0.0643)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.000712, yes=>0.999)
Comparaison avec l'approche sans pipeline¶
In [126]:
# pour rappel, proba_1
DFR.first(proba_1,7)
7-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt32, Float64}:
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>1.0, yes=>1.03e-12)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.841, yes=>0.159)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.897, yes=>0.103)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.00133, yes=>0.999)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0672, yes=>0.933)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.936, yes=>0.0643)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.000712, yes=>0.999)
In [127]:
# vérifions globalement - somme des écarts au carré => 0
sum((MLJ.pdf.(proba_1,"yes") .- MLJ.pdf.(proba_pipe,"yes")) .^ 2)
0.0