Environnement et packages¶

In [76]:
# activer l'environnement
using Pkg
Pkg.activate("env_julia_curve")
  Activating project at `c:\Users\ricco\Desktop\demo\env_julia_curve`
In [77]:
# liste des packages installés
Pkg.status()
Status `C:\Users\ricco\Desktop\demo\env_julia_curve\Project.toml`
  [324d7699] CategoricalArrays v1.1.0
  [a93c6f00] DataFrames v1.8.2
  [7073ff75] IJulia v1.34.4
  [add582a8] MLJ v0.23.2
  [6ee0df7b] MLJLinearModels v0.10.1
  [23777cdb] MLJTransforms v0.1.5
  [30f210dd] ScientificTypesBase v3.1.0
  [a19d573c] StatisticalMeasures v0.3.5
  [2913bbd2] StatsBase v0.34.10
  [4c63d2b9] StatsFuns v1.5.2
  [f3b207a7] StatsPlots v0.15.8
  [fdbf4ff8] XLSX v0.11.3

Importation et préparation des données¶

Importation - Types des variables¶

In [78]:
# packages
import DataFrames as DFR
import XLSX

# lecture des données
df = DFR.DataFrame(XLSX.readtable("./spambase.xlsx"))

# premières lignes
println(DFR.describe(df))
56×7 DataFrame
 Row │ variable                    mean        min  median  max     nmissing  eltype   
     │ Symbol                      Union…      Any  Union…  Any     Int64     DataType 
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │ wf_make                     0.104553    0.0  0.0     4.54           0  Float64
   2 │ wf_address                  0.213015    0.0  0.0     14.28          0  Float64
   3 │ wf_all                      0.280656    0.0  0.0     5.1            0  Float64
   4 │ wf_3d                       0.0654249   0.0  0.0     42.81          0  Float64
   5 │ wf_our                      0.312223    0.0  0.0     10.0           0  Float64
   6 │ wf_over                     0.0959009   0.0  0.0     5.88           0  Float64
   7 │ wf_remove                   0.114208    0.0  0.0     7.27           0  Float64
   8 │ wf_internet                 0.105295    0.0  0.0     11.11          0  Float64
   9 │ wf_order                    0.0900674   0.0  0.0     5.26           0  Float64
  10 │ wf_mail                     0.239413    0.0  0.0     18.18          0  Float64
  11 │ wf_receive                  0.059824    0.0  0.0     2.61           0  Float64
  12 │ wf_will                     0.541702    0.0  0.1     9.67           0  Float64
  13 │ wf_people                   0.0939296   0.0  0.0     5.55           0  Float64
  14 │ wf_report                   0.0586264   0.0  0.0     10.0           0  Float64
  15 │ wf_addresses                0.0492045   0.0  0.0     4.41           0  Float64
  16 │ wf_free                     0.248848    0.0  0.0     20.0           0  Float64
  17 │ wf_business                 0.142586    0.0  0.0     7.14           0  Float64
  18 │ wf_email                    0.184745    0.0  0.0     9.09           0  Float64
  19 │ wf_you                      1.6621      0.0  1.31    18.75          0  Float64
  20 │ wf_credit                   0.085577    0.0  0.0     18.18          0  Float64
  21 │ wf_your                     0.809761    0.0  0.22    11.11          0  Float64
  22 │ wf_font                     0.121202    0.0  0.0     17.1           0  Float64
  23 │ wf_000                      0.101645    0.0  0.0     5.45           0  Float64
  24 │ wf_money                    0.0942686   0.0  0.0     12.5           0  Float64
  25 │ wf_hp                       0.549504    0.0  0.0     20.83          0  Float64
  26 │ wf_hpl                      0.265384    0.0  0.0     16.66          0  Float64
  27 │ wf_lab                      0.0989155   0.0  0.0     14.28          0  Float64
  28 │ wf_labs                     0.102852    0.0  0.0     5.88           0  Float64
  29 │ wf_telnet                   0.0647533   0.0  0.0     12.5           0  Float64
  30 │ wf_857                      0.0470485   0.0  0.0     4.76           0  Float64
  31 │ wf_data                     0.0972289   0.0  0.0     18.18          0  Float64
  32 │ wf_415                      0.0478353   0.0  0.0     4.76           0  Float64
  33 │ wf_85                       0.105412    0.0  0.0     20.0           0  Float64
  34 │ wf_technology               0.0974766   0.0  0.0     7.69           0  Float64
  35 │ wf_1999                     0.136953    0.0  0.0     6.89           0  Float64
  36 │ wf_parts                    0.0132015   0.0  0.0     8.33           0  Float64
  37 │ wf_pm                       0.0786286   0.0  0.0     11.11          0  Float64
  38 │ wf_direct                   0.0648337   0.0  0.0     4.76           0  Float64
  39 │ wf_cs                       0.0436666   0.0  0.0     7.14           0  Float64
  40 │ wf_meeting                  0.132339    0.0  0.0     14.28          0  Float64
  41 │ wf_original                 0.0460987   0.0  0.0     3.57           0  Float64
  42 │ wf_project                  0.0791958   0.0  0.0     20.0           0  Float64
  43 │ wf_re                       0.301224    0.0  0.0     21.42          0  Float64
  44 │ wf_edu                      0.179824    0.0  0.0     22.05          0  Float64
  45 │ wf_table                    0.00544447  0.0  0.0     2.17           0  Float64
  46 │ wf_conference               0.0318692   0.0  0.0     10.0           0  Float64
  47 │ cf_;                        0.0385747   0.0  0.0     4.385          0  Float64
  48 │ cf_(                        0.13903     0.0  0.065   9.752          0  Float64
  49 │ cf_[                        0.0169759   0.0  0.0     4.081          0  Float64
  50 │ cf_!                        0.269071    0.0  0.0     32.478         0  Float64
  51 │ cf_$                        0.0758107   0.0  0.0     6.003          0  Float64
  52 │ cf_#                        0.0442382   0.0  0.0     19.829         0  Float64
  53 │ capital_run_length_average  5.19152     1.0  2.276   1102.5         0  Float64
  54 │ capital_run_length_longest  52.1728     1    15.0    9989           0  Int64
  55 │ capital_run_length_total    283.289     1    95.0    15841          0  Int64
  56 │ spam                                    no           yes            0  String
In [79]:
# dimension
println(DFR.size(df))
(4601, 56)
In [80]:
# retirer la limitation d'affichage
ENV["LINES"] = 1000
# vérifier le schéma de la base
# et les types scientifiques
import MLJ
MLJ.schema(df)
┌────────────────────────────┬────────────┬─────────┐
│ names                      │ scitypes   │ types   │
├────────────────────────────┼────────────┼─────────┤
│ wf_make                    │ Continuous │ Float64 │
│ wf_address                 │ Continuous │ Float64 │
│ wf_all                     │ Continuous │ Float64 │
│ wf_3d                      │ Continuous │ Float64 │
│ wf_our                     │ Continuous │ Float64 │
│ wf_over                    │ Continuous │ Float64 │
│ wf_remove                  │ Continuous │ Float64 │
│ wf_internet                │ Continuous │ Float64 │
│ wf_order                   │ Continuous │ Float64 │
│ wf_mail                    │ Continuous │ Float64 │
│ wf_receive                 │ Continuous │ Float64 │
│ wf_will                    │ Continuous │ Float64 │
│ wf_people                  │ Continuous │ Float64 │
│ wf_report                  │ Continuous │ Float64 │
│ wf_addresses               │ Continuous │ Float64 │
│ wf_free                    │ Continuous │ Float64 │
│ wf_business                │ Continuous │ Float64 │
│ wf_email                   │ Continuous │ Float64 │
│ wf_you                     │ Continuous │ Float64 │
│ wf_credit                  │ Continuous │ Float64 │
│ wf_your                    │ Continuous │ Float64 │
│ wf_font                    │ Continuous │ Float64 │
│ wf_000                     │ Continuous │ Float64 │
│ wf_money                   │ Continuous │ Float64 │
│ wf_hp                      │ Continuous │ Float64 │
│ wf_hpl                     │ Continuous │ Float64 │
│ wf_lab                     │ Continuous │ Float64 │
│ wf_labs                    │ Continuous │ Float64 │
│ wf_telnet                  │ Continuous │ Float64 │
│ wf_857                     │ Continuous │ Float64 │
│ wf_data                    │ Continuous │ Float64 │
│ wf_415                     │ Continuous │ Float64 │
│ wf_85                      │ Continuous │ Float64 │
│ wf_technology              │ Continuous │ Float64 │
│ wf_1999                    │ Continuous │ Float64 │
│ wf_parts                   │ Continuous │ Float64 │
│ wf_pm                      │ Continuous │ Float64 │
│ wf_direct                  │ Continuous │ Float64 │
│ wf_cs                      │ Continuous │ Float64 │
│ wf_meeting                 │ Continuous │ Float64 │
│ wf_original                │ Continuous │ Float64 │
│ wf_project                 │ Continuous │ Float64 │
│ wf_re                      │ Continuous │ Float64 │
│ wf_edu                     │ Continuous │ Float64 │
│ wf_table                   │ Continuous │ Float64 │
│ wf_conference              │ Continuous │ Float64 │
│ cf_;                       │ Continuous │ Float64 │
│ cf_(                       │ Continuous │ Float64 │
│ cf_[                       │ Continuous │ Float64 │
│ cf_!                       │ Continuous │ Float64 │
│ cf_$                       │ Continuous │ Float64 │
│ cf_#                       │ Continuous │ Float64 │
│ capital_run_length_average │ Continuous │ Float64 │
│ capital_run_length_longest │ Count      │ Int64   │
│ capital_run_length_total   │ Count      │ Int64   │
│ spam                       │ Textual    │ String  │
└────────────────────────────┴────────────┴─────────┘

Préparation des structures¶

In [81]:
import MLJ

# isoler y et X dans des structures distinctes
y, X = MLJ.unpack(df,==(:spam))

# dimensions
println("Dim. de y = $(DFR.size(y))")
println("Dim. de X = $(DFR.size(X))")
Dim. de y = (4601,)
Dim. de X = (4601, 55)

Ajustement du type des variables¶

In [82]:
# transformer les X count en variables continues
# utilisation du package ScientificTypesBase
import ScientificTypesBase
X = MLJ.coerce(X,ScientificTypesBase.Count => ScientificTypesBase.Continuous)

# schéma
MLJ.schema(X)
┌────────────────────────────┬────────────┬─────────┐
│ names                      │ scitypes   │ types   │
├────────────────────────────┼────────────┼─────────┤
│ wf_make                    │ Continuous │ Float64 │
│ wf_address                 │ Continuous │ Float64 │
│ wf_all                     │ Continuous │ Float64 │
│ wf_3d                      │ Continuous │ Float64 │
│ wf_our                     │ Continuous │ Float64 │
│ wf_over                    │ Continuous │ Float64 │
│ wf_remove                  │ Continuous │ Float64 │
│ wf_internet                │ Continuous │ Float64 │
│ wf_order                   │ Continuous │ Float64 │
│ wf_mail                    │ Continuous │ Float64 │
│ wf_receive                 │ Continuous │ Float64 │
│ wf_will                    │ Continuous │ Float64 │
│ wf_people                  │ Continuous │ Float64 │
│ wf_report                  │ Continuous │ Float64 │
│ wf_addresses               │ Continuous │ Float64 │
│ wf_free                    │ Continuous │ Float64 │
│ wf_business                │ Continuous │ Float64 │
│ wf_email                   │ Continuous │ Float64 │
│ wf_you                     │ Continuous │ Float64 │
│ wf_credit                  │ Continuous │ Float64 │
│ wf_your                    │ Continuous │ Float64 │
│ wf_font                    │ Continuous │ Float64 │
│ wf_000                     │ Continuous │ Float64 │
│ wf_money                   │ Continuous │ Float64 │
│ wf_hp                      │ Continuous │ Float64 │
│ wf_hpl                     │ Continuous │ Float64 │
│ wf_lab                     │ Continuous │ Float64 │
│ wf_labs                    │ Continuous │ Float64 │
│ wf_telnet                  │ Continuous │ Float64 │
│ wf_857                     │ Continuous │ Float64 │
│ wf_data                    │ Continuous │ Float64 │
│ wf_415                     │ Continuous │ Float64 │
│ wf_85                      │ Continuous │ Float64 │
│ wf_technology              │ Continuous │ Float64 │
│ wf_1999                    │ Continuous │ Float64 │
│ wf_parts                   │ Continuous │ Float64 │
│ wf_pm                      │ Continuous │ Float64 │
│ wf_direct                  │ Continuous │ Float64 │
│ wf_cs                      │ Continuous │ Float64 │
│ wf_meeting                 │ Continuous │ Float64 │
│ wf_original                │ Continuous │ Float64 │
│ wf_project                 │ Continuous │ Float64 │
│ wf_re                      │ Continuous │ Float64 │
│ wf_edu                     │ Continuous │ Float64 │
│ wf_table                   │ Continuous │ Float64 │
│ wf_conference              │ Continuous │ Float64 │
│ cf_;                       │ Continuous │ Float64 │
│ cf_(                       │ Continuous │ Float64 │
│ cf_[                       │ Continuous │ Float64 │
│ cf_!                       │ Continuous │ Float64 │
│ cf_$                       │ Continuous │ Float64 │
│ cf_#                       │ Continuous │ Float64 │
│ capital_run_length_average │ Continuous │ Float64 │
│ capital_run_length_longest │ Continuous │ Float64 │
│ capital_run_length_total   │ Continuous │ Float64 │
└────────────────────────────┴────────────┴─────────┘
In [83]:
# convertir y en variable catégorielle pour la rég. logistique
# équivalent du type factor sous R
# utilisation du package CategoricalArrays
# /!\ TRES IMPORTANT : indiquer que la classe "positive" est "yes" (en 2e position ici)
import CategoricalArrays as CA
y = CA.categorical(y,ordered=true,levels=["no","yes"])

# vérification des modalités
CA.levels(y)
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "no"
 "yes"
In [84]:
# fréquences des classes
DFR.combine(DFR.groupby(DFR.DataFrame(y=y),:y),DFR.nrow => :freq)
2×2 DataFrame
Rowyfreq
Cat…Int64
1no2788
2yes1813

Partition TRAIN/TEST¶

In [85]:
# effectif
n = DFR.nrow(X) 
println(n)
4601
In [86]:
# importer le générateur de nombre aléatoire
import Random as RND

# fixer une graine de départ (random_state)
# de mnière à ce que l'expé soit totalement reproductible
# MersenneTwister est le générateur standard de Julia
rng = RND.MersenneTwister(42)

# identifiants -> échantillonner 601 parmi 4601, sans remise
# en utilisation le générateur ci-dessus avec la "graine" spécifiée
import StatsBase
idTrain = StatsBase.sample(rng,1:n,601,replace=false)
println(length(idTrain))
601
In [87]:
# et les individus en test
idTest = setdiff(1:n,idTrain)
println(length(idTest))
4000
In [88]:
# structures y et X pour train/test
# par indexation avec les indices
yTrain, yTest = y[idTrain], y[idTest]
XTrain, XTest = X[idTrain,:], X[idTest,:]

# afficher les dimensions pour vérifications
println("Dim. de y = $(DFR.size(yTrain)) et $(DFR.size(yTest))")
println("Dim. de X = $(DFR.size(XTrain)) et $(DFR.size(XTest))")
Dim. de y = (601,) et (4000,)
Dim. de X = (601, 55) et (4000, 55)

Standardisation des descripteurs¶

Paramètres de standardisation (moyennes, écarts-type)

In [89]:
# charger l'outil de centrage-réduction
# MLJ joue le rôle de wrapper
# l'outil est dans le package MLJTransforms
Standardizer = @MLJ.load Standardizer pkg=MLJTransforms
import MLJTransforms ✔
┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\ricco\.julia\packages\MLJModels\9LbNu\src\loading.jl:159
MLJTransforms.Standardizer
In [90]:
# calcul des paramètres de standardisation
# la classe est instaciée à la volée
# bien sûr, << on n'utilise que les données d'entraînement >>
std = MLJ.machine(Standardizer(),XTrain)

# l'objet std est directement modifié avec "!"
MLJ.fit!(std)
┌ Info: Training machine(Standardizer(features = Symbol[], …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
trained Machine; caches model-specific representations of data
  model: Standardizer(features = Symbol[], …)
  args: 
    1:	Source @182 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
In [91]:
# paramètres calculés (moyennes, écarts-type)
fp = MLJ.fitted_params(std)

# moyennes
# collect() transforme le tuple en liste
println(DFR.DataFrame(var=DFR.names(XTrain),moyennes=collect(fp.means)))
55×2 DataFrame
 Row │ var                         moyennes     
     │ String                      Float64      
─────┼──────────────────────────────────────────
   1 │ wf_make                       0.100516
   2 │ wf_address                    0.203062
   3 │ wf_all                        0.288819
   4 │ wf_3d                         0.0341764
   5 │ wf_our                        0.309767
   6 │ wf_over                       0.109884
   7 │ wf_remove                     0.101481
   8 │ wf_internet                   0.0891348
   9 │ wf_order                      0.073777
  10 │ wf_mail                       0.217171
  11 │ wf_receive                    0.0475707
  12 │ wf_will                       0.559151
  13 │ wf_people                     0.0986855
  14 │ wf_report                     0.0697338
  15 │ wf_addresses                  0.0550582
  16 │ wf_free                       0.227255
  17 │ wf_business                   0.135507
  18 │ wf_email                      0.174676
  19 │ wf_you                        1.71411
  20 │ wf_credit                     0.106156
  21 │ wf_your                       0.762829
  22 │ wf_font                       0.0696007
  23 │ wf_000                        0.107155
  24 │ wf_money                      0.0671048
  25 │ wf_hp                         0.659268
  26 │ wf_hpl                        0.295042
  27 │ wf_lab                        0.0892845
  28 │ wf_labs                       0.11426
  29 │ wf_telnet                     0.0775042
  30 │ wf_857                        0.0509651
  31 │ wf_data                       0.0753577
  32 │ wf_415                        0.0532446
  33 │ wf_85                         0.092812
  34 │ wf_technology                 0.117704
  35 │ wf_1999                       0.144276
  36 │ wf_parts                      0.0140599
  37 │ wf_pm                         0.0882696
  38 │ wf_direct                     0.0694509
  39 │ wf_cs                         0.0461231
  40 │ wf_meeting                    0.120965
  41 │ wf_original                   0.0440932
  42 │ wf_project                    0.138985
  43 │ wf_re                         0.314792
  44 │ wf_edu                        0.217421
  45 │ wf_table                      0.00227953
  46 │ wf_conference                 0.0537271
  47 │ cf_;                          0.0312995
  48 │ cf_(                          0.131552
  49 │ cf_[                          0.0244992
  50 │ cf_!                          0.343062
  51 │ cf_$                          0.0712313
  52 │ cf_#                          0.0367155
  53 │ capital_run_length_average    5.81461
  54 │ capital_run_length_longest   53.827
  55 │ capital_run_length_total    278.87
In [92]:
# écarts-type
println(DFR.DataFrame(var=DFR.names(XTrain),moyennes=collect(fp.stds)))
55×2 DataFrame
 Row │ var                         moyennes    
     │ String                      Float64     
─────┼─────────────────────────────────────────
   1 │ wf_make                       0.300785
   2 │ wf_address                    1.32281
   3 │ wf_all                        0.486717
   4 │ wf_3d                         0.805123
   5 │ wf_our                        0.625448
   6 │ wf_over                       0.388862
   7 │ wf_remove                     0.376624
   8 │ wf_internet                   0.318037
   9 │ wf_order                      0.218076
  10 │ wf_mail                       0.525793
  11 │ wf_receive                    0.145138
  12 │ wf_will                       0.96617
  13 │ wf_people                     0.339056
  14 │ wf_report                     0.41476
  15 │ wf_addresses                  0.263164
  16 │ wf_free                       0.664566
  17 │ wf_business                   0.36657
  18 │ wf_email                      0.466861
  19 │ wf_you                        1.78898
  20 │ wf_credit                     0.83424
  21 │ wf_your                       1.08933
  22 │ wf_font                       0.769281
  23 │ wf_000                        0.336949
  24 │ wf_money                      0.204748
  25 │ wf_hp                         1.85746
  26 │ wf_hpl                        0.957733
  27 │ wf_lab                        0.531623
  28 │ wf_labs                       0.471564
  29 │ wf_telnet                     0.389475
  30 │ wf_857                        0.348871
  31 │ wf_data                       0.387251
  32 │ wf_415                        0.350923
  33 │ wf_85                         0.442947
  34 │ wf_technology                 0.435172
  35 │ wf_1999                       0.429476
  36 │ wf_parts                      0.181686
  37 │ wf_pm                         0.415519
  38 │ wf_direct                     0.336706
  39 │ wf_cs                         0.383414
  40 │ wf_meeting                    0.684683
  41 │ wf_original                   0.193371
  42 │ wf_project                    1.13889
  43 │ wf_re                         1.02488
  44 │ wf_edu                        1.23855
  45 │ wf_table                      0.0245281
  46 │ wf_conference                 0.373172
  47 │ cf_;                          0.197124
  48 │ cf_(                          0.228392
  49 │ cf_[                          0.205984
  50 │ cf_!                          1.47563
  51 │ cf_$                          0.205738
  52 │ cf_#                          0.310332
  53 │ capital_run_length_average   33.2306
  54 │ capital_run_length_longest  143.227
  55 │ capital_run_length_total    609.168

Transformation pour TRAIN

In [93]:
# transformation de train
ZTrain = MLJ.transform(std,XTrain)

# type de l'objet
println(typeof(ZTrain))
DataFrames.DataFrame
In [94]:
# vérification => indicateurs stats - moyennes
# vectorisation avec "."
# moyennes nulles sur TRAIN
import StatsBase
StatsBase.mean.(DFR.eachcol(ZTrain))
55-element Vector{Float64}:
 -4.1933548517455996e-17
 -9.864544012476432e-17
  1.5143183434883062e-16
 -4.888398800190217e-17
  2.309116107789427e-18
  1.926957391950277e-17
 -1.6625635976083876e-17
  6.354687528636504e-17
 -1.0621934095831365e-16
  5.3756222989337864e-17
 -1.2931050203620791e-17
  7.204442256303012e-17
 -7.426117402650797e-17
 -8.312817988041938e-19
 -1.2866394952602689e-16
  7.222915185165328e-17
 -2.5377186024605803e-17
  2.0135492459923805e-17
 -7.500009118100059e-17
 -6.908875394505966e-17
 -1.108375731738925e-17
  8.234308040377097e-17
 -4.9553631673161104e-17
 -5.209365939172948e-17
  1.4778343089852334e-18
 -1.221060597799049e-16
  3.2235260864740404e-17
  1.7826376352134377e-17
  3.417491839528352e-18
 -5.772790269473568e-17
  4.1656454584521265e-17
 -2.9556686179704667e-18
  7.620083155705109e-17
 -1.1822674471881867e-17
 -1.140703357247977e-17
  2.849449277012153e-17
 -2.8309763481498376e-17
 -2.6785746850357353e-18
  1.2099768404816598e-17
 -8.90395171163603e-17
  3.334363659647933e-17
  7.804812444328263e-18
 -3.837750971146028e-17
 -3.5560388059957176e-17
  1.5378713277877583e-17
  4.4335029269557e-17
  2.4014807521010043e-18
  1.2349152944457855e-16
 -3.198125809288357e-17
 -9.979999817865904e-17
 -8.033198459364312e-18
  3.1427070227014105e-17
 -4.5489587323451716e-17
 -1.0621934095831365e-18
  1.0113928552117691e-17
In [95]:
# écarts-type == 1 sur TRAIN
StatsBase.std.(DFR.eachcol(ZTrain))
55-element Vector{Float64}:
 0.9999999999999998
 1.0000000000000002
 0.9999999999999998
 1.0000000000000004
 1.0
 1.0000000000000004
 0.9999999999999999
 0.9999999999999997
 0.9999999999999998
 1.0
 1.0000000000000002
 1.0
 1.0
 1.0000000000000002
 1.0000000000000002
 0.9999999999999999
 1.0
 0.9999999999999999
 1.0
 0.9999999999999999
 0.9999999999999997
 0.9999999999999996
 1.0000000000000004
 0.9999999999999998
 1.0
 0.9999999999999993
 1.0000000000000002
 1.0
 1.0000000000000004
 1.0000000000000004
 1.0000000000000002
 0.9999999999999996
 0.9999999999999998
 1.0000000000000004
 1.0
 1.0000000000000002
 1.0000000000000002
 1.0000000000000002
 0.9999999999999991
 1.0000000000000002
 1.0
 0.9999999999999997
 1.0000000000000007
 1.0000000000000004
 1.0
 0.9999999999999997
 0.9999999999999999
 1.0000000000000002
 0.9999999999999996
 1.0
 1.0000000000000002
 1.0
 1.0000000000000002
 0.9999999999999999
 1.0000000000000002

Transformation pour TEST (avec les paramètres de TRAIN)

In [96]:
# transformation pour l'échantillon TEST
# les moyennes ne sont pas forcément nulles cette fois-ci
ZTest = MLJ.transform(std,XTest)

# moyennes non nulles effectivement
StatsBase.mean.(DFR.eachcol(ZTest))
55-element Vector{Float64}:
  0.015440217177922709
  0.008654629574588532
 -0.019289722078527454
  0.044643624957049315
  0.00451747741846949
 -0.04136052487096174
  0.0388694082432708
  0.05844524583642155
  0.0859239359195257
  0.04865717438459207
  0.09710942620186036
 -0.02077420224066479
 -0.016134564786717178
 -0.030804047557246958
 -0.025585750913939366
  0.0373746688513609
  0.022212717760662658
  0.02480816788214461
 -0.033440703770394334
 -0.028374808772504125
  0.049557115211658724
  0.07715555849046948
 -0.01880771103137323
  0.15260333043102015
 -0.06797207073205161
 -0.03561961536823071
  0.020838015292299245
 -0.027826680798455794
 -0.03765751452556837
 -0.012913244508499995
  0.0649637327215418
 -0.01773062422718018
  0.032719518422051154
 -0.053464662091416114
 -0.019613922569648605
 -0.005434650221465671
 -0.026688461177178113
 -0.015773134052504807
 -0.007369649214185281
  0.01910729225494208
  0.011929533590212229
 -0.06038553312375047
 -0.01522819778033067
 -0.03491649359187598
  0.14842012832693569
 -0.06737406477067076
  0.0424517201829847
  0.037661550354327736
 -0.04201126358088621
 -0.05767550244346965
  0.025602770418852148
  0.02788312878542861
 -0.021568071307880715
 -0.013284563493068597
  0.008344216413196832

Régression logistique avec MLJ¶

Importation de la classe de calcul¶

In [97]:
# importer la régression logistique
# à partir du module MLJLinearModels
LogisticClassifier = @MLJ.load LogisticClassifier pkg=MLJLinearModels
import MLJLinearModels ✔
┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\ricco\.julia\packages\MLJModels\9LbNu\src\loading.jl:159
MLJLinearModels.LogisticClassifier

Instanciation - Hyperparamètres¶

In [98]:
# instancier le modèle avec les paramètres par défaut
# c'est ici qu'il faudrait passer les paramètres de l'algo
# lambda = 0, régression non-pénalisée
lr_1 = LogisticClassifier(lambda=0)

# affichage des paramètres
for (param,value) in pairs(MLJ.params(lr_1))
    println("$param = $value")
end
lambda = 0
gamma = 0.0
penalty = l2
fit_intercept = true
penalize_intercept = false
scale_penalty_with_samples = true
solver = nothing

Entraînement du modèle¶

In [99]:
# préparation de l'objet pour l'entraînement
# avec la méthode machine() de MLJ
# on utilise bien ZTrain
mach_1 = MLJ.machine(lr_1,ZTrain,yTrain)

# lancer l'entraînement -> l'objet mach est màj directement avec "!"
MLJ.fit!(mach_1)
┌ Info: Training machine(LogisticClassifier(lambda = 0, …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}}
│   optim_options: Optim.Options{Float64, Nothing}
│   lbfgs_options: @NamedTuple{} NamedTuple()
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72
trained Machine; caches model-specific representations of data
  model: LogisticClassifier(lambda = 0, …)
  args: 
    1:	Source @535 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @913 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

Etude des coefficients estimés¶

In [100]:
# coefficients
fp_1 = MLJ.fitted_params(mach_1)
fp_1.coefs
55-element Vector{Pair{Symbol, Float64}}:
                    :wf_make => -0.20662775844013093
                 :wf_address => -0.3811389804797835
                     :wf_all => 0.1645957927535096
                      :wf_3d => 1.11931487567822
                     :wf_our => 0.3540115582062516
                    :wf_over => 0.07002507745031764
                  :wf_remove => 1.0852294566826728
                :wf_internet => 0.5195139493086103
                   :wf_order => 0.40124329592484476
                    :wf_mail => 0.2670135142542665
                 :wf_receive => 0.5321140875041915
                    :wf_will => 0.10099089146678401
                  :wf_people => -0.257730835030072
                  :wf_report => -0.04610808253153313
               :wf_addresses => 1.3262804424779204
                    :wf_free => 0.41701855421032163
                :wf_business => 0.3637316339216612
                   :wf_email => -0.01332269386255259
                     :wf_you => 0.14104609674309038
                  :wf_credit => 0.8136149640427681
                    :wf_your => 0.7186893017220286
                    :wf_font => 0.21524435629540017
                     :wf_000 => 0.2091229954856885
                   :wf_money => 0.11819911614596045
                      :wf_hp => -2.738351089085636
                     :wf_hpl => -1.928803437799816
                     :wf_lab => -3.879073074778169
                    :wf_labs => 0.05031573236454722
                  :wf_telnet => -2.0846209075015976
                     :wf_857 => -2.2226802727324
                    :wf_data => -2.8800415674462236
                     :wf_415 => 1.9654102576293766
                      :wf_85 => -0.421562053277384
              :wf_technology => 0.3406481624420365
                    :wf_1999 => -1.543889169617358
                   :wf_parts => -0.10611032300699687
                      :wf_pm => -0.39000503239031853
                  :wf_direct => -0.43360773751399
                      :wf_cs => -8.954766832401562
                 :wf_meeting => -1.4551539698007934
                :wf_original => 0.036688704959824306
                 :wf_project => -1.0456474780122373
                      :wf_re => -1.3255834345909456
                     :wf_edu => -0.5680405634676539
                   :wf_table => 0.03938125758481521
              :wf_conference => -1.6353268137272965
              Symbol("cf_;") => -0.32929001515593
              Symbol("cf_(") => -0.20975118646724641
              Symbol("cf_[") => -1.4251811365093452
                       :cf_! => 0.4224999024795942
             Symbol("cf_\$") => 0.7097391172800116
              Symbol("cf_#") => -0.02705090438170104
 :capital_run_length_average => 14.125278263170905
 :capital_run_length_longest => -2.7564022249517106
   :capital_run_length_total => 2.790850531449405
In [101]:
# type de l'objet, paires Symbol(nom des variables) et valeur (du coefficient)
println(typeof(fp_1.coefs))
Vector{Pair{Symbol, Float64}}
In [102]:
# pour ne manipuler que les valeurs (coefficients) -> last
# par ex. les 10 premières valeurs
last.(fp_1.coefs)[1:10]
10-element Vector{Float64}:
 -0.20662775844013093
 -0.3811389804797835
  0.1645957927535096
  1.11931487567822
  0.3540115582062516
  0.07002507745031764
  1.0852294566826728
  0.5195139493086103
  0.40124329592484476
  0.2670135142542665
In [103]:
# somme des carrés des coefficients
# à comparer plus loin avec la régression pénalisée !!!
println(sum(last.(fp_1.coefs) .^ 2))
363.68086825172304
In [104]:
# pour l'intercept
fp_1.intercept
-3.5008901731060185

Prédiction sur l'échantillon test (probabilité d'affectation)¶

In [105]:
# predict sur l'échantillon test -> proba d'appartenance
# via la machine qui a été entraînée
# appliqué sur les données test transformées ZTest
proba_1 = MLJ.predict(mach_1,ZTest)

# premières valeurs
DFR.first(proba_1,10)
10-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt32, Float64}:
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>1.0, yes=>1.03e-12)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.841, yes=>0.159)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.897, yes=>0.103)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.00133, yes=>0.999)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0672, yes=>0.933)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.936, yes=>0.0643)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.000712, yes=>0.999)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.759, yes=>0.241)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0349, yes=>0.965)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>4.54e-8, yes=>1.0)

Courbe ROC et AUC (Area Under Curve)¶

In [106]:
# calculer les éléments de la courbe ROC
# la troisième variable retournée est le seuil, nous n'en avons pas l'usage ici
# la fonction SAIT que (y = yes) est la modalité cible
# parce que nous l'avons codé ainsi plus haut (CategoricalArray)
fpr_1, tpr_1, _ = MLJ.roc_curve(proba_1,yTest)

# vérif
println(fpr_1[1:5])
println(tpr_1[1:5])
[0.0, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588]
[0.0, 0.019023462270133164, 0.020291693088142042, 0.02092580849714648, 0.02155992390615092]
In [107]:
# StatsPlots (graphiques statistiques, extension de Plots)
import StatsPlots

# dessin de la courbe
StatsPlots.plot(fpr_1,tpr_1,legend=false,xlabel="FPR",ylabel="TPR",lw=2)

# rajouter la diagonale avec "!"
StatsPlots.plot!([0,1],[0,1],linestyle=:dash,legend=false)
No description has been provided for this image
In [108]:
# calcul de l'AUC
# attention, utilisation des probas d'affectation et des classes observées
# l'outil SAIT que (y=yes) est la classe cible
# parce que y codé ainsi avec CategoricalArrays
import MLJ
auc_1 = MLJ.auc(proba_1,yTest)

# affichage
println("AUC (Modele 1) = $(auc_1)")
AUC (Modele 1) = 0.948841829947677
In [109]:
# comptablisation des exemples positifs et négatifs
# dans l'échantillon test
n_pos = sum(yTest .== "yes")
n_neg = sum(yTest .== "no")

# affichage
print("Exemples positifs = $n_pos, et negatifs = $n_neg")
Exemples positifs = 1577, et negatifs = 2423
In [110]:
import StatsFuns

"""
    auc_ci(auc, n_pos, n_neg, alpha=0.1)

Calcule un intervalle de confiance (bilatéral) pour l'AUC
en utilisant l'approximation de Hanley & McNeil (1982).

Arguments :
- auc   : valeur de l'AUC (entre 0 et 1)
- n_pos : nombre d'exemples positifs
- n_neg : nombre d'exemples négatifs
- alpha : niveau de risque (par défaut 0.1 pour IC à 90%)

Retour :
- (lower, upper)
"""
function auc_ci(auc, n_pos, n_neg, alpha=0.1)
    # Quantile de la loi normale
    z = StatsFuns.norminvcdf(1 - alpha/2)

    # Termes intermédiaires
    Q1 = auc / (2 - auc)
    Q2 = (2 * auc^2) / (1 + auc)

    # Variance estimée
    var_auc = (
        auc * (1 - auc) +
        (n_pos - 1) * (Q1 - auc^2) +
        (n_neg - 1) * (Q2 - auc^2)
    ) / (n_pos * n_neg)

    se = sqrt(var_auc)

    lower = auc - z * se
    upper = auc + z * se

    # Optionnel : borner dans [0,1]
    return (max(0.0, lower), min(1.0, upper))
end
auc_ci
In [111]:
# calcul pour notre AUC
println(auc_ci(auc_1,n_pos,n_neg))
(0.9422691642446792, 0.9554144956506747)

Régression RIDGE¶

Paramétrage¶

In [112]:
# modèle 2 -- RIDGE mais avec plus de pénalisation
# par défaut, penalty = l2
lr_2 = LogisticClassifier(lambda=5.0)

# affichage des paramètres
for (param,value) in pairs(MLJ.params(lr_2))
    println("$param = $value")
end
lambda = 5.0
gamma = 0.0
penalty = l2
fit_intercept = true
penalize_intercept = false
scale_penalty_with_samples = true
solver = nothing

Entraînement - Inspection des coefficients¶

In [113]:
# préparation de l'objet pour l'entraînement
mach_2 = MLJ.machine(lr_2,ZTrain,yTrain)

# lancer l'entraînement -> l'objet mach est màj directement avec "!"
MLJ.fit!(mach_2)

# coefficients
fp_2 = MLJ.fitted_params(mach_2)

# somme des carrés des coefficients
# à comparer avec la somme de la régression non pénalisée
# la pénalité a fortement opéré
println("SUM square coefs = $(sum(last.(fp_2.coefs) .^ 2))")
SUM square coefs = 0.014676052838459935
┌ Info: Training machine(LogisticClassifier(lambda = 5.0, …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}}
│   optim_options: Optim.Options{Float64, Nothing}
│   lbfgs_options: @NamedTuple{} NamedTuple()
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72

A bon escient ? Courbe ROC et AUC de RIDGE¶

In [114]:
# predict sur l'échantillon test -> proba d'appartenance
proba_2 = MLJ.predict(mach_2,ZTest)

# calculer les éléments de la courbe ROC
fpr_2, tpr_2, _ = MLJ.roc_curve(proba_2,yTest)
([0.0, 0.0, 0.0, 0.0, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588  …  0.9942220387948824, 0.9946347503095336, 0.9950474618241849, 0.9954601733388362, 0.9958728848534875, 0.9962855963681386, 0.9966983078827899, 0.9971110193974412, 0.9975237309120925, 1.0], [0.0, 0.0006341154090044388, 0.0012682308180088776, 0.0019023462270133164, 0.0019023462270133164, 0.0025364616360177552, 0.0031705770450221942, 0.003804692454026633, 0.004438807863031071, 0.0050729232720355105  …  1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [0.7943463076716266, 0.787591482250194, 0.7720663676128136, 0.7374097517897057, 0.6965341388285624, 0.6799778221588728, 0.6677390838935167, 0.6634270741130385, 0.6584955592725488, 0.6458663965658581  …  0.21600702643453248, 0.18557917711040775, 0.174984399668831, 0.17467907244710576, 0.1744815223317159, 0.1691136805299214, 0.16881602053673636, 0.16864978219631957, 0.1625690454900294, 0.1554931369455556])
In [115]:
# dessin de la courbe
StatsPlots.plot(fpr_1,tpr_1,lw=2,title="ROC Curve",label="Reg.Log.")
StatsPlots.plot!(fpr_2,tpr_2,lw=2,label="Ridge (lambda=5)")

# rajouter la diagonale
StatsPlots.plot!([0,1],[0,1],linestyle=:dash,label=nothing)
No description has been provided for this image
In [116]:
# valeur de l'AUC - un peu moins bonne 
# mais il faudrait calculer l'intervalle de confiance pour être affirmatif
auc_2 = MLJ.auc(proba_2,yTest)
println("AUC (Modele 2) = $auc_2")
AUC (Modele 2) = 0.9354262195075674
In [117]:
# intervalle de confiance
println(auc_ci(auc_2,n_pos,n_neg))
(0.9280643403432914, 0.9427880986718434)

Régression LASSO¶

Paramétrage¶

In [118]:
# modèle 3 -- LASSO
lr_3 = LogisticClassifier(lambda=0.1,penalty="l1")

# affichage des paramètres
for (param,value) in pairs(MLJ.params(lr_3))
    println("$param = $value")
end
lambda = 0.1
gamma = 0.0
penalty = l1
fit_intercept = true
penalize_intercept = false
scale_penalty_with_samples = true
solver = nothing

Entraînement et inspection des coefficients¶

In [119]:
# préparation de l'objet pour l'entraînement
mach_3 = MLJ.machine(lr_3,ZTrain,yTrain)

# lancer l'entraînement -> l'objet mach est màj directement avec "!"
MLJ.fit!(mach_3)

# coefficients
fp_3 = MLJ.fitted_params(mach_3)

# !!! NOMBRE DE COEFFICIENTS NULS (puisque LASSO)
println("\n<< NB coefs nuls = $(sum(last.(fp_3.coefs) .== 0)) >>\n")
<< NB coefs nuls = 44 >>

┌ Info: Training machine(LogisticClassifier(lambda = 0.1, …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.ProxGrad
│   accel: Bool true
│   max_iter: Int64 1000
│   tol: Float64 0.0001
│   max_inner: Int64 100
│   beta: Float64 0.8
│   gram: Bool false
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72

Liste des coefficients non-nuls

In [120]:
# transformation de la structure coefs en data frame
df_temp = DFR.DataFrame(cle=first.(fp_3.coefs),valeur=last.(fp_3.coefs))

# filtrage du data frame
DFR.filter(row -> row.valeur != 0, df_temp)
11×2 DataFrame
Rowclevaleur
SymbolFloat64
1wf_remove0.114821
2wf_order0.073568
3wf_receive0.0456103
4wf_free0.049217
5wf_business0.0429596
6wf_your0.468963
7wf_0000.114493
8wf_money0.042865
9wf_hp-0.00400554
10cf_$0.113806
11capital_run_length_longest0.0258599

Evaluation en test - Courbe ROC¶

In [121]:
# predict sur l'échantillon test -> proba d'appartenance
proba_3 = MLJ.predict(mach_3,ZTest)

# calculer les éléments de la courbe ROC
fpr_3, tpr_3, _ = MLJ.roc_curve(proba_3,yTest)
([0.0, 0.0, 0.0004127115146512588, 0.0004127115146512588, 0.0004127115146512588, 0.0008254230293025176, 0.0008254230293025176, 0.0008254230293025176, 0.0008254230293025176, 0.0012381345439537762  …  0.9954601733388362, 0.9966983078827899, 0.9971110193974412, 0.9975237309120925, 0.9979364424267437, 0.998349153941395, 0.9987618654560462, 0.9991745769706974, 0.9995872884853487, 1.0], [0.0, 0.0006341154090044388, 0.0006341154090044388, 0.0012682308180088776, 0.0019023462270133164, 0.0019023462270133164, 0.0025364616360177552, 0.0031705770450221942, 0.003804692454026633, 0.003804692454026633  …  1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [0.9781154370866937, 0.9741093907340005, 0.970412451188318, 0.9676896769028709, 0.9666205285456669, 0.9623977084405708, 0.9612543569594248, 0.9609321779360758, 0.9590932745894498, 0.9575430988337869  …  0.26639193543725564, 0.26632137183450755, 0.26588809204447694, 0.26578237876364835, 0.26574714696295565, 0.2656766923021226, 0.26465101337864433, 0.2643792252377439, 0.26427389590053063, 0.264066360590983])
In [124]:
# dessin des courbes
# << LASSO >> -- TROP de variables supprimées visiblement 
StatsPlots.plot(fpr_1,tpr_1,lw=2,title="ROC Curve",label="Reg.Log.")
StatsPlots.plot!(fpr_2,tpr_2,lw=2,label="Ridge (lambda = 5)")
StatsPlots.plot!(fpr_3,tpr_3,lw=2,label="Lasso (lambda = 0.1)")

# rajouter la diagonale
StatsPlots.plot!([0,1],[0,1],linestyle=:dash,label=nothing)
No description has been provided for this image
In [123]:
# auc et intervalle de confiance
auc_3 = MLJ.auc(proba_3,yTest)
println("AUC (Modele 3) = $auc_3")

# intervalle de confiance
bb_3, bh_3 = auc_ci(auc_3,n_pos,n_neg)
print("Int. de confiance = [$bb_3, $bh_3]")
AUC (Modele 3) = 0.8848493001046042
Int. de confiance = [0.8751620534709881, 0.8945365467382203]

Définir les traitements sous forme de PIPELINE (standardisation + régression)¶

Formation du pipeline - Entraînement, prédiction¶

In [125]:
# construire sous la forme d'un pipeline
# utilisation de |> pour l'enchaînement
pipe = Standardizer() |> LogisticClassifier(lambda=0)

# entraînement SUR LES DONNEES TRAIN NON TRANSFORMEES
mach_pipe = MLJ.machine(pipe,XTrain,yTrain)
MLJ.fit!(mach_pipe)

# prédiction des probas SUR LES TEST NON TRANSFORMEES
proba_pipe = MLJ.predict(mach_pipe,XTest)

# premières valeurs
DFR.first(proba_pipe,7)
┌ Info: Training machine(ProbabilisticPipeline(standardizer = Standardizer(features = Symbol[], …), …), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Training machine(:standardizer, …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Training machine(:logistic_classifier, …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\krfwA\src\machines.jl:499
┌ Info: Solver: MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}}
│   optim_options: Optim.Options{Float64, Nothing}
│   lbfgs_options: @NamedTuple{} NamedTuple()
└ @ MLJLinearModels C:\Users\ricco\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:72
7-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt32, Float64}:
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>1.0, yes=>1.03e-12)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.841, yes=>0.159)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.897, yes=>0.103)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.00133, yes=>0.999)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0672, yes=>0.933)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.936, yes=>0.0643)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.000712, yes=>0.999)

Comparaison avec l'approche sans pipeline¶

In [126]:
# pour rappel, proba_1
DFR.first(proba_1,7)
7-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, String, UInt32, Float64}:
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>1.0, yes=>1.03e-12)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.841, yes=>0.159)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.897, yes=>0.103)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.00133, yes=>0.999)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.0672, yes=>0.933)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.936, yes=>0.0643)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(no=>0.000712, yes=>0.999)
In [127]:
# vérifions globalement - somme des écarts au carré => 0
sum((MLJ.pdf.(proba_1,"yes") .- MLJ.pdf.(proba_pipe,"yes")) .^ 2)
0.0