Environnement et packages¶

In [92]:
# activer l'environnement
using Pkg
Pkg.activate("env_julia_nlp")
  Activating project at `c:\Users\ricco\Desktop\demo\env_julia_nlp`
In [93]:
# liste des packages installés
Pkg.status()
Status `C:\Users\ricco\Desktop\demo\env_julia_nlp\Project.toml`
  [324d7699] CategoricalArrays v1.1.1
  [a93c6f00] DataFrames v1.8.2
  [add582a8] MLJ v0.23.2
  [33e4bacb] MLJNaiveBayesInterface v0.1.7
  [9bbee03b] NaiveBayes v0.6.0
  [a2db99b7] TextAnalysis v0.8.5
  [6385f0a0] WordCloud v1.3.3
  [fdbf4ff8] XLSX v0.11.10

Importation du corpus étiqueté¶

In [94]:
# packages
import DataFrames as DFR
import XLSX

# lecture des données
df = DFR.DataFrame(XLSX.readtable("./reuters_r8.xlsx"))

# premières lignes
println(DFR.first(df,5))
5×2 DataFrame
 Row │ classe  texte                             
     │ String  String                            
─────┼───────────────────────────────────────────
   1 │ trade   asian exporters fear damage from…
   2 │ grain   china daily says vermin eat pct …
   3 │ ship    australian foreign ship ban ends…
   4 │ acq     sumitomo bank aims at quick reco…
   5 │ earn    amatil proposes two for five bon…
In [95]:
# fréquences des classes
DFR.combine(DFR.groupby(df,:classe),DFR.nrow => :frequence)
8×2 DataFrame
Rowclassefrequence
StringInt64
1trade326
2grain51
3ship144
4acq2292
5earn3923
6money_fx293
7interest271
8crude374

Construction de la matrice documents-termes¶

Transformation en corpus et prétraitements¶

In [96]:
#transformer les textes bruts en corpus
import TextAnalysis as TA
crps = TA.Corpus(TA.StringDocument.(df.texte))
crps
A Corpus with 7674 documents:
 * 7674 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
In [97]:
# corpus à nettoyer
corpus = deepcopy(crps)
corpus[1]
A TextAnalysis.StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: asian exporters fear damage from u s japan rift mo
In [98]:
# totalité du texte n°1
TA.text(corpus[1])
"asian exporters fear damage from u s japan rift mounting trade friction between the u s and japan has raised fears among many of asia s exporting nations that the row could inflict far reaching economic damage businessmen and officials said they told reuter corresponden" ⋯ 3739 bytes ⋯ "ime minister yasuhiro nakasone s avowed fiscal reform program deputy u s trade representative michael smith and makoto kuroda japan s deputy minister of international trade and industry miti are due to meet in washington this week in an effort to end the dispute reuter "
In [99]:
# préparation et nettoyage
TA.prepare!(corpus, TA.strip_punctuation | TA.strip_stopwords | TA.strip_numbers | TA.strip_whitespace)

# stemming
#TA.stem!(corpus)

# vérif
TA.text(corpus[1])
"asian exporters fear damage japan rift mounting trade friction japan raised fears asia exporting nations row inflict reaching economic damage businessmen officials told reuter correspondents asian capitals move japan boost protectionist sentiment lead curbs american imp" ⋯ 2293 bytes ⋯ "nding emergency measure stimulate economy despite prime minister yasuhiro nakasone avowed fiscal reform program deputy trade representative michael smith makoto kuroda japan deputy minister international trade industry miti due meet washington week effort dispute reuter"

Dictionnaire des termes¶

In [100]:
# création du dictionnaire
TA.update_lexicon!(corpus)

# termes du dictionnaire
termes = TA.lexicon(corpus)
termes
Dict{String, Int64} with 23165 entries:
  "tucsonm"     => 1
  "chem"        => 15
  "mh"          => 1
  "rha"         => 17
  "gout"        => 2
  "henry"       => 20
  "skylight"    => 1
  "tadxes"      => 1
  "bidder"      => 30
  "gooderham"   => 9
  "rises"       => 51
  "hampshire"   => 9
  "beckett"     => 2
  "brandt"      => 1
  "sunstar"     => 1
  "progression" => 1
  "tribunal"    => 1
  "il"          => 2
  "belgo"       => 4
  ⋮             => ⋮
In [101]:
# les 30 termes les plus fréquents
termes_30 = sort(collect(termes), by = x -> x[2], rev = true)[1:30]
termes_30
30-element Vector{Pair{String, Int64}}:
     "mln" => 14554
      "vs" => 14124
    "dlrs" => 9492
     "cts" => 8054
  "reuter" => 7024
     "net" => 6759
    "loss" => 5019
     "pct" => 4518
     "shr" => 4101
 "company" => 4075
           ⋮
   "trade" => 1729
   "sales" => 1569
    "note" => 1539
      "co" => 1538
   "offer" => 1426
 "quarter" => 1374
   "april" => 1373
  "market" => 1306
   "march" => 1227

Wordcloud¶

In [102]:
# ou encore sous la forme d'un wordcloud
import WordCloud as WC
wc = WC.wordcloud(termes_30,fonts="Consolas")
WC.generate!(wc)
colors = 0
angles = 0
backgroundcolor = :maskcolor
shape(ellipse, 195, 166, color="#B6D2D1", padding=15)
gathering style: rt = 1, ellipse
▸1. Set spacing = 2; scale = 36.882891821965266
Completed after 58 epochs.
No description has been provided for this image

Construction de la matrice DTM - Pondération fréquence¶

In [103]:
# matrice DTM (document term matrix)
# par défaut : pondération = fréquence
dtm = TA.DocumentTermMatrix(corpus)
dtm
A 7674 X 23165 DocumentTermMatrix
In [104]:
# dictionnaire again
dtm.terms
23165-element Vector{String}:
 "aa"
 "aaa"
 "aabex"
 "aac"
 "aachener"
 "aagiy"
 "aaica"
 "aaix"
 "aam"
 "aame"
 ⋮
 "zuccherifici"
 "zuckerman"
 "zuheir"
 "zulia"
 "zur"
 "zurich"
 "zuyuan"
 "zy"
 "zzzz"
In [105]:
# son type
typeof(dtm)
TextAnalysis.DocumentTermMatrix{String}

Matrice DTM - Pondération binaire¶

In [106]:
# transformer en matrice binaire
# présence / absence du terme dans les documents
# on a SparseArray
dtm_bin = Int.(TA.dtm(dtm) .> 0)
dtm_bin
7674×23165 SparseArrays.SparseMatrixCSC{Int64, Int64} with 301642 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎣⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⎦
In [107]:
# transformer en matrice dense
X = DFR.DataFrame(Matrix(dtm_bin),dtm.terms)
size(X)
(7674, 23165)
In [108]:
# vérif. affichage fraction
X[1:10,1:10]
10×10 DataFrame
Rowaaaaaaabexaacaacheneraagiyaaicaaaixaamaame
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
10000000000
20000000000
30000000000
40000000000
50000000000
60000000000
70000000000
80000000000
90000000000
100000000000

Nuance entre les pondérations¶

In [109]:
# nombre d'appartition de "trade"
# dans l'ensemble des documents
# i.e. "trade" peut apparaître plusieurs fois dans un doc.
termes["trade"]
1729
In [110]:
# nombre de documents où "trade" apparaît
# au moins une fois
import Statistics
DFR.combine(X,[:trade] .=> Statistics.sum)
1×1 DataFrame
Rowtrade_sum
Int64
1478

Préparation pour la modélisation prédictive¶

In [111]:
# variable cible -> encodage en type "factor"
import CategoricalArrays as CA
y = CA.categorical(df.classe)

# liste des modalités
CA.levels(y)
8-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "acq"
 "crude"
 "earn"
 "grain"
 "interest"
 "money_fx"
 "ship"
 "trade"
In [112]:
# identifiants pour train/test
import MLJ
idTrain, idTest = MLJ.partition(1:DFR.nrow(df),0.6,shuffle=true,stratify=y,rng=42)

# dimension
println(size(idTrain))
println(size(idTest))
(4605,)
(3069,)

Modélisation et évaluation¶

Naive Bayes (package "NaiveBayes" via "MLJ")¶

In [113]:
# naive bayes -- importation de la classe de calcul
using MLJNaiveBayesInterface
NBayes = MLJ.@load MultinomialNBClassifier pkg = "NaiveBayes"
import MLJNaiveBayesInterface ✔
┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\ricco\.julia\packages\MLJModels\AWkxi\src\loading.jl:159
MultinomialNBClassifier

Instanciation - Entraînement sur données d'apprentissage¶

In [114]:
# instanciation et préparation
modele = NBayes()

# machine avec les structures de données
mach = MLJ.machine(modele,X,y)
untrained Machine; caches model-specific representations of data
  model: MultinomialNBClassifier(alpha = 1)
  args: 
    1:	Source @949 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Count}}
    2:	Source @369 ⏎ AbstractVector{ScientificTypesBase.Multiclass{8}}
In [115]:
# entraînement sur l'échantillon d'apprentissage
# rows = identifiants des individus TRAIN
MLJ.fit!(mach,rows=idTrain)
┌ Info: Training machine(MultinomialNBClassifier(alpha = 1), …).
└ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\DCbte\src\machines.jl:499
trained Machine; caches model-specific representations of data
  model: MultinomialNBClassifier(alpha = 1)
  args: 
    1:	Source @949 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Count}}
    2:	Source @369 ⏎ AbstractVector{ScientificTypesBase.Multiclass{8}}

Inspection¶

In [116]:
# fitted params -- propriétés produites par l'algo
fp = MLJ.fitted_params(mach)
keys(fp)
(:c_counts, :x_counts, :x_totals, :n_obs, :class_pool)
In [117]:
# fréquence des classes par ex.
fp.c_counts
Dict{String, Int64} with 8 entries:
  "ship"     => 87
  "trade"    => 197
  "grain"    => 32
  "earn"     => 2355
  "acq"      => 1376
  "money_fx" => 177
  "crude"    => 225
  "interest" => 164

Evaluation en test¶

In [118]:
# prédiction en test
pred = MLJ.predict_mode(mach,rows=idTest)

# premières valeurs
pred[1:10]
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "trade"
 "crude"
 "trade"
 "crude"
 "crude"
 "crude"
 "earn"
 "earn"
 "earn"
 "money_fx"
In [119]:
# matrice de confusion
mc = MLJ.confusion_matrix(pred,y[idTest])
mc
8×8 Matrix{Int64}:
 905   11    86  3  12    5  12    3
   0  132     0  3   0    0  12    0
   8    2  1479  3   6    4   0    6
   0    0     0  0   0    0   0    0
   0    0     0  1  60    4   0    0
   2    0     0  1  25  100   0    6
   0    1     0  0   0    0  29    0
   2    4     4  9   5    4   5  115
In [120]:
# accuracy
MLJ.accuracy(pred,y[idTest])
0.9188660801564027