Environnement et packages¶
In [92]:
# activer l'environnement
using Pkg
Pkg.activate("env_julia_nlp")
Activating project at `c:\Users\ricco\Desktop\demo\env_julia_nlp`
In [93]:
# liste des packages installés
Pkg.status()
Status `C:\Users\ricco\Desktop\demo\env_julia_nlp\Project.toml` [324d7699] CategoricalArrays v1.1.1 [a93c6f00] DataFrames v1.8.2 [add582a8] MLJ v0.23.2 [33e4bacb] MLJNaiveBayesInterface v0.1.7 [9bbee03b] NaiveBayes v0.6.0 [a2db99b7] TextAnalysis v0.8.5 [6385f0a0] WordCloud v1.3.3 [fdbf4ff8] XLSX v0.11.10
Importation du corpus étiqueté¶
In [94]:
# packages
import DataFrames as DFR
import XLSX
# lecture des données
df = DFR.DataFrame(XLSX.readtable("./reuters_r8.xlsx"))
# premières lignes
println(DFR.first(df,5))
5×2 DataFrame Row │ classe texte │ String String ─────┼─────────────────────────────────────────── 1 │ trade asian exporters fear damage from… 2 │ grain china daily says vermin eat pct … 3 │ ship australian foreign ship ban ends… 4 │ acq sumitomo bank aims at quick reco… 5 │ earn amatil proposes two for five bon…
In [95]:
# fréquences des classes
DFR.combine(DFR.groupby(df,:classe),DFR.nrow => :frequence)
8×2 DataFrame
| Row | classe | frequence |
|---|---|---|
| String | Int64 | |
| 1 | trade | 326 |
| 2 | grain | 51 |
| 3 | ship | 144 |
| 4 | acq | 2292 |
| 5 | earn | 3923 |
| 6 | money_fx | 293 |
| 7 | interest | 271 |
| 8 | crude | 374 |
Construction de la matrice documents-termes¶
Transformation en corpus et prétraitements¶
In [96]:
#transformer les textes bruts en corpus
import TextAnalysis as TA
crps = TA.Corpus(TA.StringDocument.(df.texte))
crps
A Corpus with 7674 documents: * 7674 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
In [97]:
# corpus à nettoyer
corpus = deepcopy(crps)
corpus[1]
A TextAnalysis.StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: asian exporters fear damage from u s japan rift mo
In [98]:
# totalité du texte n°1
TA.text(corpus[1])
"asian exporters fear damage from u s japan rift mounting trade friction between the u s and japan has raised fears among many of asia s exporting nations that the row could inflict far reaching economic damage businessmen and officials said they told reuter corresponden" ⋯ 3739 bytes ⋯ "ime minister yasuhiro nakasone s avowed fiscal reform program deputy u s trade representative michael smith and makoto kuroda japan s deputy minister of international trade and industry miti are due to meet in washington this week in an effort to end the dispute reuter "
In [99]:
# préparation et nettoyage
TA.prepare!(corpus, TA.strip_punctuation | TA.strip_stopwords | TA.strip_numbers | TA.strip_whitespace)
# stemming
#TA.stem!(corpus)
# vérif
TA.text(corpus[1])
"asian exporters fear damage japan rift mounting trade friction japan raised fears asia exporting nations row inflict reaching economic damage businessmen officials told reuter correspondents asian capitals move japan boost protectionist sentiment lead curbs american imp" ⋯ 2293 bytes ⋯ "nding emergency measure stimulate economy despite prime minister yasuhiro nakasone avowed fiscal reform program deputy trade representative michael smith makoto kuroda japan deputy minister international trade industry miti due meet washington week effort dispute reuter"
Dictionnaire des termes¶
In [100]:
# création du dictionnaire
TA.update_lexicon!(corpus)
# termes du dictionnaire
termes = TA.lexicon(corpus)
termes
Dict{String, Int64} with 23165 entries:
"tucsonm" => 1
"chem" => 15
"mh" => 1
"rha" => 17
"gout" => 2
"henry" => 20
"skylight" => 1
"tadxes" => 1
"bidder" => 30
"gooderham" => 9
"rises" => 51
"hampshire" => 9
"beckett" => 2
"brandt" => 1
"sunstar" => 1
"progression" => 1
"tribunal" => 1
"il" => 2
"belgo" => 4
⋮ => ⋮
In [101]:
# les 30 termes les plus fréquents
termes_30 = sort(collect(termes), by = x -> x[2], rev = true)[1:30]
termes_30
30-element Vector{Pair{String, Int64}}:
"mln" => 14554
"vs" => 14124
"dlrs" => 9492
"cts" => 8054
"reuter" => 7024
"net" => 6759
"loss" => 5019
"pct" => 4518
"shr" => 4101
"company" => 4075
⋮
"trade" => 1729
"sales" => 1569
"note" => 1539
"co" => 1538
"offer" => 1426
"quarter" => 1374
"april" => 1373
"market" => 1306
"march" => 1227
Wordcloud¶
In [102]:
# ou encore sous la forme d'un wordcloud
import WordCloud as WC
wc = WC.wordcloud(termes_30,fonts="Consolas")
WC.generate!(wc)
colors = 0 angles = 0 backgroundcolor = :maskcolor shape(ellipse, 195, 166, color="#B6D2D1", padding=15) gathering style: rt = 1, ellipse ▸1. Set spacing = 2; scale = 36.882891821965266 Completed after 58 epochs.
Construction de la matrice DTM - Pondération fréquence¶
In [103]:
# matrice DTM (document term matrix)
# par défaut : pondération = fréquence
dtm = TA.DocumentTermMatrix(corpus)
dtm
A 7674 X 23165 DocumentTermMatrix
In [104]:
# dictionnaire again
dtm.terms
23165-element Vector{String}:
"aa"
"aaa"
"aabex"
"aac"
"aachener"
"aagiy"
"aaica"
"aaix"
"aam"
"aame"
⋮
"zuccherifici"
"zuckerman"
"zuheir"
"zulia"
"zur"
"zurich"
"zuyuan"
"zy"
"zzzz"
In [105]:
# son type
typeof(dtm)
TextAnalysis.DocumentTermMatrix{String}
Matrice DTM - Pondération binaire¶
In [106]:
# transformer en matrice binaire
# présence / absence du terme dans les documents
# on a SparseArray
dtm_bin = Int.(TA.dtm(dtm) .> 0)
dtm_bin
7674×23165 SparseArrays.SparseMatrixCSC{Int64, Int64} with 301642 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎣⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⎦
In [107]:
# transformer en matrice dense
X = DFR.DataFrame(Matrix(dtm_bin),dtm.terms)
size(X)
(7674, 23165)
In [108]:
# vérif. affichage fraction
X[1:10,1:10]
10×10 DataFrame
| Row | aa | aaa | aabex | aac | aachener | aagiy | aaica | aaix | aam | aame |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Nuance entre les pondérations¶
In [109]:
# nombre d'appartition de "trade"
# dans l'ensemble des documents
# i.e. "trade" peut apparaître plusieurs fois dans un doc.
termes["trade"]
1729
In [110]:
# nombre de documents où "trade" apparaît
# au moins une fois
import Statistics
DFR.combine(X,[:trade] .=> Statistics.sum)
1×1 DataFrame
| Row | trade_sum |
|---|---|
| Int64 | |
| 1 | 478 |
Préparation pour la modélisation prédictive¶
In [111]:
# variable cible -> encodage en type "factor"
import CategoricalArrays as CA
y = CA.categorical(df.classe)
# liste des modalités
CA.levels(y)
8-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"acq"
"crude"
"earn"
"grain"
"interest"
"money_fx"
"ship"
"trade"
In [112]:
# identifiants pour train/test
import MLJ
idTrain, idTest = MLJ.partition(1:DFR.nrow(df),0.6,shuffle=true,stratify=y,rng=42)
# dimension
println(size(idTrain))
println(size(idTest))
(4605,) (3069,)
Modélisation et évaluation¶
Naive Bayes (package "NaiveBayes" via "MLJ")¶
In [113]:
# naive bayes -- importation de la classe de calcul
using MLJNaiveBayesInterface
NBayes = MLJ.@load MultinomialNBClassifier pkg = "NaiveBayes"
import MLJNaiveBayesInterface ✔
┌ Info: For silent loading, specify `verbosity=0`. └ @ Main C:\Users\ricco\.julia\packages\MLJModels\AWkxi\src\loading.jl:159
MultinomialNBClassifier
Instanciation - Entraînement sur données d'apprentissage¶
In [114]:
# instanciation et préparation
modele = NBayes()
# machine avec les structures de données
mach = MLJ.machine(modele,X,y)
untrained Machine; caches model-specific representations of data
model: MultinomialNBClassifier(alpha = 1)
args:
1: Source @949 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Count}}
2: Source @369 ⏎ AbstractVector{ScientificTypesBase.Multiclass{8}}
In [115]:
# entraînement sur l'échantillon d'apprentissage
# rows = identifiants des individus TRAIN
MLJ.fit!(mach,rows=idTrain)
┌ Info: Training machine(MultinomialNBClassifier(alpha = 1), …). └ @ MLJBase C:\Users\ricco\.julia\packages\MLJBase\DCbte\src\machines.jl:499
trained Machine; caches model-specific representations of data
model: MultinomialNBClassifier(alpha = 1)
args:
1: Source @949 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Count}}
2: Source @369 ⏎ AbstractVector{ScientificTypesBase.Multiclass{8}}
Inspection¶
In [116]:
# fitted params -- propriétés produites par l'algo
fp = MLJ.fitted_params(mach)
keys(fp)
(:c_counts, :x_counts, :x_totals, :n_obs, :class_pool)
In [117]:
# fréquence des classes par ex.
fp.c_counts
Dict{String, Int64} with 8 entries:
"ship" => 87
"trade" => 197
"grain" => 32
"earn" => 2355
"acq" => 1376
"money_fx" => 177
"crude" => 225
"interest" => 164
Evaluation en test¶
In [118]:
# prédiction en test
pred = MLJ.predict_mode(mach,rows=idTest)
# premières valeurs
pred[1:10]
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"trade"
"crude"
"trade"
"crude"
"crude"
"crude"
"earn"
"earn"
"earn"
"money_fx"
In [119]:
# matrice de confusion
mc = MLJ.confusion_matrix(pred,y[idTest])
mc
8×8 Matrix{Int64}:
905 11 86 3 12 5 12 3
0 132 0 3 0 0 12 0
8 2 1479 3 6 4 0 6
0 0 0 0 0 0 0 0
0 0 0 1 60 4 0 0
2 0 0 1 25 100 0 6
0 1 0 0 0 0 29 0
2 4 4 9 5 4 5 115
In [120]:
# accuracy
MLJ.accuracy(pred,y[idTest])
0.9188660801564027