Importation des données - Stat. descriptives¶
In [ ]:
#dossier de travail
import os
os.chdir("C:/Users/ricco/Desktop/demo")
#chargement de données
import pandas
D = pandas.read_excel("iris_clustering.xlsx")
D.describe()
Out[ ]:
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Processus d'analyse¶
Définition du pipeline¶
Avec 3 composantes : (1) imputation des valeurs manquantes à l'aide de la moyenne ; (2) centrage-réduction ; (3) K-Means en 3 groupes.
In [ ]:
#librairies
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
#instanciation de Pipeline
wkf = Pipeline([
('impute',SimpleImputer()),
('std',StandardScaler()),
('km',KMeans(n_clusters=3,n_init=1,random_state=0))
])
#entraînement
wkf.fit(D)
c:\Users\ricco\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
Out[ ]:
Pipeline(steps=[('impute', SimpleImputer()), ('std', StandardScaler()), ('km', KMeans(n_clusters=3, n_init=1, random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('impute', SimpleImputer()), ('std', StandardScaler()), ('km', KMeans(n_clusters=3, n_init=1, random_state=0))])
SimpleImputer()
StandardScaler()
KMeans(n_clusters=3, n_init=1, random_state=0)
Inspection des résultats : groupes d'appartenance¶
In [ ]:
#vérif. groupes d'appartenance des obs.
grp = wkf.predict(D)
#effectifs
import numpy
numpy.unique(grp,return_counts=True)
Out[ ]:
(array([0, 1, 2]), array([50, 56, 44], dtype=int64))
Inspection des résultats : interprétation¶
In [ ]:
#coordonnées centrées et réduites
Z = wkf.named_steps['std'].transform(D)
#transformées en data frame
DZ = pandas.DataFrame(Z,columns=D.columns)
#5 premières lignes
DZ.head()
c:\Users\ricco\anaconda3\Lib\site-packages\sklearn\base.py:432: UserWarning: X has feature names, but StandardScaler was fitted without feature names warnings.warn(
Out[ ]:
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | |
---|---|---|---|---|
0 | -0.900681 | 1.019004 | -1.340227 | -1.315444 |
1 | -1.143017 | -0.131979 | -1.340227 | -1.315444 |
2 | -1.385353 | 0.328414 | -1.397064 | -1.315444 |
3 | -1.506521 | 0.098217 | -1.283389 | -1.315444 |
4 | -1.021849 | 1.249201 | -1.340227 | -1.315444 |
In [ ]:
#rajouter les groupes d'appartenance
DZ['groupes'] = grp
DZ.head()
Out[ ]:
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | groupes | |
---|---|---|---|---|---|
0 | -0.900681 | 1.019004 | -1.340227 | -1.315444 | 0 |
1 | -1.143017 | -0.131979 | -1.340227 | -1.315444 | 0 |
2 | -1.385353 | 0.328414 | -1.397064 | -1.315444 | 0 |
3 | -1.506521 | 0.098217 | -1.283389 | -1.315444 | 0 |
4 | -1.021849 | 1.249201 | -1.340227 | -1.315444 | 0 |
In [ ]:
#moyennes conditionnelles
gb = DZ.groupby('groupes')
moyennes = gb.mean()
#affichage
moyennes
Out[ ]:
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | |
---|---|---|---|---|
groupes | ||||
0 | -1.014579 | 0.853263 | -1.304987 | -1.254893 |
1 | -0.011396 | -0.876008 | 0.377076 | 0.311153 |
2 | 1.167434 | 0.145303 | 1.003026 | 1.030002 |
In [ ]:
#graphique polar chart / radar chart
#https://plotly.com/python/radar-chart/
import plotly.graph_objects as go
#création du graphique
fig = go.Figure()
#pour chaque cluster
for i in range(3):
fig.add_trace(go.Scatterpolar(r=moyennes.iloc[i,:].values,theta=moyennes.columns,fill='toself',name=str(i)))
#encapsulé par
fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True)
#affichage
fig.show(renderer = 'notebook')
In [ ]:
#nécessaire d'installer le package kaelido pour l'export
#!pip install kaleido
In [ ]:
#sauvegarde dans un fichier png
fig.write_image('./deploy/radar.png')
Exportation du pipeline pour le déploiement¶
In [ ]:
#sauvegarde - librairie dill, plus puissant que pickle
#cf. https://dill.readthedocs.io/en/latest/
import dill
#créer le fichier en écriture binaire
f = open("./deploy/workflow.sav","wb")
#sérialisation
dill.dump(wkf,f)
#ne pas oublier de fermer
f.close()