Chargement du modèle pré-entraîné¶
In [8]:
#version de Python
import sys
sys.version
Out[8]:
'3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:15:49) [MSC v.1941 64 bit (AMD64)]'
In [9]:
#version
import transformers
transformers.__version__
Out[9]:
'4.46.3'
In [10]:
# charger le modèle pré-entraîné, attention, temps long la première fois
# le fichier est mis en cache en suite cf. sur le disque dur /user/.cache
from transformers import pipeline
# c'est le modèle par défaut si on ne spécifie rien dans "model"
# https://huggingface.co/FacebookAI/roberta-large-mnli
# environnement avec PyTorch backend ici
# mais on pouvait aussi tensorflow/tf_keras (d'autres warnings dans ce cas)
classifier = pipeline("zero-shot-classification", model='roberta-large-mnli')
Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Quelques essais avec zero-shot¶
In [11]:
# sur le site de roberta-large-mnli
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)
Out[11]:
{'sequence': 'one day I will see the world', 'labels': ['travel', 'cooking', 'dancing'], 'scores': [0.979964017868042, 0.010604999959468842, 0.009431022219359875]}
In [12]:
# https://medium.com/@TheDataScience-ProF/zero-shot-classification-using-transformers-unlocking-the-power-of-ai-for-text-based-tasks-e5118398ef17
# Define your input text and possible labels (classes)
input_text = "Astronomy is the study of stars and planets."
possible_labels = ["Science", "History", "Sports"]
# Perform zero-shot classification
result = classifier(input_text, possible_labels)
# Print the result
print("Input Text:", input_text)
print("Predicted Class:", result["labels"][0])
print("Confidence Score:", result["scores"][0])
Input Text: Astronomy is the study of stars and planets. Predicted Class: Science Confidence Score: 0.8417725563049316
In [13]:
# biais de genre - stereotype
sequence_to_classify = "The CEO had a strong handshake."
candidate_labels = ['male', 'female']
#on remarque aussi qu'il est possible de contextualiser la requête
hypothesis_template = "This text speaks about a {} profession."
classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
Out[13]:
{'sequence': 'The CEO had a strong handshake.', 'labels': ['male', 'female'], 'scores': [0.8384836912155151, 0.16151630878448486]}
Comportement sur IMDB Reviews¶
Chargement et inspection de la base¶
In [14]:
# changement de dossier
import os
os.chdir("C:/Users/ricco/Desktop/demo")
# chargement et infos
import pandas
df = pandas.read_excel("imbd_reviews_100.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 avis 100 non-null object 1 commentaires 100 non-null object dtypes: object(2) memory usage: 1.7+ KB
In [15]:
# premières lignes
# on va traiter du texte brut, sans pré-traitement (nettoyage, harmonisation casse, etc.)
df.head()
Out[15]:
avis | commentaires | |
---|---|---|
0 | like | Panic in the Streets is a fairly unknown littl... |
1 | like | I'm writing this 9 years after the final episo... |
2 | like | I find this movie the best movie I have ever s... |
3 | dislike | While Bondarchuk was by no means a young man w... |
4 | dislike | Oh, man! This thing scared the heck out of me ... |
In [16]:
# distribution des classes
df.avis.value_counts(normalize = True)
Out[16]:
avis like 0.51 dislike 0.49 Name: proportion, dtype: float64
Classement des commentaires¶
In [17]:
# pas besoin de train/test ici puisque zero-shot
# classes possibles
labels_possibles = ['like', 'dislike']
# passer en revue les documents et obtenir le classement
# attention, pas très rapide
predictions = []
for doc in df.commentaires:
# appliquer le modèle
one_pred = classifier(doc, labels_possibles)
# récupérer la classe prédite
predictions.append(one_pred["labels"][0])
#transformer en vecteur numpy
import numpy
predictions = numpy.asarray(predictions)
#affichages de premières prédiction
predictions[:10]
Out[17]:
array(['like', 'like', 'like', 'dislike', 'like', 'like', 'like', 'like', 'dislike', 'dislike'], dtype='<U7')
In [18]:
# matrice de confusion
pandas.crosstab(df.avis,predictions)
Out[18]:
col_0 | dislike | like |
---|---|---|
avis | ||
dislike | 35 | 14 |
like | 1 | 50 |
In [19]:
# soit en accuracy
# pas mal du tout mais en fait on se rapproche de l'analyse de sentiments ici
# mais l'outil ne le sait, il travaille "en aveugle"
# sur des classes dont il ne "comprend" pas le sens
# mais dont l'association avec le texte (commentaires de films) paraît sensé
numpy.mean(df.avis.values == predictions)
Out[19]:
np.float64(0.85)
In [ ]:
# les tentatives de contextualisation
# avec le paramètre hypothesis_template
# n'ont pas été particulièrement probantes...