Versions, installations, vérifications¶
In [1]:
# ignorer les warnings
# futureWarning en tous genres
import warnings
# Settings the warnings to be ignored
warnings.filterwarnings('ignore')
In [2]:
# version de Python
import sys
sys.version
Out[2]:
'3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]'
In [3]:
# version pré-installée de Spacy
# attention, importantes modifs depuis 3.x (par rapport à 2.x)
# beaucoup de tutos obsolètes sur le web
import spacy
spacy.__version__
Out[3]:
'3.7.5'
In [ ]:
# importation dans notre environnement du pipeline pré-entraîné
# https://spacy.io/models/en#en_core_web_md
# chargement à refaire si non persistence de l'environnement
#!python -m spacy download en_core_web_md
In [ ]:
#!pip uninstall pydantic -y
In [ ]:
#!pip install pydantic==1.10.11
In [9]:
# chargement du pipeline pré-entraîné
# modèle de départ
nlp = spacy.load('en_core_web_md')
In [ ]:
# installation du package dans notre environnement
# à refaire parce que non persistence non plus
#!pip install classy-classification
Premier essai à partir de la documentation en ligne.¶
In [10]:
# données exemples : 3 doc. par classe
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
# ajout d'un pipeline "classy_classification"
# pour réaliser un few-shot classification
nlp.add_pipe("classy_classification",
config={
"data": data,
"model": "spacy"
}
)
Out[10]:
<classy_classification.classifiers.classy_spacy.ClassySpacyInternalFewShot at 0x7a3ea3e5d420>
In [11]:
# comportement sur un individu à classer
print(nlp("I am looking for kitchen appliances.")._.cats)
{'furniture': 0.473858160496116, 'kitchen': 0.5261418395038839}
Classement des commentaires IMDB avec classy¶
In [44]:
# quelques exemplaires des commentaires pour chaque classe
# cf. le fichier train du tutoriel
# https://www.youtube.com/watch?v=OVXLyQKNNaY ('quanteda' pour R)
qqs_commentaires = {
"like": ["""I know, I know: it's childish. But I just love this type of movie.
A bird that suffered a lot of mishaps and still hasn't lost his faith
in humanity and his sense of humor. What's special about this film is
the fact that the main character is Paulie -the parrot- and he's not
used as a boost to some hotshot human actor. Furthermore I like the
storyline: Paulie tells his lifestory to a cleaner at the point he
hit rock bottom. (By the way: Jay Mohr's voice almost sounds like Joe
Pesci's!). And Cheech Marin of course, the man IS humor to me. Ever
since I saw Up in Smoke I have appreciated his naive way of performing,
making a simple situation a hilarious one.. can't help myself.""",
"""Such a Long Journey is a well crafted film, a good shoot, and a
showcase for some good performances. However, the story is such a jumble
of subplots and peculiar characters that it becomes a sort of Jack
of all plots and master of none. Also, Western audiences will likely
find the esoterics of the rather obscure Parsee culture a little much to
get their arms around in 1.7 hours. Recommended for those with an interest in India.""",
"""Atlantis was much better than I had anticipated. In some ways it had a better story
than come of the other films aimed at a higher age. Although this film did demand a
solid attention span at times. It was a great film for all ages. I noticed some of
the younger audience expected a comedy but got an adventure. I think everyone is
tired of an endless parade of extreme parodies. A lot of these kids have seen
nothing but parodies. After a short time everyone seemed very intensely watching
Atlantis."""],
"dislike": ["""OK..this movie could have been soooo good! All generations have been
exposed to Thunderbirds and have come to love it and this film had some
of the features one would look for in a good thunderbirds movie. The
craft themselves and Tracey Island were realistically transferred to the
big screen, whilst still keeping to the designs we fell in love with.
Sophia Miles was, simply, fantastic, as Lady P and Bill Paxton, whilst
not exactly who I envisaged Jeff Tracey being, was solid enough...but then
the adults were taken out of the equation and we were asked to believe
8 year olds could fly 200 tonne machines.<br /><br />It's not so much the
fact that the movie was centred around the children that made me feel
like Jonathon Frakes was slapping me with a wet fish and laughing at
my hard earned money spent on the film, it was the fact that Alan Tracey
was so obnoxious in the film and that he seemed to be as able to fly
the machines as well as his brothers...who were at least 19/20. Seriously,
these are some pretty damn simple machines to use if this is the case.<br />
<br />The film didn't seem to know whether it wanted to be serious or farcical.
It tried to pay homage whilst satirising and it just generally fell flat on its
face. 3/10 (2 for the machines, 1 for Lady P)""",
"""BE WARNED. This movie is such a mess. It's a catastrophe. Don't waste your
time with this one. I warned you!<br /><br />The acting, story, dialogue, music...
basically everything is so over the top, it's absolutely annoying and ridiculous.
It made me want to throw up (if the dialogue/acting/story wasn't doing it,
it's everyone being shot crooked). You'll feel like you're watching a comedy.
The problem is, the parts that are supposedly funny isn't even funny.
The acting, story, cinematography, you can feel everything is just trying WAAAAY too
hard -- but it never succeeds. Practically every shot is canted, but so what?!
This movie just feels like a student film. No wonder they shot this in HD because
it would be a waste to spend more money to shoot this on film.<br /><br />If you're
easily amused or like poor acting, writing, editing, directing, full of clichés,
everything that's forced in your face, oh and did I mention poor acting?
(well, actually, it's not all the actor's fault - it's the director!) then I guess
you'll like this movie.<br /><br />I had to watch this for a class. I would have
turned it off right away if I could. If you still can't tell by now, I HATED this
movie. It made me want to throw up and get my time back... at least I didn't have
to pay for this garbage.<br /><br />Jeff Goldblum, you know... the guy from Jurassic
Park/Independence Day, is in this movie but he sure went downhill from then --
accepting roles for movies like this catastrophe.""",
"""A doctor who is trying to complete the medical dream of transplantation is experimenting secretly
on corpses from the hospital with varying success. His final best chance comes when he lovingly
wraps his girlfriend's head in his jacket as he rescues it from a burning vehicle.<br /><br />I
was looking for cheese and with this premise I believed I found it. It has everything everything
that bad movie hunters look for - chest and brain surgery with the surgeons leaving with pristine
white scrubs, unique camera angles (I always love watching the rear passenger wheels of cars),
cheesy clarinet stripper music, and one of the longest death scenes in movie history. But
unfortunately these so-bad-they-are-good moments can't overcome the too-bad-they-stink
stretches.<br /><br />Jan in the Pan annoyed me, with her droning monologues in a hoarse whisper,
the somewhat less than evil laughter, and the fact she was kept alive with some Columbian home
brew coffee and 2 DD batteries.<br /><br />I couldn't even entertain myself with Dr Bill's horrid
overacting and moral self righteousness. Usually such ham makes these movies a must see in my
opinion, in this case I was bored with it.<br /><br />The best part of the movie in my opinion was
the 1960's version of "body shopping" and I even found myself nodding off during that.<br /><br/>
Don't spend money on this one - there are better bad movies out there to entertain your sick
sense of humor."""]
}
In [45]:
# création de notre modèle
# chargement du modèle de réf.
modele_classy = spacy.load('en_core_web_md')
# configuration pour notre problème
# avec les exemples étiquetés
modele_classy.add_pipe("classy_classification",
config={
"data": qqs_commentaires,
"model": "spacy"
}
)
Out[45]:
<classy_classification.classifiers.classy_spacy.ClassySpacyInternalFewShot at 0x7a3e76c9fa60>
In [25]:
# chargement des commentaires à classer
import pandas
df = pandas.read_excel("/content/imbd_reviews_100.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 avis 100 non-null object 1 commentaires 100 non-null object dtypes: object(2) memory usage: 1.7+ KB
In [26]:
# premières lignes
df.head()
Out[26]:
avis | commentaires | |
---|---|---|
0 | like | Panic in the Streets is a fairly unknown littl... |
1 | like | I'm writing this 9 years after the final episo... |
2 | like | I find this movie the best movie I have ever s... |
3 | dislike | While Bondarchuk was by no means a young man w... |
4 | dislike | Oh, man! This thing scared the heck out of me ... |
In [27]:
# distribution des classes
df.avis.value_counts(normalize=True)
Out[27]:
avis | |
---|---|
like | 0.51 |
dislike | 0.49 |
In [28]:
# document n°0
df.iloc[0,:]
Out[28]:
0 | |
---|---|
avis | like |
commentaires | Panic in the Streets is a fairly unknown littl... |
In [46]:
# classement du document n°0
modele_classy(df.commentaires.iloc[0])._.cats
Out[46]:
{'dislike': 0.1685206924882393, 'like': 0.8314793075117607}
In [47]:
# classement des commentaires de films
predictions = []
# pour chaque doc
for doc in df.commentaires:
# appliquer le modèle
one_pred = modele_classy(doc)._.cats
# like si proba > 0.5
if (one_pred["like"] > 0.5):
predictions.append("like")
else:
predictions.append("dislike")
# transformation en vecteur numpy
import numpy
predictions = numpy.array(predictions)
# dist. des prédictions
numpy.unique(predictions, return_counts=True)
Out[47]:
(array(['dislike', 'like'], dtype='<U7'), array([60, 40]))
In [48]:
# matrice de confusion
pandas.crosstab(df.avis, predictions)
Out[48]:
col_0 | dislike | like |
---|---|---|
avis | ||
dislike | 36 | 13 |
like | 24 | 27 |
In [49]:
# accuracy
numpy.mean(df.avis.values == predictions)
Out[49]:
0.63
A vous de voir les performances avec le package "spacy-setfit"