Versions, installations, vérifications¶

In [1]:
# ignorer les warnings
# futureWarning en tous genres
import warnings

# Settings the warnings to be ignored
warnings.filterwarnings('ignore')
In [2]:
# version de Python
import sys
sys.version
Out[2]:
'3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]'
In [3]:
# version pré-installée de Spacy
# attention, importantes modifs depuis 3.x (par rapport à 2.x)
# beaucoup de tutos obsolètes sur le web
import spacy
spacy.__version__
Out[3]:
'3.7.5'
In [ ]:
# importation dans notre environnement du pipeline pré-entraîné
# https://spacy.io/models/en#en_core_web_md
# chargement à refaire si non persistence de l'environnement
#!python -m spacy download en_core_web_md
In [ ]:
#!pip uninstall pydantic -y
In [ ]:
#!pip install pydantic==1.10.11
In [9]:
# chargement du pipeline pré-entraîné
# modèle de départ
nlp = spacy.load('en_core_web_md')
In [ ]:
# installation du package dans notre environnement
# à refaire parce que non persistence non plus
#!pip install classy-classification

Premier essai à partir de la documentation en ligne.¶

In [10]:
# données exemples : 3 doc. par classe
data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

# ajout d'un pipeline "classy_classification"
# pour réaliser un few-shot classification
nlp.add_pipe("classy_classification",
    config={
        "data": data,
        "model": "spacy"
    }
)
Out[10]:
<classy_classification.classifiers.classy_spacy.ClassySpacyInternalFewShot at 0x7a3ea3e5d420>
In [11]:
# comportement sur un individu à classer
print(nlp("I am looking for kitchen appliances.")._.cats)
{'furniture': 0.473858160496116, 'kitchen': 0.5261418395038839}

Classement des commentaires IMDB avec classy¶

In [44]:
# quelques exemplaires des commentaires pour chaque classe
# cf. le fichier train du tutoriel
# https://www.youtube.com/watch?v=OVXLyQKNNaY ('quanteda' pour R)
qqs_commentaires = {
    "like": ["""I know, I know: it's childish. But I just love this type of movie.
             A bird that suffered a lot of mishaps and still hasn't lost his faith
             in humanity and his sense of humor. What's special about this film is
             the fact that the main character is Paulie -the parrot- and he's not
             used as a boost to some hotshot human actor. Furthermore I like the
             storyline: Paulie tells his lifestory to a cleaner at the point he
             hit rock bottom. (By the way: Jay Mohr's voice almost sounds like Joe
             Pesci's!). And Cheech Marin of course, the man IS humor to me. Ever
             since I saw Up in Smoke I have appreciated his naive way of performing,
             making a simple situation a hilarious one.. can't help myself.""",
             """Such a Long Journey is a well crafted film, a good shoot, and a
             showcase for some good performances. However, the story is such a jumble
             of subplots and peculiar characters that it becomes a sort of Jack
             of all plots and master of none. Also, Western audiences will likely
             find the esoterics of the rather obscure Parsee culture a little much to
             get their arms around in 1.7 hours. Recommended for those with an interest in India.""",
             """Atlantis was much better than I had anticipated. In some ways it had a better story
             than come of the other films aimed at a higher age. Although this film did demand a
             solid attention span at times. It was a great film for all ages. I noticed some of
             the younger audience expected a comedy but got an adventure. I think everyone is
             tired of an endless parade of extreme parodies. A lot of these kids have seen
             nothing but parodies. After a short time everyone seemed very intensely watching
             Atlantis."""],
    "dislike": ["""OK..this movie could have been soooo good! All generations have been
                exposed to Thunderbirds and have come to love it and this film had some
                of the features one would look for in a good thunderbirds movie. The
                craft themselves and Tracey Island were realistically transferred to the
                big screen, whilst still keeping to the designs we fell in love with.
                Sophia Miles was, simply, fantastic, as Lady P and Bill Paxton, whilst
                not exactly who I envisaged Jeff Tracey being, was solid enough...but then
                the adults were taken out of the equation and we were asked to believe
                8 year olds could fly 200 tonne machines.<br /><br />It's not so much the
                fact that the movie was centred around the children that made me feel
                like Jonathon Frakes was slapping me with a wet fish and laughing at
                my hard earned money spent on the film, it was the fact that Alan Tracey
                was so obnoxious in the film and that he seemed to be as able to fly
                the machines as well as his brothers...who were at least 19/20. Seriously,
                these are some pretty damn simple machines to use if this is the case.<br />
                <br />The film didn't seem to know whether it wanted to be serious or farcical.
                It tried to pay homage whilst satirising and it just generally fell flat on its
                face. 3/10 (2 for the machines, 1 for Lady P)""",
                """BE WARNED. This movie is such a mess. It's a catastrophe. Don't waste your
                time with this one. I warned you!<br /><br />The acting, story, dialogue, music...
                basically everything is so over the top, it's absolutely annoying and ridiculous.
                It made me want to throw up (if the dialogue/acting/story wasn't doing it,
                it's everyone being shot crooked). You'll feel like you're watching a comedy.
                The problem is, the parts that are supposedly funny isn't even funny.
                The acting, story, cinematography, you can feel everything is just trying WAAAAY too
                hard -- but it never succeeds. Practically every shot is canted, but so what?!
                This movie just feels like a student film. No wonder they shot this in HD because
                it would be a waste to spend more money to shoot this on film.<br /><br />If you're
                easily amused or like poor acting, writing, editing, directing, full of clichés,
                everything that's forced in your face, oh and did I mention poor acting?
                (well, actually, it's not all the actor's fault - it's the director!) then I guess
                you'll like this movie.<br /><br />I had to watch this for a class. I would have
                turned it off right away if I could. If you still can't tell by now, I HATED this
                movie. It made me want to throw up and get my time back... at least I didn't have
                to pay for this garbage.<br /><br />Jeff Goldblum, you know... the guy from Jurassic
                Park/Independence Day, is in this movie but he sure went downhill from then --
                accepting roles for movies like this catastrophe.""",
                """A doctor who is trying to complete the medical dream of transplantation is experimenting secretly
                on corpses from the hospital with varying success. His final best chance comes when he lovingly
                wraps his girlfriend's head in his jacket as he rescues it from a burning vehicle.<br /><br />I
                was looking for cheese and with this premise I believed I found it. It has everything everything
                that bad movie hunters look for - chest and brain surgery with the surgeons leaving with pristine
                white scrubs, unique camera angles (I always love watching the rear passenger wheels of cars),
                cheesy clarinet stripper music, and one of the longest death scenes in movie history. But
                unfortunately these so-bad-they-are-good moments can't overcome the too-bad-they-stink
                stretches.<br /><br />Jan in the Pan annoyed me, with her droning monologues in a hoarse whisper,
                the somewhat less than evil laughter, and the fact she was kept alive with some Columbian home
                brew coffee and 2 DD batteries.<br /><br />I couldn't even entertain myself with Dr Bill's horrid
                overacting and moral self righteousness. Usually such ham makes these movies a must see in my
                opinion, in this case I was bored with it.<br /><br />The best part of the movie in my opinion was
                the 1960's version of "body shopping" and I even found myself nodding off during that.<br /><br/>
                Don't spend money on this one - there are better bad movies out there to entertain your sick
                sense of humor."""]
}
In [45]:
# création de notre modèle
# chargement du modèle de réf.
modele_classy = spacy.load('en_core_web_md')

# configuration pour notre problème
# avec les exemples étiquetés
modele_classy.add_pipe("classy_classification",
    config={
        "data": qqs_commentaires,
        "model": "spacy"
    }
)
Out[45]:
<classy_classification.classifiers.classy_spacy.ClassySpacyInternalFewShot at 0x7a3e76c9fa60>
In [25]:
# chargement des commentaires à classer
import pandas
df = pandas.read_excel("/content/imbd_reviews_100.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   avis          100 non-null    object
 1   commentaires  100 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB
In [26]:
# premières lignes
df.head()
Out[26]:
avis commentaires
0 like Panic in the Streets is a fairly unknown littl...
1 like I'm writing this 9 years after the final episo...
2 like I find this movie the best movie I have ever s...
3 dislike While Bondarchuk was by no means a young man w...
4 dislike Oh, man! This thing scared the heck out of me ...
In [27]:
# distribution des classes
df.avis.value_counts(normalize=True)
Out[27]:
avis
like 0.51
dislike 0.49

In [28]:
# document n°0
df.iloc[0,:]
Out[28]:
0
avis like
commentaires Panic in the Streets is a fairly unknown littl...

In [46]:
# classement du document n°0
modele_classy(df.commentaires.iloc[0])._.cats
Out[46]:
{'dislike': 0.1685206924882393, 'like': 0.8314793075117607}
In [47]:
# classement des commentaires de films
predictions = []

# pour chaque doc
for doc in df.commentaires:
  # appliquer le modèle
  one_pred = modele_classy(doc)._.cats
  # like si proba > 0.5
  if (one_pred["like"] > 0.5):
    predictions.append("like")
  else:
    predictions.append("dislike")

# transformation en vecteur numpy
import numpy
predictions = numpy.array(predictions)

# dist. des prédictions
numpy.unique(predictions, return_counts=True)
Out[47]:
(array(['dislike', 'like'], dtype='<U7'), array([60, 40]))
In [48]:
# matrice de confusion
pandas.crosstab(df.avis, predictions)
Out[48]:
col_0 dislike like
avis
dislike 36 13
like 24 27
In [49]:
# accuracy
numpy.mean(df.avis.values == predictions)
Out[49]:
0.63

A vous de voir les performances avec le package "spacy-setfit"

https://github.com/davidberenstein1957/spacy-setfit