#nltk
import nltk
print(nltk.__version__)
3.6.5
#gensim
import gensim
print(gensim.__version__)
4.0.1
#changer de dossier par défaut
import os
os.chdir("C:/Users/ricco/Desktop/demo")
#importation du fichier
import pandas
D = pandas.read_excel("imdb_reviews_100.xlsx")
D.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 200 non-null int64 1 label 200 non-null object 2 commentaires 200 non-null object dtypes: int64(1), object(2) memory usage: 4.8+ KB
#première lignes
D.head()
ID | label | commentaires | |
---|---|---|---|
0 | 1 | neg | This guy has no idea of cinema. Okay, it seems... |
1 | 2 | neg | This movie was extremely depressing. The cha... |
2 | 3 | neg | Now, I'm one to watch movies that got poor rev... |
3 | 4 | neg | One hour, eight minutes and twelve seconds int... |
4 | 5 | neg | Another FRIDAY THE 13TH ripoff, even featuring... |
#récupérer sous forme de liste
corpus = D['commentaires'].tolist()
print(corpus[0])
This guy has no idea of cinema. Okay, it seems he made a few interestig theater shows in his youth, and about two acceptable movies that had success more of political reasons cause they tricked the communist censorship. This all is very good, but look carefully: HE DOES NOT KNOW HIS JOB! The scenes are unbalanced, without proper start and and, with a disordered content and full of emptiness. He has nothing to say about the subject, so he over-licitates with violence, nakedness and gutter language. How is it possible to keep alive such a rotten corpse who never understood anything of cinematographic profession and art? Why don't they let him succumb in piece?
#passer en minuscule
corpus = [doc.lower() for doc in corpus]
print(corpus[0])
this guy has no idea of cinema. okay, it seems he made a few interestig theater shows in his youth, and about two acceptable movies that had success more of political reasons cause they tricked the communist censorship. this all is very good, but look carefully: he does not know his job! the scenes are unbalanced, without proper start and and, with a disordered content and full of emptiness. he has nothing to say about the subject, so he over-licitates with violence, nakedness and gutter language. how is it possible to keep alive such a rotten corpse who never understood anything of cinematographic profession and art? why don't they let him succumb in piece?
#liste des ponctuations
import string
ponctuations = list(string.punctuation)
print(ponctuations)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
#retrait des ponctuations
corpus = ["".join([char for char in list(doc) if not (char in ponctuations)]) for doc in corpus]
print(corpus[0])
this guy has no idea of cinema okay it seems he made a few interestig theater shows in his youth and about two acceptable movies that had success more of political reasons cause they tricked the communist censorship this all is very good but look carefully he does not know his job the scenes are unbalanced without proper start and and with a disordered content and full of emptiness he has nothing to say about the subject so he overlicitates with violence nakedness and gutter language how is it possible to keep alive such a rotten corpse who never understood anything of cinematographic profession and art why dont they let him succumb in piece
#nécessité de punkt - modèle de tokénisation
#à charger en ligne si ce n'est pas déjà fait
import nltk
#nltk.download()
nltk.download('punkt')
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\ricco\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
True
#transformer le corpus en liste de listes (les documents)
#par tokénisation
from nltk.tokenize import word_tokenize
corpus_tk = [word_tokenize(doc) for doc in corpus]
#avant
print(corpus[0])
#après tokénisation
print('\n')
print(corpus_tk[0])
this guy has no idea of cinema okay it seems he made a few interestig theater shows in his youth and about two acceptable movies that had success more of political reasons cause they tricked the communist censorship this all is very good but look carefully he does not know his job the scenes are unbalanced without proper start and and with a disordered content and full of emptiness he has nothing to say about the subject so he overlicitates with violence nakedness and gutter language how is it possible to keep alive such a rotten corpse who never understood anything of cinematographic profession and art why dont they let him succumb in piece ['this', 'guy', 'has', 'no', 'idea', 'of', 'cinema', 'okay', 'it', 'seems', 'he', 'made', 'a', 'few', 'interestig', 'theater', 'shows', 'in', 'his', 'youth', 'and', 'about', 'two', 'acceptable', 'movies', 'that', 'had', 'success', 'more', 'of', 'political', 'reasons', 'cause', 'they', 'tricked', 'the', 'communist', 'censorship', 'this', 'all', 'is', 'very', 'good', 'but', 'look', 'carefully', 'he', 'does', 'not', 'know', 'his', 'job', 'the', 'scenes', 'are', 'unbalanced', 'without', 'proper', 'start', 'and', 'and', 'with', 'a', 'disordered', 'content', 'and', 'full', 'of', 'emptiness', 'he', 'has', 'nothing', 'to', 'say', 'about', 'the', 'subject', 'so', 'he', 'overlicitates', 'with', 'violence', 'nakedness', 'and', 'gutter', 'language', 'how', 'is', 'it', 'possible', 'to', 'keep', 'alive', 'such', 'a', 'rotten', 'corpse', 'who', 'never', 'understood', 'anything', 'of', 'cinematographic', 'profession', 'and', 'art', 'why', 'dont', 'they', 'let', 'him', 'succumb', 'in', 'piece']
#importation librarire pour lemmatisation
#si ce n'est pas déjà fait
#nltk.download('wordnet')
#lemmatisation
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
corpus_lm = [[lem.lemmatize(mot) for mot in doc] for doc in corpus_tk]
print(corpus_lm[0])
['this', 'guy', 'ha', 'no', 'idea', 'of', 'cinema', 'okay', 'it', 'seems', 'he', 'made', 'a', 'few', 'interestig', 'theater', 'show', 'in', 'his', 'youth', 'and', 'about', 'two', 'acceptable', 'movie', 'that', 'had', 'success', 'more', 'of', 'political', 'reason', 'cause', 'they', 'tricked', 'the', 'communist', 'censorship', 'this', 'all', 'is', 'very', 'good', 'but', 'look', 'carefully', 'he', 'doe', 'not', 'know', 'his', 'job', 'the', 'scene', 'are', 'unbalanced', 'without', 'proper', 'start', 'and', 'and', 'with', 'a', 'disordered', 'content', 'and', 'full', 'of', 'emptiness', 'he', 'ha', 'nothing', 'to', 'say', 'about', 'the', 'subject', 'so', 'he', 'overlicitates', 'with', 'violence', 'nakedness', 'and', 'gutter', 'language', 'how', 'is', 'it', 'possible', 'to', 'keep', 'alive', 'such', 'a', 'rotten', 'corpse', 'who', 'never', 'understood', 'anything', 'of', 'cinematographic', 'profession', 'and', 'art', 'why', 'dont', 'they', 'let', 'him', 'succumb', 'in', 'piece']
#importer la librairie des stopwords
#si ce n'est pas déjà fait
#nltk.download('stopwords')
#charger les stopwords
from nltk.corpus import stopwords
mots_vides = stopwords.words('english')
print(mots_vides)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
#suppression des mots-vides
corpus_sw = [[mot for mot in doc if not (mot in mots_vides)] for doc in corpus_lm]
#vérification - origine
print(corpus_lm[0])
#sans les stopwords
print('\n')
print(corpus_sw[0])
['this', 'guy', 'ha', 'no', 'idea', 'of', 'cinema', 'okay', 'it', 'seems', 'he', 'made', 'a', 'few', 'interestig', 'theater', 'show', 'in', 'his', 'youth', 'and', 'about', 'two', 'acceptable', 'movie', 'that', 'had', 'success', 'more', 'of', 'political', 'reason', 'cause', 'they', 'tricked', 'the', 'communist', 'censorship', 'this', 'all', 'is', 'very', 'good', 'but', 'look', 'carefully', 'he', 'doe', 'not', 'know', 'his', 'job', 'the', 'scene', 'are', 'unbalanced', 'without', 'proper', 'start', 'and', 'and', 'with', 'a', 'disordered', 'content', 'and', 'full', 'of', 'emptiness', 'he', 'ha', 'nothing', 'to', 'say', 'about', 'the', 'subject', 'so', 'he', 'overlicitates', 'with', 'violence', 'nakedness', 'and', 'gutter', 'language', 'how', 'is', 'it', 'possible', 'to', 'keep', 'alive', 'such', 'a', 'rotten', 'corpse', 'who', 'never', 'understood', 'anything', 'of', 'cinematographic', 'profession', 'and', 'art', 'why', 'dont', 'they', 'let', 'him', 'succumb', 'in', 'piece'] ['guy', 'ha', 'idea', 'cinema', 'okay', 'seems', 'made', 'interestig', 'theater', 'show', 'youth', 'two', 'acceptable', 'movie', 'success', 'political', 'reason', 'cause', 'tricked', 'communist', 'censorship', 'good', 'look', 'carefully', 'doe', 'know', 'job', 'scene', 'unbalanced', 'without', 'proper', 'start', 'disordered', 'content', 'full', 'emptiness', 'ha', 'nothing', 'say', 'subject', 'overlicitates', 'violence', 'nakedness', 'gutter', 'language', 'possible', 'keep', 'alive', 'rotten', 'corpse', 'never', 'understood', 'anything', 'cinematographic', 'profession', 'art', 'dont', 'let', 'succumb', 'piece']
#retirer les token de moins de 3 lettres
corpus_sw = [[mot for mot in doc if len(mot) >= 3] for doc in corpus_sw]
print(corpus_sw[0])
['guy', 'idea', 'cinema', 'okay', 'seems', 'made', 'interestig', 'theater', 'show', 'youth', 'two', 'acceptable', 'movie', 'success', 'political', 'reason', 'cause', 'tricked', 'communist', 'censorship', 'good', 'look', 'carefully', 'doe', 'know', 'job', 'scene', 'unbalanced', 'without', 'proper', 'start', 'disordered', 'content', 'full', 'emptiness', 'nothing', 'say', 'subject', 'overlicitates', 'violence', 'nakedness', 'gutter', 'language', 'possible', 'keep', 'alive', 'rotten', 'corpse', 'never', 'understood', 'anything', 'cinematographic', 'profession', 'art', 'dont', 'let', 'succumb', 'piece']
#reformer les documents sous forme de chaîne
documents = [" ".join(doc) for doc in corpus_sw]
print(documents[0])
guy idea cinema okay seems made interestig theater show youth two acceptable movie success political reason cause tricked communist censorship good look carefully doe know job scene unbalanced without proper start disordered content full emptiness nothing say subject overlicitates violence nakedness gutter language possible keep alive rotten corpse never understood anything cinematographic profession art dont let succumb piece
#word2vec
from gensim.models import Word2Vec
modele = Word2Vec(corpus_sw,vector_size=2,window=5)
#type de l'objet
print(type(modele))
<class 'gensim.models.word2vec.Word2Vec'>
#propriété de l'objet
dir(modele)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_adapt_by_suffix', '_check_corpus_sanity', '_check_training_sanity', '_clear_post_train', '_do_train_epoch', '_do_train_job', '_get_next_alpha', '_get_thread_working_mem', '_job_producer', '_load_specials', '_log_epoch_end', '_log_epoch_progress', '_log_progress', '_log_train_end', '_raw_word_count', '_save_specials', '_scan_vocab', '_smart_save', '_train_epoch', '_train_epoch_corpusfile', '_worker_loop', '_worker_loop_corpusfile', 'add_lifecycle_event', 'add_null_word', 'alpha', 'batch_words', 'build_vocab', 'build_vocab_from_freq', 'cbow_mean', 'comment', 'compute_loss', 'corpus_count', 'corpus_total_words', 'create_binary_tree', 'cum_table', 'effective_min_count', 'epochs', 'estimate_memory', 'get_latest_training_loss', 'hashfxn', 'hs', 'init_sims', 'init_weights', 'layer1_size', 'lifecycle_events', 'load', 'make_cum_table', 'max_final_vocab', 'max_vocab_size', 'min_alpha', 'min_alpha_yet_reached', 'min_count', 'negative', 'ns_exponent', 'null_word', 'predict_output_word', 'prepare_vocab', 'prepare_weights', 'random', 'raw_vocab', 'reset_from', 'running_training_loss', 'sample', 'save', 'scan_vocab', 'score', 'seed', 'seeded_vector', 'sg', 'sorted_vocab', 'syn1neg', 'total_train_time', 'train', 'train_count', 'update_weights', 'vector_size', 'window', 'workers', 'wv']
#propriété "wv" -> wordvector
words = modele.wv
#type
print(type(words))
<class 'gensim.models.keyedvectors.KeyedVectors'>
#les propriétés de KeyedVectors
dir(words)
['__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_adapt_by_suffix', '_load_specials', '_log_evaluate_word_analogies', '_save_specials', '_smart_save', '_upconvert_old_d2vkv', '_upconvert_old_vocab', 'add_lifecycle_event', 'add_vector', 'add_vectors', 'allocate_vecattrs', 'closer_than', 'cosine_similarities', 'distance', 'distances', 'doesnt_match', 'evaluate_word_analogies', 'evaluate_word_pairs', 'expandos', 'fill_norms', 'get_index', 'get_normed_vectors', 'get_vecattr', 'get_vector', 'has_index_for', 'index2entity', 'index2word', 'index_to_key', 'init_sims', 'intersect_word2vec_format', 'key_to_index', 'load', 'load_word2vec_format', 'log_accuracy', 'log_evaluate_word_pairs', 'mapfile_path', 'most_similar', 'most_similar_cosmul', 'most_similar_to_given', 'n_similarity', 'next_index', 'norms', 'rank', 'rank_by_centrality', 'relative_cosine_similarity', 'resize_vectors', 'save', 'save_word2vec_format', 'set_vecattr', 'similar_by_key', 'similar_by_vector', 'similar_by_word', 'similarity', 'similarity_unseen_docs', 'sort_by_descending_frequency', 'unit_normalize_all', 'vector_size', 'vectors', 'vectors_lockf', 'vectors_norm', 'vocab', 'wmdistance', 'word_vec', 'words_closer_than']
#affichage des termes de leur index
words.key_to_index
{'film': 0, 'movie': 1, 'one': 2, 'like': 3, 'character': 4, 'time': 5, 'good': 6, 'would': 7, 'even': 8, 'see': 9, 'story': 10, 'get': 11, 'show': 12, 'scene': 13, 'great': 14, 'could': 15, 'people': 16, 'way': 17, 'also': 18, 'really': 19, 'make': 20, 'much': 21, 'first': 22, 'well': 23, 'thing': 24, 'made': 25, 'bad': 26, 'think': 27, 'life': 28, 'acting': 29, 'actor': 30, 'end': 31, 'watch': 32, 'little': 33, 'love': 34, 'year': 35, 'dont': 36, 'seen': 37, 'never': 38, 'though': 39, 'ever': 40, 'know': 41, 'give': 42, 'plot': 43, 'best': 44, 'back': 45, 'still': 46, 'work': 47, 'funny': 48, 'two': 49, 'better': 50, 'got': 51, 'look': 52, 'lot': 53, 'watching': 54, 'without': 55, 'director': 56, 'something': 57, 'take': 58, 'say': 59, 'fun': 60, 'man': 61, 'day': 62, 'many': 63, 'guy': 64, 'real': 65, 'come': 66, 'always': 67, 'ive': 68, 'play': 69, 'star': 70, 'another': 71, 'actually': 72, 'enough': 73, 'performance': 74, 'part': 75, 'fan': 76, 'horror': 77, 'doe': 78, 'old': 79, 'woman': 80, 'every': 81, 'anything': 82, 'fact': 83, 'want': 84, 'big': 85, 'kind': 86, 'going': 87, 'point': 88, 'music': 89, 'feel': 90, 'didnt': 91, 'original': 92, 'comedy': 93, 'however': 94, 'episode': 95, 'doesnt': 96, 'series': 97, 'find': 98, 'king': 99, 'quite': 100, 'action': 101, 'girl': 102, 'right': 103, 'rather': 104, 'kid': 105, 'pretty': 106, 'cast': 107, 'world': 108, 'thats': 109, 'isnt': 110, 'role': 111, 'done': 112, 'wasnt': 113, 'nothing': 114, 'true': 115, 'set': 116, 'seems': 117, 'new': 118, 'probably': 119, 'anyone': 120, 'left': 121, 'course': 122, 'friend': 123, 'sure': 124, 'face': 125, 'let': 126, 'use': 127, 'lead': 128, 'bit': 129, 'last': 130, 'far': 131, 'worst': 132, 'minute': 133, 'different': 134, 'camera': 135, 'effect': 136, 'yet': 137, 'thought': 138, 'screen': 139, 'poor': 140, 'audience': 141, 'around': 142, 'almost': 143, 'excellent': 144, 'moment': 145, 'idea': 146, 'watched': 147, 'piece': 148, 'house': 149, 'reason': 150, 'shot': 151, 'found': 152, 'start': 153, 'else': 154, 'least': 155, 'line': 156, 'version': 157, 'making': 158, 'need': 159, 'may': 160, 'long': 161, 'dialogue': 162, 'book': 163, 'song': 164, 'rest': 165, 'main': 166, 'saw': 167, 'script': 168, 'wife': 169, 'second': 170, 'death': 171, 'child': 172, 'interesting': 173, 'mean': 174, 'must': 175, 'away': 176, 'boy': 177, 'young': 178, 'tell': 179, 'try': 180, 'cant': 181, 'picture': 182, 'violence': 183, 'instead': 184, 'looking': 185, 'either': 186, 'season': 187, 'often': 188, 'eye': 189, 'hit': 190, 'stupid': 191, 'money': 192, 'absolutely': 193, 'hard': 194, 'sound': 195, 'three': 196, 'talent': 197, 'youre': 198, 'put': 199, 'favorite': 200, 'everyone': 201, 'production': 202, 'whole': 203, 'brilliant': 204, 'hate': 205, 'along': 206, 'lion': 207, 'review': 208, 'next': 209, 'blood': 210, 'might': 211, 'dvd': 212, 'beginning': 213, 'become': 214, 'ending': 215, 'place': 216, 'seem': 217, 'completely': 218, 'top': 219, 'voice': 220, 'seeing': 221, 'amazing': 222, 'police': 223, 'father': 224, 'later': 225, 'boring': 226, 'problem': 227, 'waste': 228, 'played': 229, 'looked': 230, 'definitely': 231, 'theyre': 232, 'home': 233, 'comment': 234, 'theme': 235, 'example': 236, 'totally': 237, 'youll': 238, 'small': 239, 'simply': 240, 'since': 241, 'michael': 242, 'human': 243, 'although': 244, 'style': 245, 'perfect': 246, 'family': 247, 'keep': 248, 'lady': 249, 'shes': 250, 'trying': 251, 'care': 252, 'laugh': 253, 'read': 254, 'able': 255, 'opinion': 256, 'early': 257, 'today': 258, 'move': 259, 'american': 260, 'tom': 261, 'beautiful': 262, 'behind': 263, 'given': 264, 'lack': 265, 'becomes': 266, 'enjoyed': 267, 'job': 268, 'high': 269, 'break': 270, 'direction': 271, 'black': 272, 'room': 273, 'everything': 274, 'guess': 275, 'came': 276, 'husband': 277, 'together': 278, 'wanted': 279, 'gamera': 280, 'side': 281, 'school': 282, 'title': 283, 'finally': 284, 'night': 285, 'hour': 286, 'worth': 287, 'comic': 288, 'figure': 289, 'getting': 290, 'trash': 291, 'jack': 292, 'john': 293, 'late': 294, 'fine': 295, 'writer': 296, 'town': 297, 'attempt': 298, 'final': 299, 'dark': 300, 'certainly': 301, 'beyond': 302, 'head': 303, 'killer': 304, 'nice': 305, 'opening': 306, 'others': 307, 'classic': 308, 'low': 309, 'budget': 310, 'liked': 311, 'stuff': 312, 'turn': 313, 'run': 314, 'theater': 315, 'thinking': 316, 'mention': 317, 'especially': 318, 'whether': 319, 'someone': 320, 'chance': 321, 'throughout': 322, 'effort': 323, 'enjoy': 324, 'believe': 325, 'add': 326, 'maybe': 327, 'serial': 328, 'type': 329, 'question': 330, 'name': 331, 'leave': 332, 'genre': 333, 'loved': 334, 'social': 335, 'half': 336, 'yes': 337, 'written': 338, 'act': 339, 'joke': 340, 'number': 341, 'evil': 342, 'felt': 343, 'overall': 344, 'commercial': 345, 'expect': 346, 'car': 347, 'ten': 348, 'female': 349, 'understand': 350, 'said': 351, 'full': 352, 'brother': 353, 'gore': 354, 'playing': 355, 'hope': 356, 'james': 357, 'couple': 358, 'period': 359, 'serious': 360, 'cinema': 361, 'city': 362, 'perhaps': 363, 'mind': 364, 'please': 365, 'special': 366, 'couldnt': 367, 'wont': 368, 'needed': 369, 'obvious': 370, 'setting': 371, 'help': 372, 'art': 373, 'awful': 374, 'writing': 375, 'recommend': 376, 'basically': 377, 'pay': 378, 'lost': 379, 'robert': 380, 'wrong': 381, 'carpenter': 382, 'sci': 383, 'documentary': 384, 'danny': 385, 'feature': 386, 'allen': 387, 'musical': 388, 'strange': 389, 'within': 390, 'studio': 391, 'quality': 392, 'relationship': 393, 'space': 394, 'highly': 395, 'fall': 396, 'complex': 397, 'cheesy': 398, 'light': 399, 'simple': 400, 'wouldnt': 401, 'hero': 402, 'particular': 403, 'team': 404, 'sexual': 405, 'close': 406, 'soon': 407, 'terrible': 408, 'viewer': 409, 'using': 410, 'feeling': 411, 'waterman': 412, 'longer': 413, 'attention': 414, 'except': 415, 'level': 416, 'richard': 417, 'change': 418, 'took': 419, 'used': 420, 'seemed': 421, 'anyway': 422, 'truly': 423, 'knew': 424, 'short': 425, 'mother': 426, 'person': 427, 'learn': 428, 'extremely': 429, 'son': 430, 'element': 431, 'begin': 432, 'exactly': 433, 'hand': 434, 'return': 435, 'detective': 436, 'called': 437, 'released': 438, 'several': 439, 'murder': 440, 'unbelievable': 441, 'remember': 442, 'soundtrack': 443, 'sad': 444, 'based': 445, 'zombie': 446, 'clearly': 447, 'went': 448, 'premise': 449, 'editing': 450, 'mystery': 451, 'earlier': 452, 'spend': 453, 'boyfriend': 454, 'crime': 455, 'dead': 456, 'viewing': 457, 'involved': 458, 'air': 459, 'nature': 460, 'list': 461, 'scoop': 462, 'happens': 463, 'stand': 464, 'famous': 465, 'talk': 466, 'sex': 467, 'particularly': 468, 'cut': 469, 'supposed': 470, 'unfortunately': 471, 'wonderful': 472, 'appear': 473, 'street': 474, 'humor': 475, 'view': 476, 'predictable': 477, 'appearance': 478, 'directed': 479, 'rating': 480, 'word': 481, 'adult': 482, 'similar': 483, 'group': 484, 'god': 485, 'cool': 486, 'visual': 487, 'twist': 488, 'meet': 489, 'hollywood': 490, 'flick': 491, 'stage': 492, 'video': 493, 'scooby': 494, 'sense': 495, 'noir': 496, 'creepy': 497, 'image': 498, 'interest': 499, 'thriller': 500, 'easily': 501, 'member': 502, 'accent': 503, 'brought': 504, 'positive': 505, 'lugosi': 506, 'hilarious': 507, 'tone': 508, 'exciting': 509, 'strong': 510, 'seriously': 511, 'age': 512, 'military': 513, 'entertainment': 514, 'despite': 515, 'single': 516, 'narrative': 517, 'sequence': 518, 'experience': 519, 'adaptation': 520, 'dialog': 521, 'tried': 522, 'answer': 523, 'fight': 524, 'class': 525, 'copy': 526, 'subject': 527, 'cover': 528, 'fantasy': 529, 'sitting': 530, 'whilst': 531, 'recent': 532, 'mood': 533, 'shanghai': 534, 'war': 535, 'utterly': 536, 'entire': 537, 'killed': 538, 'possible': 539, 'hotel': 540, 'career': 541, 'sort': 542, 'admit': 543, 'romance': 544, 'whats': 545, 'doubt': 546, 'surprise': 547, 'giving': 548, 'excuse': 549, 'reality': 550, 'teenage': 551, 'case': 552, 'complete': 553, 'trailer': 554, 'poorly': 555, 'happen': 556, 'dad': 557, 'shown': 558, 'manner': 559, 'tale': 560, 'woody': 561, 'peter': 562, 'live': 563, 'wrote': 564, 'nudity': 565, 'standard': 566, 'term': 567, 'actual': 568, 'showed': 569, 'named': 570, 'ear': 571, 'past': 572, 'western': 573, 'kazan': 574, 'singing': 575, 'easy': 576, 'modern': 577, 'apart': 578, 'including': 579, 'lighting': 580, 'ill': 581, 'taking': 582, 'amusing': 583, 'etc': 584, 'student': 585, 'teenager': 586, 'mark': 587, 'call': 588, 'bought': 589, 'deal': 590, 'magic': 591, 'gon': 592, 'record': 593, 'superb': 594, 'history': 595, 'white': 596, 'sit': 597, 'villain': 598, 'parent': 599, 'loose': 600, 'considered': 601, 'camp': 602, 'possibly': 603, 'somehow': 604, 'note': 605, 'charlie': 606, 'appears': 607, 'obviously': 608, 'horrible': 609, 'cagney': 610, 'scream': 611, 'paris': 612, 'cold': 613, 'sadly': 614, 'attack': 615, 'amateur': 616, 'coming': 617, 'leaf': 618, 'hold': 619, 'cause': 620, 'hear': 621, 'daughter': 622, 'red': 623, 'sometimes': 624, 'youve': 625, 'ray': 626, 'timon': 627, 'cinematic': 628, 'various': 629, 'animal': 630, 'believable': 631, 'heart': 632, 'shame': 633, 'crew': 634, 'men': 635, 'plenty': 636, 'wish': 637, 'fear': 638, 'aint': 639, 'sick': 640, 'footage': 641, 'chemistry': 642, 'surprised': 643, 'fascinating': 644, 'brilliantly': 645, 'dog': 646, 'mountain': 647, 'photography': 648, 'werent': 649, 'filmed': 650, 'important': 651, 'fellow': 652, 'successful': 653, 'stay': 654, 'kick': 655, 'recommended': 656, 'drunk': 657, 'era': 658, 'radiation': 659, 'giant': 660, 'issue': 661, 'minor': 662, 'monster': 663, 'fairly': 664, 'catch': 665, 'van': 666, 'independent': 667, 'smart': 668, 'major': 669, 'fantastic': 670, 'language': 671, 'twin': 672, 'game': 673, 'universal': 674, 'whose': 675, 'grew': 676, 'target': 677, 'frankly': 678, 'pumbaa': 679, 'folk': 680, 'usual': 681, 'detail': 682, 'italian': 683, 'caught': 684, 'score': 685, 'price': 686, 'saying': 687, 'fit': 688, 'welles': 689, 'credit': 690, 'unlike': 691, 'bar': 692, 'shocking': 693, 'annoying': 694, 'decided': 695, 'force': 696, 'shooting': 697, 'extreme': 698, 'genius': 699, 'rated': 700, 'approach': 701, 'reviewer': 702, 'enjoyable': 703, 'saving': 704, 'private': 705, 'gun': 706, 'personally': 707, 'costume': 708, 'master': 709, 'political': 710, 'slow': 711, 'television': 712, 'third': 713, 'spirit': 714, 'share': 715, 'generally': 716, 'happy': 717, 'dramatic': 718, 'managed': 719, 'criminal': 720, 'arent': 721, 'touch': 722, 'seek': 723, 'personal': 724, 'epic': 725, 'atmosphere': 726, 'belief': 727, 'turned': 728, 'mary': 729, 'parade': 730, 'sorry': 731, 'realize': 732, 'suspense': 733, 'disappointed': 734, 'exception': 735, 'check': 736, 'meant': 737, 'mom': 738, 'meaning': 739, 'remake': 740, 'previous': 741, 'told': 742, 'moral': 743, 'terrific': 744, 'stop': 745, 'towards': 746, 'apparently': 747, 'reference': 748, 'cradle': 749, 'filth': 750, 'missing': 751, 'familiar': 752, 'yaara': 753, 'wondering': 754, 'cinematography': 755, 'dracula': 756, 'imdb': 757, 'system': 758, 'clear': 759, 'hell': 760, 'country': 761, 'gag': 762, 'judge': 763, 'masterpiece': 764, 'became': 765, 'stargate': 766, 'bounty': 767, 'hunter': 768, 'london': 769, 'sg1': 770, 'started': 771, 'wasted': 772, 'form': 773, 'immediately': 774, 'festival': 775, 'ruby': 776, 'match': 777, 'storyline': 778, 'dick': 779, 'lover': 780, 'footlight': 781, 'nearly': 782, 'murphy': 783, 'trashy': 784, 'novel': 785, 'hidden': 786, 'widmark': 787, 'initial': 788, 'entertaining': 789, 'working': 790, 'shining': 791, 'barbarian': 792, '2007': 793, 'fake': 794, 'wayne': 795, 'arthur': 796, 'visconti': 797, 'source': 798, 'somewhat': 799, 'kill': 800, 'living': 801, 'emotional': 802, 'extra': 803, 'marriage': 804, 'commentary': 805, 'memorable': 806, 'sondra': 807, 'karloff': 808, 'joe': 809, 'weird': 810, 'mask': 811, 'gem': 812, 'club': 813, 'alive': 814, 'artist': 815, 'affair': 816, 'raise': 817, 'success': 818, 'sexy': 819, 'plan': 820, 'normal': 821, 'fast': 822, 'local': 823, 'drama': 824, 'scare': 825, 'ghost': 826, 'combination': 827, 'outstanding': 828, 'disney': 829, 'kalifornia': 830, 'build': 831, 'singer': 832, 'beauty': 833, 'riker': 834, 'cruel': 835, 'solid': 836, 'cube': 837, 'mine': 838, 'plus': 839, 'gave': 840, 'worse': 841, 'matter': 842, 'expression': 843, 'entirely': 844, 'hot': 845, 'actress': 846, 'fire': 847, 'offer': 848, 'window': 849, 'david': 850, 'rate': 851, 'development': 852, 'metal': 853, 'silly': 854, 'cheap': 855, 'chandon': 856, 'surprisingly': 857, 'four': 858, 'thread': 859, 'context': 860, 'reminds': 861, 'ago': 862, 'ultimately': 863, 'lee': 864, 'cute': 865, 'demon': 866, 'slightly': 867, 'oscar': 868, 'concern': 869, 'none': 870, 'victim': 871, 'spot': 872, 'five': 873, 'expectation': 874, 'barely': 875, 'secret': 876, 'command': 877, 'impressive': 878, 'humour': 879, 'sky': 880, 'river': 881, 'date': 882, 'near': 883, 'fair': 884, 'compared': 885, 'bored': 886, 'ended': 887, 'follow': 888, 'mental': 889, 'imagination': 890, 'save': 891, 'adventure': 892, 'viewed': 893, 'player': 894, 'died': 895, 'sequel': 896, 'physical': 897, 'alone': 898, 'plague': 899, 'flat': 900, 'producer': 901, 'indeed': 902, '310': 903, 'value': 904, 'door': 905, 'taken': 906, 'critic': 907, 'machine': 908, 'alien': 909, 'running': 910, 'project': 911, 'double': 912, 'worthy': 913, 'ball': 914, 'edge': 915, 'decent': 916, 'albeit': 917, 'front': 918, 'phantasm': 919, 'merely': 920, 'satisfying': 921, 'impact': 922, 'kubrick': 923, 'instance': 924, 'violent': 925, 'queen': 926, 'count': 927, 'mysterious': 928, 'power': 929, 'energy': 930, 'doo': 931, 'drawn': 932, 'intention': 933, 'chris': 934, 'crap': 935, 'foot': 936, 'incredibly': 937, 'advice': 938, 'climax': 939, 'happened': 940, 'whatever': 941, 'due': 942, 'total': 943, 'smile': 944, 'happiness': 945, 'lane': 946, 'animation': 947, 'card': 948, 'reading': 949, 'usually': 950, 'clue': 951, 'across': 952, 'romantic': 953, 'garbage': 954, 'hair': 955, 'material': 956, 'fault': 957, 'science': 958, 'sight': 959, 'forced': 960, 'prof': 961, 'location': 962, 'pathetic': 963, 'hardly': 964, 'travel': 965, 'turtle': 966, 'forever': 967, 'large': 968}
words.vectors.shape
(969, 2)
#coordonnées de 'boring'
vec1 = words['boring']
print(vec1)
[1.592801 0.45731232]
#coordonnées de 'love'
vec2 = words['love']
print(vec2)
[2.3717234 1.644279 ]
#simlarité cosinus -- formule
import numpy
print(numpy.dot(vec1,vec2)/(numpy.linalg.norm(vec1)*numpy.linalg.norm(vec2)))
0.9471343
#similarité de gensim
words.similarity('boring','love')
0.94713444
#les termes les plus proches de "boring"
words.most_similar("boring")
[('edge', 0.9999995231628418), ('whether', 0.9999991655349731), ('understand', 0.9999988675117493), ('released', 0.9999964833259583), ('light', 0.9999958276748657), ('bought', 0.9999942779541016), ('slow', 0.9999867081642151), ('command', 0.9999855756759644), ('tried', 0.9999783039093018), ('brother', 0.9999707937240601)]
#plus proches de la conjonction de "boring" et "love"
print(words.most_similar(positive=['boring','love'],topn=4))
[('past', 0.9999996423721313), ('rest', 0.9999969601631165), ('trailer', 0.9999938607215881), ('though', 0.9999935626983643)]
#plus proches de "love", loin de ("boring")
print(words.most_similar(positive=['love'],negative=['boring'],topn=4))
[('offer', 0.9882609844207764), ('oscar', 0.9538273811340332), ('share', 0.9297242760658264), ('forever', 0.9213628172874451)]
#chercher l'intrus
print(words.doesnt_match(['love','romance','nice','awful']))
awful
#récupérer les données dans un data frame
df = pandas.DataFrame(words.vectors,columns=['V1','V2'],index=words.key_to_index.keys())
print(df)
V1 V2 film 4.278973 2.685822 movie 3.959529 2.825736 one 3.428247 2.062633 like 3.759431 2.556596 character 3.315631 1.988176 ... ... ... hardly 0.504609 0.640190 travel 0.312583 -0.003713 turtle 0.611507 0.261184 forever -0.031238 0.714805 large 0.110307 0.044220 [969 rows x 2 columns]
#quelques mots clés
mots = ['bad','good','plot','character','actor','dialogue','music']
dfMots = df.loc[mots,:]
print(dfMots)
V1 V2 bad 2.387161 1.698941 good 2.874387 2.253183 plot 2.840403 1.320738 character 3.315631 1.988176 actor 2.535941 1.646468 dialogue 1.134862 1.121174 music 2.226920 1.023955
#graphique dans le plan
import matplotlib.pyplot as plt
plt.scatter(dfMots.V1,dfMots.V2,s=0.5)
for i in range(dfMots.shape[0]):
plt.annotate(dfMots.index[i],(dfMots.V1[i],dfMots.V2[i]))
plt.show()
#voir -- https://fauconnier.github.io/
from gensim.models import keyedvectors
trained = keyedvectors.load_word2vec_format("frWac_non_lem_no_postag_no_phrase_500_skip_cut100.bin",binary=True,unicode_errors='ignore')
#taille du dictionnaire
print(len(trained.key_to_index))
155562
#similarité avec Benzema
trained.most_similar(positive=["benzema"])
[('arfa', 0.913178563117981), ('govou', 0.9104522466659546), ('juninho', 0.8955627679824829), ('toulalan', 0.8621358275413513), ('malouda', 0.8552051186561584), ('coupet', 0.8476995229721069), ('kallstrom', 0.8429117798805237), ('baros', 0.8416147828102112), ('abidal', 0.8360183238983154), ('wiltord', 0.8322257399559021)]
#similarité avec Benzema sans l'ol (olympique lyonnais)
trained.most_similar(positive=["benzema"],negative=["ol"])
[('malouda', 0.43094778060913086), ('vieira', 0.42352476716041565), ('sagnol', 0.42134761810302734), ('thuram', 0.4146784842014313), ('anelka', 0.41370975971221924), ('toulalan', 0.40895143151283264), ('makelele', 0.4072572886943817), ('trezeguet', 0.3994947373867035), ('evra', 0.3987172544002533), ('govou', 0.39852991700172424)]
#tagger les documents avec leur identifiant
from gensim.models.doc2vec import TaggedDocument
tagged_docs = [TaggedDocument(words=corpus_sw[i],tags=["d"+str(D.ID[i])]) for i in range(len(corpus_sw))]
#premier doc par ex.
print(tagged_docs[0])
TaggedDocument(['guy', 'idea', 'cinema', 'okay', 'seems', 'made', 'interestig', 'theater', 'show', 'youth', 'two', 'acceptable', 'movie', 'success', 'political', 'reason', 'cause', 'tricked', 'communist', 'censorship', 'good', 'look', 'carefully', 'doe', 'know', 'job', 'scene', 'unbalanced', 'without', 'proper', 'start', 'disordered', 'content', 'full', 'emptiness', 'nothing', 'say', 'subject', 'overlicitates', 'violence', 'nakedness', 'gutter', 'language', 'possible', 'keep', 'alive', 'rotten', 'corpse', 'never', 'understood', 'anything', 'cinematographic', 'profession', 'art', 'dont', 'let', 'succumb', 'piece'], ['d1'])
#second doc.
print(tagged_docs[1])
TaggedDocument(['movie', 'extremely', 'depressing', 'character', 'cold', 'mother', 'main', 'character', 'everything', 'motherly', 'unhappy', 'marriage', 'always', 'put', 'husband', 'child', 'first', 'husband', 'visit', 'son', 'meet', 'hunk', 'sleeping', 'daughter', 'end', 'sleeping', 'part', 'movie', 'right', 'excellent', 'watched', 'guy', 'charming', 'blame', 'motherly', 'sleep', 'daughter', 'lover', 'let', 'blame', 'shock', 'losing', 'husband', 'becomes', 'totally', 'obsessed', 'guy', 'think', 'part', 'started', 'dislike', 'movie', 'shes', 'always', 'wanting', 'please', 'old', 'fashioned', 'way', 'snack', 'working', 'son', 'house', 'guess', 'thing', 'ever', 'learned', 'way', 'could', 'get', 'attention', 'guy', 'obviously', 'interested', 'actually', 'seems', 'like', 'considered', 'sleeping', 'charitable', 'activity', 'instead', 'insulted', 'continues', 'beg', 'bed', 'nice', 'becomes', 'abusive', 'want', 'please', 'tell', 'desperate', 'way', 'insulting', 'badly', 'outraged', 'movie', 'utter', 'lack', 'selfrespect', 'mother', 'tell', 'craig', 'something', 'like', 'shapeless', 'lump', 'first', 'time', 'sleep', 'together', 'movie', 'insult', 'woman', 'kind', 'would', 'bought', 'little', 'object', 'would', 'brought', 'satisfaction', 'lot', 'emotional', 'pain'], ['d2'])
#doc2vec
from gensim.models.doc2vec import Doc2Vec
modeleDoc = Doc2Vec(vector_size=2,window=5)
#construction de dictionnaire dans un premier temps
modeleDoc.build_vocab(tagged_docs)
#coordonnées des termes - comme avec doc2vec
pandas.DataFrame(modeleDoc.wv.vectors,columns=['V1','V2'],index=modeleDoc.wv.key_to_index.keys())
V1 | V2 | |
---|---|---|
film | -0.026811 | 0.011822 |
movie | 0.255167 | 0.450464 |
one | -0.465147 | -0.355840 |
like | 0.322944 | 0.448649 |
character | -0.250771 | -0.188169 |
... | ... | ... |
hardly | 0.203586 | 0.450866 |
travel | -0.151883 | -0.291927 |
turtle | 0.150994 | -0.021792 |
forever | -0.498972 | 0.420885 |
large | -0.366944 | -0.246520 |
969 rows × 2 columns
#modélisation pour le positionnement des documents
modeleDoc.train(tagged_docs,total_examples=modeleDoc.corpus_count,epochs=100)
print(modeleDoc.dv)
<gensim.models.keyedvectors.KeyedVectors object at 0x00000226A5609430>
#coordonnées des documents
#print(modeleDoc.dv.key_to_index)
#data frame avec les coordonnées
dfDoc2Vec = pandas.DataFrame(modeleDoc.dv.vectors,columns=['X1','X2'])
print(dfDoc2Vec)
X1 X2 0 -0.616892 2.129723 1 0.236581 2.737840 2 -0.392282 2.317758 3 0.117344 3.013039 4 -2.857508 0.032675 .. ... ... 195 -1.558749 0.654467 196 -0.582765 1.513740 197 -3.515436 -0.829151 198 1.048651 3.434941 199 -5.043906 -1.480523 [200 rows x 2 columns]
#graphique
import seaborn as sns
sns.scatterplot(data=dfDoc2Vec,x='X1',y='X2')
<AxesSubplot:xlabel='X1', ylabel='X2'>
#le document dans le quart sud-ouest
dfDoc2Vec.loc[dfDoc2Vec.X1 < -15,:]
X1 | X2 | |
---|---|---|
124 | -16.358557 | -5.379943 |
#doc corresp. au point dans le quart sud-ouest
D.commentaires[124]
'Coming immediately on the heels of Match Point (2005), a fine if somewhat self-repetitive piece of "serious Woody," Scoop gives new hope to Allen\'s small but die-hard band of followers (among whom I number myself) that the master has once again found his form. A string of disappointing efforts, culminating in the dreary Melinda and Melinda (2004) and the embarrassing Anything Else (2003) raised serious doubts that another first rate Woody comedy, with or without his own participation as an actor, was in the cards. Happily, the cards turn out to be a Tarot deck that serves as Scoop\'s clever Maguffin and proffers an optimistic reading for the future of Woody Allen comedy. Even more encouraging, Woody\'s self-casting - sadly one of the weakest elements of his films in recent years - is here an inspired bit of self-parody as well as a humble recognition at last that he can no longer play romantic leads with women young enough to be his daughters or granddaughters. In Scoop, Allen astutely assigns himself the role of Sid Waterman, an aging magician with cheap tricks and tired stage-patter who, much like Woody himself, has brought his act to London, where audiences - if not more receptive - are at least more polite. Like Chaplin\'s Calvero in Limelight (1952), Sid Waterman affords Allen the opportunity to don the slightly distorted mask of an artist whose art has declined and whose audience is no longer large or appreciative. Moreover, because they seem in character, Allen\'s ticks and prolonged stammers are less distracting here than they have been in some time. Waterman\'s character also functions neatly in the plot. His fake magic body-dissolving box becomes the ironically plausible location for visitations from Joe Strombel (Ian McShane), a notorious journalistic muckraker and recent cardiac arrest victim. Introduced on a River Styx ferryboat-to-Hades, Strombel repeatedly jumps ship because he just can\'t rest in eternity without communicating one last "scoop" about the identity of the notorious "Tarot killer." Unfortunately, his initial return from the dead leads him to Waterman\'s magic show and the only conduit for his hot lead turns out to be a journalism undergraduate, Sondra Pransky (Scarlett Johansson), who has been called up from the audience as a comic butt for the magician\'s climactic trick. Sondra enthusiastically seizes the journalistic opportunity and drags the reluctant Waterman into the investigation to play the role of her millionaire father. As demonstrated in Lost in Translation, Johansson has a talent for comedy, and the querulous by-play between her and Allen is very amusing - and all the more so for never threatening to become a prelude to romance. Scoop\'s serial killer plot, involving grisly murders of prostitutes and an aristocratic chief suspect, Peter Lyman (Hugh Jackman), is the no doubt predictable result of Allen\'s lengthy sabbatical exposure to London\'s ubiquitous Jack the Ripper landmarks and lore. Yet other facets of Scoop (as of Match Point) also derive from Woody\'s late life encounter with English culture. Its class structure, manners, idiom, dress, architecture, and, yes, peculiar driving habits give Woody fresh new material for wry observation of human behavior as well as sharp social satire. When, for instance, Sondra is trying to ingratiate herself with Peter Lyman at a ritzy private club, Waterman observes "from his point of view we\'re scum." A good deal of humor is also generated by the contretemps of stiffly reserved British social manners encountering Waterman\'s insistent Borscht-belt Jewish plebeianism. And, then, of course, there is Waterman\'s hilarious exit in a Smart Car he can\'t remember to drive on the left side of the road. As usual, Allen\'s humor in Scoop includes heavy doses of in-jokes, taking the form of sly allusions to film and literary sources as well as, increasingly, references to his own filmography. In addition to the pervasive Jack the Ripper references, for instance, the film\'s soundtrack is dominated by an arrangement of Grieg\'s "The Hall of the Mountain King," compulsively whistled by Hans Beckert in M, the first masterpiece of the serial killer genre. The post-funeral gathering of journalists who discuss the exploits of newly departed Joe Strombel clearly mimics the opening of Broadway Danny Rose (1984). References to Deconstructing Harry (1997) include the use of Death as a character (along with his peculiar voice and costume), the use of Mandelbaum as a character name, and the mention of Adair University (Harry\'s "alma mater" and where Sondra is now a student). Moreover, the systematic use of Greek mythology in the underworld river cruise to Hades recalls the use of Greek gods and a Chorus in Mighty Aphrodite (1995). As to quotable gags, Allen\'s scripts rely less on one-liners than they did earlier in his career, but Scoop does provides at least a couple of memorable ones. To a question about his religion, Waterman answers: "I was born in the Hebrew persuasion, but later I converted to narcissism." And Sondra snaps off this put-down of Waterman\'s wannabe crime-detecting: "If we put our heads together you\'ll hear a hollow noise." All in all, Scoop is by far Woody Allen\'s most satisfying comedy in a decade.'
#point quart nord-est
dfDoc2Vec.loc[dfDoc2Vec.X1 > 4,:]
X1 | X2 | |
---|---|---|
53 | 4.053157 | 8.560942 |
#commentaire n°53
D.commentaires[53]
'We now travel to a parallel universe where the appearance of giant prehistoric monsters flattening cities are part of the daily routine. It\'s the world of Godzilla, Rodan, Mothra Ghidrah and their kind - a strange world, and one made even stranger by the appearance of an unidentified flying turtle called Gamera. Forever in the shadow of the monolithic Toho Studios, second rung Daiei Studios were more famous for samurai sagas than monster movies. In the mid 60s they decided to join the giant reptile race and designed a rival monster series to Toho\'s mammothly successful Godzilla. They wisely chose Gamera as their flagship - a giant turtle that shoots flames from between its snaggle-teeth, and spins through the air by shooting flames through its shell\'s feet-holes (and at one point you almost see the paper mache shell catch fire!). The first Gamera film "Gamera The Invincible" (as it was sold to the US) is a virtual mirror of the first Godzilla film, only 10 years behind. American fighters chase an unmarked plane over the Arctic to its fiery demise - the nuclear bomb on board ignites and awakens the giant Gamera from its icy slumber. Feeding off atomic energy, it immediately goes on a rampage, and the world wants to destroy Gamera once and for all, but a little Japanese boy named Kenny, who has a psychic connection with the giant turtle and even keeps a miniature version in an aquarium by his bedside, believes Gamera is essentially kind and benevolent. He\'s like a little Jewish kid with a pinup of Hitler. "Gamera is a GOOD turtle," he pleads, then sulks, and puts on a face like someone\'s pooped in his coco pops. Miraculously the world\'s leaders listen to him, and so begins Z-Plan to save the world AND Gamera from complete destruction. Released in 1965, Gamera was a surprising hit. The annoying infantile anthropomorphism actually worked on kiddie audiences in both Japan and the US, and the sight of Gamera on two feet stomping miniatures of Tokyo and the North Pole is gloriously chintzy. Most surprising of all is the longevity of the series: eight original Gamera films, plus a slew of recent remakes. Not bad for a mutant reptile whose only friend is mewing eight year old milquetoast - and if I hear "Gamera is friends to ALL children" one more time I\'M going to crush Tokyo. Which appears to be an easy task in the parallel universe where children are smart and turtles are bigger than a Seiko billboard in the 1965 turtle-fest Gamera.'
#ajouter la polarité
dfDoc2Vec['polarite'] = D.label
#refaire le graphique avec la polarité
sns.scatterplot(data=dfDoc2Vec,x='X1',y='X2',hue='polarite')
<AxesSubplot:xlabel='X1', ylabel='X2'>