Saturday, September 7, 2019

Once Upon A Time- NLP on Disney Movie Scripts & Their Original Stories

Introduction


Practically everyone knows some Disney movies, but not everyone knows about their original stories. The movies' original stories come from many places. For example, The Little Mermaid comes from a short story by Christian Andersen and The Hunchback of Notre Dame comes from a gothic novel written by Victor Hugo. It would be interesting to see the parallels between the Disney movie scripts and their original writings, which is exactly what my project seeks to do. In addition to finding interesting relationships between main characters in the movie scripts and original stories, my project also culminates in a rudimentary flask app that is also a book/movie recommendation system.

Methodology


I first gathered 17 Disney movie scripts and their corresponding original stories. Then, I tokenized my data in two different ways-- on the word level and on the sentence level. I tokenized my data on the word level with TF-IDF, removed punctuation and used stop-words, lemmatization, and parts of speech tagging to only keep nouns and adjectives. I did this pre-processing in order to feed my data into an NMF model to do topic modeling and also used Word2Vec and PCA to find relationships between main characters in the original stories and in the movie scripts. I also tokenized my data on the sentence/line level, keeping punctuation in order to do sentiment analysis with Vader and applied dynamic time warping techniques to compare sentiment change over time in the books and movies. With my NMF model and sentiment analysis, I was able to come up with a rudimentary book/movie recommendation system. The user puts in a story/movie from the list of 34 stories/movies they liked, a sentiment similarity weight and a topic similarity weight and the system returns a book or a movie that matches the criteria. 

My methodology is summed up in this picture:
















Findings


Word2Vec & PCA

Using Word2Vec and reducing dimensions to 2 with PCA, I was able to visualize the spread of main characters in the stories and in the movies. In the original stories, there is a greater spread of main characters, with some characters closer to the words "love" and "happy" and some characters rather far away from them. However, in the movies, the main characters are bunched together and everyone seems to be relatively close to the words "love" and "happy". Does this mean that the Disney movies are just more similar to each other and sugar-coats the original stories? This idea makes sense since a lot of the original stories are more dark and does not always have happy endings. For example, Victor Hugo's novel of the hunchback of Notre Dame has a lot of the main characters dying, but the Disney version is a lot happier. 


















Sentiment Analysis 

Using Vader, I was able to assign a sentiment score to each line or sentence of the movie script or story. This created a time series of sentiment change over the plot of the movie or story. I then used dynamic time warping techniques to compare sentiment change over time to compare how similar the movie and their corresponding story 'feels'. My findings show that the most similar book and movie pair is Winnie the Pooh by A.A. Milne and  The Adventures of Winnie the Pooh. The Winnie the Pooh book/movie pair received a dynamic time warping distance of 7, the lowest of all the other pairs of books and movies.  The most dissimilar book and movie according to sentiment is The Hunchback of Notre Dame by Victor Hugo and The Hunchback of Notre Dame.  This book/movie pair received a dynamic time warping distance of 33. You can see the change in sentiment over time in the following graphs. (I used a rolling mean of 5% of the lines/sentences to smooth out the graphs.)
















Recommendation System

Finally, my recommendation system was created by doing topic modeling over the 34 total number of books and movies and dynamic time warping scores. NMF was able to create 34 topics that corresponded to the movie or a book very well. I also wanted to be able to create a recommendation based on how similar the user wanted the recommended book or movie to feel to their chosen book or movie and how similar in topic the user wanted the recommended book or movie to be to their chosen book or movie. To give users those options, I used cosine similarity to compare each book or movie against another book or movie using the topic weights vectors I received from NMF. I also used the dynamic time warping distances to compare how similar each book/movie was to another book/movie. 

Here is a demonstration of the flask app I built:



I will end with a quote: 
"Begin at the beginning," the King said, very gravely, "and go on till you come to the end: then stop."
-Lewis Caroll, Alice in Wonderland