Thursday, May 7, 2020

Diary Project Part 1: Diary Page Generator App

The History

I have kept a diary for 18 years. Although I don't write in my diary every day and some entries are missing, I recognized my diary collection as a treasure trove of my memories and events that I experienced in my life. It was also a great source of data for doing an NLP project.

I thought, what if I can program a computer to write like me? That was the start of my diary generator app.

A Moment

Here is an excerpt from a "page" of my generated diary:

"I am able to listen to my parents talk about things and my own needs. Honestly, I really want to be able to listen to them, but I don't listen very much. I think I really need to learn how to speak to my own needs. I think I know what I really want to know, but I haven't really communicated it to my parents very well. My dad doesn't like to talk to me so much, but he does talk to me occasionally. I think I am getting better at communicating with dad because I am reflecting back what I say to him. Honestly, I think I need to do more of the following: 1. Be more present. Be passionate about something. Be passionate about math. Talk to more people. Try to keep up with habits. 2. Respect my parents. Be authentic to people. 3. Give accurate and detailed descriptions of what you are doing. 4. Be up to speed with what you are doing. Learning is not just about finding what you are doing. It is also about being noticed."

I can say that there is an essence of me in this generated text. I truly want to be able to communicate with my parents better and I vaguely remember writing about something on that topic. And the numbered list of things "I think I need to do" is mostly true. The being "more present" part cannot be more true. But, aside from the meanings of the generated text, the writing style does feel like a diary entry of mine.

The Details

The heart of this text-generator is a GPT-2 model. I used Max Woolf's gpt-2-simple (https://github.com/minimaxir/gpt-2-simple) to fine-tune a GPT-2 model on my diary dataset.
According to Wikipedia, GPT-2 is a text generating model developed by OpenAI and was trained on a corpus, called WebText, which has over 8 million documents of text from Reddit. It is great at generating fake news, and was thought to pose a significant threat by the academic community. (This project was just for fun and I don't intend to threaten anybody with my diary entries.)

I was able to find and type up 393 total diary entries, varying in different lengths and concatenate them in one .txt file. The GPT-2 model can be fine-tuned with a custom corpus that is in a .txt file with each sentence as a new line as input. I chose the "small" 124M parameter model to fine-tune my diary entries on.

After fine-tuning the model for about 45 minutes, I was able to start generating text with it. Although my app only has one setting, which is the default setting of length and temperature, it is possible to tweak it so that it generates more or less text with more or less "creativity" ("letting the model pick suboptimal predictions"). You can even enter a prefix to tell the generator to start from a certain letter, word, or phrase. But, my app only has one button to run the model and generate text.

The fine-tuning was run in a Google Colab notebook, specifically with this notebook - https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce

After I downloaded the fine-tuned model on my computer, I created a flask app to run it. I then used Docker and Google Cloud to deploy my app.

This was my first time using Docker and Google Cloud.

You can access my diary generator app here - https://diary-khmwtmm5lq-uc.a.run.app/

And the code for it here - https://github.com/morningkaren/diary-generator

Stay tuned for Diary Project Part 2, where I delve into the topic modeling portion of my project.

No comments:

Post a Comment