Friday, August 7, 2020

On Being Asian and in a STEM Career

I want to first say that my new website, which shines light on different topics in data science can be accessed at morningmodule.com

The second thing I want to say is that I will be pivoting this blog to be more of a personal reflection type of blog where I will be writing my thoughts on matters that may or may not be directly related to data science or being an aspiring data scientist. So, instead of this blog having a focus on my projects, it will be less technical in nature and more of my musings. 

Lately, I had some thoughts about being in a STEM career and being Asian American. These thoughts were prompted by my sister who has always had a big influence on my whether I acknowledge it or not. My sister is the complete opposite of me in terms of her personality, interests, and of course, career aspirations. She is outspoken and stubborn and I am shy and more of a push-over. She majored in Chinese and English and I majored in Math. She went to college in NYC and I went to school upstate of NYC. She is interested in Asian American politics, civic engagement, immigration law, and I am interested in figuring out how to incorporate backend Python code to a frontend Angular framework. 

My sister also started a newsletter which focuses on all things Asian American and also collaborated with others to start a podcast called "Fresh off the Vote" which focuses on the topic of voting mainly in the Asian American Pacific Islander community. That's pretty amazing right? Considering there aren't that many Asian Americans who are in politics, she may be doing things at the frontier of civic engagement among Asian Americans. 

As for me, I feel more like I have followed a stereotypical route of being on a STEM track almost my whole life. Although I was technically a "Social Science major" in high school, I still went to a technical high school (Brooklyn Technical High School) and took the hardest math classes they had to offer. I also contemplated on being a Biology major as well in college. I can read my old journal entries from elementary school and I can see how focused I was on grades as well. I just feel I have checked all the boxes of being a stereotypical Asian American. 

Now that I am an aspiring data scientist, I sometimes wonder why I chose this field. Of course it is interesting--there is no doubt. But, I wonder how I can bring more of the humanities to data science sometimes. I remember really enjoying my "Literature and Society" class and studying abroad in Spain to learn the language, culture, and history.  I also remember enjoying getting my Masters in Teaching, although teaching was not as enjoyable. Although I enjoyed learning about the humanities and teaching, I ended up in a more technical field than ever. I think it isn't because I inherently liked STEM more, but because of how I was raised that put STEM at a higher pedestal. 

I simply have to remember that I am a multi-faceted person and although I have chosen data science as a career track, I am not just a data scientist. I am a human with other interests and perhaps other obligations as a person who has the ability to vote. 

I suppose my sister's newsletter and the podcast she is a part of is awakening the other side of me who wants to make a difference not just in data science but in other parts of society. Her existence is a call for action. It is time to not forget to read the news sometimes, understand some politics, and take action when needed.

"Fresh off the Vote" is available everywhere. My sister's newsletter can be accessed through nonnative.substack.com




Tuesday, May 19, 2020

Diary Project Part 2: Topic Modeling

Why?

I thought it would be interesting to see the major themes that showed up throughout my life, specifically before high school, during high school, during college, and post-college. That was the rationale behind doing topic modeling on my diary entries.

What I Found

By just looking at the most common bigrams (two adjacent words that appear in my writing after I "cleaned" it up with some Natural Language Processing techniques), I gained some insight into my life. It seemed like the central things I wrote about involved school of some sort, volunteering, career, and being Chinese.

Before high school, the 8 most common bigrams were: "table_leader", "fourth_grade", "social_study", "good_night", "younger_sister", "best_wish", "bad_news", and "picture_frame".
It is true most of my diary entries before high school was written during the 3rd and 4th grades. "table_leader" most likely refers to when I was a table leader in my class, which was just a role my teacher assigned to a student to collect work from other students in the same table.

During high school, the 8 most common bigrams were "red_envelope", "kings_plaza", "nursing_home", "horseshoe_crab", "math_fair", "touch_pool", "fuck_fuck", and "ca_wait". Red envelopes are literally red envelopes stuffed with cash given out usually during Chinese New Year from married people to children or unmarried relatives. I used to live in Brooklyn and we used to go to Kings Plaza to shop for clothes. "nursing_home" probably refers to the time when I volunteered at a nursing home in high school. "horseshoe_crab" and "touch_pool" most likely refers to the time I volunteered at the NY Aquarium. "math_fair" refers to the time when I wrote a math research paper and the expletive is probably from being stressed about college applications and decisions.

During college, the 8 most common bigrams were "actuarial_exam", "high_school", "study_abroad", "host_mom", "fall_asleep", "concrete_dream", "career_path", and "family_member". The bigrams "actuarial_exam", "concrete_dream", and "career_path" all relate to the time when I was searching for a career path and thought maybe actuarial sciences was something I wanted to do. "study_abroad" and "host_mom" are related to each other as I studied abroad in Spain and lived with a host parent for four months. High school appears probably because I was reminiscing about high school during college.

After college, the 8 most common bigrams were "data_science", "san_francisco", "data_scientist", "high_school", "imposter_syndrome", "chinese_culture", "walked_around", and "make_difference".  What I am surprised is that teaching related bigrams did not show up in my top 30 bigrams. It may be because I did not write about teaching much during my time as a teacher.

Topic Modeling with LDA

To see what kinds of topics topic modeling would pick up in the four stages of my life, I decided to use LDA, or Latent Dirichlet Allocation. LDA assumes that documents with similar topics will use similar words. ("Documents are probability distribution over latent topics. Topics are probability distribution over words.")

In this case, I created a document for each sentence I have written over the years. I had 1005 documents for before high school, 4030 documents during high school,  2589 documents during college, and 3204 documents post-college.

Each document was then "cleaned" by removing punctuation, lower-casing words, removing stop-words, and lemmatizing words. Then, the documents were tokenized and bigrams were created.

Then, comes LDA. We need to corpus from our bigrams by first assigning each bigram to a number and then counting how many times a certain bigram appears in a document. This is done with using gensim. I used the code below to create my corpus.




Now, to use LDA, we have to specify how many topics we assume there are or want to have in our documents. We can pick an arbitrary number, but there is a way to use coherence scores to pick the number of topics. From my understanding, the higher the coherence score, the more semantic similarity there are between the high scoring bigrams in a certain topic. (Topics are made up of a group of bigrams.) It is an okay evaluation for the model in some cases, so I decided to try to find the coherence scores when I choose between 2 and 15 topics inclusive.

What I found was that for the before high school corpus, the greatest coherence score was achieved at 9 topics. But, it got tricky for the other corpuses. The graphs all had some sort of U shape, showing that the max coherence score was either at very little topics (2 or 3) or a lot of topics (14-15). In that case, for during high school I picked 8 topics which was the first peak after a drop in coherence score from 2 topics, as you can see in the below graph.


















I could have picked 6 as well, but I felt I had a lot more documents during high school than before high school and it didn't make sense that I would have less topics than before high school.

For the college corpus, the coherence graph looked similar to the high school coherence graph and I picked 6 topics. For post-college, I picked only 3 topics as the graph looked like this:


















I also checked the topics that came out to see if I could make sense of them for the number of topics chosen and I was satisfied with them. I suppose it is hard to truly evaluate how many topics one should choose for LDA, as there will always be a subjective component to it.

The Topics

So after I chose how many topics I wanted each corpus to produce, I got some results.
For before high school, these were my results:

[(0,
    '0.115*"eating_pizza" + 0.115*"digged_paper" + 0.104*"lowest_score" + '
    '0.092*"bumped_head" + 0.092*"sicky_problem" + 0.082*"hong_kong" + '
    '0.072*"butter_corn" + 0.007*"table_leader" + 0.007*"bad_news" + '
    '0.007*"new_year"'),
   (1,
    '0.306*"dragon_ball" + 0.121*"going_die" + 0.096*"coney_island" + '
    '0.082*"eleven_clock" + 0.066*"flying_chair" + 0.035*"adventure_fall" + '
    '0.035*"free_whale" + 0.028*"multiplication_bee" + 0.005*"year_old" + '
    '0.005*"jerwel_box"'),
   (2,
    '0.234*"whole_class" + 0.146*"fourth_grade" + 0.096*"sesame_place" + '
    '0.089*"student_month" + 0.089*"canal_street" + 0.056*"wat_nice" + '
    '0.038*"adventure_fall" + 0.005*"sign_return" + 0.005*"progress_report" + '
    '0.005*"social_study"'),
   (3,
    '0.182*"year_old" + 0.152*"report_card" + 0.152*"dear_journal" + '
    '0.074*"best_wish" + 0.064*"big_candle" + 0.051*"chat_chat" + '
    '0.051*"entered_multiplication" + 0.051*"ice_cream" + 0.005*"easy_peesy" + '
    '0.005*"whole_class"'),
   (4,
    '0.214*"report_card" + 0.197*"social_study" + 0.090*"even_though" + '
    '0.090*"progress_report" + 0.071*"hope_feel" + 0.007*"new_year" + '
    '0.007*"whole_class" + 0.007*"easy_peesy" + 0.007*"hersey_bar" + '
    '0.007*"paste_sticker"'),
   (5,
    '0.234*"report_dued" + 0.191*"good_night" + 0.103*"younger_sister" + '
    '0.083*"police_officer" + 0.083*"blood_black" + 0.006*"whole_class" + '
    '0.006*"dragon_ball" + 0.006*"dear_journal" + 0.006*"report_card" + '
    '0.006*"year_old"'),
   (6,
    '0.251*"dear_journal" + 0.203*"table_leader" + 0.097*"picture_frame" + '
    '0.067*"ca_unlock" + 0.067*"jerwel_box" + 0.006*"new_year" + '
    '0.006*"multiplication_bee" + 0.006*"adventure_fall" + 0.006*"report_card" + '
    '0.006*"year_old"'),
   (7,
    '0.312*"new_year" + 0.084*"phone_number" + 0.068*"red_eye" + '
    '0.067*"really_high" + 0.067*"around_five" + 0.067*"stop_bleeding" + '
    '0.067*"sign_return" + 0.036*"free_whale" + 0.005*"fourth_grade" + '
    '0.005*"progress_report"'),
   (8,
    '0.147*"easy_peesy" + 0.100*"smartest_girl" + 0.100*"paste_sticker" + '
    '0.100*"need_sleep" + 0.100*"hersey_bar" + 0.091*"bad_news" + '
    '0.079*"throw_dice" + 0.006*"lowest_score" + 0.006*"wat_nice" + '
    '0.006*"good_night"')]

As you can see, there are 9 groups (labeled 0-8) of bigrams. Each group of bigrams represents one topic. LDA does not name your topics for you, so you have to come up with your own meanings on what the groupings of the bigrams mean. The numbers represent the probability that the given bigram will be used in that certain group or topic. The numbers are sorted in descending order- more weight is given to the first bigram.

During high school's corpus created these topics:
[(0,
  '0.136*"ivy_league" + 0.082*"social_science" + 0.082*"fuck_shit" + '
  '0.043*"joyce_thomas" + 0.040*"wo_able" + 0.037*"sat_prep" + '
  '0.033*"park_ranger" + 0.033*"cough_suppressant" + '
  '0.029*"competitive_college" + 0.028*"garbage_bag"'),
 (1,
  '0.048*"last_forever" + 0.039*"aquarium_docent" + 0.038*"rice_ball" + '
  '0.034*"new_hairstyle" + 0.034*"spur_moment" + 0.032*"real_sorrow" + '
  '0.030*"blood_pressure" + 0.029*"tank_top" + 0.029*"force_sneaker" + '
  '0.029*"pair_air"'),
 (2,
  '0.229*"fuck_fuck" + 0.077*"bad_attribute" + 0.055*"sat_ii" + '
  '0.047*"fish_hook" + 0.047*"professional_job" + 0.038*"grandmother_grave" + '
  '0.033*"regent_week" + 0.031*"ap_spanish" + 0.026*"basketball_board" + '
  '0.025*"ap_class"'),
 (3,
  '0.051*"math_team" + 0.039*"play_basketball" + 0.034*"new_york" + '
  '0.030*"eat_breakfast" + 0.028*"rite_aid" + 0.027*"aunt_uncle" + '
  '0.026*"staten_island" + 0.024*"speak_english" + 0.022*"dim_sum" + '
  '0.021*"book_buddy"'),
 (4,
  '0.080*"birthday_party" + 0.052*"horseshoe_crab" + 0.046*"fur_seal" + '
  '0.044*"touch_pool" + 0.038*"sea_star" + 0.034*"math_fair" + '
  '0.032*"lunar_new" + 0.028*"king_highway" + 0.028*"back_forth" + '
  '0.025*"random_guy"'),
 (5,
  '0.115*"red_envelope" + 0.059*"king_plaza" + 0.049*"everyone_else" + '
  '0.049*"living_room" + 0.045*"ice_cream" + 0.038*"dark_secret" + '
  '0.037*"coney_island" + 0.036*"good_luck" + 0.033*"ca_wait" + '
  '0.030*"linda_mindy"'),
 (6,
  '0.067*"ipod_touch" + 0.058*"nursing_home" + 0.046*"photography_club" + '
  '0.036*"ap_lit" + 0.036*"talked_iff" + 0.035*"eats_sleep" + '
  '0.035*"brother_wife" + 0.035*"dad_trimmed" + 0.035*"turned_head" + '
  '0.035*"home_depot"'),
 (7,
  '0.050*"summer_reading" + 0.042*"hot_pot" + 0.038*"cent_store" + '
  '0.038*"took_picture" + 0.035*"year_eve" + 0.033*"celebrate_birthday" + '
  '0.032*"least_worry" + 0.031*"hope_safe" + 0.029*"talk_trivial" + '
  '0.028*"long_thin"')]

During college created these topics:
[(0,
  '0.086*"fall_asleep" + 0.082*"career_path" + 0.076*"host_mom" + '
  '0.039*"emergency_room" + 0.035*"chinese_american" + 0.035*"read_compass" + '
  '0.033*"fit_lifestyle" + 0.030*"stream_consciousness" + '
  '0.027*"everyone_else" + 0.024*"worth_gift"'),
 (1,
  '0.103*"high_school" + 0.074*"family_member" + 0.059*"study_abroad" + '
  '0.038*"pretty_cool" + 0.035*"certain_way" + 0.035*"paper_cup" + '
  '0.034*"month_ago" + 0.029*"un_poco" + 0.026*"food_court" + '
  '0.021*"por_dios"'),
 (2,
  '0.033*"dim_sum" + 0.029*"vanity_fancy" + 0.028*"spring_fling" + '
  '0.027*"graduate_school" + 0.023*"jodi_picoult" + 0.022*"grad_school" + '
  '0.022*"change_career" + 0.021*"nature_preserve" + 0.020*"apple_hill" + '
  '0.020*"education_system"'),
 (3,
  '0.061*"growing_vegetable" + 0.041*"going_hiking" + '
  '0.041*"peruvian_restaurant" + 0.041*"getting_married" + 0.041*"pio_pio" + '
  '0.041*"truly_understand" + 0.041*"bally_laundromat" + 0.030*"bite_nail" + '
  '0.026*"walked_around" + 0.023*"estas_imaginaciones"'),
 (4,
  '0.108*"actuarial_exam" + 0.039*"concrete_dream" + 0.033*"flawless_muse" + '
  '0.030*"teach_america" + 0.024*"new_shoe" + 0.024*"group_therapy" + '
  '0.021*"psychotic_episode" + 0.020*"mud_bath" + 0.019*"pas_exam" + '
  '0.018*"studying_p"'),
 (5,
  '0.058*"new_york" + 0.050*"estoy_aquĆ­" + 0.041*"el_mundo" + '
  '0.037*"siempre_estoy" + 0.035*"city_madrid" + 0.032*"best_worst" + '
  '0.028*"e_tan" + 0.028*"en_mi" + 0.028*"follow_compass" + '
  '0.025*"existential_crisis"')]

I can tell that the topic 4 for the college corpus makes some sense to me in the fact that it is talking about career goals and a desire to attain certain things that aren't attainable like being a "flawless_muse". Topic 4 has also a lot of words related to the actuarial exams, including exam P. 

The post college corpus created these topics:
[(0,
  '0.033*"mental_health" + 0.030*"back_ny" + 0.027*"need_relax" + '
  '0.026*"high_school" + 0.024*"taking_risk" + 0.023*"comfort_home" + '
  '0.022*"dating_jon" + 0.017*"give_cash" + 0.016*"job_description" + '
  '0.014*"dim_sum"'),
 (1,
  '0.041*"kaggle_competition" + 0.028*"imposter_syndrome" + '
  '0.028*"throughout_day" + 0.025*"weekly_spread" + 0.024*"wisdom_tell" + '
  '0.023*"make_difference" + 0.019*"gratitude_log" + 0.017*"pay_attention" + '
  '0.017*"afraid_letting" + 0.017*"sam_harris"'),
 (2,
  '0.107*"data_science" + 0.037*"data_scientist" + 0.026*"bullet_journal" + '
  '0.019*"believed_ability" + 0.016*"someone_else" + 0.015*"bring_joy" + '
  '0.015*"apply_job" + 0.014*"passed_away" + 0.013*"new_jersey" + '
  '0.013*"letting_go"')]

Topic 2 in post college is a solid topic as well. It is clearly about data science and I was interning in New Jersey as a data science intern as well.

Trends

I would say that I had a pretty school-centered childhood, some-what of an angsty high school phase, some sort of existential or career crisis point in college, and more of an acceptance of how things are period post-college. This may just be tainted by my own perspective of course. I could keep on delving into the analysis of my life, but at this time, I don't want to. This diary project was interesting and it gave me some insights into what I journal about and what I cared about in the distant and close past. 

This project aside, I felt that keeping a journal was a habit I thought I never had. But seeing that I had all these entries that built up over the years, I can say I do have a habit of journal writing and that it is a habit I would like to keep. It is a joy to read them once more time.

Thursday, May 7, 2020

Diary Project Part 1: Diary Page Generator App

The History

I have kept a diary for 18 years. Although I don't write in my diary every day and some entries are missing, I recognized my diary collection as a treasure trove of my memories and events that I experienced in my life. It was also a great source of data for doing an NLP project.

I thought, what if I can program a computer to write like me? That was the start of my diary generator app.

A Moment

Here is an excerpt from a "page" of my generated diary:

"I am able to listen to my parents talk about things and my own needs. Honestly, I really want to be able to listen to them, but I don't listen very much. I think I really need to learn how to speak to my own needs. I think I know what I really want to know, but I haven't really communicated it to my parents very well. My dad doesn't like to talk to me so much, but he does talk to me occasionally. I think I am getting better at communicating with dad because I am reflecting back what I say to him. Honestly, I think I need to do more of the following: 1. Be more present. Be passionate about something. Be passionate about math. Talk to more people. Try to keep up with habits. 2. Respect my parents. Be authentic to people. 3. Give accurate and detailed descriptions of what you are doing. 4. Be up to speed with what you are doing. Learning is not just about finding what you are doing. It is also about being noticed."

I can say that there is an essence of me in this generated text. I truly want to be able to communicate with my parents better and I vaguely remember writing about something on that topic. And the numbered list of things "I think I need to do" is mostly true. The being "more present" part cannot be more true. But, aside from the meanings of the generated text, the writing style does feel like a diary entry of mine.

The Details

The heart of this text-generator is a GPT-2 model. I used Max Woolf's gpt-2-simple (https://github.com/minimaxir/gpt-2-simple) to fine-tune a GPT-2 model on my diary dataset.
According to Wikipedia, GPT-2 is a text generating model developed by OpenAI and was trained on a corpus, called WebText, which has over 8 million documents of text from Reddit. It is great at generating fake news, and was thought to pose a significant threat by the academic community. (This project was just for fun and I don't intend to threaten anybody with my diary entries.)

I was able to find and type up 393 total diary entries, varying in different lengths and concatenate them in one .txt file. The GPT-2 model can be fine-tuned with a custom corpus that is in a .txt file with each sentence as a new line as input. I chose the "small" 124M parameter model to fine-tune my diary entries on.

After fine-tuning the model for about 45 minutes, I was able to start generating text with it. Although my app only has one setting, which is the default setting of length and temperature, it is possible to tweak it so that it generates more or less text with more or less "creativity" ("letting the model pick suboptimal predictions"). You can even enter a prefix to tell the generator to start from a certain letter, word, or phrase. But, my app only has one button to run the model and generate text.

The fine-tuning was run in a Google Colab notebook, specifically with this notebook - https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce

After I downloaded the fine-tuned model on my computer, I created a flask app to run it. I then used Docker and Google Cloud to deploy my app.

This was my first time using Docker and Google Cloud.

You can access my diary generator app here - https://diary-khmwtmm5lq-uc.a.run.app/

And the code for it here - https://github.com/morningkaren/diary-generator

Stay tuned for Diary Project Part 2, where I delve into the topic modeling portion of my project.

Sunday, April 19, 2020

CORD-19 Submission: Closed Domain Question & Answer (CDQA) Search & Summarize (CDQASS) on COVID-19 Literature

The Team
A group of us from my company came together to tackle a Kaggle competition where we were given 50,000+ scientific research articles about COVID-19 and related diseases and around 10 main tasks with sub-questions that needed to be answered. Although our team was in different time zones, we made time to get together to come up with a lovely submission over a span of 2.5 weeks.

Creating a Search Engine and Question and Answering System
During that time, we decided that we would create a search engine that can answer questions. We came up with many ideas of how to approach the question and answering problem, including using GPT2 on all the article abstracts to come up with a hypothetical answer to a query or question input. But, the answers that GPT2 comes up with could be wrong since the abstracts are just inspiration for the generated answers and not exactly the real answers themselves. We also tried to use BERT and ALBERT fine-tuned on SQuAD, but when the input text was too large, both models were not able to pick out the right answer to a question asked. That is a common challenge with Question and Answering as a Natural Language Understanding task.

To address the challenge of having a large input text, we used a closed domain question and answer model based on a retriever-reader dual algorithmic approach developed by AndrĆ© Macedo Farias et al [1][2]. The CDQA model has a cousin, the ODQA (open domain question answer). The ODQA pipeline first narrows down the input text to the top articles where the answer might be present (The Retriever) using search (e.g., tf-idf, BM25) and then finds the best potential answer (The Reader) using a Q&A model. Our CDQA model was just trained on the articles in our dataset. After using CDQA, we summarized the top answers using an abstractive summarizer. We also included a WordCloud visualization at the end. 

Here is a diagram of the steps our final solution consisted of:

Die!%20Corona!%20Kaggle%20team%20-%20Pipeline.png

A detailed walk-through of this approach can be found in our Kaggle kernel at https://www.kaggle.com/ikantdumas/cdqass-die-corona 
You can even run the kernel yourself, ask it a question, and see the output. 

Here are some examples of questions and answers you can get:

Query: What are treatments for COVID-19?
Summary: Early identification, timely and effective treatments, maintenance of hemodynamics and electrophysiological stability are of great significance on effective treatment and long-term prognosis. We suggest traditional Indian medicinal plants as possible novel therapeutic approaches, exclusively targeting SARS-CoV-2 and its pathways. There is theoretical, experimental, preclinical and clinical evidence of the effectiveness of chloroquine in patients affected with COVID-19.
Wordcloud:




















Query: What is the geographic distribution of COVID-19?
Summary:Coronavirus disease 2019 (COVID-19) is a newly emerged infection of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and has been pandemic all over the world. Covid-19 has infected more than 300,000 patients and become a global health emergency due to the very high risk of spread and impact.
Wordcloud:





















Query: What is the economic impact of this pandemic?
Summary: Although SARS infected some 10,000 individuals, killing around 1000, it did not lead to the devastating health impact that many feared, but a rather disproportionate economic impact. A large-scale pandemic could cause severe health, social, and economic impacts. Interventions can reduce the impact of an outbreak and buy time until vaccines are developed, but they may have high social and economic costs.
Wordcloud: 





















Query: Did the coronavirus originate from bats?
Summary:Bats are a major reservoir of viruses, a few of which have been highly pathogenic to humans. Severe acute respiratory syndrome (SARS)-like WIV1-coronavirus (CoV) was first isolated from Rhinolophus sinicus bats. MERS-CoV was believed to be of zoonotic origin from bats with dromedary camels as intermediate hosts.
Wordcloud:





















Query: What are risk factors of COVID-19?
Summary:C-reactive protein (CRP) levels, NCP severity, and underlying comorbidities were the risk factors for cardiac abnormalities. Poor sleep quality and high working pressure were positively associated with high risks of COVID-19. tuberculosis (MTB), the pathogen that causes TB and latently infects ~25% of the global population, may be a risk factor for SARS-CoV-2 infection and severe CO VID-19 pneumonia.
Wordcloud:




















References