Sunday, April 19, 2020

CORD-19 Submission: Closed Domain Question & Answer (CDQA) Search & Summarize (CDQASS) on COVID-19 Literature

The Team
A group of us from my company came together to tackle a Kaggle competition where we were given 50,000+ scientific research articles about COVID-19 and related diseases and around 10 main tasks with sub-questions that needed to be answered. Although our team was in different time zones, we made time to get together to come up with a lovely submission over a span of 2.5 weeks.

Creating a Search Engine and Question and Answering System
During that time, we decided that we would create a search engine that can answer questions. We came up with many ideas of how to approach the question and answering problem, including using GPT2 on all the article abstracts to come up with a hypothetical answer to a query or question input. But, the answers that GPT2 comes up with could be wrong since the abstracts are just inspiration for the generated answers and not exactly the real answers themselves. We also tried to use BERT and ALBERT fine-tuned on SQuAD, but when the input text was too large, both models were not able to pick out the right answer to a question asked. That is a common challenge with Question and Answering as a Natural Language Understanding task.

To address the challenge of having a large input text, we used a closed domain question and answer model based on a retriever-reader dual algorithmic approach developed by AndrĂ© Macedo Farias et al [1][2]. The CDQA model has a cousin, the ODQA (open domain question answer). The ODQA pipeline first narrows down the input text to the top articles where the answer might be present (The Retriever) using search (e.g., tf-idf, BM25) and then finds the best potential answer (The Reader) using a Q&A model. Our CDQA model was just trained on the articles in our dataset. After using CDQA, we summarized the top answers using an abstractive summarizer. We also included a WordCloud visualization at the end. 

Here is a diagram of the steps our final solution consisted of:

Die!%20Corona!%20Kaggle%20team%20-%20Pipeline.png

A detailed walk-through of this approach can be found in our Kaggle kernel at https://www.kaggle.com/ikantdumas/cdqass-die-corona 
You can even run the kernel yourself, ask it a question, and see the output. 

Here are some examples of questions and answers you can get:

Query: What are treatments for COVID-19?
Summary: Early identification, timely and effective treatments, maintenance of hemodynamics and electrophysiological stability are of great significance on effective treatment and long-term prognosis. We suggest traditional Indian medicinal plants as possible novel therapeutic approaches, exclusively targeting SARS-CoV-2 and its pathways. There is theoretical, experimental, preclinical and clinical evidence of the effectiveness of chloroquine in patients affected with COVID-19.
Wordcloud:




















Query: What is the geographic distribution of COVID-19?
Summary:Coronavirus disease 2019 (COVID-19) is a newly emerged infection of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and has been pandemic all over the world. Covid-19 has infected more than 300,000 patients and become a global health emergency due to the very high risk of spread and impact.
Wordcloud:





















Query: What is the economic impact of this pandemic?
Summary: Although SARS infected some 10,000 individuals, killing around 1000, it did not lead to the devastating health impact that many feared, but a rather disproportionate economic impact. A large-scale pandemic could cause severe health, social, and economic impacts. Interventions can reduce the impact of an outbreak and buy time until vaccines are developed, but they may have high social and economic costs.
Wordcloud: 





















Query: Did the coronavirus originate from bats?
Summary:Bats are a major reservoir of viruses, a few of which have been highly pathogenic to humans. Severe acute respiratory syndrome (SARS)-like WIV1-coronavirus (CoV) was first isolated from Rhinolophus sinicus bats. MERS-CoV was believed to be of zoonotic origin from bats with dromedary camels as intermediate hosts.
Wordcloud:





















Query: What are risk factors of COVID-19?
Summary:C-reactive protein (CRP) levels, NCP severity, and underlying comorbidities were the risk factors for cardiac abnormalities. Poor sleep quality and high working pressure were positively associated with high risks of COVID-19. tuberculosis (MTB), the pathogen that causes TB and latently infects ~25% of the global population, may be a risk factor for SARS-CoV-2 infection and severe CO VID-19 pneumonia.
Wordcloud:




















References