Written byAsmit JoshionAugust 30, 2021

How web scraping and data mining can help predict, track and contain current and future disease outbreaks

Coronavirus as seen under a microscope
A collection of particles (coloured pink) of the new coronavirus emerging from an infected cell in a scanning-electron-microscope image. (Credit: NIAID-RML/de Wit/Fischer/Nature)

COVID-19, a novel strain of the coronavirus, started as a rare respiratory illness in the port city of Wuhan, capital of China’s Hubei province. Since 31 December 2019, when the Chinese government first reported several cases of unusual pneumonia, the virus has spread around the world and left governments scrambling to contain its spread.

On 30 January 2020, the WHO declared the then unnamed coronavirus a global emergency with new confirmed cases increasing by the thousands every day. One researcher believes 40 to 70 percent of the world’s population will be infected within the coming year.

As of 11 March 2020, there have been almost 120,000 confirmed cases in more than 100 countries, including more than 4,200 deaths. While the spread appears to be under control in China, cases in the rest of the world, mainly Europe and USA, are rapidly increasing every day.

Impact on Global Economy

The Organisation for Economic Cooperation and Development (OECD) has projected that the global economy could grow at its slowest rate (2.4%) since 2009 because of the coronavirus outbreak. It added that the forecast would look much worse if the virus wasn’t contained within the first quarter of 2020 and spread throughout Asia, Europe and North America.

The Dow Jones and FTSE 100 plunged 4.4% and 3.5% respectively on 27 February, as major stock markets lost $1.5 trillion in global shares value the same week — their worst weekly performance since the 2008 financial crisis. Conditions got even worse over the following week, with all major stock markets posting their worst numbers since the 2008 crisis.

Impact on Global Tech Industry

The world of tech is also not immune to the effects of the coronavirus outbreak.

Almost all major events and conferences have either been cancelled or restricted to online media, including Barcelona’s Mobile World Congress, Facebook F8, Google Cloud Next, Google I/O,IBM’s Think, Austin’s South by Southwest, etc. The economic loss as a result of these cancellations is reportedly more than $1 billion.

Empty Seats at a Conference
Most major events and conferences have either been cancelled or restricted to online media.

Companies are also informing consumers to expect manufacturing and supply chain delays on their products, with offices, stores and factories in China still closed and employees urged to refrain from non-essentials travels.

Data to make or break your business
Get high-priority web data for your business, when you want it.

Role of Technology

As local and international authorities continue to contain the outbreak, incorporating data and technology into the day-to-day decision-making would not only be shrewd but also highly effective. With more relevant data, you can create a bigger picture to take aggressive measures. 

However, quick access to accurate and reliable data is not straightforward in the current climate of privacy concerns, fake news and conflicting information between sources.

The WHO has encouraged researchers, governments, business and scientific communities to collaborate and disseminate data among themselves to better understand the virus and its spread, and develop concrete action plans. This data will also be crucial in developing vaccines and preventing similar outbreaks in the future.

Artificial Intelligence and Data Analytics

AI and Big Data are at the forefront of the technological involvement in combating the global outbreak. 

Techniques like web scraping and data mining play integral roles by gathering factual data and minimizing the flow of misinformation. This data helps doctors and health experts to assess their successes or failures, and reorient their actions. 


HealthMap CoVID-19 Data
Visualization of COVID-19 spread by HealthMap as of 11 March, 2020.

Tools like Healthmap (above) and the Johns Hopkins University dashboard are perfect examples, which have become some of the most popular resources for information on the current outbreak. 

JHU CoVID-19 Data
Visualization of COVID-19 cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) as of 11 March, 2020.

These use web scraping, data mining, machine learning and Geographic Information Systems technologies to scrape information from a variety of well-sourced sites, including local, national and international-level public hospitals and health centers, news reports, chatrooms, forums, etc. This disparate date is then organized to generate visualizations that show what course the outbreak is taking.

Social media is another useful source for such GIS technologies, where users’ posts can be scraped, and keywords (or hashtags) associated with the outbreak turned into actionable data to determine areas of interest. 

Projects like these supplement traditional data-collection techniques used by organizations like the WHO, and are used by governments and health officials to develop prevention measures and action plans. 

AI in Diagnostics

On 14 March, Chinese media outlet CGTN reported that AI could now detect cases of COVID-19 in 20 seconds with 96% accuracy. The AI algorithm convolutional neural network, combs through 5,000 CT scans to learn new inputs and can be trained in a week.

deep learning classifier first analyzes images for abnormalities, which are then segmented and a massive extraction of texture features is applied. The AI can then instantly differentiate between the lungs of patients with common viral pneumonia or those with COVID-19, while also calculating the number and size of lesions, and determining the severity of each case.

Predictive Analytics

BlueDot is another project that has gone a step further. After collecting disease data, it predicts where it might next appear by using airline flight information. The resulting information is valuable in identifying potentially infectious travelers and isolating or quarantining them as soon as they land to contain the contagion at the very first point of contact, and prevent any further spread.

Other researches, like the Global Virome Project, are building genetic and ecological databases of viruses in animal populations which can potentially be transmitted to humans. The GVP researchers aim to develop vaccines and other preventive measures against potential future outbreaks. 

People in Masks
People wearing masks in Hong Kong on 23 January, 2020. (Credit: Bloomberg)

Thanks to advances in AI, machine learning and GIS technologies, disaster response times have never been quicker. But as with everything, there are limitations to the current techniques. There are still some blind spots around the world — rural areas and their populations — which may be generating less or no online data at all. 

Having said that, the enormous amount of data that these technologies collect can be used to train AI algorithms to better deal with the more disastrous disease outbreaks of the future. 

P.S.: As the COVID-19 outbreak continues to affect lives all over the world, we’d like to urge our readers and customers to stay safe andtake all preventive measures. We wish the very best of health to you and your loved ones.

Additional References:

Flexible pricing models that suit your enterprise needs

Get instant access to the Grepsr platform

Contact Salesbutton icon arrow

Recent Posts


    A collection of articles, announcements and updates from Grepsr

    View all resourcesbutton icon arrow