The term Black Swan has become a metaphor for an infrequent event that is unforeseen and has an enormous impact - an event in human history that was unprecedented and unexpected at the point in time it occurred. Economist Nassim Nicholas Taleb popularized the term in his 2001 book Fooled By Randomness; The Black Swan Theory originates with a Latin expression of the 2nd-century by Roman poet Juvenal, where he characterized phenomena as: "rara avis in Terris nigroque simillima cygno".
This Latin expression translates to "a rare bird in the lands and very much like a black swan". Initially, the majority believed that black swans did not exist. Examples of the theory include the success of Google, the 9/11 attacks, the dissolution of the Soviet Union, and the rise of the internet. Since Chinese authorities first reported the coronavirus to the World Health Organization, the term has become synonymous with the pandemic. If one searches Google News for "Black Swan Event COVID-19", the result is over 2.4 million. However, Nassim Nicholas Taleb, who co-authored a January 26th paper warning that the spread of COVID-19 would be "nonlinear" and potentially severe, denies that the novel coronavirus is a Black Swan event, and is, in reality, a Grey Rhino.
Michele Wucker, a policy analyst, originally coined the term Grey Rhino after the 2012 Greek financial crisis; introducing it to economists and world leaders in a speech at The World Economic Forum (WEF) in Davos in 2013. It stands for a known, urgent risk that is not being acted on (highly probable but neglected threats that have an enormous impact). Regarding this crisis, Wucker recently wrote:
"Given what we know about pandemics and their increased likelihood, outbreaks are highly probable and high impact. I coined the term "Grey Rhino" for exactly such events: prominent, visible, coming right at you, with enormous potential impact and highly probable consequences."
The COVID-19 pandemic is continuing to ravage, and governments, businesses, and individuals around the world are making extraordinary changes. For example, mandatory lockdowns and universal basic income (unforeseeable in the months preceding the pandemic). Citizens have had to completely change their way of life to confront the emerging economic and health challenges. The shifts in human and institutional behavior have implications that extend far beyond the direct impact of the virus itself. Organizations have invested millions of dollars in machine learning (ML) systems and pipelines, which they depend on to inform their strategic decision making. The abrupt changes in consumer and corporate behavior brought on by the pandemic will likely have a significant impact on the accuracy of forecasting models that rely on historical data to inform their predictions. What does this mean for Data and Analytics Professionals? And what are the consequences for the industry?
Table of Contents
- How COVID-19 Impacts Data Models
- How COVID-19 Impacts Data Teams
- How Data Teams Reclaim Productivity
- Effects of Recession and the Data Sector
- Integrate.io: The Perfect ETL Tool for Data Professionals
How COVID-19 Impacts Data Models
The impact on data science production setups has been dramatic. Many of the models used for segmentation or forecasting started to fail when traffic and shopping patterns changed, supply chains were interrupted, and borders were locked down. Decision-making becomes more challenging during periods of stress, especially where there is uncertainty about the future. We see a mutated and accelerated version of Concept Drift, one of the core problems of data science; handling data whose fundamental nature changes over time.
In the case of COVID, concept drift occurs due to changes in social behavior and the inherent instability in economic activities resulting from social distancing, lockdown, self-isolation, and various other human nuances and responses to the pandemic. The reactions degrade and nullify assumptions regarding market trends, fraud prediction, demand forecasts, etc., and a large proportion of models built to predict patterns, outcomes, and behaviors are not viable anymore. A notable example is fraud detection models. Traditionally, models would see the purchase of a one-way flight ticket as a red flag and a significant indication of airline fraud - of course, this is no longer the case.
COVID shows us that data scientists can't rely on historical data to train models; expecting them to work as before in the real world once deployed. The industry needs to increase agility, adaptivity, and work on new processes to keep deployed models responsive, ensuring models are more robust and have adequate fail-safes in a less predictable world. Model auditing and stress testing will become the new normal.
In short, when people's behavior changes fundamentally, data science models based on prior behavior patterns struggle to convey objective realities. Data science systems can sometimes self-correct and right themselves, and in other cases, it is impossible to train a new system and is rendered useless as base assumptions built into systems do not hold anymore. In these cases, the model and the entire data pipeline are meaningless.
How COVID-19 Impacts Data Teams
Data teams have not only faced disruption in their codebases but also their working environments. Remote working has become the new normal, and while there are many upsides, there are also many downsides. Data teams are traditionally very colligative and require free information and communication flow for productive output. It is not possible to effectively apply Software Engineering practices and processes to machine learning processes - they are inherently different disciplines.
Software engineering is essentially the design and implementation of programs that a computer can execute to perform a defined function. It's possible to prove that a program satisfies a formal specification of its behavior, and this has shaped the tools and processes for software engineering. Many data science teams use workflow processes that are very similar to the software methodologies mentioned. However, they are not very useful. It is not possible to prove the correctness of an AI or ML model, within (supervised) machine learning, a training set is the only provable or guaranteed aspect of the system; Uncertainty is intrinsic to machine learning.
Data professionals should not be homogeneously characterized as software engineers as new processes emerge to improve productivity for remote teams.
How Data Teams Reclaim Productivity
Data teams around the world are using various strategies to cope with current discrepancies and murky data streams. The following are some techniques used to rekindle dying models.
- Cull your data and certify its relevance. Delete anomalies caused by COVID and impute values based on pre-COVID data; use moving averages - where one computes the average of a subset of data to balance out random fluctuations. Apply smoothing forecasting techniques to navigate on pre- and post-pandemic data.
- Increase the use of external data and decoupled systems to allow for hot-swapping of foreign data sources. Obtaining objective reality may require feeding the model combinations of disparate data sources. The system design should allow for hot-swapping and easy deployment of mutated versions of the model with speed; allowing teams to test the hypothesis against several mutations at once rather than relying on one.
- Audit models consistently and perform automated stress testing. Considering we are now in a time of extreme uncertainty, we can no longer rely on habitual social habits as a basis for our models. A degree of chaos engineering should be in place to find and mitigate against Grey Rhino events. Resilience is the key and is only obtainable by rigorous testing. Data teams should liaise with Site Reliability engineers to plan and architect such fail-safes and circuit breakers.
- Compile a library of specialized models that can be pulled off the shelf and deployed with speed; scenario and simulation planning can generate models for Grey Rhino events. For example, it would be possible to create a model for a scenario where a vaccine has not been effective and one where it has been effective in 18-24-year-olds only. As mentioned in the introduction, a Grey Rhino stands for a known, urgent risk that is not being acted on (highly probable but neglected threats that have an enormous impact). There is no excuse not to plan and prepare for such scenarios. It is the key to survival and the best way to mitigate risk.
Effects of Recession in the Data Sector
Data Scientists have been pivotal in the battle against COVID. Healthcare providers are leveraging data from countries that were affected earlier by the pandemic to forecast needs for hospital beds, masks and ventilators. Society has seen the value of the sector, but surveys indicate that data teams have yet to see widespread layoffs or furloughs at this time, particularly among larger companies. However, workers at tech startups are susceptible to layoffs. The New York Times published an article investigating the challenges that the coronavirus pandemic is presenting for tech startups, in particular:
"The coronavirus is causing a sudden shake-up that has defied comparison for startups. Employees are learning of furloughs or cuts via video calls, plans for IPO are on hold, and funding is drying up for many young tech companies".
Burtch Works conducted a survey in partnership with the International Institute for Analytics: 57% of the 300 survey respondents indicated no change to their staffing and hiring plans, and another 22% have introduced hiring freezes or are preparing for layoffs, but have yet to take action; 5% have introduced "substantial" salary cuts, hiring freezes, or layoffs.
Pre COVID, the market for data jobs was strong; according to a Dice Tech Jobs report released in February, demand for data engineers was up 50%, and demand for data scientists was up 32% in 2019 compared to the prior year. Companies have been finding it very difficult to find data talent in the last decade and so will try their utmost to retain their current data talent.
Integrate.io: The Perfect ETL Tool for Data Professionals
The current environment is challenging and unpredictable for data professionals, there are more unknowns, and choosing the right tool for the job is more important than ever. The COVID world requires an amalgamation of multiple raw data sources across disparate platforms to obtain objective reality on an unpredictable world. ETL tools do just this and will be at the fore.
ETL tools extract data from a source database, transform the data using operations like sorting, joining, reformatting, filtering, merging, and aggregation, and load the information into your data store or data warehouse for swift accessibility and use with your business intelligence tools. Modern ETL tools include graphical interfaces for faster, more accessible results than traditional methods of moving data through hand-coded data pipelines.
ETL tools break down data silos and make it accessible and palatable for analysis. In short, ETL tools are the first essential step in the data warehousing process that eventually distills into data-driven decision making (DDDM) for the business.
However, not all ETL tools are the same. Compared to others, some are vastly easier to use and more powerful, like Integrate.io. Integrate.io is a cloud-based ETL platform that allows you to create visual data pipelines within minutes. Our visual, drag-and-drop interface is so easy to use, you might feel like you're cheating! Schedule an intro call to book a risk-free pilot and see it for yourself.