The Need for Better External Data Discovery in 2021
I don’t think anyone was expecting 2020 to end up the way it did. In many ways, it’s safe to say that it’s been a very long time since we had such an eventful year, for better or worse. Even so, 2020 was not all bad. Even with the difficult times, technology offered people and organizations the ability to adapt and face their new circumstances. From new paradigms on remote work to improvements in how we understand our customers, adaptation was the name of the game.
The pandemic was bad news for almost every industry. Retail, food services, hospitality, and most other customer-facing businesses were hit hard by lockdowns, mask mandates, reductions in foot traffic, and a rocky economic landscape. This new status quo forced companies to adapt and find ways to navigate these murky waters effectively.
Organizations increasingly turned to data science and data science platforms to gain some visibility and better respond to the rapidly changing world. 2020 was, in some ways, a pivotal year for data science. As the field grows, however, the things that drive it forward are going to change.
2020 broke our models, so how can we adapt?
One of the biggest takeaways from 2020 for me was the validation of what we have been seeing for some time. Fine-tuning our machine learning (ML) models will only get us so far in the absence of quality data. On the one hand, there are powerful ML tools for organizations regardless of their data science expertise. On the other, as more competition took place in terms of models, there was little room for uplift and improvement in results. Tweaking your hyperparameters is a game of diminishing returns — after a while, the gains become marginal.
When we finally grasped the magnitude of the pandemic, there was a moment of panic. However, we didn’t truly understand until a few months later that the historic data we were feeding our ML models was suddenly a lot less informative and valuable. Those finely-tuned prediction engines suddenly couldn’t make the right calls, leaving organizations scrambling for answers.
So, where do you turn when models that have already been tweaked to their maximum potential need to give you the answers you expect? To the other side of the equation — data. What happens if the last three months of data you collected looks nothing like the 16 months before it? Early on, it became apparent that to find solid footing in the new landscape, organizations needed more than just their own data, and consequently, there was a rush to find the right data to feed models.
This is the new reality we live in. More than ever, we need to understand this new shifting landscape, and our models are stumbling in the dark without a guiding light.
2021 will be the year of data discovery
The need for new data means that it has to be accessible quickly and affordably. Data by itself is great, but really, it’s not that there’s significantly more data now than there was a year ago (or even three years ago). What’s going to be the major difference in 2021 is the emergence of data discovery tools to help organizations find the data they need quickly and without expending weeks’ worth of efforts and resources.
Data discovery platforms like Explorium provide organizations a valuable tool for data science that can cut down on a lot of the legwork. Instead of weeks spent looking for a single valuable dataset, data discovery tools let you access thousands in a matter of minutes and ensure that they’re all relevant.
Perhaps nothing exemplified this to us more than a customer who reached out to us because the models they were already using saw a massive drop in their ability to make successful predictions. Their models worked — they had years of success to prove it. However, COVID had thrown a wrench into them, making it harder to assess risk and properly contextualize the new data that was being collected. By using our proprietary signals to track COVID-related data, events, and impacts, the company was able to quickly retrain their models with information that was relevant and, more importantly, which reflected the situation on the ground.
Even if you look at it broadly, data discovery will play a key role in expanding the value and ROI of ML in general. Even without the pandemic to account for, ML models are most effective when they have more data to learn from (or, rather, more quality data), and until now, the inability to find it effectively was a major roadblock to adoption and reaching ML’s full potential. Automated data discovery is poised to radically change the way we evaluate data, our models, and how we think of the answers to the predictive questions we ask.
The ability to connect the science and data sides of data science will allow organizations and data scientists to focus more time on building new models, innovating what they have, and understanding their predictive problems. They’ll also be able to do this with a much better foundation, using external data as a springboard for new insights, greater perspective, and scalable implementations.
Picking up the pieces and reimagining data science
2020 was a rough year, no doubt about it. 2021 will be better, and data science will play a large role in helping multiple industries adapt to the new normal. However, to do this, the field will have to embrace the fact that hyper-parameter tuning is no longer the only factor in building better models.
To take the field to the next level, we’re going to need faster pipelines to the external data organizations and data science practitioners need. Fortunately, 2020 showed us that data discovery is a scalable, viable solution, and one that we’re already implementing in the field to resounding success.