Never in history have so many government bodies around the world had so much scope, opportunity, and — most importantly — public support to harvest intricate, personal data on their citizens. The global COVID-19 pandemic has seen emergency surveillance measures that would make Orwell gasp, introduced with little resistance even in the most privacy-conscious nations. It’s not just governments and health departments, either: from giants like Facebook and Google down to small apps like CityMapper, data harvesting and corporate encroachment on privacy are making a lot of people nervous. Customers are warier than ever.
For data scientists and others who are interested in examining this data or building machine learning models, these dubious activities raise major ethical considerations. Do you exploit the potential commercial opportunity they present, knowing the privacy implications — and the fact that you will inevitably have to account for your decision further down the line?
What’s more, while this surveillance and data collection plays a vital role in slowing the spread of coronavirus, citizens, privacy advocates, and international bodies are already scrambling to secure checks and balances that will limit misuse of data once normal life resumes. This will also lead to significant logistical challenges and serious consequences for data and machine learning risk in your organization, and you need to be prepared.
Rapid, extensive data collection and modeling continue to play a key role in combating the global COVID-19 crisis, with life-saving results in places like Taiwan and South Korea. However, not all data collected by governments and other organizations directly relates to people’s health.
As countries around the world share their COVID-19 genome sequences through GISAID, open-source applications such as Nextrain have used this data to model the way the virus has and will continue to evolve.
In some countries, such as China, cutting-edge facial recognition technology that can also detect a rise in body temperature has been rolled out. Ostensibly, this is to spot early warning signs that someone may have contracted COVID-19, but it has also been used in Russia to identify people breaking isolation rules.
By monitoring individuals’ geolocation data from their mobile devices, governments are able to track when people are breaking travel restrictions and can better enforce quarantine rules.
In the immediate context of a global pandemic, many people accept that this sweeping data collection is a valuable, even necessary, strategy to save lives and prevent healthcare systems from buckling. The fear, though, is what happens to that data and the systems in place to collect it once the crisis passes.
Having external bodies be able to view all your personal health data is uncomfortable enough without the fact that right now, everything you do can be realistically treated as health data, including where you go and what you Google.
Having sensitive information passed on to governments by the apps and websites you use isn’t unprecedented, after all. PayPal, for example, shares financial data with government agencies and the video-sharing social media app TikTok is facing a class-action suit in the U.S. for scraping data from users’ mobiles and transporting it to servers in China, where they would be available to the government on request – without user permission.
Governments may be tracking phones right now in order to enforce quarantines, but that’s not to say they won’t try to keep tracking the devices of critics, protesters, opposition groups, minority groups, and journalists for more nefarious purposes going forward. It’s not just totalitarian states that run this risk, either; take the US, which continues to utilize the emergency surveillance measures rolled out in the PATRIOT Act (in the immediate aftermath of 9/11), two decades later.
Then there’s the issue of what happens if this data is stolen, hacked, or leaked. Cybercriminals have every reason to want to get their hands on such a potentially lucrative source of information, whether for fraud or to sell to unscrupulous marketers. Even if you trust the organizations in charge of collecting and storing data to do the right thing, if they get hacked, it’s out of your hands — and theirs.
There’s also the risk that open-source datasets and other data changing hands between health researchers could be exploited by private companies for commercial gain – or even that some official bodies may be willing to sell it. The U.S. Department of Health and Social Care already sells personal health information to pharmaceutical giants, with experts warning that the supposedly anonymized data is easy to trace back.
While there’s no doubt that such granular, personal, hard-to-get data presents a huge temptation for many companies, it’s best you don’t consider it. Not only is this an ethical minefield that could harm your reputation with customers in the future, but it could also actually increase business, and specifically machine learning, risk later on.
As we’ll see in a moment, improper use of data could get you into hot water with regulatory bodies one the pandemic passes, while making you rely on data sources that are far from guaranteed.
Data privacy concerns have always been a thorny issue in machine learning projects. Even when customer data has been anonymized, multiple experiments have revealed just how easy it is to identify individuals by combining datasets, undermining an individual’s right to privacy.
This is especially true when Big Data depositories create sprawling, centralized databases. Moving all that customer data into one place makes it easier and easier to fill out the picture — in a sense, reforming data points into real people. Regulatory bodies and privacy advocates in Europe and North America have pushed for years for stronger measures to prevent people who shouldn’t from cracking the code and identifying the real people behind the data. Given the vast amount of incredibly sensitive health data captured in the current crisis, these efforts to prevent data misuse and privacy breaches are likely to gain traction.
While this should frustrate attempts by bad actors to skirt the rules, it will also create more red tape for companies who have every intention of using data ethically. Without a creative approach and the right tools or platforms to help, you may find it more difficult to collect data yourself, and certainly, to obtain and integrate the external datasets you need to succeed.
Some well-known projects and regulatory attempts to safeguard data privacy across the board, from Big Data for marketing to machine learning in banking, include:
GDPR requires by law that companies secure the clear consent of all users and website visitors to exploit their data for every new purpose they intend. While an EU directive, it applies to any company or site that captures data on people who access sites and platforms from within Europe, meaning that in reality, most of the world needs to comply.
The introduction of GDPR placed enormous responsibility on companies to prove they had permission to use customer data or to stop doing so. From a machine learning risk point of view, initiatives like this demonstrate that just because you have the data you need for your algorithms and predictive models now, doesn’t mean you always will.
Similar to GDPR, the CCPA gives users the right to deny sites and platforms permission to collect and use their data, and to ask for their data to be removed from existing datasets. Again, while it only applies to residents of California, in reality, it changes how US companies operate across the board — and also introduces additional machine learning risk for companies dependent on this data for their models.
Rather than placing restrictions on what businesses can do with third-party data, PET is designed to facilitate access to valuable external data sources without undermining individual privacy. It uses comprehensive encryption technology to allow businesses to access the data they need for their machine learning models while preventing the leakage of any confidential or potentially identifying user information.
By voluntarily adopting PET, businesses can better manage data risk, demonstrating their commitment to the ethical use of data and ensuring that they avoid collecting data wholesale that they may be required to relinquish later, causing significant disruption to the business.
We mentioned above that data centralization is considered a major enemy of data privacy and that regulatory efforts increasingly push against this. From a CRO’s perspective, this means that opting for decentralized, interconnected data sources is not only a more agile and efficient way to access only the data you need but actually mitigates machine learning risk.
By using a state-of-the-art platform designed to connect to approved sources of external and alternative data, you avoid falling into the trap of accumulating masses of data in-house, only to be told in a few months’ time that you’re no longer permitted to use, store or manipulate this data as you need.
When Google removed “don’t be evil” from its code of conduct in 2018, some commentators saw this as a scary indication of where corporate data ethics was headed. Now, many people around the world are starting to doubt whether their own governments’ data surveillance policies are going the same way.
While plenty of companies will use this as an excuse to drop their ethics, too, in the long run, you will only increase your risk exposure. Corrective measures are on the horizon to curtail exploitative use of data and promote privacy. Anticipating these and getting your house in order might mean you don’t capitalize on the unethical opportunity now, but the upside is that your data strategy and predictive models will be built on robust foundations rather than stolen goods that could have to be surrendered at any time. Do the right thing. It’s too risky not to.