Artificial intelligence – more specifically, the machine learning (ML) subset of AI – has a number of privacy problems.
Not only does ML require vast amounts of data for the training process, but the derived system is also provided with access to even greater volumes of data as part of the inference processing while in operation. These AI systems need to access and “consume” huge amounts of data in order to exist and, in many use cases, the data involved is private: faces, medical records, financial data, location information, biometrics, personal records, and communications.
Preserving privacy and security in these systems is a great challenge. The problem grows in sensitivity as the public becomes more aware of the consequences of their privacy being violated and misused. Regulations are continually evolving to restrict organizations and penalize offenders who fail to respect users’ rights. British Airways was, for example, recently fined $228 million by the European Union for privacy violations.
There is currently a fine line that AI developers must walk to create useful systems to benefit society and yet avoid violating privacy rights.
For example, AI systems are an excellent candidate to help law enforcement rescue abducted and exploited children by identifying them in social media posts. Such a system would be relentless in scouring all posts and matching images to missing persons, even taking into account the likely changes of years passing by, something impossible for humans to accomplish accurately or at scale. However, such a system would need to do facial recognition analysis on every picture posted in a social network.
That could identify and ultimately contribute to tracking everyone, even bystanders in the background of images. Sounds creepy and you may likely object – this is where privacy regulations and ethics must define what is allowable. Bringing home kidnapped kids or those who are forced into sex trafficking is a worthwhile goal, but still requires adherence to privacy fundamentals so that greater harms aren’t inevitably created.
To accomplish such a noble feat, a system would need to be trained to recognize the faces of children. For accuracy, it would require a training database with millions of children’s faces. To follow the laws in some jurisdictions, the parents of each child in the training data set would need to approve the use of their child’s image as part of the learning process. No such approved database currently exists and it would be a tremendous undertaking to build one. It would probably take many decades to coordinate such an effort, leaving the promise of an efficient AI solution for finding kidnapped or exploited children just a hopeful concept for the foreseeable future.
Such is the dilemma of AI and privacy. This type of conflict arises when AI systems are in training and also when they are put to work to process real data.
Take that same facial recognition system and connect it to both a federal citizen registry and millions of surveillance cameras. Now, the government could identify and track people wherever they go, regardless if they have committed a crime, which is very Orwellian. But innovation is coming to help – federated learning, differential privacy, and homomorphic encryption are technologies that can assist in navigating such challenges. However, they are just tools and not complete solutions. They can help in specific usages but always come with drawbacks and limitations, many of which can be significant.
Federated learning (aka collaborative learning) makes possible the training of algorithms without local data sets being exchanged or centralized. It’s all about compartmentalization, which is great for privacy, but it difficult to set up and scale. Additionally, it can be limiting to data researchers that are desperate for massive data sets containing the rich information needed for training AI systems.
Differential privacy takes a different approach, attempting to obfuscate the details by providing aggregate information but not sharing specific data, i.e., “describe the forest, but not individual trees”. It is often used in conjunction with federated learning. Again, there are privacy benefits but it can result in serious degradation of accuracy for the AI system, thereby undermining their overall value and purpose.
Homomorphic encryption, one of my favorites, is a promising technology that allows for data to remain encrypted yet still allow useful computations to be done as if they were unencrypted. Imagine a class of students being asked who is their favorite teacher: Alice or Bob. To protect the privacy of the answers, an encrypted database is created containing the names of individual students and the corresponding name of their favorite teacher. While in an encrypted state, calculations could be done, in theory, to tabulate how many votes there were for Alice and for Bob, without actually looking at the individual choices by each student. Applying this to AI development, data privacy remains intact while training can still proceed. Sounds great, but in real-world scenarios, it is extremely limited and takes tremendous computing power to accomplish. For most AI applications it is simply not a feasible way to train the system.
For now, there is no perfect solution on the horizon. It currently takes the expertise of and committed partnerships between privacy, legal, AI developers, and ethics professionals to evaluate individual use-cases to determine the best course of action. Even then, most of the focus is placed only on current concerns and not on applying a more difficult strategic viewpoint of what challenges will emerge in the future. The only thing that is clear is that we need to achieve the right level of privacy so we can benefit from the tremendous advantages that AI potentially holds for mankind. How that is achieved in an effective, efficient, timely, and consistent manner is beyond what anyone has figured out to date.