There is more to learn from the European Data Protection Board’s recent opinion on AI models.

I previously reviewed the EDPB’s take on what the consequences could be for the unlawful processing of personal data in the development phase of an AI model. I also looked at how people should analyze “legitimate interest.”

This week, I look at the anonymousness of AI models.

Some key points:

  • AI models are not always anonymous. They need to be assessed on a case-by-case basis.
  • AI models cannot be considered anonymous if they were specifically designed to provide personal data regarding individuals whose personal data was used to train the model.
  • For an AI model to be considered anonymous, both (1) the likelihood of direct (including probabilistic) extraction of personal data regarding individuals whose personal data were used to develop the model and (2) the likelihood of obtaining, intentionally or not, such personal data from queries, should be insignificant. It is important to take into account “all the means reasonably likely to be used” by the controller or another person.
  • Pay special attention to risk of singling out, which is substantial.
  • Consider all means reasonably likely to be used by the controller or another person to identify individuals, which may include: characteristics of training data, AI model & training procedure; context; additional information; costs and amount of time needed to obtain such info; available technology and technological developments.
  • Such means and levels of testing may differ between a publicly available model one one that is intended only to be used internally by employees.
  • Consider risk of identification by controller and different types of “other persons,” including unintended third parties accessing the AI model, and unintended reuse or disclosure of model.
  • Be able to prove, through steps taken and documentation, that you have taken effective measures to anonymize the AI model. Otherwise, you may be in breach of your accountability obligations under Article 5(2) GDPR.

Factors to consider:

  • Selection of sources: Selection criteria; relevance and adequacy of chosen sources; exclusion of inappropriate sources.
  • Preparation of data for training phase: Could you use anonymous or pseudonymous data? If not why not. Data minimization strategies and techniques to restrict volume of personal data included in training process. Data filtering processes to remove irrelevant personal data.
  • Methodological choices regarding training: Improve model generalization and reduce overfitting. Privacy-preserving techniques (e.g. differential privacy).
  • Measures regarding outputs of model (lower likelihood of obtaining personal data related to training data from queries).
  • Conduct sufficient tests on model that cover widely known, state-of-the-art attacks: eg attribute and membership inference; exfiltration; regurgitation of training data; model inversion; or reconstruction attacks.
  • Document process including: DPIA; advice by DPO; technical and organizational measures; AI model’s theoretical resistance to re-identification techniques.