about

resources

events

contribute

republishing

☰

ΑΙhub.org

Considerations for differentially private learning with large-scale public pretraining – interview with Gautam Kamath

by Lucy Smith

14 August 2024

Firstly, what is differential privacy, and how, and why, have researchers been using public data in the training stage for differentially private machine learning models?

Differential privacy is a rigorous and provable notion of data privacy. Among other things, training a machine learning model with differential privacy can prevent it from spitting out its training data. The issue is that training a model with differential privacy generally comes at a significant hit to a model’s utility. By incorporating “public data” (i.e., data that is not subject to privacy constraints) into the training procedure, it can help alleviate this concern and increase the resulting model’s utility.

What is the paradigm that your work challenges?

We challenge the paradigm of pretraining models with public data, and then privately fine-tuning the weights with sensitive data. We question whether such a model ought to be considered privacy-preserving and further speculate about whether such a model is useful for downstream privacy-sensitive tasks.

Could you talk about the three key issues you’ve identified related to using large public datasets for training private models?

The three points we raise are the following:

a) The first step of the stated paradigm is to pre-train the model on “public” data. The understated question is where this data comes from, and what it is used for. Frequently, it is scraped en masse from the public Internet. Other times, it may come from proprietary or otherwise undisclosed sources. Our primary concern is that data from all of these sources, treated as fully “public” and employed without considerations of privacy, may actually contain significant amounts of sensitive data. Uncritically treating such a model as “privacy-preserving” thus erodes trust in privacy-enhancing technologies.
b) In these settings, evaluation of machine learning models is often performed on benchmark datasets which are ported from the non-private setting. These datasets may be unrepresentative of application settings where privacy is actually a practical consideration. In particular, they may resemble data available from public sources, leading to significant gains in utility when pre-training on public data. In contrast, data in privacy-sensitive settings may qualitatively differ from publicly available data. It thus remains to be seen whether models pre-trained on public data and privately fine-tuned on sensitive data are still useful for settings pertinent to real applications.
c) To take full advantage of publicly pre-training, one traditionally needs very large models, which are too big to fit on a single device. This necessitates outsourcing a user’s data to the cloud, and thus losing data privacy in another sense.

What are some of the principal open questions that surround this field?

The broad challenge in private machine learning is how to train high-utility models while preserving privacy of the training data, with respect to meaningful privacy semantics.

In the paper, you suggest some potential ways to address your concerns. Could you talk about a few of these?

To address some of the specific issues we identified above, we give a few broad suggestions, though our position paper is focused primarily on identifying and highlighting these problems for the community, rather than resolving them. We suggest that model curators go beyond a naïve dichotomy of treating data as either “public” or “private.” Such a dichotomy is misaligned with individual expectations of privacy norms. Another suggestion is that privacy researchers evaluate their techniques on datasets and settings that may more closely resemble those pertinent to privacy-sensitive applications. Finally, we suggest the community focus on a more holistic view of privacy. Rather than focusing primarily on the (important) task of training models with differential privacy, it is important to go further and connect that with real-world privacy norms and considerations.

About Gautam

Gautam Kamath is an Assistant Professor at the David R. Cheriton School of Computer Science at the University of Waterloo, and a Canada CIFAR AI Chair and Faculty Member at the Vector Institute. He has a B.S. in Computer Science and Electrical and Computer Engineering from Cornell University, and an M.S. and Ph.D. in Computer Science from the Massachusetts Institute of Technology. He is interested in reliable and trustworthy statistics and machine learning, including considerations such as data privacy and robustness. He was a Microsoft Research Fellow, as a part of the Simons-Berkeley Research Fellowship Program at the Simons Institute for the Theory of Computing. He serves as an Editor in Chief of Transactions on Machine Learning Research, and is the program committee co-chair of the 36th International Conference on Algorithmic Learning Theory (ALT 2025). He is the recipient of several awards, including the Caspar Bowden Award for Outstanding Research in Privacy Enhancing Technologies, a best paper award at the Forty-first International Conference on Machine Learning (ICML 2024), and the Faculty of Math Golden Jubilee Research Excellence Award.

tags: ICML, ICML2024

Lucy Smith is Senior Managing Editor for AIhub.

AIhub is supported by:

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

Lucy Smith 03 Jul 2025

Find out about the OnAIR platform, some of the particular challenges of deploying AI-based solutions in space, and how the tool has been used so far.

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Lucy Smith 01 Jul 2025

We caught up with Nicolai to find out more about the Small Size League, how the auto referees work, and how teams use AI.

Forthcoming machine learning and AI seminars: July 2025 edition

Lucy Smith 30 Jun 2025

A list of free-to-attend AI-related seminars that are scheduled to take place between 1 July and 31 August 2025.

monthly digest

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

The Conversation 19 Jun 2025

Until recently, most computer code was written, at least originally, by human beings. But with the advent of GenAI, that has begun to change.

Considerations for differentially private learning with large-scale public pretraining – interview with Gautam Kamath

Firstly, what is differential privacy, and how, and why, have researchers been using public data in the training stage for differentially private machine learning models?

What is the paradigm that your work challenges?

Could you talk about the three key issues you’ve identified related to using large public datasets for training private models?

What are some of the principal open questions that surround this field?

In the paper, you suggest some potential ways to address your concerns. Could you talk about a few of these?

About Gautam

Related posts :

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Forthcoming machine learning and AI seminars: July 2025 edition

AIhub monthly digest: June 2025 – gearing up for RoboCup 2025, privacy-preserving models, and mitigating biases in LLMs

RoboCupRescue: an interview with Adam Jacoff

Making optimal decisions without having all the cards in hand

Exploring counterfactuals in continuous-action reinforcement learning

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

↑