Quality Data Acquisition for Machine Learning

Thumbnail

Event details

Date 27.08.2019
Hour 11:0013:00
Speaker Adam Richardson
Location
Category Conferences - Seminars
EDIC candidacy exam
Exam president: Prof. Martin Jaggi
Thesis advisor: Prof. Boi Faltings
Co-examiner: Prof. Volkan Cevher

Abstract
Modern machine learning has seen tremendous growth in recent years, largely due to an abundance of data used to train complex learning models. As these models become more integral to daily life, we find an increasing need for such data. However, not much thought has been given to how to ensure that this data has the right statistical properties to produce a high quality model. In particular, we are concerned with how to incentive self-interested agents to report quality data in a crowdsourcing context. We build on the idea of the Peer Prediction mechanism presented in [Peer Truth Serum: Incentives for Crowdsourcing Measurements and Opinions], which incentivizes truthful reporting of a distribution of observations under certain conditions. We observe that in the context of machine learning this problem has additional structure. We are not simply concerned with a distribution of observations, rather, we are concerned with the ability to predict a mapping within that distribution of observations. Yang et al. attempt to address this problem in [Optimum Statistical Estimation with Strategic Data Sources] under some strong assumptions. We proposed a mechanism for linear regression learning based on the notion of influence defined in [Understanding Black-box Predictions via Influence Functions].

In prior work, we have shown that our influence mechanism induces a truthful reporting under more relaxed assumptions than [Yang et al.]. However, in order to strengthen our findings, we wish to show that our mechanism can be generalized to non-linear models, and we wish to strengthen our game-theoretic guarantees. We also wish to apply our mechanism in the context of federated learning. This would involve extending the mechanism in order to be privacy-preserving on the data, or re-examining the federated learning pipeline in order to construct a privacy-preserving mechanism.

Background papers
Peer Truth Serum: Incentives for Crowdsourcing Measurements and Opinions, by Faltings, B., et al.
Optimum Statistica lEstimation with Strategic Data Source, by Cai, Y., et al.
Understanding Black-box Predictionsvia Influence Functions, by Koh, P. W., et Liang, P.
 

Practical information

  • General public
  • Free

Tags

EDIC candidacy exam

Share