Dataset Pruning and Privacy Risks in Machine Learning
/ 3 min read
Quick take - The article discusses the emerging technique of dataset pruning in data-centric AI, which aims to optimize dataset size while maintaining machine learning performance, highlighting its alignment with GDPR principles, potential privacy risks, and the introduction of new privacy inference methods to address these concerns.
Fast Facts
- Dataset pruning, or coreset selection, optimizes dataset size while maintaining machine learning model performance by eliminating redundant data.
- This technique supports GDPR’s data minimization principle, reducing unnecessary data exposure during training but also introduces potential privacy risks.
- Recent research highlights that sensitive information can still be inferred from excluded data, necessitating innovative privacy inference approaches like Data-Centric Membership Inference (DCMI) and Data Lineage Inference (DaLI).
- The study developed a Brimming score metric to quantify privacy risks associated with different pruning methods, revealing varying degrees of privacy leakage based on the amount of data pruned.
- The authors advocate for further research to create dataset pruning methods that balance efficiency, utility, and privacy protection, emphasizing implications for data service providers and regulatory compliance.
Dataset Pruning in Data-Centric AI
Dataset pruning, also known as coreset selection, is gaining attention in the field of data-centric AI. This technique focuses on optimizing the size of datasets while maintaining the performance of machine learning models. The primary goal is to eliminate redundant data, creating a more compact and efficient subset for training. This approach enhances overall efficiency without significantly compromising model effectiveness.
Privacy Implications
Dataset pruning aligns with the General Data Protection Regulation (GDPR) principle of data minimization. By reducing unnecessary data exposure during training, it supports data protection efforts. However, it also introduces potential privacy risks. These risks are associated with data that, although excluded from training, may still contain sensitive information. The existing research mainly addresses privacy attacks targeting training samples, leaving a gap in understanding the implications of data used during the dataset pruning phase.
A recent study systematically investigates these privacy concerns in machine learning systems. The findings reveal that it is possible to infer the membership status of data within the redundant set through certain attacks, even if the data is not actively used for model training. Traditional privacy inference techniques, which typically rely on model outputs, are inadequate for these upstream privacy issues, highlighting the need for innovative approaches.
New Privacy Inference Paradigms
To address this challenge, the authors introduce the concept of Data-Centric Membership Inference (DCMI) and propose a new privacy inference paradigm called Data Lineage Inference (DaLI). DaLI includes four threshold-based attack methods: WhoDis, CumDis, ArraDis, and SpiDis. These methods can detect redundant data with minimal prior knowledge. The research indicates that different dataset pruning methods exhibit varying degrees of privacy leakage, with associated risks fluctuating based on the fraction of data pruned.
To quantify these risks, the authors developed a metric known as the Brimming score, which assists in selecting pruning methods that prioritize privacy considerations. Their experiments utilized twelve distinct pruning methods across three datasets: MNIST, CIFAR10, and CIFAR100, assessing the efficacy of their proposed attacks. The results showed that these attacks could successfully infer membership status under diverse conditions of adversarial knowledge.
The study underscores the necessity for a deeper understanding of the privacy risks linked to data collected but excluded from training. It advocates for the development of effective defenses against DCMI. The authors propose a strategy called ReDoMi to enhance the indistinguishability between redundant and non-member data. They emphasize the implications of their findings for data service providers, particularly concerning compliance with data regulations. Furthermore, they call for additional research aimed at devising dataset pruning methods that strike an optimal balance between efficiency, utility, and privacy protection.
Original Source: Read the Full Article Here