A Comprehensive Overview of the OD-HP Algorithm for Histogram Publishing Under Shuffled Differential Privacy

A Comprehensive Overview of the OD-HP Algorithm for Histogram Publishing Under Shuffled Differential Privacy

Introduction

Privacy-preserving data publishing has become increasingly important in the era of big data, where organizations frequently collect and analyze user data. However, the risk of privacy breaches necessitates robust mechanisms to protect sensitive information while maintaining data utility. Histogram publishing, a common technique for analyzing categorical data, involves partitioning datasets into non-overlapping bins and representing data characteristics using frequencies or counts. Traditional approaches under Centralized Differential Privacy (CDP) and Local Differential Privacy (LDP) have been widely studied, but they face challenges in balancing privacy and utility.

CDP relies on a trusted central server to apply noise, but real-world scenarios often lack such trust. LDP, on the other hand, perturbs data at the user level, ensuring strong privacy but often at the cost of reduced data accuracy. To address these limitations, the Shuffled Differential Privacy (SDP) model has emerged as a middle ground, offering a balance between privacy and utility. The SDP model introduces a shuffler that anonymizes user data before it reaches the data collector, enhancing privacy while maintaining reasonable accuracy.

This article presents the OD-HP (Optimized Local Hashing and Dummy Points) algorithm, a novel approach for histogram publishing under the SDP model. OD-HP combines optimized local hashing (OLH) with the introduction of dummy data points to improve accuracy and resist collusion attacks between the shuffler and the collector.

Background and Related Work

Centralized and Local Differential Privacy

Centralized Differential Privacy (CDP) assumes a trusted server that applies noise to aggregated data before release. Common mechanisms include the Laplace and Exponential mechanisms, which add calibrated noise to query results. For histogram publishing, techniques like LAP, Boost, and NoiseFirst have been proposed. LAP adds Laplace noise to each bin, while Boost uses hierarchical trees to improve accuracy. NoiseFirst combines noise addition with histogram grouping to reduce errors. However, these methods struggle to balance grouping errors with noise-induced errors.

Local Differential Privacy (LDP) shifts the perturbation process to individual users, ensuring privacy even if the server is untrusted. Google’s RAPPOR uses randomized response and Bloom filters to encode user data, while OUE and OLH improve accuracy for large value domains. However, LDP-based methods often suffer from high variance, especially for multidimensional or continuous data.

Shuffled Differential Privacy

The SDP model bridges the gap between CDP and LDP by introducing a shuffler that anonymizes user data before analysis. This model provides stronger privacy guarantees than CDP while offering better utility than LDP. Early SDP-based methods, such as DDPS and SH, focused on binary data but faced limitations in accuracy for large value domains. MURS improved upon these by using local hashing to map large domains to smaller ones, but it lacked specific shuffling and post-processing techniques.

Recent advancements include mixDUMP, which introduces dummy data to enhance privacy, and HP-SDP, which employs local hashing and post-processing for histogram refinement. However, these methods either do not fully address collusion attacks or fail to optimize for large value domains.

The OD-HP Algorithm

Overview

OD-HP is designed to address two key challenges in SDP-based histogram publishing:

  1. Large Value Domains: Traditional methods suffer from high errors when the data domain is large.
  2. Collusion Attacks: The shuffler and collector may collude to infer sensitive user data.

To mitigate these issues, OD-HP combines OLH for data perturbation with dummy data insertion to obscure real user contributions.

Key Components

  1. Optimized Local Hashing (OLH)
    • Users encode their data using a universal hash function, mapping large domains to smaller ones.

    • Perturbation follows a randomized response mechanism, where the true value is retained with probability ( p ) and replaced with a random value with probability ( q ).

    • This reduces errors caused by large value ranges while preserving privacy.

  2. Dummy Data Points
    • Each user generates a random number of dummy data points alongside their perturbed data.

    • The shuffler mixes real and dummy data uniformly, making it difficult for adversaries to distinguish between them.

    • This prevents collusion attacks between the shuffler and collector.

  3. Expectation-Maximization (EM) Refinement
    • After shuffling, the collector uses the EM algorithm to refine the histogram.

    • EM iteratively estimates true frequencies by accounting for noise and dummy data, improving accuracy.

Algorithm Steps

  1. User-Side Processing
    • Each user applies OLH to encode and perturb their data.

    • They generate dummy data points and combine them with their perturbed data.

    • The mixed data is sent to the shuffler.

  2. Shuffler Operation
    • The shuffler collects data from all users and performs a uniform random permutation.

    • This breaks the link between users and their data, enhancing privacy.

  3. Collector-Side Processing
    • The collector receives the shuffled data and computes initial frequency estimates.

    • EM refinement is applied to correct biases introduced by perturbation and dummy data.

    • The final histogram is published.

Privacy and Utility Analysis

Privacy Guarantees

OD-HP satisfies ((varepsilon, delta))-differential privacy, ensuring that the inclusion or exclusion of a single user’s data does not significantly affect the output. The privacy budget (varepsilon) is controlled by the OLH parameters and the number of dummy points.

Theoretical analysis shows that the shuffler’s uniform mixing amplifies privacy, making it difficult for adversaries to infer individual contributions. The dummy data further obscures real data, resisting collusion attacks.

Utility Metrics

The algorithm’s accuracy is measured using Mean Squared Error (MSE), which quantifies the difference between true and estimated frequencies. OD-HP achieves lower MSE compared to existing methods like MURS, HP-SDP, and mixDUMP, particularly for large value domains.

Key factors influencing utility include:
• Privacy Budget ((varepsilon)): Higher (varepsilon) (less noise) improves accuracy but weakens privacy.

• Dummy Data Volume: More dummy points enhance privacy but increase computational overhead.

• Hash Domain Size: A smaller hash domain reduces variance but may introduce collisions.

Efficiency

OD-HP’s computational complexity is dominated by the EM refinement step, which scales linearly with the number of users and dummy points. Experiments on real-world datasets (IPUMS and Kosarak) confirm that OD-HP maintains practical runtime while outperforming baseline methods in accuracy.

Experimental Evaluation

Datasets

  1. IPUMS: A U.S. census dataset with 602,325 users and 915 distinct cities.
  2. Kosarak: A clickstream dataset with 1 million users and 42,178 possible values.

Comparison Methods

• MURS: Uses local hashing but lacks dummy data and EM refinement.

• HP-SDP: Combines hashing with post-processing but does not address collusion attacks.

• mixDUMP: Introduces dummy data but uses GRR perturbation, which is less accurate for large domains.

Results

  1. MSE vs. Privacy Budget ((varepsilon))
    • OD-HP consistently achieves lower MSE than baselines across all (varepsilon) values.

    • The advantage is more pronounced for smaller (varepsilon) (stronger privacy), where noise dominates errors.

  2. Scalability
    • OD-HP’s runtime grows linearly with dataset size, making it suitable for large-scale applications.

    • The EM step adds moderate overhead but significantly improves accuracy.

  3. Collusion Resistance
    • Tests simulating shuffler-collector collusion confirm that dummy data effectively protects user privacy.

Conclusion

The OD-HP algorithm advances histogram publishing under shuffled differential privacy by addressing two critical challenges: large value domains and collusion attacks. By integrating optimized local hashing, dummy data insertion, and EM refinement, OD-HP achieves superior privacy-utility trade-offs compared to existing methods.

Future research directions include:

  1. Efficient Shuffling Algorithms: Faster shuffling techniques to handle large datasets.
  2. Dynamic Data Support: Extending OD-HP to streaming or time-varying data.

OD-HP represents a significant step toward practical, privacy-preserving data publishing in untrusted environments.

doi.org/10.19734/j.issn.1001-3695.2024.01.0110

Was this helpful?

0 / 0