Data & Knowledge

Preserving Privacy in a Big Data World

What is real user data anonymization? Can anonymized data still be useful? Can a user get better service by sacrificing privacy?

My colleagues and I at Ericsson Research conducted our research trying to answer these questions. The results were presented at the July 16, 2012 IEEE conference in Izmir, Turkey.

Part 1 – Anonymization Technologies

Data Publication with Privacy Concern

With the increased digitization of the society, more and more information about the physical world and citizens is collected and stored in databases. The collection of information by governments and corporations has created massive opportunities for knowledge-based decision making. Driven by either mutual benefits or by regulations of publicly available information, there is a demand for the exchange and publication of data among various parties. This user data can be "mined" for insights or used to create useful computer systems, such as recommendation engines. We see this every day on e-commerce sites that track a user's shopping history and analyze it to recommend new products he or she might be interested in. Online movie streaming applications also do this by tracking users' viewing history and/or self-reported ratings in order to suggest additional movies to the users.

Simplified Data Publication Process

Given the trend towards the release of user data, user privacy has become an important concern.

People are uncomfortable with so much of their personal information being shared with a variety of, often unidentified, third parties. So, data sharing has been strongly limited since the released data usually contains users' sensitive information and by publishing data directly, it will violate users' privacy. This is why it’s extremely important that the users' safety and integrity are preserved.

Privacy preserving techniques can be broken down into five dimensions:

  • data distribution,
  • data modification,
  • data mining algorithm,
  • data or rule hiding, and
  • privacy preservation.

Our research deals with all of these dimensions.

Privacy preserving data publishing (PPDP) is a field of research that focuses on manipulating a user dataset to create greater user anonymity while still maintaining the value of the dataset. Using PPDP techniques, a data publisher might "anonymize" a dataset and then give the anonymized dataset rather than the original data set to a third party. That way, the receiver of the data will still be able to use the data for meaningful data mining activities but can’t find out particular private information about each user.

A number of PPDP techniques have been developed. One simple technique is to replace entities' names with anonymous identifiers (e.g., random numbers) or to remove such names. This approach is usually not enough in many cases. So, for example, if I know my neighbor rented “Brokeback Mountain” and “Mulholland Drive” last Saturday, I might be able to identify my neighbor from the dataset released by Blockbuster.

Now here’s where it gets technical...

More complex techniques can be aimed at stopping malicious actors from reverse-engineering personal user information from the data when considered as a whole. Such techniques include approaches such as perturbation and $k$-anonymity.

In $k$-anonymity, the data publisher attempts to protect data by constructing groups of anonymous records. In the case of $k$=10, the data publisher groups the records in clusters of at least ten and assigns the same value to all records in the same group. In this way, no record in the dataset is unique. There are at least nine other records that have exactly the same value. Since the number of records in each group is unknown and the number of groups is also unknown, finding the optimal solution for this problem can be very time consuming.

Different sub-optimal solutions have been proposed. One of the approaches is called “Fixed-size $k$-gather”, which constructs $\left[\frac{n}{k} \right]$ anonymous groups and assigns the rest records to these groups. Fixed-size $k$-gather is often very efficient and simple, but it tends to create unnatural groups which can cause high information loss. For example, if we have nine data points generated for two clusters with four and five entries each. With fixed 3-gather approach, the data entries need to be divided into three groups. One of the groups will unavoidably contain data entries from both clusters.

In my next post, I’ll explain how our anonymization algorithm can provide much better results, so keep an eye out for it. In the meantime, I'm happy to address any questions or comments you have.

-- Vincent Huang, Ericsson Research

Subscribe to Comments for ""