Preserving Privacy in a Big Data World – Part 3
Part 3 – Privacy or Service Quality, A Trade Off
In my previous posts – Part 1 and Part 2 – I described how our anonymization technique anonymizes user data before it’s published. The released data is actually aggregated user information on different levels. In normal cases, even releasing aggregated user data requires user consent. However, if we think about the GDP of a country, it contains aggregated information from all its citizens yet no consent was required to publish that information. This may seem like a strictly legal question about whether consent is required when user data is published, but it also has technical aspects for the chosen privacy levels for the users.
Not all users will have the same privacy concerns. For instance, in a movie recommender system, some subscribers may be willing to give up a little bit of their privacy to get a more accurate recommendation. In this case, the privacy level is tightly connected to the service quality. With some users sacrificing their privacy, the overall anonymization quality can be potentially improved with respect to the information loss.
To study the relation between privacy level and service quality, we have modified our Bisecting $k$-gather (BKG) with additional functions to be a heuristic for solving 1-$k$-gather problem. The new heuristic is called bisecting one-$k$-gather (BOKG). Put simply: we have two groups of users. One group does not care about their privacy; it can even be a single user within a cluster. Another group has the same $k$-anonymity privacy concerns each user has to be hidden among $k$ other users. Therefore, we call this problem the 1-$k$-gather problem.
To illustrate the new algorithm, we used the same example as in my previous post and randomly assign the 12 points to these two different groups. The red circles represent users that don’t have any privacy concerns and the black dots represent users with 3-anonymity privacy requirements (at least 3 users in each group). With our modified algorithm, we obtained six clusters where three of them contain three users each while the other three clusters contain only one user each. The users have the freedom to prioritize the privacy or service quality and all of the users’ privacy requirements were satisfied.
Now, the interesting question: is there any difference in service quality for these two groups of users?
We used the same movie recommendation data that I told you about in my last post and were able to predict user ratings.
The result is rather interesting as shown in the following figure:
We again studied the mean absolute error (MAE) as a function of $k$. The solid line represents the performance when all users have the same privacy requirements, just as mentioned last time. The dotted line is the performance for the group of users with privacy concerns in the mixed case and dashed line represents the performance for the group of users without privacy concerns.
From the evaluation results, users who have sacrificed their privacy can get better recommendations on average. However, even for these users, the recommendation can get worse when other users’ privacy level increase because they can get less accurate information from them.
Our research showed that there is always a trade-off between privacy and service quality. While providing user with high quality personalized and customized services, the users have to be aware of what privacy they may lose. The good news is that they have the possibility to choose. We are continuing working on many interesting research topics in this area, such as optimizing QoS with fixed privacy requirements, clustering users with multiple privacy levels, dynamic privacy settings, etc. If you have any questions, please drop us a line.
-- Vincent Huang, Ericsson Research