We live in a data-driven world—one where even seemingly small amounts of data may contain enough information to identify an individual. Companies collect enormous amounts of data about their customers, and have access to large volumes of data about their employees and partners.
While that data is collected for business reasons such as payments, invoices, and so on, it's also an extremely important source of potential insight. This leaves us with a question: how can companies enable their analysts to work with data, while still respecting and protecting the privacy of individuals within that data?
SAP HANA Cloud has an answer to this issue: data anonymization. Thanks to data anonymization, organizations can now unlock the value of previously unusable data that had been restricted by privacy concerns.
On a side note: The SAP HANA Cloud data anonymization team are recipients of a 2020 Hasso Plattner Founders’ Award, in the category Products and Technology. These awards are the highest employee recognition at SAP, awarded annually by the CEO. You can find out more about the awards here.
What is Data Anonymization?
Data anonymization is a key feature of SAP HANA Cloud. It allows you to use a structured approach to anonymize data to protect sensitive information without changing your original dataset, but also retaining data quality to get meaningful analysis from it.
Let’s say you have a dataset containing data about your company’s 500 employees. This dataset includes their name, age, sex, start date, salary, employee ID number, birthplace, nationality, and other sensitive data about each employee.
Unique data attributes, such as name and employee number, are called identifiers. With this information, you’d be able to instantly identify the employee. But removing or hiding identifiers isn’t enough to protect their privacy.
Other attributes, such as birthplace or nationality, are not unique but could still be used to identify individuals. These types of attributes are called quasi-identifiers. For example, if only two of your 500 employees were born in Munich, then you’d be able to identify each employee by checking other attributes, like age or salary.
True data anonymization needs to blur even these quasi-identifiers. Otherwise, individual’s privacy is at risk.
Data Anonymization in SAP HANA Cloud
Anonymization standards are set by a data controller. This refers to someone within your company who is authorized to determine when and how personal data is accessed and processed.
The data controller creates anonymized SQL views in SAP HANA Cloud. Each view can use different strategies to anonymize the data, according to how the data needs to be used. The data controller then grants view access to specific users, allowing analysts to work on data while protecting individual privacy. These views can also be sent as data sources to other systems, such as SAP Analytics Cloud, or used in data models that combine the views with data from other systems.
These are the methods you can use to create anonymized data views in SAP HANA Cloud.
K-anonymity refers to the process of hiding an individual’s identity by placing them into categories with other similar individuals. The letter k refers to the minimum number of individuals that must be in a category, and is set by the data controller. Let’s say that your data controller sets k to 10.
Now, let’s revisit that dataset from earlier, containing info about your company’s 500 employees. If you have only two employees who were born in Munich, then that quasi- identifier is too specific to them. Using k-anonymity, you can make birthplace data more general by setting birthplace value to appear as a country instead. If you have 20 employees who were born in Germany, it then becomes much harder to identify an individual based on this attribute. And since k was set to 10, with 20 German-born employees, you have more than enough individuals to fulfill your k-anonymity standard.
Here’s an example diagram of how setting k-anonymization sorts individuals into larger groups to help protect their privacy:
K-anonymity can be applied in many different ways: for example, you could also set age to show up in ranges of five or ten years, instead of a specific number. The SAP HANA data anonymization technology determines the optimal level of generalization to render data anonymous while keeping as much information as possible.
When k-anonymity isn’t enough to protect individual privacy, you can add another layer of protection using l-diversity. l-diversity further refines k-anonymity by specifying that an attribute must have a minimum number of specific sensitive values within a k-anonymity group.
The importance of this is easily explained: assume that a group only contains persons of the same age. If age is set as the sensitive value, then age will be hidden, since the value of age is the same for every individual within the group.
The below chart shows us a small group of your employees:
In this chart, the Age column is considered sensitive data, and we want to keep it hidden. Using the k-anonymization method of sorting our employees by birthplace will not work in this case: because both Juliette and Fabienne are the same age, even if we set birthplace to country instead, we would still be able to identify the age of each woman born in France as 28.
Depending on your l-diversity settings, this data could be further anonymized in different ways. Birthplace could be generalized to continent, protecting Juliette and Fabienne from being identified. If l-diversity parameters cannot guarantee privacy, then Juliette and Fabienne may be removed from the data that is visible to analysts to prevent them from being identified.
3. Differential Privacy
Differential privacy is also available in SAP HANA Cloud. It anonymizes data by randomizing sensitive information while still ensuring meaningful analysis of data. Differential privacy aims to ensure that, even if individual records are removed from the dataset, the outcome of queries on the dataset will remain the same. It’s best applied to numerical attributes.
Differential privacy works by adding “noise” to sensitive data to protect privacy. Let’s say that an analyst wants to get information about pay ranges at your company, based on how many years individuals have worked at the company.
In the below chart, you can see a sample of salary data. Identifiers have already been removed.
The data controller has labelled the actual salary number as sensitive data that should not be revealed to analysts. Therefore, before the data is shown to analysts, the salary attribute will have “noise” applied to it so that analysts won’t see the actual figures:
This “noise” will make each salary attribute into a number that, on its own, is meaningless to analysts. But overall, analysts will still be able to get meaningful statistics from the dataset for this purpose of this exercise, such as how long-time employee salaries compare to the salaries of new starters.
The true beauty of differential privacy is that even removing individual records from a dataset doesn’t affect the analysis outcome. If a record needs to be removed from the dataset for privacy reasons, the “noise” applied to the remaining records can make up for the missing record. This way, a query on the dataset will return approximately the same result, whether or not individual records are missing.
Data Anonymization: A Necessity in a Data-driven World
SAP HANA Cloud data anonymization enables and facilitates new applications and business models, such as the syndication of data that had been constrained by privacy concerns. As a direct result, companies can now reveal value that would have otherwise been lost, either because data was deemed inaccessible or made too generic to draw useful analysis from it.
The examples that we’ve covered above are only a tiny fraction of the possibilities that data anonymization offers. SAP HANA Cloud data anonymization can be applied to any industry and any organization, from healthcare to environmental services. And it’s included within SAP HANA Cloud, ready to be used.
Want to learn more about how to use data anonymization? You’ll find the data anonymization learning track here.