{"id":27510,"date":"2024-04-29T05:00:00","date_gmt":"2024-04-29T03:00:00","guid":{"rendered":"https:\/\/sii.pl\/blog\/?p=27510"},"modified":"2024-05-13T12:44:00","modified_gmt":"2024-05-13T10:44:00","slug":"data-pseudonymization-in-google-cloud","status":"publish","type":"post","link":"https:\/\/sii.pl\/blog\/en\/data-pseudonymization-in-google-cloud\/","title":{"rendered":"Data pseudonymization in Google Cloud"},"content":{"rendered":"\n<p>Pseudonymisation and anonymization of Personal Identifiable Information (PII) are often confused. Both techniques are relevant within the General Data Protection Regulation (GDPR) context. This confusion arises because, in legal terms, personal data is information that can directly identify a person. Still, data that doesn&#8217;t directly identify a person is also considered personal data.<\/p>\n\n\n\n<p>Pseudonymization is a method that allows you to switch the original data set (for example, e-mail or a name) with an alias or pseudonym. It is a reversible process that de-identifies data but allows for re-identification later on if necessary.<\/p>\n\n\n\n<p>Anonymization is a technique that irreversibly alters data so an individual is no longer identifiable directly or indirectly.<\/p>\n\n\n\n<p>Although both are used to secure data, they are not the same. The following figure illustrates the distinctions between the two techniques:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image1-1.png\"><img decoding=\"async\" width=\"1024\" height=\"546\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image1-1-1024x546.png\" alt=\"The distinctions between pseudonymization and anonymization\" class=\"wp-image-27511\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image1-1-1024x546.png 1024w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image1-1-300x160.png 300w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image1-1-768x410.png 768w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image1-1.png 1500w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Fig. 1 The distinctions between pseudonymization and anonymization<\/figcaption><\/figure>\n\n\n\n<p>Compared to anonymization, pseudonymization is a much more sophisticated option since it leaves you the key (also known as a crypto key) to \u201cunlock\u201d the data. This way, data is not regarded as immediately identifiable, yet it is also not anonymized, so it retains its original value.<\/p>\n\n\n\n<p>Anonymous data or non-personal data are combined to the point that particular events can no longer be associated with a specific individual. This increases data privacy and enables organizations to comply with GDPR and other data protection rules.<\/p>\n\n\n\n<p>To demonstrate pseudonymization, consider tokenizing each row such that it may be returned to its original value. To demonstrate anonymization, try changing all values in a certain column to ******** or null, thereby making the data meaningless.<\/p>\n\n\n\n<p>A more detailed comparison can be seen in this table:<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><strong>Pseudonymization<\/strong><\/td><td><strong>Anonymization<\/strong><\/td><\/tr><tr><td>Protects data at a record level<\/td><td>Protects entire datasets (columns)<\/td><\/tr><tr><td>Re-identification possible<\/td><td>Re-identification not possible<\/td><\/tr><tr><td>Still considered PII<\/td><td>Not considered PII according to GDPR<\/td><\/tr><tr><td>Business value retained<\/td><td>Business value lost<\/td><\/tr><tr><td>More sophisticated, harder to implement<\/td><td>Simpler, easier to implement<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Tab. 1 Pseudonymization and anonymization \u2013 comparison<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>De-identification and re-identification with Cloud DLP<\/strong><\/h2>\n\n\n\n<p>De-identification and re-identification with Cloud DLPDe-identification and re-identification of PII are critical processes for organizations that handle sensitive data. Google Cloud offers a couple of powerful solutions to this problem. One of them is through Cloud Data Loss Prevention (DLP) service, which is now a part of Sensitive Data Protection. It provides a suite of tools for de-identifying and re-identifying PII.<\/p>\n\n\n\n<p>The de-identification process involves removing or obfuscating any information that can be used to identify an individual while still preserving the utility of the data. Cloud DLP offers several techniques for de-identification, including redaction, replacement, masking, tokenization, or bucketing.<\/p>\n\n\n\n<p>Different transformations can be applied to various data objects, including unstructured text, records, and images. They help organizations effectively de-identify PII within their datasets, ensuring compliance with data privacy regulations and minimizing the risk of unauthorized exposure.<\/p>\n\n\n\n<p>The vast majority of methods are meant to be used on values in tabular data that are marked as a certain infoType. You can make and handle it with de-identification templates within DLP. These templates let you use the same set of changes repeatedly. You can also keep setup information separate from how requests are carried out.<\/p>\n\n\n\n<p>DLP&#8217;s deidentifyConfig allows for many different transformations, such as primitiveTransformations and infoTypeTransformations, to handle sensitive data properly. These features allow businesses to handle and protect their private data on a large scale while ensuring they follow data privacy laws and lowering the risk of data being seen by people who aren&#8217;t supposed to.<\/p>\n\n\n\n<p>For example, here is a de-identification template that masks all characters in email addresses, except for \u201c.\u201d and \u201c@\u201d:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n&quot;deidentifyConfig&quot;:{\n    &quot;infoTypeTransformations&quot;:{\n      &quot;transformations&quot;:&#x5B;\n        {\n          &quot;infoTypes&quot;:&#x5B;\n            {\n              &quot;name&quot;:&quot;EMAIL_ADDRESS&quot;\n            }\n          ],\n          &quot;primitiveTransformation&quot;:{\n            &quot;characterMaskConfig&quot;:{\n              &quot;maskingCharacter&quot;:&quot;#&quot;,\n              &quot;reverseOrder&quot;:false,\n              &quot;charactersToIgnore&quot;:&#x5B;\n                {\n                  &quot;charactersToSkip&quot;:&quot;.@&quot;\n                }\n              ]\n            }\n          }\n        }\n      ]\n    }\n  },\n\n<\/pre><\/div>\n\n\n<p>Re-identification, on the other hand, involves matching de-identified data with its original source. This is useful in cases where the de-identified data needs to be re-associated with its original source for analysis or other purposes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Reference architecture of data pseudonymization using Google Cloud products<\/strong><\/h2>\n\n\n\n<p>This article will explore using Cloud DLP with Cloud Dataflow (Google\u2019s data processing engine) to pseudonymize PII data using a de-identification template that uses a deterministic crypto tokenization.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image3-2.png\"><img decoding=\"async\" width=\"736\" height=\"465\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image3-2.png\" alt=\"Reference architecture of data pseudonymization using Google Cloud products\" class=\"wp-image-27518\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image3-2.png 736w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image3-2-300x190.png 300w\" sizes=\"(max-width: 736px) 100vw, 736px\" \/><\/a><figcaption class=\"wp-element-caption\">Fig. 2 Reference architecture of data pseudonymization using Google Cloud products<\/figcaption><\/figure>\n\n\n\n<p>Dataflow is a fully managed service for executing data processing pipelines. It supports parallel execution of tasks, auto-scaling, and seamless integration with other Google Cloud services, making it an ideal choice for processing large volumes of data.<\/p>\n\n\n\n<p>Here, we use it to read the data from a source that can be anything: object storage, RDBMS, NoSQL database, data warehouse, API. Then, we use a de-identification template secured with a crypto key to perform a recordTransformation on our data. Dataflow calls Cloud DLP API with the data to be transformed and retrieves data obfuscated by one of the available techniques.<\/p>\n\n\n\n<p>In this example, we use tokenization, which is a solution for the pseudonymization of the data. It replaces the original data with a deterministic token, preserving referential integrity. You can use the token to join data or use the token in aggregate analysis. You can reverse or re-identify the data using the same key you used to create the token. This key is stored in Cloud Key Management Service (KMS), also called a wrapped key.<\/p>\n\n\n\n<p>You can think of the whole process as converting all the rows containing IDs into different types. It allows you to query and aggregate the data, but you no longer see the original data unless you have permission to re-identify it.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image4-1.png\"><img decoding=\"async\" width=\"920\" height=\"495\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image4-1.png\" alt=\"Tokenization and Replacement\" class=\"wp-image-27521\" style=\"width:848px;height:auto\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image4-1.png 920w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image4-1-300x161.png 300w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image4-1-768x413.png 768w\" sizes=\"(max-width: 920px) 100vw, 920px\" \/><\/a><figcaption class=\"wp-element-caption\">Fig. 3 Tokenization and Replacement<\/figcaption><\/figure>\n\n\n\n<p>By adhering to Google&#8217;s best practices and leveraging DLP templates, organizations can effectively manage and secure their sensitive data at scale while also ensuring compliance with data privacy regulations and minimizing the risk of unauthorized exposure.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Provisioning of the DLP template<\/strong><\/h2>\n\n\n\n<p>First, we must create a blueprint for our data pseudonymization techniques. For this example, I&#8217;ll utilize the official Google Cloud Terraform provider.<\/p>\n\n\n\n<p>The following HCL code demonstrates how to create a Cloud DLP de-identification template that can be used to pseudonymize data:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nresource &quot;google_data_loss_prevention_deidentify_template&quot; &quot;dlp_template_tokenization&quot; {\n  parent       = &quot;projects\/${var.project_id}\/locations\/${local.regions.primary}&quot;\n  display_name = &quot;dlp_pii_tokenization&quot;\n\n  deidentify_config {\n    record_transformations {\n      field_transformations {\n        fields {\n          name = &quot;PII_FIELD&quot;\n        }\n        primitive_transformation {\n          crypto_deterministic_config {\n            crypto_key {\n              kms_wrapped {\n                wrapped_key     = data.google_secret_manager_secret_version.dlp_wrapped_key.secret_data\n                crypto_key_name = &quot;projects\/${var.project_id}\/locations\/${local.regions.primary}\/keyRings\/${var.project_id}\/cryptoKeys\/dlp-${var.environment}&quot;\n              }\n            }\n          }\n        }\n      }\n    }\n  }\n}\n\n<\/pre><\/div>\n\n\n<p>We apply a recordTransformation of the deterministic crypto type, which requires a crypto key and a wrapped key. A wrapped key is a base64-encoded data encryption key, while a crypto key is a Cloud KMS \/ HSM \/ EKM key used to decrypt a wrapped key. It is up to you how you create them.<\/p>\n\n\n\n<p>The complete list of possible transformations may be found in <a aria-label=\" (opens in a new tab)\" href=\"https:\/\/cloud.google.com\/sensitive-data-protection\/docs\/reference\/rest\/v2\/projects.deidentifyTemplates#deidentifyconfig\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >the Cloud DLP API Reference<\/a>. [7] After applying this with Terraform, the Cloud DLP de-identification template will appear in your GCP project.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Data processing with Apache Beam Python SDK<\/strong><\/h2>\n\n\n\n<p>After deploying DLP-related resources, we can finally utilize Dataflow to pseudonymize the PII data. Creating and executing jobs in Dataflow requires developing code using the Apache Beam framework.<\/p>\n\n\n\n<p>Apache Beam is an open-source, unified model for defining batch and streaming data-parallel processing pipelines. Using one of the open-source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam\u2019s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow. [8]<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image5-1.png\"><img decoding=\"async\" width=\"1024\" height=\"913\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image5-1-1024x913.png\" alt=\"Data processing with Apache Beam Python SDK\" class=\"wp-image-27524\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image5-1-1024x913.png 1024w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image5-1-300x267.png 300w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image5-1-768x685.png 768w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/image5-1.png 1375w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Fig. 4 Data processing with Apache Beam Python SDK<\/figcaption><\/figure>\n\n\n\n<p>The following Apache Beam Python SDK pseudocode snippet demonstrates how to achieve pseudonymization on the fly while migrating data from one source to another:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport logging\n\nimport apache_beam as beam\nfrom apache_beam.io.gcp.bigquery import BigQueryDisposition, WriteToBigQuery\nfrom apache_beam.io.jdbc import ReadFromJdbc\nfrom apache_beam.options.pipeline_options import PipelineOptions\nfrom google.cloud import dlp_v2\nfrom google.cloud.dlp_v2 import types\n\n\nclass CustomPipelineOptions(PipelineOptions):\n    @classmethod\n    def _add_argparse_args(cls, parser):\n        # JDBC parameters\n        parser.add_value_provider_argument(&quot;--jdbc_url&quot;, type=str, help=&quot;JDBC connection URL&quot;)\n        parser.add_value_provider_argument(&quot;--driver_class_name&quot;, type=str, help=&quot;JDBC driver class name&quot;)\n        parser.add_value_provider_argument(&quot;--username&quot;, type=str, help=&quot;JDBC connection username&quot;)\n        parser.add_value_provider_argument(&quot;--password&quot;, type=str, help=&quot;JDBC connection password&quot;)\n        parser.add_value_provider_argument(&quot;--src_table_name&quot;, type=str, help=&quot;Source table name&quot;)\n\n        # BigQuery parameters\n        parser.add_value_provider_argument(&quot;--project_id&quot;, type=str, help=&quot;GCP project ID&quot;)\n        parser.add_value_provider_argument(&quot;--dataset_id&quot;, type=str, help=&quot;BigQuery dataset ID&quot;)\n        parser.add_value_provider_argument(&quot;--dest_table_name&quot;, type=str, help=&quot;BigQuery destination table name&quot;)\n\n        # DLP parameters\n        parser.add_value_provider_argument(&quot;--project&quot;, type=str, help=&quot;GCP project ID for DLP&quot;)\n        parser.add_value_provider_argument(&quot;--location&quot;, type=str, help=&quot;DLP service region\/location&quot;)\n        parser.add_value_provider_argument(&quot;--deidentify_template_name&quot;, type=str, help=&quot;De-identify template name in DLP&quot;)\n        parser.add_value_provider_argument(&quot;--columns_to_pseudonymize&quot;, type=str, help=&quot;Comma-separated list of columns to pseudonymize&quot;)\n\n\nclass PseudonymizeData(beam.DoFn):\n    def __init__(self, options):\n        self.options = options\n        self.client = None\n\n    def start_bundle(self):\n        &quot;&quot;&quot;Initialize resources at the start of a bundle.&quot;&quot;&quot;\n        # The DLP client is initialized here to ensure it&#039;s set up in the worker&#039;s context\n        self.client = dlp_v2.DlpServiceClient()\n\n    def deidentify_table(self, table, project, location, deidentify_template_name, columns_to_pseudonymize):\n        &quot;&quot;&quot;De-identify a table using the DLP API.&quot;&quot;&quot;\n        parent = f&quot;projects\/{project}\/locations\/{location}&quot;\n\n        headers = &#x5B;types.FieldId(name=&quot;PII_FIELD&quot; if col in columns_to_pseudonymize else col) for col in table&#x5B;0].keys()]\n        rows = &#x5B;types.Table.Row(values=&#x5B;types.Value(string_value=str(cell)) for cell in row.values()]) for row in table]\n\n        table_item = types.ContentItem(table=types.Table(headers=headers, rows=rows))\n\n        deidentify_request = types.DeidentifyContentRequest(parent=parent, deidentify_template_name=deidentify_template_name, item=table_item)\n\n        response = self.client.deidentify_content(request=deidentify_request)\n        deidentified_table = &#x5B;{col: value.string_value for col, value in zip(table&#x5B;0].keys(), row.values)} for row in response.item.table.rows]\n        return deidentified_table\n\n    def process(self, element):\n        &quot;&quot;&quot;Process each element (a table) in the PCollection.&quot;&quot;&quot;\n        project = self.options.project.get()\n        location = self.options.location.get()\n        deidentify_template_name = self.options.deidentify_template_name.get()\n        columns_to_pseudonymize = self.options.columns_to_pseudonymize.get().split(&quot;,&quot;)\n\n        deidentified_table = self.deidentify_table(element, project, location, deidentify_template_name, columns_to_pseudonymize)\n        yield deidentified_table\n\n\ndef main(argv=None):\n    pipeline_options = PipelineOptions(argv)\n    custom_options = pipeline_options.view_as(CustomPipelineOptions)\n\n    with beam.Pipeline(options=pipeline_options) as p:\n        (\n            p\n            | &quot;ReadFromJdbc&quot;\n            &gt;&gt; ReadFromJdbc(\n                table_name=custom_options.src_table_name.get(),\n                driver_class_name=custom_options.driver_class_name.get(),\n                jdbc_url=custom_options.jdbc_url.get(),\n                username=custom_options.username.get(),\n                password=custom_options.password.get(),\n            )\n            | &quot;PseudonymizeData&quot; &gt;&gt; beam.ParDo(PseudonymizeData(custom_options))\n            | &quot;WriteToBigQuery&quot;\n            &gt;&gt; WriteToBigQuery(\n                project=custom_options.project_id.get(),\n                dataset=custom_options.dataset_id.get(),\n                table=custom_options.dest_table_name.get(),\n                create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,\n                write_disposition=BigQueryDisposition.WRITE_TRUNCATE,\n                method=WriteToBigQuery.Method.FILE_LOADS,\n            )\n        )\n\n\nif __name__ == &quot;__main__&quot;:\n    logging.getLogger().setLevel(logging.INFO)\n    main()\n\n<\/pre><\/div>\n\n\n<p>We build a pipeline in this Python SDK code sample that reads a table from a PostgreSQL database. We create pseudonymize_row(), which leverages the Cloud DLP API to tokenize the data. The dlp_client.deidentify_content() function uses the supplied de-identification template to de-identify items. We pseudonymize fields from all rows based on the &#8221;pii_columns&#8221; pipeline option. We receive tokenized data and write it back in a BigQuery table. <\/p>\n\n\n\n<p>The following steps are intentionally left out: building a Dataflow template, either classic or flex, creating a Google Cloud Storage (GCS) bucket to store temporary data, and creating a BigQuery dataset and table. You should execute these steps in conjunction with your existing configuration and requirements.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Triggering and scheduling a Cloud Dataflow job with Apache Airflow<\/strong><\/h2>\n\n\n\n<p>When you create a Dataflow template, nothing prevents you from scheduling the application to migrate a daily batch of fresh pseudonymized data to your destination.<\/p>\n\n\n\n<p>A plethora of options for services and solutions are available to you. I suggest the tried-and-true method of orchestrating ETL processes using Apache Airflow. You can use Cloud Composer, a managed Airflow service on Google Cloud.<\/p>\n\n\n\n<p>Here&#8217;s an example of how to trigger a Dataflow application with a Flex template using an official Apache Airflow operator:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom airflow.decorators import dag\nfrom airflow.providers.google.cloud.operators.dataflow import DataflowStartFlexTemplateOperator\nfrom airflow.timetables.trigger import CronTriggerTimetable\nfrom airflow.utils.dates import timedelta, datetime\n\nfrom dependencies.callbacks import on_failure_function\n\ndefault_args = {\n    &quot;owner&quot;: &quot;Someone&quot;,\n    &quot;email&quot;: &quot;someone@example.com&quot;,\n    &quot;on_failure_callback&quot;: on_failure_function,\n    &quot;retries&quot;: 1,\n    &quot;retry_delay&quot;: timedelta(minutes=1),\n}\n\n\n@dag(\n    dag_id=&quot;dataflow_pii&quot;,\n    default_args=default_args,\n    schedule=CronTriggerTimetable(&quot;0 0 * * *&quot;, timezone=&quot;UTC&quot;),\n    start_date=datetime(2024, 1, 1),\n    catchup=False,\n    is_paused_upon_creation=True,\n    tags=&#x5B;&quot;dataflow&quot;, &quot;pii&quot;, &quot;example&quot;],\n)\ndef dataflow_pii_dag():\n    DataflowStartFlexTemplateOperator(\n        task_id=f&quot;run_dataflow_pii_job&quot;,\n        location=REGION,\n        project_id=PROJECT_ID,\n        body={\n            &quot;launchParameter&quot;: {\n                &quot;jobName&quot;: dataflow_job_name,\n                &quot;parameters&quot;: dataflow_parameters,\n                &quot;environment&quot;: dataflow_environment,\n                &quot;containerSpecGcsPath&quot;: f&quot;gs:\/\/{DATAFLOW_BUCKET}\/tag\/{TEMPLATE_VERSION}\/{TEMPLATE_NAME}.json&quot;,\n            }\n        },\n        cancel_timeout=360,\n        deferrable=True,\n        append_job_name=True,\n        execution_timeout=DF_TASK_TIMEOUT,\n    )\n\n\ndataflow_pii_dag()\n\n<\/pre><\/div>\n\n\n<p>In this example, we define a Directed Acyclic Graph (DAG) that schedules the pseudonymization process to run daily. The DataflowStartFlexTemplateOperator is used to trigger the Dataflow job. Various other operators can run Dataflow jobs in different ways.<\/p>\n\n\n\n<p>The launchParameter dict contains not only the job name but also parameters for the template, such as source and destination configuration, addresses for sensitive values stored in Secret Manager, and any other parameters that may be used as application variables.<\/p>\n\n\n\n<p>The Dataflow environment includes values set at runtime, such as the maximum number of workers, the compute engine availability zone, the service account used to run the job, the machine type size, the networking configuration for VMs, and more. ContainerSpecGcsPath is the Cloud Storage path to the serialized Flex template JSON file.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Summary<\/strong><\/h2>\n\n\n\n<p>By following these steps, organizations can leverage Cloud DLP and Dataflow to pseudonymize PII data securely and scalable. The use of Terraform for resource provisioning and Dataflow for data processing enables efficient and consistent management of sensitive information. This is feasible because of de-identification templates, which are an excellent method to keep track of how sensitive data is handled.<\/p>\n\n\n\n<p>Apache Airflow, a popular open-source workflow management platform, can schedule the Dataflow job. By doing so, organizations can automate the pseudonymization process and ensure the secure handling of sensitive data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sources<\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.privacycompany.eu\/blog\/what-are-the-differences-between-anonymisation-and-pseudonymisation\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >What are the Differences Between Anonymisation and Pseudonymisation, Privacy Company<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/dataprivacymanager.net\/pseudonymization-according-to-the-gdpr\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >Pseudonymization according to the GDPR<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cloud.google.com\/sensitive-data-protection\/docs\/inspect-sensitive-text-de-identify\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >De-identify and re-identify sensitive data<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cloud.google.com\/sensitive-data-protection\/docs\/deidentify-sensitive-data#charactermaskconfig\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >De-identifying sensitive data<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cloud.google.com\/architecture\/de-identification-re-identification-pii-using-cloud-dlp\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/registry.terraform.io\/providers\/hashicorp\/google\/latest\/docs\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >Terraform provider for Google Cloud<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cloud.google.com\/sensitive-data-protection\/docs\/reference\/rest\/v2\/projects.deidentifyTemplates#deidentifyconfig\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >DeidentifyConfig<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/beam.apache.org\/get-started\/beam-overview\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >Apache Beam Overview<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-google\/stable\/operators\/cloud\/dataflow.html#templated-jobs\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\" rel=\"nofollow\" >Google Cloud Dataflow Operators<\/a><\/li>\n<\/ol>\n\n\n<div class=\"kk-star-ratings kksr-auto kksr-align-left kksr-valign-bottom\"\n    data-payload='{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;27510&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;bottom&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;4&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;starsonly&quot;:&quot;&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;2&quot;,&quot;greet&quot;:&quot;&quot;,&quot;legend&quot;:&quot;5\\\/5&quot;,&quot;size&quot;:&quot;30&quot;,&quot;title&quot;:&quot;Data pseudonymization in Google Cloud&quot;,&quot;width&quot;:&quot;159&quot;,&quot;_legend&quot;:&quot;{score}\\\/5&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>\n            \n<div class=\"kksr-stars\">\n    \n<div class=\"kksr-stars-inactive\">\n            <div class=\"kksr-star\" data-star=\"1\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"2\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"3\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"4\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"5\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n    <\/div>\n    \n<div class=\"kksr-stars-active\" style=\"width: 159px;\">\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n    <\/div>\n<\/div>\n                \n\n<div class=\"kksr-legend\" style=\"font-size: 24px;\">\n            5\/5    <\/div>\n    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>Pseudonymisation and anonymization of Personal Identifiable Information (PII) are often confused. Both techniques are relevant within the General Data Protection &hellip; <a class=\"continued-btn\" href=\"https:\/\/sii.pl\/blog\/en\/data-pseudonymization-in-google-cloud\/\">Continued<\/a><\/p>\n","protected":false},"author":628,"featured_media":27528,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","inline_featured_image":false,"footnotes":""},"categories":[1320],"tags":[2200,2199,1578,1526],"class_list":["post-27510","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hard-development","tag-data","tag-google","tag-cloud-2","tag-guidebook"],"acf":[],"aioseo_notices":[],"republish_history":[],"featured_media_url":"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2024\/04\/Data-pseudonymization-in-Google-Cloud-in-practice-.jpg","category_names":["Hard development"],"_links":{"self":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/27510"}],"collection":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/users\/628"}],"replies":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/comments?post=27510"}],"version-history":[{"count":2,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/27510\/revisions"}],"predecessor-version":[{"id":27595,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/27510\/revisions\/27595"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/media\/27528"}],"wp:attachment":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/media?parent=27510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/categories?post=27510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/tags?post=27510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}