By Denys Tyshetskyy, Principal Data Specialist – CyberCX
By Rishi Singla, Solutions Architect – AWS


Data has emerged as a critical driver for businesses across diverse verticals, necessitating a strong reliance on timely data acquisition and efficient utilization of existing data.

Data engineering, although part of the software development realm, presents distinct challenges that sets it apart. One such challenge stems from the extensive variety of data types encountered in data engineering projects, which significantly complicates the availability of suitable synthetic data for development and testing outside the production environment.

Adhering to established best practices from the software development domain, it’s imperative to meticulously design, develop, and thoroughly test solutions within non-production environments prior to their deployment.

However, the practical implementation of best practices can become intricate when only production data is accessible, particularly considering the potential security vulnerabilities that arise from using less secure non-production environments.

Regrettably, many companies opt for the “easier” route by employing production data in non-production environments, inadvertently increasing the risk of sensitive data exposure in the event of data breach.

Data masking, or data obfuscation, is the process of modifying sensitive data in such a way that it’s of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred to as anonymization or tokenization.

In this post, we will explore how CyberCX identified the challenge of limited amounts of non-production data available for testing—a scenario that a masking solution was built to help with. We’ll also share how CyberCX collaborated with AWS to develop a solution empowering customers to leverage production data within non-production environments, reaping the benefits while effectively safeguarding sensitive information.

CyberCX is an AWS Premier Tier Services Partner and AWS Marketplace Seller with Competencies in DevOps, Data and Analytics, Security, and other key areas. CyberCX embraces an AWS-first and secure cloud transformation approach through its deep AWS specialization.

CyberCX Serverless Data

To address these opportunities, CyberCX constructs Amazon Web Services (AWS) serverless data lake platforms that include attributes such as:

  • 100% AWS-native: The serverless data lake platform seamlessly integrates with the vast ecosystem of AWS services, enabling CyberCX to leverage the right tools for each specific task.
  • High modularity: The inherent modularity of the platform facilitates effortless integration into existing data platforms without disrupting ongoing workloads. This ensures a smooth transition and harmonious coexistence with existing infrastructures.
  • Fast time to market: In 10-15 days, the foundation of the data lake can be deployed to a customer’s production environment, ready to be used.
  • DevOps: A data lake can be provisioned to multiple environments to ensure every change is thoroughly tested before it’s deployed to production.

Benefits of a data masking solution include:

  • Allows the use of production data in a non-production environment by effectively masking all sensitive data, and only then moving it to the destination non-production environment for further usage.
  • Allows users to mask data in near-real time, which helps to keep production and non-production environments as similar as possible.
  • Combines automatic identification of sensitive data with an additional human review for edge cases and validation.
  • Limited effort required to onboard a new data sources.
  • Fully managed solution with no infrastructure to worry about.
  • Cost-effective; you only pay when there is data to process.

Solution Overview

A data masking solution is built as a module of the serverless data lake platform and consists of the following components:

  • AWS Step Functions: Serves as the orchestration mechanism for the overall workflow, enabling users to streamline and coordinate various stages of the data masking process.
  • AWS Glue DataBrew: Identifies and masks personally identifiable information (PII) and payment card industry (PCI) present within the data, ensuring compliance with data protection regulations and preserving data privacy.
  • AWS Lambda: Executes essential calculations and transformations.
  • Amazon DynamoDB: Key-value store used to persist metadata about the scanned dataset.

AWS Glue DataBrew is the main component in the workflow which takes care of the data profiling as well as the subsequent masking. The rest of the elements serve the purpose of making this solution more generic, repeatable, scalable, and efficient. You can read more about AWS Glue DataBrew data masking in this AWS blog post.

How it Works

All of the required infrastructure for the data masking workflow is built using the AWS Cloud Development Kit (AWS CDK). This simplifies the maintenance and deployment process, and the solution scales based on the number of incoming files that require masking.

Figure 1 – CyberCX’s data masking solution.

A data masking solution can be adjusted to ingest historical data as well as handle the ongoing ingest. In practice, it has been shown to be more beneficial to run two separate step functions—one for historical data and one for ongoing ingestion.

Use Case Scenario

Let’s say Company Y provides the service of hiring electric bikes. It collects data about the locations of the bikes, types of bikes, and transactions and customers who hire bikes.

The company wants to ingest its data into the data lake house and use data in a non-production environment for experimenting and testing new features. The first datasets that are ingested are bike, transaction, and customer. The company runs the masking pipeline in production to de-sensitize the datasets before they can be moved to a non-production environment.

The masking step function would first run for all three datasets and profile the data in each of them. As a result, the profiling job would generate the response for each dataset where the columns will be provided and which contain sensitive data. The response will be recorded and the workflow will move to the next stage of data masking.

If the dataset has sensitive data, the relevant columns would be de-sensitized using one-way hash or other similar methods, and the resulting file copied to an Amazon Simple Storage Service (Amazon S3) bucket in a non-production account.

In the transaction dataset, the ID would remain unchanged, but fields such as “bank account” and “amount” would be masked. If the dataset doesn’t contain sensitive data, the file would be directly copied to the non-production account. The customer can also adjust the result of profiling job by adding or removing certain fields from masking if necessary.

Figure 2 – Data masking workflow.


To keep up with the ever-growing amount of data from diverse systems, it’s vital for companies to be able to easily adopt new data sources. It’s also important to have confidence in the solutions built for processing data, and the only way to gain this is to be able to exercise the logic before it goes to production environment.

This post outlines a data masking mechanism from CyberCX which can generate meaningful data for non-production environments, facilitating the development and testing of solutions within the data lake.

To learn more about CyberCX’s data lake platform and data masking offering, check out the product page. You can also learn more about CyberCX in AWS Marketplace.


CyberCX – AWS Partner Spotlight

CyberCX is an AWS Premier Tier Services Partner that embraces an AWS-first and secure cloud transformation approach through its deep AWS specialization.

Contact CyberCX | Partner Overview | AWS Marketplace