AWS Glue DataBrew: Enabling customers to clean and normalize data without writing code

Amazon Web Services announced the general availability of AWS Glue DataBrew, a new visual data preparation tool that enables customers to clean and normalize data without writing code.

Since 2016, data engineers have used AWS Glue to create, run, and monitor extract, transform, and load (ETL) jobs. AWS Glue provides both code-based and visual interfaces, and has dramatically simplified extracting, orchestrating, and loading data in the cloud for customers.

Data analysts and data scientists have wanted an easier way to clean and transform this data, and that’s what DataBrew delivers, with a service that allows data exploration and experimentation directly from AWS data lakes, data warehouses, and databases without writing code.

AWS Glue DataBrew offers customers over 250 pre-built transformations to automate data preparation tasks (e.g. filtering anomalies, standardizing formats, and correcting invalid values) that would otherwise require days or weeks writing hand-coded transformations.

Once the data is prepared, customers can immediately start using it with AWS and third-party analytics and machine learning services to query the data and train machine learning models. There are no upfront commitments or costs to use AWS Glue DataBrew, and customers only pay for creating and running transformations on datasets.

Preparing data for analytics and machine learning involves several necessary and time-consuming tasks, including data extraction, cleaning, normalization, loading, and the orchestration of ETL workflows at scale.

For extracting, orchestrating, and loading data at scale, data engineers and ETL developers skilled in SQL or programming languages like Python or Scala can use AWS Glue.

ETL developers often prefer the visual interfaces common in modern ETL tools over writing SQL, Python, or Scala, so AWS recently introduced AWS Glue Studio, a new visual interface to help author, run, and monitor ETL jobs without having to write any code.

Once the data has been reliably moved, the underlying data still needs to be cleaned and normalized by data analysts and data scientists that operate in the lines of business and understand the context of the data.

To clean and normalize the data, data analysts and data scientists have to either work with small batches of the data in Excel or Jupyter Notebooks, which cannot accommodate large data sets, or rely on scarce data engineers and ETL developers to write custom code to perform cleaning and normalization transformations.

In an effort to spot anomalies in the data, highly skilled data engineers and ETL developers spend days or weeks writing custom workflows to pull data from different sources, then pivot, transpose, and slice the data multiple times, before they can iterate with data analysts or data scientists to identify and fix data quality issues.

After they have developed these transformations, data engineers and ETL developers still need to schedule the custom workflows to run on an ongoing basis, so new incoming data can automatically be cleaned and normalized.

Each time a data analyst or data scientist wants to change or add a transformation, the data engineers and ETL developers need to extract, load, clean, normalize, and orchestrate the data preparation tasks over again.

This iterative process can take several weeks to months to complete; and as a result, customers spend as much as 80% of their time cleaning and normalizing data instead of actually analyzing the data and extracting value from it.

AWS Glue DataBrew is a visual data preparation tool for AWS Glue that allows data analysts and data scientists to clean and transform data with an interactive, point-and-click visual interface, without writing any code.

With AWS Glue DataBrew end users can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (S3) data lake, Amazon Redshift data warehouse, and Amazon Aurora and Amazon Relational Database Service (RDS) databases.

Customers can choose from over 250 built-in functions to combine, pivot, and transpose the data without writing code. AWS Glue DataBrew recommends data cleaning and normalization steps like filtering anomalies, normalizing data to standard date and time values, generating aggregates for analyses, and correcting invalid, misclassified, or duplicative data.

For complex tasks like converting words to a common base or root word (e.g. converting “yearly” and “yearlong” to “year”), AWS Glue DataBrew also provides transformations that use advanced machine learning techniques like Natural Language Processing (NLP).

Users can then save these cleaning and normalization steps into a workflow (called a recipe) and apply them automatically to future incoming data. If changes need to be made to the workflow, data analysts and data scientists simply update the cleaning and normalization steps in the recipe, and they are automatically applied to new data as it arrives.

AWS Glue DataBrew publishes the prepared data to Amazon S3, which makes it easy for customers to immediately use it in analytics and machine learning. AWS Glue DataBrew is serverless and fully managed, so customers never need to configure, provision, or manage any compute resources.

“AWS customers are using data for analytics and machine learning at an unprecedented pace. However, these customers regularly tell us that their teams spend too much time on the undifferentiated, repetitive, and mundane tasks associated with data preparation,” said Raju Gulabani, VP of Database and Analytics, AWS.

“Customers love the scalability and flexibility of code-based data preparation services like AWS Glue, but they could also benefit from allowing business users, data analysts, and data scientists to visually explore and experiment with data independently, without writing code.

“AWS Glue DataBrew features an easy-to-use visual interface that helps data analysts and data scientists of all technical levels understand, combine, clean, and transform data.”

AWS Glue DataBrew is generally available in US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), Asia Pacific (Sydney), and Asia Pacific (Tokyo), with availability in additional regions coming soon.

Tokyo-based NTT DOCOMO is the largest mobile service provider in Japan, serving more than 80 million customers. “Our analysts profile and query various kinds of structured and unstructured data in order to better understand usage patterns,” said Takashi Ito, General Manager of Marketing Platform Planning Department, NTT DOCOMO.

“AWS Glue DataBrew provides a visual interface that enables both our technical and non-technical users to analyze data quickly and easily. Its advanced data profiling capability helps us better understand our data and monitor the data quality. AWS Glue DataBrew and other AWS analytics services have allowed us to streamline our workflow and increase productivity.”

bp is one of the world’s largest integrated energy companies. “A data lake is a critical part of our analytics strategy. One of the challenges we face is not being able to easily explore data before ingestion into our data lake,” said John Maio, Director, Data & Analytics Platforms Architecture, bp.

“AWS Glue DataBrew has sophisticated data profiling functionality and a rich set of built-in transformations. This enables our data engineers to easily explore new datasets in a visual interface and make modifications in order to optimize ingestion and allow analysts to shape the data for their analytics solutions.

“We see AWS Glue DataBrew as a way to help us better manage our data platform and improve efficiencies in our data pipelines.”

INVISTA, a subsidiary of Koch Industries, is one of the world’s largest integrated producers of chemical intermediates, polymers, and fibers.

“Data is critical to optimizing our manufacturing processes. One of the challenges we face is ensuring we have a clean data lake that can serve as the source of truth for our analytics and machine learning applications,” said Tanner Gonzalez, Analytics and Cloud leader, INVISTA.

“The data ingested into our data lake often contains duplicate values, incorrect formatting and other imperfections that make it difficult to use in its raw form. Amazon AWS Glue DataBrew will allow our data analysts to visually inspect large data sets, clean and enrich data, and perform advanced transformations.

“AWS Glue DataBrew will empower our analysts and data scientists to perform advanced data engineering activities, giving them the freedom to explore their data and decreasing the time to derive new insights.”

More about

AWS Glue DataBrew: Enabling customers to clean and normalize data without writing code

Featured news

Resources

Don't miss