Amazon Web Services’ new visual data preparation tool for AWS Glue allows users to clean and normalize data with an interactive point-and-click visual interface without writing custom code.
AWS Glue DataBrew helps data scientists and data analysts get the data ready for analytics and machine learning (ML) 80 percent quicker than traditional data preparation approaches, according to the cloud provider, which made the tool generally available on Wednesday.
The new offering builds on AWS Glue, which AWS generally released in April of 2017. AWS Glue is a serverless, fully managed, extract, transform and load (ETL) service to categorize, clean, enrich and move data between various data stores. It has a central data repository called the AWS Glue Data Catalog, an ETL engine that generates Python code automatically and a flexible scheduler to handle dependency resolution, job monitoring and retries.
For AWS Glue DataBrew, AWS is offering more than 250 pre-built functions to automate data preparation tasks that otherwise would require days to weeks to code, according to the company.
AWS customers are using data for analytics and ML at an unprecedented pace, but regularly tell AWS that their teams are spending too much time on the “undifferentiated, repetitive, and mundane tasks associated with data preparation,” according to Raju Gulabani, AWS’ vice president of database and analytics. They spend up to 80 percent of their time cleaning and normalizing data rather than analyzing and extracting value from it, according to AWS.
“Customers love the scalability and flexibility of code-based data preparation services like AWS Glue, but they could also benefit from allowing business users, data analysts and data scientists to visually explore and experiment with data independently, without writing code,” Gulabani said in a statement. “AWS Glue DataBrew features an easy-to-use visual interface that helps data analysts and data scientists of all technical levels understand, combine, clean and transform data.”
AWS Glue DataBrew currently is generally available in AWS’ US East (north Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), Asia Pacific (Sydney) and Asia Pacific (Tokyo) cloud regions.
Users can access and visually explore any amount of data directly from their Amazon Simple Storage Service (S3) data lake, Amazon Redshift data warehouse, and Amazon Aurora and Amazon Relational Database Service (RDS) databases.
The 250 built-in functions to combine and transpose the data include filtering anomalies, standardizing formats, generating aggregates for analyses, and correcting invalid, misclassified or duplicative data. Some of the prebuilt transformations use advanced ML techniques such as natural language processing.
“Once your data is ready, you can immediately use it with AWS and third-party services to gain further insights, such as Amazon SageMaker for machine learning, Amazon Redshift and Amazon Athena for analytics, and Amazon QuickSight and Tableau for business intelligence,” Danilo Poccia, AWS “chief evangelist” for Europe, the Middle East and Africa, said in a blog post.
Users also can then save these cleaning and normalization steps into a workflow -- called a “recipe” -- and apply them automatically to future incoming data.
“At any point in time, you can visually track and explore how datasets are linked to projects, recipes and job runs,” Poccia said. “In this way, you can understand how data flows and what are the changes. This information is called ‘data lineage’ and can help you find the root cause in case of errors in your output.”
AWS is not requiring customers to make upfront commitments or payments to use AWS Glue DataBrew. Customers pay only for creating and running transformations on datasets, with pricing varying by region.
Interactive sessions are billed per session, and the AWS Glue DataBrew jobs are billed by the minute.
Sessions run 30 minutes and start when a customer opens a DataBrew project. The initial 40 interactive sessions are free for first-time users. Customers are billed at the same rate to use DataBrew API operations.
The per-session price currently listed for the AWS Asia Pacific (Sydney) region is US$0.44.
AWS charges 48 US cents per DataBrew node hour in some regions. A node provides 4 vCPU and 16 GB of memory.
Customers face additional charges if their AWS Glue DataBrew jobs use other AWS services or transfer data.
Who’s Using It
Japanese mobile service provider NTT DOCOMO, London-based energy company BP and INVISTA, a chemical intermediates, polymers and fibers producer with headquarters in the US, are among the AWS customers that have been using AWS Glue DataBrew.
Data analysts for Toyko-based NTT DOCOMO profile and query structured and unstructured data to better understand usage patterns, according to Takashi Ito, NTT DOCOMO’s general manager of marketing platform planning.
“AWS Glue DataBrew provides a visual interface that enables both our technical and non-technical users to analyze data quickly and easily,” Ito said in a statement. “Its advanced data profiling capability helps us better understand our data and monitor the data quality. AWS Glue DataBrew and other AWS analytics services have allowed us to streamline our workflow and increase productivity.“
A data lake is a critical part of BP’s analytics strategy, and the company has been challenged by not being able to easily explore data before ingestion into its data lake, according to John Maio, the company’s director of data and analytics platforms architecture. AWS Glue DataBrew helps the company better manage its data platform and improve data pipeline efficiencies, he said.
“AWS Glue DataBrew has sophisticated data profiling functionality and a rich set of built-in transformations,” Maio said. “This enables our data engineers to easily explore new datasets in a visual interface and make modifications in order to optimize ingestion and allow analysts to shape the data for their analytics solutions.”
Data is essential to optimize INVISTA’s manufacturing processes, according to Tanner Gonzalez, the company’s analytics and cloud leader.
“One of the challenges we face is ensuring we have a clean data lake that can serve as the source of truth for our analytics and machine learning applications,” Gonzalez said. “The data ingested into our data lake often contains duplicate values, incorrect formatting and other imperfections that make it difficult to use in its raw form. AWS Glue DataBrew will empower our analysts and data scientists to perform advanced data engineering activities, giving them the freedom to explore their data and decreasing the time to derive new insights.”