Scroll Top

Building Reusable ETL Pipelines in AWS Glue with Custom Visual Transforms

Home Building Reusable ETL Pipelines in AWS Glue with Custom Visual Transforms

By Eswara Sridharan Technical Blog March 28, 2025

In today’s AI-driven world, data engineers are focusing on eliminating manual processes and minimizing heavy lifting. The shift towards low-code or no-code solutions is gaining momentum, and visual ETL (Extract, Transform, Load) is at the forefront of this trend. AWS Glue Studio offers a powerful platform to perform visual ETL, making data transformation tasks simpler and more efficient.

Glue Visual Studio

AWS Glue Studio provides a user-friendly graphical interface that simplifies the process of creating, running, and monitoring data integration jobs. With its visual composition capabilities, you can design data transformation workflows with ease and seamlessly execute them on AWS Glue’s serverless, Apache Spark–based ETL engine.

In AWS Glue Studio, each node within the visual interface represents a specific function in an ETL pipeline. There are three main types of nodes: Source, Action, and Target.

Source: This node establishes connections to data sources such as RDS, S3, SharePoint, Salesforce, and more.
Action: This node allows you to apply business logic and write custom code to meet data transformation and analytics requirements.
Target: This node loads the transformed data into the desired destination, such as S3, Redshift, Snowflake, or other supported storage solutions.

AWS Glue Studio offers a wide range of predefined transformations that can be easily applied using a drag-and-drop interface. However, if the required transformation isn’t available, you can use the “Custom Transform” node to write your PySpark code to implement the desired logic. The challenge is that custom transformations are created at the job level and are not designed for reusability, making it difficult to share or reuse them across different jobs or with other developers.

Custom Visual Transform

This is where the Custom Visual Transform comes in. It allows you to create a custom action node that performs the transformation and can be reused across multiple jobs and by different developers — without writing any code each time. With this feature, you can build a reusable library of action nodes that can be applied to various jobs, solving the reusability challenge and streamlining the development process.

Creating a Custom Visual Transform involves following specific steps to convert your custom code into a reusable visual component:

Create a Configuration File: This file defines the structure and behavior of your custom visual transform, including input and output schema details.
Write Custom Code: Implement the logic for your custom transformation using PySpark. This code will be the core of your custom action node.
Upload to S3 Bucket: Store both the configuration file and custom code in an S3 bucket, which AWS Glue Studio will reference to make your custom transform available in the visual interface.

We will build a custom action node that masks personally identifiable information (PII) in our dataset. This is a common requirement when handling sensitive data, and creating a reusable custom visual transform will allow us to apply PII masking easily across multiple jobs.

Create a Configuration File

{
    "name": "custom_filter_state",
    "displayName": "Apply Masking",
    "description": "This state will perform masking for email and Phone columns",
    "functionName": "custom_filter_state",
    "parameters": [
       {
        "name": "colName",
        "displayName": "Name of the Column with Email",
        "type": "list",
        "listOptions":"column",
        "listType":"str",
        "description": "Name of the column in the data that holds the state postal code"
       }
      ]
}

The configuration file for a Custom Visual Transform is a JSON file that defines the UI elements and functionality for the masking node. Below are the key sections of the JSON file:

Name: The system identifier for the transform.
displayName:The name displayed in AWS Glue Studio’s visual editor.
Description: A brief, searchable description in AWS Glue Studio.
functionName:The Python function to be invoked in the script.
Parameters: A list of input fields used for configuring the transform.

Within parameters, you have:

name:The parameter name, passed as a named argument to the Python function. Follows Python variable naming rules.
displayName:The label shown in the AWS Glue Studio editor for the parameter.
Type: The data type of the parameter (e.g., string, list). In our example, it’s a list for selecting values.
listOptions:A list of options to display in a Select or Multiselect UI, auto-populated with column names from the parent node’s schema.
listType:Specifies the type of list (for type = ‘list’).

This configuration creates a node that retrieves all the column names from the parent schema and displays them in a multiselect dropdown. The user then has the option to select the PII fields that require masking. The resulting UI will look something like this:

Create the Custom Transform Code

from awsglue import DynamicFrame

def myTransform(self, email, phone, age=None, gender="", 
                      country="", promotion=False):
   resulting_dynf = # do some transformation on self 
   return resulting_dynf
   
DynamicFrame.myTransform = myTransform

The custom code for the Custom Visual Transform follows the structure where self refers to the DynamicFrame that needs to be transformed.

Parameter Naming:

The parameter names in the code must match those defined in the config file.
If a parameter is optional, provide a default value in the function.

Function Naming:

Ensure the function name in the code matches the one defined in the config file.
It’s advisable to ensure the function name and the name field in the config file are the same. This helps prevent any confusion between them.

Assigning Results:

In the final line, assign the function’s results to the output in the config file.
Always return the transformed result as a DynamicFrame.
You can convert the DynamicFrame into a DataFrame if needed for further transformations using libraries like pandas, numpy,etc.

Below is the code that masks the columns selected by the user in the UI.

from awsglue import DynamicFrame
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import re

# Define the masking function for email
def mask_email(email):
    if email and isinstance(email, str):
        return re.sub(r'(\w{2})(.*)(@.*)', r'\1***\3', email)
    return email

# Define the custom transformation
def custom_filter_state(self, colName):
    # Create UDF for Spark SQL
    mask_udf = udf(mask_email, StringType())
    
    # Convert DynamicFrame to DataFrame for transformation
    df = self.toDF()
    
    # Apply masking to each column in colName
    for col in colName:
        df = df.withColumn(col, mask_udf(df[col]))
    
    # Convert DataFrame back to DynamicFrame
    return DynamicFrame.fromDF(df, self.glue_ctx, self.name)

# Assign custom transformation to DynamicFrame class
DynamicFrame.custom_filter_state = custom_filter_state

Deploying Custom Visual Transforms

Save Configuration and Code Files:

Save the configuration file (mytransform.json) and the transformation code (mytransform.py).
Ensure both files share the same name to simplify the integration between the configuration and the code.

Upload to S3 Bucket:

Upload both files to the specified S3 bucket location: s3://aws-glue-assets-<accountid>-<region>/transforms/.
This S3 path makes the transform accessible for your AWS Glue jobs in that region and account.

Cross-Region and Cross-Account Reusability:

To reuse the transform across multiple AWS regions or accounts, you can:

Manually replicate the configuration and code files across your accounts and regions.
Leverage S3 automatic replication to synchronize your transforms for reusing across various environments.

End-to-End Pipeline Implementation

After uploading the required configuration and code files to the designated S3 bucket, your custom node will appear in the “Transforms” tab within AWS Glue Studio shortly. To demonstrate an end-to-end pipeline, consider a dataset containing two PII fields: personal_email_address and work_email_address. By selecting these fields in your custom masking node, AWS Glue will automatically apply the transformations. In the “Data Preview” tab, you can verify that the masking has been successfully applied. This process showcases how custom visual transforms enable seamless data transformations within the AWS Glue Studio environment.

Conclusion

By leveraging custom visual transforms in AWS Glue Studio, you can build specialized ETL transformations that fit your exact needs. The entire procedure is intended to improve and streamline data management tasks, from setting up configuration files to deploying the solution across regions. AWS Glue’s flexibility and scalability make it an excellent choice for building robust, automated data pipelines in the cloud.

Eswara Sridharan

+ posts

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use nonessential cookies that help us analyze and understand how you use this website and enhance your user experience. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Analytics

Analytics cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Other

Other uncategorized cookies are those that are being analyzed and have not yet been classified into a category according to their type and purpose.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Building Reusable ETL Pipelines in AWS Glue with Custom Visual Transforms

Glue Visual Studio

Custom Visual Transform

Create a Configuration File

Create the Custom Transform Code

Parameter Naming:

Function Naming:

Assigning Results:

Deploying Custom Visual Transforms

Save Configuration and Code Files:

Upload to S3 Bucket:

Cross-Region and Cross-Account Reusability:

End-to-End Pipeline Implementation

Conclusion

Eswara Sridharan

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us