Run a Spark/Scala/Python Jar/Script using AWS Glue Job (Serverless) and Scheduling it using a Glue Trigger
Motivation
Running a Python Script on Glue and scheduling it..How complicated can it be ? Its not ! Its quite simple and straightforward. But as we all know sometimes the most simple things are very tough to find and figure out.
Similar to hundreds of user, even I thought the same and eventually figured out the solution (took days to figure out after some annoying blocks). Thus I felt like sharing my experience so that others out there can utilize my efforts put in and accomplish it in few easy click steps.
AWS GLUE in short
Straight from their textbook :
Yup.. you read it right..Glue is fully managed AWS tool. (Obviously it has its Limitations which we’ll cover in a short while)
Benefits
Less hassle
AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when on-boarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.
Cost effective
AWS Glue is server-less. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
More power
AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
Lets Begin . . .
Creating a Glue Job:
I will continue from where we left off in the last blog {you can find it here} where I had a python script to load partitions dynamically into AWS Athena Schema. This Blog would be more pictorial with lots of screenshots added where I would be going step by step to make it as simple as possible.
Open the Glue Landing Page through AWS. Navigate to Jobs in ETL Section.
Let’s get started by creating our very own job.
You can have 3 types of jobs in Glue
1. Spark
2. Spark Streaming
3. Python Shell
Have not got a chance to explore Spark Streaming so wont comment much.
So Basically you can create your script file in Scala or Python depending on your choice.
You can create one inside the console or pass script file from an S3 location.
Depending on your glue job type, you might get extra configurations like ADVANCED PROPERTIES
& MONITORING OPTIONS
specific to job type SPARK like Job bookmark, Job Metrics, Continuous logging, Spark UI logs
Tags are the usual AWS tagging which is used to organize and identify the resources.
Lets Discuss — Security configuration, script libraries, and job parameters in a bit detail for Job Type — Python / Spark
Python library path, Referenced Files Path & Dependent Jars Path (Specific To Spark)
— are used to provide external dependencies or properties or config files to python, Scala or spark applications. You can have multiple files too by defining the s3 paths comma-separated.
NOTE : Currently AWS Glue only supports specific inbuilt python libraries like Boto3, NumPy, SciPy, sklearn and few others. But AWS have mentioned that “Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.”
BUT There is a way — You can download or create *.egg /*.whl files of dependencies used and pass it to glue.
The *.whl or *.egg is basically a *.zip file in disguise. If you rename the extension from *.whl to *.zip, you can open it up with the zip application of your choice and examine the files and folders inside at your leisure.
All python dependencies wont work but maximum will work. Give it a try ;)
Worker Types
— Specific to Spark Job — is like the instances used similar to EMR configuration. Glue currently supports 3 instances
- Standard — Standard comes with a extra option to set
Maximum capacity
. Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. TheStandard
worker type has a 50 GB disk and 2 executors.
The maximum capacity can be set between 2 to 100. The default is 10. This job type cannot have a fractional DPU allocation.
As per my tests I have found out the Max Capacity & Executor ratio to beMaximum Capacity — Executors
5–7
& so on..
10–17
15–25
20–37
2. G1.X — This type of instance is used for memory-intensive jobs. When we choose this type, instead of Maximum Capacity we get Number of workers
. Each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and provides 1 executor per worker.
The maximum number of workers you can define for G.1X
is 299
3. G2.X — Similar to the above, this instance is also recommended for memory-intensive jobs and jobs that run ML transformations.Similarly we can provide the Number of workers
. Each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk), and provides 1 executor per worker.
The maximum number of workers you can define for G.2X
is 149
Maximum capacity
(for Python) — You can set the value to 0.0625 or 1. The default is 0.0625. Its similar to the standard instance type of Spark. You can run Python shell jobs using 1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A single DPU provides processing capacity that consists of 4 vCPUs of compute and 16 GB of memory.
Max Concurrency
— is the maximum number of concurrent runs that are allowed for that job. The default is 1. An error is returned when this threshold is reached.
Job timeout (minutes)
— Sets the maximum execution time in minutes. The default is 2880 minutes (48 hrs). If execution time is greater than this limit, the job run state changes to “TIMEOUT”.
Delay notification threshold
— Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to send notifications when a RUNNING
, STARTING
, or STOPPING
job run takes more than an expected number of minutes.
Number of retries
— Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it fails.
Job parameters
— A set of key-value pairs that are passed as named parameters to the script. These are default values that are used when the script is run, but you can override them in triggers or when you run the job. You must prefix the key name with --
; for example: --myKey
. You pass job parameters as a map when using the AWS Command Line Interface.
Non-overrideable Job parameters
— A set of special job parameters that cannot be overridden in triggers or when you run the job. These key-value pairs are passed as named parameters to the script. getResolvedOptions()
returns both job parameters and non-overrideable job parameters in a single map.
Catalog options
— Glue data catalog as the Hive metastore
Select this to use the AWS Glue Data Catalog as the Hive metastore. The IAM role used for the job must have the glue:CreateDatabase
permission. A database called “default” is created in the Data Catalog if it does not exist.
NOTE : You can also run your existing Scala/Python Spark Jar from inside a Glue Job by having a simple script in Python/Scala and calling the main function from your script and passing the jar as an external dependency in “
Python Library Path
”, “Dependent Jars Path
” or “Referenced Files Path
” in Security Configurations.Special Parameters which can be set specifically for Spark Jobs are :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.htmlYou can also have custom log groups for Spark Jobs :
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html
https://aws.amazon.com/premiumsupport/knowledge-center/glue-not-writing-logs-cloudwatch/Also, in order to change the spark default parameters like spark.driver.memory or spark.executor.memory or spark.yarn.executor.memoryOverhead, you can send it via job parameters under Glue Job Settings:
key:--conf
value:spark.yarn.executor.memoryOverhead=1G
In case of multiple values, you can send the parameters like below:
key:--conf
value:spark.executor.memory=10G --conf spark.driver.memory=10G --conf spark.yarn.executor.memoryOverhead=1G
Please Note : There is a limit to setting config depending on worker type selected i.e for Standard its 12G max executor memory
AWS officially does not recommend and its a last resort to manipulate the default parameters since this is a managed service from AWS and hence it can lead to errors.
An AWS Glue connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, virtual private cloud (VPC) information, and more. Creating connections in the Data Catalog saves the effort to specify all connection details every time you create a crawler or job. You can use connections for both sources and targets.
The following connection types are available:
* JDBC
* Amazon Relational Database Service (Amazon RDS)
* Amazon Redshift
* MongoDB, including Amazon DocumentDB (with MongoDB compatibility)
For this example, I wont be using any connections.
I am using the same script which I created in my previous blog to load partitions programmatically and using a glue job to trigger it periodically.
Lets create a Trigger which will run our Glue Job. A trigger fires on demand, based on a schedule, or based on a combination of events. There are three types of triggers:
1. Scheduled — A time-based trigger based on cron
.
2. Job-events (Conditional) — A trigger that fires when a previous job or crawler or multiple jobs or crawlers satisfy a list of conditions. Conditions can be one of the job status (Succeeded, Failed, Stopped, Timeout)
3. On-demand — A trigger that fires when you activate it.
I have selected a Scheduled Trigger to run daily at 02:00 am UTC as per my use case.
Timeout : The JobRun
timeout in minutes. This is the maximum time that a job run can consume resources before it is terminated and enters TIMEOUT
status. The default is 2,880 minutes (48 hours).
This overrides the timeout value set in the parent job.
You can then select the job to be triggered using the Trigger created. Multiple jobs can be triggered using the same Trigger.
Lastly click on Enable trigger on creation
to activate it on creation.
There you go.. Your trigger would show up in the console with a green info message of successful creation. Quite Simple Right !
That’s it. Our very own glue job is created and scheduled using a trigger.
In AWS Glue, you can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs and triggers. Each workflow manages the execution and monitoring of all its components.
Won’t be able to cover much about workflow in this blog but you guys can refer to the actual AWS Glue Workflow Documentation for that.
Most of the content is directly referenced from AWS Documentation for this article, just made concise. That’s it for this blog guys !!
Thank you for reading till the end. Hope you found it worthy. I am not an AWS expert but pursuing to be one. Just shared my personal experience working on a POC which I thought would help others like me. Do like the blog, comment your feedback and improvements and also let me know if you need any help understanding it. Please follow for more such easy and interesting write ups. This would motivate me to keep writing and sharing my experiences.
Till then Keep Smiling and Keep Coding ✌️😊 !!
About ME :
Data Engineer/Software Developer/Lead with Masters in Data Analytics from Dublin City University having 6+ years of work experience in Data Pipelines, Core Development, Reporting, Visualizations, DB & Data Technologies with progressive growth. I believe in taking ownership of projects and proactively contributing towards the enhancement of the overall business solution.
Currently working with Verizon Connect as a Big Data Engineer. Thanks Verizon Connect for giving me this awesome opportunity to work for some really cool projects of migrating a legacy on-premise data server to AWS thereby, getting my hands dirty while working on neat POCs on AWS.
Updated On : 23/Sept/2021 || Published On : 11/June/2020 || Version : 4
version 4 :
Glue Trigger Timeout info added. Trigger timeout overrides the timeout value set in the parent job
version 3 :
Glue Spark Job Special Parameters which can be set + custom log group for spark jobs
version 2 :
Added Maximum Capacity detailed info for Python Shell Job
version 1 :
Published