[ad_1]
ETL stands for Extract, Rework, and Load. An ETL pipeline is actually only a information transformation course of — extracting information from one place, doing one thing with it, after which loading it again to the identical or a distinct place.
If you’re working with pure language processing by way of APIs, which I’m guessing most will begin doing, you possibly can simply hit the timeout threshold of AWS Lambda when processing your information, particularly if at the very least one operate exceeds quarter-hour. So, whereas Lambda is nice as a result of it’s fast and actually low cost, the timeout generally is a hassle.
The selection right here is to deploy your code as a container that has the choice of working so long as it must and run it on a schedule. So, as an alternative of spinning up a operate as you do with Lambda, we are able to spin up a container to run in an ECS cluster utilizing Fargate.
For clarification, Lambda, ECS and EventBridge are all AWS Providers.
Simply as with Lambda, the fee of working a container for an hour or two is minimal. Nevertheless, it’s a bit extra difficult than working a serverless operate. However should you’re studying this, you then’ve in all probability run into the identical points and are questioning what the best solution to transition is.
I’ve created a quite simple ETL template that makes use of Google BigQuery to extract and cargo information. This template will get you up and working inside a couple of minutes should you comply with alongside.
Utilizing BigQuery is totally optionally available however I often retailer my long run information there.
As an alternative of constructing one thing complicated right here, I’ll present you the right way to construct one thing minimal and hold it actually lean.
If you happen to don’t have to course of information in parallel, you shouldn’t want to incorporate one thing like Airflow. I’ve seen a number of articles on the market that unnecessarily arrange complicated workflows, which aren’t strictly essential for easy information transformation.
In addition to, should you really feel such as you wish to add on to this later, that choice is yours.
Workflow
We’ll construct our script in Python as we’re doing information transformation, then bundle it up with Docker and push it to an ECR repository.
From right here, we are able to create a job definition utilizing AWS Fargate and run it on a schedule in an ECS cluster.
Don’t fear if this feels international; you’ll perceive all these companies and what they do as we go alongside.
Expertise
If you’re new to working with containers, then consider ECS (Elastic Container Service) as one thing that helps us arrange an surroundings the place we are able to run a number of containers concurrently.
Fargate, then again, helps us simplify the administration and setup of the containers themselves utilizing Docker photos — that are known as duties in AWS.
There’s the choice of utilizing EC2 to arrange your containers, however you would need to do much more handbook work. Fargate manages the underlying situations for us, whereas with EC2, you might be required to handle and deploy your individual compute situations. Therefore, Fargate is sometimes called the ‘serverless’ choice.
I discovered a thread on Reddit discussing this, should you’re eager to learn a bit about how customers discover utilizing EC2 versus Fargate. It may give you an thought of how individuals evaluate EC2 and Fargate.
Not that I’m saying Reddit is the supply of reality, but it surely’s helpful for getting a way of consumer views.
Prices
The first concern I often have is to maintain the code working effectively whereas additionally managing the entire price.
As we’re solely working the container when we have to, we solely pay for the quantity of sources we use. The value we pay is set by a number of elements, such because the variety of duties working, the execution period of every job, the variety of digital CPUs (vCPUs) used for the duty, and reminiscence utilization.
However to present you a tough thought, on a excessive stage, the entire price for working one job is round $0.01384 per hour for the EU area, relying on the sources you’ve provisioned.
If we have been to check this worth with AWS Glue we are able to get a little bit of perspective whether it is good or not.
If an ETL job requires 4 DPUs (the default quantity for an AWS Glue job) and runs for an hour, it will price 4 DPUs * $0.44 = $1.76. This price is for just one hour and is considerably larger than working a easy container.
That is, in fact, a simplified calculation, and the precise variety of DPUs can differ relying on the job. You may take a look at AWS Glue pricing in additional element on their pricing web page.
To run long-running scripts, establishing your individual container and deploying it on ECS with Fargate is smart, each by way of effectivity and value.
To comply with this text, I’ve created a easy ETL template that can assist you rise up and working rapidly.
This template makes use of BigQuery to extract and cargo information. It’s going to extract a number of rows, do one thing easy after which load it again to BigQuery.
After I run my pipelines I’ve different issues that remodel information — I take advantage of APIs for pure language processing that runs for a number of hours within the morning — however that’s as much as you so as to add on later. That is simply to present you a template that will likely be simple to work with.
To comply with alongside this tutorial, the primary steps will likely be as follows:
- Establishing your native code.
- Establishing an IAM consumer & the AWS CLI.
- Construct & push Docker picture to AWS.
- Create an ECS job definition.
- Create an ECS cluster.
- Schedule to your duties.
In whole it shouldn’t take you longer than 20 minutes to get by this, utilizing the code I’ll give you. This assumes you might have an AWS account prepared, and if not, add on 5 to 10 minutes.
The Code
First create a brand new folder domestically and find into it.
mkdir etl-pipelines
cd etl-pipelines
Ensure you have python put in.
python --version
If not, set up it domestically.
When you’re prepared, you possibly can go forward and clone the template I’ve already arrange.
git clone https://github.com/ilsilfverskiold/etl-pipeline-fargate.git
When it has completed fetching the code, open it up in your code editor.
First verify the predominant.py file to look how I’ve structured the code to know what it does.
Basically, it’ll fetch all names with “Doe” in it from a desk in BigQuery that you specify, remodel these names after which insert them again into the identical information desk as new rows.
You may go into every helper operate to see how we arrange the SQL Question job, remodel the information after which insert it again to the BigQuery desk.
The thought is in fact that you simply arrange one thing extra complicated however it is a easy check run to make it simple to tweak the code.
Setting Up BigQuery
If you wish to proceed with the code I’ve ready you will have to arrange a number of issues in BigQuery. In any other case you possibly can skip this half.
Listed here are the issues you will have:
- A BigQuery desk with a area of ‘title’ as a string.
- A few rows within the information desk with the title “Doe” in it.
- A service account that can have entry to this dataset.
To get a service account you will have to navigate to IAM within the Google Cloud Console after which to Service Accounts.
As soon as there, create a brand new service account.
As soon as it has been created, you will have to present your service account BigQuery Consumer entry globally by way of IAM.
Additionally, you will have to present this service account entry to the dataset itself which you do in BigQuery straight by way of the dataset’s Share button after which by urgent Add Principal.
After you’ve given the consumer the suitable permissions, be sure you return to the Service Accounts after which obtain a key. This provides you with a json file that it’s worthwhile to put in your root folder.
Now, crucial half is ensuring the code has entry to the google credentials and is utilizing the right information desk.
You’ll need the json file you’ve downloaded with the Google credentials in your root folder as google_credentials.json and you then wish to specify the right desk ID.
Now you may argue that you don’t want to retailer your credentials domestically which is barely proper.
You may add within the choice of storing your json file in AWS Secrets and techniques Supervisor later. Nevertheless, to start out, this will likely be simpler.
Run ETL Pipeline Domestically
We’ll run this code domestically first, simply so we are able to see that it really works.
So, arrange a Python digital surroundings and activate it.
python -m venv etl-env
supply etl-env/bin/activate # On Home windows use `venvScriptsactivate`
Then set up dependencies. We solely have google-cloud-bigquery in there however ideally you should have extra dependencies.
pip set up -r necessities.txt
Run the primary script.
python predominant.py
This could log ‘New rows have been added’ in your terminal. This may then affirm that the code works as we’ve meant.
The Docker Picture
Now to push this code to ECS we should bundle it up right into a Docker picture which implies that you’ll want Docker put in domestically.
If you happen to wouldn’t have Docker put in, you possibly can obtain it right here.
Docker helps us package deal an utility and its dependencies into a picture, which will be simply acknowledged and run on any system. Utilizing ECS, it’s required of us to bundle our code into Docker photos, that are then referenced by a job definition to run as containers.
I’ve already arrange a Dockerfile in your folder. It is best to be capable of look into it there.
FROM --platform=linux/amd64 python:3.11-slimWORKDIR /app
COPY . /app
RUN pip set up --no-cache-dir -r necessities.txt
CMD ["python", "main.py"]
As you see, I’ve saved this actually lean as we’re not connecting net site visitors to any ports right here.
We’re specifying AMD64 which you will not want if you’re not on a Mac with an M1 chip but it surely shouldn’t harm. This may specify to AWS the structure of the docker picture so we don’t run into compatibility points.
Create an IAM Consumer
When working with AWS, entry will have to be specified. A lot of the points you’ll run into are permission points. We’ll be working with the CLI domestically, and for this to work we’ll should create an IAM consumer that can want fairly broad permissions.
Go to the AWS console after which navigate to IAM. Create a brand new consumer, add permissions after which create a brand new coverage to connect to it.
I’ve specified the permissions wanted in your code within the aws_iam_user.json file. You’ll see a brief snippet beneath of what this json file seems like.
{
"Model": "2012-10-17",
"Assertion": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"iam:CreateRole",
"iam:AttachRolePolicy",
"iam:PutRolePolicy",
"ecs:DescribeTaskDefinition",
...more
],
"Useful resource": "*"
}
]
}
You’ll want to enter this file to get all the permissions you will have to set, that is only a quick snippet. I’ve set it to fairly a number of, which you will wish to tweak to your individual preferences later.
When you’ve created the IAM consumer and also you’ve added the right permissions to it, you will have to generate an entry key. Select ‘Command Line Interface (CLI)’ when requested about your use case.
Obtain the credentials. We’ll use these to authenticate in a bit.
Arrange the AWS CLI
Subsequent, we’ll join our terminal to our AWS account.
If you happen to don’t have the CLI arrange but you possibly can comply with the directions right here. It’s very easy to set this up.
When you’ve put in the AWS CLI you’ll have to authenticate with the IAM consumer we simply created.
aws configure
Use the credentials we downloaded from the IAM consumer within the earlier step.
Create an ECR Repository
Now, we are able to get began with the DevOps of all of it.
We’ll first have to create a repository in Elastic Container Registry. ECR is the place we are able to retailer and handle our docker photos. We’ll be capable of reference these photos from ECR after we arrange our job definitions.
To create a brand new ECR repository run this command in your terminal. This may create a repository known as bigquery-etl-pipeline.
aws ecr create-repository — repository-name bigquery-etl-pipeline
Notice the repository URI you get again.
From right here we now have to construct the docker picture after which push this picture to this repository.
To do that you possibly can technically go into the AWS console and discover the ECR repository we simply created. Right here AWS will allow us to see your entire push instructions we have to run to authenticate, construct and push our docker picture to this ECR repository.
Nevertheless, if you’re on a Mac I’d recommendation you to specify the structure when constructing the docker picture or you might run into points.
If you’re following together with me, then begin with authenticating your docker consumer like so.
aws ecr get-login-password --region YOUR_REGION | docker login --username AWS --password-stdin YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com
You’ll want to change the values, area and account ID the place relevant.
Construct the docker picture.
docker buildx construct --platform=linux/amd64 -t bigquery-etl-pipeline .
That is the place I’ve tweaked the command to specify the linux/amd64 structure.
Tag the docker picture.
docker tag bigquery-etl-pipeline:newest YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/bigquery-etl-pipeline:newest
Push the docker picture.
docker push YOUR_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/bigquery-etl-pipeline:newest
If every part labored as deliberate you’ll see one thing like this in your terminal.
9f691c4f0216: Pushed
ca0189907a60: Pushed
687f796c98d5: Pushed
6beef49679a3: Pushed
b0dce122021b: Pushed
4de04bd13c4a: Pushed
cf9b23ff5651: Pushed
644fed2a3898: Pushed
Now that we now have pushed the docker picture to an ECR repository, we are able to use it to arrange our job definition utilizing Fargate.
If you happen to run into EOF points right here it’s almost certainly associated to IAM permissions. You’ll want to give it every part it wants, on this case full entry to ECR to tag and push the picture.
Roles & Log Teams
Keep in mind what I instructed you earlier than, the most important points you’ll run into in AWS pertains to roles between totally different companies.
For this to movement neatly we’ll have to ensure we arrange a number of issues earlier than we begin establishing a job definition and an ECS cluster.
To do that, we first should create a job position — this position is the position that can want entry to companies within the AWS ecosystem from our container — after which the execution position — so the container will be capable of pull the docker picture from ECR.
aws iam create-role --role-name etl-pipeline-task-role --assume-role-policy-document file://ecs-tasks-trust-policy.json
aws iam create-role - role-name etl-pipeline-execution-role - assume-role-policy-document file://ecs-tasks-trust-policy.json
I’ve specified a json file known as ecs-tasks-trust-policy.json in your folder domestically that it’ll use to create these roles.
For the script that we’re pushing, it received’t have to have permission to entry different AWS companies so for now there isn’t a want to connect insurance policies to the duty position. However, you might wish to do that later.
Nevertheless, for the execution position although we might want to give it ECR entry to tug the docker picture.
To connect the coverage AmazonECSTaskExecutionRolePolicy to the execution position run this command.
aws iam attach-role-policy --role-name etl-pipeline-execution-role --policy-arn arn:aws:iam::aws:coverage/service-role/AmazonECSTaskExecutionRolePolicy
We additionally create one final position whereas we’re at it, a service position.
aws iam create-service-linked-role - aws-service-name ecs.amazonaws.com
If you happen to don’t create the service position you might find yourself with an errors comparable to ‘Unable to imagine the service linked position. Please confirm that the ECS service linked position exists’ once you attempt to run a job.
The very last thing we create a log group. Making a log group is important for capturing and accessing logs generated by your container.
To create a log group you possibly can run this command.
aws logs create-log-group - log-group-name /ecs/etl-pipeline-logs
When you’ve created the execution position, the duty position, the service position after which the log group we are able to proceed to arrange the ECS job definition.
Create an ECS Activity Definition
A job definition is a blueprint on your duties, specifying what container picture to make use of, how a lot CPU and reminiscence is required, and different configurations. We use this blueprint to run duties in our ECS cluster.
I’ve already arrange the duty definition in your code at task-definition.json. Nevertheless, it’s worthwhile to set your account id in addition to area in there to ensure it runs because it ought to.
{
"household": "my-etl-task",
"taskRoleArn": "arn:aws:iam::ACCOUNT_ID:position/etl-pipeline-task-role",
"executionRoleArn": "arn:aws:iam::ACCOUNT_ID:position/etl-pipeline-execution-role",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "my-etl-container",
"image": "ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/bigquery-etl-pipeline:latest",
"cpu": 256,
"memory": 512,
"essential": true,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/etl-pipeline-logs",
"awslogs-region": "REGION",
"awslogs-stream-prefix": "ecs"
}
}
}
],
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"reminiscence": "512"
}
Keep in mind the URI we received again after we created the ECR repository? That is the place we’ll use it. Keep in mind the execution position, the duty position and the log group? We’ll use it there as properly.
If you happen to’ve named the ECR repository together with the roles and log group precisely what I named mine then you possibly can merely change the account ID and Area on this json in any other case make certain the URI is the right one.
You may also set CPU and reminiscence right here for what you’ll have to run your job — i.e. your code. I’ve set .25 vCPU and 512 mb as reminiscence.
When you’re glad you possibly can register the duty definition in your terminal.
aws ecs register-task-definition --cli-input-json file://task-definition.json
Now you need to be capable of go into Amazon Elastic Container Service after which discover the duty we’ve created underneath Activity Definitions.
This job — i.e. blueprint — received’t run on it’s personal, we have to invoke it later.
Create an ECS Cluster
An ECS Cluster serves as a logical grouping of duties or companies. You specify this cluster when working duties or creating companies.
To create a cluster by way of the CLI run this command.
aws ecs create-cluster --cluster-name etl-pipeline-cluster
When you run this command, you’ll be capable of see this cluster in ECS in your AWS console should you look there.
We’ll connect the Activity Definition we simply created to this cluster after we run it for the following half.
Run Activity
Earlier than we are able to run the duty we have to get ahold of the subnets which are accessible to us together with a safety group id.
We will do that straight within the terminal by way of the CLI.
Run this command within the terminal to get the accessible subnets.
aws ec2 describe-subnets
You’ll get again an array of objects right here, and also you’re in search of the SubnetId for every object.
If you happen to run into points right here, make certain your IAM has the suitable permissions. See the aws_iam_user.json file in your root folder for the permissions the IAM consumer related to the CLI will want. I’ll stress this, as a result of it’s the primary points that I at all times run into.
To get the safety group ID you possibly can run this command.
aws ec2 describe-security-groups
You’re in search of GroupId right here within the terminal.
If you happen to received at the very least one SubnetId after which a GroupId for a safety group, we’re able to run the duty to check that the blueprint — i.e. job definition — works.
aws ecs run-task
--cluster etl-pipeline-cluster
--launch-type FARGATE
--task-definition my-etl-task
--count 1
--network-configuration "awsvpcConfiguration={subnets=[SUBNET_ID],securityGroups=[SECURITY_GROUP_ID],assignPublicIp=ENABLED}"
Do bear in mind to vary the names should you’ve named your cluster and job definition otherwise. Keep in mind to additionally set your subnet ID and safety group ID.
Now you possibly can navigate to the AWS console to see the duty working.
If you’re having points you possibly can look into the logs.
If profitable, you need to see a number of reworked rows added to BigQuery.
EventBridge Schedule
Now, we’ve managed to arrange the duty to run in an ECS cluster. However what we’re all for is to make it run on a schedule. That is the place EventBridge is available in.
EventBridge will arrange our scheduled occasions, and we are able to set this up utilizing the CLI as properly. Nevertheless, earlier than we arrange the schedule we first have to create a brand new position.
That is life when working with AWS, every part must have permission to work together with one another.
On this case, EventBridge will want permission to name the ECS cluster on our behalf.
Within the repository you might have a file known as trust-policy-for-eventbridge.json that I’ve already put there, we’ll use this file to create this EventBridge position.
Paste this into the terminal and run it.
aws iam create-role
--role-name ecsEventsRole
--assume-role-policy-document file://trust-policy-for-eventbridge.json
We then have to connect a coverage to this position.
aws iam attach-role-policy
--role-name ecsEventsRole
--policy-arn arn:aws:iam::aws:coverage/AmazonECS_FullAccess
We’d like it to at the very least be capable of have ecs:RunTask however we’ve given it full entry. If you happen to favor to restrict the permissions, you possibly can create a customized coverage with simply the mandatory permissions as an alternative.
Now let’s arrange the rule to schedule the duty to run with the duty definition day-after-day at 5 am UTC. That is often the time I’d like for it to course of information for me so if it fails I can look into it after breakfast.
aws occasions put-rule
--name "ETLPipelineDailyRun"
--schedule-expression "cron(0 5 * * ? *)"
--state ENABLED
It is best to obtain again an object with a area known as RuleArn right here. That is simply to substantiate that it labored.
Subsequent step is now to affiliate the rule with the ECS job definition.
aws occasions put-targets --rule "ETLPipelineDailyRun"
--targets "[{"Id":"1","Arn":"arn:aws:ecs:REGION:ACCOUNT_NUMBER:cluster/etl-pipeline-cluster","RoleArn":"arn:aws:iam::ACCOUNT_NUMBER:role/ecsEventsRole","EcsParameters":{"TaskDefinitionArn":"arn:aws:ecs:REGION:ACCOUNT_NUMBER:task-definition/my-etl-task","TaskCount":1,"LaunchType":"FARGATE","NetworkConfiguration":{"awsvpcConfiguration":{"Subnets":["SUBNET_ID"],"SecurityGroups":["SECURITY_GROUP_ID"],"AssignPublicIp":"ENABLED"}}}}]"
Keep in mind to set your individual values right here for area, account quantity, subnet and safety group.
Use the subnets and safety group that we received earlier. You may set a number of subnets.
When you’ve run the command the duty is scheduled for five am day-after-day and also you’ll discover it underneath Scheduled Duties within the AWS Console.
AWS Secrets and techniques Supervisor (Optionally available)
So maintaining your Google credentials within the root folder isn’t best, even should you’ve restricted entry to your datasets for the Google service account.
Right here we are able to add on the choice of shifting these credentials to a different AWS service after which accessing it from our container.
For this to work you’ll have to maneuver the credentials file to Secrets and techniques Supervisor, tweak the code so it may well fetch it to authenticate and ensure that the duty position has permissions to entry AWS Secrets and techniques Supervisor in your behalf.
While you’re achieved you possibly can merely push the up to date docker picture to your ECR repo you arrange earlier than.
The Finish Outcome
Now you’ve received a quite simple ETL pipeline working in a container on AWS on a schedule. The thought is that you simply add to it to do your individual information transformations.
Hopefully this was a helpful piece for anybody that’s transitioning to establishing their long-running information transformation scripts on ECS in a easy, price efficient and easy manner.
Let me know should you run into any points in case there’s something I missed to incorporate.
❤
[ad_2]