Implement a Customized Coaching Resolution Primarily based on Amazon EC2 | by Chaim Rand

Machine Learning

Implement a Customized Coaching Resolution Primarily based on Amazon EC2 | by Chaim Rand | Jan, 2024

hhhhm

2024年2月1日

Implement a Customized Coaching Resolution Primarily based on Amazon EC2 | by Chaim Rand | Jan, 2024

[ad_1]

A Easy Resolution for Managing Cloud-Primarily based ML-Coaching — Half 2

This can be a sequel to a current submit on the subject of constructing customized, cloud-based options for machine studying (ML) mannequin improvement utilizing low-level occasion provisioning providers. Our focus on this submit shall be on Amazon EC2.

Cloud service suppliers (CSPs) sometimes supply totally managed options for coaching ML fashions within the cloud. Amazon SageMaker, for instance, Amazon’s managed service providing for ML improvement, simplifies the method of coaching considerably. Not solely does SageMaker automate the end-to-end coaching execution — from auto-provisioning the requested occasion sorts, to organising the coaching surroundings, to working your coaching workload, to saving the coaching artifacts and shutting the whole lot down — but it surely additionally provides numerous auxiliary providers that assist ML improvement, corresponding to automated mannequin tuning, platform optimized distributed coaching libraries, and extra. Nonetheless, as is usually the case with high-level options, the elevated ease-of-use of SageMaker coaching is coupled with a sure stage of lack of management over the underlying move.

In our earlier submit we famous among the limitations generally imposed by managed coaching providers corresponding to SageMaker, together with diminished consumer privileges, inaccessibility of some occasion sorts, diminished management over multi-node system placement, and extra. Some eventualities require the next stage of autonomy over the surroundings specification and coaching move. On this submit, we illustrate one method to addressing these circumstances by making a customized coaching answer constructed on prime of Amazon EC2.

Many because of Max Rabin for his contributions to this submit.

In our earlier submit we listed a minimal set of options that we might require from an automatic coaching answer and proceeded to exhibit, in a step-by-step method, a method of implementing these in Google Cloud Platform (GCP). And though the identical sequence of steps would apply to every other cloud platform, the main points may be fairly completely different because of the distinctive nuances of every one. Our intention on this submit shall be to suggest an implementation primarily based on Amazon EC2 utilizing the create_instances command of the AWS Python SDK (model 1.34.23). As in our earlier submit, we are going to start with a easy EC2 occasion creation command and steadily complement it with further parts that may incorporate our desired administration options. The create_instances command helps many controls. For the needs of our demonstration, we are going to focus solely on those which are related to our answer. We are going to assume the existence of a default VPC and an IAM occasion profile with applicable permissions (together with entry to Amazon EC2, S3, and CloudWatch providers).

Observe that there are a number of methods of utilizing Amazon EC2 to satisfy the minimal set of options that we outlined. We have now chosen to exhibit one potential implementation. Please don’t interpret our selection of AWS, EC2, or any particulars of the particular implementation we have now chosen as an endorsement. The very best ML coaching answer for you’ll vastly depend upon the particular wants and particulars of your venture.

We start with a minimal instance of a single EC2 occasion request. We have now chosen a GPU accelerated g5.xlarge occasion sort and a current Deep Studying AMI (with an Ubuntu 20.4 working system).

import boto3area = 'us-east-1'
job_id = 'my-experiment' # change with distinctive id
num_instances = 1
image_id = 'ami-0240b7264c1c9e6a9' # change with picture of selection
instance_type = 'g5.xlarge' # change with occasion of selection
ec2 = boto3.useful resource('ec2', region_name=area)
cases = ec2.create_instances(
MaxCount=num_instances,
MinCount=num_instances,
ImageId=image_id,
InstanceType=instance_type,
)

The primary enhancement we want to apply is for our coaching workload to routinely begin as quickly as our occasion is up and working, with none want for handbook intervention. In direction of this aim, we are going to make the most of the UserData argument of the create_instances API that allows you to specify what to run at launch. Within the code block under, we suggest a sequence of instructions that units up the coaching surroundings (i.e., updates the PATH surroundings variable to level to the prebuilt PyTorch surroundings included in our picture), downloads our coaching code from Amazon S3, installs the venture dependencies, runs the coaching script, and syncs the output artifacts to persistent S3 storage. The demonstration assumes that the coaching code has already been created and uploaded to the cloud and that it incorporates two recordsdata: a necessities file (necessities.txt) and a stand-alone coaching script (prepare.py). In observe, the exact contents of the startup sequence will depend upon the venture. We embrace a pointer to our predefined IAM occasion profile which is required for accessing S3.

import boto3area = 'us-east-1'
job_id = 'my-experiment' # change with distinctive id
num_instances = 1
image_id = 'ami-0240b7264c1c9e6a9' # change with picture of selection
instance_type = 'g5.xlarge' # change with occasion of selection
instance_profile_arn = 'instance-profile-arn' # change with profile arn
ec2 = boto3.useful resource('ec2', region_name=area)
script = """#!/bin/bash
# surroundings setup
export PATH=/choose/conda/envs/pytorch/bin/python:$PATH
# obtain and unpack code
aws s3 cp s3://my-s3-path/my-code.tar .
tar -xvf my-code.tar
# set up dependencies
python3 -m pip set up -r necessities.txt
# run coaching workload
python3 prepare.py
# sync output artifacts
aws s3 sync artifacts s3://my-s3-path/artifacts
"""
cases = ec2.create_instances(
MaxCount=num_instances,
MinCount=num_instances,
ImageId=image_id,
InstanceType=instance_type,
IamInstanceProfile={'Arn':instance_profile_arn},
UserData=script
)

Observe that the script above syncs the coaching artifacts solely on the finish of coaching. A extra fault-tolerant answer would sync intermediate mannequin checkpoints all through the coaching job.

Once you prepare utilizing a managed service, your cases are routinely shut down as quickly as your script completes to make sure that you solely pay for what you want. Within the code block under, we append a self-destruction command to the top of our UserData script. We do that utilizing the AWS CLI terminate-instances command. The command requires that we all know the instance-id and the internet hosting area of our occasion which we extract from the occasion metadata. Our up to date script assumes that our IAM occasion profile has applicable instance-termination authorization.

script = """#!/bin/bash
# surroundings setup
TOKEN=$(curl -s -X PUT "http://169.254.169.254/newest/api/token" -H 
"X-aws-ec2-metadata-token-ttl-seconds: 21600")
INST_MD=http://169.254.169.254/newest/meta-data
CURL_FLAGS="-H "X-aws-ec2-metadata-token: ${TOKEN}" -s"
INSTANCE_ID=$(curl $CURL_FLAGS $INST_MD/instance-id)
REGION=$(curl $CURL_FLAGS $INST_MD/placement/area)
export PATH=/choose/conda/envs/pytorch/bin/python:$PATH# obtain and unpack code
aws s3 cp s3://my-s3-path/my-code.tar .
tar -xvf my-code.tar
# set up dependencies
python3 -m pip set up -r necessities.txt
# run coaching workload
python3 prepare.py
# sync output artifacts
aws s3 sync artifacts s3://my-s3-path/artifacts
# self-destruct
aws ec2 terminate-instances --instance-ids $INSTANCE_ID 
--region $REGION
"""

We extremely suggest introducing further mechanisms for verifying applicable occasion deletion to keep away from the potential of having unused (“orphan”) cases within the system racking up pointless prices. In a current submit we confirmed how serverless features can be utilized to deal with this sort of drawback.

Amazon EC2 allows you to apply customized metadata to your occasion utilizing EC2 occasion tags. This allows you to move info to the occasion relating to the coaching workload and/or the coaching surroundings. Right here, we use the TagSpecifications setting to move in an occasion title and a novel coaching job id. We use the distinctive id to outline a devoted S3 path for our job artifacts. Observe that we have to explicitly allow the occasion to entry the metadata tags through the MetadataOptions setting.

import boto3area = 'us-east-1'
job_id = 'my-experiment' # change with distinctive id
num_instances = 1
image_id = 'ami-0240b7264c1c9e6a9' # change with picture of selection
instance_type = 'g5.xlarge' # change with occasion of selection
instance_profile_arn = 'instance-profile-arn' # change with profile arn
ec2 = boto3.useful resource('ec2', region_name=area)
script = """#!/bin/bash
# surroundings setup
TOKEN=$(curl -s -X PUT "http://169.254.169.254/newest/api/token" -H 
"X-aws-ec2-metadata-token-ttl-seconds: 21600")
INST_MD=http://169.254.169.254/newest/meta-data
CURL_FLAGS="-H "X-aws-ec2-metadata-token: ${TOKEN}" -s"
INSTANCE_ID=$(curl $CURL_FLAGS $INST_MD/instance-id)
REGION=$(curl $CURL_FLAGS $INST_MD/placement/area)
JOB_ID=$(curl $CURL_FLAGS $INST_MD/tags/occasion/JOB_ID)
export PATH=/choose/conda/envs/pytorch/bin/python:$PATH
# obtain and unpack code
aws s3 cp s3://my-s3-path/$JOB_ID/my-code.tar .
tar -xvf my-code.tar
# set up dependencies
python3 -m pip set up -r necessities.txt
# run coaching workload
python3 prepare.py
# sync output artifacts
aws s3 sync artifacts s3://my-s3-path/$JOB_ID/artifacts
# self-destruct
aws ec2 terminate-instances --instance-ids $INSTANCE_ID 
--region $REGION
"""
cases = ec2.create_instances(
MaxCount=num_instances,
MinCount=num_instances,
ImageId=image_id,
InstanceType=instance_type,
IamInstanceProfile={'Arn':instance_profile_arn},
UserData=script,
MetadataOptions={"InstanceMetadataTags":"enabled"},
TagSpecifications=[{
'ResourceType': 'instance',
'Tags': [
{'Key': 'NAME', 'Value': 'test-vm'},
{'Key': 'JOB_ID', 'Value': f'{job_id}'}
]
}],
)

Utilizing metadata tags to move info to our cases shall be notably helpful within the subsequent sections.

Naturally, we require the flexibility to research our software’s output logs each throughout and after coaching. This requires that they be periodically synced to persistent logging. On this submit we implement this utilizing Amazon CloudWatch. Beneath we outline a minimal JSON configuration file for enabling CloudWatch log assortment which we add to our supply code tar-ball as cw_config.json. Please see the official documentation for particulars on CloudWatch setup and configuration.

{
"logs": {
"logs_collected": {
"recordsdata": {
"collect_list": [
{
"file_path": "/output.log",
"log_group_name": "/aws/ec2/training-jobs",
"log_stream_name": "job-id"
}
]
}
}
}
}

In observe, we want the log_stream_name to uniquely determine the coaching job. In direction of that finish, we use the sed command to switch the generic “job-id” string with the job id metadata tag from the earlier part. Our enhanced script additionally contains the CloudWatch begin up command and modifications for piping the usual output to the designated output.log outlined within the CloudWatch config file.

script = """#!/bin/bash
# surroundings setup
TOKEN=$(curl -s -X PUT "http://169.254.169.254/newest/api/token" -H 
"X-aws-ec2-metadata-token-ttl-seconds: 21600")
INST_MD=http://169.254.169.254/newest/meta-data
CURL_FLAGS="-H "X-aws-ec2-metadata-token: ${TOKEN}" -s"
INSTANCE_ID=$(curl $CURL_FLAGS $INST_MD/instance-id)
REGION=$(curl $CURL_FLAGS $INST_MD/placement/area)
JOB_ID=$(curl $CURL_FLAGS $INST_MD/tags/occasion/JOB_ID)
export PATH=/choose/conda/envs/pytorch/bin/python:$PATH# obtain and unpack code
aws s3 cp s3://my-s3-path/$JOB_ID/my-code.tar .
tar -xvf my-code.tar
# configure cloudwatch
sed -i "s/job-id/${JOB_ID}/g" cw_config.json
/choose/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl 
-a fetch-config -m ec2 -c file:cw_config.json -s
# set up dependencies
python3 -m pip set up -r necessities.txt 2>&1 | tee -a output.log
# run coaching workload
python3 prepare.py 2>&1 | tee -a output.log
# sync output artifacts
aws s3 sync artifacts s3://my-s3-path/$JOB_ID/artifacts
# self-destruct
aws ec2 terminate-instances --instance-ids $INSTANCE_ID 
--region $REGION
"""

These days, it’s fairly widespread for coaching jobs to run on a number of nodes in parallel. Modifying our occasion request code to assist a number of nodes is an easy matter of modifying the num_instances setting. The problem is the best way to configure the surroundings in a fashion that helps distributed coaching, i.e., a fashion that allows — and optimizes — the switch of knowledge between the cases.

To attenuate the community latency between the cases and maximize throughput, we add a pointer to a predefined cluster placement group within the Placement discipline of our ec2 occasion request. The next command line demonstrates the creation of a cluster placement group.

aws ec2 create-placement-group --group-name cluster-placement-group 
--strategy cluster

For our cases to speak with each other, they want to concentrate on one another’s presence. On this submit we are going to exhibit a minimal surroundings configuration required for working knowledge parallel coaching in PyTorch. For PyTorch DistributedDataParallel (DDP), every occasion must know the IP of the grasp node, the grasp port, the full variety of cases, and its serial rank amongst all the nodes. The script under demonstrates the configuration of a knowledge parallel coaching job utilizing the surroundings variables MASTER_ADDR, MASTER_PORT, NUM_NODES, and NODE_RANK.

import os, ast, socket
import torch
import torch.distributed as dist
import torch.multiprocessing as mpdef mp_fn(local_rank, *args):
# uncover topology settings
num_nodes = int(os.environ.get('NUM_NODES',1))
node_rank = int(os.environ.get('NODE_RANK',0))
gpus_per_node = torch.cuda.device_count()
world_size = num_nodes * gpus_per_node
node_rank = nodes.index(socket.gethostname())
global_rank = (node_rank * gpus_per_node) + local_rank
print(f'native rank {local_rank} '
f'world rank {global_rank} '
f'world dimension {world_size}')
# init_process_group assumes the existence of MASTER_ADDR
# and MASTER_PORT surroundings variables
dist.init_process_group(backend='nccl',
rank=global_rank, 
world_size=world_size)
torch.cuda.set_device(local_rank)
# Add coaching logic
if __name__ == '__main__':
mp.spawn(mp_fn,
args=(),
nprocs=torch.cuda.device_count())

The node rank may be retrieved from the ami-launch-index. The variety of nodes and the grasp port are recognized on the time of create_instances invocation and may be handed in as EC2 occasion tags. Nonetheless, the IP deal with of the grasp node is just decided as soon as the grasp occasion is created and might solely be communicated to the cases following the create_instances name. Within the code block under, we selected to move the grasp deal with to every of the cases utilizing a devoted name to the AWS Python SDK create_tags API. We use the identical name to replace the title tag of every occasion based on its launch-index worth.

The total answer for multi-node coaching seems under:

import boto3area = 'us-east-1'
job_id = 'my-multinode-experiment' # change with distinctive id
num_instances = 4
image_id = 'ami-0240b7264c1c9e6a9' # change with picture of selection
instance_type = 'g5.xlarge' # change with occasion of selection
instance_profile_arn = 'instance-profile-arn' # change with profile arn
placement_group = 'cluster-placement-group' # change with placement group
ec2 = boto3.useful resource('ec2', region_name=area)
script = """#!/bin/bash
# surroundings setup
TOKEN=$(curl -s -X PUT "http://169.254.169.254/newest/api/token" -H 
"X-aws-ec2-metadata-token-ttl-seconds: 21600")
INST_MD=http://169.254.169.254/newest/meta-data
CURL_FLAGS="-H "X-aws-ec2-metadata-token: ${TOKEN}" -s"
INSTANCE_ID=$(curl $CURL_FLAGS $INST_MD/instance-id)
REGION=$(curl $CURL_FLAGS $INST_MD/placement/area)
JOB_ID=$(curl $CURL_FLAGS $INST_MD/tags/occasion/JOB_ID)
export NODE_RANK=$(curl $CURL_FLAGS $INST_MD/ami-launch-index)
export NUM_NODES=$(curl $CURL_FLAGS $INST_MD/NUM_NODES)
export MASTER_PORT=$(curl $CURL_FLAGS $INST_MD/tags/occasion/MASTER_PORT)
export PATH=/choose/conda/envs/pytorch/bin/python:$PATH         
# obtain and unpack code
aws s3 cp s3://my-s3-path/$JOB_ID/my-code.tar .
tar -xvf my-code.tar
# configure cloudwatch
sed -i "s/job-id/${JOB_ID}_${NODE_RANK}/g" cw_config.json
/choose/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl 
-a fetch-config -m ec2 -c file:cw_config.json -s
# set up dependencies
python3 -m pip set up -r necessities.txt 2>&1 | tee -a output.log
# retrieve grasp deal with
# ought to be accessible however simply in case tag software is delayed...
whereas true; do
export MASTER_ADDR=$(curl $CURL_FLAGS $INST_MD/tags/occasion/MASTER_ADDR)
if [[ $MASTER_ADDR == "<?xml"* ]]; then
echo 'tags lacking, sleep for five seconds' 2>&1 | tee -a output.log
sleep 5
else
break
fi
achieved
# run coaching workload
python3 prepare.py 2>&1 | tee -a output.log
# sync output artifacts
aws s3 sync artifacts s3://my-s3-path/$JOB_ID/artifacts
# self-destruct
aws ec2 terminate-instances --instance-ids $INSTANCE_ID 
--region $REGION
"""
cases = ec2.create_instances(
MaxCount=num_instances,
MinCount=num_instances,
ImageId=image_id,
InstanceType=instance_type,
IamInstanceProfile={'Arn':instance_profile_arn},
UserData=script,
MetadataOptions={"InstanceMetadataTags":"enabled"},
TagSpecifications=[{
'ResourceType': 'instance',
'Tags': [
{'Key': 'NAME', 'Value': 'test-vm'},
{'Key': 'JOB_ID', 'Value': f'{job_id}'},
{'Key': 'MASTER_PORT', 'Value': '7777'},
{'Key': 'NUM_NODES', 'Value': f'{num_instances}'}
]
}],
Placement={'GroupName': placement_group}
)
if num_instances > 1:
# discover master_addr
for inst in cases:
if inst.ami_launch_index == 0:
master_addr = inst.network_interfaces_attribute[0]['PrivateIpAddress']
break
# replace ec2 tags
for inst in cases:
res = ec2.create_tags(
Assets=[inst.id],
Tags=[
{'Key': 'NAME', 'Value': f'test-vm-{inst.ami_launch_index}'},
{'Key': 'MASTER_ADDR', 'Value': f'{master_addr}'}]
)

A preferred method of lowering coaching prices is to make use of discounted Amazon EC2 Spot Cases. Using Spot cases successfully requires that you simply implement a method of detecting interruptions (e.g., by listening for termination notices) and taking the suitable motion (e.g., resuming incomplete workloads). Beneath, we present the best way to modify our script to make use of Spot cases utilizing the InstanceMarketOptions API setting.

import boto3area = 'us-east-1'
job_id = 'my-spot-experiment' # change with distinctive id
num_instances = 1
image_id = 'ami-0240b7264c1c9e6a9' # change with picture of selection
instance_type = 'g5.xlarge' # change with occasion of selection
instance_profile_arn = 'instance-profile-arn' # change with profile arn
placement_group = 'cluster-placement-group' # change with placement group
cases = ec2.create_instances(
MaxCount=num_instances,
MinCount=num_instances,
ImageId=image_id,
InstanceType=instance_type,
IamInstanceProfile={'Arn':instance_profile_arn},
UserData=script,
MetadataOptions={"InstanceMetadataTags":"enabled"},
TagSpecifications=[{
'ResourceType': 'instance',
'Tags': [
{'Key': 'NAME', 'Value': 'test-vm'},
{'Key': 'JOB_ID', 'Value': f'{job_id}'},
]
}],
InstanceMarketOptions = {
'MarketType': 'spot',
'SpotOptions': {
"SpotInstanceType": "one-time",
"InstanceInterruptionBehavior": "terminate"
}
}
)

Please see our earlier posts (e.g., right here and right here) for some concepts for the best way to implement an answer for Spot occasion life-cycle administration.

Managed cloud providers for AI improvement can simplify mannequin coaching and decrease the entry bar for potential incumbents. Nonetheless, there are some conditions the place higher management over the coaching course of is required. On this submit we have now illustrated one method to constructing a personalized managed coaching surroundings on prime of Amazon EC2. In fact, the exact particulars of the answer will vastly depend upon the particular wants of the initiatives at hand.

As at all times, please be at liberty to answer this submit with feedback, questions, or corrections.

[ad_2]