Step Functions: Orchestrating Serverless Workflows
Build complex serverless workflows with AWS Step Functions. Learn state machine patterns, error handling, parallel execution, and integration with Lambda, ECS, and other AWS services.
Step Functions orchestrate complex workflows without writing orchestration code. Instead of Lambda calling Lambda calling Lambda (and dealing with timeouts, retries, and state), you define a state machine that handles it all. Here’s how to build production workflows.
When to Use Step Functions
✅ Multi-step processes with branching logic
✅ Long-running workflows (up to 1 year)
✅ Human approval workflows
✅ Parallel processing with aggregation
✅ Retry and error handling across services
❌ Simple request-response APIs (use Lambda directly)
❌ Real-time streaming (use Kinesis)
❌ Sub-second latency requirements
Basic State Machine
Order Processing Workflow
{
"Comment": "Order processing workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate-order",
"Next": "CheckInventory",
"Catch": [
{
"ErrorEquals": ["ValidationError"],
"Next": "OrderFailed",
"ResultPath": "$.error"
}
]
},
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:check-inventory",
"Next": "ProcessPayment",
"Catch": [
{
"ErrorEquals": ["OutOfStockError"],
"Next": "NotifyBackorder"
}
]
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-payment",
"Retry": [
{
"ErrorEquals": ["PaymentGatewayTimeout"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"Next": "OrderFailed",
"ResultPath": "$.error"
}
],
"Next": "FulfillOrder"
},
"FulfillOrder": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "UpdateInventory",
"States": {
"UpdateInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:update-inventory",
"End": true
}
}
},
{
"StartAt": "SendConfirmation",
"States": {
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:send-confirmation",
"End": true
}
}
},
{
"StartAt": "InitiateShipping",
"States": {
"InitiateShipping": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:initiate-shipping",
"End": true
}
}
}
],
"Next": "OrderComplete"
},
"NotifyBackorder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-backorder",
"Next": "WaitForInventory"
},
"WaitForInventory": {
"Type": "Wait",
"Seconds": 3600,
"Next": "CheckInventory"
},
"OrderComplete": {
"Type": "Succeed"
},
"OrderFailed": {
"Type": "Fail",
"Error": "OrderProcessingFailed",
"Cause": "Order could not be processed"
}
}
}
Terraform Deployment
resource "aws_sfn_state_machine" "order_processing" {
name = "order-processing"
role_arn = aws_iam_role.step_functions.arn
definition = templatefile("${path.module}/state-machines/order-processing.json", {
validate_order_arn = aws_lambda_function.validate_order.arn
check_inventory_arn = aws_lambda_function.check_inventory.arn
process_payment_arn = aws_lambda_function.process_payment.arn
update_inventory_arn = aws_lambda_function.update_inventory.arn
send_confirmation_arn = aws_lambda_function.send_confirmation.arn
initiate_shipping_arn = aws_lambda_function.initiate_shipping.arn
notify_backorder_arn = aws_lambda_function.notify_backorder.arn
})
logging_configuration {
log_destination = "${aws_cloudwatch_log_group.sfn.arn}:*"
include_execution_data = true
level = "ALL"
}
tracing_configuration {
enabled = true
}
tags = var.tags
}
resource "aws_iam_role" "step_functions" {
name = "step-functions-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "states.amazonaws.com"
}
}]
})
}
resource "aws_iam_role_policy" "step_functions" {
name = "step-functions-policy"
role = aws_iam_role.step_functions.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = "lambda:InvokeFunction"
Resource = [
aws_lambda_function.validate_order.arn,
aws_lambda_function.check_inventory.arn,
aws_lambda_function.process_payment.arn,
aws_lambda_function.update_inventory.arn,
aws_lambda_function.send_confirmation.arn,
aws_lambda_function.initiate_shipping.arn,
aws_lambda_function.notify_backorder.arn,
]
},
{
Effect = "Allow"
Action = [
"logs:CreateLogDelivery",
"logs:GetLogDelivery",
"logs:UpdateLogDelivery",
"logs:DeleteLogDelivery",
"logs:ListLogDeliveries",
"logs:PutResourcePolicy",
"logs:DescribeResourcePolicies",
"logs:DescribeLogGroups"
]
Resource = "*"
},
{
Effect = "Allow"
Action = [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords"
]
Resource = "*"
}
]
})
}
Express Workflows
For high-volume, short-duration workflows (under 5 minutes):
resource "aws_sfn_state_machine" "data_transform" {
name = "data-transform"
type = "EXPRESS" # High volume, synchronous
role_arn = aws_iam_role.step_functions.arn
definition = jsonencode({
StartAt = "Transform"
States = {
Transform = {
Type = "Task"
Resource = aws_lambda_function.transform.arn
Next = "Validate"
}
Validate = {
Type = "Task"
Resource = aws_lambda_function.validate.arn
Next = "Store"
}
Store = {
Type = "Task"
Resource = aws_lambda_function.store.arn
End = true
}
}
})
tags = var.tags
}
# Invoke Express workflow synchronously
resource "aws_lambda_function" "api_handler" {
# ... lambda config ...
environment {
variables = {
STATE_MACHINE_ARN = aws_sfn_state_machine.data_transform.arn
}
}
}
Express vs Standard
Standard Workflows:
- Duration: up to 1 year
- Execution: exactly-once
- Pricing: per state transition ($0.025/1000)
- Use for: long-running, audit-required workflows
Express Workflows:
- Duration: up to 5 minutes
- Execution: at-least-once (sync) or at-most-once (async)
- Pricing: per execution + duration ($1.00/million + $0.00001667/GB-sec)
- Use for: high-volume, short-duration processing
Advanced Patterns
Map State for Batch Processing
{
"StartAt": "GetBatch",
"States": {
"GetBatch": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:get-batch",
"ResultPath": "$.items",
"Next": "ProcessBatch"
},
"ProcessBatch": {
"Type": "Map",
"ItemsPath": "$.items",
"MaxConcurrency": 10,
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "EXPRESS"
},
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-item",
"End": true
}
}
},
"ResultPath": "$.results",
"Next": "Aggregate"
},
"Aggregate": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:aggregate-results",
"End": true
}
}
}
Human Approval Workflow
{
"StartAt": "SubmitRequest",
"States": {
"SubmitRequest": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:submit-request",
"Next": "WaitForApproval"
},
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789012/approval-queue",
"MessageBody": {
"taskToken.$": "$$.Task.Token",
"request.$": "$.request",
"approverEmail.$": "$.approverEmail"
}
},
"TimeoutSeconds": 86400,
"Next": "CheckDecision",
"Catch": [
{
"ErrorEquals": ["States.Timeout"],
"Next": "RequestExpired"
}
]
},
"CheckDecision": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.approved",
"BooleanEquals": true,
"Next": "ProcessApproved"
}
],
"Default": "RequestDenied"
},
"ProcessApproved": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-approved",
"End": true
},
"RequestDenied": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-denied",
"End": true
},
"RequestExpired": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-expired",
"End": true
}
}
}
Callback Lambda for Approval
import boto3
import os
sfn = boto3.client('stepfunctions')
def approve_handler(event, context):
"""Called when approver clicks approval link."""
task_token = event['queryStringParameters']['token']
decision = event['queryStringParameters']['decision']
try:
if decision == 'approve':
sfn.send_task_success(
taskToken=task_token,
output='{"approved": true}'
)
else:
sfn.send_task_success(
taskToken=task_token,
output='{"approved": false, "reason": "Rejected by approver"}'
)
return {
'statusCode': 200,
'body': f'Request {decision}d successfully'
}
except sfn.exceptions.TaskTimedOut:
return {
'statusCode': 400,
'body': 'Request has expired'
}
except sfn.exceptions.InvalidToken:
return {
'statusCode': 400,
'body': 'Invalid or already used token'
}
SDK Integrations (No Lambda Needed)
{
"StartAt": "QueryDynamoDB",
"States": {
"QueryDynamoDB": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "orders",
"Key": {
"pk": {"S.$": "$.orderId"}
}
},
"ResultPath": "$.order",
"Next": "SendNotification"
},
"SendNotification": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:notifications",
"Message": {
"orderId.$": "$.orderId",
"status.$": "$.order.Item.status.S"
}
},
"Next": "UpdateStatus"
},
"UpdateStatus": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:updateItem",
"Parameters": {
"TableName": "orders",
"Key": {
"pk": {"S.$": "$.orderId"}
},
"UpdateExpression": "SET notified = :true",
"ExpressionAttributeValues": {
":true": {"BOOL": true}
}
},
"End": true
}
}
}
Error Handling
Comprehensive Retry and Catch
{
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:process-payment",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
},
{
"ErrorEquals": ["PaymentGatewayTimeout", "NetworkError"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 1.5
}
],
"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"Next": "HandleDeclinedPayment",
"ResultPath": "$.error"
},
{
"ErrorEquals": ["InsufficientFunds"],
"Next": "RequestAlternatePayment",
"ResultPath": "$.error"
},
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleUnexpectedError",
"ResultPath": "$.error"
}
],
"Next": "PaymentSuccessful"
}
}
Starting Workflows
From Lambda
import boto3
import json
sfn = boto3.client('stepfunctions')
def start_workflow(order_data: dict) -> str:
"""Start order processing workflow."""
response = sfn.start_execution(
stateMachineArn=os.environ['STATE_MACHINE_ARN'],
name=f"order-{order_data['order_id']}-{int(time.time())}",
input=json.dumps(order_data)
)
return response['executionArn']
def start_sync_workflow(data: dict) -> dict:
"""Start Express workflow synchronously."""
response = sfn.start_sync_execution(
stateMachineArn=os.environ['EXPRESS_STATE_MACHINE_ARN'],
input=json.dumps(data)
)
if response['status'] == 'SUCCEEDED':
return json.loads(response['output'])
else:
raise Exception(f"Workflow failed: {response['error']}")
From EventBridge
resource "aws_cloudwatch_event_rule" "order_created" {
name = "order-created"
event_pattern = jsonencode({
source = ["myapp.orders"]
detail-type = ["Order Created"]
})
}
resource "aws_cloudwatch_event_target" "step_functions" {
rule = aws_cloudwatch_event_rule.order_created.name
arn = aws_sfn_state_machine.order_processing.arn
role_arn = aws_iam_role.eventbridge.arn
input_transformer {
input_paths = {
orderId = "$.detail.orderId"
items = "$.detail.items"
}
input_template = <<EOF
{
"orderId": <orderId>,
"items": <items>,
"source": "eventbridge"
}
EOF
}
}
Monitoring
CloudWatch Alarms
resource "aws_cloudwatch_metric_alarm" "execution_failed" {
alarm_name = "step-functions-failed"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ExecutionsFailed"
namespace = "AWS/States"
period = 300
statistic = "Sum"
threshold = 0
alarm_description = "Step Functions execution failed"
dimensions = {
StateMachineArn = aws_sfn_state_machine.order_processing.arn
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
resource "aws_cloudwatch_metric_alarm" "execution_throttled" {
alarm_name = "step-functions-throttled"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ExecutionThrottled"
namespace = "AWS/States"
period = 300
statistic = "Sum"
threshold = 0
dimensions = {
StateMachineArn = aws_sfn_state_machine.order_processing.arn
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
Key Takeaways
- Use SDK integrations when possible — skip Lambda for DynamoDB, SNS, SQS
- Express for high volume — cheaper and faster for short workflows
- Map state for parallelism — process batches concurrently
- Callbacks for human workflows — wait for external input without polling
- Comprehensive error handling — use Retry + Catch for resilience
“Step Functions turns spaghetti orchestration code into a visual diagram you can actually debug.”