Step Functions orchestrate complex workflows without writing orchestration code. Instead of Lambda calling Lambda calling Lambda (and dealing with timeouts, retries, and state), you define a state machine that handles it all. Here’s how to build production workflows.

When to Use Step Functions

✅ Multi-step processes with branching logic
✅ Long-running workflows (up to 1 year)
✅ Human approval workflows
✅ Parallel processing with aggregation
✅ Retry and error handling across services

❌ Simple request-response APIs (use Lambda directly)
❌ Real-time streaming (use Kinesis)
❌ Sub-second latency requirements

Basic State Machine

Order Processing Workflow

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate-order",
      "Next": "CheckInventory",
      "Catch": [
        {
          "ErrorEquals": ["ValidationError"],
          "Next": "OrderFailed",
          "ResultPath": "$.error"
        }
      ]
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:check-inventory",
      "Next": "ProcessPayment",
      "Catch": [
        {
          "ErrorEquals": ["OutOfStockError"],
          "Next": "NotifyBackorder"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-payment",
      "Retry": [
        {
          "ErrorEquals": ["PaymentGatewayTimeout"],
          "IntervalSeconds": 5,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["PaymentDeclined"],
          "Next": "OrderFailed",
          "ResultPath": "$.error"
        }
      ],
      "Next": "FulfillOrder"
    },
    "FulfillOrder": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "UpdateInventory",
          "States": {
            "UpdateInventory": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789012:function:update-inventory",
              "End": true
            }
          }
        },
        {
          "StartAt": "SendConfirmation",
          "States": {
            "SendConfirmation": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789012:function:send-confirmation",
              "End": true
            }
          }
        },
        {
          "StartAt": "InitiateShipping",
          "States": {
            "InitiateShipping": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789012:function:initiate-shipping",
              "End": true
            }
          }
        }
      ],
      "Next": "OrderComplete"
    },
    "NotifyBackorder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-backorder",
      "Next": "WaitForInventory"
    },
    "WaitForInventory": {
      "Type": "Wait",
      "Seconds": 3600,
      "Next": "CheckInventory"
    },
    "OrderComplete": {
      "Type": "Succeed"
    },
    "OrderFailed": {
      "Type": "Fail",
      "Error": "OrderProcessingFailed",
      "Cause": "Order could not be processed"
    }
  }
}

Terraform Deployment

resource "aws_sfn_state_machine" "order_processing" {
  name     = "order-processing"
  role_arn = aws_iam_role.step_functions.arn

  definition = templatefile("${path.module}/state-machines/order-processing.json", {
    validate_order_arn    = aws_lambda_function.validate_order.arn
    check_inventory_arn   = aws_lambda_function.check_inventory.arn
    process_payment_arn   = aws_lambda_function.process_payment.arn
    update_inventory_arn  = aws_lambda_function.update_inventory.arn
    send_confirmation_arn = aws_lambda_function.send_confirmation.arn
    initiate_shipping_arn = aws_lambda_function.initiate_shipping.arn
    notify_backorder_arn  = aws_lambda_function.notify_backorder.arn
  })

  logging_configuration {
    log_destination        = "${aws_cloudwatch_log_group.sfn.arn}:*"
    include_execution_data = true
    level                  = "ALL"
  }

  tracing_configuration {
    enabled = true
  }

  tags = var.tags
}

resource "aws_iam_role" "step_functions" {
  name = "step-functions-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "states.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy" "step_functions" {
  name = "step-functions-policy"
  role = aws_iam_role.step_functions.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = "lambda:InvokeFunction"
        Resource = [
          aws_lambda_function.validate_order.arn,
          aws_lambda_function.check_inventory.arn,
          aws_lambda_function.process_payment.arn,
          aws_lambda_function.update_inventory.arn,
          aws_lambda_function.send_confirmation.arn,
          aws_lambda_function.initiate_shipping.arn,
          aws_lambda_function.notify_backorder.arn,
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogDelivery",
          "logs:GetLogDelivery",
          "logs:UpdateLogDelivery",
          "logs:DeleteLogDelivery",
          "logs:ListLogDeliveries",
          "logs:PutResourcePolicy",
          "logs:DescribeResourcePolicies",
          "logs:DescribeLogGroups"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "xray:PutTraceSegments",
          "xray:PutTelemetryRecords"
        ]
        Resource = "*"
      }
    ]
  })
}

Express Workflows

For high-volume, short-duration workflows (under 5 minutes):

resource "aws_sfn_state_machine" "data_transform" {
  name = "data-transform"
  type = "EXPRESS"  # High volume, synchronous
  role_arn = aws_iam_role.step_functions.arn

  definition = jsonencode({
    StartAt = "Transform"
    States = {
      Transform = {
        Type     = "Task"
        Resource = aws_lambda_function.transform.arn
        Next     = "Validate"
      }
      Validate = {
        Type     = "Task"
        Resource = aws_lambda_function.validate.arn
        Next     = "Store"
      }
      Store = {
        Type     = "Task"
        Resource = aws_lambda_function.store.arn
        End      = true
      }
    }
  })

  tags = var.tags
}

# Invoke Express workflow synchronously
resource "aws_lambda_function" "api_handler" {
  # ... lambda config ...

  environment {
    variables = {
      STATE_MACHINE_ARN = aws_sfn_state_machine.data_transform.arn
    }
  }
}

Express vs Standard

Standard Workflows:
- Duration: up to 1 year
- Execution: exactly-once
- Pricing: per state transition ($0.025/1000)
- Use for: long-running, audit-required workflows

Express Workflows:
- Duration: up to 5 minutes
- Execution: at-least-once (sync) or at-most-once (async)
- Pricing: per execution + duration ($1.00/million + $0.00001667/GB-sec)
- Use for: high-volume, short-duration processing

Advanced Patterns

Map State for Batch Processing

{
  "StartAt": "GetBatch",
  "States": {
    "GetBatch": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:get-batch",
      "ResultPath": "$.items",
      "Next": "ProcessBatch"
    },
    "ProcessBatch": {
      "Type": "Map",
      "ItemsPath": "$.items",
      "MaxConcurrency": 10,
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "DISTRIBUTED",
          "ExecutionType": "EXPRESS"
        },
        "StartAt": "ProcessItem",
        "States": {
          "ProcessItem": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-item",
            "End": true
          }
        }
      },
      "ResultPath": "$.results",
      "Next": "Aggregate"
    },
    "Aggregate": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:aggregate-results",
      "End": true
    }
  }
}

Human Approval Workflow

{
  "StartAt": "SubmitRequest",
  "States": {
    "SubmitRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:submit-request",
      "Next": "WaitForApproval"
    },
    "WaitForApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
      "Parameters": {
        "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789012/approval-queue",
        "MessageBody": {
          "taskToken.$": "$$.Task.Token",
          "request.$": "$.request",
          "approverEmail.$": "$.approverEmail"
        }
      },
      "TimeoutSeconds": 86400,
      "Next": "CheckDecision",
      "Catch": [
        {
          "ErrorEquals": ["States.Timeout"],
          "Next": "RequestExpired"
        }
      ]
    },
    "CheckDecision": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.approved",
          "BooleanEquals": true,
          "Next": "ProcessApproved"
        }
      ],
      "Default": "RequestDenied"
    },
    "ProcessApproved": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-approved",
      "End": true
    },
    "RequestDenied": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-denied",
      "End": true
    },
    "RequestExpired": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-expired",
      "End": true
    }
  }
}

Callback Lambda for Approval

import boto3
import os

sfn = boto3.client('stepfunctions')

def approve_handler(event, context):
    """Called when approver clicks approval link."""
    task_token = event['queryStringParameters']['token']
    decision = event['queryStringParameters']['decision']

    try:
        if decision == 'approve':
            sfn.send_task_success(
                taskToken=task_token,
                output='{"approved": true}'
            )
        else:
            sfn.send_task_success(
                taskToken=task_token,
                output='{"approved": false, "reason": "Rejected by approver"}'
            )

        return {
            'statusCode': 200,
            'body': f'Request {decision}d successfully'
        }
    except sfn.exceptions.TaskTimedOut:
        return {
            'statusCode': 400,
            'body': 'Request has expired'
        }
    except sfn.exceptions.InvalidToken:
        return {
            'statusCode': 400,
            'body': 'Invalid or already used token'
        }

SDK Integrations (No Lambda Needed)

{
  "StartAt": "QueryDynamoDB",
  "States": {
    "QueryDynamoDB": {
      "Type": "Task",
      "Resource": "arn:aws:states:::dynamodb:getItem",
      "Parameters": {
        "TableName": "orders",
        "Key": {
          "pk": {"S.$": "$.orderId"}
        }
      },
      "ResultPath": "$.order",
      "Next": "SendNotification"
    },
    "SendNotification": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:notifications",
        "Message": {
          "orderId.$": "$.orderId",
          "status.$": "$.order.Item.status.S"
        }
      },
      "Next": "UpdateStatus"
    },
    "UpdateStatus": {
      "Type": "Task",
      "Resource": "arn:aws:states:::dynamodb:updateItem",
      "Parameters": {
        "TableName": "orders",
        "Key": {
          "pk": {"S.$": "$.orderId"}
        },
        "UpdateExpression": "SET notified = :true",
        "ExpressionAttributeValues": {
          ":true": {"BOOL": true}
        }
      },
      "End": true
    }
  }
}

Error Handling

Comprehensive Retry and Catch

{
  "ProcessPayment": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:...:process-payment",
    "Retry": [
      {
        "ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
        "IntervalSeconds": 2,
        "MaxAttempts": 6,
        "BackoffRate": 2
      },
      {
        "ErrorEquals": ["PaymentGatewayTimeout", "NetworkError"],
        "IntervalSeconds": 5,
        "MaxAttempts": 3,
        "BackoffRate": 1.5
      }
    ],
    "Catch": [
      {
        "ErrorEquals": ["PaymentDeclined"],
        "Next": "HandleDeclinedPayment",
        "ResultPath": "$.error"
      },
      {
        "ErrorEquals": ["InsufficientFunds"],
        "Next": "RequestAlternatePayment",
        "ResultPath": "$.error"
      },
      {
        "ErrorEquals": ["States.ALL"],
        "Next": "HandleUnexpectedError",
        "ResultPath": "$.error"
      }
    ],
    "Next": "PaymentSuccessful"
  }
}

Starting Workflows

From Lambda

import boto3
import json

sfn = boto3.client('stepfunctions')

def start_workflow(order_data: dict) -> str:
    """Start order processing workflow."""
    response = sfn.start_execution(
        stateMachineArn=os.environ['STATE_MACHINE_ARN'],
        name=f"order-{order_data['order_id']}-{int(time.time())}",
        input=json.dumps(order_data)
    )
    return response['executionArn']


def start_sync_workflow(data: dict) -> dict:
    """Start Express workflow synchronously."""
    response = sfn.start_sync_execution(
        stateMachineArn=os.environ['EXPRESS_STATE_MACHINE_ARN'],
        input=json.dumps(data)
    )

    if response['status'] == 'SUCCEEDED':
        return json.loads(response['output'])
    else:
        raise Exception(f"Workflow failed: {response['error']}")

From EventBridge

resource "aws_cloudwatch_event_rule" "order_created" {
  name = "order-created"

  event_pattern = jsonencode({
    source      = ["myapp.orders"]
    detail-type = ["Order Created"]
  })
}

resource "aws_cloudwatch_event_target" "step_functions" {
  rule     = aws_cloudwatch_event_rule.order_created.name
  arn      = aws_sfn_state_machine.order_processing.arn
  role_arn = aws_iam_role.eventbridge.arn

  input_transformer {
    input_paths = {
      orderId = "$.detail.orderId"
      items   = "$.detail.items"
    }
    input_template = <<EOF
{
  "orderId": <orderId>,
  "items": <items>,
  "source": "eventbridge"
}
EOF
  }
}

Monitoring

CloudWatch Alarms

resource "aws_cloudwatch_metric_alarm" "execution_failed" {
  alarm_name          = "step-functions-failed"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ExecutionsFailed"
  namespace           = "AWS/States"
  period              = 300
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "Step Functions execution failed"

  dimensions = {
    StateMachineArn = aws_sfn_state_machine.order_processing.arn
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "execution_throttled" {
  alarm_name          = "step-functions-throttled"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ExecutionThrottled"
  namespace           = "AWS/States"
  period              = 300
  statistic           = "Sum"
  threshold           = 0

  dimensions = {
    StateMachineArn = aws_sfn_state_machine.order_processing.arn
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Key Takeaways

  1. Use SDK integrations when possible — skip Lambda for DynamoDB, SNS, SQS
  2. Express for high volume — cheaper and faster for short workflows
  3. Map state for parallelism — process batches concurrently
  4. Callbacks for human workflows — wait for external input without polling
  5. Comprehensive error handling — use Retry + Catch for resilience

“Step Functions turns spaghetti orchestration code into a visual diagram you can actually debug.”