AWS Model Development

auto_awesome

Amazon Nova Forge

Custom Frontier Models

Build optimized variants of Nova called 'Novellas' by blending proprietary data with Nova's frontier capabilities. Start from early checkpoints across pre-training, mid-training, and post-training phases for maximum customization control.

Similar Technologies
Early checkpoint accessData blendingNovella outputsPhase selection
Reinforcement Fine-Tuning (RFT)

Execute Reinforcement Fine-Tuning with reward functions in your own environment. Train models with custom reward signals to optimize for specific behaviors, outcomes, or quality metrics unique to your use case.

Similar Technologies
Custom rewardsEnvironment integrationBehavior optimizationQuality targeting
Responsible AI Toolkit

Implement custom safety guardrails using Nova Forge's built-in responsible AI toolkit. Define and enforce safety policies, content moderation rules, and alignment constraints specific to your deployment context.

Similar Technologies
Custom guardrailsSafety policiesContent moderationAlignment
Bedrock Integration

Import custom Nova models as private models on Amazon Bedrock. Leverage the same security, consistent APIs, and broader AWS integrations as any model in Bedrock for seamless production deployment.

Similar Technologies
Private hostingBedrock APIsAWS integrationProduction-ready
timeline

Nova Forge Training Phases

PhaseDescriptionData BlendingUse Case
Pre-trainingStart from earliest checkpoints for maximum model modificationFull proprietary dataset integration with Nova-curated dataDomain-specific foundation models
Mid-trainingContinue from intermediate checkpoints with established capabilitiesBlend domain data while preserving base knowledgeSpecialized knowledge injection
Post-trainingFine-tune from near-final checkpoints for behavior refinementTask-specific data with proven recipesBehavioral alignment, instruction tuning
hub

Amazon SageMaker HyperPod

Purpose-built infrastructure for distributed training at scale with built-in resiliency. Reduces training time by up to 40% through automatic fault detection, diagnosis, and recovery allowing continuous model development for months without disruption.

Resilient Training Infrastructure

Automatic fault detection, diagnosis, and recovery without manual intervention. Continuously run model development workloads for months without disruption. Reduces training time by up to 40% through eliminated downtime.

Similar Technologies
Auto fault recovery40% faster trainingMonths-long runsZero manual intervention
Training Operator for Kubernetes

Purpose-built Kubernetes extension for resilient foundation model training. Surgical recovery restarts only affected resources instead of full job restarts. Customizable hanging job monitoring via YAML configurations.

Similar Technologies
Surgical recoveryK8s nativeHanging detectionYAML config
Distributed Training at Scale

Train across hundreds or thousands of GPUs with built-in distributed training primitives. Data parallelism, model parallelism, and pipeline parallelism patterns supported out of the box with optimized communication.

Similar Technologies
Multi-GPUData parallelModel parallelPipeline parallel
IDE & Notebook Integration

Run JupyterLab, Code Editor, or connect local IDEs directly on HyperPod clusters. Interactive AI workloads on the same persistent EKS clusters used for training and inference.

Similar Technologies
JupyterLabCode EditorLocal IDEPersistent clusters
settings_suggest

HyperPod Capabilities

CapabilityDescriptionBenefit
Fault ToleranceAutomatic detection and recovery from GPU failures, network issues, and node problemsContinuous training without manual restarts
Health MonitoringReal-time cluster health checks with proactive issue identificationPrevent failures before they impact training
Checkpoint ManagementAutomatic checkpointing and recovery from last known good stateMinimal progress loss on failures
Resource OptimizationDynamic resource allocation and efficient GPU utilizationCost efficiency at scale
Multi-framework SupportPyTorch, TensorFlow, JAX with optimized distributed backendsUse preferred ML framework
calendar_month

HyperPod Flexible Training Plans

Capacity Reservation

  • Reserve GPU capacity up to 8 weeks in advance
  • Durations up to 6 months
  • Cluster sizes of 1-256 instances
  • P4d, P5, P5e, P5en GPU instances
  • Instant start times (30 min)

Regional Availability

  • US West (N. California)
  • Asia Pacific (Sydney, Mumbai)
  • Europe (Stockholm, London)
  • South America (Sao Paulo)
  • Additional regions expanding

Development Tools

  • JupyterLab integration
  • Code Editor support
  • Local IDE connectivity
  • HyperPod CLI & SDK
  • Kubernetes orchestration
account_tree

Nova Forge + HyperPod Architecture

1Nova CheckpointPre/Mid/Post-training
2Data BlendingProprietary + Nova-curated
3HyperPod TrainingDistributed GPU clusters
4Bedrock ImportPrivate model hosting
groups

HyperPod Adoption

Enterprise

SalesforceThomson ReutersBMW

AI Startups

LumaPerplexityStability AIHugging Face