AWS Model Development
Amazon Nova Forge
Build optimized variants of Nova called 'Novellas' by blending proprietary data with Nova's frontier capabilities. Start from early checkpoints across pre-training, mid-training, and post-training phases for maximum customization control.
Execute Reinforcement Fine-Tuning with reward functions in your own environment. Train models with custom reward signals to optimize for specific behaviors, outcomes, or quality metrics unique to your use case.
Implement custom safety guardrails using Nova Forge's built-in responsible AI toolkit. Define and enforce safety policies, content moderation rules, and alignment constraints specific to your deployment context.
Import custom Nova models as private models on Amazon Bedrock. Leverage the same security, consistent APIs, and broader AWS integrations as any model in Bedrock for seamless production deployment.
Nova Forge Training Phases
| Phase | Description | Data Blending | Use Case |
|---|---|---|---|
| Pre-training | Start from earliest checkpoints for maximum model modification | Full proprietary dataset integration with Nova-curated data | Domain-specific foundation models |
| Mid-training | Continue from intermediate checkpoints with established capabilities | Blend domain data while preserving base knowledge | Specialized knowledge injection |
| Post-training | Fine-tune from near-final checkpoints for behavior refinement | Task-specific data with proven recipes | Behavioral alignment, instruction tuning |
Amazon SageMaker HyperPod
Purpose-built infrastructure for distributed training at scale with built-in resiliency. Reduces training time by up to 40% through automatic fault detection, diagnosis, and recovery allowing continuous model development for months without disruption.
Automatic fault detection, diagnosis, and recovery without manual intervention. Continuously run model development workloads for months without disruption. Reduces training time by up to 40% through eliminated downtime.
Purpose-built Kubernetes extension for resilient foundation model training. Surgical recovery restarts only affected resources instead of full job restarts. Customizable hanging job monitoring via YAML configurations.
Train across hundreds or thousands of GPUs with built-in distributed training primitives. Data parallelism, model parallelism, and pipeline parallelism patterns supported out of the box with optimized communication.
Run JupyterLab, Code Editor, or connect local IDEs directly on HyperPod clusters. Interactive AI workloads on the same persistent EKS clusters used for training and inference.
HyperPod Capabilities
| Capability | Description | Benefit |
|---|---|---|
| Fault Tolerance | Automatic detection and recovery from GPU failures, network issues, and node problems | Continuous training without manual restarts |
| Health Monitoring | Real-time cluster health checks with proactive issue identification | Prevent failures before they impact training |
| Checkpoint Management | Automatic checkpointing and recovery from last known good state | Minimal progress loss on failures |
| Resource Optimization | Dynamic resource allocation and efficient GPU utilization | Cost efficiency at scale |
| Multi-framework Support | PyTorch, TensorFlow, JAX with optimized distributed backends | Use preferred ML framework |
HyperPod Flexible Training Plans
Capacity Reservation
- Reserve GPU capacity up to 8 weeks in advance
- Durations up to 6 months
- Cluster sizes of 1-256 instances
- P4d, P5, P5e, P5en GPU instances
- Instant start times (30 min)
Regional Availability
- US West (N. California)
- Asia Pacific (Sydney, Mumbai)
- Europe (Stockholm, London)
- South America (Sao Paulo)
- Additional regions expanding
Development Tools
- JupyterLab integration
- Code Editor support
- Local IDE connectivity
- HyperPod CLI & SDK
- Kubernetes orchestration
