World Models
World Models
Generates consistent, explorable 3D worlds from a single image. Supports agent embodiment — an AI character can take actions inside the generated environment and the world responds coherently. Action-conditioned video generation produces frames that respect physics and spatial layout.
- Single-image to 3D world generation
- Agent embodiment with action-conditioned output
- Consistent physics and spatial reasoning
- Currently requires ~4 TPUs per 60 seconds of generation
Runs DOOM entirely inside a diffusion model at 20+ FPS with no game engine whatsoever. The neural network replaces rendering, physics, and game logic. Each frame is generated by conditioning on the previous frames and player input.
- 20+ FPS neural game simulation
- No game engine — pure diffusion model
- Conditioned on previous frames + player actions
- Trained on recorded DOOM gameplay
Real-time playable Minecraft-like world running entirely via neural inference. Open weights. The model acts as the game engine — it takes player inputs, runs a forward pass, and outputs the next frame.
- Real-time playable output
- Open weights available
- Inference-as-engine architecture
- Minecraft-style open world
Simulates real-world interactions for robotics and embodied AI training. Given an action description or control signal, it generates a video of what would happen. Designed for policy learning — a robot can train in UniSim's simulated world before touching real hardware.
- Action-conditioned video generation
- Real-world scene simulation
- Robotics training without physical hardware
- Supports diverse action modalities
Diffusion world model trained on Atari games. Learns environment dynamics entirely from pixel observations, then plays games using those learned dynamics rather than interacting with the real environment.
- Diffusion-based world model
- Trained on Atari gameplay pixels
- Plays games via learned dynamics
- No environment interaction at inference
Video Generation
Text-to-video and image-to-video diffusion transformer capable of generating up to 1 minute of high-fidelity video. Strong physics intuition — understands 3D space, object permanence, and cause-and-effect. Limited public access.
- Up to 60 seconds of coherent video
- Strong physics intuition
- Text-to-video and image-to-video
- Limited public availability
xAI's video generation model integrated into the Grok platform. Fewer content restrictions than competitors. Part of xAI's push to make Grok a multimodal creative tool.
- Text-to-video generation
- Fewer content restrictions
- Integrated into Grok platform
- Coherent motion and physics
Google DeepMind's second-generation video model. Strong photorealism and 1080p output with cinematic camera control and temporal consistency. Available via Vertex AI for production use.
- Strong photorealism
- 1080p video generation
- Available via Vertex AI
- Cinematic camera control
Chinese video generation model with competitive quality and broader accessibility than Western alternatives. Generates high-quality video from text and image prompts with good motion consistency.
- Competitive video quality
- More accessible than Sora
- Text and image prompts
- Good motion consistency
Production-focused third-generation video model. Fine-grained control over motion, style, and composition via API and web interface. Widely used in professional film and VFX.
- Production-focused
- Good motion consistency
- API and web interface
- Fine-grained style control
Open-weights video generation model with 13B parameters. Strong quality for an open model, enabling local deployment and fine-tuning without API dependence.
- Open weights (13B parameters)
- Local deployment possible
- Fine-tuning friendly
- Solid generation quality
Open-source video generation model from Zhipu AI (makers of GLM). Solid quality with open weights, enabling research and custom deployment.
- Open source
- Solid generation quality
- Research-friendly
- Custom deployment
Consumer-tier video generators with faster iteration cycles and lower barriers to entry. Pika focuses on creative editing, Luma Dream Machine on 3D-aware generation, and Haiper on accessible quick generation.
- Fast iteration and generation
- Lower barrier to entry
- Consumer-friendly interfaces
- Rapid model updates
Key Concepts
Action-Conditioned Generation
The model takes an action and generates the next visual frame showing the result. This is what makes a world interactive rather than just a video.
Neural Simulation
A neural network learns to predict the next state, replacing hand-coded physics and rendering with a single model forward pass.
Agent Embodiment
An AI agent is placed inside the generated world and takes actions. The world model generates consistent responses, enabling training without a real environment.
Temporal Consistency
Characters, objects, and physics remain coherent across frames. Critical for both playable worlds and video generation — the model must maintain a persistent world state.
