World Models

public

World Models

Genie 2Google DeepMind — Dec 2024

Generates consistent, explorable 3D worlds from a single image. Supports agent embodiment — an AI character can take actions inside the generated environment and the world responds coherently. Action-conditioned video generation produces frames that respect physics and spatial layout.

Key Features
  • Single-image to 3D world generation
  • Agent embodiment with action-conditioned output
  • Consistent physics and spatial reasoning
  • Currently requires ~4 TPUs per 60 seconds of generation
Related Systems
GameNGenOasisUniSimDIAMOND
GameNGenGoogle — 2024

Runs DOOM entirely inside a diffusion model at 20+ FPS with no game engine whatsoever. The neural network replaces rendering, physics, and game logic. Each frame is generated by conditioning on the previous frames and player input.

Key Features
  • 20+ FPS neural game simulation
  • No game engine — pure diffusion model
  • Conditioned on previous frames + player actions
  • Trained on recorded DOOM gameplay
Related Systems
Genie 2OasisDIAMOND
OasisDecart — 2024

Real-time playable Minecraft-like world running entirely via neural inference. Open weights. The model acts as the game engine — it takes player inputs, runs a forward pass, and outputs the next frame.

Key Features
  • Real-time playable output
  • Open weights available
  • Inference-as-engine architecture
  • Minecraft-style open world
Related Systems
Genie 2GameNGenDIAMOND
UniSimGoogle DeepMind

Simulates real-world interactions for robotics and embodied AI training. Given an action description or control signal, it generates a video of what would happen. Designed for policy learning — a robot can train in UniSim's simulated world before touching real hardware.

Key Features
  • Action-conditioned video generation
  • Real-world scene simulation
  • Robotics training without physical hardware
  • Supports diverse action modalities
Related Systems
Genie 2DayDreamerDreamer V3
DIAMONDETH Zürich

Diffusion world model trained on Atari games. Learns environment dynamics entirely from pixel observations, then plays games using those learned dynamics rather than interacting with the real environment.

Key Features
  • Diffusion-based world model
  • Trained on Atari gameplay pixels
  • Plays games via learned dynamics
  • No environment interaction at inference
Related Systems
Genie 2GameNGenDreamer V3
movie

Video Generation

SoraOpenAI

Text-to-video and image-to-video diffusion transformer capable of generating up to 1 minute of high-fidelity video. Strong physics intuition — understands 3D space, object permanence, and cause-and-effect. Limited public access.

Key Features
  • Up to 60 seconds of coherent video
  • Strong physics intuition
  • Text-to-video and image-to-video
  • Limited public availability
Competitors
Grok AuroraVeo 2Runway Gen-3
Grok AuroraxAI

xAI's video generation model integrated into the Grok platform. Fewer content restrictions than competitors. Part of xAI's push to make Grok a multimodal creative tool.

Key Features
  • Text-to-video generation
  • Fewer content restrictions
  • Integrated into Grok platform
  • Coherent motion and physics
Competitors
SoraVeo 2Runway Gen-3
Veo 2Google DeepMind

Google DeepMind's second-generation video model. Strong photorealism and 1080p output with cinematic camera control and temporal consistency. Available via Vertex AI for production use.

Key Features
  • Strong photorealism
  • 1080p video generation
  • Available via Vertex AI
  • Cinematic camera control
Competitors
SoraGrok AuroraRunway Gen-3
KlingKuaishou

Chinese video generation model with competitive quality and broader accessibility than Western alternatives. Generates high-quality video from text and image prompts with good motion consistency.

Key Features
  • Competitive video quality
  • More accessible than Sora
  • Text and image prompts
  • Good motion consistency
Competitors
SoraHunyuan VideoCogVideoX
Runway Gen-3 AlphaRunway

Production-focused third-generation video model. Fine-grained control over motion, style, and composition via API and web interface. Widely used in professional film and VFX.

Key Features
  • Production-focused
  • Good motion consistency
  • API and web interface
  • Fine-grained style control
Competitors
SoraVeo 2Kling
Hunyuan VideoTencent

Open-weights video generation model with 13B parameters. Strong quality for an open model, enabling local deployment and fine-tuning without API dependence.

Key Features
  • Open weights (13B parameters)
  • Local deployment possible
  • Fine-tuning friendly
  • Solid generation quality
Competitors
CogVideoXKlingRunway Gen-3
CogVideoXZhipu AI

Open-source video generation model from Zhipu AI (makers of GLM). Solid quality with open weights, enabling research and custom deployment.

Key Features
  • Open source
  • Solid generation quality
  • Research-friendly
  • Custom deployment
Competitors
Hunyuan VideoKlingRunway Gen-3
Pika, Luma, HaiperVarious

Consumer-tier video generators with faster iteration cycles and lower barriers to entry. Pika focuses on creative editing, Luma Dream Machine on 3D-aware generation, and Haiper on accessible quick generation.

Key Features
  • Fast iteration and generation
  • Lower barrier to entry
  • Consumer-friendly interfaces
  • Rapid model updates
Competitors
Runway Gen-3KlingCogVideoX
lightbulb

Key Concepts

gamepad

Action-Conditioned Generation

The model takes an action and generates the next visual frame showing the result. This is what makes a world interactive rather than just a video.

neurology

Neural Simulation

A neural network learns to predict the next state, replacing hand-coded physics and rendering with a single model forward pass.

person

Agent Embodiment

An AI agent is placed inside the generated world and takes actions. The world model generates consistent responses, enabling training without a real environment.

movie

Temporal Consistency

Characters, objects, and physics remain coherent across frames. Critical for both playable worlds and video generation — the model must maintain a persistent world state.