Will Percey — Portfolio

World Models

> > Updated Feb 2026

public

World Models

Genie 2Google DeepMind — Dec 2024

Generates consistent, explorable 3D worlds from a single image. Supports agent embodiment — an AI character can take actions inside the generated environment and the world responds coherently. Action-conditioned video generation produces frames that respect physics and spatial layout.

Key Features

Single-image to 3D world generation
Agent embodiment with action-conditioned output
Consistent physics and spatial reasoning
Currently requires ~4 TPUs per 60 seconds of generation

Related Systems

GameNGenOasisUniSimDIAMOND

GameNGenGoogle — 2024

Runs DOOM entirely inside a diffusion model at 20+ FPS with no game engine whatsoever. The neural network replaces rendering, physics, and game logic. Each frame is generated by conditioning on the previous frames and player input.

Key Features

20+ FPS neural game simulation
No game engine — pure diffusion model
Conditioned on previous frames + player actions
Trained on recorded DOOM gameplay

Related Systems

Genie 2OasisDIAMOND

OasisDecart — 2024

Real-time playable Minecraft-like world running entirely via neural inference. Open weights. The model acts as the game engine — it takes player inputs, runs a forward pass, and outputs the next frame.

Key Features

Real-time playable output
Open weights available
Inference-as-engine architecture
Minecraft-style open world

Related Systems

Genie 2GameNGenDIAMOND

UniSimGoogle DeepMind

Simulates real-world interactions for robotics and embodied AI training. Given an action description or control signal, it generates a video of what would happen. Designed for policy learning — a robot can train in UniSim's simulated world before touching real hardware.

Key Features

Action-conditioned video generation
Real-world scene simulation
Robotics training without physical hardware
Supports diverse action modalities

Related Systems

Genie 2DayDreamerDreamer V3

DIAMONDETH Zürich

Diffusion world model trained on Atari games. Learns environment dynamics entirely from pixel observations, then plays games using those learned dynamics rather than interacting with the real environment.

Key Features

Diffusion-based world model
Trained on Atari gameplay pixels
Plays games via learned dynamics
No environment interaction at inference

Related Systems

Genie 2GameNGenDreamer V3

movie

Video Generation

SoraOpenAI

Text-to-video and image-to-video diffusion transformer capable of generating up to 1 minute of high-fidelity video. Strong physics intuition — understands 3D space, object permanence, and cause-and-effect. Limited public access.

Key Features

Up to 60 seconds of coherent video
Strong physics intuition
Text-to-video and image-to-video
Limited public availability

Competitors

Grok AuroraVeo 2Runway Gen-3

Grok AuroraxAI

xAI's video generation model integrated into the Grok platform. Fewer content restrictions than competitors. Part of xAI's push to make Grok a multimodal creative tool.

Key Features

Text-to-video generation
Fewer content restrictions
Integrated into Grok platform
Coherent motion and physics

Competitors

SoraVeo 2Runway Gen-3

Veo 2Google DeepMind

Google DeepMind's second-generation video model. Strong photorealism and 1080p output with cinematic camera control and temporal consistency. Available via Vertex AI for production use.

Key Features

Strong photorealism
1080p video generation
Available via Vertex AI
Cinematic camera control

Competitors

SoraGrok AuroraRunway Gen-3

KlingKuaishou

Chinese video generation model with competitive quality and broader accessibility than Western alternatives. Generates high-quality video from text and image prompts with good motion consistency.

Key Features

Competitive video quality
More accessible than Sora
Text and image prompts
Good motion consistency

Competitors

SoraHunyuan VideoCogVideoX

Runway Gen-3 AlphaRunway

Production-focused third-generation video model. Fine-grained control over motion, style, and composition via API and web interface. Widely used in professional film and VFX.

Key Features

Production-focused
Good motion consistency
API and web interface
Fine-grained style control

Competitors

SoraVeo 2Kling

Hunyuan VideoTencent

Open-weights video generation model with 13B parameters. Strong quality for an open model, enabling local deployment and fine-tuning without API dependence.

Key Features

Open weights (13B parameters)
Local deployment possible
Fine-tuning friendly
Solid generation quality

Competitors

CogVideoXKlingRunway Gen-3

CogVideoXZhipu AI

Open-source video generation model from Zhipu AI (makers of GLM). Solid quality with open weights, enabling research and custom deployment.

Key Features

Open source
Solid generation quality
Research-friendly
Custom deployment

Competitors

Hunyuan VideoKlingRunway Gen-3

Pika, Luma, HaiperVarious

Consumer-tier video generators with faster iteration cycles and lower barriers to entry. Pika focuses on creative editing, Luma Dream Machine on 3D-aware generation, and Haiper on accessible quick generation.

Key Features

Fast iteration and generation
Lower barrier to entry
Consumer-friendly interfaces
Rapid model updates

Competitors

Runway Gen-3KlingCogVideoX

lightbulb

Key Concepts

gamepad

Action-Conditioned Generation

The model takes an action and generates the next visual frame showing the result. This is what makes a world interactive rather than just a video.

neurology

Neural Simulation

A neural network learns to predict the next state, replacing hand-coded physics and rendering with a single model forward pass.

person

Agent Embodiment

An AI agent is placed inside the generated world and takes actions. The world model generates consistent responses, enabling training without a real environment.

movie

Temporal Consistency

Characters, objects, and physics remain coherent across frames. Critical for both playable worlds and video generation — the model must maintain a persistent world state.