Web Analytics
WonderWorld:

Interactive 3D Scene Generation from a Single Image

Interactive Scene Generation

WonderWorld allows real-time rendering and fast scene generation. This allows a user to navigate existing contents, and specify where and what to generate. Here are examples where a user specifies scene contents (via text) and locations (via camera movement).

Generated Virtual World

Here are some examples of generated scenes with different camera path styles: rotational, casual, and straight.

Interactive Viewing

Control: Move by "W/A/S/D", look around by "I/J/K/L". After loading, please click on the canvas to activate control.
Note: The rendering here is done on your browser in real-time. Loading a scene (~100MB) may take a while.

Abstract

We present WonderWorld, a novel framework for interactive 3D scene extrapolation that enables users to explore and shape virtual environments based on a single input image and user-specified text. While significant improvements have been made to the visual quality of scene generation, existing methods are run offline, taking tens of minutes to hours to generate a scene. By leveraging Fast Gaussian Surfels and a guided diffusion-based depth estimation method, WonderWorld generates geometrically consistent extrapolation while significantly reducing computational time. Our framework generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for applications in virtual reality, gaming, and creative design, where users can quickly generate and navigate immersive, potentially infinite virtual worlds from a single image. Our approach represents a significant advancement in interactive 3D scene generation, opening up new possibilities for user-driven content creation and exploration in virtual environments. We will release full code and software for reproducibility.

Approach

WonderWorld takes a single image as input and generates connected diverse 3D scenes to form a virtual world. Users can specify new scene contents and styles via text, and specify where to generate new scenes via camera movement as our system allows real-time rendering. Our system generates a single 3D scene in less than 10 seconds thank to two technical innovations: Firstly, our 3D scene representation, Fast Gaussian Surfels, take less than 1 second to optimize thank to its geometry-based initialization. Secondly, our layer-wise scene generation strategy allows us to use only a single view rather than generating multiple views for each scene without introducing large disocclusion holes.