Kaleido: Scaling Sequence-to-Sequence Generative Neural Rendering

by Shikun Liu¹†, Kam Woh Ng¹†, Wonbong Jang¹, Jiadong Guo¹, Junlin Han¹, Haozhe Liu¹, Yiannis Douratsos¹, Juan C. Pérez¹, Zijian Zhou¹, Chi Phung¹, Tao Xiang¹, and Juan-Manuel Pérez-Rúa¹ (†Core Contributors)

¹Meta AI

We introduce Kaleido, a family of spatial generative models that achieves photorealistic, unified object- and scene-level neural rendering. Kaleido sets a new state-of-the-art, significantly outperforming existing generative models in few-view settings, and is the first zero-shot generative model to match the rendering quality of per-scene optimisation models in many-view settings.

Introduction

Kaleido is a general-purpose, generative neural rendering engine that can synthesise photorealistic images and videos of any scene or object from any viewpoint. Unlike traditional methods that require per-scene optimisation or are limited to specific object categories, Kaleido leverages a sequence-to-sequence transformer architecture design to learn a unified understanding of 3D space from diverse video and 3D datasets. This allows Kaleido to render new subjects in a zero-shot manner, generating realistic renderings of both individual objects and complex environments without any scene-specific fine-tuning.

Kaleido's design is guided by a simple philosophy: 3D perception is a form of visual common sense. Instead of relying on explicit 3D structures and inductive biases, we consider 3D as a specialised form of video and unify both 3D and video modelling with a consistent sequence-to-sequence design. This means Kaleido learns its understanding of space, geometry, and physics directly from watching millions of videos, just like a human learns by observing the world. This foundational visual understanding learned from large-scale video data is then efficiently transferred and refined using structured, camera-labelled 3D datasets. This two-stage process is designed all within a single, unified model, without any task-specific components, making Kaleido incredibly versatile, efficient, and easy to scale.

Single-View Generative Renderings

A primary application of Kaleido is to perform generative rendering from a single image. Our process is simple: given one input image, Kaleido first generates 32 to 48 distinct novel views along a pre-defined camera path. Then, our lightweight view interpolation model FiLM seamlessly stitches these keyframes together by predicting all the intermediate frames, creating a smooth, continuous video. The following examples, rendered at 30 FPS, showcase Kaleido's ability to produce high-quality, spatially consistent 4-to-6-second videos from a single image across a wide range of real-world and synthetic subjects.

Multi-View Generative Renderings

Kaleido's generative rendering quality scales impressively as more reference views are provided. Below, we compare Kaleido against several state-of-the-art rendering methods on the widely-used NeRF Synthetic dataset (at 256px resolution) and LLFF (at 512px resolution), and without using frame interpolation. Kaleido produces significantly higher-quality renderings with much less flickering than other generative models like EscherNet and SEVA. In many-view settings, its quality matches that of per-scene optimisation methods like Instant-NGP and 3D Gaussian Splatting, achieving the accuracy of specialised scene-specific models within a single, general-purpose rendering framework.

3D Reconstruction

We showcase Kaleido's precise multi-view rendering by using its generated views for 3D reconstruction with an off-the-shelf surface reconstruction framework NeuS2. Below, we compare reconstructions from views generated by Kaleido against those from EscherNet. The results show that Kaleido's generated views lead to significantly higher-quality meshes. Notably, at 1024px resolution, Kaleido's reconstructions is nearly close to the ground-truth, capturing fine geometric details and producing sharp, realistic textures.

Emergent Rendering Capabilities

Kaleido's sequence-to-sequence design and its powerful data-driven prior give rise to several emergent capabilities not found in prior generative rendering models. To demonstrate this, we test Kaleido on unconventional inputs that are clearly out-of-distribution from our training data, such as image collages and colour padded images.

As shown below, Kaleido is able to reasonably interpret the 3D structure and appearance of these unusual inputs, generating novel views that are spatially coherent and perceptionally plausible. This strong generalisation highlights that Kaleido has learned a robust understanding of 3D space, as a form of "visual common sense."

Conclusions, Limitations and Future Work

In this project, we introduced Kaleido, a new family of generative models that redefines neural rendering as a pure sequence-to-sequence problem, unifying 3D and video modelling. Through extensive ablations, we progressively modernised the architecture and training strategies, resulting in a model with exceptional rendering precision and spatial consistency. Kaleido exhibits strong scaling properties and achieves state-of-the-art performance across a wide range of view synthesis and 3D reconstruction benchmarks. Most notably, it is the first generative rendering model to match the quality of per-scene optimisation methods in a zero-shot setting, representing a significant step towards a universal, general-purpose rendering engine. Despite its strong performance, Kaleido has several limitations that open exciting avenues for future research:

Texture Flickering and Sticking. In certain challenging scenarios, we observe two main types of visual artefacts in Kaleido's generations. Texture flickering can occur in scenes with high-frequency details (e.g., the LLFF Fern scene), particularly at lower resolutions or when conditioned on very few reference views, i.e. 1 view. We also occasionally observe texture sticking, where the generated sequence exhibits a non-continuous jump between frames. Improving spatial consistency in these most challenging settings remains an important direction for future work.
Fixed Camera Intrinsics. Kaleido currently does not model camera intrinsics, which prevents it from generating effects like dolly-zooms, a capability present in models like SEVA. Future work could explore incorporating intrinsic parameterisation, potentially another form of RoPE-based positional encoding designs, to allow for more flexible camera control.
Degraded Generations with Large Viewpoint Changes. While Kaleido maintains excellent spatial consistency, its generated views can sometimes lack semantic plausibility when the viewpoint change is extreme. This suggests that while video pre-training builds a strong geometric foundation, it may not provide the diverse semantic knowledge required for high-fidelity single-image realism. Integrating priors from large-scale text-to-image/video models could be a promising direction to address this limitation.
Towards Faster Rendering. Kaleido's generation time scales with the number of input views, and it is far from real-time. To fully bridge the gap with efficient scene-specific methods like 3D Gaussian Splatting, future work will focus on improving inference speed through techniques like step distillation or architectural optimisations.
Towards 4D Generation. Our unified positional encoding for space and time provides a natural foundation for true 4D generation. A promising future direction is to extend Kaleido to precisely control scenes across both space and time, enabling generative modelling of dynamic, four-dimensional worlds.

Citation

If you found this work is useful in your own research, please considering citing the following.

@article{liu2025kaleido, title={Scaling Sequence-to-Sequence Generative Neural Rendering}, author={Liu, Shikun and Ng, Kam Woh and Jang, Wonbong and Guo, Jiadong and Han, Junlin and Liu, Haozhe and Douratsos, Yiannis and Pérez, Juan C. and Zhoum Zijian and Phung, Chi and Xiang, Tao Xiang and Pérez-Rúa, Juan-Manuel}, journal={arXiv preprint}, year={2025} }