OpenAI’s recent reveal of its stunning generative model Sora pushed the envelope of what’s possible with text-to-video. Now Google DeepMind brings us text-to-video games.
The new model, called Genie, can take a short description, a hand-drawn sketch, or a photo and turn it into a playable video game in the style of classic 2D platformers like Super Mario Bros. But don’t expect anything fast-paced. The games run at one frame per second, versus the typical 30 to 60 frames per second of most modern games.
“It’s cool work,” says Matthew Guzdial, an AI researcher at the University of Alberta, who developed a similar game generator a few years ago.
Genie was trained on 30,000 hours of video of hundreds of 2D platform games taken from the internet. Others have taken that approach before, says Guzdial. His own game generator learned from videos to create abstract platformers. Nvidia used video data to train a model called GameGAN, which could produce clones of games like Pac-Man.
Nvidia trained GameGAN with input actions (such as button presses on a controller), as well as video footage: a video frame showing Mario jumping was paired with the Jump action, and so on. Tagging video footage with input actions takes a lot of work, which has limited the amount of training data available.
In contrast, Genie and Guzdial's model were both trained on video footage alone. Guzdial's model learned level layouts and game rules, represented in code. In Genie's case, the generative model learned a visual representation, which allows it to turn starter images into game levels. This approach turns countless hours of existing online video into potential training data.
Genie learned which of eight possible actions would cause the game character in a video to change its position. It generates each new frame of the game on the fly depending on the action the player takes. Press Jump, and Genie updates the current image to show the game character jumping; press Left and the image changes to show the character moved to the left. The game ticks along action by action, each new frame generated from scratch as the player plays.
Future versions of Genie could run faster. “There is no fundamental limitation that prevents us from reaching 30 frames per second,” says Tim Rocktäschel, a research scientist at Google DeepMind who leads the team behind the work. “Genie uses many of the same technologies as contemporary large language models, where there has been significant progress in improving inference speed.”
Genie learned some common visual quirks found in platformers. Many games of this type use parallax, where the foreground moves sideways faster than the background. Genie often adds this effect to the games it generates.
While Genie is an in-house research project and won’t be released, Guzdial notes that the Google DeepMind team says it could one day be turned into a game-making tool—something he’s working on too. “I’m definitely interested to see what they build,” he says.
Virtual playgrounds
But the Google DeepMind researchers are interested in more than just game generation. The team behind Genie works on open-ended learning, where AI-controlled bots are dropped into a virtual environment and left to solve various tasks by trial and error (a technique known as reinforcement learning).
In 2021, a different DeepMind team developed a virtual playground called XLand, in which bots learned how to cooperate on simple tasks such as moving obstacles. Sandboxes like XLand will be crucial for training future bots on a range of different challenges before pitting them against real-world scenarios. The video-game examples prove that Genie could be used to generate such virtual playgrounds.
Others have developed similar world-building tools. For example, David Ha at Google Brain and Jürgen Schmidhuber at the AI lab IDSIA in Switzerland developed a tool in 2018 that trained bots in game-based virtual environments called world models. But again, unlike Genie, these required the training data to include input actions.
The team demonstrated how this ability is useful in robotics, too. When Genie was shown videos of real robot arms manipulating a variety of household objects, the model learned what actions that arm could do and how to control it. Future robots could learn new tasks by watching video tutorials.
“It is hard to predict what use cases will be enabled,” says Rocktäschel. “We hope projects like Genie will eventually provide people with new tools to express their creativity.”
Correction: This article has been updated to clarify that Genie and XLand were developed by different teams and to clarify the similarities between Genie and Guzdial's existing work.