Digital Disruption in the Film Industry – Part 4: Generative AI for Video Synthesis
Posted on | April 20, 2025 | No Comments
These are still the early days of artificial intelligence (AI) applied to video, but recently, the applications have rapidly accelerated, and enough has happened to comment. OpenAI’s SORA and Google’s Veo-2 are the current vanguards implementing extraordinary innovations in AI video applications but others such as Luma, Pika, and Runway are very competitive. How will these developments disrupt the current paradigm of media production and what does it mean for people wanting to use these new tools?
Generative AI for televisual video synthesis has become a major focus in my class EST 240 – Visual Rhetoric and Information Technologies. Although lectures have mainly focused on the signifying practices used in film, television, and YouTube channels, I am adding portions to cover AI techniques and prompt engineering required for effective AI generated media content. Connecting the vocabulary of televisual production to the possibilities of AI introduces students to new techniques that can enhance their careers.
This post introduces how AI is disrupting the media industry’s capacity to synthesize motion imagery. Generating video with AI requires not only creating visually plausible individual image frames but, crucially, form a coherent sequence with consistent objects, characters, environments, and logical motion over time. AI models achieve this by learning intricate patterns, relationships, and temporal dynamics from analyzing massive datasets containing existing video and image content.
AI can now take text prompts and generate full-motion video, thanks to a new class of computing models called text-to-video generative AI. These models interpret natural language descriptions and produce short video clips with varying levels of realism and coherence. These instruction sets guide the AI’s generation and synthesis of “tokens” – data points that are combined into the visual sequence through various algorithmic processes. Text prompts can be plugged into AI platforms like Runway, Pika Labs, or OpenAI’s Sora. They follow categories like these below.
Visuals: Earth modeled with digital/glowing textures
Focus: North and South America (you can add “seen from the western hemisphere”)
Grid Style: Spreadsheet-like structure wrapping or hovering around the globe
Mood: Futuristic, data-driven, cinematic
Motion: Rotating Earth, flickering grid lines, digital particles
Developing the rhetorical languages that can guide technical detail is important for harnessing the capabilities of generative AI. At a general level we use the French meaning-making concepts of Mise-en-scène and montage from film analysis to develop the understanding and language needed for visual prompts. More on that below after a brief scan of previous work in this series.
The Continuity of Disruption
For decades, AI primarily existed within the realm of science fiction cinema, often depicted as threatening humanity. 2001’s HAL 9000, The Terminator’s Skynet, and the machines of the Matrix series served as cautionary but fantastical tales. Now, AI has transitioned from a narrative trope into a tangible technological force actively reshaping modern moviemaking and continuing the mode of disruption that began with the introduction of microprocessing power in the film and video industries.
In my last post, I discussed the first examples of computer special effects (F/X) in movies such as Westworld (1973) and Star Trek II: The Wrath of Khan (1982) based on NASA’s work with its Jet Propulsion Lab (JPL) to develop an imaging system for Mariner 4, the Mars explorer. Digital F/X has continued with technology such as ILM’s Renderman, one of the first rendering software packages capable of using advanced rendering algorithms and shading techniques to create photorealistic images in CGI. Its allowed filmmakers to achieve lifelike lighting, textures, and reflections, enhancing the realism of digital environments and characters and winning numerous Oscar’s for Best Visual Effects.
AI-powered tools and software have transformed the field of visual effects (VFX) and animation, enabling filmmakers to create stunning, photorealistic CGI sequences and lifelike digital characters. AI algorithms can automate and streamline various aspects of the VFX production process, from rendering and compositing to motion capture and facial animation, saving time and resources while enhancing visual quality.
Before that, I posted on the transformation of post-production practices with the advent of Non-linear Editing (NLE) using AVID and other applications. I was there at the advent of the NLE revolution when the University of Hawaii was the first higher education institution to purchase an AVID NLE. I also used Clayton Christensen’s theory of innovative disruption to describe how digital editing progressed from very basic and almost crude computer applications and technology to the sophisticated, and now often very inexpensive techniques available on devices like smartphones, tablets, and PCs.
I started with how cameras had moved from film to digital, including a discussion on charge-coupled device (CCD) technology developed initially for spy satellites, and the development of cheaper and more energy-efficient complementary metal-oxide semiconductor (CMOS) technology for digital cameras. The 4K resolution achieved by the Red One Camera rocked the film industry in 2007 and the same company’s Red Dragon 6K Sensor in 2013 have been extended into the company’s KOMODO and RAPTOR series.
Although useful in several stages of movie-making and promotion, the process of video synthesis is a cornerstone of its disruptive potential in filmmaking and has been progressing over time. Deepfake technology was the first form of video synthesis that captured the public’s attention when it used AI for face-swapping or recreating actors’ likenesses. Drawing on computer graphics, neural rendering has been used since 2020 in visual effects (VFX) to create realistic textures, lighting, and animations. AI-assisted editing included tools that automated scene cuts, color grading, or suggested improvements. Virtual production is a term that includes AI for real-time rendering, facial tracking, and scene generation. Synthetic media involves AI-generated visuals, dialogue, or characters for digital doubles or de-aged actors in movies such as Martin Scorsese’s The Irishman (2019) as seen here.
Generative AI and Prompt Engineering for Video Synthesis
Generative AI such as Sora and New York City’s Runway is primarily focused on creating models and products for generating videos, images, and various multimedia content. AI is not a single, monolithic entity but rather a collection of rapidly changing technologies – including machine learning, natural language processing (NLP), computer vision, and sophisticated generative models – that are impacting nearly every facet of how films are conceived, created, and consumed. AI using machine learning algorithms and natural language processing, have been used to generate and analyze scripts, develop story ideas, and create entirely new digital content.
AI-driven systems also analyze vast amounts of data, including audience preferences, trends, and historical box office performance, to inform content creation decisions and even predict potential commercial success. The pace of change has accelerated dramatically in recent years, propelled by breakthroughs in generative AI, particularly diffusion models capable of creating increasingly realistic images and video sequences. These tools are increasingly available for use if you know how to access and guide them.
Notice the guidelines in this information about prompts from a dedicated YouTube channel.
Several platforms have emerged as leaders in the text-to-video and image-to-video generation space. Google’s Imagen Video and Veo, Meta’s Make-A-Video, Pika, Runway’s Gen-3, and Stable’s Diffusion Video currently have some of the most innovative models. These platforms introduce entirely new techniques for audiovisual content creation. This includes generating synthetic actors or digital doubles; creating photorealistic VFX elements like environments or specific effects such as explosions and intense weather; synthesizing video directly from text or images; generating dialogue or sound effects; and performing digital de-aging or applying digital makeup.
The process is pretty straightforward. At one level it follows the basic Turing computer model of input, processing, and output. You give a prompt like: “A dynamic 3D spreadsheet grid forms in space, encapsulating a glowing digital Earth. The Earth rotates slowly, with North and South America prominently displayed. The grid pulses with data streams and numbers, representing global analytics. Cinematic lighting with a futuristic blue and green palette, viewed from a slow-moving orbital camera.” Another level kicks in and AI processes the request using a multimodal transformer model trained on text and video data to interpret the scene from the text prompt. Then it outputs a short video (typically 2–20 seconds) showing that scene with motion, lighting, and camera movement.
Engines of Visual Generation
Conceptually, generative AI models are a class of AI systems specifically designed to create new, original data (text, images, audio, video, 3D models, etc.) that mimics the patterns and characteristics learned from their training data. Large Language Models (LLMs) like Chat GPT are particularly useful for researching and generating text that answer basic queries and research questions. Unlike discriminative models that classify or predict based on input, generative models learn the underlying distribution of the data to synthesize novel outputs. The process typically involves encoding the text prompt into a meaningful representation, (often using models like CLIP, a neural network that efficiently learns visual concepts from natural language supervision) which then conditions or guides the generative model (usually a diffusion model) during the video synthesis process.
Several core machine learning architectures and engines of visual generation underpin modern AI video generation. Key architectures enabling this include Generative Adversarial Networks (GANs), Diffusion Models, Variational Autoencoders (VAEs), as well as Transformers & RNNs (LSTMs). Each have specific strengths and weaknesses in generating different types of media.
GANs consist of two neural networks — a generator that creates synthetic data (images/video frames) and a discriminator that tries to distinguish between real and synthetic data. Through this adversarial process, the generator learns to produce increasingly realistic outputs. GANs are known for generating sharp, detailed images and can be relatively fast at generation once trained. However, they can be notoriously difficult to train stably and may suffer from “mode collapse,” where the generator produces only a limited variety of outputs. While used in some video synthesis approaches, they have been largely superseded by diffusion models for state-of-the-art results.
Diffusion models are a class of models has become dominant in high-quality image and video generation. The process involves two stages: first, gradually adding noise to training data over many steps until it becomes pure noise. Then, it trains a model (typically a U-Net architecture) to reverse this process, starting from noise and iteratively removing it (denoising) to generate a clean sample. Diffusion models generally produce higher-quality and more diverse outputs than GANs, often achieving superior realism. They also offer more stable training. A main drawback is the significantly slower generation speed due to the iterative denoising process, which can be thousands of times slower than GANs. Latent Diffusion Models (LDMs) address this partially by performing the diffusion process in a lower-dimensional “latent space” created by an encoder (like a VAE), making it more computationally efficient.
The Variational Autoencoders (VAEs) are generative models that have been repurposed for generative AI. They learn to encode data into a compressed latent representation and then decode it back. While they can generate images, these might sometimes be blurrier than GAN outputs. Their primary role in modern video synthesis is often as the encoder and decoder components within Latent Diffusion Models, enabling efficient processing in the latent space. They have also been explored for predicting motion in video generation and are used in generating image aspects.
Transformers and RNNs (LSTMs) include architectures that excel at processing sequential data. Transformers, particularly models like CLIP (Contrastive Language-Image Pretraining), are crucial for understanding the relationship between text prompts and visual content, enabling effective text-to-image and text-to-video generation by guiding the diffusion process. Vision Transformer (ViT) blocks are often integrated within the U-Net architecture of diffusion models. Recurrent Neural Networks (RNNs), such as LSTMs, have been used in earlier or alternative video generation models to help maintain temporal consistency across frames.
Temporal Consistency
The challenge of coherent motion is achieving temporal consistency – ensuring that objects, characters, lighting, and motion remain coherent and believable from one frame to the next throughout the video sequence. Without this, videos can appear jittery, nonsensical, or suffer from flickering artifacts. Diffusion models employ several techniques to address this critical hurdle for AI video generation. One is 3D U-Nets, architectures that extend the standard 2D U-Net used in image diffusion by incorporating a temporal dimension. Convolutions and attention mechanisms are factorized to operate across both space (within a frame) and time (across frames).
Another technique are temporal attention layers. These are specific layers added to the network architecture that allow different parts of a frame to “attend to” or share information with corresponding parts in other frames, explicitly modeling temporal relationships. Different attention strategies exist, such as attending to the same spatial location across all frames (temporal attention) or attending only to past frames (causal attention).
Frame interpolation is a video processing technique that creates new intermediating image frames between existing ones. It seeks to improve video quality by increasing the frame rate. Frame interpolation can be achieved through motion estimation that produces intermediate frames by calculating the motion vectors of pixels or blocks of pixels between frames and using them to predict the newly created frame. These vectors track the movement of pixels or blocks of pixels from one frame to the next and an algorithm predicts how objects in the video should move between the frames. By using these motion vectors, the algorithm can predict and generate new frames that follow the estimated motion paths.
A simpler approach blends the existing frames to create a new frame, although this can produce blurring or ghost effects. Also, morphing shapes objects from one frame to another and can be computationally intensive.
A related technique is hierarchical upsampling that uses text-to-video synthesis and operates like building a movie from a storyboard to provide high-resolution and temporally coherent video. It’s a core design principle in advanced generative models like Sora, Runway Gen-2, and Pika to scale from idea to realistic video. It starts with a sparse set of keyframes first and then uses the model (or simpler interpolation) to generate new and intermediating frames. It is easier to generate low-res previews and refines details. This progressively adds more frames per second to refine the spatial resolution while capturing the core semantics of the scene: motion, structure, general composition. Noise prediction and diffusion models are applied to enhance detail. These upsampling stages refine the spatial resolution and temporal consistency before investing in full computation for the final high-resolution (720p or 1080p at 24–30 fps) output with fine textures, lighting, shadows, and subtle motion.
Newer models like Sora aim to build more sophisticated internal representations of the world, including basic physics and object permanence, to generate more consistent and plausible motion and interactions. Despite progress, challenges remain. Maintaining high quality and consistency often becomes harder as the desired video length increases. Fine details, such as text legibility or the accurate depiction of complex objects like hands, can still be problematic, resulting in garbled or distorted outputs.
The quest for robust temporal consistency represents the next major frontier in AI video generation. While image generation models now produce stunningly realistic static visuals, the true utility of AI video in professional filmmaking hinges on its ability to create not just beautiful frames, but coherent, believable sequences. The techniques being developed—temporal attention, 3D architectures, world models—are direct responses to this fundamental challenge.
The qualitative difference between various AI video models often lies precisely in their capacity to handle motion, object persistence, and logical progression over time. Consequently, advancements in ensuring temporal coherence will be the primary driver determining how quickly and effectively AI video transitions from generating short, experimental clips to becoming a practical tool for longer-form narrative filmmaking. Overcoming current limitations, such as the occasional physics glitches or inconsistencies observed even in advanced models, is paramount. This area is where significant research, development, and competitive differentiation among AI platforms will likely occur in the near future.
That’s a Wrap!
As a continuation of my focus on disruption in the film industry, this post discusses the rapid advancements in AI-generated video and its growing impact on the media industry, particularly filmmaking. Generative AI models like Sora and Google Veo-2 are making significant strides in creating realistic and coherent video from user-initated text prompts. The post emphasizes the importance of teaching “prompt engineering” (crafting effective text instructions for AI) in media production courses, connecting it to traditional filmmaking concepts like mise-en-scène and montage.
A major hurdle in AI video generation is achieving temporal consistency, which means ensuring that objects, characters, and motion remain believable and coherent across video frames. It explains the workings of key AI architectures used in video synthesis, including GANs, Diffusion Models, VAEs, and Transformers. It highlights the dominance of Diffusion Models in achieving high-quality results. The post details specific techniques used to address the challenge of creating coherent motion, such as 3D U-Nets, temporal attention layers, frame interpolation, and hierarchical upsampling. The next major frontier in AI video generation is improving temporal consistency and overcoming limitations like inconsistencies and artifacts, especially in longer-form video.
In the meantime, many people can prepare for the opportunities inherent in AI-generated video production by developing their understanding and vocabulary of televisual production that they will need for effective prompting of desired moving image content.
Citation APA (7th Edition)
Pennings, A.J. (2025, Apr 20). Digital Disruption in the Film Industry – Part 4: Generative AI for Video Synthesis. apennings.com https://apennings.com/ditigal_destruction/digital-disruption-in-the-film-industry-part-4-generative-ai-for-video-synthesis/
© ALL RIGHTS RESERVED

var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-20637720-1']); _gaq.push(['_trackPageview']);
(function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })();
Tags: Diffusion Models > Generative Adversarial Networks (GANs) > generative AI > prompt engineering > text-to-video > Variational Autoencoders (VAEs)