Video Processing on the Web

Michael Ivanov
7 min readAug 23, 2024

--

Generated with Leonardo.AI. Copyright CodeArtworks Ltd. 2024.

In the modern web, video is king. YouTube, Canva, and thousands of video editing apps for all kinds of purposes dominate the digital content creation landscape. Yet, the technology underneath often falls short of developers’ expectations, who in turn work hard in a competitive market to stand out among a multitude of solutions by delivering the best tools and smoothest experiences. In this blog post, I’ll share my thoughts on the current state of web-based video processing APIs, drawing from my practical experience developing video editing software in recent projects.

How do we usually make a judgment regarding web-based graphics or video editing software? We typically do so by comparing it to desktop tools. Even in the year 2024, the following statement still holds true: desktop video editing software is superior to web-based software in both features and performance. We cannot achieve native performance on the web not only because of the limitations of the JavaScript virtual machine but also because we cannot take advantage of many low-level graphics processing APIs such as CUDA and hardware-accelerated video encoders. Web browser vendors are reluctant to expose these APIs, even though it is technically possible to do so. Why? “For security reasons” is usually the official explanation. However, I recall that the decision not to extend WebGL 2.0 to support advanced features such as compute shaders was also based on “security considerations,” yet they have now shipped WebGPU, which brings support for compute shaders to web browsers.

Web technologies have made progress in improving multimedia APIs. For years, web developers relied on the HTMLMediaElement to play video. Initially, video APIs on the web were primarily used for playback. However, today the web is filled with video editing apps, many of which attempt to provide advanced features similar to professional desktop video editors. Unfortunately, the HTMLMediaElement, in layman’s terms, is simply not good enough for this kind of job. A common example is precise video seeking, a task at which it often fails. Stack Overflow is full of questions and solutions, many of which don’t effectively solve the problem. This issue stems from the fact that HTMLMediaElement was never designed to provide such granular seeking capabilities due to the performance overhead it incurs.

To explain what we’re dealing with: seeking to a precise time in a video is not a lightweight task. Without getting too technical, here’s how it works: in the world of video codecs, a video stream consists of different types of frames. The key frames are called I-frames, and they contain all the information required to generate a complete frame. There are also B-frames and P-frames, which are encoded between I-frames and contain only partial information about the frame they represent. They get the rest of the data by referencing either the previous frame or both the previous and next frames.

To seek precisely, the decoding logic must first locate the nearest I-frame and then decode all the intermediate frames until it reaches the requested time. Depending on the distance between the keyframe and the requested time, this process can introduce noticeable delays, especially for high-definition or high-bitrate videos. The process involves decompressing each frame, which can be computationally expensive, particularly on lower-powered devices like smartphones, even though most modern mobile vendors provide hardware-accelerated video encoding capabilities.

Media players and browsers are designed to be responsive. If the seek operation requires decoding a large number of frames, it might introduce latency. To avoid such delays, media players often jump to the nearest keyframe and start playback from there, even if it’s not the exact requested time. This approach prioritizes responsiveness over precise seeking.

The standard HTML video API is designed to provide decent performance and a smooth experience, which is likely why the HTMLMediaElement lacks functionality for precise seeking. However, there has been a growing demand for such functionality from video editing software creators. This demand was likely noticed by the W3C, leading to the rollout of the WebCodecs API. I used the WebCodecs video decoder API while working on a project for one of my clients, and I’d like to share my thoughts in retrospect.

My use case was rather unconventional, as is often the case with the projects I participate in. The application that needed to consume the video decoding output was written in C++ and cross-compiled to WASM. Inside the WASM app, a WebGL renderer was supposed to receive the decoded video frame data, update the WebGL texture, and draw it to the canvas. Precise frame seeking was a basic requirement, so we considered either using a sophisticated native solution like LibAV or exploring other options.

The initial idea was to find a native solution that could be part of the WebAssembly engine system. FFMPEG WASM was not an option because we needed the decoder code to be integrated into the C++ codebase. Porting it would have taken a considerable amount of time, and we weren’t sure we could bypass its LGPL/GPL license, which requires exposing the entire application code to the public domain if used in distributed software.

I attempted to cross-compile an AV1 decoder to WebAssembly, but the decoding was extremely slow, not even close to real-time, primarily due to the lack of support for web workers. The current AV1 implementations are generally slow compared to libraries like x264. Long story short, we couldn’t find a decent open-source C/C++ video decoder that could provide real-time decoding speed for FullHD resolutions, so we turned to explore browser-based tools.

The Web Codecs API makes a good impression overall. It’s a relatively low-level API that gives the user full control over the decoding process, including demultiplexing (demuxing) the video stream, feeding packets into the decoder sink, and retrieving the decoded frames. The user is also responsible for converting the output frame, which is provided in the YUV color space — the standard in video processing — to RGB, unless the output is used to update a WebGL texture, where the conversion is handled under the hood.

However, the API doesn’t provide built-in demuxing functionality, which I consider a serious shortcoming. The rationale for not including a demuxing API is somewhat understandable. Video containers, like MPEG4, have many variations, some of which may deviate from the standard. Additionally, some applications, like camera streamers, deliver raw video bitstreams that don’t require demuxers. Still, I found it extremely inconvenient not to have an out-of-the-box demultiplexer, as I couldn’t find a good third-party implementation.

In fact, the only JavaScript-based demuxer I found was MP4Box.js, which is used in the WebCodecs API code examples. The issue is that this tool is more like an MP4 container inspector than a true demuxer; it can only parse the container serially from start to end, with no seeking or looping capabilities. While a developer could attempt to add the missing functionality, doing so is not easy without a solid knowledge of MP4 container specifications — something I suspect very few web developers possess. Implementing a demuxer from scratch is an even more daunting task.

As far as I know, web browsers like Chromium and Firefox use LibAV under the hood for video processing tasks originating from JavaScript. LibAV has a robust demuxer built-in. Why this functionality isn’t exposed as an option in the WebCodecs is unclear to me.

Eventually, we implemented a custom demuxer in C++ to parse the container in the WASM app. This demuxer would send the packets to a JavaScript layer, which would then forward them to the WebCodecs decoder. The decoded YUV frames would be sent back to the WASM app. This entire process was challenging to optimize, as moving data between WASM and JavaScript incurs noticeable overhead. Additionally, YUV to RGB conversion of the decoded data, even in WASM, is slow. In a native application, I would use a CUDA kernel or compute shader to perform this conversion quickly, but such options are not available on the web. This limitation led me to wonder why the WebCodecs API designers didn’t include color space conversion functionality that could be handled by the browser. Implicit conversion of VideoFrame data from YUV to RGB is supported for WebGL textures, so providing RGB output from the WebCodecs video decoder should not be a major issue.

Ideally, I would expect a low-level API like this to be supported by the Emscripten project. If I could access it via Emscripten APIs, it would eliminate the need to develop such a complex flow, where some decoding functionality resides in WASM and some in JavaScript, thereby minimizing overhead. As of now, this support has not materialized, and I have no information on whether the Emscripten developers plan to support WebCodecs natively.

In the end, we managed to optimize the entire decoding pipeline to achieve solid real-time performance and efficient precise video seeking, but this required substantial engineering effort from developers with strong expertise in graphics and video processing. Therefore, I don’t see how the WebCodecs API, in its current state, can be easily adopted by typical front-end developers. It will likely take some time for robust high-level frameworks to be built around the API to provide all the functionalities I discussed here out of the box.

--

--

No responses yet