Digital Video DecodedFirst published in 1997. NOTE: This article is more than ten years old! It's out of date!
Don't copy and paste it into your homework! Last modified 03-Dec-2011.
The rise of digital video editing and recording has brought two parts of the high technology world together. The computer geeks have found themselves reading the same catalogues as the video experts, and both are finding that their expertise in one field doesn’t necessarily transfer to the strange, hybrid world of digital vide. This article’s aimed at getting both groups up to speed, by discussing things that computer and video experts, respectively, take for granted as common knowledge.
Computer folks know what resolution is. If your monitor’s set to show 800 by 600 pixels, you know for a fact that that’s EXACTLY what you get. And yes, you can point to a single pixel, though it may take a magnifying glass to do it.
Analogue video is fuzzier, in the literal as well as the conceptual sense. Its horizontal resolution can be defined in terms of the maximum number of alternating vertical black and white lines that can be displayed without becoming unacceptably blurred, but only after you decide when adjacent lines are to be regarded as "unacceptably blurred". Nobody has ever reached a consensus on how to do that.
A better way to measure analogue video resolution is with bandwidth. Bandwidth, expressed in megahertz (MHz), is how much space the video signal takes up. For comparison purposes, you can use the bandwidth of the signal going to the monitor, without any fancy encoding such as is used to make video smaller on tape and easier to transmit, or you can use the smaller encoded bandwidth if you’re comparing identically encoded signals. Bandwidth can be accurately defined by saying that when a test signal recorded in the format is down one decibel (dB) in intensity on what it should be, there lies the edge of the signal bandwidth. This eliminates subjective judging of fuzzy black and white lines.
Fuzzy line estimates, however, are still in common use, because they tend to make for more impressive numbers – especially if you do the fuzzy-line estimate while drunk and squinting, then run the bandwidth-to-line resolution equation backwards and convert your highly optimistic line resolution estimate into an inflated bandwidth figure.
And then there’s vertical resolution. The vertical resolution of a video signal would seem to be firmly fixed – it's the number of scanlines, full stop, right? It Is Written that NTSC video has 525 lines interlaced at 60 fields (30 frames) per second, and PAL video has 625 lines similarly interlaced at 50 fields (25 frames) per second. But, of course, it’s not that simple.
Canonically, at standard overscan, 575 lines are visible in PAL and 485 in NTSC; the rest of the lines aren't used for video signals. The lost lines are where closed captioning lives, for instance, and a few are lost to the vertical blanking interval while the electron beam retraces to the top of the screen.
Of the lines that are used for video, some may not be clearly visible on a given monitor, you may see fewer if your vertical size control is set high, and lines may fall victim to a restricted-resolution storage format. Lines may also be doubled by funky TVs, which with digital jiggery-pokery can fool you into thinking you’re looking at genuine broadcast detail, not clever interpolation.
Your average semi-professional digital video dude labours under the misconception that regular "clean VHS" grade video is video is 640 by 480 pixels in 24 bit colour. Well, that might be what your computer digitises, but it’s definitely not what the original source contains.
Analogue video comes in two basic flavours – component and composite. Component video uses multiple signals to build a picture, composite rolls it all into one. The whole idea of component video is to retain full bandwidth, so (theoretically) no detail is lost in storage or transmission.
Pro video is component, with three channels. When people talk about component video, they mean this version. The most straightforward version of three channel component video is RGB, with separate channels for the red, the green and the blue information in the video signal. Combine red, green and blue data in the standard additive colour model and you get your colour picture.
RGB is the same basic system used to feed all modern computer monitors, but it’s practically never used for recording, because leaving the space-hog green data in there makes RGB take up 50% more bandwidth than the less straightforward version of component video – colour difference.
In colour difference component video, the first channel is luminance. Notated Y, the standard abbreviation for intensity, the luminance is the signal’s brightness information with no colour data. Y by itself gives a black and white picture. The other two channels are called colour difference signals – they’re notated R-Y and B-Y, and are the difference between red and the luminance and the difference between blue and the luminance, respectively. The colour difference channels can be algebraically recombined with the luminance to give a full colour picture, without having to transmit the green data that, on most video, takes up more bandwidth than the other two colours put together (on average, green is 59% of a video signal).
Prosumer Y/C video is sorta-kinda component, with two channels, but is never referred to as component video, for two reasons. One, everyone knows what component video is and letting $3000 camcorders into the club would cause confusion. And two, all Y/C does is extend the separate chrominance and luminance signals that most low-end video formats already encode into the cable connections. Every VHS VCR records video as separate luminance and chrominance; SVHS and Hi-8 ones with Y/C output can send and receive this two-channel video without squishing it into composite. Y/C is better than composite, but still much worse than component.
The chrominance signal Y/C uses is PAL or NTSC-modulated, ready for integration into a composite or RF modulated signal in the appropriate format. This makes Y/C video inherently PAL or NTSC, while real component video is format-agnostic – it can be encoded into any format you like. Y/C, or S-Video as it’s also known, is used only be SVHS, Hi8 and ¾" SP equipment.
The "smallest" standard analogue video format is composite. Composite video combines the luminance and chrominance components into one signal, encoded differently depending on whether the system is built to the PAL, NTSC or SECAM standards. Composite video is transmissible on a single two conductor cable, but it restricts the image bandwidth (detail) and is very difficult to decode into pure luminance and chrominance.
There is a very large amount of data in a high resolution video signal. In order to fit the signal into a relatively small amount of bandwidth, some form of data-reduction encoding has to be performed.
Computer geeks tend to be under the impression that data compression, especially the "lossy" compression also known as data reduction that made desktop video feasible, is a relatively new field of endeavour. They also think that compression has to be digital. This is incorrect; all analogue video formats use lossy encoding designed to minimise the amount of bandwidth occupied by the signal while maintaining a given quality level. They just do it with infinitely variable voltages, not strings of zeroes and ones.
Digital compression works better than analogue compression, because it’s got more brains. As far as an analogue system is concerned, a video signal is just a darn great waveform, which includes timing information to make sure the display system sweeps the electron beam to the start of the next line on time and back up to the top left corner after each field has been displayed. An analogue system can have no comprehension of the picture it’s displaying beyond the vaguest information, like overall brightness and the frequency characteristics at a given point. So analogue compression is based on throwing away some amount of image detail in a dumb, unconsidered way, and you can’t get a great deal of compression that way without losing an unacceptable amount of detail.
Digital systems, on the other hand, work on each field or frame as a single image (intraframe or spatial compression), and may also take into account the images before and after it (interframe/interfield temporal compression). This lets digital systems more intelligently determine what parts of an image are full of important detail and what parts aren’t. They can also spot similar areas in single frames or, with interframe, strings of frames, and describe these areas in terms of each other so that all of the data needed to draw the area everywhere it appears doesn’t have to be stored. This is why digital compression systems can fit a visually indistinguishable signal into less than a sixth of the bandwidth of the analogue original.
Digital compression has a number of issues that parallel those analogue video users have always had to confront, but digital can give the user more control. In analogue, there’s a limited amount of manipulation that can be performed to make an image look better or worse, and practically nothing at the actual storage point. Sure, you can tweak a variety of infinitely variable knobs on cameras, mixing desks and filter boxes, but the tape system just does its best with what it’s given.
In digital, on the other hand, every system lets you adjust the compression rate. More compression means more video fits on a drive and less data has to be pumped around per second, but also means lower image quality. It’s roughly analogous to changing the tape speed of an analogue system; a regular VHS VCR in long play mode may be able to fit twice as much on a tape, but quality is lost.
Exactly how much quality is lost when you crank up digital compression, however, is hard to say. Digital compression quality is about as quantifiable as analogue resolution.
Advertisements for digital video products like to tout the lowest compression rate Board X can work at, as a ratio figure – say, 3:1. This indicates a product’s ability to handle tremendous data rates (3:1 is about half a gigabyte per minute). This is good for impressing your friends, but doesn’t, in and of itself, tell you much about the image quality. You can only use raw ratio figures to compare compression systems if Compression Rate X looks the same on everyone’s hardware. And it doesn’t.
Most digital video products available today use some variant on editable MPEG or Motion JPEG image compression – but all variants are not alike, and a board with more processing power can do more intelligent compression, making a lower data rate signal that produces a visually indistinguishable picture. More recent wavelet compression based boards raise the bar again, with a different compression system.
One company’s 4:1 24 bit colour wavelet compression is most definitely not the same as another company’s 4:1 16 bit colour M-JPEG compression; the latter will have notably worse compression artefacts, and its lousy number of possible colours (16 bit is 5 bit red, 5 bit blue, 6 bit green, for 65,536 possible colours) will make unprocessed video look slightly rougher and invite horrid banding effects when mixing sources. Output to composite can hide a multitude of sins, but poor mixing of 16 bit sources pushes the defect-blurring abilities of even an elderly VHS tape.
Remember this: a simple compression ratio is an easily remembered, readily comparable statistic which is based on a large number of complex variables. Like most such statistics, it is not very meaningful, and can be highly misleading.
The wheels are in motion for a cross-platform standard with which to evaluate compression quality, but they’re turning slowly, primarily because different compression systems work in different ways on different video input, and video input is infinitely variable. Devising a workable, standardised suite of input signals and a set of criteria for judging the results is not entirely unlike nailing jelly to a tree.
When Digital Video (DV) becomes more popular, the compression issue will be largely forgotten in the semi-pro and low end professional markets. We have to deal with it now, because acquisition and delivery devices are overwhelmingly analogue. To digitally edit most video, you have to record it with an analogue device, digitise it with whatever flavour and quantity of compression your system uses, do your edit, and pump it out to another analogue device. DV promises simple, one-wire connections from cameras to edit machines to VTRs, with zero transfer and duplication loss. DV does high quality compression at the original acquisition point, and then just pumps the digital data around. There’s room for quality to improve with more powerful compression hardware in the cameras and VTRs, but DV is already quoted as equal to 3:1 to 3.5:1 M-JPEG, which is plenty good enough for broadcast.
High end digital systems – like those used at pro MPEG encoding shops, for example – give a lot more control over compression. A human can judge much better than a computer which elements of an image are important, and can set compression accordingly to wring the best possible picture out of the final result. This sort of control, however, is practical only for creation of a final distribution copy from high grade source material – in an editing environment, it’d drive you nuts.
When people talk pro digital video, they talk ITU-R 601. Well, sometimes they talk CCIR 601, but these are just two names for the same thing.
ITU-R 601 is the international encoding standard for digital television, whether intended for PAL, NTSC or SECAM applications. The standard covers colour difference and RGB video, but the colour difference version is more popular, because colour difference video in general is more popular. Colour difference ITU-R 601 is referred to as a Y, Cr, Cb format – it digitises luminance, R-Y and B-Y. Luminance is sampled at 13.5MHz, and the two colour difference signals are each sampled at 6.75MHz. This sampling is referred to as 4:2:2, but the A:B:C format for describing digital video sampling modes is internally inconsistent and leads to lots of confusion. ITU-R 601 video is uncompressed, and has a continuous data rate of 21 megabytes per second.
The samples can be either 8 bit or 10 bit, giving theoretical maximum numbers of unique colours of 16.8 million and 1.1 billion, respectively. Black is actually defined as level 16 and white at level 235 (on a 0 to 255 scale), and the colour difference channels are defined as occupying the values from 32 to 224, to leave anti-clipping headroom and "footroom". If you ever see Y, Pr, Pb video referred to, it’s a variant on this scheme where the possible luma and chroma digitisation values are identical. Pr and Pb are not allowed for in ITU-R 601, but they’re the digital equivalent of the equal-bandwidth analogue colour difference components used by, among other formats, Betacam and M-II.
The restricted digitisation values for the components make the real numbers of possible colours for Y, Cr, Cb video in 8 and 10 bit a mere 8.1 million and 430 million, respectively. The useable colour resolution, in actual studios, is further reduced by all of those gamma curves and colour space manipulations that people go to university to learn about. In any case, the colour resolution of even 8 bit ITU-R 601 is such that any colour artefacts visible in the pure video are the fault of the acquisition or display equipment, not the format. The 10 bit version gives a wider colour gamut for easier mixing and other processing of sources.
Note that Cr and Cb are, correctly, the names for the DIGITISED versions of the colour difference signals; if you’re talking about the analogue versions, they’re plain old R-Y and B-Y, respectively. Cr and Cb trip off the tongue more easily, though, so the digital terms are becoming standard usage when referring to the analogue components as well.
While we’re on the subject of incorrect usage, people who work with PAL video sometimes call the analogue components Y, U and V respectively. U and V, correctly, are PAL subcarrier axes that are modulated by scaled and filtered versions of the R-Y and B-Y signals – they’re not the signals themselves. Likewise, NTSC video aficionados sometimes incorrectly call the analogue components Y, I and Q, although since most NTSC broadcast equipment now uses the YUV axes, this particular incorrect usage may be swallowed by the PAL variant.
In Digital Video (DV) write-ups, it’s described as "virtually lossless". This is one of those weasel phrases like "almost unique" and "fresh frozen" which cynical consumers have come to recognise as The Slippery Salesman’s Friend. Fortunately, nothing underhanded is happening in DV’s case.
Digital data is, classically, incorruptible. Provided damage to the storage medium or transmission noise doesn’t destroy the signal, digital data will be the same after any amount of transmission, copying and viewing. And this is the case for digital video – provided it hasn’t been lossily compressed.
Lossy compression moves the goalposts. By definition, you lose data when you use lossy compression, and it is axiomatic that this data cannot be recovered, full stop. The trick is to tune the compression so the final image quality is good enough for the demands of your application. But that’s not the end of the story.
Digital video editing formats, like the DV format, don’t use interframe compression. Interframe compression – constructing frames in terms of previous and following frames – allows non-editable MPEG like that used by Video CD and DVD to very effectively compress high grade moving images, but any change to the sequence of frames requires modification of the data for every frame that refers to a modified frame.
But because editing formats don’t use interframe compression, and all of their compressed frames are independent of each other, plain old cut-cut editing is easy. All the editing system has to do is nose-to-tail the appropriate chunks of data, in a process completely analogous to splicing film, and the job is done. But anything which requires changes to the frames themselves, as opposed to their sequence, becomes a problem.
You can’t do effects directly on lossily compressed frames. They are not a plain, simple map of pixel colours that can be mathematically tinkered with. They’re a recipe which, when followed, allows an image to be generated that looks as much like the original uncompressed frame as the compression scheme allows.
So to do, say, a simple cross-fade between compressed sources that takes up 50 frames, the 50 frames of each video stream that will be used for the effect are decompressed. The decompressed data is mathematically combined with appropriate weighting. And then the frames are recompressed for insertion in the final video stream, the other frames of which are straight copies of the compressed frames in the source files.
It’s the recompression that causes the problem.
Clever compression schemes can recognise their own artefacts, to some degree. Any digitisation and compression system that samples chrominance at a lower resolution than luminance, for instance, will produce a picture that’s really a nice high resolution greyscale image, coloured in with a wall of lower resolution blocks of single hues. This method is a very popular way of reducing data rates, because the human eye has much more detailed response to luminance than to chrominance. This explains the lower bandwidth assigned to the colour difference signals in analogue component video, and the lower sampling rate for Cr and Cb in digital component video.
Now, let’s say that a frame thus compressed is for some reason decompressed, nothing at all is done to the data and the frame is recompressed again. The compression algorithm, if it’s smart enough, can recognise the chroma blocks as artefacts of its own action and won’t try to reduce the chroma resolution any more. It may throw away a bit more detail in the luma, because it can’t recognise every single twiddly artefact of its own compression, but by and large the frame should look the same. You’d have to decompress and recompress it quite a few times to lose much more image quality.
All of this happens because lossy image compression only runs one way. It’s easy to compress a frame. It’s impossible to take a compressed frame and turn it back into the exact same frame that generated it. Since a compression algorithm can’t tell what a compressed frame looked like before the original compression, it can’t tell what changes have been made and can’t see for sure what’s an artefact and what’s real image detail. If a compression scheme can maintain pixel-perfect identicality of decompressed and recompressed data, it’s not lossy, by definition.
Fortunately, there’s no reason for a digital video system to decompress and recompress frames without doing anything to them. When playing video, the system just decompresses, and doesn’t change the compressed data. When doing cut-cut editing, it just rearranges the frame order, again without changing the data that comprises each frame. But when doing something that requires the appearance of frames to change – like the crossfade mentioned above – it has to decompress and recompress. And, worse still, it has to recompress image data that is not a plain, single, decompressed frame, but is come combination of frames, or a combination of frames and text, or part of a frame being wiped away, or whatever.
The variety of video content is, in practice, infinitely variable. There are 16.8 million to the power of 307,200 possible images that can be displayed on a 640 by 480, 24 bit screen.
This is well over a two million digit number of possible frames. It is difficult to even describe how many this is in human-comprehensible terms. Use all the matter in the universe to make televisions that show 25 frames per second for the entire life of the universe so far, and you're still hilariously short of the mark. Magically turn every particle in the universe into a TV that shows 25 thousand frames per second for a million times the age of the universe, and you're still not even within shouting distance.
Compression schemes are, by definition, finite in complexity. It is thus literally impossible for a compression scheme, no matter how complex, to recognise all the things that can be thrown at it for recompression – combinations of artefacts, the addition of extra digital effects and all the other things that people do to their video. It is thus also impossible for it to avoid losing more data when frames are recompressed.
What does all this mean, in brief? Well, it means that any digital video that uses lossy compression cannot, even after the initial compression stage, be described as truly lossless. Any operation that requires decompression and recompression of frames will reduce the quality of those frames, full stop. This should not be a significant consideration unless you’re layering effect on effect on effect in separate sessions, but it’s the source of descriptions of, for example, DV, as "virtually lossless"
Analogue video folks often take the "virtually lossless" description to mean "slight generational losses that they don’t want us to know about". This is incorrect. To reiterate – lossily compressed digital video can be dubbed, cut-cut edited and transmitted as many times as you like, with literal zero loss. It’s only when you start changing frames that quality can fall.