For full format paper (including table, picture, graphic etc) Please open at A Guide to MPEG Fundamentals and Protocol Analysis
Section 1 – Introduction to MPEG · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.1 Convergence · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.2 Why Compression is Needed · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.3 Applications of Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.4 Introduction to Video Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 2
1.5 Introduction to Audio Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 4
1.6 MPEG Signals · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 4
1.7 Need for Monitoring and Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 5
1.8 Pitfalls of Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 6
Section 2 – Compression in Video· · · · · · · · · · · · · · · · · · · · · · · · · · · · 7
2.1 Spatial or Temporal Coding? · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 7
2.2 Spatial Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 7
2.3 Weighting · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 9
2.4 Scanning · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10
2.5 Entropy Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10
2.6 A Spatial Coder · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 11
2.7 Temporal Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 12
2.8 Motion Compensation · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 13
2.9 Bidirectional Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 15
2.10 I, P, and B Pictures · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 15
2.11 An MPEG Compressor · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 17
2.12 Preprocessing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 20
2.13 Profiles and Levels · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 21
2.14 Wavelets · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 22
Section 3 – Audio Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · 23
3.1 The Hearing Mechanism · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 23
3.2 Subband Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 24
3.3 MPEG Layer 1 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 25
3.4 MPEG Layer 2 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 26
3.5 Transform Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 26
3.6 MPEG Layer 3 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 27
3.7 AC-3 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 27
Section 4 – Elementary Streams· · · · · · · · · · · · · · · · · · · · · · · · · · · · 29
4.1 Video Elementary Stream Syntax · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 29
4.2 Audio Elementary Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 30
Contents
A Guide to MPEG Fundamentals and Protocol Analysis
iv
Section 5 – Packetized Elementary Streams (PES) · · · · · · · · · · · · · · 31
5.1 PES Packets · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 31
5.2 Time Stamps · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 31
5.3 PTS/DTS · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 31
Section 6 – Program Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 33
6.1 Recording vs. Transmission · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 33
6.2 Introduction to Program Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 33
Section 7 – Transport Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 35
7.1 The Job of a Transport Stream · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 35
7.2 Packets · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 36
7.3 Program Clock Reference (PCR) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 36
7.4 Packet Identification (PID) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 37
7.5 Program Specific Information (PSI) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 37
Section 8 – Introduction to DVB/ATSC · · · · · · · · · · · · · · · · · · · · · · · 39
8.1 An Overall View · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 39
8.2 Remultiplexing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40
8.3 Service Information (SI) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40
8.4 Error Correction · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40
8.5 Channel Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 41
8.6 Inner Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 43
8.7 Transmitting Digits · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 44
Section 9 – MPEG Testing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45
9.1 Testing Requirements · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45
9.2 Analyzing a Transport Stream · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45
9.3 Hierarchic View · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 46
9.4 Interpreted View · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 48
9.5 Syntax and CRC Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 48
9.6 Filtering · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 49
9.7 Timing Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 49
9.8 Elementary Stream Testing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 51
9.9 Sarnoff Compliant Bit Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 51
9.10 Elementary Stream Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 52
9.11 Creating a Transport Stream · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 52
9.12 Jitter Generation · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 53
9.13 DVB Tests · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 53
Glossary · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 55
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 1
MPEG is one of the most popular audio/video compression techniques because
it is not just a single standard. Instead, it is a range of standards suitable for
different applications but based on similar principles. MPEG is an acronym for
the Moving Picture Experts Group which was set up by the ISO (International
Standards Organization) to work on compression.
MPEG can be described as the interaction of acronyms. As ETSI stated “The
CAT is a pointer to enable the IRD to find the EMMs associated with the CA
system(s) that it uses.” If you can understand that sentence you don’t need this
book!
1.1 Convergence
Digital techniques have made rapid progress in audio and video for a number
of reasons. Digital information is more robust and can be coded to substantially
eliminate error. This means that generation-loss in recording and losses in
transmission are eliminated. The Compact Disc was the first consumer product
to demonstrate this.
While the CD has an improved sound quality with respect to its vinyl predecessor,
comparison of quality alone misses the point. The real point is that digital
recording and transmission techniques allow content manipulation to a degree
that is impossible with analog. Once audio or video are digitized, they become
data. Such data cannot be distinguished from any other kind of data; therefore,
digital video and audio become the province of computer technology.
The convergence of computers and audio/video is an inevitable consequence of
the key inventions of computing and Pulse Code Modulation. Digital media can
store any type of information, so it is easy to use a computer storage device for
digital video. The nonlinear workstation was the first example of an application
of convergent technology that did not have an analog forerunner. Another
example, multimedia, mixed the storage of audio, video, graphics, text, and
data on the same medium. Multimedia is impossible in the analog domain.
1.2 Why Compression is Needed
The initial success of digital video was in post-production applications, where
the high cost of digital video was offset by its limitless layering and effects
capability. However, production-standard digital video generates over 200
megabits per second of data and this bit rate requires extensive capacity for
storage and wide bandwidth for transmission. Digital video could only be used
in wider applications if the storage and bandwidth requirements could be
eased; easing these requirements is the purpose of compression.
Compression is a way of expressing digital audio and video by using less data.
Compression has the following advantages:
A smaller amount of storage is needed for a given amount of source material.
With high-density recording such as with tape, compression allows highly miniaturized
equipment for consumer and Electronic News Gathering (ENG) use. The
access time of tape improves with compression because less tape needs to be
shuttled to skip over a given amount of program. With expensive storage media
such as RAM, compression makes new applications affordable.
When working in real time, compression reduces the bandwidth needed.
Additionally, compression allows faster-than-real-time transfer between media,
for example, between tape and disk.
A compressed recording format can afford a lower recording density and this can
make the recorder less sensitive to environmental factors and maintenance.
1.3 Applications of Compression
Compression has a long association with television. Interlace is a simple form
of compression giving a 2:1 reduction in bandwidth. The use of color-difference
signals instead of GBR is another form of compression. Because the eye is less
sensitive to color detail, the color-difference signals need less bandwidth. When
color broadcasting was introduced, the channel structure of monochrome had
to be retained and composite video was developed. Composite video systems,
such as PAL, NTSC, and SECAM, are forms of compression because they use
the same bandwidth for color as was used for monochrome.
Section 1 – Introduction to MPEG
2
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
Figure 1-1a shows that in traditional television systems, the GBR camera signal
is converted to Y, Pr, Pb components for production and encoded into analog
composite for transmission. Figure 1-1b shows the modern equivalent. The Y, Pr,
Pb signals are digitized and carried as Y, Cr, Cb signals in SDI form through the
production process prior to being encoded with MPEG for transmission. Clearly,
MPEG can be considered by the broadcaster as a more efficient replacement
for composite video. In addition, MPEG has greater flexibility because the bit
rate required can be adjusted to suit the application. At lower bit rates and resolutions,
MPEG can be used for video conferencing and video telephones.
DVB and ATSC (the European- and American-originated digital-television broadcasting
standards) would not be viable without compression because the bandwidth
required would be too great. Compression extends the playing time of
DVD (digital video/versatile disc) allowing full-length movies on a standard size
compact disc. Compression also reduces the cost of Electronic News Gathering and
other contributions to television production.
In tape recording, mild compression eases tolerances and adds reliability in
Digital Betacam and Digital-S, whereas in SX, DVC, DVCPRO, and DVCAM, the
goal is miniaturization. In magnetic disk drives, such as the Tektronix Profile® storage
system, that are used in file servers and networks (especially for news purposes),
compression lowers storage cost. Compression also lowers bandwidth,
which allows more users to access a given server. This characteristic is also
important for VOD (Video On Demand) applications.
1.4 Introduction to Video Compression
In all real program material, there are two types of components of the signal: those
which are novel and unpredictable and those which can be anticipated. The
novel component is called entropy and is the true information in the signal. The
remainder is called redundancy because it is not essential. Redundancy may be
spatial, as it is in large plain areas of picture where adjacent pixels have
almost the same value. Redundancy can also be temporal as it is where similarities
between successive pictures are used. All compression systems work by separating
the entropy from the redundancy in the encoder. Only the entropy is
recorded or transmitted and the decoder computes the redundancy from the transmitted
signal. Figure 1-2a shows this concept.
An ideal encoder would extract all the entropy and only this will be transmitted
to the decoder. An ideal decoder would then reproduce the original signal. In practice,
this ideal cannot be reached. An ideal coder would be complex and cause
a very long delay in order to use temporal redundancy. In certain applications,
such as recording or broadcasting, some delay is acceptable, but in videoconferencing
it is not. In some cases, a very complex coder would be too expensive.
It follows that there is no one ideal compression system.
In practice, a range of coders is needed which have a range of processing
delays and complexities. The power of MPEG is that it is not a single compression
format, but a range of standardized coding tools that can be combined
flexibly to suit a range of applications. The way in which coding has been performed
is included in the compressed data so that the decoder can automatically
handle whatever the coder decided to do.
MPEG coding is divided into several profiles that have different complexity, and
each profile can be implemented at a different level depending on the resolution of
the input picture. Section 2 considers profiles and levels in detail.
Figure 1-1.
Analog
Composite
Out
(PAL, NTSC
or SECAM)
B
G
R
Y
Pr
Pb
Digital
Compressed
Out
Matrix ADC Production
Process
B
G
R
Y
Pr
Pb
Y
Cr
Cb
Y
Cr
Cb
SDI
MPEG
Coder
a)
b)
Camera Matrix
Camera
Composite
Encoder
There are many different digital video formats and each has a different bit rate.
For example a high definition system might have six times the bit rate of a
standard definition system. Consequently, just knowing the bit rate out of the
coder is not very useful. What matters is the compression factor, which is the
ratio of the input bit rate to the compressed bit rate, for example 2:1, 5:1, and
so on.
Unfortunately the number of variables involved make it very difficult to determine
a suitable compression factor. Figure 1-2a shows that for an ideal coder,
if all of the entropy is sent, the quality is good. However, if the compression
factor is increased in order to reduce the bit rate, not all of the entropy is sent
and the quality falls. Note that in a compressed system when the quality loss
occurs, it is steep (Figure 1-2b). If the available bit rate is inadequate, it is better
to avoid this area by reducing the entropy of the input picture. This can be
done by filtering. The loss of resolution caused by the filtering is subjectively
more acceptable than the compression artifacts.
To identify the entropy perfectly, an ideal compressor would have to be
extremely complex. A practical compressor may be less complex for economic
reasons and must send more data to be sure of carrying all of the entropy. Figure
1-2b shows the relationship between coder complexity and performance. The higher
the compression factor required, the more complex the encoder has to be.
The entropy in video signals varies. A recording of an announcer delivering the
news has much redundancy and is easy to compress. In contrast, it is more difficult
to compress a recording with leaves blowing in the wind or one of a football
crowd that is constantly moving and therefore has less redundancy (more information
or entropy). In either case, if all the entropy is not sent, there will be quality loss.
Thus, we may choose between a constant bit-rate channel with variable quality or a
constant quality channel with variable bit rate. Telecommunications network operators
tend to prefer a constant bit rate for practical purposes, but a buffer memory
can be used to average out entropy variations if the resulting increase in delay
is acceptable. In recording, a variable bit rate may be easier to handle and DVD
uses variable bit rate, speeding up the disc where difficult material exists.
Intra-coding (intra = within) is a technique that exploits spatial redundancy, or
redundancy within the picture; inter-coding (inter = between) is a technique
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 3
Figure 1-2.
Short Delay
Coder has to
send even more
Non-Ideal
Coder has to
send more
Ideal Coder
sends only
Entropy
Entropy
PCM Video
Worse
Quality
Better
Quality
Latency
Compression Factor
Compression Factor
Worse
Quality
Better
Quality
Complexity
a)
b) c)
4
that exploits temporal redundancy. Intra-coding may be used alone, as in the JPEG
standard for still pictures, or combined with inter-coding as in MPEG.
Intra-coding relies on two characteristics of typical images. First, not all spatial
frequencies are simultaneously present, and second, the higher the spatial frequency,
the lower the amplitude is likely to be. Intra-coding requires analysis of
the spatial frequencies in an image. This analysis is the purpose of transforms
such as wavelets and DCT (discrete cosine transform). Transforms produce
coefficients which describe the magnitude of each spatial frequency. Typically,
many coefficients will be zero, or nearly zero, and these coefficients can be
omitted, resulting in a reduction in bit rate.
Inter-coding relies on finding similarities between successive pictures. If a
given picture is available at the decoder, the next picture can be created by sending
only the picture differences. The picture differences will be increased when
objects move, but this magnification can be offset by using motion compensation,
since a moving object does not generally change its appearance very much
from one picture to the next. If the motion can be measured, a closer approximation
to the current picture can be created by shifting part of the previous picture to
a new location. The shifting process is controlled by a vector that is transmitted to
the decoder. The vector transmission requires less data than sending the picture-
difference data.
MPEG can handle both interlaced and non-interlaced images. An image at some
point on the time axis is called a “picture,” whether it is a field or a frame.
Interlace is not ideal as a source for digital compression because it is in itself a compression
technique. Temporal coding is made more complex because pixels in
one field are in a different position to those in the next.
Motion compensation minimizes but does not eliminate the differences between
successive pictures. The picture-difference is itself a spatial image and can be
compressed using transform-based intra-coding as previously described.
Motion compensation simply reduces the amount of data in the difference
image.
The efficiency of a temporal coder rises with the time span over which it can act.
Figure 1-2c shows that if a high compression factor is required, a longer time
span in the input must be considered and thus a longer coding delay will be
experienced. Clearly, temporally coded signals are difficult to edit because the
content of a given output picture may be based on image data which was transmitted
some time earlier. Production systems will have to limit the degree of
temporal coding to allow editing and this limitation will in turn limit the available
compression factor.
1.5 Introduction to Audio Compression
The bit rate of a PCM digital audio channel is only about one megabit per second,
which is about 0.5% of 4:2:2 digital video. With mild video compression
schemes, such as Digital Betacam, audio compression is unnecessary. But, as
the video compression factor is raised, it becomes necessary to compress the
audio as well.
Audio compression takes advantage of two facts. First, in typical audio signals,
not all frequencies are simultaneously present. Second, because of the phenomenon
of masking, human hearing cannot discern every detail of an audio
signal. Audio compression splits the audio spectrum into bands by filtering or
transforms, and includes less data when describing bands in which the level is
low. Where masking prevents or reduces audibility of a particular band, even
less data needs to be sent.
Audio compression is not as easy to achieve as is video compression because
of the acuity of hearing. Masking only works properly when the masking and
the masked sounds coincide spatially. Spatial coincidence is always the case in
mono recordings but not in stereo recordings, where low-level signals can still
be heard if they are in a different part of the sound stage. Consequently, in
stereo and surround sound systems, a lower compression factor is allowable
for a given quality. Another factor complicating audio compression is that
delayed resonances in poor loudspeakers actually mask compression artifacts.
Testing a compressor with poor speakers gives a false result, and signals
which are apparently satisfactory may be disappointing when heard on good
equipment.
1.6 MPEG Signals
The output of a single MPEG audio or video coder is called an Elementary
Stream. An Elementary Stream is an endless near real-time signal. For convenience,
it can be broken into convenient-sized data blocks in a Packetized
Elementary Stream (PES). These data blocks need header information to identify
the start of the packets and must include time stamps because packetizing
disrupts the time axis.
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
Figure 1-3 shows that one video PES and a number of audio PES can be combined
to form a Program Stream, provided that all of the coders are locked to a
common clock. Time stamps in each PES ensure lip-sync between the video
and audio. Program Streams have variable-length packets with headers. They
find use in data transfers to and from optical and hard disks, which are error
free and in which files of arbitrary sizes are expected. DVD uses Program
Streams.
For transmission and digital broadcasting, several programs and their associated
PES can be multiplexed into a single Transport Stream. A Transport Stream
differs from a Program Stream in that the PES packets are further subdivided
into short fixed-size packets and in that multiple programs encoded with different
clocks can be carried. This is possible because a transport stream has a
program clock reference (PCR) mechanism which allows transmission of multiple
clocks, one of which is selected and regenerated at the decoder. A Single
Program Transport Stream (SPTS) is also possible and this may be found
between a coder and a multiplexer. Since a Transport Stream can genlock the
decoder clock to the encoder clock, the Single Program Transport Stream
(SPTS) is more common than the Program Stream.
A Transport Stream is more than just a multiplex of audio and video PES. In
addition to the compressed audio, video, and data, a Transport Stream includes
a great deal of metadata describing the bit stream. This includes the Program
Association Table (PAT) that lists every program in the transport stream. Each
entry in the PAT points to a Program Map Table (PMT) that lists the elementary
streams making up each program. Some programs will be open, but some programs
may be subject to conditional access (encryption) and this information is
also carried in the metadata.
The Transport Stream consists of fixed-size data packets, each containing 188
bytes. Each packet carries a packet identifier code (PID). Packets in the same
elementary stream all have the same PID, so that the decoder (or a demultiplexer)
can select the elementary stream(s) it wants and reject the remainder.
Packet-continuity counts ensure that every packet that is needed to decode a
stream is received. An effective synchronization system is needed so that
decoders can correctly identify the beginning of each packet and deserialize
the bit stream into words.
1.7 Need for Monitoring and Analysis
The MPEG transport stream is an extremely complex structure using interlinked
tables and coded identifiers to separate the programs and the elementary
streams within the programs. Within each elementary stream, there is a complex
structure, allowing a decoder to distinguish between, for example, vectors,
coefficients and quantization tables.
Failures can be divided into two broad categories. In the first category, the
transport system correctly multiplexes and delivers information from an
encoder to a decoder with no bit errors or added jitter, but the encoder or the
decoder has a fault. In the second category, the encoder and decoder are fine,
but the transport of data from one to the other is defective. It is very important
to know whether the fault lies in the encoder, the transport, or the decoder if a
prompt solution is to be found.
Synchronizing problems, such as loss or corruption of sync patterns, may prevent
reception of the entire transport stream. Transport-stream protocol defects
may prevent the decoder from finding all of the data for a program, perhaps
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 5
Figure 1-3.
Video
Data
Audio
Data
Elementary
Stream
Video
PES
Audio
PES
Data
Program
Stream
(DVD)
Single
Program
Transport
Stream
Video
Encoder
Audio
Encoder
Packetizer
Packetizer
Program
Stream
MUX
Transport
Stream
MUX
6
delivering picture but not sound. Correct delivery of the data but with excessive
jitter can cause decoder timing problems.
If a system using an MPEG transport stream fails, the fault could be in the
encoder, the multiplexer, or in the decoder. How can this fault be isolated?
First, verify that a transport stream is compliant with the MPEG-coding standards.
If the stream is not compliant, a decoder can hardly be blamed for having
difficulty. If it is, the decoder may need attention.
Traditional video testing tools, the signal generator, the waveform monitor, and
vectorscope, are not appropriate in analyzing MPEG systems, except to ensure
that the video signals entering and leaving an MPEG system are of suitable
quality. Instead, a reliable source of valid MPEG test signals is essential for
testing receiving equipment and decoders. With a suitable analyzer, the performance
of encoders, transmission systems, multiplexers, and remultiplexers
can be assessed with a high degree of confidence. As a long standing supplier
of high grade test equipment to the video industry, Tektronix continues to provide
test and measurement solutions as the technology evolves, giving the
MPEG user the confidence that complex compressed systems are correctly
functioning and allowing rapid diagnosis when they are not.
1.8 Pitfalls of Compression
MPEG compression is lossy in that what is decoded, is not identical to the original.
The entropy of the source varies, and when entropy is high, the compression
system may leave visible artifacts when decoded. In temporal compression,
redundancy between successive pictures is assumed. When this is not the
case, the system fails. An example is video from a press conference where
flashguns are firing. Individual pictures containing the flash are totally different
from their neighbors, and coding artifacts become obvious.
Irregular motion or several independently moving objects on screen require a
lot of vector bandwidth and this requirement may only be met by reducing the
picture-data bandwidth. Again, visible artifacts may occur whose level varies
and depends on the motion. This problem often occurs in sports-coverage
video.
Coarse quantizing results in luminance contouring and posterized color. These
can be seen as blotchy shadows and blocking on large areas of plain color.
Subjectively, compression artifacts are more annoying than the relatively constant
impairments resulting from analog television transmission systems.
The only solution to these problems is to reduce the compression factor.
Consequently, the compression user has to make a value judgment between
the economy of a high compression factor and the level of artifacts.
In addition to extending the encoding and decoding delay, temporal coding also
causes difficulty in editing. In fact, an MPEG bit stream cannot be arbitrarily
edited at all. This restriction occurs because, in temporal coding, the decoding
of one picture may require the contents of an earlier picture and the contents
may not be available following an edit. The fact that pictures may be sent out
of sequence also complicates editing.
If suitable coding has been used, edits can take place only at splice points,
which are relatively widely spaced. If arbitrary editing is required, the MPEG
stream must undergo a read-modify-write process, which will result in generation
loss.
The viewer is not interested in editing, but the production user will have to
make another value judgment about the edit flexibility required. If greater flexibility
is required, the temporal compression has to be reduced and a higher bit
rate will be needed.
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 7
This section shows how video compression is based on the perception of the
eye. Important enabling techniques, such as transforms and motion compensation,
are considered as an introduction to the structure of an MPEG coder.
2.1 Spatial or Temporal Coding?
As was seen in Section 1, video compression can take advantage of both spatial
and temporal redundancy. In MPEG, temporal redundancy is reduced first
by using similarities between successive pictures. As much as possible of the
current picture is created or “predicted” by using information from pictures
already sent. When this technique is used, it is only necessary to send a difference
picture, which eliminates the differences between the actual picture and
the prediction. The difference picture is then subject to spatial compression. As
a practical matter it is easier to explain spatial compression prior to explaining
temporal compression.
Spatial compression relies on similarities between adjacent pixels in plain areas
of picture and on dominant spatial frequencies in areas of patterning. The JPEG
system uses spatial compression only, since it is designed to transmit individual
still pictures. However, JPEG may be used to code a succession of individual
pictures for video. In the so-called “Motion JPEG” application, the compression
factor will not be as good as if temporal coding was used, but the bit
stream will be freely editable on a picture-by-picture basis.
2.2 Spatial Coding
The first step in spatial coding is to perform an analysis of spatial frequency
using a transform. A transform is simply a way of expressing a waveform in a
different domain, in this case, the frequency domain. The output of a transform
is a set of coefficients that describe how much of a given frequency is present.
An inverse transform reproduces the original waveform. If the coefficients are
handled with sufficient accuracy, the output of the inverse transform is identical
to the original waveform.
The most well known transform is the Fourier transform. This transform finds
each frequency in the input signal. It finds each frequency by multiplying the
input waveform by a sample of a target frequency, called a basis function, and
integrating the product. Figure 2-1 shows that when the input waveform does
not contain the target frequency, the integral will be zero, but when it does, the
integral will be a coefficient describing the amplitude of that component frequency.
The results will be as described if the frequency component is in phase with
the basis function. However if the frequency component is in quadrature with
the basis function, the integral will still be zero. Therefore, it is necessary to
perform two searches for each frequency, with the basis functions in quadrature
with one another so that every phase of the input will be detected.
Section 2 – Compression in Video
Figure 2-1.
No Correlation
if Frequency
Different
High Correlation
if Frequency
the Same
Input
Basis
Function
Input
Basis
Function
8
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
The Fourier transform has the disadvantage of requiring coefficients for both
sine and cosine components of each frequency. In the cosine transform, the
input waveform is time-mirrored with itself prior to multiplication by the basis
functions. Figure 2-2 shows that this mirroring cancels out all sine components
and doubles all of the cosine components. The sine basis function is unnecessary
and only one coefficient is needed for each frequency.
The discrete cosine transform (DCT) is the sampled version of the cosine transform
and is used extensively in two-dimensional form in MPEG. A block of
8 x 8 pixels is transformed to become a block of 8 x 8 coefficients. Since the
transform requires multiplication by fractions, there is wordlength extension,
resulting in coefficients that have longer wordlength than the pixel values.
Typically an 8-bit pixel block results in an 11-bit coefficient block. Thus, a DCT
does not result in any compression; in fact it results in the opposite. However,
the DCT converts the source pixels into a form where compression is easier.
Figure 2-3 shows the results of an inverse transform of each of the individual
coefficients of an 8 x 8 DCT. In the case of the luminance signal, the top-left
coefficient is the average brightness or DC component of the whole block.
Moving across the top row, horizontal spatial frequency increases. Moving
down the left column, vertical spatial frequency increases. In real pictures, different
vertical and horizontal spatial frequencies may occur simultaneously and
a coefficient at some point within the block will represent all possible horizontal
and vertical combinations.
Figure 2-2.
Mirror
Cosine Component
Coherent Through Mirror
Sine Component
Inverts at Mirror – Cancels
Figure 2-3.
Horizontal spatial
frequency waveforms
H
V
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 9
Figure 2-3 also shows the coefficients as a one-dimensional horizontal waveform.
Combining these waveforms with various amplitudes and either polarity
can reproduce any combination of 8 pixels. Thus combining the 64 coefficients
of the 2-D DCT will result in the original 8 x 8 pixel block. Clearly for color pictures,
the color difference samples will also need to be handled. Y, Cr, and Cb
data are assembled into separate 8 x 8 arrays and are transformed individually.
In much real program material, many of the coefficients will have zero or nearzero
values and, therefore, will not be transmitted. This fact results in significant
compression that is virtually lossless. If a higher compression factor is
needed, then the wordlength of the non-zero coefficients must be reduced. This
reduction will reduce accuracy of these coefficients and will introduce losses
into the process. With care, the losses can be introduced in a way that is least
visible to the viewer.
2.3 Weighting
Figure 2-4 shows that the human perception of noise in pictures is not uniform
but is a function of the spatial frequency. More noise can be tolerated at high
spatial frequencies. Also, video noise is effectively masked by fine detail in the
picture, whereas in plain areas it is highly visible. The reader will be aware that
traditional noise measurements are always weighted so that the technical
measurement relates to the subjective result.
Compression reduces the accuracy of coefficients and has a similar effect to
using shorter wordlength samples in PCM; that is, the noise level rises. In
PCM, the result of shortening the word-length is that the noise level rises
equally at all frequencies. As the DCT splits the signal into different frequencies,
it becomes possible to control the spectrum of the noise. Effectively, lowfrequency
coefficients are rendered more accurately than high-frequency coefficients
by a process of weighting.
Figure 2-5 shows that, in the weighting process, the coefficients from the DCT
are divided by constants that are a function of two-dimensional frequency.
Low-frequency coefficients will be divided by small numbers, and highfrequency
coefficients will be divided by large numbers. Following the division,
the least-significant bit is discarded or truncated. This truncation is a form of
requantizing. In the absence of weighting, this requantizing would have the
effect of doubling the size of the quantizing step, but with weighting, it increases
the step size according to the division factor.
As a result, coefficients representing low spatial frequencies are requantized
with relatively small steps and suffer little increased noise. Coefficients representing
higher spatial frequencies are requantized with large steps and suffer
more noise. However, fewer steps means that fewer bits are needed to identify
the step and a compression is obtained.
In the decoder, a low-order zero will be added to return the weighted coefficients
to their correct magnitude. They will then be multiplied by inverse
weighting factors. Clearly, at high frequencies the multiplication factors will be
larger, so the requantizing noise will be greater. Following inverse weighting,
the coefficients will have their original DCT output values, plus requantizing
error, which will be greater at high frequency than at low frequency.
As an alternative to truncation, weighted coefficients may be nonlinearly
requantized so that the quantizing step size increases with the magnitude of
the coefficient. This technique allows higher compression factors but worse
levels of artifacts.
Clearly, the degree of compression obtained and, in turn, the output bit rate
obtained, is a function of the severity of the requantizing process. Different bit
rates will require different weighting tables. In MPEG, it is possible to use various
different weighting tables and the table in use can be transmitted to the
decoder, so that correct decoding automatically occurs.
Figure 2-4.
Human
Noise
Sensitivity
Spatial Frequency
10
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
2.4 Scanning
In typical program material, the significant DCT coefficients are generally found
in the top-left corner of the matrix. After weighting, low-value coefficients might
be truncated to zero. More efficient transmission can be obtained if all of the
non-zero coefficients are sent first, followed by a code indicating that the
remainder are all zero. Scanning is a technique which increases the probability
of achieving this result, because it sends coefficients in descending order of
magnitude probability. Figure 2-6a shows that in a noninterlaced system, the
probability of a coefficient having a high value is highest in the top-left corner
and lowest in the bottom-right corner. A 45 degree diagonal zig-zag scan is the
best sequence to use here.
In Figure 2-6b, the scan for an interlaced source is shown. In an interlaced picture,
an 8 x 8 DCT block from one field extends over twice the vertical screen
area, so that for a given picture detail, vertical frequencies will appear to be
twice as great as horizontal frequencies. Thus, the ideal scan for an interlaced
picture will be on a diagonal that is twice as steep. Figure 2-6b shows that a
given vertical spatial frequency is scanned before scanning the same horizontal
spatial frequency.
2.5 Entropy Coding
In real video, not all spatial frequencies are simultaneously present; therefore,
the DCT coefficient matrix will have zero terms in it. Despite the use of scanning,
zero coefficients will still appear between the significant values. Runlength
coding (RLC) allows these coefficients to be handled more efficiently.
Figure 2-5.
Input DCT Coefficients
(a more complex block)
Output DCT Coefficients
Value for display only
not actual results
Quant Matrix Values
Value used corresponds
to the coefficient location
Quant Scale Values
Not all code values are shown
One value used for complete 8x8 block
Divide by
Quant
Matrix
Divide by
Quant
Scale
980 12 23 16 13 4 1 0
12
7
5
2
2
1
0
9 8 11 2 1 0 0
13
6
3
4
4
0
8 3 0 2 0 1
6
8
2
2
1
4 2 1 0 0
1
1
1
0
0
0 0
0 0 0
1 0 0 0
0 0
0 0 0 0
7842 199 448 362 342 112 31 22
198
142
111
49
58
30
22
151 181 264 59 37 14 3
291
133
85
120
121
28
218 87 27 88 27 12
159
217
60
61
2
119 58 65 36 2
50
40
22
33
8 3 14 12
41 11 2 1
30 1 0 1
24 51 44 81
8 16 19 22 26 27 29 34
16
19
22
22
26
26
27
16 22 24 27 29 34 37
22
22
26
27
27
29
26 27 29 34 34 38
26
27
29
29
35
27 29 34 37 40
29
32
34
38
35 40 48
35 40 48
38 48 56 69
46 56 69 83
32
58
Code Linear
Quant Scale
Non-Linear
Quant Scale
1
8
16
20
24
28
31
2
16
32
40
48
56
62
1
8
24
40
88
112
56
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 11
Where repeating values, such as a string of 0s, are present, run-length coding
simply transmits the number of zeros rather than each individual bit.
The probability of occurrence of particular coefficient values in real video can
be studied. In practice, some values occur very often; others occur less often.
This statistical information can be used to achieve further compression using
variable length coding (VLC). Frequently occurring values are converted to short
code words, and infrequent values are converted to long code words. To aid
deserialization, no code word can be the prefix of another.
2.6 A Spatial Coder
Figure 2-7 ties together all of the preceding spatial coding concepts. The input
signal is assumed to be 4:2:2 SDI (Serial Digital Interface), which may have 8-
or 10-bit wordlength. MPEG uses only 8-bit resolution; therefore, a rounding
stage will be needed when the SDI signal contains 10-bit words. Most MPEG
profiles operate with 4:2:0 sampling; therefore, a vertical low-pass filter/interpolation
stage will be needed. Rounding and color subsampling introduces a
small irreversible loss of information and a proportional reduction in bit rate.
The raster scanned input format will need to be stored so that it can be converted
to 8 x 8 pixel blocks.
Figure 2-6.
Zigzag or Classic (nominally for frames)
a)
Alternate (nominally for fields)
b)
Figure 2-7.
Quantizing
Full
Bitrate
10-bit
Data
Rate Control
Quantizing Data
Compressed
Data
Data reduced
(no loss)
Data reduced
(information
lost)
No Loss
No Data
reduced
Information lost
Data reduced
Convert
4:2:2 to
8-bit 4:2:0
Quantize
Entropy
DCT Coding Buffer
Entropy Coding
Reduce the number of bits for each
coefficient. Give preference to certain
coefficients. Reduction can differ
for each coefficient
Variable Length Coding
Use short words for
most frequent values
(like Morse Code)
Run Length Coding
Send a unique code
word instead of
strings of zeros
12
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
The DCT stage transforms the picture information to the frequency domain. The
DCT itself does not achieve any compression. Following DCT, the coefficients
are weighted and truncated, providing the first significant compression. The
coefficients are then zig-zag scanned to increase the probability that the significant
coefficients occur early in the scan. After the last non-zero coefficient, an
EOB (end of block) code is generated.
Coefficient data are further compressed by run-length and variable-length coding.
In a variable bit-rate system, the quantizing is fixed, but in a fixed bit-rate
system, a buffer memory is used to absorb variations in coding difficulty. Highly
detailed pictures will tend to fill the buffer, whereas plain pictures will allow it
to empty. If the buffer is in danger of overflowing, the requantizing steps will
have to be made larger, so that the compression factor is effectively raised.
In the decoder, the bit stream is deserialized and the entropy coding is
reversed to reproduce the weighted coefficients. The inverse weighting is
applied and coefficients are placed in the matrix according to the zig-zag scan
to recreate the DCT matrix. Following an inverse transform, the 8 x 8 pixel
block is recreated. To obtain a raster-scanned output, the blocks are stored in
RAM, which is read a line at a time. To obtain a 4:2:2 output from 4:2:0 data, a
vertical interpolation process will be needed as shown in Figure 2-8.
The chroma samples in 4:2:0 are positioned half way between luminance samples
in the vertical axis so that they are evenly spaced when an interlaced
source is used.
2.7 Temporal Coding
Temporal redundancy can be exploited by inter-coding or transmitting only the
differences between pictures. Figure 2-9 shows that a one-picture delay com-
Figure 2-8.
4:2:2 Rec 601 4:1:1
4:2:0
1 Luminance sample Y
2 Chrominance samples Cb, Cr
N
N+1
N+2
N+3
Figure 2-9.
This
Picture
Previous
Picture
Picture
Difference
Picture Delay
+
_
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 13
bined with a subtractor can compute the picture differences. The picture difference
is an image in its own right and can be further compressed by the spatial
coder as was previously described. The decoder reverses the spatial coding
and adds the difference picture to the previous picture to obtain the next
picture.
There are some disadvantages to this simple system. First, as only differences
are sent, it is impossible to begin decoding after the start of the transmission.
This limitation makes it difficult for a decoder to provide pictures following a
switch from one bit stream to another (as occurs when the viewer changes
channels). Second, if any part of the difference data is incorrect, the error in
the picture will propagate indefinitely.
The solution to these problems is to use a system that is not completely differential.
Figure 2-10 shows that periodically complete pictures are sent. These
are called Intra-coded pictures (or I-pictures), and they are obtained by spatial
compression only. If an error or a channel switch occurs, it will be possible to
resume correct decoding at the next I-picture.
2.8 Motion Compensation
Motion reduces the similarities between pictures and increases the data needed
to create the difference picture. Motion compensation is used to increase
the similarity. Figure 2-11 shows the principle. When an object moves across
the TV screen, it may appear in a different place in each picture, but it does
not change in appearance very much. The picture difference can be reduced by
measuring the motion at the encoder. This is sent to the decoder as a vector.
The decoder uses the vector to shift part of the previous picture to a more
appropriate place in the new picture.
Figure 2-10.
Difference Restart
Start Intra Temporal Temporal Temporal Intra
Difference Difference
Figure 2-11.
Actions:
1. Compute Motion Vector
2. Shift Data from Picture N
Using Vector to Make
Predicted Picture N+1
3. Compare Actual Picture
with Predicted Picture
4. Send Vector and
Picture N Picture N+1 Prediction Error
Motion
Vector
Part of Moving Object
14
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
One vector controls the shifting of an entire area of the picture that is known
as a macroblock. The size of the macroblock is determined by the DCT coding
and the color subsampling structure. Figure 2-12a shows that, with a 4:2:0
system, the vertical and horizontal spacing of color samples is exactly twice the
spacing of luminance. A single 8 x 8 DCT block of color samples extends over
the same area as four 8 x 8 luminance blocks; therefore this is the minimum
picture area which can be shifted by a vector. One 4:2:0 macroblock contains
four luminance blocks: one Cr block and one Cb block.
In the 4:2:2 profile, color is only subsampled in the horizontal axis. Figure
2-12b shows that in 4:2:2, a single 8 x 8 DCT block of color samples extends
over two luminance blocks. A 4:2:2 macroblock contains four luminance
blocks: two Cr blocks and two Cb blocks.
The motion estimator works by comparing the luminance data from two successive
pictures. A macroblock in the first picture is used as a reference. When
the input is interlaced, pixels will be at different vertical locations in the two
fields, and it will, therefore, be necessary to interpolate one field before it can
be compared with the other. The correlation between the reference and the
next picture is measured at all possible displacements with a resolution of half
a pixel over the entire search range. When the greatest correlation is found,
this correlation is assumed to represent the correct motion.
The motion vector has a vertical and horizontal component. In typical program
material, a moving object may extend over a number of macroblocks. A greater
compression factor is obtained if the vectors are transmitted differentially.
When a large object moves, adjacent macroblocks have the same vectors and
the vector difference becomes zero.
Motion vectors are associated with macroblocks, not with real objects in the
image and there will be occasions where part of the macroblock moves and
part of it does not. In this case, it is impossible to compensate properly. If the
motion of the moving part is compensated by transmitting a vector, the stationary
part will be incorrectly shifted, and it will need difference data to be cor-
Figure 2-12.
a) 4:2:0 has 1/4 as many chroma sampling points as Y
b) 4:2:2 has twice as much chroma data as 4:2:0
8
8
8
8
8
8
8
8
8
8
8
8
16
16 4 x Y
Cr
Cb
8
8
8
8
2 x Cr
2 x Cb
16
16
8
8
8
8
8
8
8
8
4 x Y
8
8
8
8
Section 1 – Introduction to MPEG · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.1 Convergence · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.2 Why Compression is Needed · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.3 Applications of Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.4 Introduction to Video Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 2
1.5 Introduction to Audio Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 4
1.6 MPEG Signals · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 4
1.7 Need for Monitoring and Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 5
1.8 Pitfalls of Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 6
Section 2 – Compression in Video· · · · · · · · · · · · · · · · · · · · · · · · · · · · 7
2.1 Spatial or Temporal Coding? · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 7
2.2 Spatial Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 7
2.3 Weighting · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 9
2.4 Scanning · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10
2.5 Entropy Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10
2.6 A Spatial Coder · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 11
2.7 Temporal Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 12
2.8 Motion Compensation · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 13
2.9 Bidirectional Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 15
2.10 I, P, and B Pictures · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 15
2.11 An MPEG Compressor · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 17
2.12 Preprocessing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 20
2.13 Profiles and Levels · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 21
2.14 Wavelets · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 22
Section 3 – Audio Compression · · · · · · · · · · · · · · · · · · · · · · · · · · · · 23
3.1 The Hearing Mechanism · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 23
3.2 Subband Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 24
3.3 MPEG Layer 1 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 25
3.4 MPEG Layer 2 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 26
3.5 Transform Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 26
3.6 MPEG Layer 3 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 27
3.7 AC-3 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 27
Section 4 – Elementary Streams· · · · · · · · · · · · · · · · · · · · · · · · · · · · 29
4.1 Video Elementary Stream Syntax · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 29
4.2 Audio Elementary Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 30
Contents
A Guide to MPEG Fundamentals and Protocol Analysis
iv
Section 5 – Packetized Elementary Streams (PES) · · · · · · · · · · · · · · 31
5.1 PES Packets · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 31
5.2 Time Stamps · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 31
5.3 PTS/DTS · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 31
Section 6 – Program Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 33
6.1 Recording vs. Transmission · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 33
6.2 Introduction to Program Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 33
Section 7 – Transport Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 35
7.1 The Job of a Transport Stream · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 35
7.2 Packets · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 36
7.3 Program Clock Reference (PCR) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 36
7.4 Packet Identification (PID) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 37
7.5 Program Specific Information (PSI) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 37
Section 8 – Introduction to DVB/ATSC · · · · · · · · · · · · · · · · · · · · · · · 39
8.1 An Overall View · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 39
8.2 Remultiplexing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40
8.3 Service Information (SI) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40
8.4 Error Correction · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40
8.5 Channel Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 41
8.6 Inner Coding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 43
8.7 Transmitting Digits · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 44
Section 9 – MPEG Testing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45
9.1 Testing Requirements · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45
9.2 Analyzing a Transport Stream · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45
9.3 Hierarchic View · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 46
9.4 Interpreted View · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 48
9.5 Syntax and CRC Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 48
9.6 Filtering · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 49
9.7 Timing Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 49
9.8 Elementary Stream Testing · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 51
9.9 Sarnoff Compliant Bit Streams · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 51
9.10 Elementary Stream Analysis · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 52
9.11 Creating a Transport Stream · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 52
9.12 Jitter Generation · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 53
9.13 DVB Tests · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 53
Glossary · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 55
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 1
MPEG is one of the most popular audio/video compression techniques because
it is not just a single standard. Instead, it is a range of standards suitable for
different applications but based on similar principles. MPEG is an acronym for
the Moving Picture Experts Group which was set up by the ISO (International
Standards Organization) to work on compression.
MPEG can be described as the interaction of acronyms. As ETSI stated “The
CAT is a pointer to enable the IRD to find the EMMs associated with the CA
system(s) that it uses.” If you can understand that sentence you don’t need this
book!
1.1 Convergence
Digital techniques have made rapid progress in audio and video for a number
of reasons. Digital information is more robust and can be coded to substantially
eliminate error. This means that generation-loss in recording and losses in
transmission are eliminated. The Compact Disc was the first consumer product
to demonstrate this.
While the CD has an improved sound quality with respect to its vinyl predecessor,
comparison of quality alone misses the point. The real point is that digital
recording and transmission techniques allow content manipulation to a degree
that is impossible with analog. Once audio or video are digitized, they become
data. Such data cannot be distinguished from any other kind of data; therefore,
digital video and audio become the province of computer technology.
The convergence of computers and audio/video is an inevitable consequence of
the key inventions of computing and Pulse Code Modulation. Digital media can
store any type of information, so it is easy to use a computer storage device for
digital video. The nonlinear workstation was the first example of an application
of convergent technology that did not have an analog forerunner. Another
example, multimedia, mixed the storage of audio, video, graphics, text, and
data on the same medium. Multimedia is impossible in the analog domain.
1.2 Why Compression is Needed
The initial success of digital video was in post-production applications, where
the high cost of digital video was offset by its limitless layering and effects
capability. However, production-standard digital video generates over 200
megabits per second of data and this bit rate requires extensive capacity for
storage and wide bandwidth for transmission. Digital video could only be used
in wider applications if the storage and bandwidth requirements could be
eased; easing these requirements is the purpose of compression.
Compression is a way of expressing digital audio and video by using less data.
Compression has the following advantages:
A smaller amount of storage is needed for a given amount of source material.
With high-density recording such as with tape, compression allows highly miniaturized
equipment for consumer and Electronic News Gathering (ENG) use. The
access time of tape improves with compression because less tape needs to be
shuttled to skip over a given amount of program. With expensive storage media
such as RAM, compression makes new applications affordable.
When working in real time, compression reduces the bandwidth needed.
Additionally, compression allows faster-than-real-time transfer between media,
for example, between tape and disk.
A compressed recording format can afford a lower recording density and this can
make the recorder less sensitive to environmental factors and maintenance.
1.3 Applications of Compression
Compression has a long association with television. Interlace is a simple form
of compression giving a 2:1 reduction in bandwidth. The use of color-difference
signals instead of GBR is another form of compression. Because the eye is less
sensitive to color detail, the color-difference signals need less bandwidth. When
color broadcasting was introduced, the channel structure of monochrome had
to be retained and composite video was developed. Composite video systems,
such as PAL, NTSC, and SECAM, are forms of compression because they use
the same bandwidth for color as was used for monochrome.
Section 1 – Introduction to MPEG
2
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
Figure 1-1a shows that in traditional television systems, the GBR camera signal
is converted to Y, Pr, Pb components for production and encoded into analog
composite for transmission. Figure 1-1b shows the modern equivalent. The Y, Pr,
Pb signals are digitized and carried as Y, Cr, Cb signals in SDI form through the
production process prior to being encoded with MPEG for transmission. Clearly,
MPEG can be considered by the broadcaster as a more efficient replacement
for composite video. In addition, MPEG has greater flexibility because the bit
rate required can be adjusted to suit the application. At lower bit rates and resolutions,
MPEG can be used for video conferencing and video telephones.
DVB and ATSC (the European- and American-originated digital-television broadcasting
standards) would not be viable without compression because the bandwidth
required would be too great. Compression extends the playing time of
DVD (digital video/versatile disc) allowing full-length movies on a standard size
compact disc. Compression also reduces the cost of Electronic News Gathering and
other contributions to television production.
In tape recording, mild compression eases tolerances and adds reliability in
Digital Betacam and Digital-S, whereas in SX, DVC, DVCPRO, and DVCAM, the
goal is miniaturization. In magnetic disk drives, such as the Tektronix Profile® storage
system, that are used in file servers and networks (especially for news purposes),
compression lowers storage cost. Compression also lowers bandwidth,
which allows more users to access a given server. This characteristic is also
important for VOD (Video On Demand) applications.
1.4 Introduction to Video Compression
In all real program material, there are two types of components of the signal: those
which are novel and unpredictable and those which can be anticipated. The
novel component is called entropy and is the true information in the signal. The
remainder is called redundancy because it is not essential. Redundancy may be
spatial, as it is in large plain areas of picture where adjacent pixels have
almost the same value. Redundancy can also be temporal as it is where similarities
between successive pictures are used. All compression systems work by separating
the entropy from the redundancy in the encoder. Only the entropy is
recorded or transmitted and the decoder computes the redundancy from the transmitted
signal. Figure 1-2a shows this concept.
An ideal encoder would extract all the entropy and only this will be transmitted
to the decoder. An ideal decoder would then reproduce the original signal. In practice,
this ideal cannot be reached. An ideal coder would be complex and cause
a very long delay in order to use temporal redundancy. In certain applications,
such as recording or broadcasting, some delay is acceptable, but in videoconferencing
it is not. In some cases, a very complex coder would be too expensive.
It follows that there is no one ideal compression system.
In practice, a range of coders is needed which have a range of processing
delays and complexities. The power of MPEG is that it is not a single compression
format, but a range of standardized coding tools that can be combined
flexibly to suit a range of applications. The way in which coding has been performed
is included in the compressed data so that the decoder can automatically
handle whatever the coder decided to do.
MPEG coding is divided into several profiles that have different complexity, and
each profile can be implemented at a different level depending on the resolution of
the input picture. Section 2 considers profiles and levels in detail.
Figure 1-1.
Analog
Composite
Out
(PAL, NTSC
or SECAM)
B
G
R
Y
Pr
Pb
Digital
Compressed
Out
Matrix ADC Production
Process
B
G
R
Y
Pr
Pb
Y
Cr
Cb
Y
Cr
Cb
SDI
MPEG
Coder
a)
b)
Camera Matrix
Camera
Composite
Encoder
There are many different digital video formats and each has a different bit rate.
For example a high definition system might have six times the bit rate of a
standard definition system. Consequently, just knowing the bit rate out of the
coder is not very useful. What matters is the compression factor, which is the
ratio of the input bit rate to the compressed bit rate, for example 2:1, 5:1, and
so on.
Unfortunately the number of variables involved make it very difficult to determine
a suitable compression factor. Figure 1-2a shows that for an ideal coder,
if all of the entropy is sent, the quality is good. However, if the compression
factor is increased in order to reduce the bit rate, not all of the entropy is sent
and the quality falls. Note that in a compressed system when the quality loss
occurs, it is steep (Figure 1-2b). If the available bit rate is inadequate, it is better
to avoid this area by reducing the entropy of the input picture. This can be
done by filtering. The loss of resolution caused by the filtering is subjectively
more acceptable than the compression artifacts.
To identify the entropy perfectly, an ideal compressor would have to be
extremely complex. A practical compressor may be less complex for economic
reasons and must send more data to be sure of carrying all of the entropy. Figure
1-2b shows the relationship between coder complexity and performance. The higher
the compression factor required, the more complex the encoder has to be.
The entropy in video signals varies. A recording of an announcer delivering the
news has much redundancy and is easy to compress. In contrast, it is more difficult
to compress a recording with leaves blowing in the wind or one of a football
crowd that is constantly moving and therefore has less redundancy (more information
or entropy). In either case, if all the entropy is not sent, there will be quality loss.
Thus, we may choose between a constant bit-rate channel with variable quality or a
constant quality channel with variable bit rate. Telecommunications network operators
tend to prefer a constant bit rate for practical purposes, but a buffer memory
can be used to average out entropy variations if the resulting increase in delay
is acceptable. In recording, a variable bit rate may be easier to handle and DVD
uses variable bit rate, speeding up the disc where difficult material exists.
Intra-coding (intra = within) is a technique that exploits spatial redundancy, or
redundancy within the picture; inter-coding (inter = between) is a technique
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 3
Figure 1-2.
Short Delay
Coder has to
send even more
Non-Ideal
Coder has to
send more
Ideal Coder
sends only
Entropy
Entropy
PCM Video
Worse
Quality
Better
Quality
Latency
Compression Factor
Compression Factor
Worse
Quality
Better
Quality
Complexity
a)
b) c)
4
that exploits temporal redundancy. Intra-coding may be used alone, as in the JPEG
standard for still pictures, or combined with inter-coding as in MPEG.
Intra-coding relies on two characteristics of typical images. First, not all spatial
frequencies are simultaneously present, and second, the higher the spatial frequency,
the lower the amplitude is likely to be. Intra-coding requires analysis of
the spatial frequencies in an image. This analysis is the purpose of transforms
such as wavelets and DCT (discrete cosine transform). Transforms produce
coefficients which describe the magnitude of each spatial frequency. Typically,
many coefficients will be zero, or nearly zero, and these coefficients can be
omitted, resulting in a reduction in bit rate.
Inter-coding relies on finding similarities between successive pictures. If a
given picture is available at the decoder, the next picture can be created by sending
only the picture differences. The picture differences will be increased when
objects move, but this magnification can be offset by using motion compensation,
since a moving object does not generally change its appearance very much
from one picture to the next. If the motion can be measured, a closer approximation
to the current picture can be created by shifting part of the previous picture to
a new location. The shifting process is controlled by a vector that is transmitted to
the decoder. The vector transmission requires less data than sending the picture-
difference data.
MPEG can handle both interlaced and non-interlaced images. An image at some
point on the time axis is called a “picture,” whether it is a field or a frame.
Interlace is not ideal as a source for digital compression because it is in itself a compression
technique. Temporal coding is made more complex because pixels in
one field are in a different position to those in the next.
Motion compensation minimizes but does not eliminate the differences between
successive pictures. The picture-difference is itself a spatial image and can be
compressed using transform-based intra-coding as previously described.
Motion compensation simply reduces the amount of data in the difference
image.
The efficiency of a temporal coder rises with the time span over which it can act.
Figure 1-2c shows that if a high compression factor is required, a longer time
span in the input must be considered and thus a longer coding delay will be
experienced. Clearly, temporally coded signals are difficult to edit because the
content of a given output picture may be based on image data which was transmitted
some time earlier. Production systems will have to limit the degree of
temporal coding to allow editing and this limitation will in turn limit the available
compression factor.
1.5 Introduction to Audio Compression
The bit rate of a PCM digital audio channel is only about one megabit per second,
which is about 0.5% of 4:2:2 digital video. With mild video compression
schemes, such as Digital Betacam, audio compression is unnecessary. But, as
the video compression factor is raised, it becomes necessary to compress the
audio as well.
Audio compression takes advantage of two facts. First, in typical audio signals,
not all frequencies are simultaneously present. Second, because of the phenomenon
of masking, human hearing cannot discern every detail of an audio
signal. Audio compression splits the audio spectrum into bands by filtering or
transforms, and includes less data when describing bands in which the level is
low. Where masking prevents or reduces audibility of a particular band, even
less data needs to be sent.
Audio compression is not as easy to achieve as is video compression because
of the acuity of hearing. Masking only works properly when the masking and
the masked sounds coincide spatially. Spatial coincidence is always the case in
mono recordings but not in stereo recordings, where low-level signals can still
be heard if they are in a different part of the sound stage. Consequently, in
stereo and surround sound systems, a lower compression factor is allowable
for a given quality. Another factor complicating audio compression is that
delayed resonances in poor loudspeakers actually mask compression artifacts.
Testing a compressor with poor speakers gives a false result, and signals
which are apparently satisfactory may be disappointing when heard on good
equipment.
1.6 MPEG Signals
The output of a single MPEG audio or video coder is called an Elementary
Stream. An Elementary Stream is an endless near real-time signal. For convenience,
it can be broken into convenient-sized data blocks in a Packetized
Elementary Stream (PES). These data blocks need header information to identify
the start of the packets and must include time stamps because packetizing
disrupts the time axis.
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
Figure 1-3 shows that one video PES and a number of audio PES can be combined
to form a Program Stream, provided that all of the coders are locked to a
common clock. Time stamps in each PES ensure lip-sync between the video
and audio. Program Streams have variable-length packets with headers. They
find use in data transfers to and from optical and hard disks, which are error
free and in which files of arbitrary sizes are expected. DVD uses Program
Streams.
For transmission and digital broadcasting, several programs and their associated
PES can be multiplexed into a single Transport Stream. A Transport Stream
differs from a Program Stream in that the PES packets are further subdivided
into short fixed-size packets and in that multiple programs encoded with different
clocks can be carried. This is possible because a transport stream has a
program clock reference (PCR) mechanism which allows transmission of multiple
clocks, one of which is selected and regenerated at the decoder. A Single
Program Transport Stream (SPTS) is also possible and this may be found
between a coder and a multiplexer. Since a Transport Stream can genlock the
decoder clock to the encoder clock, the Single Program Transport Stream
(SPTS) is more common than the Program Stream.
A Transport Stream is more than just a multiplex of audio and video PES. In
addition to the compressed audio, video, and data, a Transport Stream includes
a great deal of metadata describing the bit stream. This includes the Program
Association Table (PAT) that lists every program in the transport stream. Each
entry in the PAT points to a Program Map Table (PMT) that lists the elementary
streams making up each program. Some programs will be open, but some programs
may be subject to conditional access (encryption) and this information is
also carried in the metadata.
The Transport Stream consists of fixed-size data packets, each containing 188
bytes. Each packet carries a packet identifier code (PID). Packets in the same
elementary stream all have the same PID, so that the decoder (or a demultiplexer)
can select the elementary stream(s) it wants and reject the remainder.
Packet-continuity counts ensure that every packet that is needed to decode a
stream is received. An effective synchronization system is needed so that
decoders can correctly identify the beginning of each packet and deserialize
the bit stream into words.
1.7 Need for Monitoring and Analysis
The MPEG transport stream is an extremely complex structure using interlinked
tables and coded identifiers to separate the programs and the elementary
streams within the programs. Within each elementary stream, there is a complex
structure, allowing a decoder to distinguish between, for example, vectors,
coefficients and quantization tables.
Failures can be divided into two broad categories. In the first category, the
transport system correctly multiplexes and delivers information from an
encoder to a decoder with no bit errors or added jitter, but the encoder or the
decoder has a fault. In the second category, the encoder and decoder are fine,
but the transport of data from one to the other is defective. It is very important
to know whether the fault lies in the encoder, the transport, or the decoder if a
prompt solution is to be found.
Synchronizing problems, such as loss or corruption of sync patterns, may prevent
reception of the entire transport stream. Transport-stream protocol defects
may prevent the decoder from finding all of the data for a program, perhaps
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 5
Figure 1-3.
Video
Data
Audio
Data
Elementary
Stream
Video
PES
Audio
PES
Data
Program
Stream
(DVD)
Single
Program
Transport
Stream
Video
Encoder
Audio
Encoder
Packetizer
Packetizer
Program
Stream
MUX
Transport
Stream
MUX
6
delivering picture but not sound. Correct delivery of the data but with excessive
jitter can cause decoder timing problems.
If a system using an MPEG transport stream fails, the fault could be in the
encoder, the multiplexer, or in the decoder. How can this fault be isolated?
First, verify that a transport stream is compliant with the MPEG-coding standards.
If the stream is not compliant, a decoder can hardly be blamed for having
difficulty. If it is, the decoder may need attention.
Traditional video testing tools, the signal generator, the waveform monitor, and
vectorscope, are not appropriate in analyzing MPEG systems, except to ensure
that the video signals entering and leaving an MPEG system are of suitable
quality. Instead, a reliable source of valid MPEG test signals is essential for
testing receiving equipment and decoders. With a suitable analyzer, the performance
of encoders, transmission systems, multiplexers, and remultiplexers
can be assessed with a high degree of confidence. As a long standing supplier
of high grade test equipment to the video industry, Tektronix continues to provide
test and measurement solutions as the technology evolves, giving the
MPEG user the confidence that complex compressed systems are correctly
functioning and allowing rapid diagnosis when they are not.
1.8 Pitfalls of Compression
MPEG compression is lossy in that what is decoded, is not identical to the original.
The entropy of the source varies, and when entropy is high, the compression
system may leave visible artifacts when decoded. In temporal compression,
redundancy between successive pictures is assumed. When this is not the
case, the system fails. An example is video from a press conference where
flashguns are firing. Individual pictures containing the flash are totally different
from their neighbors, and coding artifacts become obvious.
Irregular motion or several independently moving objects on screen require a
lot of vector bandwidth and this requirement may only be met by reducing the
picture-data bandwidth. Again, visible artifacts may occur whose level varies
and depends on the motion. This problem often occurs in sports-coverage
video.
Coarse quantizing results in luminance contouring and posterized color. These
can be seen as blotchy shadows and blocking on large areas of plain color.
Subjectively, compression artifacts are more annoying than the relatively constant
impairments resulting from analog television transmission systems.
The only solution to these problems is to reduce the compression factor.
Consequently, the compression user has to make a value judgment between
the economy of a high compression factor and the level of artifacts.
In addition to extending the encoding and decoding delay, temporal coding also
causes difficulty in editing. In fact, an MPEG bit stream cannot be arbitrarily
edited at all. This restriction occurs because, in temporal coding, the decoding
of one picture may require the contents of an earlier picture and the contents
may not be available following an edit. The fact that pictures may be sent out
of sequence also complicates editing.
If suitable coding has been used, edits can take place only at splice points,
which are relatively widely spaced. If arbitrary editing is required, the MPEG
stream must undergo a read-modify-write process, which will result in generation
loss.
The viewer is not interested in editing, but the production user will have to
make another value judgment about the edit flexibility required. If greater flexibility
is required, the temporal compression has to be reduced and a higher bit
rate will be needed.
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 7
This section shows how video compression is based on the perception of the
eye. Important enabling techniques, such as transforms and motion compensation,
are considered as an introduction to the structure of an MPEG coder.
2.1 Spatial or Temporal Coding?
As was seen in Section 1, video compression can take advantage of both spatial
and temporal redundancy. In MPEG, temporal redundancy is reduced first
by using similarities between successive pictures. As much as possible of the
current picture is created or “predicted” by using information from pictures
already sent. When this technique is used, it is only necessary to send a difference
picture, which eliminates the differences between the actual picture and
the prediction. The difference picture is then subject to spatial compression. As
a practical matter it is easier to explain spatial compression prior to explaining
temporal compression.
Spatial compression relies on similarities between adjacent pixels in plain areas
of picture and on dominant spatial frequencies in areas of patterning. The JPEG
system uses spatial compression only, since it is designed to transmit individual
still pictures. However, JPEG may be used to code a succession of individual
pictures for video. In the so-called “Motion JPEG” application, the compression
factor will not be as good as if temporal coding was used, but the bit
stream will be freely editable on a picture-by-picture basis.
2.2 Spatial Coding
The first step in spatial coding is to perform an analysis of spatial frequency
using a transform. A transform is simply a way of expressing a waveform in a
different domain, in this case, the frequency domain. The output of a transform
is a set of coefficients that describe how much of a given frequency is present.
An inverse transform reproduces the original waveform. If the coefficients are
handled with sufficient accuracy, the output of the inverse transform is identical
to the original waveform.
The most well known transform is the Fourier transform. This transform finds
each frequency in the input signal. It finds each frequency by multiplying the
input waveform by a sample of a target frequency, called a basis function, and
integrating the product. Figure 2-1 shows that when the input waveform does
not contain the target frequency, the integral will be zero, but when it does, the
integral will be a coefficient describing the amplitude of that component frequency.
The results will be as described if the frequency component is in phase with
the basis function. However if the frequency component is in quadrature with
the basis function, the integral will still be zero. Therefore, it is necessary to
perform two searches for each frequency, with the basis functions in quadrature
with one another so that every phase of the input will be detected.
Section 2 – Compression in Video
Figure 2-1.
No Correlation
if Frequency
Different
High Correlation
if Frequency
the Same
Input
Basis
Function
Input
Basis
Function
8
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
The Fourier transform has the disadvantage of requiring coefficients for both
sine and cosine components of each frequency. In the cosine transform, the
input waveform is time-mirrored with itself prior to multiplication by the basis
functions. Figure 2-2 shows that this mirroring cancels out all sine components
and doubles all of the cosine components. The sine basis function is unnecessary
and only one coefficient is needed for each frequency.
The discrete cosine transform (DCT) is the sampled version of the cosine transform
and is used extensively in two-dimensional form in MPEG. A block of
8 x 8 pixels is transformed to become a block of 8 x 8 coefficients. Since the
transform requires multiplication by fractions, there is wordlength extension,
resulting in coefficients that have longer wordlength than the pixel values.
Typically an 8-bit pixel block results in an 11-bit coefficient block. Thus, a DCT
does not result in any compression; in fact it results in the opposite. However,
the DCT converts the source pixels into a form where compression is easier.
Figure 2-3 shows the results of an inverse transform of each of the individual
coefficients of an 8 x 8 DCT. In the case of the luminance signal, the top-left
coefficient is the average brightness or DC component of the whole block.
Moving across the top row, horizontal spatial frequency increases. Moving
down the left column, vertical spatial frequency increases. In real pictures, different
vertical and horizontal spatial frequencies may occur simultaneously and
a coefficient at some point within the block will represent all possible horizontal
and vertical combinations.
Figure 2-2.
Mirror
Cosine Component
Coherent Through Mirror
Sine Component
Inverts at Mirror – Cancels
Figure 2-3.
Horizontal spatial
frequency waveforms
H
V
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 9
Figure 2-3 also shows the coefficients as a one-dimensional horizontal waveform.
Combining these waveforms with various amplitudes and either polarity
can reproduce any combination of 8 pixels. Thus combining the 64 coefficients
of the 2-D DCT will result in the original 8 x 8 pixel block. Clearly for color pictures,
the color difference samples will also need to be handled. Y, Cr, and Cb
data are assembled into separate 8 x 8 arrays and are transformed individually.
In much real program material, many of the coefficients will have zero or nearzero
values and, therefore, will not be transmitted. This fact results in significant
compression that is virtually lossless. If a higher compression factor is
needed, then the wordlength of the non-zero coefficients must be reduced. This
reduction will reduce accuracy of these coefficients and will introduce losses
into the process. With care, the losses can be introduced in a way that is least
visible to the viewer.
2.3 Weighting
Figure 2-4 shows that the human perception of noise in pictures is not uniform
but is a function of the spatial frequency. More noise can be tolerated at high
spatial frequencies. Also, video noise is effectively masked by fine detail in the
picture, whereas in plain areas it is highly visible. The reader will be aware that
traditional noise measurements are always weighted so that the technical
measurement relates to the subjective result.
Compression reduces the accuracy of coefficients and has a similar effect to
using shorter wordlength samples in PCM; that is, the noise level rises. In
PCM, the result of shortening the word-length is that the noise level rises
equally at all frequencies. As the DCT splits the signal into different frequencies,
it becomes possible to control the spectrum of the noise. Effectively, lowfrequency
coefficients are rendered more accurately than high-frequency coefficients
by a process of weighting.
Figure 2-5 shows that, in the weighting process, the coefficients from the DCT
are divided by constants that are a function of two-dimensional frequency.
Low-frequency coefficients will be divided by small numbers, and highfrequency
coefficients will be divided by large numbers. Following the division,
the least-significant bit is discarded or truncated. This truncation is a form of
requantizing. In the absence of weighting, this requantizing would have the
effect of doubling the size of the quantizing step, but with weighting, it increases
the step size according to the division factor.
As a result, coefficients representing low spatial frequencies are requantized
with relatively small steps and suffer little increased noise. Coefficients representing
higher spatial frequencies are requantized with large steps and suffer
more noise. However, fewer steps means that fewer bits are needed to identify
the step and a compression is obtained.
In the decoder, a low-order zero will be added to return the weighted coefficients
to their correct magnitude. They will then be multiplied by inverse
weighting factors. Clearly, at high frequencies the multiplication factors will be
larger, so the requantizing noise will be greater. Following inverse weighting,
the coefficients will have their original DCT output values, plus requantizing
error, which will be greater at high frequency than at low frequency.
As an alternative to truncation, weighted coefficients may be nonlinearly
requantized so that the quantizing step size increases with the magnitude of
the coefficient. This technique allows higher compression factors but worse
levels of artifacts.
Clearly, the degree of compression obtained and, in turn, the output bit rate
obtained, is a function of the severity of the requantizing process. Different bit
rates will require different weighting tables. In MPEG, it is possible to use various
different weighting tables and the table in use can be transmitted to the
decoder, so that correct decoding automatically occurs.
Figure 2-4.
Human
Noise
Sensitivity
Spatial Frequency
10
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
2.4 Scanning
In typical program material, the significant DCT coefficients are generally found
in the top-left corner of the matrix. After weighting, low-value coefficients might
be truncated to zero. More efficient transmission can be obtained if all of the
non-zero coefficients are sent first, followed by a code indicating that the
remainder are all zero. Scanning is a technique which increases the probability
of achieving this result, because it sends coefficients in descending order of
magnitude probability. Figure 2-6a shows that in a noninterlaced system, the
probability of a coefficient having a high value is highest in the top-left corner
and lowest in the bottom-right corner. A 45 degree diagonal zig-zag scan is the
best sequence to use here.
In Figure 2-6b, the scan for an interlaced source is shown. In an interlaced picture,
an 8 x 8 DCT block from one field extends over twice the vertical screen
area, so that for a given picture detail, vertical frequencies will appear to be
twice as great as horizontal frequencies. Thus, the ideal scan for an interlaced
picture will be on a diagonal that is twice as steep. Figure 2-6b shows that a
given vertical spatial frequency is scanned before scanning the same horizontal
spatial frequency.
2.5 Entropy Coding
In real video, not all spatial frequencies are simultaneously present; therefore,
the DCT coefficient matrix will have zero terms in it. Despite the use of scanning,
zero coefficients will still appear between the significant values. Runlength
coding (RLC) allows these coefficients to be handled more efficiently.
Figure 2-5.
Input DCT Coefficients
(a more complex block)
Output DCT Coefficients
Value for display only
not actual results
Quant Matrix Values
Value used corresponds
to the coefficient location
Quant Scale Values
Not all code values are shown
One value used for complete 8x8 block
Divide by
Quant
Matrix
Divide by
Quant
Scale
980 12 23 16 13 4 1 0
12
7
5
2
2
1
0
9 8 11 2 1 0 0
13
6
3
4
4
0
8 3 0 2 0 1
6
8
2
2
1
4 2 1 0 0
1
1
1
0
0
0 0
0 0 0
1 0 0 0
0 0
0 0 0 0
7842 199 448 362 342 112 31 22
198
142
111
49
58
30
22
151 181 264 59 37 14 3
291
133
85
120
121
28
218 87 27 88 27 12
159
217
60
61
2
119 58 65 36 2
50
40
22
33
8 3 14 12
41 11 2 1
30 1 0 1
24 51 44 81
8 16 19 22 26 27 29 34
16
19
22
22
26
26
27
16 22 24 27 29 34 37
22
22
26
27
27
29
26 27 29 34 34 38
26
27
29
29
35
27 29 34 37 40
29
32
34
38
35 40 48
35 40 48
38 48 56 69
46 56 69 83
32
58
Code Linear
Quant Scale
Non-Linear
Quant Scale
1
8
16
20
24
28
31
2
16
32
40
48
56
62
1
8
24
40
88
112
56
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 11
Where repeating values, such as a string of 0s, are present, run-length coding
simply transmits the number of zeros rather than each individual bit.
The probability of occurrence of particular coefficient values in real video can
be studied. In practice, some values occur very often; others occur less often.
This statistical information can be used to achieve further compression using
variable length coding (VLC). Frequently occurring values are converted to short
code words, and infrequent values are converted to long code words. To aid
deserialization, no code word can be the prefix of another.
2.6 A Spatial Coder
Figure 2-7 ties together all of the preceding spatial coding concepts. The input
signal is assumed to be 4:2:2 SDI (Serial Digital Interface), which may have 8-
or 10-bit wordlength. MPEG uses only 8-bit resolution; therefore, a rounding
stage will be needed when the SDI signal contains 10-bit words. Most MPEG
profiles operate with 4:2:0 sampling; therefore, a vertical low-pass filter/interpolation
stage will be needed. Rounding and color subsampling introduces a
small irreversible loss of information and a proportional reduction in bit rate.
The raster scanned input format will need to be stored so that it can be converted
to 8 x 8 pixel blocks.
Figure 2-6.
Zigzag or Classic (nominally for frames)
a)
Alternate (nominally for fields)
b)
Figure 2-7.
Quantizing
Full
Bitrate
10-bit
Data
Rate Control
Quantizing Data
Compressed
Data
Data reduced
(no loss)
Data reduced
(information
lost)
No Loss
No Data
reduced
Information lost
Data reduced
Convert
4:2:2 to
8-bit 4:2:0
Quantize
Entropy
DCT Coding Buffer
Entropy Coding
Reduce the number of bits for each
coefficient. Give preference to certain
coefficients. Reduction can differ
for each coefficient
Variable Length Coding
Use short words for
most frequent values
(like Morse Code)
Run Length Coding
Send a unique code
word instead of
strings of zeros
12
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
The DCT stage transforms the picture information to the frequency domain. The
DCT itself does not achieve any compression. Following DCT, the coefficients
are weighted and truncated, providing the first significant compression. The
coefficients are then zig-zag scanned to increase the probability that the significant
coefficients occur early in the scan. After the last non-zero coefficient, an
EOB (end of block) code is generated.
Coefficient data are further compressed by run-length and variable-length coding.
In a variable bit-rate system, the quantizing is fixed, but in a fixed bit-rate
system, a buffer memory is used to absorb variations in coding difficulty. Highly
detailed pictures will tend to fill the buffer, whereas plain pictures will allow it
to empty. If the buffer is in danger of overflowing, the requantizing steps will
have to be made larger, so that the compression factor is effectively raised.
In the decoder, the bit stream is deserialized and the entropy coding is
reversed to reproduce the weighted coefficients. The inverse weighting is
applied and coefficients are placed in the matrix according to the zig-zag scan
to recreate the DCT matrix. Following an inverse transform, the 8 x 8 pixel
block is recreated. To obtain a raster-scanned output, the blocks are stored in
RAM, which is read a line at a time. To obtain a 4:2:2 output from 4:2:0 data, a
vertical interpolation process will be needed as shown in Figure 2-8.
The chroma samples in 4:2:0 are positioned half way between luminance samples
in the vertical axis so that they are evenly spaced when an interlaced
source is used.
2.7 Temporal Coding
Temporal redundancy can be exploited by inter-coding or transmitting only the
differences between pictures. Figure 2-9 shows that a one-picture delay com-
Figure 2-8.
4:2:2 Rec 601 4:1:1
4:2:0
1 Luminance sample Y
2 Chrominance samples Cb, Cr
N
N+1
N+2
N+3
Figure 2-9.
This
Picture
Previous
Picture
Picture
Difference
Picture Delay
+
_
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/ 13
bined with a subtractor can compute the picture differences. The picture difference
is an image in its own right and can be further compressed by the spatial
coder as was previously described. The decoder reverses the spatial coding
and adds the difference picture to the previous picture to obtain the next
picture.
There are some disadvantages to this simple system. First, as only differences
are sent, it is impossible to begin decoding after the start of the transmission.
This limitation makes it difficult for a decoder to provide pictures following a
switch from one bit stream to another (as occurs when the viewer changes
channels). Second, if any part of the difference data is incorrect, the error in
the picture will propagate indefinitely.
The solution to these problems is to use a system that is not completely differential.
Figure 2-10 shows that periodically complete pictures are sent. These
are called Intra-coded pictures (or I-pictures), and they are obtained by spatial
compression only. If an error or a channel switch occurs, it will be possible to
resume correct decoding at the next I-picture.
2.8 Motion Compensation
Motion reduces the similarities between pictures and increases the data needed
to create the difference picture. Motion compensation is used to increase
the similarity. Figure 2-11 shows the principle. When an object moves across
the TV screen, it may appear in a different place in each picture, but it does
not change in appearance very much. The picture difference can be reduced by
measuring the motion at the encoder. This is sent to the decoder as a vector.
The decoder uses the vector to shift part of the previous picture to a more
appropriate place in the new picture.
Figure 2-10.
Difference Restart
Start Intra Temporal Temporal Temporal Intra
Difference Difference
Figure 2-11.
Actions:
1. Compute Motion Vector
2. Shift Data from Picture N
Using Vector to Make
Predicted Picture N+1
3. Compare Actual Picture
with Predicted Picture
4. Send Vector and
Picture N Picture N+1 Prediction Error
Motion
Vector
Part of Moving Object
14
A Guide to MPEG Fundamentals and Protocol Analysis
www.tektronix.com/video_audio/
One vector controls the shifting of an entire area of the picture that is known
as a macroblock. The size of the macroblock is determined by the DCT coding
and the color subsampling structure. Figure 2-12a shows that, with a 4:2:0
system, the vertical and horizontal spacing of color samples is exactly twice the
spacing of luminance. A single 8 x 8 DCT block of color samples extends over
the same area as four 8 x 8 luminance blocks; therefore this is the minimum
picture area which can be shifted by a vector. One 4:2:0 macroblock contains
four luminance blocks: one Cr block and one Cb block.
In the 4:2:2 profile, color is only subsampled in the horizontal axis. Figure
2-12b shows that in 4:2:2, a single 8 x 8 DCT block of color samples extends
over two luminance blocks. A 4:2:2 macroblock contains four luminance
blocks: two Cr blocks and two Cb blocks.
The motion estimator works by comparing the luminance data from two successive
pictures. A macroblock in the first picture is used as a reference. When
the input is interlaced, pixels will be at different vertical locations in the two
fields, and it will, therefore, be necessary to interpolate one field before it can
be compared with the other. The correlation between the reference and the
next picture is measured at all possible displacements with a resolution of half
a pixel over the entire search range. When the greatest correlation is found,
this correlation is assumed to represent the correct motion.
The motion vector has a vertical and horizontal component. In typical program
material, a moving object may extend over a number of macroblocks. A greater
compression factor is obtained if the vectors are transmitted differentially.
When a large object moves, adjacent macroblocks have the same vectors and
the vector difference becomes zero.
Motion vectors are associated with macroblocks, not with real objects in the
image and there will be occasions where part of the macroblock moves and
part of it does not. In this case, it is impossible to compensate properly. If the
motion of the moving part is compensated by transmitting a vector, the stationary
part will be incorrectly shifted, and it will need difference data to be cor-
Figure 2-12.
a) 4:2:0 has 1/4 as many chroma sampling points as Y
b) 4:2:2 has twice as much chroma data as 4:2:0
8
8
8
8
8
8
8
8
8
8
8
8
16
16 4 x Y
Cr
Cb
8
8
8
8
2 x Cr
2 x Cb
16
16
8
8
8
8
8
8
8
8
4 x Y
8
8
8
8
Tidak ada komentar:
Posting Komentar