Source code
Revision control
Copy as Markdown
Other Tools
# Corruption Detection
**Name:**
"Corruption Detection"; "Extension for Automatic Detection of Video Corruptions"
**Formal name:**
**Status:** This extension is defined here to allow for experimentation.
**Contact:** <sprang@google.com>
NOTE: This explainer is a work in progress and may change without notice.
The Corruption Detection (sometimes referred to as automatic corruption
detection or ACD) extension is intended to be a part of a system that allows
estimating a likelihood that a video transmission is in a valid state. That is,
the input to the video encoder on the send side corresponds to the output of the
video decoder on the receive side with the only difference being the expected
distortions from lossy compression.
The goal is to be able to detect outright coding errors caused by things such as
bugs in encoder/decoders, malformed packetization data, incorrect relay
decisions in SFU-type servers, incorrect handling of packet loss/reordering, and
so forth. We want to accomplish this with a high signal-to-noise ratio while
consuming a minimum of resources in terms of bandwidth and/or computation. It
should be noted that it is _not_ a goal to be able to e.g. gauge general video
quality using this method.
This explainer contains two parts:
1) A definition of the RTP header extension itself and how it is to be parsed.
2) The intended usage and implementation details for a WebRTC sender and
receiver respectively.
If this extension has been negotiated, all the client behavior outlined in this
doc MUST be adhered to.
## RTP Header Extension Format
### Data Layout Overview
The message format of the header extension:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|B| seq index | std dev | Y err | UV err| sample 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| sample 1 | sample 2 | … up to sample <=12
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
### Data Layout Details
* B (1 bit): If the sequence number should be interpreted as the MSB or LSB
of the full size 14 bit sequence index described in the next point.
* seq index (7 bits): The index into the Halton sequence (used to locate
where the samples should be drawn from).
* If B is set: the 7 most significant bits of the true index. The 7 least
significant bits of the true index shall be interpreted as 0. This is
because this is the point where we can guarantee that the sender and
receiver has the same full index. B MUST be set on keyframes. On droppable
frames B MUST NOT be set.
* If B is not set: The 7 LSB of the true index. The 7 most significant bits
should be inferred based on the most recent message.
* std dev (8 bits): The standard deviation of the Gaussian filter used
to weigh the samples. The value is scaled using a linear map:
0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using
just the sample value at the desired coordinate, without any weighting.
* Y err (4 bits): The allowed error for the luma channel.
* UV err (4 bits): The allowed error for the chroma channels.
* Sample N (8 bits): The N:th filtered sample from the input image. Each
sample represents a new point in one of the image planes, the plane and
coordinates being determined by index into the Halton sequence (starting at
seq# index and is incremented by one for each sample). Each sample has gone
through a Gaussian filter with the std dev specified above. The samples
have been floored to the nearest integer.
A special case is the so-called "synchronization" message. Such a message
only contains the first byte. They are used to keep the sender and receiver in
sync even if no "full" message has been received for a while. Such messages
MUST NOT be sent on droppable frames.
### A note on encryption
Privacy and security are core parts of nearly every WebRTC-based application,
which means that some sort of encryption needs to be present. The most common
form of encryption is SRTP, defined in RFC 3711. However, as mentioned in
section 9.4 of that RFC, RTP header extensions are considered part of the header
and are thus not encrypted.
The automatic corruption detection header extension is different from most other
header extensions in that it provides not only metadata about the media stream
being transmitted but in practice comprises an extremely sparse representation
of the actual video stream itself. Given a static scene and enough time, a crude
image of the encrypted video can rather trivially be constructed.
As such, most applications should use this extension with SRTP only if
additional security is present to protect it. That could be for example in the
form of explicit header extension encryption provided by RFC 6904/RFC 9335, or
by encapsulating the entire RTP stream in an additional layer such as IPSec.
## Usage & Guidelines
In this section we’ll first look at a general overview of the intended usage of
this header extensions, followed by more details around the expected
implementation.
### Overview
The premise of the extension described here is that we can validate the state of
the video pipeline by quasi-randomly selecting a few samples from the raw input
frame to an encoder, and then checking them against the output of a decoder.
Assuming that a lossless codec is used we can follow these steps:
1) In an image that is to be encoded, quasi-randomly select N sampling positions
and store the samples values for those positions from the raw input image.
2) Encode the image, and attach the selected sample values to the RTP packets
containing the encoded bitstream of that image.
3) Transmit the RTP packets to a remote receiver.
4) At the receiver, collect the attached sample values from the RTP packets when
assembling the frame, and then pass the bitstream to a decoder.
5) Using the same quasi-random sequence as in (1), calculate the corresponding N
sampling positions.
6) Take the output of the decoder and check the values of the samples from the
RTP packets. If they differ significantly, it is likely that an image
corruption has occurred.
Lossless encoding is however rarely used in practice, and that introduces
problems for the above algorithm.
* Quantization causes values to be different from the desired value.
* Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion
vectors.
* Inaccuracies caused by in-loop or post-process filtering.
* etc.
We must therefore take these distortions into consideration, as they are merely
a natural side-effect of the compression and their effect is not to be
considered an “invalid state”. We aim to accomplish this using two tools.
First, instead of a sample being a single raw sample value let it be a filtered
one: a weighted average of samples in the vicinity of the desired location, with
the weights being a 2D Gaussian centered at that location and the variance
adjusted depending on the magnitude of the expected distortions
(higher distortion => higher variance). This smoothes out inaccuracies caused by
both quantization and motion compensation.
Secondly, even with a very large filter kernel the new sample might not converge
towards the exact desired value. For that reason, set an “allowed error
threshold” that removes small magnitude differences. Since chroma and luma
channels have different scales, separate error thresholds are available for
them.
### Sequence Index Handling
The quasi-random sequence of choice for this extension is a 2D
The index into the Halton Sequence is indicated by the header extension and
results in a 14 bit unsigned integer which on overflow will wrap around back to
0.
For each sample contained within the extension, the sequence index should be
considered to be incremented by one. Thus the sequence index at the start of the
header should be considered “the sequence index for the next sample to be
drawn”.
The ACD extension may be sent containing either the 7 most significant bits
(B = true) or the 7 least significant bits (B = false) of the sequence index.
Key-frames MUST be populated with the ACD extension, and those MUST use B = true
indicating only the 7 most significant bits are transmitted.
The sender may choose any arbitrary starting point. The biggest reason to not
always start with (B = true, seq index = 0) is that with frequent/periodic
keyframes you might end up always sampling the same small subset of image
locations over and over.
If B = false and the LSB seq index + number of samples exceeds the capacity of
the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit
sequence counter should be considered to be implicitly incremented by the
overflow.
Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for
example temporal layering using the
key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are
droppable.
For non-droppable frames, B MAY be set to true even though there is often little
utility for it.
For droppable frames B MUST NOT be set to true, since a receiver could otherwise
easily end up out of sync with the sender.
A receiver must store a state containing the last sequence index used. If an ACD
extension is receiver with B = false but the LSB does not match the last known
sequence index state, this indicates that an instrumented frame has been
dropped. The receiver should recover from this by incrementing the last known
sequence index until the 7 least significant bits match.
Because of this, the sender MUST send ACD messages on non-droppable frames such
that the delta between their sequence indexing (from the last sample of the
previous packet to the first of the next) indexing does not exceed 0x7F. A
synchronization message may be used for this purpose if there is no wish to
instrument the non-droppable frame.
It is not required to add the ACD extension to every frame. Indeed, for
performance reasons it may be reasonable to only instrument a small subset of
frames, for example using only one frame per second.
Additionally, when encoding a structure that has independent decode targets
(e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence
per target resolution so that a receiver can validate the state of the
sub-stream they receive.
// TODO: Add concrete examples.
### Sample Selection
As mentioned above, a Halton Sequence is used to generate sampling coordinates.
Base 2 is used for selecting the rows, and base 3 is used for selecting columns.
Each sample in the ACD extension represents a single image sample, meaning it
belongs to a single channel rather than e.g. being an RGB pixel.
The initial version of the ACD extension supports only the I420 chroma
subsampling format. When determining which plane a location belongs to, it is
easiest to visualize it as the chroma planes being “stacked” to the side of the
luma plane:
+------+---+
| | U |
+ Y +---+
| | V |
+------+---+
In pseudo code:
```
row = GetHaltonSequence(seq_index, /*base=*/2) * image_height;
col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5;
if (col < image_width) {
HandleSample(Y_PLANE, row, col);
} else if (row < image_height / 2) {
HandleSample(U_PLANE, row, col - image_width);
} else {
HandleSample(V_PLANE, row - (image_height / 2), col - image_width);
}
seq_index++;
```
Support for other layout types may be added in later versions of this extension.
Note that the image dimensions are not explicitly a part of the ACD extension -
that has to be inferred from the raw image itself.
### Sample Filtering
As mentioned above, when filtering a sample we create a weighted average around
the desired location. Only samples in the same plane are considered. The
weighting consists of a 2D Gaussian centered on the desired location, with the
standard deviation specified in the ACD extension header.
If the standard deviation is specified as 0.0 - we consider only a singular
sample. Otherwise, we first determine a cutoff distance below which the weights
are considered too small to matter. For now, we have set the weight cutoff to
0.2 - meaning the maximum distance from the center sample we need to consider is
max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.
Any samples outside the plane are considered to have weight 0.
In pseudo-code, that means we get the following:
```
sample_sum = 0;
weight_sum = 0;
for (y = max(0, row - max_d) to min(plane_height, row + max_d) {
for (x = max(0, col - max_d) to min(plane_width, col + max_d) {
weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2));
sample_sum += SampleAt(x, y) * weight;
weight_sum += weight;
}
}
filtered_sample = sample_sum / weight_sum;
```
### Receive Side Considerations
When a frame has been decoded and an ACD message is present, the receiver
performs the following steps:
* Update the sequence index so that it is consistent with the ACD message.
* Calculate the sample positions from the Halton sequence.
* Filter each sample of the decoded image using the standard deviation provided
in the ACD message.
We then need to compare the actual samples present in the ACD message and the
samples generated from the locally decoded frame, and take the allowed error
into account:
```
for (i = 0 to num_samples) {
// Allowed error from ACD message, depending on which plane sample i is in.
allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR;
delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error);
}
```
It is then up to the receiver how to interpret these deltas. A suggested method
is to calculate a “corruption score” by calculating sum(delta(i)^2), where
delta(i) is the delta for i:th sample in the message, and then scaling and
capping that result to a maximum of 1.0. By squaring the sample, we make sure
that even singular samples that are way outside their expected values cause a
noticeable shift in the score. Another possible way is to calculate the distance
and cap it using a sigmoid function.
This extension message format does not make recommendations about what a
receiver should do with the corruption scores, but some possibilities are:
* Expose it as a statistics connected to the video receive stream. Let the
application decide what to do with the information.
* Let the WebRTC application use a corruption signal to take proactive measures.
E.g. request a key-frame in order to recover, or try to switch to another
codec type or implementation.
### Determining Filter Settings & Error Thresholds
It is up to the sender to estimate how large the filter kernel and the allowed
error thresholds should be.
One method to do this is to analyze example outputs from different encoders and
map the average frame QP to suitable settings. There will of course have to be
different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to
get “tighter” values with knowledge of the exact implementation used. E.g. a
mapping designed just for libaom encoder version X running with speed setting Y.
Another method is to use the actual reconstructor state from the encoder. That
of course means the encoder has to expose that state, which is not common.
A benefit of doing it that way is that the filter size and allowed error can be
very small (really only post-processing could introduce distortions in that
scenario). A drawback is if the reconstructed state already contains corruption
due to an encoder bug - then we would not be able to detect that corruption at
all.
There are also possibly more accurate but probably much more costly alternatives
as well, such as training an ML model to determine the settings based on both
the content of the source frame and any metadata present in the encoded
bitstream.
Regardless of method, the implementation at the send side SHOULD strive to set
the filter size and error thresholds such that 99.5% of filtered samples end up
with a delta <= the error threshold for that plane, based on a representative
set of test clips and bandwidth constraints.
Notes: The extension must not be present in more than 1 packet per video frame.