AlgorithmDescription.md

<div style="font-size:3em; text-align:center;"> Algorithm Description </div>

# Abstract

This document describes technical aspects of coding tools included in

the associated codec. This document is not a specification of the associated

codec. Instead, it summarizes the highlighted features of coding tools for new

developers. This document should be updated when significant new normative

changes have been integrated into the associated codec.

# Table of Contents

[Abbreviations](#Abbreviations)

[Algorithm description](#Algorithm-Description)

- [Block Partitioning](#Block-Partitioning)

  - [Coding block partition](#Coding-block-partition)

  - [Transform block partition](#Transform-block-partition)

- [Intra Prediction](#Intra-Prediction)

  - [Directional intra prediction modes](#Directional-intra-prediction-modes)

  - [Non-directional intra prediction modes](#Non-directional-intra-prediction-modes)

  - [Recursive filtering modes](#Recursive-filtering-modes)

  - [Chroma from Luma mode](#Chroma-from-Luma-mode)

- [Inter Prediction](#Inter-Prediction)

  - [Motion vector prediction](#Motion-vector-prediction)

  - [Motion vector coding](#Motion-vector-coding)

  - [Interpolation filter for motion compensation](#Interpolation-filter-for-motion-compensation)

  - [Warped motion compensation](#Warped-motion-compensation)

  - [Overlapped block motion compensation](#Overlapped-block-motion-compensation)

  - [Reference frames](#Reference-frames)

  - [Compound Prediction](#Compound-Prediction)

- [Transform](#Transform)

- [Quantization](#Quantization)

- [Entropy Coding](#Entropy-Coding)

- [Loop filtering and post-processing](#Loop-filtering-and-post-processing)

  - [Deblocking](#Deblocking)

  - [Constrained directional enhancement](#Constrained-directional-enhancement)

  - [Loop Restoration filter](#Loop-Restoration-filter)

  - [Frame super-resolution](#Frame-super-resolution)

  - [Film grain synthesis](#Film-grain-synthesis)

- [Screen content coding](#Screen-content-coding)

  - [Intra block copy](#Intra-block-copy)

  - [Palette mode](#Palette-mode)

[References](#References)

# Abbreviations

CfL: Chroma from Luma\

IntraBC: Intra block copy\

LCU: Largest coding unit\

OBMC: Overlapped Block Motion Compensation\

CDEF: Constrained Directional Enhancement Filter

# Algorithm Description

## Block Partitioning

### Coding block partition

The largest coding block unit (LCU) applied in this codec is 128×128. In

addition to no split mode `PARTITION_NONE`, the partition tree supports 9

different partitioning patterns, as shown in below figure.

<figure class="image"> <center><img src="img\partition_codingblock.svg"

alt="Partition" width="360" /> <figcaption>Figure 1: Supported coding block

partitions</figcaption> </figure>

According to the number of sub-partitions, the 9 partition modes are summarized

as follows: 1. Four partitions: `PARTITION_SPLIT`, `PARTITION_VERT_4`,

`PARTITION_HORZ_4` 2. Three partitions (T-Shape): `PARTITION_HORZ_A`,

`PARTITION_HORZ_B`, `PARTITION_VERT_A`, `PARTITION_HORZ_B` 3. Two partitions:

`PARTITION_HORZ`, `PARTITION_VERT`

Among all the 9 partitioning patterns, only `PARTITION_SPLIT` mode supports

recursive partitioning, i.e., sub-partitions can be further split, other

partitioning modes cannot further split. Particularly, for 8x8 and 128x128,

`PARTITION_VERT_4`, `PARTITION_HORZ_4` are not used, and for 8x8, T-Shape

partitions are not used either.

### Transform block partition

For both intra and inter coded blocks, the coding block can be further

partitioned into multiple transform units with the partitioning depth up to 2

levels. The mapping from the transform size of the current depth to the

transform size of the next depth is shown in the following Table 1.

<figure class="image"> <center><figcaption>Table 1: Transform partition size

setting</figcaption> <img src="img\tx_partition.svg" alt="Partition" width="220"

/> </figure>

Furthermore, for intra coded blocks, the transform partition is done in a way

that all the transform blocks have the same size, and the transform blocks are

coded in a raster scan order. An example of the transform block partitioning for

intra coded block is shown in the Figure 2.

<figure class="image"> <center><img src="img\intra_tx_partition.svg"

alt="Partition" width="600" /> <figcaption>Figure 2: Example of transform

partitioning for intra coded block</figcaption> </figure>

For inter coded blocks, the transform unit partitioning can be done in a

recursive manner with the partitioning depth up to 2 levels. The transform

partitioning supports 1:1 (square), 1:2/2:1, and 1:4/4:1 transform unit sizes

ranging from 4×4 to 64×64. If the coding block is smaller than or equal to

64x64, the transform block partitioning can only apply to luma component, for

chroma blocks, the transform block size is identical to the coding block size.

Otherwise, if the coding block width or height is greater than 64, then both the

luma and chroma coding blocks will implicitly split into multiples of min(W,

64)x min(H, 64) and min(W, 32)x min(H, 32) transform blocks, respectively.

<figure class="image"> <center><img src="img\inter_tx_partition.svg"

alt="Partition" width="400" /> <figcaption>Figure 3: Example of transform

partitioning for inter coded block</figcaption> </figure>

## Intra Prediction

### Directional intra prediction modes

Directional intra prediction modes are applied in intra prediction, which models

local textures using a given direction pattern. Directional intra prediction

modes are represented by nominal modes and angle delta. The nominal modes are

similar set of intra prediction angles used in VP9, which includes 8 angles. The

index value of angle delta is ranging from -3 ~ +3, and zero delta angle

indicates a nominal mode. The prediction angle is represented by a nominal intra

angle plus an angle delta. In total, there are 56 directional intra prediction

modes, as shown in the following figure. In the below figure, solid arrows

indicate directional intra prediction modes and dotted arrows represent non-zero

angle delta.

<figure class="image"> <center><img src="img\intra_directional.svg"

alt="Directional intra" width="300" /> <figcaption>Figure 4: Directional intra

prediction modes</figcaption> </figure>

The nominal mode index and angle delta index is signalled separately, and

nominal mode index is signalled before the associated angle delta index. It is

noted that for small block sizes, where the coding gain from extending intra

prediction angles may saturate, only the nominal modes are used and angle delta

index is not signalled.

### Non-directional intra prediction modes

In addition to directional intra prediction modes, four non-directional intra

modes which simulate smooth textures are also included. The four non-directional

intra modes include `SMOOTH_V`, `SMOOTH_H`, `SMOOTH` and `PAETH predictor`.

In `SMOOTH V`, `SMOOTH H` and `SMOOTH modes`, the prediction values are

generated using quadratic interpolation along vertical, horizontal directions,

or the average thereof. The samples used in the quadratic interpolation include

reconstructed samples from the top and left neighboring blocks and samples from

the right and bottom boundaries which are approximated by top reconstructed

samples and the left reconstructed samples.

In `PAETH predictor` mode, the prediction for each sample is assigned as one

from the top (T), left (L) and top-left (TL) reference samples, which has the

value closest to the Paeth predictor value, i.e., T + L -TL. The samples used in

`PAETH predictor` are illustrated in below figure.

<figure class="image"> <center><img src="img\intra_paeth.svg" alt="Directional

intra" width="300" /> <figcaption>Figure 5: Paeth predictor</figcaption>

</figure>

### Recursive filtering modes

Five filtering intra modes are defined, and each mode specify a set of eight

7-tap filters. Given the selected filtering mode index (0~4), the current block

is divided into 4x2 sub-blocks. For one 4×2 sub-block, each sample is predicted

by 7-tap interpolation using the 7 top and left neighboring samples as inputs.

Different filters are applied for samples located at different coordinates

within a 4×2 sub-block. The prediction process can be done recursively in unit

4x2 sub-block, which means that prediction samples generated for one 4x2

prediction block can be used to predict another 4x2 sub-block.

<figure class="image"> <center><img src="img\intra_recursive.svg"

alt="Directional intra" width="300" /> <figcaption>Figure 6: Recursive filtering

modes</figcaption> </figure>

### Chroma from Luma mode

Chroma from Luma (CfL) is a chroma intra prediction mode, which models chroma

samples as a linear function of co-located reconstructed luma samples. To align

the resolution between luma and chroma samples for different chroma sampling

format, e.g., 4:2:0 and 4:2:2, reconstructed luma pixels may need to be

sub-sampled before being used in CfL mode. In addition, the DC component is

removed to form the AC contribution. In CfL mode, the model parameters which

specify the linear function between two color components are optimized by

encoder signalled in the bitstream.

<figure class="image"> <center><img src="img\intra_cfl.svg" alt="Directional

intra" width="700" /> <figcaption>Figure 7: CfL prediction</figcaption>

</figure>

## Inter Prediction

### Motion vector prediction

Motion vectors are predicted by neighboring blocks which can be either spatial

neighboring blocks, or temporal neighboring blocks located in a reference frame.

A set of MV predictors will be identified by checking all these blocks and

utilized to encode the motion vector information.

**Spatial motion vector prediction**

There are two sets of spatial neighboring blocks that can be utilized for

finding spatial MV predictors, including the adjacent spatial neighbors which

are direct top and left neighbors of the current block, and second outer spatial

neighbors which are close but not directly adjacent to the current block. The

two sets of spatial neighboring blocks are illustrated in an example shown in

Figure 8.

<figure class="image"> <center><img src="img\inter_spatial_mvp.svg"

alt="Directional intra" width="350" /><figcaption>Figure 8: Motion field

estimation by linear projection</figcaption></figure>

For each set of spatial neighbors, the top row will be checked from left to

right and then the left column will be checked from top to down. For the

adjacent spatial neighbors, an additional top-right block will be also checked

after checking the left column neighboring blocks. For the non-adjacent spatial

neighbors, the top-left block located at (-1, -1) position will be checked

first, then the top row and left column in a similar manner as the adjacent

neighbors. The adjacent neighbors will be checked first, then the temporal MV

predictor that will be described in the next subsection will be checked second,

after that, the non-adjacent spatial neighboring blocks will be checked.

For compound prediction which utilizes a pair of reference frames, the

non-adjacent spatial neighbors are not used for deriving the MV predictor.

**Temporal motion vector prediction**

In addition to spatial neighboring blocks, MV predictor can be also derived

using co-located blocks of reference pictures, namely temporal MV predictor. To

generate temporal MV predictor, the MVs of reference frames are first stored

together with reference indices associated with the reference frame. Then for

each 8x8 block of the current frame, the MVs of a reference frame which pass the

8x8 block are identified and stored together with the reference frame index in a

temporal MV buffer. In an example shown in Figure 5, the MV of reference frame 1

(R1) pointing from R1 to a reference frame of R1 is identified, i.e., MVref,

which passes a 8x8 block (shaded in blue dots) of current frame. Then this MVref

is stored in the temporal MV buffer associated with this 8x8 block. <figure

class="image"> <center><img src="img\inter_motion_field.svg" alt="Directional

intra" width="800" /><figcaption>Figure 9: Motion field estimation by linear

projection</figcaption></figure> Finally, given a couple of pre-defined block

coordinates, the associated MVs stored in the temporal MV buffer are identified

and projected accordingly to derive a temporal MV predictor which points from

the current block to its reference frame, e.g., MV0 in Figure 5. In Figure 6,

the pre-defined block positions for deriving temporal MV predictors of a 16x16

block are shown and up to 7 blocks will be checked to find valid temporal MV

predictors.<figure class="image"> <center><img

src="img\inter_tmvp_positions.svg" alt="Directional intra" width="300"

/><figcaption>Figure 10: Block positions for deriving temporal MV

predictors</figcaption></figure> The temporal MV predictors are checked after

the nearest spatial MV predictors but before the non-adjacent spatial MV

predictors.

All the spatial and temporal MV candidates will be put together in a pool, with

each predictor associated with a weighting determined during the scanning of the

spatial and temporal neighboring blocks. Based on the associated weightings, the

candidates are sorted and ranked, and up to four candidates will be used as a

list MV predictor list.

### Motion vector coding

### Interpolation filter for motion compensation

<mark>[Ed.: to be added]</mark>

### Warped motion compensation

**Global warped motion**

The global motion information is signalled at each inter frame, wherein the

global motion type and motion parameters are included. The global motion types

and the number of the associated parameters are listed in the following table.

| Global motion type   | Number of parameters   |

|:------------------:|:--------------------:|

| Identity (zero motion)| 0 |

| Translation | 2 |

| Rotzoom  | 4 |

| General affine | 6 |

For an inter coded block, after the reference frame index is

transmitted, if the motion of current block is indicated as global motion, the

global motion type and the associated parameters of the given reference will be

used for current block.

**Local warped motion**

For an inter coded block, local warped motion is allowed when the following

conditions are all satisfied:

* Current block is single prediction

* Width or height is greater than or equal to 8 samples

* At least one of the immediate neighbors uses same reference frame with current block

If the local warped motion is used for current block, instead of signalling the

affine parameters, they are estimated by using mean square minimization of the

distance between the reference projection and modeled projection based on the

motion vectors of current block and its immediate neighbors. To estimate the

parameters of local warped motion, the projection sample pair of the center

pixel in neighboring block and its corresponding pixel in the reference frame

are collected if the neighboring block uses the same reference frame with

current block. After that, 3 extra samples are created by shifting the center

position by a quarter sample in one or two dimensions, and these samples are

also considered as projection sample pairs to ensure the stability of the model

parameter estimation process.

### Overlapped block motion compensation

For an inter-coded block, overlapped block motion compensation (OBMC) is allowed

when the following conditions are all satisfied.

* Current block is single prediction

* Width or height is greater than or equal to 8 samples

* At least one of the neighboring blocks are inter-coded blocks

When OBMC is applied to current block, firstly, the initial inter prediction

samples is generated by using the assigned motion vector of current block, then

the inter predicted samples for the current block and inter predicted samples

based on motion vectors from the above and left blocks are blended to generate

the final prediction samples.The maximum number of neighboring motion vectors is

limited based on the size of current block, and up to 4 motion vectors from each

of upper and left blocks can be involved in the OBMC process of current block.

One example of the processing order of neighboring blocks is shown in the

following picture, wherein the values marked in each block indicate the

processing order of the motion vectors of current block and neighboring blocks.

To be specific, the motion vector of current block is firstly applied to

generate inter prediction samples P0(x,y). Then motion vector of block 1 is

applied to generate the prediction samples p1(x,y). After that, the prediction

samples in the overlapping area between block 0 and block 1 is an weighted

average of p0(x,y) and p1(x,y). The overlapping area of block 1 and block 0 is

marked in grey in the following picture. The motion vectors of block 2, 3, 4 are

further applied and blended in the same way.

<figure class="image"> <center><img src="img\inter_obmc.svg" alt="Directional

intra" width="300" /><figcaption>Figure 11: neighboring blocks for OBMC

process</figcaption></figure>

### Reference frames

<mark>[Ed.: to be added]</mark>

### Compound Prediction

<mark>[Ed.: to be added]</mark>

**Compound wedge prediction**

<mark>[Ed.: to be added]</mark>

**Difference-modulated masked prediction**

<mark>[Ed.: to be added]</mark>

**Frame distance-based compound prediction**

<mark>[Ed.: to be added]</mark>

**Compound inter-intra prediction**

<mark>[Ed.: to be added]</mark>

## Transform

The separable 2D transform process is applied on prediction residuals. For the

forward transform, a 1-D vertical transform is performed first on each column of

the input residual block, then a horizontal transform is performed on each row

of the vertical transform output. For the backward transform, a 1-D horizontal

transform is performed first on each row of the input de-quantized coefficient

block, then a vertical transform is performed on each column of the horizontal

transform output. The primary 1-D transforms include four different types of

transform: a) 4-point, 8-point, 16-point, 32-point, 64-point DCT-2; b) 4-point,

8-point, 16-point asymmetric DST’s (DST-4, DST-7) and c) their flipped

versions; d) 4-point, 8-point, 16-point, 32-point identity transforms. When

transform size is 4-point, ADST refers to DST-7, otherwise, when transform size

is greater than 4-point, ADST refers to DST-4.

<figure class="image"> <center><figcaption>Table 2: Transform basis functions

(DCT-2, DST-4 and DST-7 for N-point input.</figcaption> <img src=

"img\tx_basis.svg" alt="Partition" width="450" /> </figure>

For luma component, each transform block can select one pair of horizontal and

vertical transform combination given a pre-defined set of transform type

candidates, and the selection is explicitly signalled into the bitstream.

However, the selection is not signalled when Max(width,height) is 64. When

the maximum of transform block width and height is greater than or equal to 32,

the set of transform type candidates depend on the prediction mode, as described

in Table 3. Otherwise, when the maximum of transform block width and height is

smaller than 32, the set of transform type candidates depend on the prediction

mode, as described in Table 4.

<figure class="image"> <center><figcaption>Table 3: Transform type candidates

for luma component when max(width, height) is greater than or equal to 32.

</figcaption> <img src="img\tx_cands_large.svg" alt="Partition" width="370" />

</figure>

<figure class="image"> <center><figcaption>Table 4: Transform type candidates

for luma component when max(width, height) is smaller than 32. </figcaption>

<img src="img\tx_cands_small.svg" alt="Partition" width="440" /> </figure>

The set of transform type candidates (namely transform set) is defined in Table

5.

<figure class="image"> <center><figcaption>Table 5: Definition of transform set.

</figcaption> <img src="img\tx_set.svg" alt="Partition" width="450" /> </figure>

For chroma component, the transform type selection is done in an implicit way.

For intra prediction residuals, the transform type is selected according to the

intra prediction mode, as specified in Table 4. For inter prediction residuals,

the transform type is selected according to the transform type selection of the

co-located luma block. Therefore, for chroma component, there is no transform

type signalling in the bitstream.

<figure class="image"> <center><figcaption>Table 6: Transform type selection for

chroma component intra prediction residuals.</figcaption> <img src=

"img\tx_chroma.svg" alt="Partition" width="500" /> </figure>

The computational cost of large size (e.g., 64-point) transforms is further

reduced by zeroing out all the coefficients except the following two cases:

1. The top-left 32×32 quadrant for 64×64/64×32/32×64 DCT_DCT hybrid transforms

2. The left 32×16 area for 64×16 and top 16×32 for16×64 DCT_DCT hybrid transforms.

Both the DCT-2 and ADST (DST-4, DST-7) are implemented using butterfly structure

[1], which included multiple stages of butterfly operations. Each butterfly

operations can be calculated in parallel and different stages are cascaded in a

sequential order.

## Quantization

Quantization of transform coefficients may apply different quantization step

size for DC and AC transform coefficients, and different quantization step size

for luma and chroma transform coefficients. To specify the quantization step

size, in the frame header, a _**base_q_idx**_ syntax element is first signalled,

which is a 8-bit fixed length code specifying the quantization step size for

luma AC coefficients. The valid range of _**base_q_idx**_ is [0, 255].

After that, the delta value relative to base_q_idx for Luma DC coefficients,

indicated as DeltaQYDc is further signalled. Furthermore, if there are more than

one color plane, then a flag _**diff_uv_delta**_ is signaled to indicate whether

Cb and Cr color components apply different quantization index values. If

_**diff_uv_delta**_ is signalled as 0, then only the delta values relative to

base_q_idx for chroma DC coefficients (indicated as DeltaQUDc) and AC

coefficients (indicated as DeltaQUAc) are signalled. Otherwise, the delta values

relative to base_q_idx for both the Cb and Cr DC coefficients (indicated as

DeltaQUDc and DeltaQVDc) and AC coefficients (indicated as DeltaQUAc and

DeltaQVAc) are signalled.

The above decoded DeltaQYDc, DeltaQUAc, DeltaQUDc, DeltaQVAc and DeltaQVDc are

added to _base_q_idx_ to derive the quantization indices. Then these

quantization indices are further mapped to quantization step size according to

two tables. For DC coefficients, the mapping from quantization index to

quantization step size for 8-bit, 10-bit and 12-bit internal bit depth is

specified by a lookup table Dc_Qlookup[3][256], and the mapping from

quantization index to quantization step size for 8-bit, 10-bit and 12-bit is

specified by a lookup table Ac_Qlookup[3][256].

<figure class="image"> <center><img src="img\quant_dc.svg" alt="quant_dc"

width="800" /><figcaption>Figure 11: Quantization step size of DC coefficients

for different internal bit-depth</figcaption></figure>

<figure class="image"> <center><img src="img\quant_ac.svg" alt="quant_ac"

width="800" /><figcaption>Figure 12: Quantization step size of AC coefficients

for different internal bit-depth</figcaption></figure>

Given the quantization step size, indicated as _Q<sub>step_, the input quantized

coefficients is further de-quantized using the following formula:

_F_ = sign * ( (_f_ * _Q<sub>step_) % 0xFFFFFF ) / _deNorm_

, where _f_ is the input quantized coefficient, _F_ is the output dequantized

coefficient, _deNorm_ is a constant value derived from the transform block area

size, as indicated by the following table:

| _deNorm_ | Tx block area size |

|----------|:--------------------------|

| 1| Less than 512 samples |

| 2 | 512 or 1024 samples |

| 4 | Greater than 1024 samples |

When the quantization index is 0, the quantization is performed using a

quantization step size equal to 1, which is lossless coding mode.

## Entropy Coding

**Entropy coding engine**

<mark>[Ed.: to be added]</mark>

**Coefficient coding**

For each transform unit, the coefficient coding starts with coding a skip sign,

which is followed by the signaling of primary transform kernel type and the

end-of-block (EOB) position in case the transform coding is not skipped. After

that, the coefficient values are coded in a multiple level map manner plus sign

values. The level maps are coded as three level planes, namely lower-level,

middle-level and higher-level planes, and the sign is coded as another separate

plane. The lower-level, middle-level and higher-level planes correspond to

correspond to different ranges of coefficient magnitudes. The lower level plane

corresponds to the range of 0–2, the middle level plane takes care of the

range of 3–14, and the higher-level plane covers the range of 15 and above.

The three level planes are coded as follows. After the EOB position is coded,

the lower-level and middle-level planes are coded together in backward scan

order, and the scan order refers to zig-zag scan applied on the entire transform

unit basis. Then the sign plane and higher-level plane are coded together in

forward scan order. After that, the remainder (coefficient level minus 14) is

entropy coded using Exp-Golomb code.

The context model applied to the lower level plane depends on the primary

transform directions, including: bi-directional, horizontal, and vertical, as

well as transform size, and up to five neighbor (in frequency domain)

coefficients are used to derive the context. The middle level plane uses a

similar context model, but the number of context neighbor coefficients is

reduced from 5 to 2. The higher-level plane is coded by Exp-Golomb code without

using context model. For the sign plane, except the DC sign that is coded using

the DC signs from its neighboring transform units, sign values of other

coefficients are coded directly without using context model.

## Loop filtering and post-processing

### Deblocking

There are four methods when picking deblocking filter level, which are listed

below:

* LPF_PICK_FROM_FULL_IMAGE: search the full image with different values

* LPF_PICK_FROM_Q: estimate the filter level based on quantizer and frame type

* LPF_PICK_FROM_SUBIMAGE: estimate the level from a portion of image

* LPF_PICK_MINIMAL_LPF: set the filter level to 0 and disable the deblocking

When estimating the filter level from the full image or sub-image, the searching

starts from the previous frame filter level, ends when the filter step is less

or equal to zero. In addition to filter level, there are some other parameters

which control the deblocking filter such as sharpness level, mode deltas, and

reference deltas.

Deblocking is performed at 128x128 super block level, and the vertical and

horizontal edges are filtered respectively. For a 128x128 super block, the

vertical/horizontal edges aligned with each 8x8 block is firstly filtered. If

the 4x4 transform is used, the internal edge aligned with a 4x4 block will be

further filtered. The filter length is switchable from 4-tap, 6-tap, 8-tap,

14-tap, and 0-tap (no filtering). The location of filter taps are identified

based on the number of filter taps in order to compute the filter mask. When

finally performing the filtering, outer taps are added if there is high edge

variance.

### Constrained directional enhancement filter

**Edge Direction Estimation**\

In CDEF, edge direction search is performed at 8x8 block-level. There are

eight edge directions in total, as illustrated in Figure 13.

<figure class="image"> <center><img src="img\edge_direction.svg"

alt="Edge direction" width="700" /> <figcaption>Figure 13: Line number

k for pixels following direction d=0:7 in an 8x8 block.</figcaption> </figure>

The optimal edge direction d_opt is found by maximizing the following

term [3]:

<figure class="image"> <center><img src="img\equ_edge_direction.svg"

alt="Equation edge direction" width="250" /> </figure>

<!-- $$d_{opt}=\max_{d} s_d$$

$$s_d = \sum_{k}\frac{1}{N_{d,k}}(\sum_{p\in P_{d,k}}x_p)^2,$$ -->

where x_p is the value of pixel p, P_{d,k} is the set of pixels in

line k following direction d, N_{d,k} is the cardinality of P_{d,k}.

**Directional filter**\

CDEF consists two filter taps: the primary tap and the secondary tap.

The primary tap works along the edge direction (as shown in Figure 14),

while the secondary tap forms an oriented 45 degree off the edge direction

 (as shown in Figure 15).

<figure class="image"> <center><img src="img\primary_tap.svg"

alt="Primary tap" width="700" /> <figcaption>Figure 14: Primary filter

taps following edge direction. For even strengths a = 2 and b = 4, for

odd strengths a = 3 and b = 3. The filtered pixel is shown in the

highlighted center.</figcaption> </figure>

<figure class="image"> <center><img src="img\secondary_tap.svg"

alt="Edge direction" width="700" /> <figcaption>Figure 15: Secondary

filter taps. The filtered pixel is shown in the highlighted center.

</figcaption> </figure>

CDEF can be described by the following equation:

<figure class="image"> <center><img src="img\equ_dir_search.svg"

alt="Equation direction search" width="720" /> </figure>

<!-- $$y(i,j)=x(i,j)+round(\sum_{m,n}w^{(p)}_{d,m,n}f(x(m,x)-x(i,j),S^{(p)},

D)+\sum_{m,n}w^{(s)}_{d,m,n}f(x(m,x)-x(i,j),S^{(s)},D)),$$ -->

where x(i,j) and y(i,j) are the input and output reconstructed values

of CDEF. p denotes primary tap, and s denotes secondary tap, w is

the weight between primary and secondary tap. f(d,S,D) is a non-linear

filtering function, S denotes filter strength, D is a damping parameter.

For 8-bit content, S^p ranges from 0 to 15, and S^s can be

0, 1, 2, or 4. D ranges from 3 to 6 for luma, and 2 to 4 for chroma.

**Non linear filter**\

CDEF uses a non-linear filtering function to prevent excessive blurring

when applied across an edge. It is achieved by ignoring pixels that are

too different from the current pixels to be filtered. When the difference

between current pixel and it's neighboring pixel d is within a threshold,

f(d,S,D) = d, otherwise f(d,S,D) = 0. Specifically, the strength S

determines the maximum difference allowed and damping D determines the

point to ignore the filter tap.

### Loop Restoration filter

**Separable symmetric wiener filter**

Let F be a w x w 2D filter taps around the pixel to be filtered, denoted as

a w^2 x 1 column vector. When compared with traditional Wiener Filter,

Separable Symmetric Wiener Filter has the following three constraints in order

to save signaling bits and reduce complexity [4]:

1) The w x w filter window of is separated into horizontal and vertical w-tap

convolutions.

2) The horizontal and vertical filters are constrained to be symmetric.

3) It is assumed that the summation of horizontal/vertical filter coefficients

is 1.

As a result, F can be written as F = column_vectorize[ab^T], subject to a(i)

= a(w - 1 - i), b(i) = b(w - 1 - i), for i = [0, r - 1], and sum(a(i)) =

sum(b(i)) = 1, where a is the vertical filters and b is the horizontal filters.

The derivation of the filters a and b starts from an initial guess of

horizontal and vertical filters, optimizing one of the two while holding the

other fixed. In the implementation w = 7, thus, 3 taps need to be sent for

filters a and b, respectively. When signaling the filter coefficients, 4, 5 and

6 bits are used for the first three filter taps, and the remaining ones are

obtained from the normalization and symmetry constraints. 30 bits in total are

transmitted for both vertical and horizontal filters.

**Dual self-guided filter**

Dual self-guided filter is designed to firstly obtain two coarse restorations

X1 and X2 of the degraded frame X, and the final restoration Xr is obtained as

a combination of the degraded samples, and the difference between the degraded

samples and the coarse restorations [4]:

<figure class="image"> <center><img src="img\equ_dual_self_guided.svg"

alt="Equation dual self guided filter" width="300" /> </figure>

<!-- $$X_r = X + \alpha (X_1 - X) + \beta (X_2 - X)$$ -->

At encoder side, alpha and beta are computed using:

<figure class="image"> <center><img src="img\equ_dual_self_para.svg"

alt="Equation dual self guided filter parameter" width="220" /> </figure>

<!-- $${\alpha, \beta}^T = (A^T A) ^{-1} A^T b,$$ -->

where A = {X1 - X, X2 - X}, b = Y - X, and Y is the original source.

X1 and X2 are obtained using guided filtering, and the filtering is controlled

by a radius r and a noise parameter e, where a higher r implies a higher

spatial variance and a higher e implies a higher range variance [4]. X1 and X2

can be described by {r1, e1} and {r2, e2}, respectively.

The encoder sends a 6-tuple {r1, e1, r2, e2, alpha, beta} to the decoder. In

the implementation, {r1, e1, r2, e2} uses a 3-bit codebook, and {alpha, beta}

uses 7-bit each due to much higher precision, resulting in a total of 17 bits.

r is always less or equal to 3 [4].

Guided filtering can be described by a local linear model:

<figure class="image"> <center><img src="img\equ_guided_filter.svg"

alt="Equation guided filter" width="155" /> </figure>

<!-- $$y=Fx+G,$$ -->

where x and y are the input and output samples, F and G are determined by the

statistics in the neighboring of the pixel to be filtered. It is called

self-guided filtering when the guidance image is the same as the degraded

image[4].

Following are three steps when deriving F and G of the self-guided filtering:

1) Compute mean u and variance d of pixels in a (2r + 1) x (2r + 1) window

around the pixel to be filtered.

2) For each pixel, compute f = d / (d + e); g = (1 - f)u.

3) Compute F and G for each pixel as averages of f and g values in a 3 x 3

window around the pixel for use in step 2.

### Frame super-resolution

In order to improve the perceptual quality of decoded pictures, a

super-resolution process is applied at low bit-rates [5]. First, at encoder

side, the source video is downscaled as a non-normative procedure. Second,

the downscaled video is encoded, followed by deblocking and CDEF process.

Third, a linear upscaling process is applied as a normative procedure to bring

the encoded video back to it's original spatial resolution. Lastly, the loop

restoration is applied to resolve part of the high frequency lost. The last

two steps together are called super-resolving process [5]. Similarly, decoding,

deblocking and CDEF processes are applied at lower spatial resolution at

decoder side. Then, the frames go through the super-resolving process.

In order to reduce overheads in line-buffers with respect to hardware

implementation, the upscaling and downscaling process are applied to

horizontal dimension only.

### Film grain synthesis

At encoder side, film grain is removed from the input video as a denoising

process. Then, the structure and intensity of the input video are analyzed

by canny edge detector, and smooth areas are used to estimate the strength

of film grain. Once the strength is estimated, the denoised video and film

grain parameters are sent to decoder side. Those parameters are used to

synthesis the grain and add it back to the decoded video, producing the final

output video.

In order to reconstruct the film grain, the following parameters are sent to

decoder side: lag value, autoregressive coefficients, values for precomputed

look-up table index of chroma components, and a set of points for a piece-wise

linear scaling function [6]. Those parameters are signaled as quantized

integers including 64 bytes for scaling function and 74 bytes for

autoregressive coefficients. Once the parameters are received, an

autoregressive process is applied in a raster scan order to generate one 64x64

luma and two 32x32 chroma film grain templates [6]. Those templates are used

to generate the grain for the remaining part of a picture.

## Screen content coding

To improve the coding performance of screen content coding, the associated video

codec incorporates several coding tools，for example, intra block copy

(IntraBC) is employed to handle the repeated patterns in a screen picture, and

palette mode is used to handle the screen blocks with a limited number of

different colors.

### Intra block copy

Intra Block Copy (IntraBC) [2] is a coding tool similar to inter-picture

prediction. The main difference is that in IntraBC, a predictor block is

formed from the reconstructed samples (before application of in-loop filtering)

of the current picture. Therefore, IntraBC can be considered as "motion

compensation" within current picture.

A block vector (BV) was coded to specify the location of the predictor block.

The BV precision is integer. The BV will be signalled in the bitstream since the

decoder needs it to locate the predictor. For current block, the flag use

IntraBC indicating whether current block is IntraBC mode is first transmitted in

bit stream. Then, if the current block is IntraBC mode, the BV difference diff

is obtained by subtracting the reference BV from the current BV, and then diff

is classified into four types according to the diff values of horizontal and

vertical component. Type information needs to be transmitted into the bitstream,

after that, diff values of two components may be signalled based on the type

info.

IntraBC is very effective for screen content coding, but it also brings a lot of

difficulties to hardware design. To facilitate the hardware design, the

following modifications are adopted.

1) when IntraBC is allowed, the loop filters are disabled, which are de-blocking

filter, the CDEF (Constrained Directional Enhancement Filter), and the Loop

Restoration. By doing this, picture buffer of reconstructed samples can be

shared between IntraBC and inter prediction.

2) To facilitate parallel decoding, the prediction cannot exceed the restricted

areas. For one super block, if the coordinate of its top-left position is (x0,

y0), the prediction at position (x, y) can be accessed by IntraBC, if y < y0 and

x < x0 + 2 * (y0 - y)

3) To allow hardware writing back delay, immediate reconstructed areas cannot be

accessed by IntraBC prediction. The restricted immediate reconstructed area can

be 1 ∼ n super blocks. So on top of modification 2, if the coordinate of one

super block's top-left position is (x0, y0), the prediction at position (x, y)

can be accessed by IntraBC, if y < y0 and x < x0 + 2 * (y0 - y) - D, where D

denotes the restricted immediate reconstructed area. When D is one super block,

the prediction area is shown in below figure.

<figure class="image"> <center><img src="img\SCC_IntraBC.svg" alt="Intra block

copy" width="600" /> <figcaption>Figure 13: the prediction area for IntraBC mode

in one super block prediction</figcaption> </figure>

### Palette mode

# References

[1] J. Han, Y. Xu and D. Mukherjee, "A butterfly structured design of the hybrid

transform coding scheme," 2013 Picture Coding Symposium (PCS), San Jose, CA,

2013, pp. 17-20.\

[2] J. Li, H. Su, A. Converse, B. Li, R. Zhou, B. Lin, J. Xu, Y. Lu, and R.

Xiong, "Intra Block Copy for Screen Content in the Emerging AV1 Video Codec,"

2018 Data Compression Conference, Snowbird, Utah, USA.\

[3] S. Midtskogen and J.M. Valin. "The AV1 constrained directional enhancement

 filter (CDEF)." In 2018 IEEE International Conference on Acoustics, Speech

  and Signal Processing (ICASSP), pp. 1193-1197. IEEE, 2018.\

[4] D. Mukherjee, S. Li, Y. Chen, A. Anis, S. Parker, and

J. Bankoski. "A switchable loop-restoration with side-information framework

for the emerging AV1 video codec." In 2017 IEEE International Conference on

Image Processing (ICIP), pp. 265-269. IEEE, 2017.\

[5] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu,... & C.H.Chiang,

(2018, June). "An overview of core coding tools in the AV1 video codec.""

In 2018 Picture Coding Symposium (PCS) (pp. 41-45). IEEE.\

[6] A. Norkin, & N. Birkbeck, (2018, March). "Film grain synthesis for AV1

video codec." In 2018 Data Compression Conference (pp. 3-12). IEEE.

Source code

Revision control

Copy as Markdown

Other Tools