implementation_details.md

# Implementation details of 2d-IDCT and reinterpreting-DCT

First of all, note that we only need to generate Nx2N, NxN, 2NxN transforms for

most sizes. The one exception are 8x32/32x8 IDCTs and 1x4/4x1

reinterpreting-DCTs.

The 4x1/1x4 reinterpreting-DCTs are very small and don't need special

considerations.

Large transforms use the same implementation strategy, but avoid increasing code

size by using size-generic code.

## Code generation

Code is generated with python scripts. The following bash snippet

generates the relevant files:

```bash

for i in 2 4 8 16 32

do

    python3 gen_idct.py $i > src/idct$i.rs

done

for i in 2 4 8 16 32

do

    python3 gen_reinterpreting_dct.py $i > src/reinterpreting_dct$i.rs

done

python3 gen_idct2d.py > src/idct2d.rs

python3 gen_reinterpreting_dct2d.py > src/reinterpreting_dct2d.rs

cargo fmt

```

## SIMD type selection

The compiler generates suboptimal code when mixing different vector sizes.

Thus, as a first step we "downgrade" to the largest vector size that divides

both sizes of the transform.

## DCT/IDCT Implementation

Both the DCT and the IDCT use the same recursive algorithm used in libjxl to

compute a vector worth of DCTs/IDCTs.

## 2d transforms

The code is written to minimize transposition cost while still ensuring we load

full vectors at a time. We don't use any additional memory to store transposes.

Let K be vector length (which divides both sides of the DCT as per above).

### N x 2N transforms and 8x32 IDCT

For those transforms, the final output should be the same shape as the input.

Thus, we logically need to transpose, DCT, transpose and DCT. However, we can

instead first do a set of row-DCTs on K rows, transposing every KxK

sub-matrix in place in advance, then do a column-DCT on the first K columns,

and finally transpose the KxK sub-matrices in the columns again.

### N x N transforms

Square transforms are easy: we can do column-DCTs, then swap KxK blocks between

lower and upper triangular part of the block-matrix, going K columns by K columns

and transposing during the swap, and do a column-DCT after each group of columns

is complete.

### 2N x N IDCTs and 32x8 IDCT

For these transforms, we have a special implementation of 1D-IDCT that does part

of the transpose.

In particular, we transpose NxN blocks as in the square case. We are then left

with doing the row-DCT and interleaving blocks so that they go from stacked

horizontally to stacked vertically. Since that can be done by just reshuffling

individual columns of vectors, we merge that operation with the DCT.

### 2N x N DCTs

This is basically the same as the IDCTs, but in reverse order. Thus, the

"special" DCT applies a different permutation.