Why does stacking many JPEG webcam frames reveal an 8×8 grid pattern?

Question

I averaged about 200,000 frames from a webcam pointed at a very dark, uniform scene. The camera and scene were static, and individual frames are extremely underexposed and noisy. After auto-equalising the stacked result, I can see the lens mount shading as expected, but I also see a fine grid across the whole image that looks aligned to 8×8 JPEG blocks.

I expected stacking so many frames to smooth out random noise. Why does the 8×8 grid remain so visible instead of averaging away?

user11924 · Accepted Answer

TL;DR: The edge-discontinuity errors between 8×8 blocks introduced by the DCT compression used in JPEGs is magnified by your image stack, which is why the "grid" is so prominent.

If the frame stacking /smoothing process had failed or not used enough frames, I would have expected 8x8 blocks of varying colours, not just their edges.

You appear to have a slight misunderstanding about compression used in JPEG images. You are correct that the compression in (most) JPEGs is done on blocks of 8×8 pixels. Your expectation would be true if the compression were merely as simple as just replacing each 8×8 block with the average pixel value for each block (trivially achieving 64:1 "compression" ratio).

Incidentally, this would be identical to simply downsampling your image(s) by 8 for each dimension, then "blowing up" the downsampled image by a factor 8 without interpolation. This would produce poor-quality images, which is why it isn't done.

JPEG compression takes advantage of the discrete cosine transform (DCT) of each 8×8 block. Like any Fourier-like transform, the DCT converts spatial information (i.e., images) (or time-domain information, such as audio) into frequency information. The DCT is favored in JPEG compression over other transforms (such as the discrete sine transform (DST), or discrete Fourier transform (DFT)) for 2 reasons:

The DCT coefficients (the result of performing a discrete cosine transform over data) settle quicker to near zero than other transforms; and
The DCT "behaves well" at the edges of the data sample. Qualitatively, this means that the DCT introduces the least edge-discontinuity between neighboring pixel blocks. However, while the edge-discontinuity is small (compared to other transforms), the slope of signal change at either side of a block boundary is not continuous.

Mathematically, a discrete cosine transform would result in fractional numbers that would require high-precision math to maintain accuracy. The "discrete" part of DCT means that the coefficients have been discretized into integer values that can be stored in a byte. This discretization is part of the absolute error between a digital image and its JPEG equivalent.

The other part of the error, and where the compression comes in, is to truncate, literally throw away, the high-frequency DCT coefficients. This is analogous to representing the number ⅓ in decimal: 3 tenths (0.3), plus 3 hundredths (0.03), plus 3 thousandths (0.003), ad infinitum. This is 0.3333... never ending. For our purposes, we say 0.333 is a decent approximation (the error is one part in 1,000, or 0.1%).

While the error is small, nevertheless it is there. Specifically, even though the DCT is better at the edges than other Fourier-like transforms, the error is most visible at the edges. This is what you're seeing in your composite image.

Stacking / averaging has the property of more-or-less eliminating random (stochastic) noise (i.e., non-biased sensor noise), because the noise has equal probability of being positive or negative. When you throw 200,000 fair dice, you will see that statistically, all numbers come up roughly equally. This is why you don't see magnified sensor noise in your composite image.

However, biased data, whether there by its very nature (i.e., the image of your stepped lens mount), or introduced externally, is magnified. The fact that your images were all JPEGs, meaning each frame was DCT-compressed, is magnified in your stack.

The reason you are seeing a pronounced grid is because the very nature of DCT compression, due to quantization error and low-pass filtering of spatial frequencies a block at a time, magnifies the slight-but-cumulative edge discontinuity errors between 8×8 blocks.

Why does stacking many JPEG webcam frames reveal an 8×8 grid pattern?

2 Answers

Your Answer

Related Questions