I think it would be very interesting to determine if these images do come back with notably better collection.
Technically, this is simply "VAE-based image compression" (that uses stable diffusion v1.4's pretrained variational autoencoder) that takes the VAE representations and quantizes them.
(Note: not saying this is not interesting or useful; just that it's not what it says on the label)
Using the "denoising UNet" would make the method more computationally expensive, but probably even better (e.g., you can quantize the internal VAE representations more aggressively, since the denoising step might be able to recover the original data anyway).
There's something to be said about compression algorithms being predictable, deterministic, and only capable of introducing defects that stand out as compression artifacts.
Plus, decoding performance and power consumption matters, especially on mobile devices (which also happens be the setting where bandwidth gains are most meaningful).
See e.g. the Hutter Prize.
Memory <> Compression <> Language <> Signal Strength <> Harmonics and Ratios
1. This is a brilliant hack. Kudos.
2. It would be great to see the best codecs included in the comparison - AVIF and JPEG XL. Without those it's rather incomplete. No surprise that JPEG and WEBP totally fall apart at that bitrate.
3. A significant limitation of the approach seems to be that it targets extremely low bitrates where other codecs fall apart, but at these bitrates it incurs problems of its own (artifacts take the form of meaningful changes to the source image instead of blur or blocking, very high computational complexity for the decoder).
When only moderate compression is needed, codecs like JPEG XL already achieve very good results. This proof of concept focuses on the extreme case, but I wonder what would happen if you targeted much higher bitrates, say 5x higher than used here. I suspect (but have no evidence) that JPEG XL would improve in fidelity faster as you gave it more bits than this SD-based technique. Transparent compression, where the eye can't tell a visual difference between source and transcode (at least without zooming in) is the optimal case for JPEG XL. I wonder what sort of bitrate you'd need to provide that kind of guarantee with this technique.
There already are client side algorithms that increase the quality of JPGs a lot. For some reason, they are not used in browsers yet.
For comparison, I made a version of the Lama with such a current JPG denoising algorithm:
A Stable Diffusion based enhancement would probably be much nicer in most cases.
There might be an interesting race to do client side image enhancements coming to the browsers over the next years.
From the paper the VQ variant also performs better overall, SD may have chosen the KL variant only to lower vram use.
Where recreating with traditional codices introduce syntactic noise, then this will introduce semantic noise.
Imagine seeing a high res perfect picture, just until you see the source image and discover that it was reinterpreted..
It is also going to be interesting, to see if this method will be chosen for specific pictures, eg. pictures of celebrity objects (or people, when/if issues around that resolve), but for novel things, we need to use "syntactical" compression.
This might actually be interesting/important for the OpenVINO adaptation of SD ... from what I gathered from the OpenVINO documentation, quantizing is actually a big part of optimizing as this allows the usage of Intels new(-ish) NN instruction sets.
If you could compute the true conditional Kolmogorov complexity of an image or video file given all visual online media as the prior, I imagine you would obtain mind-blowing compression ratios.
People complain of the biased artifacts that appear when using neural networks for compression, but I’m not concerned in the long term. The ability to extract algorithmic redundancy from images using neural networks is obviously on its way to outclassing manually crafted approaches, and it’s just a matter of time before we are able to tack on a debiasing step to the process (such that the distribution of error between the reconstructed image and the ground truth has certain nice properties).
Imagine if some day a computer could take a snapshot of the weights and memory bits of the brain and then reconstruct memories and thoughts.
However, on some images the artefacts are terrible. It does not work as a general-purpose lossy compressor unless you don't care about details.
The main obstacle is compute. The model is quite large, but hdds are cheap. The real problem is that reconstruction requires a GPU with lots of VRAM. Even with a GPU it's 15 seconds to reconstruct an image in Google Collab. You could do it on CPU, but then it's extremely slow. This is only viable if compute costs go down a lot.
However, a cautionary tale on AI medical image "denoising":
(and beyond, in science)
- See the artifacts?
The algorithm plugs into ambiguous areas of the image stuff it has seen before / it was trained with. So, if such a system was to "denoise" (or compress, which - if you think about it - is basically the same operation) CT scans, X-rays, MRIs, etc., in ambiguous areas it could plug-in diseased tissue where the ground-truth was actually healthy.
Or the opposite, which is even worse: substitute diseased areas of the scan with healthy looking imagery it had been trained on.
Reading recent publications that try to do "denoising" or resolution "enhancement" in medical imaging contexts, the authors seem to be completely oblivious to this pitfall.
(maybe they had a background as World Bank / IMF economists?)
Ie. “There’s a neighbourhood here” is more of an abstraction than “here’s this exact neighbourhood with the correct layout just fuzzy or noisy.”
This could be interesting but I'm wondering if the compression size is more a result of the benefit of what is essentially a massive offline dictionary built into the decoder vs some intrinsic benefit to processing the image in latent space based on the information in the image alone.
That said... I suppose it's actually quite hard to implement a "standard image dictionary" and this could be a good way to do that.
And here we have computers doing the same thing! Reconstructing an image from a highly compressed memory and filling in appropriate, if not necessarily exact details. Human eye looks at it casually and yeah, that's it, that's how I remember it. Except that not all the details are right.
Which is one of those "Whoa!" moments, like many many years ago, when I wrote a "Connect 4" implementation in BASIC on the Commodore 64, played it and lost! How did the machine get so smart all of a sudden?
The way this actually works is pretty impressive. I wonder if it could be made lossless or less lossy in a similar manner to FLAC and/or video compression algorithms... basically first do the compression, and then add on a correction that converts the result partially or completely into the true image. Essentially, e.g. encoding real images of the most egregiously modified regions of the photo and putting them back over the result.