Stable Diffusion Based Image Compression



egypturnash 10d
To evaluate this experimental compression codec, I didn’t use any of the standard test images or images found online in order to ensure that I’m not testing it on any data that might have been used in the training set of the Stable Diffusion model (because such images might get an unfair compression advantage, since part of their data might already be encoded in the trained model).

I think it would be very interesting to determine if these images do come back with notably better collection.

fjkdlsjflkds 10d
This is not really "stable-diffusion based image compression", since it only uses the VAE part of "stable diffusion", and not the denoising UNet.

Technically, this is simply "VAE-based image compression" (that uses stable diffusion v1.4's pretrained variational autoencoder) that takes the VAE representations and quantizes them.

(Note: not saying this is not interesting or useful; just that it's not what it says on the label)

Using the "denoising UNet" would make the method more computationally expensive, but probably even better (e.g., you can quantize the internal VAE representations more aggressively, since the denoising step might be able to recover the original data anyway).

RosanaAnaDana 10d
Something interesting about the San Francisco test image is that if you start to look into the details, its clear that some real changes have been made to the city. Rather than losing texture or grain or clarity, the information lost in this is information about the particular layout of a neighborhood of streets, which has now been replaced as if some one were drawing the scene from memory. A very different kind of loss that with out the original might be imperceptible because the information that was lost isn't replaced with random or systematic noise, but rather new, structured information..
fzzt 10d
The prospect of the images getting "structurally" garbled in unpredictable ways would probably limit real-world applications:*

There's something to be said about compression algorithms being predictable, deterministic, and only capable of introducing defects that stand out as compression artifacts.

Plus, decoding performance and power consumption matters, especially on mobile devices (which also happens be the setting where bandwidth gains are most meaningful).

eru 10d
Compare compressed sensing's single pixel camera:
minimaxir 10d
For text, GPT-2 was used in a similar demo a year ago albeit said demo is now defunct:
vjeux 10d
How long does it take to compress and decompress an image that way?
dwohnitmok 10d
Indeed one way of looking at intelligence is that it is a method of compressing the external universe.

See e.g. the Hutter Prize.

kgeist 10d
I heard Stable Diffusion's model is just 4 GB. It's incredible that billions of images could be squeezed in just 4 GB. Sure it's lossy compression but still.
pyinstallwoes 10d
This relates to a strong hunch that consciousness is tightly coupled to whatever compression is as an irreducible entity.

Memory <> Compression <> Language <> Signal Strength <> Harmonics and Ratios

bscphil 10d
A few thoughts that aren't related to each other.

1. This is a brilliant hack. Kudos.

2. It would be great to see the best codecs included in the comparison - AVIF and JPEG XL. Without those it's rather incomplete. No surprise that JPEG and WEBP totally fall apart at that bitrate.

3. A significant limitation of the approach seems to be that it targets extremely low bitrates where other codecs fall apart, but at these bitrates it incurs problems of its own (artifacts take the form of meaningful changes to the source image instead of blur or blocking, very high computational complexity for the decoder).

When only moderate compression is needed, codecs like JPEG XL already achieve very good results. This proof of concept focuses on the extreme case, but I wonder what would happen if you targeted much higher bitrates, say 5x higher than used here. I suspect (but have no evidence) that JPEG XL would improve in fidelity faster as you gave it more bits than this SD-based technique. Transparent compression, where the eye can't tell a visual difference between source and transcode (at least without zooming in) is the optimal case for JPEG XL. I wonder what sort of bitrate you'd need to provide that kind of guarantee with this technique.

Dwedit 10d
This is why for compression tests, they incorporate the size of everything needed to decompress the file. You can compress down to 4.97KB all you want, just include the 4GB trained model.
illubots 10d
In theory, it would be possible to benefit from the ability of Stable Diffusion to increase perceived image quality without even using a new compression format. We could just enhance existing JPG images in the browser.

There already are client side algorithms that increase the quality of JPGs a lot. For some reason, they are not used in browsers yet.

For comparison, I made a version of the Lama with such a current JPG denoising algorithm:

A Stable Diffusion based enhancement would probably be much nicer in most cases.

There might be an interesting race to do client side image enhancements coming to the browsers over the next years.

Jack000 10d
The vae used in stable diffusion is not ideal for compression. I think it would be better to use the vector-quantized variant (by the same authors of latent diffusion) instead of the KL variant, then store the indexes for each quantized vector using standard entropy coding algorithms.

From the paper the VQ variant also performs better overall, SD may have chosen the KL variant only to lower vram use.

jwr 10d
While this is great as an experiment, before you jump into practical applications, it is worth remembering that the decompressor is roughly 5GB in size :-)
holoduke 9d
In the future you can have full 16k movies representing only 1.44mb seeds. A giant 500 petabyte trained model file can run those movies. You can even generate your own movie by uploading a book.
madsbuch 10d
It is really interesting to talk about semantic lossy compression, which is probably what we get.

Where recreating with traditional codices introduce syntactic noise, then this will introduce semantic noise.

Imagine seeing a high res perfect picture, just until you see the source image and discover that it was reinterpreted..

It is also going to be interesting, to see if this method will be chosen for specific pictures, eg. pictures of celebrity objects (or people, when/if issues around that resolve), but for novel things, we need to use "syntactical" compression.

fho 9d
> Quantizing the latents from floating point to 8-bit unsigned integers by scaling, clamping and then remapping them results in only very little visible reconstruction error.

This might actually be interesting/important for the OpenVINO adaptation of SD ... from what I gathered from the OpenVINO documentation, quantizing is actually a big part of optimizing as this allows the usage of Intels new(-ish) NN instruction sets.

Xcelerate 9d
Great idea to use Stable Diffusion for image compression. There are deep links between machine learning and data compression (which I’m sure the author is aware of).

If you could compute the true conditional Kolmogorov complexity of an image or video file given all visual online media as the prior, I imagine you would obtain mind-blowing compression ratios.

People complain of the biased artifacts that appear when using neural networks for compression, but I’m not concerned in the long term. The ability to extract algorithmic redundancy from images using neural networks is obviously on its way to outclassing manually crafted approaches, and it’s just a matter of time before we are able to tack on a debiasing step to the process (such that the distribution of error between the reconstructed image and the ground truth has certain nice properties).

sod 9d
This may give insights in how brain memory and thinking works.

Imagine if some day a computer could take a snapshot of the weights and memory bits of the brain and then reconstruct memories and thoughts.

euphetar 9d
I am currently also playing around with this. The best part is that for storage you don't need to store the reconstructed image, just the latent representation and the VAE decoder (which can do the reconstructing later). So you can store the image as relatively few numbers in a database. In my experiment I was able to compress a (512, 384, 3) RGB image to (48, 64, 4) floats. In terms of memory it was a 8x reduction.

However, on some images the artefacts are terrible. It does not work as a general-purpose lossy compressor unless you don't care about details.

The main obstacle is compute. The model is quite large, but hdds are cheap. The real problem is that reconstruction requires a GPU with lots of VRAM. Even with a GPU it's 15 seconds to reconstruct an image in Google Collab. You could do it on CPU, but then it's extremely slow. This is only viable if compute costs go down a lot.

DrNosferatu 9d
Nice work!

However, a cautionary tale on AI medical image "denoising":

(and beyond, in science)

- See the artifacts?

The algorithm plugs into ambiguous areas of the image stuff it has seen before / it was trained with. So, if such a system was to "denoise" (or compress, which - if you think about it - is basically the same operation) CT scans, X-rays, MRIs, etc., in ambiguous areas it could plug-in diseased tissue where the ground-truth was actually healthy.

Or the opposite, which is even worse: substitute diseased areas of the scan with healthy looking imagery it had been trained on.

Reading recent publications that try to do "denoising" or resolution "enhancement" in medical imaging contexts, the authors seem to be completely oblivious to this pitfall.

(maybe they had a background as World Bank / IMF economists?)

red75prime 9d
It reminded me of a scene from "A Fire Upon the Deep" where connection bitrate is abysmal, but the video is crisp and realistic. It is used as a tool for deception, as it happens. Invisible information loss has its costs.
Waterluvian 9d
I wonder if this technique could be called something like “abstraction” rather than “compression” given it will actually change information rather than its quality.

Ie. “There’s a neighbourhood here” is more of an abstraction than “here’s this exact neighbourhood with the correct layout just fuzzy or noisy.”

zcw100 9d
You can do lossless neural compression too.
codeflo 9d
One interesting feature of ML-based image encoders is that it’ll be hard to evaluate them with standard benchmarks, because those are likely to be part of the training set, simply by virtue of being scraped from the web. How many copies of Lenna has Stable Diffusion been trained with? It’s on so many websites.
tomxor 9d
Doesn't decompression require the entire stable fusion model? (and the exact same model at that)

This could be interesting but I'm wondering if the compression size is more a result of the benefit of what is essentially a massive offline dictionary built into the decoder vs some intrinsic benefit to processing the image in latent space based on the information in the image alone.

That said... I suppose it's actually quite hard to implement a "standard image dictionary" and this could be a good way to do that.

fritzo 9d
I'd love to see a series of increasingly compressed images, say 8kb -> 4kb -> 2kb -> ... -> 2bits -> 1bit. This would be a great way to demonstrate the increasing fictionalization of the method's recall.
MarkusWandel 9d
The one with the different buildings in the reconstructed image is a bit spooky. I've always argued that human memory is highly compressed, storing, for older memories anyway, a "vibe" plus pointers to relevant experiences/details that can be used to flesh it out as needed. Details may be wrong in the recollecting/retelling, but the "feel" is right.

And here we have computers doing the same thing! Reconstructing an image from a highly compressed memory and filling in appropriate, if not necessarily exact details. Human eye looks at it casually and yeah, that's it, that's how I remember it. Except that not all the details are right.

Which is one of those "Whoa!" moments, like many many years ago, when I wrote a "Connect 4" implementation in BASIC on the Commodore 64, played it and lost! How did the machine get so smart all of a sudden?

bjornsing 9d
If it’s a VAE then the latents should really be distributions, usually represented as the mean and variance of a normal distribution. If so then it should be possible to use the variance to determine to what precision a particular latent needs to be encoded. Could perhaps help increase the compression further.
a-dub 9d
hm. would be interesting to see if any of the perceptual image compression quality metrics could be inserted into the vae step to improve quality and performance...
UniverseHacker 9d
From the title, I expected this to be basically pairing stable diffusion with an image captioning algorithm by 'compressing' the image to a simple human readable description, and then regenerating a comparable image from the text. I imagine that would work and be possible, essentially an autoencoder with a 'latent space' of single short human readable sentences.

The way this actually works is pretty impressive. I wonder if it could be made lossless or less lossy in a similar manner to FLAC and/or video compression algorithms... basically first do the compression, and then add on a correction that converts the result partially or completely into the true image. Essentially, e.g. encoding real images of the most egregiously modified regions of the photo and putting them back over the result.