Abstract

In this work, we propose a unified representation for Super-Resolution (SR) and Image Compression, termed Factorized Fields, motivated by the shared principles between these two tasks. Both SISR and Image Compression require recovering and preserving fine image details—whether by enhancing resolution or reconstructing compressed data. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition to explicitly capture multi-scale visual features and structural components in images, addressing the core challenges of both tasks. We first derive our SR model, which includes a Coefficient Backbone and Basis Swin Transformer for generalizable Factorized Fields. Then, to further unify these two tasks, we leverage the strong information-recovery capabilities of the trained SR modules as priors in the compression pipeline, improving both compression efficiency and detail reconstruction. Additionally, we introduce a merged-basis compression branch that consolidates shared structures, further optimizing the compression process. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA.


TL;DR: A new unified representation for both super-resolution and image compression.

The correlation between coordinate transformation and downsampling

A diagram explaining the method in broad strokes, like explained in the caption.
(a) The sawtooth transformation example with \(k=2\). (b) The PixelUnShuffle downsample. (c) To explicitly model the information for sampling with a sawtooth, we rearrange the feature map in a dilation-like manner in the downsample layer of the Basis Swin Transformer. This way, the feature sampled would capture the information in the original layout correctly.


The overall pipeline of image super-resolution with our Factorized Fields

A diagram explaining the method in broad strokes, like explained in the caption.
Given a low-resolution image \(I_{LR}\), we first extract coefficient feature map \(X_\text{coeff}\) with the coefficient backbone, which is then decoded into coefficient and passed through the basis Swin Transformer for basis, separately. Finally, the coefficient and basis are sampled, multiplied, and decoded for final high-resolution output \(I_{HR}\), where \(s\), \(H\), \(W\) denote the scale factor, height, and width respectively.


The illustration of our joint image-compression and super-resolution framework compared with the traditional compression-only method

A diagram explaining the method in broad strokes, like explained in the caption.
(a) Traditional learning-based compression methods. (b) Our approach surpasses (a) by incorporating our Super-Resolution (SR) Module as information-recovery prior. (c) Expanding on (b) and introducing a multi-image compression strategy that utilizes both our SR Module and a Basis Merging Transformer to capture shared structure.


Visual comparisons on super-resolution (4×)

A diagram explaining the method in broad strokes, like explained in the caption.


Performance (RD-Curve) evaluation on image compression using different datasets

A diagram explaining the method in broad strokes, like explained in the caption.


Acknowledgements

This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 112-2222-E-A49-004-MY2 and 113-2628-E-A49-023-. The authors are grateful to Google, NVIDIA, and MediaTek Inc. for generous donations. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.