DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 2Shanghai Artificial Intelligence Laboratory
(a) Visual comparison of blind image super-resolution (BSR) methods on real-world low-quality images.
(b) Visual comparison of blind face restoration (BFR) methods on real-world low-quality face images.

Comparisons of DiffBIR and state-of-the-art BSR/BFR methods on real-world images. Compared to BSR methods, DiffBIR is more effective to (1) generate natural textures; (2) reconstruct semantic regions; (3) not erase small details; (4) overcome severe cases. Compared to BFR methods, DiffBIR can (1) handle occlusion cases; (2) obtain satisfactory restoration beyond facial areas (e.g., headwear, earrings).


We present DiffBIR, which leverages pretrained text-to-image diffusion models for blind image restoration problem. Our framework adopts a two-stage pipeline. In the first stage, we pretrain a restoration module across diversified degradations for improving generalization capability in real-world scenarios. The second stage leverages the generative ability of latent diffusion models, for achieving realistic image restoration. Specifically, we introduce an injective modulation sub-network -- LAControlNet for finetuning, while the pre-trained Stable Diffusion is to maintain its generative ability. Finally, we introduce a controllable module that allows users to balance quality and fidelity by introducing the latent image guidance in the denoising process during inference. Extensive experiments have demonstrated its superiority over state-of-the-art approaches for both blind image super-resolution and blind face restoration tasks on both synthetic and real-world datasets.


DiffBIR Architecture

The two-stage pipeline of DiffBIR: (1) pretrain a Restoration Module (RM) for degradation removal to obtain \(I_{reg}\); (2) leverage fixed Stable Diffusion through our proposed LAControNet for realistic image reconstruction and obtain \(I_{diff}\). RM is trained across diversified degradations in a self-supervised manner, and is fixed during stage-two. LAControlNet contains a parallel module that is partially initialized with the denoiser's checkpoint and has several fusion layers. It uses VAE's encoder to project the \(I_{reg}\) to the latent space, and performs concatenation with the randomly sampled noisy \(z_t\) as the conditioning mechanism.


Visual results on real-world general images
Visual results on real-world face images


  author    = {Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, Chao Dong},
  title     = {DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior},
  journal   = {arxiv},
  year      = {2023},