Ultra-Fast Language Generation
via Discrete Diffusion Divergence Instruct

Published:

Haoyang Zheng1, Xinyang Liu2, Cindy Xiangrui Kong1, Nan Jiang3, Zheyuan Hu4, Weijian Luo5, Wei Deng6, Guang Lin1

1Purdue University 2University of Texas at Austin 3University of Texas at El Paso 4National University of Singapore
5hi-Lab, Xiaohongshu Inc 6ML Research, Morgan Stanley

We unlock high-quality language generation in the blink of an eye with DiDi-Instruct.

🚀 Feel the Ultra-Fast Generation Speed:
DiDi-Instruct (64x) vs. MDMs (2x) vs. ARMs

1×
Unmasked Tokens    Masked Tokens
NFEs: 0/0

Contributions

DiDi-Instruct distills a few-step generator from a masked discrete diffusion language model, achieving up to 64× speed-ups with comparable or superior quality to its teacher and GPT-2 baselines.

• Principled Training Method for Fast (Language) Sequence Generation: We reformulate the distillation objective from a general policy gradient perspective, deriving a simple yet tractable update rule for the few-step student according to some reward function. With an adversarial language discriminator to estimate the log-density ratio (reward) between the teacher dLLM and the student, we introduce a practical DiDi-Instruct algorithm that trains the few-step student, accompanied by an assistant discriminator.
• Simple yet Effective Techniques in Training and Inference: We introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler (RGAS) that significantly improve the training stability, the model coverage, and the inference performances, reducing the generation perplexity by 30%.
• State-of-the-Art Fast Sequence Generation: DiDi-Instruct achieves new state-of-the-art performance on the OpenWebText benchmark: consistently lower PPL across 8 to 128 NFEs, negligible entropy loss, and over 20× faster distillation; detailed ablations, model scaling, and protein sequence generation further confirm its robustness.
Chart comparing Perplexity vs. NFEs for DiDi-Instruct against baselines
Perplexity vs. NFEs. Baselines for comparison include GPT-2 Small, masked diffusion language models (MDLM; Sahoo et al., 2024), diffusion duality (DUO; Sahoo et al., 2025), and self-distillation through time (SDTT; Deschenaux et al., 2025).

Abstract

Fast and high-quality language generation is the holy grail pursued in the age of AI. In this work we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained masked discrete diffusion language model and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and a GPT-2 baseline while enabling up to 64× acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage and inference quality. On OpenWebText, DiDi-Instruct achieves perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs) and reduces additional training wall-clock time by more than 20× compared to competing dLLM distillation methods. We validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling and generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye.

BibTeX

If you find this useful, please cite:

    @article{zheng2025ultra,
      title={{Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct}},
      author={Zheng, Haoyang and Liu, Xinyang and Kong, Cindy Xiangrui and Jiang, Nan and Hu, Zheyuan and Luo, Weijian and Deng, Wei and Lin, Guang},
      journal={arXiv preprint arXiv:2509.25035},
      year={2025}
    }