What did we claim: MDM-Prime-v2 claims that the model scales better than autoregressive models (ARMs) in compute-optimal comparisons. However, the loss function used is not comparable to metrics like log likelihood of data. This means conclusions about the scaling behavior are not comparable to those of masked diffusion models (MDMs) or ARMs.
What does that mean for the results in the paper: The claims in the paper about the perplexity performance of MDM-Prime-v2 are incorrect. In zero-shot settings, we do see consistent improvements on some benchmarks as the model scales; this merits further investigation to better understand how this arises.
What we're doing about it: The MDM-Prime-v2 preprint has been withdrawn. We are writing this post jointly with the EDLM [1] authors, because the same issue occurs in EDLM, to provide a single, clear technical record for the community.
How did this happen? This issue arises from the loss function used in masked diffusion models that capture dependencies between tokens. Both EDLM [1] and the subsequent work MDM-Prime [2] use such a loss function, which we now understand is not comparable to perplexity. This is why we are documenting it jointly.
If you're building on the MDM loss function and want to introduce any form of dependency between tokens in your parameterization of $p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)$, whether through energy-based corrections [1], sub-token joint modeling [2], autoregressive decoders over groups, or anything else that couples the predictions, you can still train a perfectly good masked diffusion model, but the number your loss function reports is no longer an upper bound on negative log-likelihood.
This note will show that, if we replace the independent model $\prod_{i=1}^{L} p_\theta(x_0^i|\boldsymbol{x}_t)$ in MDM loss with a dependent model $p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)$, the model can achieve a loss lower than the entropy. Specifically, if we calculate the lowest possible loss ($\inf_{p_\theta} \mathcal{L}$), we will see that $\inf_{p_\theta} \mathcal{L} \leq \mathcal{H}(\boldsymbol{x}_0)$. This suffices to show that the loss is invalid for perplexity evaluation.
| Symbol | Description | Symbol | Description |
|---|---|---|---|
| \(\boldsymbol{x}_0\) | a sequence of tokens | \(\boldsymbol{x}_t\) | a sequence of noised tokens |
| \(L\) | sequence length | \(\alpha_t\) | a decreasing scheduling function |
Masked diffusion models (MDMs) minimize the following loss during training:
$$ \mathcal{L} = \int_0^1 \frac{\alpha_t'}{1-\alpha_t} \mathbb{E}_q\left[\log p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)\right] dt $$
where $q(\boldsymbol{x}_0, \boldsymbol{x}_t) = q(\boldsymbol{x}_t|\boldsymbol{x}_0)p_\text{data}(\boldsymbol{x}_0)$ and $q(\boldsymbol{x}_t|\boldsymbol{x}_0) = \prod_{i=1}^{L} q(x_t^i|x_0^i)$ is a masked diffusion kernel.
This function represents the variational bound of a masked diffusion process only when \(\textstyle p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t) = \prod_{i=1}^{L} p_\theta(x_0^i|\boldsymbol{x}_t)\) is independent. This is the case in standard MDM [3].
If the parameterization of \(p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)\) captures the dependencies among the variables \(\{x_0^1, \cdots, x_0^L\}\), the above loss function CANNOT represent a valid variational bound. In EDLM [1], \(p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)\) is implemented as a (residual) energy-based model.
If \(p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)\) captures partial dependencies among groups of variables (\(\{x_0^{1,1},\cdots,x_0^{1,\ell}\}\), \(\{x_0^{2,1},\cdots,x_0^{2,\ell}\}\), \(\cdots\),\(\{x_0^{L/\ell,1},\cdots,x_0^{L/\ell,\ell}\}\); each group has \(\ell\) variables), the above loss function also CANNOT represent a valid variational bound. In MDM-Prime [2], dependencies exist between the sub-tokens of each token but not between the tokens themselves.
Takeaway: If $\boldsymbol{x}_0 = [x_0^1, \cdots, x_0^L]$ are dependent in the parameterized predictor, the above loss does not yield a valid variational bound.
We show that both EDLM [1] and MDM-Prime [2] parameterize a dependence in the distribution over $\boldsymbol{x}_0$, which invalidates the bound. We put a more detailed derivation here to explain how the ELBO is derived.
Let's first look at the optimality of the independent, dependent, and partially dependent cases, i.e., $\inf_{p_\theta} \mathcal{L}$:
$$\inf_{p_\theta}\mathcal{L}=\inf_{p_\theta}\int_0^1\frac{\alpha_t'}{1-\alpha_t}\mathbb{E}_q[\log p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)]dt =\int_0^1\frac{\alpha_t'}{1-\alpha_t} \sup_{p_\theta}\mathbb{E}_q[\log p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)]dt,$$where the equality holds since $\frac{\alpha_t'}{1-\alpha_t} < 0$. The quantity we are interested in is:
$$\sup_{p_\theta}\mathbb{E}_q[\log p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)]=\sup_{p_\theta}-\mathbb{E}_{q(\boldsymbol{x}_t)}[\mathbb{D}_\text{KL}[p(\boldsymbol{x}_0|\boldsymbol{x}_t)\|p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)]]+C,$$where $q(\boldsymbol{x}_t)=\sum_{\boldsymbol{x}_0} q(\boldsymbol{x}_0,\boldsymbol{x}_t)$, $p(\boldsymbol{x}_0|\boldsymbol{x}_t)\triangleq \frac{q(\boldsymbol{x}_0,\boldsymbol{x}_t)}{q(\boldsymbol{x}_t)}$ and $C=\mathbb{E}_{p_\text{data}}[\log p(\boldsymbol{x}_0|\boldsymbol{x}_t)]$ is a constant and does not influence the optimization. The optimal solution for each of the above scenarios will look like the following:
1. Independent case: $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t)=\prod_{i=1}^L p(x_0^i|\boldsymbol{x}_t)$ (product of marginal conditional data distribution)
2. Dependent case: $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t)=p(\boldsymbol{x}_0|\boldsymbol{x}_t)$ (conditional data distribution)
3. Partially dependent case: $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t)=\prod_{i=1}^{L/\ell} p(\boldsymbol{x}_0^i|\boldsymbol{x}_t)$ (product of marginal conditional data distribution across groups.) Each $\boldsymbol{x}_0^i$ is represented as a vector $\boldsymbol{x}_0^i=[x_0^{i,1}, \cdots, x_0^{i,\ell}]$.
Case 1 (Independent case) is well-established and used in standard MDM [3]. In the following example, we show that case 2 (Dependent case) and case 3 (Partially dependent case) violate $\inf_{p_\theta} \mathcal{L}\leq \mathcal{H}(\boldsymbol{x}_0)$. We revisit the loss function used by EDLM [1] and MDM-Prime [2] at the end.
This counterexample comes from Jiacheng's twitter post. We provide a more formal version with detailed calculations below for ease of reading.
In the following example, we suppose that the losses are minimized:
• $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t) = \prod_{i=1}^{L} p(x_0^i|\boldsymbol{x}_t)$ (Independent)
• $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t) = p(\boldsymbol{x}_0|\boldsymbol{x}_t)$ (Dependent)
• $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t) = \prod_{i=1}^{L/\ell} p(\boldsymbol{x}_0^i|\boldsymbol{x}_t)$ (Partially dependent)
Suppose we have two states: $\boldsymbol{x}_0 \in \{0000, 1111\}$ with probability $1/2$ each.
• Length: $L = 4$, $\ell = 2$.
• Entropy: $\mathcal{H}(\boldsymbol{x}_0) = \log 2$.
• Under every circumstance we should always satisfy: $\log 2 \leq \inf_{p_\theta} \mathcal{L}$.
We use linear scheduling: $\alpha_t = 1 - t$ (the result is invariant to this choice [3]).
In all cases, we first calculate the time-wise loss $\mathcal{L}_\text{time-wise} = \mathbb{E}_q[\log p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t)]$ and then compute the integration $\int_0^1 \frac{\alpha_t'}{1-\alpha_t} \mathcal{L}_\text{time-wise}\, dt$.
Case A. all masks: $\log p(\boldsymbol{x}_0|\mathbf{[m,m,m,m]})$
• $p(\boldsymbol{x}_0|\mathbf{[m,m,m,m]})$ is $\frac{1}{2}\cdot\frac{1}{2}\cdot\frac{1}{2}\cdot\frac{1}{2} = \frac{1}{16}$ for any $\boldsymbol{x}_0$.
• The chance of sampling $\boldsymbol{x}_t = \mathbf{[m,m,m,m]}$ is $(1-\alpha_t)^L = t^L$.
• The expectation is:
$$ \mathcal{L}_\text{time-wise} = -(1-\alpha_t)^L \log 16 = t^L(-\log 16). $$
Case B. $\geq 1$ unmask tokens, e.g., $\log p(\boldsymbol{x}_0|\mathbf{[m,x_0^i,m,m]})$
• Any revealed token results in $p(\boldsymbol{x}_0|\mathbf{[m,x_0^i,m,m]}) = 1\cdot1\cdot1\cdot1 = 1$ (since $\boldsymbol{x}_0$ can only take 0000 or 1111).
• The chance of sampling $\boldsymbol{x}_t = \mathbf{[m,x_0^i,m,m]}$ is $(1-t^L)$.
• The expectation is:
$$ \mathcal{L}_\text{time-wise} = (1-t^L)\cdot 0 = 0. $$
Final loss:
$$ \mathcal{L} = \int_0^1 \frac{-1}{t}\cdot(-t^L\log 16 + 0)\,dt = \log 16 \int_0^1 t^{L-1}\,dt = \log 16 \cdot \frac{1}{L} = \frac{\log 16}{L} $$
→ The entropy is equal to this loss (valid): $\log 2 \leq \frac{\log 16}{4} = \frac{\log 16}{L} = \log 2$.
Case A. all masks: $\log p(\boldsymbol{x}_0|\mathbf{[m,m,m,m]})$
• $p(\boldsymbol{x}_0|\mathbf{[m,m,m,m]})$ is $\frac{1}{2}$ for any $\boldsymbol{x}_0$.
• The chance of sampling $\boldsymbol{x}_t = \mathbf{[m,m,m,m]}$ is $(1-\alpha_t)^L = t^L$.
• The expectation is:
$$ \mathcal{L}_\text{time-wise} = -(1-\alpha_t)^L\log 2 = t^L(-\log 2). $$
Case B. $\geq 1$ unmask tokens, e.g., $\log p(\boldsymbol{x}_0|\mathbf{[m,x_0^i,m,m]})$
• Any revealed token results in $p(\boldsymbol{x}_0|\mathbf{[m,x_0^i,m,m]}) = 1$.
• The chance of sampling $\boldsymbol{x}_t = \mathbf{[m,x_0^i,m,m]}$ is $(1-t^L)$.
• The expectation is:
$$ \mathcal{L}_\text{time-wise} = (1-t^L)\cdot 0 = 0. $$
Final loss:
$$ \mathcal{L} = \int_0^1 \frac{-1}{t}\cdot(-t^L\log 2 + 0)\,dt = \log 2 \int_0^1 t^{L-1}\,dt = \log 2 \cdot \frac{1}{L} = \frac{\log 2}{L} $$
→ The entropy is lower-bounded by this loss (invalid): $\log 2 > \frac{\log 2}{L} = \frac{\log 2}{4}$.
Case A. all masks: $\log p(\boldsymbol{x}_0|\mathbf{[(m,m),(m,m)]})$
• $p(\boldsymbol{x}_0|\mathbf{(m,m),(m,m)})$ is $\frac{1}{2}\cdot\frac{1}{2} = \frac{1}{4}$ for any $\boldsymbol{x}_0$.
• The chance of sampling $\boldsymbol{x}_t = \mathbf{[m,m,m,m]}$ is $(1-\alpha_t)^L = t^L$.
• The expectation is:
$$ \mathcal{L}_\text{time-wise} = -(1-\alpha_t)^L\log 4 = t^L(-\log 4). $$
Case B. $\geq 1$ unmask tokens in each group, e.g., $\log p(\boldsymbol{x}_0|\mathbf{[(m,x_0^i),(m,m)]})$
• Any revealed token results in $p(\boldsymbol{x}_0|\mathbf{[(m,x_0^i),(m,m)]}) = 1\cdot1 = 1$.
• The chance of sampling $\boldsymbol{x}_t = \mathbf{[m,x_0^i,m,m]}$ is $(1-t^L)$.
• The expectation is:
$$ \mathcal{L}_\text{time-wise} = (1-t^L)\cdot 0 = 0 $$
Final loss:
$$ \mathcal{L} = \int_0^1 \frac{-1}{t}\cdot(-t^L\log 4 + 0)\,dt = \log 4 \int_0^1 t^{L-1}\,dt = \log 4 \cdot \frac{1}{L} = \frac{\log 4}{L} $$
→ The entropy is lower-bounded by this loss (invalid): $\log 2 > \frac{\log 4}{4} = \frac{2\log 2}{4} = \frac{\log 2}{2}$.
Intuitively, the dependent model concentrates all probability mass on the two valid sequences $\boldsymbol{x}_0 \in \{0000, 1111\}$ when all tokens are masked, whereas the independent model must spread mass across all 16 combinations including the 14 impossible ones. This is what allows the loss to fall below entropy.
Remark: While the minimal loss does depend on the data distribution, the violation holds generically whenever the true joint $p(\boldsymbol{x}_0|\boldsymbol{x}_t) \neq \prod_{i=1}^{L} p(x_0^i|\boldsymbol{x}_t)$. This means the loss function is valid for learning but invalid for a fair comparison to perplexity or log likelihood.
Parameterization determines whether the dependency of the variables is modeled. Here, we show that EDLM [1] is a Dependent Model and MDM-Prime [2] is a Partially Dependent Model.
EDLM [1] models inter-token dependencies where $p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is parameterized as follows:
$$ p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t) = \frac{\exp(-E_\theta(\boldsymbol{x}_0, \boldsymbol{x}_t))}{Z_\theta(\boldsymbol{x}_t)} \mu_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t) $$
where $\mu_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t) = \prod_{i=1}^{L} \mu_\theta(x_0^i|\boldsymbol{x}_t) = \prod_{i=1}^{L} \text{softmax}_\theta(x_0^i|\boldsymbol{x}_t)$.
The energy-correction term $\frac{\exp(-E_\theta(\boldsymbol{x}_0, \boldsymbol{x}_t))}{Z_\theta(\boldsymbol{x}_t)}$ acts as a large softmax normalized on $\boldsymbol{x}_0$ space (i.e., $(\text{vocab\_size})^L$). In general, this quantity cannot be decomposed:
$$ \frac{\exp(-E_\theta(\boldsymbol{x}_0,\boldsymbol{x}_t))}{Z_\theta(\boldsymbol{x}_t)} = \text{softmax}_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t) \neq \prod_{i=1}^{L} \text{softmax}_\theta(x_0^i|\boldsymbol{x}_t) $$
Therefore, the independence assumption does not hold. The resulting loss $\mathcal{L}$ has optimal solution $p_\theta^*(\boldsymbol{x}_0|\boldsymbol{x}_t) = p(\boldsymbol{x}_0|\boldsymbol{x}_t) \neq \prod_{i=1}^{L} p(x_0^i|\boldsymbol{x}_t)$, which violates the entropy assumption.
MDM-Prime [2] captures inner-token dependencies using sub-tokens: $f(x_0^i) = \boldsymbol{y}_0^i = [y_0^{i,1}, \cdots, y_0^{i,\ell}]$ represents a sequence of sub-tokens.
$$ p_\theta(\boldsymbol{y}_0|\boldsymbol{y}_t) = \prod_{i=1}^{L} p_\theta(\boldsymbol{y}_0^i|\boldsymbol{y}_t) $$
The optimal solution satisfies $p_\theta^*(\boldsymbol{y}_0|\boldsymbol{y}_t) = \prod_{i=1}^{L} p(\boldsymbol{y}_0^i|\boldsymbol{y}_t) \neq \prod_{i=1}^{L}\prod_{j=1}^{\ell} p(y_0^{i,j}|\boldsymbol{y}_t)$, which violates the entropy assumption.
One interpretation is to view the loss function through the lens of a weighted KL divergence. Let $w(t) = \left|\frac{\alpha_t'}{1-\alpha_t}\right|$ be a positive weighting term. Then:
$$ \nabla_\theta \mathcal{L} = \nabla_\theta \int_0^1 w(t)\,\mathbb{E}_q[-\log p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)]\,dt = \nabla_\theta \int_0^1 w(t)\,\mathbb{E}_{q(\boldsymbol{x}_t)}\left[\mathbb{D}_\text{KL}[p(\boldsymbol{x}_0|\boldsymbol{x}_t)\| p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)]\right] dt $$
where $C$ is a constant w.r.t. $\theta$. In the reverse diffusion process, we can use the following kernel:
$$ p(\boldsymbol{x}_s|\boldsymbol{x}_t) = \mathbb{E}_{p(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[q(\boldsymbol{x}_s|\boldsymbol{x}_t, \boldsymbol{x}_0)\right] $$
Minimizing the loss still yields a valid diffusion model that captures $p(\boldsymbol{x}_0|\boldsymbol{x}_t)$. However, their loss is not lower bounded by the data entropy. In technical terms, they are a "valid training objective" but not a "valid evaluation metric" on the likelihood of the data.
Takeaway: The loss can be interpreted as a weighted KL divergence objective that will drive your model toward the correct $p_\theta(\boldsymbol{x}_0|\boldsymbol{x}_t)$, and sampling will work fine. But you cannot take that loss value, compare it to an autoregressive model's loss, and draw conclusions about which model is better; the two numbers are measuring different things. The parameterization for which the standard MDM loss is a valid ELBO (and therefore comparable to NLL/perplexity) is the fully independent factorization $\prod_{i=1}^{L} p(x_0^i|\boldsymbol{x}_t)$. This matters when studying scaling laws where the methodology rests on comparing loss curves across model families on a common footing.
To ensure that the bound is valid, we can marginalize the joint distribution to derive factorized distributions. Let $\mathcal{M} \triangleq \{x_0^j \mid j \neq i,\; x_0 \in \mathcal{X}\}$. Then:
$$ p_\theta(x_0^i|\boldsymbol{x}_t) \triangleq \sum_{\boldsymbol{x}_0^{\neq i} \in \mathcal{M}} p_\theta(x_0^i, \boldsymbol{x}_0^{\neq i}|\boldsymbol{x}_t) $$
In MDM-Prime [2], the marginalized distribution can be derived efficiently since the subtoken space is relatively small. The following table compares the MDM-Prime ELBO calculated in the correct and incorrect ways (286M-parameter model trained with 168B tokens on OpenWebText):
| MDM-Prime (\(\ell=6\)) — Joint (incorrect) | MDM-Prime (\(\ell=6\)) — Marginalized (correct) | |
|---|---|---|
| NLL (\(\downarrow\)) | \(\leq 2.596\) | \(\leq 3.675\) |
| PPL (\(\downarrow\)) | \(\leq 13.41\) | \(\leq 39.48\) |
Credit: We'd like to sincerely thank Jiacheng You for raising this issue which prompted us to dig deeper and formally write out this note explaining what we found [twitter].
[1] Xu et al. Energy-Based Diffusion Language Models for Text Generation. ICLR 2025.
[2] Chao et al. Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking. NeurIPS 2025.
[3] Sahoo et al. Simple and Effective Masked Diffusion Language Models. NeurIPS 2024.
@article{chao2026dependency,
title = {Dependency breaks validity of loss functions in masked diffusion models},
author = {Chao, Chen-Hao and Xu, Minkai and Geffner, Tomas and Vahdat, Arash and Krishnan, Rahul G.},
journal = {chen-hao-chao.github.io},
year = {2026}
}