This blog post provides an introduction to our proposed DLSM training method. We start with an introduction of the denoising score-matching (DSM) method. Then, we discuss the limitations of the current conditional score-based generation methods. Next, we formulate the proposed DLSM loss. Finally, we present experimental results to demonstrate the effectiveness of DLSM. If you have any questions, please feel free to email Chen-Hao Chao. If you find this information useful, please consider sharing it with your friends.
The following paragraphs provide a review and background information on conditional score-based generation methods.
▶Parzen Density Estimator: Given a true data distribution
This equation can be directly applied to generate samples with Langevin diffusion. Unfortunately, this requires summation over all
▶Denoising Score Matching (DSM): Score matching (SM) [2] was proposed to estimate the score function with a model
This objective requires evaluating
where
▶Score Decomposition via Bayes' Theorem: Score models can be extended to conditional models when conditioned on a certain label
The equalities hold since
The classifier can be trained with cross-entropy (CE) loss
To further investigate the score mismatch issue, we first leverage a motivational experiment on the inter-twining moon dataset to examine the extent of the discrepancy between the estimated and true posterior scores. In this experiment, we consider five different methods (denoted as (a)~(e)) to calculate the posterior scores:
Figure 1. The visualized results on the inter-twining moon dataset. The plots presented in the first two
rows correspond to the visualized vector fields for the posterior scores of the class
Fig. 1 visualizes the posterior scores and the sampled points based on the five methods. It is observed that the posterior scores estimated using methods (a) and (b) are significantly different from the true posterior scores measured by method (e). This causes the sampled points in methods (a) and (b) to deviate from those sampled based on method (e). On the other hand, the estimated posterior scores and the sampled points in method (c) are relatively similar to those in method (e). The above results therefore suggest that the score mismatch issue is severe under the cases where methods (a) and (b) are adopted, but is alleviated when method (c) is used.
In order to inspect the potential causes for the differences between the results produced by methods
(a), (b), and (c), we incorporate metrics for evaluating the sampling quality and the errors between
the scores in an quantitative manner. The sampling quality is evaluated using the precision and
recall [5] metrics. On the other hand, the estimation errors of the score
functions are measured by the expected values of
where the terms
Table 1. The experimental results on the inter-twining moon dataset. The quality of the sampled data
for different methods are measured in terms of the precision and recall metrics. The errors of the
score functions for different methods are measured using
Table 1 presents
The above experimental clues therefore shed light on two essential issues to be further explored and
dealt with. First, although employing a classifier trained with
In this section, we introduce DLSM, a new training objective that encourages the classifier to capture the true likelihood score.
▶Objective Function: As discussed in the previous section, a score model trained with the score-matching objective can potentially
be beneficial in producing a better posterior score estimation. In light of this, a classifier may be enhanced if the score-matching process is
involved during its training procedure. An intuitive way to accomplish this aim is through minimizing the explicit likelihood score-matching loss
This loss term, however, involves the calculation of the true likelihood score, whose computational cost grows with respect to the dataset size.
In order to reduce the computational cost, we follow the derivation of DSM as well as Bayes' theorem, and formulate an alternative objective
called DLSM loss (
Proposition.
The proposition suggests that optimizing
The underlying intuition of
▶Training Procedure: Following the theoretical derivation, we next discuss the practical aspects during training, and propose to train the classifier by jointly minimizing the approximated denoising likelihood score-matching loss and the cross-entropy loss. In practice, the total training objective of the classifier can be written as follows:
where
Figure 2. The training procedure of the proposed methodology.
Fig. 2 depicts a two-stage training procedure adopted in this work. In stage 1, a score model
We compare the base method, scaling method, and our method on the CIFAR-10 and CIFAR-100 datasets, which contain real-world RGB
images with 10 and 100 categories of objects, respectively. For the sampling algorithm, we adopt the predictor-corrector (PC) sampler described in [6]
with the sampling steps set to
▶Evaluation on CIFAR-10 and CIFAR-100: In this section, we examine the effectiveness of the base method, the scaling method, and our proposed
method on the CIFAR-10 and CIFAR-100 benchmarks with several key evaluation metrics. We adopt the Inception Score (IS) and the Fréchet Inception Distance (FID)
as the metrics for evaluating the overall sampling quality by comparing the similarity between the distributions of the generated images and the real images.
We also evaluate the methods using the Precision (P), Recall (R) [5], Density (D), and Coverage (C) [7] metrics to further examine the fidelity and diversity
of the generated images. In addition, we report the Classification Accuracy Score (CAS) [8] to measure if the generated samples bear representative class information.
Given a dataset containing
Table 2. The evaluation results on the CIFAR-10 and CIFAR-100 datasets. The P / R / D / C metrics with '(CW)' in the last four
rows represents the average class-wise metrics. The arrow symbols
Table 2 reports the quantitative results of the above methods. It is observed that the proposed method outperforms the other two methods with substantial margins in terms of FID and IS, indicating that the generated samples bear closer resemblance to the real data. Meanwhile, for the P / R / D / C metrics, the scaling method is superior to the other two methods in terms of the fidelity metrics (i.e., precision and density). However, this method may cause the diversity of the generated images to degrade, as depicted in Fig. 3, resulting in significant performance drops for the diversity metrics (i.e., the recall and the coverage metrics).
Figure 3. A comparison of the samples generated via the base method and the scaling method.
Another insight is that the base method achieves relatively better performance on the precision and
density metrics in comparison to our method. However, it fails to deliver analogous tendency on
the CAS metric. This behavior indicates that the base method may be susceptible to generating
false positive samples, since the evaluation of the P / R / D / C metrics does not involve the class
information, and thus may fail to consider samples with wrong classes. Such a phenomenon motivates
us to further introduce a set of class-wise (CW) metrics, which takes the class information into
account by evaluating the P / R / D / C metrics on a per-class basis. Specifically, the class-wise
metrics are evaluated separately for each class
Figure 4. The sorted differences between the proposed method and the base method evaluated on the
CIFAR-10 and CIFAR-100 datasets for the class-wise P / R / D / C metrics. Each colored bar in the plots
represents the differences between our method and the base method evaluated using one of the P / R /
D / C metrics for a certain class. A positive difference represents that our method outperforms the
base method for that class.
Base on these evidences, it can be concluded that the proposed method outperforms both baseline methods in terms of FID and IS, implying that our method does possess a better ability to capture the true data distribution. Additionally, the evaluation results on the CAS and the class-wise metrics suggest that our method does offer a superior ability for a classifier to learn accurate class information as compared to the base method.
▶Ablation Study: To further investigate the characteristic of
Figure 5. The evaluation curves of (a) the score errors and (b) the cross-entropy errors during
the training iterations for
[1] P. Vincent. A Connection between Score Matching and Denoising Autoencoders. Neural computation, 23(7):1661-1674, 2011.
[2] A. Hyvärinen. Estimation of Non-normalized Statistical Models by Score Matching. Journal of Machine Learning Research (JMLR), 6(24):695-709, 2005.
[3] A. M Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space. In Proc. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
[4] P. Dhariwal and A. Nichol. Diffusion mMdels Beat GANs on Image Synthesis. arXiv preprint arXiv:2105.05233, 2021.
[5] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved Precision and Recall metric for Assessing Generative Models. In Proc. of Conf. on Neural Information Processing Systems (NeurIPS), 2019.
[6] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In Proc. Int. Conf. on Learning Representations (ICLR), 2021.
[7] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo. Reliable Fidelity and Diversity Metrics for Generative Models. In Proc. of the Int. Conf. on Machine Learning (ICML), 2020.
[8] S. Ravuri and O. Vinyals. Classification Accuracy Score for Conditional Generative Models. In Proc. of Conf. on Neural Information Processing Systems (NeurIPS), 2019.
@inproceedings{
chao2022denoising,
title={Denoising Likelihood Score Matching for Conditional Score-based Data Generation},
author={Chen-Hao Chao and Wei-Fang Sun and Bo-Wun Cheng and Yi-Chen Lo and Chia-Che Chang and Yu-Lun Liu and Yu-Lin Chang and Chia-Ping Chen and Chun-Yi Lee},
booktitle={International Conference on Learning Representations (ICLR)},
year={2022},
url={https://openreview.net/forum?id=LcF-EEt8cCC}
}