NIPS 2023

trigger inversion

Given a diffusion model, to whom we are suspected whether it is backdoored. The trigger inversion process is to extract underlying triggeer of the model.

The author’s idea is for different noise ($\epsilon$ or $\epsilon+g$), the models output distribution is different:
$$
\mu_b^t-\mu_c^t=\lambda^t\tau
$$
$\tau$ is the ideal trigger, and $\mu_b^{t-1}=M(x_c^t+\lambda^t\tau, t)$, $\mu_c^{t-1}=M(x_c^t, t)$, $M$ denotes the potential backdoored model.

Namely, we could init a trigger $\tau$, and optimize $\tau$ based on the above equation.
$$
L=\Vert \mu_b^{t-1}-\mu_c^{t-1}-\lambda^{t-1}\tau\Vert_2
$$
By the way, once we get the potential trigger $\tau$, it doesn’t mean that the model is definitily backdoored. Because assuming the model is benign, we could also use this way to extract a $\tau$ which does not make sence.

backdoor detection

Because based on the thread model, defender don’t have such knowledge about the target image. Thus, could not straightly apply $ASR$ to this method. Thus, they use pair-wise similarity as the eval. metric.
$$
S(x_{[1,n]})=E[\Vert x_i-x_j\Vert_2]
$$

How-to-remove-backdoors-in-diffusion-models

trigger inversion

backdoor detection

backdoor removal