@CreamyLong, have you reproduce layout results as in the original stable diffusion paper?
I used your layout config and have trained for 10 epochs (COCO dataset,batch size=16), but the log images obtained in training phrase are as follows,
which I think are not correct training results. The training loss have awalys keep around 0.27 not decreased during the whole training process.
input bbox

input image

decoded recontruction directly from first-stage embeded latent

sample image from latent diffusion model (ddim_step=200, eta=0.)

Originally posted by @Tonsty in #14 (comment)