Regarding generative learning, There are sentences, images, or voices.
below we will quickly go through two main concepts in generative AI, Autoregressive(AR) model and Non-autoregressive model(NAR)
autoregressive model:
- each time only working on one unit, it could be one word, one pixel.
- slower, due to each word/sentence is generated in sequence
- better quality.
- working in word/sentence domain
non-autoregressive model:
- Setting the output fixed size, i.e. 200 words or 1000 pixels, then produce the results at once.
- faster, if computation is parallel
- working in image domain.
Can we use both (AR+ NAR) at the same time? yes
in word/sentence domain, if only using AR it takes ages to finish. The solution is to use AR first to generate the temporary products quickly, then using NAR to generate fine results.
in image domain, we use NAR repeatedly, like we loop NAR a few times to give it more accuracy. In this way we generate vague results at initial runs, then it becomes more and more accurate in the later runs when we fine-tune it. Yes, it is also the basic concept of “Diffusion model”.
Images ref: https://www.youtube.com/watch?v=AihBniegMKg&list=PLJV_el3uVTsOePyfmkfivYZ7Rqr2nMk3W&index=6&t=122s