How do different form of “masking” affect transformer?
There are many things you can mask by, including 0 or adding Gaussian randomised values (normal distribution) or adding negative numbers (there might be more)…
I want to understand how does it affect on the transformer learning (or rather pre-training). If we have image dataset what kinda masking should we use? And why?