What is ElevenLabs doing? How is it so good?

Basically the title. What's their trick? On everything but voice, local models are pretty good for what they are, but ElevenLabs just blows everyone out of the water.

Is it full Transformer? Some sort of Diffuser? Do they model the human anatomy to add accuracy to the model?