Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
In this paper, we propose a new task - generating speech from videos of people and their transcripts (VTTS) -...
In this paper, we propose a new task - generating speech from videos of people and their transcripts (VTTS) -...