PR-380: Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Name: PR-380: Textless Speech Emotion Conversion using Discrete and Decomposed Representations
Uploaded: Apr 10, 2022
Duration: 2912 s

JoonHo LEE1.15K subscribers

524 views

Apr 10, 2022

48:32

안녕하세요, PR12의 멤버 이준호입니다. 최근 Meta AI는 speech 기반으로 학습하여 자연스러운 대화를 생성하는 AI 모델을 공개했고, Google AI는 speech-to-speech translation 데이터셋을 공개했습니다. Text를 통해 speech를 생성하지 않고 이렇게 speech-to-speech로 직접 정보를 처리하면 웃음, 하품, 침묵, 맞장구와 같은 다양한 비언어적 요소들을 잘 살릴 수 있는 장점이 있습니다. 그 결과 text 기반으로 생성한 것과 비교할 때 훨씬 더 자연스러운 speech를 만들어 낼 수 있게 됩니다. 이번 PR-380에서는 이러한 Textless NLP 연구의 일환으로 지난 3월 마지막날 Meta AI 블로그를 통해 데모가 공개된 "Textless Speech Emotion Conversion using Discrete & Decomposed Representations" 논문을 다루었습니다 (https://arxiv.org/abs/2111.07402). 먼저 Meta AI에서 공개한 데모를 감상하고, 전체 시스템을 구성하는 다음 4개의 주요 블록을 하나씩 들여다 보았습니다. (1) HuBERT: Raw audio를 discrete representation (unit)으로 변환하는 음성 encoder (2) S2S Transformer: source speech unit을 원하는 감정이 표현된 target speech unit으로 변환하는 네트웍 (3) Prosody predictor: duration, pitch 등 prosody(운율)의 주요 요소를 예측하는 네트웍 (4) HiFi-GAN: 변환된 unit을 예측된 prosody, speaker 등과 함께 입력받아 raw audio를 생성하는 vocoder 이렇게 구성된 시스템을 세부 튜닝한 후, 기존의 text-to-speech를 활용하는 시스템들과 정성적, 정량적인 지표로 비교할 때 큰 폭으로 향상된 성능을 볼 수 있습니다. 빠른 속도로 발전하고 있는 Textless NLP가 가까운 시일 내에 인간보다 더 인간적으로 말하는 AI를 가능하게 하지 않을까 생각해 보게 됩니다. 즐감하세요! 이준호 드림.

Download

0 formats

No download links available.