Hypothesis

30 Matching Annotations

Oct 2024
www.semanticscholar.org www.semanticscholar.org

[PDF] RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement | Semantic Scholar

1
1. yfliu 18 Oct 2024
  
  in Public
  
  This change reduces the algorithm latency of the video encoderfrom 120 ms (3 frames) to 40 ms (1 frame)
  
  anyone knows what this means？
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/d2541a11e5bcc86cc1c3883bd6d7a4c27d09c702
Apr 2024
www.semanticscholar.org www.semanticscholar.org

[PDF] DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding | Semantic Scholar

2
1. yfliu 25 Apr 2024
  
  in Public
  
  v
  
  should be a
2. yfliu 25 Apr 2024
  
  in Public
  
  j
  
  j?
  
  is j equal to i?
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/af6dbf4743762e11bb6fa9822bf9dfed567555d1
Mar 2024
www.semanticscholar.org www.semanticscholar.org

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

2
1. yfliu 25 Mar 2024
  
  in Public
  
  first training anaudio-only SSL model, which is used to gener-ate unit targets for audio-visual SSL pre-training.
  
  Instead of MFCC? Improvement better than AV-HuBERT
2. yfliu 25 Mar 2024
  
  in Public
  
  We improve AV-HuBERT (Shi et al., 2022a) with simplified trainingprotocol and updated model architecture, and scalearXiv:2403.14402v1 [cs.CL] 21 Mar 2024model size up to 2B parameters for performance inthe multilingual setting.
  
  Bigger model
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/2accffa2c7073b1a513f97da7c58a2ef1d7c0ec6
arxiv.org arxiv.org

1711.00937.pdf

1
1. yfliu 16 Mar 2024
  
  in Public
  
  The decoder optimises the first loss term only, the encoder optimises the first and the last loss terms,and the embeddings are optimised by the middle loss term
  
  Good summary
Visit annotations in context

Annotators

yfliu

URL

arxiv.org/pdf/1711.00937
Feb 2024
www.semanticscholar.org www.semanticscholar.org

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

1
1. yfliu 29 Feb 2024
  
  in Public
  
  C
  
  Not used in code for AR. Just for reference
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/c2f91f35df893714418cc29096083dce0b441229
www.semanticscholar.org www.semanticscholar.org

[PDF] SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models | Semantic Scholar

1
1. yfliu 27 Feb 2024
  
  in Public
  
  SpeechTokenizer training
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/5fc1a1f79ea4d1a1c6e8da1a40ae08022a6d7308
www.semanticscholar.org www.semanticscholar.org

[PDF] BigVGAN: A Universal Neural Vocoder with Large-Scale Training | Semantic Scholar

1
1. yfliu 23 Feb 2024
  
  in Public
  
  BigVGAN-base (without filter & snake) refers to the vanilla HiFi-GAN architecturetrained with MRD replacing MSD.
  
  HiFi-GAN comparison * +MRD * -MSD * +filter&snake * +bigger size in model * +larger dataset used for training
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/04f5553934c458305a501d63323f1b841fd5d102
Jan 2024
www.semanticscholar.org www.semanticscholar.org

[PDF] AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations | Semantic Scholar

2
1. yfliu 05 Jan 2024
  
  in Public
  
  α = 1 and β = 0
  
  alpha beta only adjusted for student regardless of teacher
2. yfliu 05 Jan 2024
  
  in Public
  
  (A-data2vec)
  
  This is vanilla data2vec
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/f850925461228fefd99ecff0a55247c559ea2776
Dec 2023
www.semanticscholar.org www.semanticscholar.org

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

1
1. yfliu 27 Dec 2023
  
  in Public
  
  val-idating that Lmse improves the convergence
  
  could also be poison
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/19ea2ec1582fdf34e68d9307292e1108293f81be
Nov 2023
www.semanticscholar.org www.semanticscholar.org

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

2
1. yfliu 01 Nov 2023
  
  in Public
  
  t has been found in the literature [32, 31] that wave-form reconstruction from VQ acoustic feature needs additionalprosody feature
  
  additional information: prosody feature is needed to reconstruct waveform
2. yfliu 01 Nov 2023
  
  in Public
  
  In particular, txt2vec basically becomes aclassification model rather than a traditional regression model
  
  nice design
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/94bcd3cd6557717dd6335b3228d4455ffeabb41b
Oct 2023
www.semanticscholar.org www.semanticscholar.org

[PDF] HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks | Semantic Scholar

2
1. yfliu 23 Oct 2023
  
  in Public
  
  In practice, we use an equallyweighted combination of two spectrogram losses for 16kHz au-dio: one with large FFT window size of 2048 and hop size of512, and one with small FFT window size of 512 and hop size of128. The larger one gives more frequency resolution, while thesmaller one gives more temporal resolution.
  
  MultiResolutionSTFTLoss
2. yfliu 15 Oct 2023
  
  in Public
  
  First, we train the feed-forward WaveNet for 500K steps with learning rate 0.001, usingL1 loss and spectrogram loss.
  
  stage 1
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/73f3c8da2e9ffd1efb372ef38fdf5767797aec63
www.semanticscholar.org www.semanticscholar.org

[PDF] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks | Semantic Scholar

4
1. yfliu 23 Oct 2023
  
  in Public
  
  For this purpose, we use the L1 loss between theSTFT magnitudes of the real and synthesized audio as follows
  
  Just Magnitude
2. yfliu 22 Oct 2023
  
  in Public
  
  We did not attempt batch normalization, which worked wellfor the generator, since this interferes with the gradient penaltyfor our adversarial loss [16].
  
  No Batch Norm For Critics
3. yfliu 22 Oct 2023
  
  in Public
  
  To discriminate the realfrom the synthesized waveforms, we adapt the critic from[25].
  
  Critric is from MelGAN
4. yfliu 22 Oct 2023
  
  in Public
  
  In an attempt to alleviatethe issue of abrupt frame transitions, we use an overlap of50 % between the generated waveform frames
  
  50% overlap，所以FIg3展示的Decoder结构算出来是反卷积放大了1280倍而非640倍，这是因为有一半的帧都被average吞掉了
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/f743df6cddc9da84dd031ba2df605b118df16c57
www.semanticscholar.org www.semanticscholar.org

[PDF] Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis | Semantic Scholar

4
1. yfliu 18 Oct 2023
  
  in Public
  
  we use thespeech representation Z without pitch supervision to predictpitch
  
  没有经过pitch训练出来的Z没有学到pitch信息
2. yfliu 18 Oct 2023
  
  in Public
  
  WaveGAN including VAE and GAN,
  
  WaveGAN=VAE(Encoder-Decoder)+GAN
3. yfliu 18 Oct 2023
  
  in Public
  
  λ3Lrecons +λ4Ladv g +λ5Lf m
  
  STFT Loss+GAN Loss+Feature Loss
4. yfliu 18 Oct 2023
  
  in Public
  
  we do not directly useany predesigned acoustic features such as Mel-spectrum. Thisis because the estimated approximations of the Mel distributionfrom the acoustic model are still different from those inputs toa neural vocoder, which may cause artifacts i.e., metallic ef-fects in synthesized audios.
  
  no mel-spectrogram!
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/52eac03a7f65224db8c357199d6ae211230813db
www.semanticscholar.org www.semanticscholar.org

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

2
1. yfliu 17 Oct 2023
  
  in Public
  
  (i) content encoder; (ii) F0 encoder;(iii) speaker identity encoder; and a decoder network.
  
  structure
2. yfliu 17 Oct 2023
  
  in Public
  
  The study by [23] is the closest toour work. In which, the authors suggest training a VQ-VAEwith two encoders, one for the waveform and the other for F0
  
  VQ-VAE with dual encoders
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/51a79a45e2577bffca0a50992a852c73d7a13ffa
arxiv.org arxiv.org

2309.17453.pdf

3
1. yfliu 05 Oct 2023
  
  in Public
  
  , LLMs employing StreamingLLM maintainreasonable accuracy even as input lengths approach 120K tokens
  
  120K token
2. yfliu 05 Oct 2023
  
  in Public
  
  hile the window attention method worksefficiently, it exhibits low accuracy due to random outputs when the input length exceeds the cachesize. Conversely, StreamingLLM excels by efficiently handling the streaming format, aligning withthe one-shot, sample-by-sample baseline accuracy.
  
  此乃本文相对于Window优势
3. yfliu 05 Oct 2023
  
  in Public
  
  Why do LLMs break when removing initial tokens’ KV?
  
  KeyPoint of the article
Visit annotations in context

Annotators

yfliu

URL

arxiv.org/abs/2309.17453
Sep 2023
www.semanticscholar.org www.semanticscholar.org

[PDF] Lip-to-Speech Synthesis in the Wild with Multi-task Learning | Semantic Scholar

1
1. yfliu 25 Sep 2023
  
  in Public
  
  and the speaker embedding[16, 18] is obtained by encoding random short audio clip ofthe same speaker with another 3 layered 1D convolutions with128, 256, and 256 hidden dimensions and a kernel size of7.
  
  他们的speaker embedding似乎不是用原版Speaker Encoder做的，用的是synthesizer相同的结构
Visit annotations in context

Annotators

yfliu

URL

semanticscholar.org/reader/9441378e00779f82734b295d5c7a43111508a860

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL