- Oct 2024
-
www.semanticscholar.org www.semanticscholar.org
-
This change reduces the algorithm latency of the video encoderfrom 120 ms (3 frames) to 40 ms (1 frame)
anyone knows what this means?
-
- Apr 2024
-
www.semanticscholar.org www.semanticscholar.org
-
v
should be a
-
j
j?
is j equal to i?
-
- Mar 2024
-
www.semanticscholar.org www.semanticscholar.org
-
first training anaudio-only SSL model, which is used to gener-ate unit targets for audio-visual SSL pre-training.
Instead of MFCC? Improvement better than AV-HuBERT
-
We improve AV-HuBERT (Shi et al., 2022a) with simplified trainingprotocol and updated model architecture, and scalearXiv:2403.14402v1 [cs.CL] 21 Mar 2024model size up to 2B parameters for performance inthe multilingual setting.
Bigger model
-
-
arxiv.org arxiv.org
-
The decoder optimises the first loss term only, the encoder optimises the first and the last loss terms,and the embeddings are optimised by the middle loss term
Good summary
-
- Feb 2024
-
www.semanticscholar.org www.semanticscholar.org
-
C
Not used in code for AR. Just for reference
-
-
www.semanticscholar.org www.semanticscholar.org
-
SpeechTokenizer training
-
-
www.semanticscholar.org www.semanticscholar.org
-
BigVGAN-base (without filter & snake) refers to the vanilla HiFi-GAN architecturetrained with MRD replacing MSD.
HiFi-GAN comparison * +MRD * -MSD * +filter&snake * +bigger size in model * +larger dataset used for training
-
- Jan 2024
-
www.semanticscholar.org www.semanticscholar.org
-
α = 1 and β = 0
alpha beta only adjusted for student regardless of teacher
-
(A-data2vec)
This is vanilla data2vec
-
- Dec 2023
-
www.semanticscholar.org www.semanticscholar.org
-
val-idating that Lmse improves the convergence
could also be poison
-
- Nov 2023
-
www.semanticscholar.org www.semanticscholar.org
-
t has been found in the literature [32, 31] that wave-form reconstruction from VQ acoustic feature needs additionalprosody feature
additional information: prosody feature is needed to reconstruct waveform
-
In particular, txt2vec basically becomes aclassification model rather than a traditional regression model
nice design
-
- Oct 2023
-
www.semanticscholar.org www.semanticscholar.org
-
In practice, we use an equallyweighted combination of two spectrogram losses for 16kHz au-dio: one with large FFT window size of 2048 and hop size of512, and one with small FFT window size of 512 and hop size of128. The larger one gives more frequency resolution, while thesmaller one gives more temporal resolution.
MultiResolutionSTFTLoss
-
First, we train the feed-forward WaveNet for 500K steps with learning rate 0.001, usingL1 loss and spectrogram loss.
stage 1
-
-
www.semanticscholar.org www.semanticscholar.org
-
For this purpose, we use the L1 loss between theSTFT magnitudes of the real and synthesized audio as follows
Just Magnitude
-
We did not attempt batch normalization, which worked wellfor the generator, since this interferes with the gradient penaltyfor our adversarial loss [16].
No Batch Norm For Critics
-
To discriminate the realfrom the synthesized waveforms, we adapt the critic from[25].
Critric is from MelGAN
-
In an attempt to alleviatethe issue of abrupt frame transitions, we use an overlap of50 % between the generated waveform frames
50% overlap,所以FIg3展示的Decoder结构算出来是反卷积放大了1280倍而非640倍,这是因为有一半的帧都被average吞掉了
-
-
www.semanticscholar.org www.semanticscholar.org
-
we use thespeech representation Z without pitch supervision to predictpitch
没有经过pitch训练出来的Z没有学到pitch信息
-
WaveGAN including VAE and GAN,
WaveGAN=VAE(Encoder-Decoder)+GAN
-
λ3Lrecons +λ4Ladv g +λ5Lf m
STFT Loss+GAN Loss+Feature Loss
-
we do not directly useany predesigned acoustic features such as Mel-spectrum. Thisis because the estimated approximations of the Mel distributionfrom the acoustic model are still different from those inputs toa neural vocoder, which may cause artifacts i.e., metallic ef-fects in synthesized audios.
no mel-spectrogram!
-
-
www.semanticscholar.org www.semanticscholar.org
-
(i) content encoder; (ii) F0 encoder;(iii) speaker identity encoder; and a decoder network.
structure
-
The study by [23] is the closest toour work. In which, the authors suggest training a VQ-VAEwith two encoders, one for the waveform and the other for F0
VQ-VAE with dual encoders
-
-
arxiv.org arxiv.org
-
, LLMs employing StreamingLLM maintainreasonable accuracy even as input lengths approach 120K tokens
120K token
-
hile the window attention method worksefficiently, it exhibits low accuracy due to random outputs when the input length exceeds the cachesize. Conversely, StreamingLLM excels by efficiently handling the streaming format, aligning withthe one-shot, sample-by-sample baseline accuracy.
此乃本文相对于Window优势
-
Why do LLMs break when removing initial tokens’ KV?
KeyPoint of the article
-
- Sep 2023
-
www.semanticscholar.org www.semanticscholar.org
-
and the speaker embedding[16, 18] is obtained by encoding random short audio clip ofthe same speaker with another 3 layered 1D convolutions with128, 256, and 256 hidden dimensions and a kernel size of7.
他们的speaker embedding似乎不是用原版Speaker Encoder做的,用的是synthesizer相同的结构
-