R&D Result Detail

Original Title

Improving Speaker Verification with Self-Pretrained Transformer Models

English Title

Improving Speaker Verification with Self-Pretrained Transformer Models

Type

Paper in proceedings (conference paper)

Original Abstract

Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models are pretrained and finetuned on the same dataset. Three pre-trained models including HuBERT, Conformer andWavLM are evaluated on four different speaker verification datasets with varying sizes. Our experiments show that these self-pretrained models achieve competitive performance on downstream speaker verification tasks with only one-third of the data compared to Librispeech pretraining, such as Vox- Celeb1 and CNCeleb1. Furthermore, when pre-training only on the VoxCeleb2-dev, the Conformer model outperforms the one pre-trained on 94k hours of data using the same fine-tuning settings.

English abstract

Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models are pretrained and finetuned on the same dataset. Three pre-trained models including HuBERT, Conformer andWavLM are evaluated on four different speaker verification datasets with varying sizes. Our experiments show that these self-pretrained models achieve competitive performance on downstream speaker verification tasks with only one-third of the data compared to Librispeech pretraining, such as Vox- Celeb1 and CNCeleb1. Furthermore, when pre-training only on the VoxCeleb2-dev, the Conformer model outperforms the one pre-trained on 94k hours of data using the same fine-tuning settings.

Keywords

speaker verification, pre-trained speech transformer model, pre-training,

Key words in English

speaker verification, pre-trained speech transformer model, pre-training,

Authors

PENG, J.; PLCHOT, O.; STAFYLAKIS, T.; MOŠNER, L.; BURGET, L.; ČERNOCKÝ, J.

RIV year

2024

Released

20.08.2023

Publisher

International Speech Communication Association

Location

Dublin

Book

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

ISBN

1990-9772

Periodical

Proceedings of Interspeech

Volume

2023

Number

08

State

French Republic

Pages from

5361

Pages to

5365

Pages count

5

URL

https://www.isca-speech.org/archive/pdfs/interspeech_2023/peng23_interspeech.pdf

BibTex

@inproceedings{BUT185575,
  author="Junyi {Peng} and Oldřich {Plchot} and Themos {Stafylakis} and Ladislav {Mošner} and Lukáš {Burget} and Jan {Černocký}",
  title="Improving Speaker Verification with Self-Pretrained Transformer Models",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
  year="2023",
  journal="Proceedings of Interspeech",
  volume="2023",
  number="08",
  pages="5361--5365",
  publisher="International Speech Communication Association",
  address="Dublin",
  doi="10.21437/Interspeech.2023-453",
  issn="1990-9772",
  url="https://www.isca-speech.org/archive/pdfs/interspeech_2023/peng23_interspeech.pdf"
}

Documents

peng23_interspeech2023_improving

VUT

Faculties and university institutes

Parts

Improving Speaker Verification with Self-Pretrained Transformer Models