Dataset

Dataset Name

Type

#Samples

# Hours

Task

Metrics

MDRM-test [1]

Short Clips

22,208

87

short financial clip ASR

WER

SPGISpeech-test [2]

Short Clips

39,341

130

short financial clip ASR

WER

Earning-21 [3]

Long Audio

44

39

long financial audio ASR

WER

Earning-22 [4]

Long Audio

125

120

long financial audio ASR

WER

FinAudioSum

Long Audio

64

55

long financial audio Summarization

Rouge-L & BERTScore

We create FinAudioSum based on the ECTSum dataset, originally designed for earnings call summarization using textual data. ECTSum comprises 2,425 earnings transcripts paired with expert-generated, telegram-style summaries. We obtain corresponding audio recordings for the ECTSum test set from earningscast. Overlapping recordings with Earnings-21 and Earnings-22 (spanning 2019–2022) are removed. The final FinAudioSum dataset includes 64 recordings totaling 55 hours.

[1] [What You Say and How You Say It Matters: Predicting Financial Risk Using Verbal and Vocal Cues](https://aclanthology.org/P19-1038.pdf) ACL 2019. [Data](https://github.com/GeminiLn/EarningsCall_Dataset/tree/master)

[2] [SPGISpeech: A Large-Scale Dataset for Financial Speech Recognition](https://arxiv.org/pdf/2104.02014) InterSpeech 2021.

[3] [Earnings-21: A practical benchmark for ASR in the wild](https://arxiv.org/pdf/2104.11348) InterSpeech 2021.

[4] Earnings-22: A Practical Benchmark for Accents in the Wild