Common Voice Scripted Speech 25.0 - Swahili

Mozilla Foundation

Common Voice Scripted Speech 25.0 - Swahili

Date

2026-03-23

Authors

Mozilla Foundation

Publisher

Mozilla Foundation

Abstract

A collection of read speech recordings in Swahili (Kiswahili). This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Swahili [Kiswahili - sw]. The dataset contains 730187 clips representing 1064.02 hours of recorded speech (392.11 hours validated) from 1518 speakers, recorded from a text corpus of 140,486 sentences. This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Description

<table border="1" cellpadding="8" cellspacing="0" style="border-collapse: collapse; width: 100%; font-family: Arial, sans-serif; font-size: 0.9em;"> <thead> <tr style="background-color: #f2f2f2; text-align: left;"> <th>Code</th> <th>Variant</th> <th>Clips</th> <th>Speakers</th> </tr> </thead> <tbody> <tr> <td>sw-baratz</td> <td>Kiswahili cha Bara ya Tanzania</td> <td>22,486 (3.1%)</td> <td>24 (1.6%)</td> </tr> <tr style="background-color: #f9f9f9;"> <td>sw-sanifu</td> <td>Kiswahili Sanifu (EA)</td> <td>21,833 (3.0%)</td> <td>69 (4.5%)</td> </tr> <tr> <td>sw-barake</td> <td>Kiswahili cha Bara ya Kenya</td> <td>6,487 (0.9%)</td> <td>59 (3.9%)</td> </tr> <tr style="background-color: #f9f9f9;"> <td>sw-kingwana</td> <td>Kingwana (DRC)</td> <td>1,756 (0.2%)</td> <td>34 (2.2%)</td> </tr> <tr> <td>sw-kimvita</td> <td>Kimvita (KE) — Central dialect</td> <td>15 (0.0%)</td> <td>2 (0.1%)</td> </tr> <tr style="background-color: #f9f9f9;"> <td>sw-katanga</td> <td>Katanga (DRC)</td> <td>5 (0.0%)</td> <td>1 (0.1%)</td> </tr> </tbody> </table>