Natural Language Processing Research Group

Datasets

These datasets currently includes text only. Audio data will be published in a future release. Access to the dataset can be requested through the contact person.

  • Bahasa Alas
    Aceh Tenggara
    250 sentences
  • Bahasa Minangkabau
    Dialek Lima Puluh Kota
    500 sentences
  • Baso Palembang
    500 sentences

More datasets for languages and dialects are coming soon, including Malay, Minangkabau, Banjarese, Batak, Buginese, Javanese, and Sundanese.

Contributors

  • Bima Alfiansyah dan M. Maulana
    Bahasa Alas
  • Bintang Fauzan
    Bahasa Minangkabau, Dialek Lima Puluh Kota
  • Marcello Yasta
    Baso Palembang

License

This dataset is distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Users are permitted to use, reproduce, and modify the dataset for non-commercial research and educational purposes, provided that appropriate credit is given to the original authors and source. Any derivative works or adaptations based on this dataset must be distributed under the same license (CC BY-NC-SA 4.0). Commercial use of the dataset, in whole or in part, including but not limited to incorporation into proprietary systems or services, is prohibited without prior written permission from the authors.

Citing in papers

Authorship for each dataset reflects its primary contributor(s), with Yusra and Muhammad Fikry included as co-authors across all datasets. For example:

Fauzan, B., Yusra, & Fikry, M. (2026). Bahasa Minangkabau (Dialek Lima Puluh Kota) NLP dataset [Data set]. Bhinneka NLP-RG. https://nlp-rg.yusrafikry.com

Yusra dan Muhammad Fikry (2025-2026). Natural Language Processing Research Group (Bhinneka NLP-RG).
Prodi Teknik Informatika, Fakultas Sains dan Teknologi, UIN Suska Riau.