Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model

Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara

Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model

Tirus Muya Maina¹ , Aaron Mogeni Oirere² , Stephen Kahara³

Section:Research Paper, Product Type: Journal-Paper
Vol.12 , Issue.6 , pp.58-65, Dec-2024

Online published on Dec 31, 2024

Copyright © Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at Google Scholar | DPI Digital Library

XML View PDF Download

How to Cite this Paper

IEEE Citation
MLA Citation
APA Citation
BibTex Citation
RIS Citation

IEEE Style Citation: Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara, “Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model,” International Journal of Scientific Research in Computer Science and Engineering, Vol.12, Issue.6, pp.58-65, 2024.

MLA Style Citation: Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara "Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model." International Journal of Scientific Research in Computer Science and Engineering 12.6 (2024): 58-65.

APA Style Citation: Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara, (2024). Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model. International Journal of Scientific Research in Computer Science and Engineering, 12(6), 58-65.

BibTex Style Citation:
@article{Maina_2024,
author = {Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara},
title = {Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model},
journal = {International Journal of Scientific Research in Computer Science and Engineering},
issue_date = {12 2024},
volume = {12},
Issue = {6},
month = {12},
year = {2024},
issn = {2347-2693},
pages = {58-65},
url = {https://www.isroset.org/journal/IJSRCSE/full_paper_view.php?paper_id=3772},
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
UR - https://www.isroset.org/journal/IJSRCSE/full_paper_view.php?paper_id=3772
TI - Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model
T2 - International Journal of Scientific Research in Computer Science and Engineering
AU - Tirus Muya Maina, Aaron Mogeni Oirere, Stephen Kahara
PY - 2024
DA - 2024/12/31
PB - IJCSE, Indore, INDIA
SP - 58-65
IS - 6
VL - 12
SN - 2347-2693
ER -

83 Views

96 Downloads

34 Downloads

Bar Line

Abstract :
This study undertakes the development of the Annotated Swahili Digraph Corpus, utilizing a convolutional neural network-based model specifically designed for the extraction of digraphs. This initiative addresses a significant gap in the availability of dedicated digraph corpora for the Swahili language, which is increasingly needed for various applications in Natural Language Processing (NLP). The CNN-based model was accurately crafted to optimize the extraction and classification of digraphs, taking full advantage of the annotated features within the corpus. Digraphs are pairs of letters that create distinct sounds in a language, and Swahili`s linguistic structure presents unique challenges and requirements in this regard. Therefore, specialized tools and models are essential for ensuring accurate transcription and efficient speech recognition that cater specifically to the nuances of the Swahili language. The resulting Swahili Digraph Corpus comprises a comprehensive collection of 31,197 words, each systematically annotated to highlight their respective digraphs. Notably, this corpus features the nine key Swahili digraphs: "ch," "dh," "gh," "kh," "ng’," "ny," "sh," "th," and "ng." Furthermore, it includes annotations for vowel distribution, showcasing the core vowels "a," "e," "i," "o," and "u." This detailed annotated corpus supports a wide array of NLP applications, enabling researchers and developers to utilize accurate linguistic data for tasks such as text processing, machine translation, and speech synthesis. Through this dedicated effort, we aim to enhance the resources available for processing the Swahili language, ultimately contributing to its greater accessibility in the digital landscape.

Key-Words / Index Term :
Annotated, Swahili, Digraph, Corpus, NLP, CNN, Dense layer

References :
[1] I. A. Okafor, “Distinctive Features: A Linguistic Analysis of Consonant Sounds in English Language,” Ansu Journal of Language and Literary Studies, vol. 2, Issue.2, 2022.
[2] M. Sipser, Introduction to the Theory of Computation, PWS Publishing Company, 1996.
[3] S. S. Rao, Engineering Optimization: Theory and Practice, John Wiley & Sons, 2019.
[4] J. Hopcroft, R. Motwani, and J. D. Ullman, “Introduction to automata theory, languages, and computation,” ACM Sigact News, vol. 32, Issue. 1, pp. 60–65, 2001.
[5] J. H. Hansen and G. Liu, “Unsupervised accent classification for deep data fusion of accent and language information,” Speech Communication, vol. 78, pp. 19–33, 2016
[6] A. F. Atanda, “Multinomial Logistic Regression Probability Ratio-Based Feature Vectors for Malay Vowel Recognition,” Universiti Utara Malaysia, 2021.
[7] W. H. Finch, J. E. Bolin, and K. Kelley, Multilevel Modelling Using R, United Kingdom: CRC Press/Taylor & Francis Group, 2019.
[8] M. S. Azmi, “Development of Malay Word Pronunciation Application using Vowel Recognition,” Malay, vol. 9, Issue.1, 2016.
[9] M. S. Azmi, “Malay Word Pronunciation Test Application for Pre-School Children,” International Journal of Interactive Digital Media, vol. 4, Issue. 2, pp. 2289–4098, 2016.
[10] K. Y. Chan and M. D. Hall, “The importance of vowel formant frequencies and proximity in vowel space to the perception of foreign accent,” Journal of Phonetics, vol. 77, 2019.
[11] M. Mehraj, A. Goel, M. A. Butt, and M. Zaman, “Automatic Speech Recognition Approach for Diverse Voice Commands,” International Journal of Advanced Research in Computer Science, vol. 8, Issue.9, 2017.
[12] J. O. De Sordi, Design Science Research Methodology, Springer International Publishing, 2021.
[13] A. R. Kivaisi, Q. Zhao, and J. T. Mbelwa, “Swahili Speech Dataset Development and Improved Pre-Training Method for Spoken Digit Recognition,” ACM Transactions on Asian and Low-Resource Language Information Processing, 2023.
[14] T. M. Maina, A.M. Oirere, and S.Kahara “A CNN-Based Digraph Extraction Model for Enhanced Swahili Natural Language Processing,” International Journal of Scientific Research in Computer Science and Engineering, Vol.12, Issue.6, pp.43-55, 2024.
[15] T. M. Maina, “The Swahili Digraph Corpus,” Mendeley Data, 2024.
[16] R. Yacouby and D. Axman, “Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models,” in Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP), 2020.
[17] H. Dalianis, “Evaluation Metrics and Evaluation,” in Clinical Text Mining, Springer, Cham, pp.45-53, 2018

Full Paper View Go Back

Main Menu

Journals Contents

Information

Download

Publication Certificate

Contact Us

Use full Link