Ethiopian Language Identification from Text Data Using Hybrid Approach

Sitotaw, Saba

st. Mary's University Institutional Repository

Please use this identifier to cite or link to this item: http://hdl.handle.net/123456789/7885

Title:	Ethiopian Language Identification from Text Data Using Hybrid Approach
Authors:	Sitotaw, Saba
Keywords:	Language Identification, Multinomial NB and DT, RF, Gradient Boost.
Issue Date:	Feb-2024
Publisher:	St. Mary's University
Abstract:	Text identification is an automatic recognition task that seeks to determine a word's meaning based on its context from the specified text in a targeted language. In richly resourced languages, this issue has been thoroughly examined and analyzed like European, but not in low resourced language especially Ethiopian language so, to mitigate such issues many researchers propose a language identifier system and now become the main research topic of many researchers. To solve the above problem propose a language identifier system, by exploring the three experiment with the first Unigrams, Bigrams and Mixture of bothand second experiment analyzer=‘char’ and n-gram range= (1, 3), last experiment twenty feature sets used as a column in the first experiment, for all classifiers, employed a unigram (n=1) feature set with four specific language instruction classes for Hadiyya, Wolaytta/Wolaytegna, Somali &Sidama on this experiment in the Naïve Bayes model, the average classification accuracy for all language was 81%, and 85%, 90%, 79%, and 89% for Logistic Regression, Random forest, Decision Tree, and Gradient Boosting classifiers and in 1% mixture of Unigram & Bigram was an average classification accuracy of the Naïve Bayes, Logistic Regression, and Random forest, Decision Tree, Gradient Boosting classifiers was 95.25%, 96.7 %, 97.56%, 91%, and 96.6%, respectively. In 60%mixture of Unigram & Bigram feature set for all classifiers with four targeted language classes, Naïve Bayes is, Logistic Regression, Random forest, Decision Tree and Gradient Boosting classifiers showed an average classification accuracy of 91% and 94% ,95.96%,88.36% and 94.87% respectively. When using n-gram range= (1, 3)analyzer=‘char Logistic regression has an overall average performance of 98.9% Out of all the classifiers, this one has the highest rate and for each language Hadiyya,Sidama, and Somali wolayta is 99%, 98%, 100%%, and 99% respectively. In the third experiment, twenty Sets of features were employed as a column for each model; the average rate of correct classification using Naïve Bayes is 59.71%, whereas the rates for Logistic regression, Random Forest, Decision Tree, and Gradient Boosting are 70.41%, 78.11%, and 76.69%, respectively.
URI:	http://hdl.handle.net/123456789/7885
Appears in Collections:	Master of computer science

Files in This Item:

File	Description	Size	Format
6.Saba Sitotaw.pdf		1.35 MB	Adobe PDF	View/Open

Show full item record