Abstract: | Text identification is an automatic recognition task that seeks to determine a word's meaning
based on its context from the specified text in a targeted language. In richly resourced
languages, this issue has been thoroughly examined and analyzed like European, but not in low
resourced language especially Ethiopian language so, to mitigate such issues many researchers
propose a language identifier system and now become the main research topic of many
researchers. To solve the above problem propose a language identifier system, by exploring the
three experiment with the first Unigrams, Bigrams and Mixture of bothand second experiment
analyzer=‘char’ and n-gram range= (1, 3), last experiment twenty feature sets used as a column
in the first experiment, for all classifiers, employed a unigram (n=1) feature set with four
specific language instruction classes for Hadiyya, Wolaytta/Wolaytegna, Somali &Sidama on
this experiment in the Naïve Bayes model, the average classification accuracy for all language
was 81%, and 85%, 90%, 79%, and 89% for Logistic Regression, Random forest, Decision
Tree, and Gradient Boosting classifiers and in 1% mixture of Unigram & Bigram was an average
classification accuracy of the Naïve Bayes, Logistic Regression, and Random forest, Decision
Tree, Gradient Boosting classifiers was 95.25%, 96.7 %, 97.56%, 91%, and 96.6%, respectively.
In 60%mixture of Unigram & Bigram feature set for all classifiers with four targeted language
classes, Naïve Bayes is, Logistic Regression, Random forest, Decision Tree and Gradient
Boosting classifiers showed an average classification accuracy of 91% and 94%
,95.96%,88.36% and 94.87% respectively. When using n-gram range= (1, 3)analyzer=‘char
Logistic regression has an overall average performance of 98.9% Out of all the classifiers, this
one has the highest rate and for each language Hadiyya,Sidama, and Somali wolayta is 99%,
98%, 100%%, and 99% respectively. In the third experiment, twenty Sets of features were
employed as a column for each model; the average rate of correct classification using Naïve
Bayes is 59.71%, whereas the rates for Logistic regression, Random Forest, Decision Tree, and
Gradient Boosting are 70.41%, 78.11%, and 76.69%, respectively. |