Abstract: | Today's fast expanding use of information technology has led to a dynamic rise in hacking and other unauthorized operations. The variety and quantity of assaults are increasing dramatically as a result of advancements in both hardware and software. Classifying network traffic is becoming increasingly important because of the rapid increase in Internet users. Every day, numerous threats are developed by people and groups looking to breach computer networks and steal data and personally identifiable information. Many organizations implement a broad defense to thwart these attacks, including setting up robust firewalls, authentication systems, encryption, antivirus software, the newest gear, and so on. A further method for reducing network breaches is intrusion detection. Numerous intrusion detection systems have been created to monitor and identify any unusual behavior on networks or systems. Low detection rate, long training time, and a comparatively high false alarm rate are achieved in the majority of them. In order to address the issues, we put out a strategy that combines the ideas of big data, anomaly detection, and machine learning to produce better outcomes faster. The major components of the proposed system are testing, validation, and training. The gathered training data is preprocessed and sent to the classification model in the training component. We employ and compare four categorization models: Random Forest, Neural Network, Logistic Regression, and Decision Tree.
To discover the best value for each hyperparameter and raise the models' detection rate, the validation component's hyperparameter tuning for each machine learning algorithm use a grid search strategy in conjunction with 5-fold cross-validation. The final model is then constructed by training the classification models with the optimal parameters. Lastly, the test data is divided into normal and attack categories using the trained model. The Apache Spark big data framework is used to create each classification model. Data from assaults and normal conditions are included in the NSL-KDD dataset, which is used for the experimental study. The dataset was divided into three categories: training (80%), validation (10%), and testing (10%). The outcomes demonstrate that nearly every algorithm produces high prediction results.
Neural Network has achieved the greatest results out of all the algorithms, with 96.9% accuracy, 96.8% precision, 96.7% recall, and 96.7% f1-score. |