Developing a Clustering-Based Semi-Supervised Learning Algorithm for Data Streams
Keywords:
Software Engineering, Data streams, Machine learning algorithmAbstract
In today's world, we encounter an enormous volume of continuous data streams. The task of processing and deriving insights from this data can be overwhelming. To address this challenge, we propose the use of a machine learning algorithm, specifically, a clustering-based semi-supervised learning algorithm. This choice is motivated by the recognition that data from these streams is often imperfect, exhibiting issues such as partial labelling, missing values, and noise. Therefore, a clustering-based semi-supervised learning algorithm has been put forward as a solution, as it is capable of handling diverse and inconsistent data, including both labelled and unlabelled data types. After conducting thorough background research, one prominent algorithm has been identified as capable of effectively addressing the project's requirements: the cluster and label classifier. In this project, variations of the cluster and label method were employed to build the proposed algorithm. At the current stage, the proposed model utilizes an ensemble architecture, where multiple cluster and label methods collaborate to classify data. The model also incorporates a combination of ensemble methods such as bagging and the Random Subspace Method. When evaluating this model using metrics such as accuracy, kappa, and execution time, the current model yields result like those of the cluster and label method. However, it is important to note that the execution time of the current model is considerably slower, approximately 160 times slower than the cluster and label method.