Abstract
Plagiarism in programming assignments has been increasing these days which affects the evaluation of students. Thispaper proposes a machine learning approach for plagiarism detection of programming assignments. Different features related to source code are computed based on similarity score of n-grams, code style similarity and dead codes. Then, xgboost model is used for training and predicting whether a pair of source code are plagiarised or not. Many plagiarism techniques ignores dead codes such as unused variables and functions in their predictions tasks. But number of unused variables and functions in the source code are considered in this paper. Using our features, the model achieved an accuracy score of 94% and average f1-score of 0.905 on the test set. We also compared the result of xgboost model with support vector machines(SVM) and report that xgboost model performed better on our dataset.
References
Prechelt,L.,Malpohl,G., &Phillipsen,M. (2000). Jplag:Finding plagiarisms among a set of programs, Technical Report 2000-1,FakultätfürInformatik , Universität Karlsruhe.
Schleimer, S., & Wilkerson, D. S., Aiken, A. (2003). Winnowing: local algorithms for document fingerprinting, Acmsigmod, 76-85.
Bandara, U., &Wijayarathna, G. (2011). A Machine Learning Based Tool for Source Code Plagiarism Detection, International Journal of Machine Learning and Computing, 1(04), 337-343.
Narayanan, S., & Simi, S. (2012). Source code plagiarism detection and performance analysis using fingerprint based distance measure method, 7th International Conference on Computer Science & Education, 1065-1068.
Agrawal, M., & Sharma, D. K. (2016). A novel method to find out the similarity between source codes, IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics Engi- neering, 339-343.
Yasaswi, J., Purini, S., &Jawahar, C. V. (2017). Plagiarism Detection in Programming Assignments Using Deep Features, 4th IAPR Asian Conference on Pattern Recognition, 652-657.
Huang, Q., Fang, G., & Jiang, K. (2019). An Approach of Suspected Code Plagiarism Detection Based on XGBoost Incremental Learning, International Conference on Computer, Network, Communication and Information Systems, 88, 269-276.
Ljubovic, V. (2020). Programming Homework Dataset for Plagiarism Detection.IEEE Dataport. http://dx.doi.org/10.21227/71fw-ss32
