David Lillis: CodEX: Source Code Plagiarism Detection Based on Abstract Syntax Trees

CodEX: Source Code Plagiarism Detection Based on Abstract Syntax Trees

Mengya Zheng, Xingyu Pan and David Lillis

In Proceedings of the 29th Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2018), Dublin, Ireland, 2018.

Abstract

CodEX is a source code search engine that allows users to search a repository of source code snippets using source code snippets as the query also. A potential use for such a search engine is to help educators identify cases of plagiarism in students' programming assignments. This paper evaluates CodEX in this context. Abstract Syntax Trees (ASTs) are used to represent source code files on an abstract level. This, combined with node hashing and similarity calculations, allows users to search for source code snippets that match suspected plagiarism cases. A number of commonly-employed techniques to avoid plagiarism detection are identified, and the CodEX system is evaluated for its ability to detect plagiarism cases even when these techniques are employed. Evaluation results are promising, with 95$\backslash of test cases being identified successfully.