Compiling a Text Re-Use Detect...

  تاریخ انتشار : 1396/9/14   نام نشریه : The 21th international conference on Asian Language Processing- Singapore   تعداد صفحات : 4
Compiling a Text Re-Use Detection Corpus from Scientific Papers with Semi-Real Cases of Plagiarism

چکیده مقاله

Automatic plagiarism detection deals with retrieval
of reused fragment of texts in a document and finding
source documents. Due to development of various methods
for plagiarism detection, large scale plagiarism corpora are
needed to evaluate these methods. Despite of their importance,
few plagiarism detection corpora developed in recent
years, especially in low resource languages. Because of legal
issues, releasing a collection of real cases of plagiarism for
evaluation purposes is not ethical. Due to these limitations,
simulation and artificial based methods are the two main approaches
to compile a plagiarism corpus. These approaches
try to simulate real cases of plagiarism, from different point
of views. However, there are still fundamental differences
between simulated corpora and real cases of plagiarism. In
this paper a semi-real approach is proposed to create a
collection of plagiarism cases as a corpus. This approach
is based on eliminating correct references from scientific
papers to make them as plagiarized passages. Unlike methods
based on simulated and artificial approaches, the proposed
corpus can correctly simulate real cases of text re-use. The
evaluation result shows high accuracy of proposed corpus
with respect to n-gram similarity for different ranges of N.


نویسندگان : سالار محتاج، حبیب‌اله اصغری، وحید ضرابی