Given a set of suspicious and source documents written in Persian, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. This External Plagiarism Detection Task provides a standard situation to evaluate Persian Plagiarism detection systems.
General principles related to the PD system are consists of:
<document reference="suspicious-documentXYZ.txt"> <feature name="detected-plagiarism" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" source_offset="100" source_length="1000" /> <feature ... /> ... </document>
Use Cases:
This sub-task includes the construction of text alignment plagiarism detection corpora. The corpus would be Persian mono-lingual or bi-lingual with the compound of Persian and any other languages. The task would include compiling Persian PD corpora and the goal is to evaluate existing corpora to rank them based on their quality. Also, the proposed PD systems would be run on submitted corpora in this sub-task. Validating of Persian plagiarism detection corpora. The Intellectual Property of the submitted corpora belongs to its owner and would not be transferred to Persian Plagdet2016.
The submitted corpora should follow the standard PAN text alignment annotation structure. The corpora shall contain a source, suspicious and a XML directory which include source documents, suspicious documents and annotated xml documents, respectively. Also a text file named pairs should list all pairs of suspicious documents and source documents to be compared. You can find a sample corpus here
To be available on July,15,2016
The training corpus will be available to competitors for developing their methods and setting their related parameters. The corpus is consists of suspicious files, source files and xml files. In addition a text file determines pairs of suspicious and source documents to investigate plagiarism. For each pair of suspicious and source document, the associated xml file shows the exact offset of common plagiarized fragments between documents. The structure and content of a sample XML file is shown in below.
To be available on August,15,2016
The test corpus will be available for evaluating the competitors. The structure of the corpus is similar to the training corpus except that there are no xml files in this corpus. Participants should generate similar xml files as the training corpus for each pairs of suspicious and source documents.
Training Data Release |
Test Data Release |
Run Submission Deadline |
Results Declared |
15th October 2016 Working Notes Due |
8-10 Dec 2016 Conference |
Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score.
The precision and macro will be computed at character level, in addition the granularity measure quantifies whether the contiguity between plagiarized text passages is properly recognized. The plagdet score is a combination of precision, recall and granularity.
For more information read the related article:
The following python code will be used for computing mentioned measures. The code provided by Martin Potthast at PAN@CLEF:
Performance will be measured by assessing the validity of submitted corpora in different ways:
Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review.
Detection: The submitted corpora will be fed into the text alignment prototypes from task 1. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured.
The winner is NLP Research Lab of Shahid Beheshti University.
Rank |
Team |
Plagdet |
Granularity |
Precision |
Recall |
1 |
Fatemeh Mashhadi, Mehrnoush Shamsfard |
0.92204 |
1.00146 |
0.92688 |
0.91919 |
2 |
Hadi Veisi, Kayvan Bijari, Kiarash Zahirnia, Erfaneh Gharavi |
0.90593 |
1 |
0.95927 |
0.85820 |
3 |
Mozhgan Momtaz, Kayvan Bijari, Davood Heidarpour |
0.87103 |
1 |
0.89258 |
0.85049 |
4 |
Behrouz Minaei , Mahdi Niknam |
0.83015 |
1.03968 |
0.92034 |
0.79602 |
5 |
Faezeh Esteki, Faramarz Safi Esfahani |
0.80083 |
1.0 |
0.93337 |
0.70124 |
6 |
Alireza Talebpour, Mohammad Shirzadi, Zahra Aminolroaya, Mohammad Adibi, Ahmad Mahmoudi-Aznaveh |
0.77496 |
1.22759 |
0.96383 |
0.83615 |
7 |
Nava Ehsan , Azadeh Shakeri |
0.72662 |
1 |
0.74962 |
0.70499 |
8 |
Lee Gillam, Anna Vartapetiance |
0.39968 |
1.52803 |
0.75484 |
0.41407 |
9 |
Muharram Mansoorizadeh,Taher Rahpooy |
0.38994 |
3.53698 |
0.90002 |
0.80659 |
Steering Committee
Vahid Zarrabi, ICT Research Institute, ACECR ,Iran
Mehrnoosh Shamsfard, Shahid Beheshti University ,Iran
Omid Fatemi, University of Tehran ,Iran
Hesham Faili, University of Tehran ,Iran
Salar Mohtaj, ICT Research Institute, ACECR ,Iran
Behrouz Minaei,University of Science & Technology ,Iran
Habibollah Asghari, ICT Research Institute of ACECR ,Iran
Paolo Rosso, Universitat Politècnica de València, Spain