تاریخ امروز: 1405/5/1 (English)

parsivar: APFa: A Language Pr...

خانه گروه های پژوهشی جزئیات گروه پژوهشی مقالات همایش جزئیات parsivar: APFa: A Language Pr...

تاریخ انتشار : 1397/2/16 نام نشریه : International Conference on Language Resources and Evaluation تعداد صفحات : 7

parsivar: APFa: A Language Processing Toolkit for Persian

چکیده مقاله

With the growth of Internet usage, a massive amount of textual data is generated on social media and the Web. As the text on the Web are generated by different authors with various types of writing styles and different encodings , a preprocessing step is required before applying any NLP task . The goal of preprocessing is to convert text into a standard format that make s it easy to extract information from documents and sentences. Moreover, the problem is more acute when we deal with Arabic script -based languages, in which there are some different kinds of encoding schemes , different kinds of writing styles and the spaces between or within the words. This paper introduces a preprocessing toolkit named as Parsivar , which is a comprehensive set of tools for Persian text preprocessing tasks . This toolkit performs various kinds of activities comprised of normalization, space correction, tokenization, stemming, parts of speech tagging and shallow parsing. To evaluate the performance of the proposed toolkit, both intrinsic and extrinsic approaches for evaluation have been applied. A Persian plagiarism detection system has been exploited as a downstream task for extrinsic evaluation of the proposed toolkit. The results have revealed that our toolkit outperforms the available Persian preprocessing toolkits by about 8 percent in terms of F1.

نویسندگان : سالار محتاج، بهنام روشنفکر، عاطفه ظفریان، حبیب‌اله اصغری

جهاد دانشگاهی مولود مبارک انقلاب است
حضرت آیت الله خامنه ای / معرفی گوینده...

درباره پژوهشکده

اين پژوهشكده يكي از زيرمجموعه‌هاي جهاد دانشگاهي بوده كه هدف از تأسيس آن دستيابي به دانش فني و كاربردي در رشته‌هاي تخصصي ICT از طريق طرح‌هاي مطالعاتي و تحقيقاتي و تلاش در جهت بررسي، شناسايي و كمك به رفع نيازهاي تحقيقاتي بخش‌هاي توليدي، خدماتي و اجرايي در زمينه‌هاي مذكور است.
جزئیات بیشتر...

پیوندهای مفید

اطلاعات تماس

تهران، خیابان انقلاب، چهار راه کالج، کوچه سعیدی، پلاک 5
02188930150
02188930157
info@ictrc.ac.ir

No.5 Saeedi Alley, Hafez Junction, Enghelab Avenue, Tehran, IRAN
+982188930150
+982188930157
info@ictrc.ac.ir

شبکه های اجتماعی

تمای حقوق این وب سایت برای پژوهشکده فناوری اطلاعات جهاد دانشگاهی محفوظ است.

درباره ما | ساختار پژوهشکده | نقشه سایت | اهداف و چشم انداز |

Scroll