查看文章 |
Research on Anti-plagiarism System and the Law of Plagiarism(有删节) SHEN Yang1 LI Shu-chen2 TIAN Chen-geng3 CHENG Ming2
Abstract: Internet Plagiarism and paper industry’s development lead to the extension of plagiarism phenomenon, so it is urgently necessary to do research on it and develop anti-plagiarism system. Some countries have obtained huge achievement in anti-plagiarism field, but because of the heavy workload of literature collection and Chinese letter’s special characteristics, Chinese anti-plagiarism is still in the initial stages of development. Moreover, plagiarism activities involve directly copying and semantic plagiarizing, so the definition of plagiarism cannot be unified. From 2006 to 2008, based on two patents, we implement an anti-plagiarism system named ROST AntiP which covers 18.8 billion web pages and 4.9 million literatures, presents flexible match technology based on attribute value strings, can flexibly define plagiarism rule, and can implement fuzzy detection. This system has been practically using in several editorial office and universities. According to the practical collecting data, we find the PlagTrendHot phenomenon and the first page plagiarism phenomenon, thereby we improve the macroscopical algorithm to solve detecting speed problem. Besides, we accurately estimate the context bound errors involved in fuzzy matching, and preliminarily achieve the practice implementation goal. Keywords: anti-plagiarism, plagiarism, the law of plagiarism, academics morality, electronic study
1. Background Digital resources have gradually changed the learners’ way of obtain knowledge from person to person communication to person to computer communication. Since students and scholars are the group who have better capacity in requiring knowledge, this change intensifies plagiarizing activities. Joan Woodward from Nottingham Trent University points that using new technology to cheat becomes much more serious, so it is very necessary to develop and extend the anti-plagiarism system at the background of internet plagiarizing and “paper industry” explosion.[1] Anti-plagiarism system is an information processing system which gives plagiarism similarity result though comparing similarity between source document and target document. In digital library, intellectual property protection, electronic study, reprint monitoring, information security, music and multimedia anti-plagiarism, anti-plagiarism makes essential sense.
In response to the status quo, some countries have done lots of work and harvest abundant fruits. In 1993, Manber presented approximate index concept to measure similarity between strings in different document.[2] In 1995, Brin and Carcia-Molina, from America, introduced the COPS system and the corresponded algorithm in a digital library project, which established the structure foundation for anti-plagiarism system.[3] Si and Leong constructed the CHECK prototype to measure similarity between texts by counting keywords, and introduced text structure information to the measurement of text similarity.[4] In 2000, Monostori used postfix vector to store postfix tree so as to search the largest string in strings and set up the MDR prototype.[5] At present, Turnitin.com[6] provides anti-plagiarism service to more than 90 countries, covers more than 6.6 billion web pages, holds 6.5 million users, and supports the biggest four electronic teaching platform BlackBoard, WebCT, Moodle, Angel. However, these websites and software have some common disadvantages: limited document format and not supporting Chinese. Nowadays, all the researchers concern more about the selection of similar string detection algorithm and storing technology, while relatively research on the law of plagiarism lacks.
In
2. Anti-plagiarism system model’s processing content The main objects which are processed by the anti-plagiarism system (Figure 1) are Chinese letters, numbers, common foreign languages and all kinds of general symbols and subject symbols. According to document contents, objects can be divided into to be published books, to be published paper drafts, to be submitted homework, to be submitted source code and mathematical prove process, to be submitted web pages or websites, and other patent documents with originality demanding. According to document types, they can be divided into TXT, DOC, PDF, HTML, XML, etc. According to users’ anti-plagiarism requirements, they can be divided into one-time anti-plagiarism and durative anti-plagiarism. According to media form, they can be divided in to text plagiarism, picture plagiarism, music plagiarism, and model plagiarism, etc.
This paper will mainly focus on text plagiarism. Text Plagiarism can be divided into direct plagiarism and semantic plagiarism. Direct plagiarism means that the plagiarizers search the internet or literature from literature library, copy the whole or parts of the text, and make up them together again. Semantic plagiarism means that the plagiarizers do deeply word processing after literature collection, adjust text structure, change sentence pattern, and replace some keywords.
According to plagiarizing pattern, text plagiarism can be divided into “traditional plagiarism” and “internet plagiarism”. Traditional plagiarism is either paper literature plagiarizing or people relationship network plagiarizing, like classmates copy each other. Internet plagiarism indicates those directly plagiarizing, simple translating, and using software to automatically generate papers through search engines and literature library. This way is helped with two carriers: one is the multiple sources in the internet, like search engines, literature library, translation websites, and translation software. It involves both Chinese literature plagiarism and cross language plagiarism which plagiarizes research fruits from other countries by human or through translation websites, like Google, or translation software, like Jinshan Fast Translation. The other is automatically article generated software. For example, software “ZuoWenKuaiShou” will generate articles of your requirements by input one keyword. [8] This technology method increasingly enhances plagiarism efficiency, and lead homework and papers plagiarism phenomenon flourish.
To be published books, curricular papers, degree papers and academic papers are always submitted in DOC or PDF format. References at the end of the papers, or with footnotes, are point to their derivation, so we need extract pure text and filter the sentences which are referred. That means eliminating referring contents and reference part of the papers. Then, the rest part is the available text content which will be processed by the anti-plagiarism system.
3. System structure and difficulties Ant-plagiarism system is an integrated technology processing system. For its related technology, we have respectively implemented about ten pieces of independent free sharing software and published them in the internet.[9] Directly search “ROST 词频”, you will get around 3160 return web pages, including document format recognition plug-in, meta-search engine technology and search engine, Web Service and ActiveX, word parsing and word parsing statistics, high speed strings’ flexible matching, high effective information store and information fingerprint,[10] web pages’ text extraction, high speed searching and labeling, etc. System Structure shows in figure 2.
Figure 2: ROST AntiP anti-plagiarism system model chart From above processing content, we can find that there are many difficulties in anti-plagiarism system. We need take all possible method to eliminate or temporarily avoid the existing technology difficulties so as to gradually make it into practice use, and employ practice to boost the further research. Brief summary of several points is as following.
1. literature library source Because plagiarizer may copy literatures from all kinds of sources, how to construct a large scale literature resource library is the key to the anti-plagiarism system’s success. Only a library contains enough documents, can its cover rates raise. There will be three main sources for the documents: first are the document groups which are submitted from a certain range, like the submitted homework in one class; second are documents from meta-search engine; third are documents from other literature libraries, like libraries of China Journal Net. Through our hard working, now ROST AntiP has covered 18.8 billion Chinese web pages and 4.9 million Chinese literatures.
2. document format conversion Users may submit documents of variety of format, like Founder’s layout format and Email format EML, so how to recognize and integrate these known and unknown document types is a problem. System should add format conversion model to deal with this problem. Common file types are easy to process according to our practical use, while PDF documents with two columns are tough to manage.
3. definition of plagiarism How many words are there sequentially same in a text will be defined as plagiarism? Or how similar are two certain meaning structure will be defined as plagiarism? There are systems in other counties which define 26 sequentially same words in two texts as plagiarism. We introduce a concept of similar degree rather than only give a conclusion of plagiarism. The similar degree can tell users how many words are 80% similar to target texts in the current text, and the plagiarizing words’ percentage in the current text excluding references. Users can configure the system freely to let it alert when the percentage exceeds a certain value.
4. Plagiarism phenomenon and improved algorithm 4.1 Practical test data Now, our anti-plagiarism software has been using for the to be published papers in editorial office of Publishing Science, course papers of public optional course in Wuhan University and a part of master thesis. This statistic includes 488 papers of 3.07 million words. We get the following result after applying ROST’s plagiarism document statistic and analysis tool: From the above table we can find that curricular papers and bachelor thesis have high plagiarism rate, and master thesis’s plagiarism rate is pessimistic either. Academic papers has relatively low plagiarism rate, but because academic papers are serious, the condition of academic corruption is not optimistic. We sum up the following listing phenomenon based on our practical test data:
PlagTrendHot phenomenon In the submitted curricular papers, 13 pieces are about human search, and 4 of them plagiarize the same article from the internet. If give whole class the same assignment topic, we discover that many students in the class will copy text from same documents after we check all the submitted homework from that class. The high copied rate literature exist a certain rate of being copied by somebody else. That is to say, the number of literature with high copy rate and repeated literature in a group are relevant. We call this phenomenon as PlagTrendHot phenomenon. Hereby, we can select a part of websites or literatures as checking object, and store them into high copy rate document Cache, so as to effectively increase checking speed.
Plagiarism first page phenomenon In the copied documents, 89.3% of them belong to the first searching return page of literature library or search engine, 7.2% belong to the second page, and the rest 3.5% belong to the rest pages. We call this phenomenon Plagiarism first page phenomenon. Hence, we can just put the first page contents of the literature library or search engine in the common probability plagiarism document library to both enhance speed and decrease storage workload.
4.2 Improved algorithm When construct plagiarism literature, we can take advantage of the return pages’ characteristics to enhance checking speed. To explain the question clearly, we’ve got the following explanation and definition:
Firstly, get available contents and judge their belonging categories. Comparison between unrelated categories is redundant. Secondly, split documents into blocks. After that, we extract attribute term strings and send these terms to literature index library (this library is an index file, based on Hash, which contains a set of each text block’s attribute terms) for checking. If similar degree is larger than 0, we record the character number in this block, the similar degree, and the similar literature derivation or web link resources. Thirdly, when we finish checking to the whole literature, we count percentage of the characters, whose similar degree is larger than 80%, in the whole text excluding references, and get the most similar literature document’s derivation or web link.
|
