A System for Citations Retrieval on the Web

Application main window
Figure 1: Application main window

References to earlier research articles, papers, reports, etc. (publications) are called citations. The number of citations of a particular publication is important in that it expresses its significance, quality or novelty. If all publications of an individual author are taken into account it may tell us to some extent how successful the researcher is.

At present the vast majority of papers are available on the web. This enables to discover and count citations in an automated way. CiteSeeker is an application created as part of an MSc thesis. It is based on the .NET technology and almost entirely written in C#. It searches the web starting from given URLs (start points). The start points may also be obtained from Google as CiteSeeker is able to communicate with its web APIs.

Setting search parameters
Figure 2: Setting search parameters

The web is crawled following the links to other web pages and documents in specified formats. Supported formats are text, HTML, PDF, PS as well as archived files (ZIP, GZ, TAR). Free external utilities are used for text conversion and unpacking. Inputs and outputs are text files with a specific structure. The major input is a list of publications whose citations should be found. Besides, start points and web search restrictions may be specified. The main output is a list of citations found and some statistical summary. Combined methods of exact and fuzzy search are employed so as to cope with errors in the search strings.

CiteSeeker turns out to be especially useful for conference servers with newly published papers. For instance, Table 1 shows the results of a search for citations to a list of 129 publications by one author. The search took about three hours.

Table 1: Searching
Execution time 3:01:51:382
Documents searched 1 335
Documents successfully searched 8
New servers found 82
Kilobytes processed 794 031
Archives checked 270
PS and PDF checked 811
Text extraction errors 8
Extracted PS (average time) 255 (19.17 sec)
Extracted PDF (average time) 548 (0.57 sec)

The complete description can be found in the thesis. You may also have a look at the conference paper. Binaries are available here (7.8 MB zip). For comments or requests, send me an email.

Poslední změna:  09. 09. 2005 v 15:32
Webmaster:  Dalibor Fiala
URL této stránky: