References to earlier research articles, papers, reports, etc. (publications) are called citations. The number of citations of a particular publication is important in that it expresses its significance, quality or novelty. If all publications of an individual author are taken into account it may tell us to some extent how successful the researcher is.
At present the vast majority of papers are available on the web. This
enables to discover and count citations in an automated way.
CiteSeeker is an application created as part of an MSc thesis.
It is based on the .NET technology and almost entirely written in C#.
It searches the web starting from given URLs (start points). The start
points may also be obtained from Google as CiteSeeker is able to
communicate with its web APIs.
The web is crawled following the links to other web pages and documents in specified formats. Supported formats are text, HTML, PDF, PS as well as archived files (ZIP, GZ, TAR). Free external utilities are used for text conversion and unpacking. Inputs and outputs are text files with a specific structure. The major input is a list of publications whose citations should be found. Besides, start points and web search restrictions may be specified. The main output is a list of citations found and some statistical summary. Combined methods of exact and fuzzy search are employed so as to cope with errors in the search strings.
CiteSeeker turns out to be especially useful for conference servers with newly published papers. For instance, Table 1 shows the results of a search for citations to a list of 129 publications by one author. The search took about three hours.
Execution time | 3:01:51:382 |
Documents searched | 1 335 |
Documents successfully searched | 8 |
New servers found | 82 |
Kilobytes processed | 794 031 |
Archives checked | 270 |
PS and PDF checked | 811 |
Text extraction errors | 8 |
Extracted PS (average time) | 255 (19.17 sec) |
Extracted PDF (average time) | 548 (0.57 sec) |
The complete description can be found in the
thesis. You may also have a look at the conference
paper.
Binaries are available here (7.8 MB zip).
For comments or requests, send me an email.