Deej Guide to Life, Universe and Everything I Don't Know About: Bibliographic Reference Parsing

Every research is done on the basis of some other previous research, and our scientists are scrupulous enough to acknowledge others work. So, every research paper contains some reference section, which refers to the previous work. But unfortunately, it is not computer-readable, since not every reference section is in the same format as others. There are many conventions as to how to cite other people's work in your paper, but we can never reach a stage that it is computer readable without any pre-processing.

Thus, there are projects like Google Scholar, ParaCite, CiteBase doing this work for us. But they are not perfect, neither is Plazi. The aim of Plazi is:

"Plazi is an association supporting and promoting the development of persistent and openly accessible digital taxonomic literature."

My aim is to identify the references and then parse them from a research paper and get out fields like Author Name, Year, Title and Publication. Thus, with this information in hand, we can then save it in our database and create Digital identifiers for the same. The project is useful to Plazi, thus I get to work on it for the summers.

So what I have been doing till now:

I have been writing lots of small codes to test things, to get data, to get training data etc.
Understanding the huge codebase of ParaCite and then finding out that it is not something we can build upon.
Starting out from scratch for our project.

The thing with most of the zoological papers is that they are not structured in the same sense as some of the modern scientific papers are; some are archaic, and they do not follow a very easy to computerize format. Another challenge that we face is that of micro-citation i.e. they often do not contain the name of the publication in full but in abbreviated format, which often end with the same separators that are used to separate various fields of a reference. Thus, we cannot follow some of the common approaches followed by conventional projects like ParaCite .

So, we have been using some conventional but effective approaches to do the job like finding out publications from a set of publications, learning from that data and making predictions for the ones I cannot find.

The project is challenging and interesting, will post more updates later on, and would like to hear views on it in comments.

Deej Guide to Life, Universe and Everything I Don't Know About

Search This Blog

Wednesday, July 30, 2008

Bibliographic Reference Parsing

No comments:

Post a Comment

Links

Labels