Project : atoll
Section: New Results
Processing Botanical Corpora
Participants : Guillaume Rousse, Éric Villemonte de la Clergerie, François Role.
BIOTIM Action: http://atoll.inria.fr Rubrique « Projets »
In the context of French action BIOTIM (cf. 7.2), ATOLL is involved in processing botanical corpora.
The work effectued this year on BIOTIM is twofold.
First, the continuation of last year effort on NLP pipeline. We had to rework it fully for integrating tools developed during the EASY campaign by other team members, and for ensuring a better compliance to MAF.
Experiments on terminology extraction have been presented at TIA 2005 [31]. One of the problen we had was the poor typographic quality of the OCRized corpora and a new OCRization was done. However, despite the improved quality of the newly OCRized corpora, they are still issues with spelling errors and noise induced by the formatting (layout) of the original documents (pagination, illustrations, numerisation artifacts, etc...). Hence the need for some kind of input filtering.
Therefore, we started to work on retrieving the logical structuring of the corpora, to adress this very issue. With a generic regular expression based chunker, and corpus-specific configurations, we are able to segregate and label the various parts of interest in the document, mainly taxon descriptions. Domain specific integrity rules allow some automatic error correction for undetected patterns, such as missing taxons in taxonomic hierarchy. Coupled with morpho-syntactic processing with our our NLP pipeline, a preliminary study has been done by François Role to assess the possibility to extract an ontology and to represent it in OWL [19].
We have also started parsing some corpora, exploiting the logical structuring to remove the non pertinent parts (such as the bibliographical notices). We now need to assess the quality of the parsing, in order to eventually tune the meta-grammar to the very specific style of these botanical corpora. We also need to complete and enrich a domain specific lexicon. The next step, in collaboration with LIFO (Univ. of Orléans) will be to exploit the dependencies, produced during parsing, to extract a small lexical ontology.