Revision of Data sources used from Tue, 2013-03-05 12:12

To get information about biotic association and distribution records we used three sources:

  1. Encyclopedia of Life (EOL), provide information of biotic association from general sources like Wikipedia, NatureServe, but also more specific sources like North American Butterfly Knowledge Network.
  2. Biodiversity Heritage Library (BHL), provide digitial  versions of legacy literature of biodiversity held in natural history and botanical libraries.
  3. Global Biodiversity Information Facility (GBIF), provide distribution records with different accuracy from country, province or ecoregion names, to point localities.

Due each data source have different structure and organization of the data, we developed a data extraction protocol specific for each source:

Protocol for data extraction from EOL

For EOL search we will repeat the search with the new list of keywords, to evaluate how much does increase the number of species with objects matched. To know the final number of text objects we have evaluate, we will repeated the questions 1 to 4. Due the contents from EOL page change periodically, we consider necessary repeat this search during the course of the projects. The new text objects recorded could be incorporated into the extraction processes as they are found.

Protocol for data extraction from BHL

For BHL search we will identify the PDFs with false positive (index pages and bibliography pages), which could contain butterfly name but the word describing the biotic association is out the context.

For both source of information EOL and BHL we will extract plant names, ant names, parasitoids names and mimic names for each butterfly species. A simple way to do this automatically would be to compare the location of the keyword and the plant names within the text, but semantic methods could be explored to handle more complex sentences. The results will be validated by manual inspection to validate the context under which the plant species has been mentioned, specially to differentiate between nectar sources for the adults, confirmed host plants for the larvae, and records of non-adequate host plants (for example when the caterpillar was unsuccessfully reared under lab conditions).

Protocol for data extraction form GBIF

For GBIF search we are going to evaluate the reliability of the range maps automatically generated (based in approaches a and b) assigning a confident weight. For those maps with low confidence, we will evaluate whether additional information about distribution is available in other web sources or in published literature.

Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith