The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” Read more about BHL here.
They provide metadata at four different levels: titles (volume level records), items (records for original monographs or journals), parts (records for articles, chapters, treatments, etc) and for the individual pages. We focus our search in the items and pages.
Step 1: Automatic search
For each butterfly species in the checklist we did a preliminary automatic search in BHL for pages where the species was mentioned. We made use of the BHL API (version 2) function NameGetDetail. This function matches the given name string to a Name Bank ID and returns basic title, item and page metadata for each page on which the specified name appears. Please note that users are required to obtain an API Key in order to use the BHL API.
We summarized the results of the API call for each species, and built a list of page IDs. We then used the BHL API function GetPageOcrText to download the text file for each one of these pages.
Step 2: Key word matching
We used regular expression matching to search for keywords within the text of each page. We used several complete and partial keywords that we consider are indicative of species associations and recorded their location within the text. We discarded those pages without keyword matches.
Step 3: Species names
For every page with at least one keyword, we requested the complete list of species names found in that page using the BHL API function GetPageNames, for each name found we recorded its location within the text.
We built a file with the following information:
Column name |
Description |
Example of content |
item.id |
URL code which identify the original monographs or journals |
|
page.id |
URL code which identify the page |
|
kws |
Keywords matched |
Pterichis, Heliconius |
fam |
Describe whether the word contained in “kws” is a plant or butterfly name, or a keyword. |
Keyword, Plantae, Animalia |
start |
Number describing the word starting position into the page. |
146 |
end |
Number describing the word ending position into the page. |
162 |
Then, we added four (4) columns to do the manual information validation and extraction. We first classify each page into one of the four (4) categories based on the type of information contained in them.
We ignored all those pages classified as “index”, “reference” or “announcement”, because they do not content information about host plants. We manually evaluate all the pages classified as “text”, which contained information about species description, field records, biology, catalogue descriptions, etc. all of then with potential host plant records.
We defined four (4) different actions: “ready” when the content of a text page was verified and the association information was extracted when it was present; “to postpone” when the content of a text page was about associations different to host plant ones (mimicry, parasitism, etc.); “ignore” when the page was classified as “index”, “reference” or “announcement”; “to process” for all text page without verification.
Document type |
Describe the information contained in the page |
“index”, “announcement”, “text” |
Action |
Describe whether the page should be manually processed or not |
“ignore”, “to processes”, “ready”, “to postpone” |
Responsible |
Person who processed the page. |
“ASM”, “CL”, “LZ” |
Date |
Date when the page was validated |
“20130701” |