BHL data search

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.”  Read more about BHL here.

They provide metadata at four different levels: titles (volume level records), items (records for original monographs or journals), parts (records for articles, chapters, treatments, etc) and for the individual pages. We focus our search in the items and pages. 

Step 1: Automatic search

For each butterfly species in the checklist we did a preliminary automatic search in BHL for pages where the species was mentioned. We made use of the BHL API (version 2) function NameGetDetail. This function matches the given name string to a Name Bank ID and returns basic title, item and page metadata for each page on which the specified name appears. Please note that users are required to obtain an API Key in order to use the BHL API.

We summarized the results of the API call for each species, and built a list of page IDs. We then used the BHL API function GetPageOcrText to download the text file for each one of these pages.

Step 2: Key word matching

We used regular expression matching to search for keywords within the text of each page. We used several complete and partial keywords that we consider are indicative of species associations and recorded their location within the text. We discarded those pages without keyword matches.

Step 3: Species names

For every page with at least one keyword, we requested the complete list of species names found in that page using the BHL API function GetPageNames, for each name found we recorded its location within the text.

We built a file with the following information:

Column name

Description

Example of content

item.id

URL code which identify the original monographs or journals

http://www.biodiversitylibrary.org/item/699

page.id

URL code which identify the page

http://www.biodiversitylibrary.org/page/13297

kws

Keywords matched

Pterichis, Heliconius

fam

Describe whether the word contained in “kws” is a plant or butterfly name, or a keyword.

 

Keyword, Plantae, Animalia

start

Number describing the word starting position into the page.

146

end

Number describing the word ending position into the page.

162

 

Then, we added four (4) columns to do the manual information validation and extraction. We first classify each page into one of the four (4) categories based on the type of information contained in them.

 

We ignored all those pages classified as “index”, “reference” or “announcement”, because they do not content information about host plants. We manually evaluate all the pages classified as “text”, which contained information about species description, field records, biology, catalogue descriptions, etc. all of then with potential host plant records.

 

We defined four (4) different actions: “ready” when the content of a text page was verified and the association information was extracted when it was present; “to postpone” when the content of a text page was about associations different to host plant ones (mimicry, parasitism, etc.); “ignore” when the page was classified as “index”, “reference” or “announcement”; “to process” for all text page without verification.

 

Document type

Describe the information contained in the page

“index”, “announcement”, “text”

Action

Describe whether the page should be manually processed or not

“ignore”, “to processes”, “ready”, “to postpone”

Responsible

Person who processed the page.

“ASM”, “CL”, “LZ”

Date

Date when the page was validated

20130701”

 

 

Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith