BHL data search

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” Read more about BHL here.

They provide metadata at four different levels: titles (volume level records), items (records for original monographs or journals), parts (records for articles, chapters, treatments, etc) and for the individual pages. We focus our search in the items and pages.

Step 1: Automatic search

For each butterfly species in the checklist we did a preliminary automatic search in BHL for pages where the species was mentioned. We made use of the BHL API (version 2) function NameGetDetail. This function matches the given name string to a Name Bank ID and returns basic title, item and page metadata for each page on which the specified name appears. Please note that users are required to obtain an API Key in order to use the BHL API.

We summarized the results of the API call for each species, and built a list of page IDs. We then used the BHL API function GetPageOcrText to download the text file for each one of these pages.

Step 2: Key word matching

We used regular expression matching to search for keywords within the text of each page. We used several complete and partial keywords that we consider are indicative of species associations and recorded their location within the text. We discarded those pages without keyword matches.

Step 3: Species names

For every page with at least one keyword, we requested the complete list of species names found in that page using the BHL API function GetPageNames, for each name found we recorded its location within the text.

We built a file with the following information:

Column name	Description	Example of content
item.id	URL code which identify the original monographs or journals	http://www.biodiversitylibrary.org/item/699
page.id	URL code which identify the page	http://www.biodiversitylibrary.org/page/13297
kws	Keywords matched	Pterichis, Heliconius
fam	Describe whether the word contained in “kws” is a plant or butterfly name, or a keyword.	Keyword, Plantae, Animalia
start	Number describing the word starting position into the page.	146
end	Number describing the word ending position into the page.	162

Then, we added four (4) columns to do the manual information validation and extraction. We first classify each page into one of the four (4) categories based on the type of information contained in them.

We ignored all those pages classified as “index”, “reference” or “announcement”, because they do not content information about host plants. We manually evaluate all the pages classified as “text”, which contained information about species description, field records, biology, catalogue descriptions, etc. all of then with potential host plant records.

We defined four (4) different actions: “ready” when the content of a text page was verified and the association information was extracted when it was present; “to postpone” when the content of a text page was about associations different to host plant ones (mimicry, parasitism, etc.); “ignore” when the page was classified as “index”, “reference” or “announcement”; “to process” for all text page without verification.

Document type	Describe the information contained in the page	“index”, “announcement”, “text”
Action	Describe whether the page should be manually processed or not	“ignore”, “to processes”, “ready”, “to postpone”
Responsible	Person who processed the page.	“ASM”, “CL”, “LZ”
Date	Date when the page was validated	“20130701”

Papilionoidea of the World

Work protocol

Primary tabs

Search form

You are here

Work protocol

BHL data search

Primary tabs