Revision of EOL data search from Mon, 2013-03-04 11:20

We used a tentative global checklist that we assembled by collating data from authoritative checklists and on-line resources (Checklist of Checklists).

Step 1

For each butterfly species in the checklist we did a preliminary automatic search for data objects with information in text format.

We found 39,116 text objects from 11,730 butterfly species.

Step 2

Then, we used regular expression matching to identify the text objects that contains key words related with biotic association in general (parasitoid, ant) or hostplant in particular. We did a preliminary search using the following list of 18 keywords:

“ant”

”ant association”

“egg”

“fabaceae”

“feed on”

“feeding”

“foodplant”

“host”

“host ant”

“host plant”

“legume”

“mimetic”

“mimicry”

“ovipos”

“parasite”

“parasitoid”

“plant”

“poaceae”

All the keywords were used in lower cases.

Step 3

We built a file with the following information:

Column name

Description

Example of content

arch

Archive name

rslt_58645205b4ca632339783f5fa5775cec4723fa7a_11566465.json

nmbr

Scientific name in EOL

Philotiella leona

Id

Identification code from assigned for EOL web page

11566465

ntaxa

Number of taxon concept associated with the name given in “nmbr”

2

j

Number assigned to each

1

mT

Mime type or data type

text/html

dR

Data rating

2.5

long

Number of character in the text object

277

title

Title that describe the object content

Statistics of barcoding coverage: Graphium illyris

title_2

Title reclassification in short characters

“Barcoding”, "Conservation", “Introduction”, "Distribution and Abundance", "General comments", "Acraea machequena"

ag

Agents

NatureServe

vS

Vetted status

Trusted

kws

Keywords matched

host :: host plant :: feed on

val

Scientific name from the checklist

Philotiella leona

genera

Genera assigned in the checklist

Step 4

Then, we will evaluate this file with the information about the text objects founded following a flowchart approach.

Question 1: Did the text objects have content? How many text objects have any content (long > 0), are empty files (long < 0) or files were incorrectly downloaded?

Question 2: Did the keywords match?: How many text objects with content matched with at least one of the keywords used. We will call this subset “matching subgroup” hereafter.

Question 3: Is the association already in our database? How many objects from the matching subgroup provide information for species already recorded in our database.

Question 4: Have the objects information in others themes? To evaluate the effectiveness of the keywords used, we have to look with more detail the subgroup of text objects that not matched with any keyword, “not-matching subgroup”

There are two possible reasons for this disagreement:

1) The object does not have information about biotic association.

2) The object has information about biotic association but it is referred with words not contained in our initial keyword list.

To evaluate if the objects had information about biotic association we first checked the list of titles (title_2 column) and decided which of them likely could content information about biotic association. There are more specific titles that definitively describe information in subjects different than biotic association, like barcoding, distribution or phylogeny, etc. Thus we selected the objects with the following information in the column title_2, and exclude them for our initial analysis:

"Barcoding"

"Distribution and Abundance"

"Conservation"

"Other Comments"

"Geographical distribution"

"nomenclature"

"Discussion of Phylogenetic Relationships"

This result in a subgroup of object we are going to excluded from subsequent analysis.

To evaluate the second possibility, that the list of keyword used is incomplete, we create a subgroup with the object for which, we suspect could have information of interest. Most of the object does not have any title (20,573), while an important group has titles referring to species names, both with general terms (i.e.  “names”, 1,049 objects), or specific ones like species or genera name (i.e. "Telipna sheffieldi", "Bibla", 731 objects), “General comments” (61), “Introduction” (331), and “General introduction” (48). All these titles are ambiguous, and the object could or not contain information about biotic associations.  There were also objects with more specific titles like “Ecology and Biology” (224), “Life cycle” (95), and “Association” (70), that likely could contain information about hostplants. So we are going to select all the objects with such titles to evaluate a random sample of 1%.

Step 5

The evaluation consisted in a manual searching of information in EOL for 230 species. For each butterfly species, we recorded the keywords found in the document and the plant genera or species reported. Of this 63% definitively did not have any information about biotic interaction, 32% had at least one of the keywords used, but no match was found because the file did not have any character (long = 0). We presumed that download error had occurred and we had to download again the information for this species. The remaining objects provide the following list of new keywords:

“attracted to”, “deposited on”, “eggs”, “fruits”, “grasses”, “herbaceous”, “hosts”, “laid in”, “laid on”, “larvae”, “larval”, “mimics”, “reported on”

Thus, we are going to include it to improve the search, but we will also include the name of all plant families recorded in our database. We will replicate the keyword list in English, Spanish, French and German, because most of the text were written in those languages.

Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith