EOL data search

Steo 1: Automatic search

We used a tentative global checklist that we assembled by collating data from authoritative checklists and on-line resources (Checklist of Checklists).

For each butterfly species in the checklist we did a preliminary automatic search for data objects with information in text format.

We found 39,116 text objects from 11,730 butterfly species.

Step 2: Key word matching

Then, we used regular expression matching to identify the text objects that contains key words related with biotic association in general (parasitoid, ant) or hostplant in particular. We did a preliminary search using the following list of 18 keywords:

“ant”, “”ant association”, “egg”, “fabaceae”, “feed on”, “feeding”, “foodplant”, “host”, “host ant”, “host plant”, “legume”, “mimetic”, mimicry”, “ovipos”, “parasite”, “parasitoid”, “plant”, and “poaceae”

All the keywords were used in lower cases.


Step 3: Summary file

We built a file with the following information:

Column name

Description

Example of content

arch

Archive name

rslt_58645205b4ca632339783f5fa5775cec4723fa7a_11566465.json

nmbr

Scientific name in EOL

Philotiella leona

Id

Identification code from assigned for EOL web page

11566465

ntaxa

Number of taxon concept associated with the name given in “nmbr”

2

j

Number assigned to each

1

mT

Mime type or data type

text/html

dR

Data rating

2.5

long

Number of character in the text object

277

title

Title that describe the object content

Statistics of barcoding coverage: Graphium illyris

title_2

Title reclassification in short characters

“Barcoding”, "Conservation", “Introduction”, "Distribution and Abundance", "General comments", "Acraea machequena"

ag

Agents

NatureServe

vS

Vetted status

Trusted

kws

Keywords matched

host :: host plant :: feed on

val

Scientific name from the checklist

Philotiella leona

genera

Genera assigned in the checklist


Step 4: Evaluation

In this step we want to know how many information we have. Specifically, how many text document we need analyze to extract the required information. Due the high number of objects, we considered apply the strategy of categorized them in order of importance with base on the probability that contents the information required. So we identified three subgroups of text documents:

  • Excluded subgroup
  • Matching subgroup
  • No-matching subgroup

But we also want to quantify how effective was our preliminary search and key word matching and how we can improve our selection criteria to include as much information as possible. To do that we followed the flowchart approach:

Question 1: Did the text objects have content? How many text objects have any content (long > 0), are empty files (long < 0) or files were incorrectly downloaded? These later it will be called “excluded subgroup” hereafter.

Question 2: Did the keywords match?: How many text objects with content matched with at least one of the keywords used. We will call this subset “matching subgroup” hereafter.

Question 3: Is the association already in our database? How many objects from the matching subgroup provide information for species already recorded in our database.

Question 4: Have the objects information in others themes? To evaluate the effectiveness of the keywords used, we have to look with more detail the subgroup of text objects that not matched with any keyword, “not-matching subgroup”

There are two possible reasons for this disagreement:

1) The object does not have information about biotic association.

2) The object has information about biotic association but it is referred with words not contained in our initial keyword list.

To evaluate if the objects had information about biotic association we first checked the list of titles (title_2) and decided which of them likely could content information about biotic association. There are more specific titles that definitively describe information in subjects different than biotic association, like barcoding, distribution or phylogeny, etc. Thus we selected the objects with the following information in the column title_2, and exclude them for our initial analysis:

"Barcoding", "Distribution and Abundance", "Conservation", "Discussion of Phylogenetic Relationships", "Other Comments", "Geographical distribution", "nomenclature"

The subgroup of object resulted are going to be excluded from subsequent analysis.

To evaluate the second possibility, that the list of keyword used is incomplete, we create a subgroup with the object for which, we suspect could have information of interest.

Most of the object does not have any title (20,573), while an important group has titles referring to species names, both with general terms (i.e.  “names”, 1,049 objects), or specific ones like species or genera name (i.e. "Telipna sheffieldi", "Bibla", 731 objects), “General comments” (61), “Introduction” (331), and “General introduction” (48). All these titles are ambiguous, and the object could or not contain information about biotic associations.  There were also objects with more specific titles like “Ecology and Biology” (224), “Life cycle” (95), and “Association” (70), that likely could contain information about hostplants.

So we will select all the objects with such titles to reevaluate a random sample of 1%.

Step 5: Reevaluation

The reevaluation consisted in a manual searching of information in EOL for 230 species. For each butterfly species, we recorded the keywords found in the document and the plant genera or species reported.

  • 63% of the object definitively did not have any information about biotic interaction,
  • 32% had at least one of the keywords used, but no match was found because the file did not have any character (long = 0).

We presumed that download error had occurred and we had to download again the information for this species.

The remaining objects provide the following list of new keywords:

“attracted to”, “deposited on”, “eggs”, “fruits”, “grasses”, “herbaceous”, “hosts”, “laid in”, “laid on”, “larvae”, “larval”, “mimics”, and “reported on”.

Thus, we will include it in our list but we also will include the name of the most frequent hostplant families. The keyword list should be replicate in English, Spanish, French and German because most of the texts were written in those languages.

Then, we will repeat the protocol from step 1 to step 4, and evaluate how many txt object were add by the improvement of key word list.


Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith