Thursday, February 26, 2015

Stone Age to Industrial Age: The evolution of eBird's filter system

Did you ever wonder why eBird asks for details about a report of American Dipper from Adams County, Colorado, but lets the same report from Park County, Wyoming, sail through without a twitch? I just want to say one word to you; just one word... "filters."

A filter is what creates the list of species (and non-species entries) that you see when you enter a checklist into eBird. Imagine, with the current world-wide scope of eBird, having to rummage through the entire list of the world's 10,000± species of birds just to enter the seven species that you saw in a nine-minute jaunt through your yard. Ugh. No one would use eBird. Thus, filters were created from the program's inception in order to streamline the data-entry process.

Filters have at least a couple other purposes, which go hand in hand with the primary purpose. The first of these is to call attention to the eBirder entering data that a species is not expected at the location, which helps point out possible data-entry errors. I've done it; you've done it. You meant to enter an American Bittern, but you got the number in the box for Least Bittern. Oops. eBird will point that out to you when it asks you to confirm the entry, as the species is rare in this region. Filters also involve abundance limits, which, again, enables the software to point out possible data-entry errors. This one I've done numerous times with my fat fingers. You meant to enter 1 American Bittern, but you accidentally also hit the zero on the number pad resulting in 10 American Bitterns. With the filter set at something less than 10, eBird will point out that error when it asks you to confirm the number, which is atypical for anywhere in this region.

In the beginning (2002), eBird was a very simple and simplistic world. The first iterations of filters were state-/province-based things (the program was originally restricted to the U.S. and Canada) that provided gross estimates of numbers acceptable for that state in each of the 12 months of the calendar. They were created as Excel spreadsheets, with species down and months across, each cell filled with an integer that was a gross approximation of what was thought to be the realistic maximum that one might encounter in a day's birding in the state/province; no matter where in the state/province. Abundant species, such as Red-winged Blackbird, might have a cell (or each of the 12 cells) filled with 100,000.

Chris Wood (currently the eBird Project Manager, but then "just" a Colorado birder with a strong knowledge of the state's bird distribution) and I constructed the first Colorado filter in 2001 and we did not really think through the ramifications of those cell entries. We did not consider data-entry errors. We did not consider what those cell entries might mean regarding permitting some fairly outlandish numbers to skip the review process. At the time, there were no non-species entries. That is, no spuhs, slashes, hybrids, subspecies. There were just species.

As eBird has become more refined with much more capacity and capability (thanks to Jeff Gerbracht and the rest of eBird's architect team), filters have become incredibly more complex. First, was the separation of the statewide filter into regional filters, using the county as the basic block, at least in the Lower 48. Circa 2005, Chris and I divided Colorado's statewide filter into five regional filters in which we lumped counties with similar avifauna: Northeast, Southeast, Mountains, Northwest, and Southwest. That, obviously, required some fine-tuning of each of those five filters to more-closely match each subregion's avifauna, such as excluding Northern Bobwhite from the three western filters and excluding Gunnison Sage-Grouse from the two eastern filters and the Northwest filter.

Next was the addition of various above-species-level entries, the spuhs (e. g., "goose sp.") and the slashes (e. g., "Semipalmated/Western Sandpiper"). That meant going through each of the five then-extant filters and adding those non-species entries relevant to each filter, which I did (Chris was now working at eBird) on a fairly conservative basis, putting in just the really common non-specific entries, such as "Snow/Ross's Goose" and "accipiter sp." That wasn't too bad. Tedious, but not too bad, and at the time, I was the only person working on Colorado's eBird filters. Due to Wyoming's low human population (thus birder population), the state did not then have a resident filter meister; the Stone Age (statewide) filter was established and maintained at eBird Central at the Cornell Lab of Ornithology.

At some point after the addition of non-species-level entries, I started splitting the five Colorado filter regions into smaller ones in order to account for the complexity of bird distribution in the state, such as taking the species-rich and well-birded counties of Boulder and Larimer out of the northeastern filter and constructing a new filter that could be more tightly focused on that smaller region's avifauna. With available time, the number of Colorado filter regions grew, cracking double digits and never looking back.

In 2007, the addition of Marshall Iliff as the third member of the eBird management team (Team eBird; Brian Sullivan and Chris Wood being the other two), eBird's abilities expanded further, with a more-in-depth taxonomy that was to cover the entire planet (2010). (I find it very interesting that Brian, Chris, and Marshall all worked for me at Rocky Mountain Bird Observatory at various times and that all worked on the same project in Mexico with me one year!) Hybrids were added, as were many, many, many more non-species entries, such that there are now nearly as many non-species-level entries available in the ABA area as species-level entries, some used exceedingly rarely, some widely used.

Then, eBird tackled the "April problem." Those of us in the filter and record-review aspects of eBird had for years complained that the rigid monthly structure to the filters made for some major problems, with April being the poster child for such problems. In much of the ABA area, particularly the Lower 48, filter makers/editors had to decide between filtering out all occurrences of a migrant species that arrived in the filter region in the last few days of April, or allow all occurrences in the month of such species, even in early April when they were unknown. In Colorado, MacGillivray's Warbler is an excellent case in point, with the vast bulk of migrants arriving in May, but with a small number typically noted in the last week of April, but unknown in the state prior to the 22nd or so.

The solution was to throw out the monthly framework, replacing it with up to 13 individually adjustable time periods. The new system allowed chopping up, particularly, the short, intense spring migration of most migrant species into periods as small as five days, with each period allowed its own abundance limit (a number that is "permitted," with any larger number of birds of that species in that time period requiring review). As example, the Lincoln County, Colorado, filter has five filter periods covering the spring migration of Clay-colored Sparrow, each with its own abundance limit (in parentheses): 22-30 April (1), 1-7 May (9), 8-14 May (29), 15-21 May (15), and 22-31 May (9). The filters are also now constructed online, a move nearly enforced by the new system (Fig. 1); think about trying to construct an Excel spreadsheet that does this.


Figure 1. Abundance and seasonal limits of a variety of sparrow species (and non-species entries), January through about 15 August in Lincoln County, Colorado, as indicated by the current eBird filter for that county. Click on image to see larger version. Note that the short lengths of some of the temporal periods of the filter mean that values of more than one digit are partially hidden by the scroll arrows in each period (see Clay-colored Sparrow). Note that there are three non-species entries, two subspecies of Brewer's Sparrow (neither of which is allowed without review, as identification is quite difficult) and a spuh (Spizella sp.). One can quickly determine from this filter those species of this selection that are known to breed in the county, just Cassin's and Vesper sparrows. Note also that the individual abundance limits of the spuh entry allow for all temporally-occurring species of the genus, as it is fairly easy to imagine most of a large flock of mixed Spizella not being identified to species. Finally, the grayed '+' buttons indicate entries (Chipping Sparrow, Clay-colored Sparrow, Spizella sp.) for which all of the possible 13 temporal periods are used; that is, the occurrence patterns represented in the filter cannot be further divided.

As something of an aside, an abundance limit is the result of a decision about a tenuous balance between the number that might occur and catching data-entry errors, and such decisions need to be made for as many as 13 temporal periods for each of the species and non-species entries in each filter (the current Boulder County, Colorado, filter -- split off from the Boulder/Larimer filter a few years back -- lists 413 species and 231 non-species entries).

While the new filter system allows great flexibility in constructing species- and location-specific filter limits, it is also much more complex and much more time-consuming to construct. It takes me something like 5-20 hours of tedious effort per filter, whether constructing a new filter from scratch or completely overhauling an existing filter.

In the early years of this decade, big changes came about in the Wyoming birding community and in Wyoming eBird review and filters. First, Shawn Billerman arrived to attend to graduate studies at University of Wyoming. Before arrival, though, he had already been shanghaied by Team eBird (he did come straight from school at Cornell, so was already known by the powers-that-be) into tackling the state's eBird review. James Maley arrived a year later (working at University of Wyoming) and was quickly added to what was then a two-person team. Perhaps more importantly, though, they took on the task of bringing Wyoming eBird into the Industrial Age, as far as filters go. They divided the statewide filter into seven filter regions (Fig. 2).

Figure 2. The seven Wyoming filter regions.

Meanwhile, Colorado, with its considerably more-substantial eBird data set (and, perhaps, a filter meister with a wee bit more time on his hands), has 37 filter regions. As depicted in Figure 3, these regions are, generally, individual counties, though with a few two-county regions. The large, multi-county regions are in the process of being broken up into smaller units, with the Southeast region being an excellent example. Just in late February 2015, I have split the old large and unwieldy version of the Southeast filter into three regions: Baca, Crowley and Otero, and the rest of the counties. The San Luis Valley region, though large, is fairly homogeneous, so will not be broken up into smaller regions until some rather major changes in how eBird deals with filter regions come to pass (see below).

Figure 3. The 37 Colorado filter regions.

As it always has been, the primary impetus to establish a new filter region is to fine-tune filters to the landscape and the avifaunal occurrence (both spatial and temporal) thereon. Perhaps one of the best examples of the need for such new filters is Phillips County, Colorado (Fig. 3). At the time that I constructed the filter, the county was covered by the general Northeast filter (which also included Weld at the time). Note that all of those counties but Phillips has at least part of a major water body in it, while the largest water body in Phillips is probably a sewage pond (at Haxtun). Thus, Phillips County data were "allowed" to include large numbers of waterbird species that are actually fairly rare there.

However, just because a particular filter covers just one county does not mean that there aren't still difficult decisions about filtering to be made, and Wyoming is the epitome of that problem. Because of the state's low human population, its counties are overly large. The state's geography is also more varied and, well, eccentric. The combination of these two factors means that, unlike eastern Colorado, most of Wyoming's counties contain both low-elevation "flat" lands and high mountains, which makes for a wonderfully varied avifauna from a birding standpoint, but a nightmare from an eBird-filter point of view.

One of the best examples of this problem is Big Horn County, which combines the spine of the Big Horn Mountains (and the associated suite of subalpine forest species) and the low-elevation Big Horn Basin (and the associated grassland and shrubsteppe species). Because one would not expect to encounter Long-billed Curlew in the forest near the pinnacle of the Big Horn Mountains nor White-winged Crossbill in the open country northeast of Greybull, eBird really should not "allow" such occurrences. However, with county-based filter regions, such problems are encountered frequently. In the long term, the solution is filter regions based on physiognomy and habitat, not geopolitical boundaries -- what we might term Space Age filters. Though there have been arbitrarily defined filter regions in the past (e. g., most or all of the California coastal counties have "offshore" regions), Team eBird is moving in that direction in a general fashion. However, the process will be long, involved, and tedious, and will not happen tomorrow.

As more and more data are entered into eBird, the data sets for individual filter regions become more robust and allow for more-precise filter limits and temporal periods, so reviewers are constantly fine-tuning filters. However, as noted above, this is a slow process. Thus, you may encounter remnants of previous filter strategies mixed with more-"modern" strategies on individual filters in both states when entering data into eBird, simply because no one has found the free time to completely revamp older filters. So, when you encounter something that a filter flags that you believe should not be flagged, please let us know (politely!) in the species's comment field. That goes the same for occurrences that you feel ought to be flagged, but aren't. Most of you know your local area much better than any of us do; we're happy to learn such bits of information, particularly in Wyoming, which has many fewer data backing the various filters.

As eBird continues to grow in popularity, demands on its programming, design, and operations will also grow. With the very recent addition of Ian Davies to the management team and the resultant spreading of the workload at eBird Central, we can expect further enhancements, even radically new capability, to come online in the near-term future. I cannot wait to see the changes!

[This essay is largely based on a previous version that I posted to Cobirds, the Colorado birds listserve. Thanks to Team eBird and various members of the Colorado and Wyoming review teams for comments on an earlier version of this essay.]

Tony Leukering

Lead Colorado eBird reviewer and senior author of the Colorado & Wyoming column in North American Birds

1 comment:

  1. Tony, thanks for this detailed explanation. The amount of time that you and other behind-the-scenes volunteers have spent on the excellent filtering system is truly astounding. As the siite-specific databases become more robust, I hope that the eBird team will consider instituting "smart filters" which would use artificial intelligence to determine what is normal or unusual (i.e. filterable). eBird already does this to some extent with the hotspot-specific bar charts and the "more likely" feature for entering checklists (which lists unreported species, and species reported on <10% of checklists, separately). An observation that represents a filterable outlier could be flagged if it outside a "normal" temporal or numeric range where normal is defined as 90% (conservative) or 99% (liberal) of sightings in the database for that hotspot. Every outlier that is flagged and accepted upon further review by moderators would automatically adjust the "normal" limits. I believe this code would be feasible, and while it should not replace the current filter system, it could take priority within hotspots that have sufficient data. To illustrate the utility of such a system, consider Golden-crowned Kinglet at Lake Loveland. This rare sighting (there are only 1 or 2 in the database) should always be flagged, but the current county-based system overlooks it, because they are "normal" elsewhere in the county. Here is another illustration: California Gull concentrations of over 400 birds are flagged throughout Larimer County. However, bigger concentrations are normal at a handful of hotspots - landfill, Boyd Lake, Lake Loveland, Horseshoe Lake. A "smart filter" would determine what is "normal" and what should be flagged for each hotspot based on the individual experience of each hotspot. And "normal" will change automatically over time parallel with the bird populations themselves.

    ReplyDelete