Cleaning GBIF Data

This article walks through a practical case study using the barRoso package to standardize and harmonize plant specimen records downloaded from the Global Biodiversity Information Facility (GBIF) portal.

Goal

Perform a full cleaning workflow for GBIF records using barRoso, including:

  • Standardizing collector names and record numbers
  • Harmonizing taxonomic and geographic fields
  • Flagging and filtering duplicate specimens
  • Generating a cleaned dataset ready for downstream use

Step 1: Download GBIF Records

Use the rgbif package to download occurrence records. Here’s an example for the Fabaceae family:

# install.packages("rgbif")
library(rgbif)

occ <- occ_search(scientificName = "Fabaceae", limit = 1000, hasCoordinate = TRUE)
df <- occ$data

Step 2: Load barRoso and Standardize Records

# install.packages("devtools")
devtools::install_github("DBOSlab/barRoso")

library(barRoso)

cleaned <- barroso_std(df,
                       flag_duplicates = TRUE,
                       rm_duplicates = FALSE)

Step 3: Review Standardized Output

table(cleaned$duplicate)
head(cleaned[, c("recordedBy", "recordNumber", "duplicate")])

Step 4: Save Results

write.csv(cleaned, "gbif_cleaned.csv", row.names = FALSE)

Summary

In this case study, we:

  • Programmatically downloaded GBIF data for a plant family
  • Used barRoso to standardize and harmonize specimen metadata
  • Flagged duplicates and prepared a clean dataset for biodiversity research

You can apply this pipeline to other sources such as SEINet, REFLORA Virtual Herbarium, JABOT, or speciesLink.