Cleaning GBIF Data
This article walks through a practical case study using the barRoso
package to standardize and harmonize plant specimen records downloaded from the Global Biodiversity Information Facility (GBIF) portal.
Goal
Perform a full cleaning workflow for GBIF records using barRoso
, including:
- Standardizing collector names and record numbers
- Harmonizing taxonomic and geographic fields
- Flagging and filtering duplicate specimens
- Generating a cleaned dataset ready for downstream use
Step 1: Download GBIF Records
Use the rgbif
package to download occurrence records. Here’s an example for the Fabaceae family:
# install.packages("rgbif")
library(rgbif)
<- occ_search(scientificName = "Fabaceae", limit = 1000, hasCoordinate = TRUE)
occ <- occ$data df
Step 2: Load barRoso and Standardize Records
# install.packages("devtools")
::install_github("DBOSlab/barRoso")
devtools
library(barRoso)
<- barroso_std(df,
cleaned flag_duplicates = TRUE,
rm_duplicates = FALSE)
Step 3: Review Standardized Output
table(cleaned$duplicate)
head(cleaned[, c("recordedBy", "recordNumber", "duplicate")])
Step 4: Save Results
write.csv(cleaned, "gbif_cleaned.csv", row.names = FALSE)
Summary
In this case study, we:
- Programmatically downloaded GBIF data for a plant family
- Used
barRoso
to standardize and harmonize specimen metadata - Flagged duplicates and prepared a clean dataset for biodiversity research
You can apply this pipeline to other sources such as SEINet, REFLORA Virtual Herbarium, JABOT, or speciesLink.