Detect and Flag Duplicates
Overview
Duplicate specimens are common in herbarium collections due to specimen exchange among institutions. Identifying and flagging them is crucial for avoiding data inflation and ensuring analytical accuracy. The function barroso_flag_duplicates() provides a fast and flexible way to detect potential duplicates based on collector name, collection number, taxon, and collection date.
Function: barroso_flag_duplicates()
Purpose
This function flags duplicate specimens by comparing fields that typically indicate specimen identity and collection event.
Fields used in comparison:
recordedBy(collector name)recordNumber(collection number)year,month,dayfamily,genus,specificEpithet
You can run this function as a standalone or as part of barroso_std().
Example
library(barRoso)
# Load sample specimen dataset
df <- read.csv("raw_herbarium_data.csv")
# Detect duplicates and add flag column
df_flagged <- barroso_flag_duplicates(df,
rm_duplicates = FALSE)
# View flagged rows
subset(df_flagged, duplicate == TRUE)Arguments
df: A data frame of herbarium recordsrm_duplicates: Logical; ifTRUE, duplicates will be removed (default:FALSE)
Output
Returns the same data frame with an additional column: - duplicate: Logical column indicating whether a row is a suspected duplicate
Best Practices
- Run this step after standardizing fields using
barroso_std() - Check for misspellings in
recordedByand inconsistencies in dates before trusting flags
Integration Tip
Use barroso_flag_duplicates() to visually inspect duplicates before removing them.
# To keep only unique records:
df_clean <- barroso_flag_duplicates(df, rm_duplicates = TRUE)See Also
barroso_std()– integrates duplicate detection automatically- Standardize Specimen Records