Detect and Flag Duplicates
Overview
Duplicate specimens are common in herbarium collections due to specimen exchange among institutions. Identifying and flagging them is crucial for avoiding data inflation and ensuring analytical accuracy. The function barroso_flag_duplicates()
provides a fast and flexible way to detect potential duplicates based on collector name, collection number, taxon, and collection date.
Function: barroso_flag_duplicates()
Purpose
This function flags duplicate specimens by comparing fields that typically indicate specimen identity and collection event.
Fields used in comparison:
recordedBy
(collector name)recordNumber
(collection number)year
,month
,day
family
,genus
,specificEpithet
You can run this function as a standalone or as part of barroso_std()
.
Example
library(barRoso)
# Load sample specimen dataset
<- read.csv("raw_herbarium_data.csv")
df
# Detect duplicates and add flag column
<- barroso_flag_duplicates(df,
df_flagged rm_duplicates = FALSE)
# View flagged rows
subset(df_flagged, duplicate == TRUE)
Arguments
df
: A data frame of herbarium recordsrm_duplicates
: Logical; ifTRUE
, duplicates will be removed (default:FALSE
)
Output
Returns the same data frame with an additional column: - duplicate
: Logical column indicating whether a row is a suspected duplicate
Best Practices
- Run this step after standardizing fields using
barroso_std()
- Check for misspellings in
recordedBy
and inconsistencies in dates before trusting flags
Integration Tip
Use barroso_flag_duplicates()
to visually inspect duplicates before removing them.
# To keep only unique records:
<- barroso_flag_duplicates(df, rm_duplicates = TRUE) df_clean
See Also
barroso_std()
– integrates duplicate detection automatically- Standardize Specimen Records