barRoso
  • Overview
  • Get Started
  • Articles
  • News
  • Reference
  • Authors
  • Help
    • Report a Bug
    • Ask a Question
    • FAQ
  1. Duplicate Detection
  2. Detect and Flag Duplicates
  • Overview
  • Data Standardization
    • Standardize Specimen Records
    • Flexible Column Names
  • Duplicate Detection
    • Detect and Flag Duplicates
    • Compare Herbarium Sources
  • Case Studies & Workflows
    • Standardize GBIF Data
    • REFLORA Cleaning Workflow
  • Herbarium Label Generation
    • Generate Herbarium Labels

On this page

  • Overview
  • Function: barroso_flag_duplicates()
    • Purpose
    • Fields used in comparison:
  • Example
  • Arguments
  • Output
  • Best Practices
  • Integration Tip
  • See Also
  1. Duplicate Detection
  2. Detect and Flag Duplicates

Detect and Flag Duplicates

Overview

Duplicate specimens are common in herbarium collections due to specimen exchange among institutions. Identifying and flagging them is crucial for avoiding data inflation and ensuring analytical accuracy. The function barroso_flag_duplicates() provides a fast and flexible way to detect potential duplicates based on collector name, collection number, taxon, and collection date.

Function: barroso_flag_duplicates()

Purpose

This function flags duplicate specimens by comparing fields that typically indicate specimen identity and collection event.

Fields used in comparison:

  • recordedBy (collector name)
  • recordNumber (collection number)
  • year, month, day
  • family, genus, specificEpithet

You can run this function as a standalone or as part of barroso_std().

Example

library(barRoso)

# Load sample specimen dataset
df <- read.csv("raw_herbarium_data.csv")

# Detect duplicates and add flag column
df_flagged <- barroso_flag_duplicates(df,
                                      rm_duplicates = FALSE)

# View flagged rows
subset(df_flagged, duplicate == TRUE)

Arguments

  • df: A data frame of herbarium records
  • rm_duplicates: Logical; if TRUE, duplicates will be removed (default: FALSE)

Output

Returns the same data frame with an additional column: - duplicate: Logical column indicating whether a row is a suspected duplicate

Best Practices

  • Run this step after standardizing fields using barroso_std()
  • Check for misspellings in recordedBy and inconsistencies in dates before trusting flags

Integration Tip

Use barroso_flag_duplicates() to visually inspect duplicates before removing them.

# To keep only unique records:
df_clean <- barroso_flag_duplicates(df, rm_duplicates = TRUE)

See Also

  • barroso_std() – integrates duplicate detection automatically
  • Standardize Specimen Records
Flexible Column Names
Compare Herbarium Sources
Source Code
---
title: "Detect and Flag Duplicates"
format:
  html:
    toc: true
    code-fold: true
    page-layout: full
execute:
  echo: true
  warning: false
  message: false
---

## Overview

Duplicate specimens are common in herbarium collections due to specimen exchange among institutions. Identifying and flagging them is crucial for avoiding data inflation and ensuring analytical accuracy. The function `barroso_flag_duplicates()` provides a fast and flexible way to detect potential duplicates based on collector name, collection number, taxon, and collection date.

## Function: `barroso_flag_duplicates()`

### Purpose
This function flags duplicate specimens by comparing fields that typically indicate specimen identity and collection event.

### Fields used in comparison:
- `recordedBy` (collector name)
- `recordNumber` (collection number)
- `year`, `month`, `day`
- `family`, `genus`, `specificEpithet`

You can run this function as a standalone or as part of `barroso_std()`.

## Example

```r
library(barRoso)

# Load sample specimen dataset
df <- read.csv("raw_herbarium_data.csv")

# Detect duplicates and add flag column
df_flagged <- barroso_flag_duplicates(df,
                                      rm_duplicates = FALSE)

# View flagged rows
subset(df_flagged, duplicate == TRUE)
```

## Arguments

- `df`: A data frame of herbarium records
- `rm_duplicates`: Logical; if `TRUE`, duplicates will be removed (default: `FALSE`)

## Output
Returns the same data frame with an additional column:
- `duplicate`: Logical column indicating whether a row is a suspected duplicate

## Best Practices
- Run this step **after** standardizing fields using `barroso_std()`
- Check for misspellings in `recordedBy` and inconsistencies in dates before trusting flags

## Integration Tip
Use `barroso_flag_duplicates()` to visually inspect duplicates before removing them.

```r
# To keep only unique records:
df_clean <- barroso_flag_duplicates(df, rm_duplicates = TRUE)
```

## See Also
- `barroso_std()` – integrates duplicate detection automatically
- [Standardize Specimen Records](/articles/standardize_specimens.qmd)
 
  • About

  • FAQ

  • License