std_recordedBy

Standardize Collector Names in Biodiversity Records
barRoso::std_recordedBy()

Description

Cleans and standardizes the recordedBy and recordNumber fields in biodiversity collection data, consolidating collector names and removing inconsistencies across herbarium records. The function identifies and formats collector initials, extracts main collector names, and handles multilingual and complex name structures including multiple collectors, Asian unicode names, and Brazilian surname conventions.

Details

This function is part of the barRoso package. It supports reconciliation of biodiversity records, especially for resolving collector name discrepancies across duplicate specimens. A new column addCollector is created when multiple collectors are detected, storing secondary collectors as "et al.". Original columns can be preserved or overwritten.

Specifically, this function performs extensive string cleaning including:

  • Converting unicode (e.g., Chinese) to Latin names

  • Parsing and normalizing collector names split by

  • &

  • ,

  • and

  • ,

  • e

  • ,

  • y

  • ,

  • ;

  • ,

  • |

  • , etc.

  • Handling cases of one, two, or more collectors

  • Cleaning spacing, punctuation, and known collector aliases

  • Adding standardized initials or removing redundant suffixes (e.g., “et al.”)

Arguments

Argument Description
df A data frame containing biodiversity records.
colname_recordedBy Column name for the main collector (default: “recordedBy”).
colname_recordNumber Column name for the collector number (default: “recordNumber”).
rm_original_column Logical; if TRUE, original columns are removed after cleaning. If FALSE, they are retained with *Original suffixes (default: FALSE).

Value

A data frame with cleaned and harmonized collector name fields. A new column addCollector is added where additional collectors are identified.

Examples

df <- read.csv("herbarium_records.csv")
df_clean <- std_recordedBy(df,
                           colname_recordedBy = "coletor",
                           colname_recordNumber = "num_coleta",
                           rm_original_column = FALSE)