catGenes::mineTaxa()mineTaxa
Description
A comprehensive function built on the rentrez package to search, download, and process DNA sequences from GenBank using taxonomic query terms. This function performs multiple operations including taxon name cleaning, voucher information extraction, duplicate removal, and plastome separation. It iss particularly useful for phylogenetic studies requiring large-scale sequence retrieval with consistent naming conventions.
Details
The function performs the following steps:
Connects to GenBank using the specified search term
Downloads sequences (up to
retmaxrecords)
If
clean_taxa = TRUE:
Extracts and cleans organism names from each accession
Optionally adds voucher information
Standardizes naming format: Genus_species_voucher_accession
Separates plastomes if
plastome_apart = TRUERemoves duplicates if
rm_duplicated = TRUE
- Saves results to specified directory with date-stamped subfolder
Taxon name cleaning includes:
Removing periods and hyphens
Converting “var.”, “subsp.”, “f.” to simple spaces
Converting spaces to underscores
Trimming “sp.” designations
Removing trailing underscores
Handling hybrid cultivars
Arguments
| Argument | Description |
|---|---|
| term | Character string specifying the search query for GenBank. Should follow NCBI Entrez search syntax (e.g., “Solanaceae[Organism] AND matk”, “Arabidopsis[Organism] AND rbcL[Gene]”). See NCBI’s Entrez Help for detailed syntax. |
| db | Character string specifying the NCBI database to search. Default is “nucleotide”. Other options include “protein”, “popset”, etc. See entrez_search for details. |
| filename | Character string for the output FASTA file name. Default is “mined_seqs_by_taxon.fasta”. |
| clean_taxa | Logical. If TRUE (default), taxon names are cleaned and standardized (removes authors, subsp./var. indicators, special characters). If FALSE, original GenBank names are preserved. |
| add_voucher | Logical. If TRUE (default), voucher specimen information (when available) is extracted and appended to sequence names in the format: Genus_species_voucher_accession. If FALSE, voucher information is omitted. |
| original_query | Logical. If TRUE, a copy of the original unprocessed FASTA file is saved with “_ORIGINAL_QUERY” suffix. If FALSE (default), only the cleaned file is kept. Ignored if clean_taxa = FALSE. |
| plastome_apart | Logical. If TRUE (default), complete plastome sequences are separated into a separate file with “PLASTOMES_” prefix. Useful for distinguishing whole plastomes from individual gene sequences. |
| rm_duplicated | Logical. If TRUE (default), removes duplicate sequences from the same species, keeping only the longest sequence. Applied separately to plastomes and other sequences. |
| retmax | Numeric. Maximum number of sequences to retrieve from GenBank. Default is 4000. Increase for large queries, but note NCBI rate limits. |
| verbose | Logical. If TRUE (default), progress messages are printed to the console. If FALSE, function runs silently. |
| save | Logical. If TRUE (default), results are saved to disk in the specified directory. If FALSE, results are returned as R objects without saving. |
| dir | Character string specifying the directory path for saving results. Default is “RESULTS_mineTaxa”. A subdirectory with current date will be created within this directory. |
Value
If save = TRUE, returns an invisible list of DNA sequences in DNAbin format (if clean_taxa = TRUE) or raw FASTA text (if clean_taxa = FALSE). If save = FALSE, returns the processed sequences as an R object. Files are written to disk in either case when save = TRUE.
Examples
# Example 1: Basic search for matK sequences in Solanaceae (excluding Solanum)
result1 <- mineTaxa(
term = "(Solanaceae[Organism] AND matk NOT Solanum[Organism])",
filename = "Solanaceae_outSolanum.fasta"
)
# Example 2: Search for rbcL sequences in Fabaceae with custom settings
result2 <- mineTaxa(
term = "Fabaceae[Organism] AND rbcL[Gene]",
filename = "Fabaceae_rbcL.fasta",
clean_taxa = TRUE,
add_voucher = FALSE, # Don't add voucher info
original_query = TRUE, # Keep original unprocessed file
plastome_apart = FALSE, # Don't separate plastomes
rm_duplicated = FALSE, # Keep all sequences
retmax = 1000,
verbose = TRUE
)
# Example 3: Protein database search
result3 <- mineTaxa(
term = "Rubisco[Protein] AND plants[Organism]",
db = "protein",
filename = "plant_Rubisco.fasta",
clean_taxa = FALSE # Don't clean names for proteins
)
# Example 4: Return results without saving to disk
result4 <- mineTaxa(
term = "Arabidopsis[Organism] AND chloroplast[Title]",
verbose = FALSE,
save = FALSE
)