library(catGenes)
genes <- list.files(system.file("DNAlignments/Vataireoids",
package = "catGenes"))
Vataireoids <- list()
for (i in genes) {
Vataireoids[[i]] <- ape::read.nexus.data(
system.file("DNAlignments/Vataireoids", i, package = "catGenes")
)
}
names(Vataireoids) <- gsub("[.].*", "", names(Vataireoids))Standardizing DNA alignments
Why standardization matters
Before using the concatenation functions in catGenes, DNA alignments must be consistently formatted. The package compares taxa across individual alignments based on their sequence labels. When labels are inconsistent, taxa may fail to match across loci, leading to incomplete or incorrect concatenated datasets.
For datasets without duplicated accessions, catGenes matches taxa primarily by scientific name. For datasets with multiple accessions per species, the package uses both the scientific name and an associated identifier. For this reason, careful standardization of sequence labels is an essential first step.
General principles
When preparing alignments for catGenes:
- use consistent taxon naming across all loci
- separate all components of sequence labels with underscores
- avoid spaces and hyphens in sequence names
- keep identifiers stable across all alignments when the same accession is represented in multiple loci
- use simple and consistent file names for each alignment
1. Datasets with a single sequence per species
When each species is represented by only one sequence per locus, taxon labels can be formatted simply as:
Genus_speciesG_species
Labels may also include additional information after the species name, provided that the scientific name remains consistently formatted. For example:
Genus_species_identifierGenus_species_identifier_accession
Examples:
Vatairea_fuscaVatairea_fusca_Cardoso2939Vatairea_fusca_Cardoso2939_JX152598


2. Datasets with duplicated species or multiple accessions
When one or more species are represented by multiple accessions, catGenes requires labels that include both:
- the scientific name
- a stable accession identifier
This identifier may correspond to a collector number, voucher code, DNA extraction code, or another stable accession label used consistently across loci.
Recommended format:
Genus_species_identifierGenus_species_identifier_everythingelse
Examples:
Vatairea_fusca_Cardoso2939Vatairea_fusca_Cardoso2939_JX152598
In these cases, use catmultGenes() rather than catfullGenes().

3. Genus-level identifications and infraspecific taxa
Accessions identified only to genus level can be formatted as:
Genus_spGenus_sp1Genus_sp2Genus_spAGenus_spB
Abbreviated generic names can also be used if done consistently:
G_spG_sp1G_spA
For infraspecific taxa, add the infraspecific epithet after the species epithet, without including terms such as var. or subsp. in the label. For example:
Genus_species_variety_identifierGenus_species_subspecies_identifier

4. Naming alignment files
We recommend using simple file names based on the corresponding gene or locus name, without spaces or hyphens. For example:
ITS.nexrbcL.nexpsbAtrnH.nexCOX1.nex
This makes it easier to load all alignments into R and helps preserve clear partition names in downstream concatenation workflows.
5. Loading multiple alignments into R
The example below uses the Vataireoid example alignments included in catGenes.