Vatairea_fusca_Cardoso2939
Vatairea_fusca_Silva1820Concatenate datasets with duplicated accessions
Overview
The function catmultGenes() is the main catGenes workflow for concatenating multilocus DNA datasets when one or more species are represented by multiple accessions across individual alignments. This situation is common in phylogenetic and phylogenomic studies that include different vouchers, collections, DNA extractions, or GenBank accessions for the same species.
Unlike catfullGenes(), which assumes one sequence per species per locus, catmultGenes() compares alignments while taking accession identity into account. For this reason, it is the appropriate function whenever taxa are duplicated in one or more input matrices.
This article explains how to prepare the input data, how catmultGenes() works, how the main arguments affect the concatenated output, and how to export the resulting dataset for downstream phylogenetic analyses.
When to use catmultGenes()
Use catmultGenes() when:
- one or more species are represented by multiple accessions in one or more loci
- the same accession must be tracked across loci
- labels include both taxon name and a stable identifier
- you want to maximize taxon coverage while preserving accession-level matching
- you are working with multilocus datasets that include duplicated taxa
If every species is represented by only one sequence per locus, use catfullGenes() instead.
Why duplicated accessions require a different workflow
When a species appears only once per locus, taxa can be matched across alignments using the scientific name alone. However, when a species is represented by multiple accessions, the scientific name is not enough to determine which sequence in one locus corresponds to which sequence in another.
For example, a species such as Vatairea_fusca may appear in one alignment with two accessions:
If the same species also appears in another locus, catGenes must know which accession should be matched with which. This is why the duplicated-accession workflow requires labels that include both:
- the scientific name
- a stable accession identifier
Input structure
catmultGenes() expects a named list of individual DNA alignments, usually loaded from NEXUS files with ape::read.nexus.data().
Each alignment should be part of a named list in which:
- each list element corresponds to one locus
- the list names correspond to gene or locus names
- sequence labels follow a consistent accession-aware format
For example:
library(catGenes)
genes <- list.files("path_to_DNA_alignments_folder")
my_alignments <- list()
for (i in genes) {
my_alignments[[i]] <- ape::read.nexus.data(
paste0("path_to_DNA_alignments_folder/", i)
)
}
names(my_alignments) <- gsub("[.].*", "", names(my_alignments))Label formatting for duplicated taxa
When using catmultGenes(), labels should include both taxon name and identifier, for example:
Genus_species_identifier
Genus_species_identifier_everythingelseExamples:
Vatairea_fusca_Cardoso2939
Vatairea_fusca_Cardoso2939_JX152598
Vatairea_fusca_Silva1820
Vatairea_fusca_RB123456The key requirement is that the identifying component be used consistently across loci for the same accession.
Example of accession-aware labels
The figure below illustrates the kind of sequence labels expected when species are duplicated with multiple accessions.
Basic usage
A typical concatenation workflow with duplicated accessions looks like this:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)This compares taxa across loci while taking duplicated accessions into account and returns a list of equalized data frames ready for export.
What catmultGenes() returns
As with catfullGenes(), the output is a list of data frames of equal size. Each data frame corresponds to one locus and contains:
- a species column with the matched accession-aware taxon labels
- a sequence column with the corresponding aligned DNA sequence
Understanding maxspp
The argument maxspp determines how species without duplicated accessions are handled in the final concatenated output. A typical call is:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)When maxspp = TRUE, species that are not duplicated in any alignment are retained in the final concatenated dataset. This helps maximize taxon coverage, which is usually desirable. When maxspp = FALSE, taxa that are never duplicated may end up being removed or duplicated depending on the interaction with the missdata setting and the structure of the dataset.
In most cases, maxspp = TRUE is the recommended setting.
Understanding shortaxlabel
The argument shortaxlabel controls how much of the original label is retained in the output. When shortaxlabel = TRUE, output labels are kept shorter and more standardized:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)When shortaxlabel = FALSE, more of the original accession label is retained:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = FALSE,
missdata = TRUE
)Retaining the full original label is often useful in genomic, voucher-rich, or accession-traceable workflows where the exact identity of each accession must be preserved throughout downstream analyses.
Understanding missdata
The argument missdata controls whether taxa lacking one or more loci are retained in the final concatenated dataset.
When missdata = TRUE, incomplete taxa are retained and missing partitions are filled with missing data:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)This is often the most useful setting when trying to maximize taxon representation.
When missdata = FALSE, taxa lacking any locus are excluded:
catdf_complete <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = FALSE
)This produces a more conservative matrix containing only accessions with complete locus coverage.
Understanding outgroup
As in other concatenation workflows, outgroup can be used to indicate one or more taxa that should be treated as outgroups in incomplete datasets. For example:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE,
outgroup = "Outgroup_species_identifier"
)Or with multiple outgroups:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE,
outgroup = c("Outgroup_1", "Outgroup_2")
)This can be useful when maintaining incomplete taxa and ensuring that critical outgroup accessions remain in the concatenated dataset.
Understanding verbose
The argument verbose controls how much progress information is printed during matching and concatenation.
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE,
verbose = TRUE
)To suppress the detailed output:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE,
verbose = FALSE
)This can be useful for large datasets or rendered tutorials.
A typical workflow
A common duplicated-accession workflow is:
Step 1. Load the alignments
library(catGenes)
genes <- list.files("path_to_DNA_alignments_folder")
my_alignments <- list()
for (i in genes) {
my_alignments[[i]] <- ape::read.nexus.data(
paste0("path_to_DNA_alignments_folder/", i)
)
}
names(my_alignments) <- gsub("[.].*", "", names(my_alignments))Step 2. Run catmultGenes()
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)At this point, the dataset is ready for export.
Exporting the concatenated output
After running catmultGenes(), the most common next step is to export the result with writeNexus() or writePhylip().
Export as NEXUS
writeNexus(
catdf,
file = "duplicated_accessions_dataset.nex",
genomics = FALSE,
interleave = TRUE,
bayesblock = TRUE
)Export as PHYLIP
writePhylip(
catdf,
file = "duplicated_accessions_dataset.phy",
genomics = FALSE,
catalignments = TRUE,
partitionfile = TRUE
)These exports produce final concatenated datasets that can be used in downstream phylogenetic analyses.
Preserving full accession labels in exported datasets
When the full original accession labels need to be preserved, use shortaxlabel = FALSE during concatenation.
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = FALSE,
missdata = TRUE
)Then export using settings that preserve this level of identifier detail.
For example:
writeNexus(
catdf,
file = "full_labels_dataset.nex",
genomics = TRUE,
interleave = TRUE,
bayesblock = TRUE
)This is particularly useful in phylogenomic datasets where accession identity needs to remain explicit in the final matrix.
Handling differing identifiers across partitions
When full labels are retained, writeNexus() can preserve differing identifiers across partitions by representing them in a structured way in the output. This helps maintain traceability between the concatenated dataset and the original accessions. The examples below illustrate how these labels can appear in exported matrices.
Removing redundant duplicated accessions after concatenation
In some cases, after concatenation you may decide that duplicated accessions of the same species should be reduced to the best or most complete representative. The function dropSeq() was designed for this purpose.
clean_catdf <- dropSeq(catdf)This function removes smaller or less informative duplicated sequences, typically favoring accessions with better completeness. The screenshots below illustrate this type of cleanup.
Relationship with large genomic datasets
catmultGenes() is particularly useful in larger phylogenomic workflows where many loci are available and one or more taxa may be represented by multiple genome-derived accessions. The function has been used efficiently with larger multilocus organellar datasets. In such contexts, careful label standardization and accession consistency are especially important.
The example below illustrates a larger concatenated genomic matrix.
Common mistakes
Using catfullGenes() on duplicated-accession datasets If taxa are duplicated in one or more loci, catfullGenes() is not the correct workflow. In that case, use catmultGenes().
Inconsistent accession identifiers across loci
If the same accession is labeled differently in different alignments, catGenes will treat those labels as different accessions. For example, these may not be matched correctly:
Vatairea_fusca_Cardoso2939
Vatairea_fusca_DCardoso2939
Vatairea_fusca_RB2939Missing accession identifiers in some loci
If duplicated taxa are identified with accession-aware labels in one alignment but not in another, the function may not be able to resolve them correctly.
Mixing label schemes across loci
Using collector numbers in one locus and GenBank accession numbers in another for the same accession can create mismatches unless the same stable identifier is carried across all loci.
Unexpected loss of taxa
If fewer taxa appear than expected, inspect the settings of missdata and maxspp, and make sure the input labels are truly consistent across loci.
Recommended practice
For the smoothest use of catmultGenes():
- choose one stable identifier for each accession
- use that identifier consistently across all loci
- separate label components with underscores
- load alignments into R as a named list
- use
maxspp = TRUEin most cases - inspect the resulting output before export
- use
dropSeq()when cleanup of redundant duplicated accessions is needed
Next step
Once a duplicated-accession dataset has been concatenated with catmultGenes(), the next step is usually to export the result with writeNexus() or writePhylip(), then proceed to downstream phylogenetic analyses and tree visualization.