catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)Working with duplicated taxa and multiple accessions
Overview
A common challenge in multilocus phylogenetic datasets is that one or more species may be represented by multiple accessions across individual alignments. These duplicated taxa may correspond to different vouchers, collections, DNA extractions, or sequencing accessions. catGenes was designed to handle this situation explicitly, but doing so requires consistent label formatting across loci.
This article explains how duplicated taxa and multiple accessions should be represented in input alignments, how catGenes distinguishes them from simple single-accession datasets, and when to use catmultGenes() instead of catfullGenes().
Why duplicated accessions matter
In a simple multilocus dataset, each species is represented by a single sequence per locus. In that case, taxa can be matched across loci using only the scientific name.
However, many real datasets include cases where the same species is represented by more than one accession, for example:
- multiple specimens collected from different populations
- different vouchers from the same species
- independent DNA extractions
- different sequencing accessions for the same taxon
In these cases, matching taxa across loci by scientific name alone is not sufficient, because catGenes must determine which sequence in one alignment corresponds to which sequence in another alignment.
For this reason, duplicated-accession datasets require both:
- a consistently formatted scientific name
- a stable identifier that distinguishes one accession from another
Choosing the correct concatenation function
catGenes provides two main concatenation workflows:
catfullGenes()for datasets in which each species has a single sequence per locuscatmultGenes()for datasets in which one or more species are represented by multiple accessions
As a rule:
- use
catfullGenes()when labels can be matched by taxon name alone - use
catmultGenes()when the same taxon may occur multiple times in one or more alignments
Using the wrong function may lead to mismatched sequences, duplicated taxa in the output, or unexpected exclusion of accessions.
General labeling rule for duplicated taxa
When species are represented by multiple accessions, sequence labels must include both:
- the taxon name
- an identifier that remains stable across loci
Recommended format:
Genus_species_identifierGenus_species_identifier_everythingelse
Examples:
Vatairea_fusca_Cardoso3060Vatairea_fusca_Cardoso2939_JX152598Vatairea_fusca_Silva1820Vatairea_fusca_Lima7344
In this format, catGenes uses the scientific name plus the identifier to recognize which accessions should be matched across loci.
Example of duplicated-accession labels
The figure below illustrates how labels should be formatted when one or more species are represented by multiple accessions.

The key principle is that the identifying component must be consistent wherever the same accession appears.
What counts as a stable identifier?
A stable identifier is any label that uniquely identifies the accession and is used consistently across loci. Examples include:
- collector surname and collection number
- voucher number
- DNA extraction code
- laboratory accession code
- specimen barcode
Good examples:
Cardoso2939Silva1820RB123456DNA45
Poor choices include identifiers that change from locus to locus or identifiers that are missing from some alignments for the same accession.
When not to use accession identifiers
If each species is represented by only one sequence per locus, accession identifiers are optional. In those cases, labels such as:
Genus_speciesG_speciesGenus_species_moretext
may still be suitable for catfullGenes(), as long as the taxon name is consistently formatted.
The accession-aware workflow is only necessary when a species is actually duplicated in one or more alignments.
Detecting duplicated taxa in your dataset
A simple way to recognize whether you need catmultGenes() is to inspect sequence labels in your alignments and ask:
- does the same species appear more than once within an alignment?
- do different loci include multiple sequences for the same taxon?
- do I need to preserve accession identity across loci?
If the answer to any of these is yes, catmultGenes() is usually the correct choice.
Example workflow with duplicated accessions
The example below assumes that the alignments have already been loaded into R as a named list and that labels include stable identifiers.
This compares alignments while taking duplicated accessions into account.
Understanding the maxspp argument
The catmultGenes() function includes the argument maxspp, which controls how taxa without duplicated accessions are treated.
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)When maxspp = TRUE, species that are not duplicated in any individual alignment are retained in the final concatenated dataset, helping maximize taxon coverage. This is usually the recommended option, because it avoids unnecessarily dropping taxa that only have one accession but still belong in the final matrix.
Understanding the shortaxlabel argument
In datasets with duplicated accessions, shortaxlabel determines how much of the original label is retained in the output.
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = FALSE,
missdata = TRUE
)shortaxlabel = TRUEkeeps labels shorter and more standardizedshortaxlabel = FALSEretains more of the original identifying information
Keeping full labels can be useful in phylogenomic or voucher-rich datasets, especially when accessions must remain clearly traceable throughout downstream analyses.
Understanding the missdata argument
As in other concatenation workflows, the missdata argument controls whether taxa lacking sequences in one or more loci are retained.
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = TRUE,
missdata = TRUE
)missdata = TRUEretains incomplete taxa and fills missing partitions with missing datamissdata = FALSEexcludes taxa lacking a complete sequence for one or more loci
This is especially important in duplicated-accession datasets, where different accessions may have different locus coverage.
What happens after concatenation?
Once catmultGenes() has been run successfully, the resulting list of equalized alignments can be exported with:
writeNexus()writePhylip()
For example:
writeNexus(
catdf,
file = "duplicated_accessions_dataset.nex",
genomics = FALSE,
interleave = TRUE,
bayesblock = TRUE
)or
writePhylip(
catdf,
file = "duplicated_accessions_dataset.phy",
genomics = FALSE,
catalignments = TRUE,
partitionfile = TRUE
)Keeping original identifiers in exported datasets
In some workflows, you may want to preserve all original identifiers in the concatenated output. This is especially useful for genomic or accession-rich datasets.
When you run catfullGenes() or catmultGenes() with shortaxlabel = FALSE, the export functions can preserve accession detail in the final output.
For example:
catdf <- catmultGenes(
my_alignments,
maxspp = TRUE,
shortaxlabel = FALSE,
missdata = TRUE
)
writeNexus(
catdf,
file = "full_labels_dataset.nex",
genomics = TRUE,
interleave = TRUE,
bayesblock = TRUE
)This helps maintain traceability between the concatenated matrix and the original sequence labels.
Removing redundant duplicated accessions after concatenation
After concatenation, you may decide that some duplicated accessions should be removed, particularly when they represent the same species but differ in completeness.
The function dropSeq() was developed for this purpose. It removes smaller or less informative duplicated sequences, typically favoring the accession with better data coverage.
clean_catdf <- dropSeq(catdf)This can be useful before exporting the final matrix.
Common mistakes
Using catfullGenes() when taxa are duplicated
If one or more species are represented by multiple accessions, catfullGenes() may not match them correctly because it is intended for one-sequence-per-species datasets.
Inconsistent identifiers across loci
If the same accession is labeled differently in different alignments, catGenes will treat them as different accessions. For example, these may fail to match as intended:
Vatairea_fusca_Cardoso2939
Vatairea_fusca_DCardoso2939
Vatairea_fusca_RB2939Missing identifiers in only some loci
If duplicated taxa are identified with accession codes in one alignment but not in another, catGenes may not be able to resolve them correctly.
Mixing multiple identification schemes
Using collector numbers in one alignment and GenBank accessions in another for the same accession can create mismatches unless the same identifier is carried consistently across loci.
Recommended practice
For duplicated-accession datasets, the safest workflow is:
- choose a single stable identifier for each accession
- use that identifier consistently across all loci
- separate label components with underscores
- load alignments into R as a named list
- use c
atmultGenes()for concatenation - inspect the output and optionally use
dropSeq()if cleanup is needed