Concatenate datasets with duplicated accessions

Overview

The function catmultGenes() is the main catGenes workflow for concatenating multilocus DNA datasets when one or more species are represented by multiple accessions across individual alignments. This situation is common in phylogenetic and phylogenomic studies that include different vouchers, collections, DNA extractions, or GenBank accessions for the same species.

Unlike catfullGenes(), which assumes one sequence per species per locus, catmultGenes() compares alignments while taking accession identity into account. For this reason, it is the appropriate function whenever taxa are duplicated in one or more input matrices.

This article explains how to prepare the input data, how catmultGenes() works, how the main arguments affect the concatenated output, and how to export the resulting dataset for downstream phylogenetic analyses.

When to use `catmultGenes()`

Use catmultGenes() when:

one or more species are represented by multiple accessions in one or more loci
the same accession must be tracked across loci
labels include both taxon name and a stable identifier
you want to maximize taxon coverage while preserving accession-level matching
you are working with multilocus datasets that include duplicated taxa

If every species is represented by only one sequence per locus, use catfullGenes() instead.

Why duplicated accessions require a different workflow

When a species appears only once per locus, taxa can be matched across alignments using the scientific name alone. However, when a species is represented by multiple accessions, the scientific name is not enough to determine which sequence in one locus corresponds to which sequence in another.

For example, a species such as Vatairea_fusca may appear in one alignment with two accessions:

Vatairea_fusca_Cardoso2939
Vatairea_fusca_Silva1820

If the same species also appears in another locus, catGenes must know which accession should be matched with which. This is why the duplicated-accession workflow requires labels that include both:

the scientific name
a stable accession identifier

Input structure

catmultGenes() expects a named list of individual DNA alignments, usually loaded from NEXUS files with ape::read.nexus.data().

Each alignment should be part of a named list in which:

each list element corresponds to one locus
the list names correspond to gene or locus names
sequence labels follow a consistent accession-aware format

For example:

library(catGenes)

genes <- list.files("path_to_DNA_alignments_folder")
my_alignments <- list()

for (i in genes) {
  my_alignments[[i]] <- ape::read.nexus.data(
    paste0("path_to_DNA_alignments_folder/", i)
  )
}

names(my_alignments) <- gsub("[.].*", "", names(my_alignments))

Label formatting for duplicated taxa

When using catmultGenes(), labels should include both taxon name and identifier, for example:

Genus_species_identifier
Genus_species_identifier_everythingelse

Examples:

Vatairea_fusca_Cardoso2939
Vatairea_fusca_Cardoso2939_JX152598
Vatairea_fusca_Silva1820
Vatairea_fusca_RB123456

The key requirement is that the identifying component be used consistently across loci for the same accession.

Example of accession-aware labels

The figure below illustrates the kind of sequence labels expected when species are duplicated with multiple accessions.

Basic usage

A typical concatenation workflow with duplicated accessions looks like this:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

This compares taxa across loci while taking duplicated accessions into account and returns a list of equalized data frames ready for export.

What `catmultGenes()` returns

As with catfullGenes(), the output is a list of data frames of equal size. Each data frame corresponds to one locus and contains:

a species column with the matched accession-aware taxon labels
a sequence column with the corresponding aligned DNA sequence

Understanding `maxspp`

The argument maxspp determines how species without duplicated accessions are handled in the final concatenated output. A typical call is:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

When maxspp = TRUE, species that are not duplicated in any alignment are retained in the final concatenated dataset. This helps maximize taxon coverage, which is usually desirable. When maxspp = FALSE, taxa that are never duplicated may end up being removed or duplicated depending on the interaction with the missdata setting and the structure of the dataset.

In most cases, maxspp = TRUE is the recommended setting.

Understanding `shortaxlabel`

The argument shortaxlabel controls how much of the original label is retained in the output. When shortaxlabel = TRUE, output labels are kept shorter and more standardized:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

When shortaxlabel = FALSE, more of the original accession label is retained:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = FALSE,
  missdata = TRUE
)

Retaining the full original label is often useful in genomic, voucher-rich, or accession-traceable workflows where the exact identity of each accession must be preserved throughout downstream analyses.

Understanding `missdata`

The argument missdata controls whether taxa lacking one or more loci are retained in the final concatenated dataset.

When missdata = TRUE, incomplete taxa are retained and missing partitions are filled with missing data:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

This is often the most useful setting when trying to maximize taxon representation.

When missdata = FALSE, taxa lacking any locus are excluded:

catdf_complete <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = FALSE
)

This produces a more conservative matrix containing only accessions with complete locus coverage.

Understanding `outgroup`

As in other concatenation workflows, outgroup can be used to indicate one or more taxa that should be treated as outgroups in incomplete datasets. For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE,
  outgroup = "Outgroup_species_identifier"
)

Or with multiple outgroups:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE,
  outgroup = c("Outgroup_1", "Outgroup_2")
)

This can be useful when maintaining incomplete taxa and ensuring that critical outgroup accessions remain in the concatenated dataset.

Understanding `verbose`

The argument verbose controls how much progress information is printed during matching and concatenation.

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE,
  verbose = TRUE
)

To suppress the detailed output:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE,
  verbose = FALSE
)

This can be useful for large datasets or rendered tutorials.

A typical workflow

A common duplicated-accession workflow is:

Step 1. Load the alignments

library(catGenes)

genes <- list.files("path_to_DNA_alignments_folder")
my_alignments <- list()

for (i in genes) {
  my_alignments[[i]] <- ape::read.nexus.data(
    paste0("path_to_DNA_alignments_folder/", i)
  )
}

names(my_alignments) <- gsub("[.].*", "", names(my_alignments))

Step 2. Run catmultGenes()

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

At this point, the dataset is ready for export.

Exporting the concatenated output

After running catmultGenes(), the most common next step is to export the result with writeNexus() or writePhylip().

Export as NEXUS

writeNexus(
  catdf,
  file = "duplicated_accessions_dataset.nex",
  genomics = FALSE,
  interleave = TRUE,
  bayesblock = TRUE
)

Export as PHYLIP

writePhylip(
  catdf,
  file = "duplicated_accessions_dataset.phy",
  genomics = FALSE,
  catalignments = TRUE,
  partitionfile = TRUE
)

These exports produce final concatenated datasets that can be used in downstream phylogenetic analyses.

Preserving full accession labels in exported datasets

When the full original accession labels need to be preserved, use shortaxlabel = FALSE during concatenation.

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = FALSE,
  missdata = TRUE
)

Then export using settings that preserve this level of identifier detail.

For example:

writeNexus(
  catdf,
  file = "full_labels_dataset.nex",
  genomics = TRUE,
  interleave = TRUE,
  bayesblock = TRUE
)

This is particularly useful in phylogenomic datasets where accession identity needs to remain explicit in the final matrix.

Handling differing identifiers across partitions

When full labels are retained, writeNexus() can preserve differing identifiers across partitions by representing them in a structured way in the output. This helps maintain traceability between the concatenated dataset and the original accessions. The examples below illustrate how these labels can appear in exported matrices.

Removing redundant duplicated accessions after concatenation

In some cases, after concatenation you may decide that duplicated accessions of the same species should be reduced to the best or most complete representative. The function dropSeq() was designed for this purpose.

clean_catdf <- dropSeq(catdf)

This function removes smaller or less informative duplicated sequences, typically favoring accessions with better completeness. The screenshots below illustrate this type of cleanup.

Relationship with large genomic datasets

catmultGenes() is particularly useful in larger phylogenomic workflows where many loci are available and one or more taxa may be represented by multiple genome-derived accessions. The function has been used efficiently with larger multilocus organellar datasets. In such contexts, careful label standardization and accession consistency are especially important.

The example below illustrates a larger concatenated genomic matrix.

Common mistakes

Using catfullGenes() on duplicated-accession datasets If taxa are duplicated in one or more loci, catfullGenes() is not the correct workflow. In that case, use catmultGenes().

Inconsistent accession identifiers across loci

If the same accession is labeled differently in different alignments, catGenes will treat those labels as different accessions. For example, these may not be matched correctly:

Vatairea_fusca_Cardoso2939
Vatairea_fusca_DCardoso2939
Vatairea_fusca_RB2939

Missing accession identifiers in some loci

If duplicated taxa are identified with accession-aware labels in one alignment but not in another, the function may not be able to resolve them correctly.

Mixing label schemes across loci

Using collector numbers in one locus and GenBank accession numbers in another for the same accession can create mismatches unless the same stable identifier is carried across all loci.

Unexpected loss of taxa

If fewer taxa appear than expected, inspect the settings of missdata and maxspp, and make sure the input labels are truly consistent across loci.

Recommended practice

For the smoothest use of catmultGenes():

choose one stable identifier for each accession
use that identifier consistently across all loci
separate label components with underscores
load alignments into R as a named list
use maxspp = TRUE in most cases
inspect the resulting output before export
use dropSeq() when cleanup of redundant duplicated accessions is needed

Next step

Once a duplicated-accession dataset has been concatenated with catmultGenes(), the next step is usually to export the result with writeNexus() or writePhylip(), then proceed to downstream phylogenetic analyses and tree visualization.

--- title: "Concatenate datasets with duplicated accessions" format: html: toc: true toc-depth: 3 --- ## Overview The function `catmultGenes()` is the main `catGenes` workflow for concatenating multilocus DNA datasets when one or more species are represented by multiple accessions across individual alignments. This situation is common in phylogenetic and phylogenomic studies that include different vouchers, collections, DNA extractions, or GenBank accessions for the same species. Unlike `catfullGenes()`, which assumes one sequence per species per locus, `catmultGenes()` compares alignments while taking accession identity into account. For this reason, it is the appropriate function whenever taxa are duplicated in one or more input matrices. This article explains how to prepare the input data, how `catmultGenes()` works, how the main arguments affect the concatenated output, and how to export the resulting dataset for downstream phylogenetic analyses. ## When to use `catmultGenes()` Use `catmultGenes()` when: - one or more species are represented by multiple accessions in one or more loci - the same accession must be tracked across loci - labels include both taxon name and a stable identifier - you want to maximize taxon coverage while preserving accession-level matching - you are working with multilocus datasets that include duplicated taxa If every species is represented by only one sequence per locus, use `catfullGenes()` instead. ## Why duplicated accessions require a different workflow When a species appears only once per locus, taxa can be matched across alignments using the scientific name alone. However, when a species is represented by multiple accessions, the scientific name is not enough to determine which sequence in one locus corresponds to which sequence in another. For example, a species such as `Vatairea_fusca` may appear in one alignment with two accessions: ```{r eval=FALSE} Vatairea_fusca_Cardoso2939 Vatairea_fusca_Silva1820 ``` If the same species also appears in another locus, `catGenes` must know which accession should be matched with which. This is why the duplicated-accession workflow requires labels that include both: - the scientific name - a stable accession identifier ## Input structure `catmultGenes()` expects a named list of individual DNA alignments, usually loaded from `NEXUS` files with `ape::read.nexus.data()`. Each alignment should be part of a named list in which: - each list element corresponds to one locus - the list names correspond to gene or locus names - sequence labels follow a consistent accession-aware format For example: ```{r, eval=FALSE} library(catGenes) genes <- list.files("path_to_DNA_alignments_folder") my_alignments <- list() for (i in genes) { my_alignments[[i]] <- ape::read.nexus.data( paste0("path_to_DNA_alignments_folder/", i) ) } names(my_alignments) <- gsub("[.].*", "", names(my_alignments)) ``` ## Label formatting for duplicated taxa When using `catmultGenes()`, labels should include both taxon name and identifier, for example: ```{r eval=FALSE} Genus_species_identifier Genus_species_identifier_everythingelse ``` Examples: ```{r eval=FALSE} Vatairea_fusca_Cardoso2939 Vatairea_fusca_Cardoso2939_JX152598 Vatairea_fusca_Silva1820 Vatairea_fusca_RB123456 ``` The key requirement is that the identifying component be used consistently across loci for the same accession. ## Example of accession-aware labels The figure below illustrates the kind of sequence labels expected when species are duplicated with multiple accessions. ## Basic usage A typical concatenation workflow with duplicated accessions looks like this: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` This compares taxa across loci while taking duplicated accessions into account and returns a list of equalized data frames ready for export. ## What `catmultGenes()` returns As with `catfullGenes()`, the output is a list of data frames of equal size. Each data frame corresponds to one locus and contains: - a species column with the matched accession-aware taxon labels - a sequence column with the corresponding aligned DNA sequence ## Understanding `maxspp` The argument `maxspp` determines how species without duplicated accessions are handled in the final concatenated output. A typical call is: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` When `maxspp = TRUE`, species that are not duplicated in any alignment are retained in the final concatenated dataset. This helps maximize taxon coverage, which is usually desirable. When `maxspp = FALSE`, taxa that are never duplicated may end up being removed or duplicated depending on the interaction with the missdata setting and the structure of the dataset. In most cases, `maxspp = TRUE` is the recommended setting. ## Understanding `shortaxlabel` The argument `shortaxlabel` controls how much of the original label is retained in the output. When `shortaxlabel = TRUE`, output labels are kept shorter and more standardized: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` When `shortaxlabel = FALSE`, more of the original accession label is retained: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = FALSE, missdata = TRUE ) ``` Retaining the full original label is often useful in genomic, voucher-rich, or accession-traceable workflows where the exact identity of each accession must be preserved throughout downstream analyses. ## Understanding `missdata` The argument `missdata` controls whether taxa lacking one or more loci are retained in the final concatenated dataset. When `missdata = TRUE`, incomplete taxa are retained and missing partitions are filled with missing data: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` This is often the most useful setting when trying to maximize taxon representation. When `missdata = FALSE`, taxa lacking any locus are excluded: ```{r, eval=FALSE} catdf_complete <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = FALSE ) ``` This produces a more conservative matrix containing only accessions with complete locus coverage. ## Understanding `outgroup` As in other concatenation workflows, outgroup can be used to indicate one or more taxa that should be treated as outgroups in incomplete datasets. For example: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE, outgroup = "Outgroup_species_identifier" ) ``` Or with multiple outgroups: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE, outgroup = c("Outgroup_1", "Outgroup_2") ) ``` This can be useful when maintaining incomplete taxa and ensuring that critical outgroup accessions remain in the concatenated dataset. ## Understanding `verbose` The argument verbose controls how much progress information is printed during matching and concatenation. ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE, verbose = TRUE ) ``` To suppress the detailed output: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE, verbose = FALSE ) ``` This can be useful for large datasets or rendered tutorials. ## A typical workflow A common duplicated-accession workflow is: Step 1. Load the alignments ```{r, eval=FALSE} library(catGenes) genes <- list.files("path_to_DNA_alignments_folder") my_alignments <- list() for (i in genes) { my_alignments[[i]] <- ape::read.nexus.data( paste0("path_to_DNA_alignments_folder/", i) ) } names(my_alignments) <- gsub("[.].*", "", names(my_alignments)) ``` Step 2. Run `catmultGenes()` ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` At this point, the dataset is ready for export. ## Exporting the concatenated output After running `catmultGenes()`, the most common next step is to export the result with `writeNexus()` or `writePhylip()`. Export as `NEXUS` ```{r, eval=FALSE} writeNexus( catdf, file = "duplicated_accessions_dataset.nex", genomics = FALSE, interleave = TRUE, bayesblock = TRUE ) ``` Export as `PHYLIP` ```{r, eval=FALSE} writePhylip( catdf, file = "duplicated_accessions_dataset.phy", genomics = FALSE, catalignments = TRUE, partitionfile = TRUE ) ``` These exports produce final concatenated datasets that can be used in downstream phylogenetic analyses. ## Preserving full accession labels in exported datasets When the full original accession labels need to be preserved, use `shortaxlabel = FALSE` during concatenation. ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = FALSE, missdata = TRUE ) ``` Then export using settings that preserve this level of identifier detail. For example: ```{r, eval=FALSE} writeNexus( catdf, file = "full_labels_dataset.nex", genomics = TRUE, interleave = TRUE, bayesblock = TRUE ) ``` This is particularly useful in phylogenomic datasets where accession identity needs to remain explicit in the final matrix. ## Handling differing identifiers across partitions When full labels are retained, `writeNexus()` can preserve differing identifiers across partitions by representing them in a structured way in the output. This helps maintain traceability between the concatenated dataset and the original accessions. The examples below illustrate how these labels can appear in exported matrices. ## Removing redundant duplicated accessions after concatenation In some cases, after concatenation you may decide that duplicated accessions of the same species should be reduced to the best or most complete representative. The function `dropSeq()` was designed for this purpose. ```{r, eval=FALSE} clean_catdf <- dropSeq(catdf) ``` This function removes smaller or less informative duplicated sequences, typically favoring accessions with better completeness. The screenshots below illustrate this type of cleanup. ## Relationship with large genomic datasets `catmultGenes()` is particularly useful in larger phylogenomic workflows where many loci are available and one or more taxa may be represented by multiple genome-derived accessions. The function has been used efficiently with larger multilocus organellar datasets. In such contexts, careful label standardization and accession consistency are especially important. The example below illustrates a larger concatenated genomic matrix. ## Common mistakes Using `catfullGenes()` on duplicated-accession datasets If taxa are duplicated in one or more loci, `catfullGenes()` is not the correct workflow. In that case, use `catmultGenes()`. ### Inconsistent accession identifiers across loci If the same accession is labeled differently in different alignments, catGenes will treat those labels as different accessions. For example, these may not be matched correctly: ```{r, eval=FALSE} Vatairea_fusca_Cardoso2939 Vatairea_fusca_DCardoso2939 Vatairea_fusca_RB2939 ``` ### Missing accession identifiers in some loci If duplicated taxa are identified with accession-aware labels in one alignment but not in another, the function may not be able to resolve them correctly. ### Mixing label schemes across loci Using collector numbers in one locus and GenBank accession numbers in another for the same accession can create mismatches unless the same stable identifier is carried across all loci. ### Unexpected loss of taxa If fewer taxa appear than expected, inspect the settings of missdata and maxspp, and make sure the input labels are truly consistent across loci. ## Recommended practice For the smoothest use of `catmultGenes()`: - choose one stable identifier for each accession - use that identifier consistently across all loci - separate label components with underscores - load alignments into R as a named list - use `maxspp = TRUE` in most cases - inspect the resulting output before export - use `dropSeq()` when cleanup of redundant duplicated accessions is needed ## Next step Once a duplicated-accession dataset has been concatenated with `catmultGenes()`, the next step is usually to export the result with `writeNexus()` or `writePhylip()`, then proceed to downstream phylogenetic analyses and tree visualization.