Remove redundant accessions

Overview

In multilocus datasets that include duplicated taxa or multiple accessions per species, it is often useful to simplify the final matrix by retaining only the best or most informative accession for each species. In catGenes, this cleanup step is performed with dropSeq().

The function dropSeq() removes smaller or less informative duplicated accessions from a concatenated dataset, typically favoring the accession with the greatest sequence completeness. This is especially helpful after running catmultGenes() and before exporting the final dataset with writeNexus() or writePhylip().

This article explains when to use dropSeq(), how it fits into the broader workflow, and how to inspect the cleaned result before export.

When to use dropSeq()

Use dropSeq() when:

  • one or more species are represented by multiple accessions in the concatenated dataset
  • you want to reduce redundancy before export
  • you want to keep the accession with the most complete sequence information
  • you want a simplified final matrix for downstream phylogenetic analysis

This function is especially relevant after workflows based on catmultGenes().

If your input dataset already contains only one sequence per species per locus, dropSeq() is usually not needed.

Why redundant accessions may remain after concatenation

The purpose of catmultGenes() is to correctly match duplicated accessions across loci. This is often the right strategy during dataset assembly, because it preserves accession identity and avoids premature loss of information.

However, after concatenation you may decide that:

  • duplicated accessions of the same species are unnecessary in the final matrix
  • only one accession per species should be retained
  • the accession with the best data coverage is preferable for downstream analyses

dropSeq() performs this reduction after concatenation, when the full matrix can be evaluated more directly.

Input required by dropSeq()

The function dropSeq() expects a concatenated object, typically the result of catmultGenes().

For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

Basic usage

A minimal workflow is:

clean_catdf <- dropSeq(catdf)

This returns a new concatenated object in which redundant duplicated accessions have been removed.

How dropSeq() decides which accession to keep

The general purpose of dropSeq() is to retain the most informative accession when a species is duplicated. In practice, this typically means favoring the accession with:

  • fewer missing characters
  • greater sequence completeness
  • more informative data across the concatenated matrix

The function is especially useful when duplicated accessions differ mainly in how complete their sequences are.

Typical workflow with dropSeq()

A common duplicated-accession workflow is: - load the aligned loci into R - concatenate the dataset with catmultGenes() - inspect the concatenated output - run dropSeq() to reduce redundancy - export the cleaned result with writeNexus() or writePhylip()

This allows accession-aware matching to happen first, followed by a deliberate cleanup of the final matrix.

Example workflow

Step 1. Run catmultGenes()

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

Step 2. Remove redundant duplicated accessions

clean_catdf <- dropSeq(catdf)

Step 3. Export the cleaned dataset

writeNexus(
  clean_catdf,
  file = "cleaned_dataset.nex",
  genomics = FALSE,
  interleave = TRUE,
  bayesblock = TRUE
)

or

writePhylip(
  clean_catdf,
  file = "cleaned_dataset.phy",
  genomics = FALSE,
  catalignments = TRUE,
  partitionfile = TRUE
)

Visual example of the cleanup step

The screenshots below illustrate a concatenated dataset before and after applying dropSeq(). The first screenshot shows a concatenated dataset in which some species are still represented by multiple accessions:

After running dropSeq(), redundant accessions are removed and the dataset becomes simpler:

Relationship with shortaxlabel

The behavior of dropSeq() is easiest to interpret when the concatenated object already has consistent and informative labels. Whether the labels are shorter or more complete depends on how catmultGenes() was run.

For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = FALSE,
  missdata = TRUE
)

If full labels are retained, the cleaned output may preserve more accession detail. If short labels were used, the cleaned output will be correspondingly simplified.

Relationship with missdata

dropSeq() is often most useful when catmultGenes() was run with missdata = TRUE, because incomplete accessions are retained during concatenation and can then be evaluated for removal afterward.

For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

clean_catdf <- dropSeq(catdf)

This workflow first maximizes taxon representation and then reduces redundancy in a more informed way.

When not to use dropSeq()

You may not want to use dropSeq() when:

  • accession-level representation is biologically important
  • multiple accessions per species are intentionally being retained for downstream analyses
  • the goal is to preserve voucher-level or population-level diversity
  • no duplicated taxa remain after concatenation

In such cases, the full accession-aware output from catmultGenes() may already be the appropriate final dataset.

Typical use cases

dropSeq() is especially useful when:

  • building a species-level phylogeny from accession-rich input data
  • simplifying a matrix after accession-aware matching
  • selecting the best accession of each species for final export
  • reducing noise before downstream model selection or phylogenetic inference

It is less relevant in workflows focused explicitly on within-species sampling or accession-level phylogenetic structure.

Common issues

No duplicated accessions are removed

If dropSeq() does not appear to change the dataset, this may mean:

  • no duplicated species remained after concatenation
  • the labels are not formatted in a way that allows duplicated taxa to be recognized
  • the input object did not contain the expected accession-level redundancy

Labels are inconsistent

If the same species is labeled inconsistently across loci or across accessions, the cleanup step may not behave as intended. Always standardize labels before concatenation.

Important accessions were removed unintentionally

Because dropSeq() is designed to simplify the dataset, always inspect the cleaned result before export, especially if some accessions are biologically important.

Next step

Once redundant accessions have been removed with dropSeq(), the next step is usually to export the cleaned dataset with writeNexus() or writePhylip(), then proceed to downstream phylogenetic analyses.