Remove redundant accessions

Overview

In multilocus datasets that include duplicated taxa or multiple accessions per species, it is often useful to simplify the final matrix by retaining only the best or most informative accession for each species. In catGenes, this cleanup step is performed with dropSeq().

The function dropSeq() removes smaller or less informative duplicated accessions from a concatenated dataset, typically favoring the accession with the greatest sequence completeness. This is especially helpful after running catmultGenes() and before exporting the final dataset with writeNexus() or writePhylip().

This article explains when to use dropSeq(), how it fits into the broader workflow, and how to inspect the cleaned result before export.

When to use `dropSeq()`

Use dropSeq() when:

one or more species are represented by multiple accessions in the concatenated dataset
you want to reduce redundancy before export
you want to keep the accession with the most complete sequence information
you want a simplified final matrix for downstream phylogenetic analysis

This function is especially relevant after workflows based on catmultGenes().

If your input dataset already contains only one sequence per species per locus, dropSeq() is usually not needed.

Why redundant accessions may remain after concatenation

The purpose of catmultGenes() is to correctly match duplicated accessions across loci. This is often the right strategy during dataset assembly, because it preserves accession identity and avoids premature loss of information.

However, after concatenation you may decide that:

duplicated accessions of the same species are unnecessary in the final matrix
only one accession per species should be retained
the accession with the best data coverage is preferable for downstream analyses

dropSeq() performs this reduction after concatenation, when the full matrix can be evaluated more directly.

Input required by `dropSeq()`

The function dropSeq() expects a concatenated object, typically the result of catmultGenes().

For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

Basic usage

A minimal workflow is:

clean_catdf <- dropSeq(catdf)

This returns a new concatenated object in which redundant duplicated accessions have been removed.

How `dropSeq()` decides which accession to keep

The general purpose of dropSeq() is to retain the most informative accession when a species is duplicated. In practice, this typically means favoring the accession with:

fewer missing characters
greater sequence completeness
more informative data across the concatenated matrix

The function is especially useful when duplicated accessions differ mainly in how complete their sequences are.

Typical workflow with dropSeq()

A common duplicated-accession workflow is: - load the aligned loci into R - concatenate the dataset with catmultGenes() - inspect the concatenated output - run dropSeq() to reduce redundancy - export the cleaned result with writeNexus() or writePhylip()

This allows accession-aware matching to happen first, followed by a deliberate cleanup of the final matrix.

Example workflow

Step 1. Run catmultGenes()

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

Step 2. Remove redundant duplicated accessions

clean_catdf <- dropSeq(catdf)

Step 3. Export the cleaned dataset

writeNexus(
  clean_catdf,
  file = "cleaned_dataset.nex",
  genomics = FALSE,
  interleave = TRUE,
  bayesblock = TRUE
)

writePhylip(
  clean_catdf,
  file = "cleaned_dataset.phy",
  genomics = FALSE,
  catalignments = TRUE,
  partitionfile = TRUE
)

Visual example of the cleanup step

The screenshots below illustrate a concatenated dataset before and after applying dropSeq(). The first screenshot shows a concatenated dataset in which some species are still represented by multiple accessions:

After running dropSeq(), redundant accessions are removed and the dataset becomes simpler:

Relationship with `shortaxlabel`

The behavior of dropSeq() is easiest to interpret when the concatenated object already has consistent and informative labels. Whether the labels are shorter or more complete depends on how catmultGenes() was run.

For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = FALSE,
  missdata = TRUE
)

If full labels are retained, the cleaned output may preserve more accession detail. If short labels were used, the cleaned output will be correspondingly simplified.

Relationship with `missdata`

dropSeq() is often most useful when catmultGenes() was run with missdata = TRUE, because incomplete accessions are retained during concatenation and can then be evaluated for removal afterward.

For example:

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

clean_catdf <- dropSeq(catdf)

This workflow first maximizes taxon representation and then reduces redundancy in a more informed way.

When not to use `dropSeq()`

You may not want to use dropSeq() when:

accession-level representation is biologically important
multiple accessions per species are intentionally being retained for downstream analyses
the goal is to preserve voucher-level or population-level diversity
no duplicated taxa remain after concatenation

In such cases, the full accession-aware output from catmultGenes() may already be the appropriate final dataset.

Typical use cases

dropSeq() is especially useful when:

building a species-level phylogeny from accession-rich input data
simplifying a matrix after accession-aware matching
selecting the best accession of each species for final export
reducing noise before downstream model selection or phylogenetic inference

It is less relevant in workflows focused explicitly on within-species sampling or accession-level phylogenetic structure.

Common issues

No duplicated accessions are removed

If dropSeq() does not appear to change the dataset, this may mean:

no duplicated species remained after concatenation
the labels are not formatted in a way that allows duplicated taxa to be recognized
the input object did not contain the expected accession-level redundancy

Labels are inconsistent

If the same species is labeled inconsistently across loci or across accessions, the cleanup step may not behave as intended. Always standardize labels before concatenation.

Important accessions were removed unintentionally

Because dropSeq() is designed to simplify the dataset, always inspect the cleaned result before export, especially if some accessions are biologically important.

Recommended practice

For the smoothest use of dropSeq():

use it after catmultGenes(), not before concatenation
inspect the concatenated output before deciding to simplify it
verify that labels are standardized and accession-aware
compare the cleaned and original outputs before export
use dropSeq() only when species-level simplification is the intended goal

Next step

Once redundant accessions have been removed with dropSeq(), the next step is usually to export the cleaned dataset with writeNexus() or writePhylip(), then proceed to downstream phylogenetic analyses.

--- title: "Remove redundant accessions" format: html: toc: true toc-depth: 3 --- ## Overview In multilocus datasets that include duplicated taxa or multiple accessions per species, it is often useful to simplify the final matrix by retaining only the best or most informative accession for each species. In `catGenes`, this cleanup step is performed with `dropSeq()`. The function `dropSeq()` removes smaller or less informative duplicated accessions from a concatenated dataset, typically favoring the accession with the greatest sequence completeness. This is especially helpful after running `catmultGenes()` and before exporting the final dataset with `writeNexus()` or `writePhylip()`. This article explains when to use `dropSeq()`, how it fits into the broader workflow, and how to inspect the cleaned result before export. ## When to use `dropSeq()` Use `dropSeq()` when: - one or more species are represented by multiple accessions in the concatenated dataset - you want to reduce redundancy before export - you want to keep the accession with the most complete sequence information - you want a simplified final matrix for downstream phylogenetic analysis This function is especially relevant after workflows based on `catmultGenes()`. If your input dataset already contains only one sequence per species per locus, `dropSeq()` is usually not needed. ## Why redundant accessions may remain after concatenation The purpose of `catmultGenes()` is to correctly match duplicated accessions across loci. This is often the right strategy during dataset assembly, because it preserves accession identity and avoids premature loss of information. However, after concatenation you may decide that: - duplicated accessions of the same species are unnecessary in the final matrix - only one accession per species should be retained - the accession with the best data coverage is preferable for downstream analyses `dropSeq()` performs this reduction after concatenation, when the full matrix can be evaluated more directly. ## Input required by `dropSeq()` The function `dropSeq()` expects a concatenated object, typically the result of `catmultGenes()`. For example: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` ## Basic usage A minimal workflow is: ```{r, eval=FALSE} clean_catdf <- dropSeq(catdf) ``` This returns a new concatenated object in which redundant duplicated accessions have been removed. ## How `dropSeq()` decides which accession to keep The general purpose of `dropSeq()` is to retain the most informative accession when a species is duplicated. In practice, this typically means favoring the accession with: - fewer missing characters - greater sequence completeness - more informative data across the concatenated matrix The function is especially useful when duplicated accessions differ mainly in how complete their sequences are. ## Typical workflow with dropSeq() A common duplicated-accession workflow is: - load the aligned loci into R - concatenate the dataset with `catmultGenes()` - inspect the concatenated output - run `dropSeq()` to reduce redundancy - export the cleaned result with `writeNexus()` or `writePhylip()` This allows accession-aware matching to happen first, followed by a deliberate cleanup of the final matrix. ## Example workflow Step 1. Run `catmultGenes()` ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) ``` Step 2. Remove redundant duplicated accessions ```{r, eval=FALSE} clean_catdf <- dropSeq(catdf) ``` Step 3. Export the cleaned dataset ```{r, eval=FALSE} writeNexus( clean_catdf, file = "cleaned_dataset.nex", genomics = FALSE, interleave = TRUE, bayesblock = TRUE ) ``` or ```{r, eval=FALSE} writePhylip( clean_catdf, file = "cleaned_dataset.phy", genomics = FALSE, catalignments = TRUE, partitionfile = TRUE ) ``` ## Visual example of the cleanup step The screenshots below illustrate a concatenated dataset before and after applying `dropSeq()`. The first screenshot shows a concatenated dataset in which some species are still represented by multiple accessions: After running `dropSeq()`, redundant accessions are removed and the dataset becomes simpler: ## Relationship with `shortaxlabel` The behavior of `dropSeq()` is easiest to interpret when the concatenated object already has consistent and informative labels. Whether the labels are shorter or more complete depends on how `catmultGenes()` was run. For example: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = FALSE, missdata = TRUE ) ``` If full labels are retained, the cleaned output may preserve more accession detail. If short labels were used, the cleaned output will be correspondingly simplified. ## Relationship with `missdata` `dropSeq()` is often most useful when `catmultGenes()` was run with `missdata = TRUE`, because incomplete accessions are retained during concatenation and can then be evaluated for removal afterward. For example: ```{r, eval=FALSE} catdf <- catmultGenes( my_alignments, maxspp = TRUE, shortaxlabel = TRUE, missdata = TRUE ) clean_catdf <- dropSeq(catdf) ``` This workflow first maximizes taxon representation and then reduces redundancy in a more informed way. ## When not to use `dropSeq()` You may not want to use `dropSeq()` when: - accession-level representation is biologically important - multiple accessions per species are intentionally being retained for downstream analyses - the goal is to preserve voucher-level or population-level diversity - no duplicated taxa remain after concatenation In such cases, the full accession-aware output from `catmultGenes()` may already be the appropriate final dataset. ## Typical use cases `dropSeq()` is especially useful when: - building a species-level phylogeny from accession-rich input data - simplifying a matrix after accession-aware matching - selecting the best accession of each species for final export - reducing noise before downstream model selection or phylogenetic inference It is less relevant in workflows focused explicitly on within-species sampling or accession-level phylogenetic structure. ## Common issues ### No duplicated accessions are removed If `dropSeq()` does not appear to change the dataset, this may mean: - no duplicated species remained after concatenation - the labels are not formatted in a way that allows duplicated taxa to be recognized - the input object did not contain the expected accession-level redundancy ### Labels are inconsistent If the same species is labeled inconsistently across loci or across accessions, the cleanup step may not behave as intended. Always standardize labels before concatenation. ### Important accessions were removed unintentionally Because `dropSeq()` is designed to simplify the dataset, always inspect the cleaned result before export, especially if some accessions are biologically important. ## Recommended practice For the smoothest use of `dropSeq()`: - use it after `catmultGenes()`, not before concatenation - inspect the concatenated output before deciding to simplify it - verify that labels are standardized and accession-aware - compare the cleaned and original outputs before export - use `dropSeq()` only when species-level simplification is the intended goal ## Next step Once redundant accessions have been removed with `dropSeq()`, the next step is usually to export the cleaned dataset with `writeNexus()` or `writePhylip()`, then proceed to downstream phylogenetic analyses.

Overview

When to use dropSeq()

Why redundant accessions may remain after concatenation

Input required by dropSeq()

Basic usage

How dropSeq() decides which accession to keep

Typical workflow with dropSeq()

Example workflow

Visual example of the cleanup step

Relationship with shortaxlabel

Relationship with missdata

When not to use dropSeq()

Typical use cases

Common issues

No duplicated accessions are removed

Labels are inconsistent

Important accessions were removed unintentionally

Recommended practice

Next step

When to use `dropSeq()`

Input required by `dropSeq()`

How `dropSeq()` decides which accession to keep

Relationship with `shortaxlabel`

Relationship with `missdata`

When not to use `dropSeq()`