Combine FASTA files and align sequences

Overview

A common step in phylogenetic workflows is to gather multiple sequence files, organize them into locus-specific datasets, and perform multiple sequence alignment before downstream concatenation or phylogenetic analysis. In catGenes, this workflow is supported mainly by two functions:

  • combineFASTA() to combine multiple FASTA files into a single file
  • alignSeqs() to perform automated multiple sequence alignment

These functions are especially useful when sequences have already been retrieved from GenBank with mineSeq() or mineTaxa(), or mined from plastomes or mitochondrial genomes with minePlastome() or mineMitochondrion().

This article explains when to combine FASTA files, how to run alignments, how to choose output formats, and how to prepare aligned files for later use in catGenes.

Typical workflow

A common sequence-processing workflow is:

  1. retrieve or mine sequences
  2. save them as FASTA files
  3. combine FASTA files when needed
  4. align the sequences
  5. save aligned output in NEXUS, FASTA, or PHYLIP
  6. load the alignments into R for downstream concatenation

Not every project requires the combineFASTA() step, but it is often useful when sequences are spread across multiple files and need to be merged before alignment.

When to use combineFASTA()

Use combineFASTA() when:

  • the sequences for a locus are split across multiple FASTA files
  • you want to merge separate downloads into one file before alignment
  • you want a single combined FASTA file for inspection or downstream processing
  • you want to organize several sequence batches into a unified locus dataset

If your sequences are already in one FASTA file per locus, you may go directly to alignSeqs().

Basic usage of combineFASTA()

A simple workflow looks like this:

library(catGenes)

result <- combineFASTA(
  input_files = c("gene1.fasta", "gene2.fasta", "gene3.fasta")
)

This reads the listed files, combines all sequences into a single object, and by default saves the result to disk.

Choosing the output file name

You can specify a custom file name for the combined output.

result <- combineFASTA(
  input_files = c("data/part1.fasta", "data/part2.fasta"),
  output_file = "combined_sequences.fasta"
)

This is especially useful when the combined file should reflect a locus name or project name.

Returning the combined sequences without saving

If you want to inspect the combined sequences in R without immediately writing them to disk, set save = FALSE.

result <- combineFASTA(
  input_files = c("temp1.fasta", "temp2.fasta"),
  save = FALSE,
  verbose = TRUE
)

In this case, the function returns the combined sequence object and summary information without creating an output file.

Inspecting the result of combineFASTA()

The result returned by combineFASTA() includes:

  • sequences: a DNAbin object containing all combined sequences
  • summary: a data frame with summary information
  • output_path: the output file path when saving is enabled

Notes on duplicate sequences

combineFASTA() does not remove duplicate sequences. It simply combines all sequences from the files you specify.

This means that: - repeated accessions remain in the combined file - duplicate taxa are not filtered automatically - downstream alignment and inspection may still be needed before concatenation

This behavior is useful because it preserves the original contents of the input files.

When to use alignSeqs()

Use alignSeqs() when you have one or more FASTA files containing unaligned sequences and want to generate aligned outputs for downstream phylogenetic workflows.

The function uses the msa package and supports:

  • ClustalW
  • Muscle

It can write aligned results in:

  • NEXUS
  • FASTA
  • PHYLIP

Basic alignment workflow

A minimal alignment workflow looks like this:

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "ClustalW",
  format = "NEXUS"
)

This reads one or more FASTA files from the specified directory, aligns them, and writes the aligned output in NEXUS format.

Choosing an alignment method

Currently, alignSeqs() supports two main alignment algorithms:

  • ClustalW
  • Muscle

For example, using ClustalW:

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "ClustalW",
  format = "NEXUS"
)

Or using Muscle:

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "Muscle",
  format = "NEXUS"
)

Both methods are widely used, and the best choice may depend on the characteristics of the dataset.

Choosing the output format

The argument format determines how the aligned files are written. Supported output formats are:

  • NEXUS
  • FASTA
  • PHYLIP

For example, to write aligned files in FASTA format:

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "ClustalW",
  format = "FASTA"
)

Or in PHYLIP format:

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "Muscle",
  format = "PHYLIP"
)

In most catGenes concatenation workflows, NEXUS is the most convenient format.

Adjusting the gap opening penalty

The argument gapOpening controls the gap opening penalty used by the alignment algorithm.

alignSeqs(
  filepath = "path_to_fasta_files",
  method = "ClustalW",
  gapOpening = "default",
  format = "NEXUS"
)

In many cases, the default setting is adequate. However, if your dataset requires specific tuning, you can adjust the value depending on the selected alignment method and the behavior you want.

A complete sequence-processing example

A typical workflow might look like this:

Step 1. Retrieve sequences

seqs <- mineSeq(
  inputdf = my_accession_table,
  gb.colnames = c("ITS", "matK", "rbcL"),
  save = TRUE,
  filename = "GenBank_seqs",
  dir = "RESULTS_mineSeq"
)

Step 2. Combine FASTA files if needed

combined <- combineFASTA(
  input_files = c(
    "RESULTS_mineSeq/09Mar2026/locus_part1.fasta",
    "RESULTS_mineSeq/09Mar2026/locus_part2.fasta"
  ),
  output_file = "ITS_combined.fasta",
  dir = "RESULTS_combineFASTA"
)

Step 3. Align the combined sequences

alignSeqs(
  filepath = "RESULTS_combineFASTA/09Mar2026",
  method = "ClustalW",
  format = "NEXUS",
  dir = "RESULTS_alignSeqs"
)

This produces aligned files ready for loading into R and later concatenation.

Aligning multiple FASTA files from one folder

If a directory contains several FASTA files, alignSeqs() can process them together.

alignSeqs(
  filepath = "RESULTS_mineSeq/09Mar2026",
  method = "Muscle",
  format = "NEXUS",
  dir = "RESULTS_alignSeqs"
)

This is useful when each file represents a different locus and all need to be aligned before loading into R.

Output directories

By default, aligned files are written to a directory such as:

RESULTS_alignSeqs/09Mar2026/

Similarly, combined FASTA outputs may be written to:

RESULTS_combineFASTA/09Mar2026/

This date-based structure helps keep results organized across independent runs.

File naming after alignment

If filename is not specified in alignSeqs(), output files are usually named based on the original input file names, with an added identifier such as aligned. This makes it easier to trace each aligned output back to the original input locus.

Inspecting aligned outputs

After running alignSeqs(), it is good practice to inspect the resulting alignment files before using them in downstream workflows.

Questions to check include:

  • are all expected sequences present?
  • do sequence labels look correct and consistent?
  • are the alignments the expected length?
  • were the files written in the intended format?

Once the outputs are confirmed, they can be loaded into R for concatenation.

Loading aligned files into R

If aligned outputs were written in NEXUS format, they can be loaded directly with ape::read.nexus.data().

genes <- list.files("RESULTS_alignSeqs/09Mar2026")
my_alignments <- list()

for (i in genes) {
  my_alignments[[i]] <- ape::read.nexus.data(
    paste0("RESULTS_alignSeqs/09Mar2026/", i)
  )
}

names(my_alignments) <- gsub("[.].*", "", names(my_alignments))

This creates the named list structure expected by catGenes concatenation functions.

Typical workflow after alignment

Once the alignments are loaded into R, the next step is usually:

  • catfullGenes() for datasets with one sequence per species per locus
  • catmultGenes() for datasets containing duplicated taxa or multiple accessions

For example:

catdf <- catfullGenes(
  my_alignments,
  shortaxlabel = TRUE,
  missdata = TRUE
)

or

catdf <- catmultGenes(
  my_alignments,
  maxspp = TRUE,
  shortaxlabel = TRUE,
  missdata = TRUE
)

Common issues

Input files are not in FASTA format

alignSeqs() expects FASTA inputs. If your files are currently in another format, convert them first or ensure the correct file type is used before alignment.

Alignment method is not specified correctly

The method argument must match one of the supported algorithms, such as ClustalW or Muscle.

Sequence labels are inconsistent

Alignment can still run when labels are inconsistent, but later concatenation may fail or produce unexpected results. Standardize labels before or immediately after retrieval.

Duplicate sequences are unintentionally combined

Because combineFASTA() keeps all sequences, duplicated entries may persist into the alignment stage. Inspect combined files when this matters for downstream analysis.

Output format is not appropriate for downstream use

If your next step is concatenation with catGenes, NEXUS is often the most convenient output format.

Next step

Once your sequences have been combined and aligned, the next step is usually to:

  • load the alignments into R
  • choose catfullGenes() or catmultGenes()
  • concatenate loci for downstream phylogenetic analysis

See the next tutorials for loading alignments and concatenating multilocus datasets with catGenes.