About catGenes

Tools for DNA Alignment Concatenation, Sequence Mining, and Phylogenetic Analysis

Overview

catGenes is an R package designed to support reproducible phylogenetic and phylogenomic workflows, from sequence retrieval and alignment preparation to multilocus dataset assembly, evolutionary model selection, Bayesian inference, and phylogenetic tree visualization.

Although originally developed to compare and concatenate multiple DNA alignments, the package now provides a broader ecosystem of tools for retrieving sequences from GenBank, mining loci from plastid and mitochondrial genomes, combining FASTA files, performing automated multiple sequence alignment, converting alignment formats, exporting partitioned datasets, generating MrBayes command blocks, running MrBayes from R, and editing phylogenetic trees with ggtree.

The package aims to simplify and standardize many steps that researchers typically perform manually when preparing multilocus DNA datasets for phylogenetic analysis.


Motivation

Modern phylogenetic workflows often involve multiple software tools, repeated file conversions, and extensive manual editing of sequence labels, partitions, and alignment datasets. These steps can introduce inconsistencies and make analyses difficult to reproduce.

catGenes was developed to streamline these processes within the R environment by integrating several common tasks into a unified workflow. By automating sequence retrieval, dataset preparation, and phylogenetic analysis support, the package helps researchers focus more on biological interpretation rather than file manipulation.

The package is particularly useful for:

  • multilocus Sanger to genome-level datasets
  • plastid and mitochondrial loci mined from organellar genomes
  • datasets containing duplicated taxa or multiple accessions
  • partitioned Bayesian phylogenetic analyses
  • reproducible phylogenetic tree visualization and editing

Package Information

Package: catGenes
Type: R Package
Current Version: 1.0.0

Authors:

Maintainer:
Domingos Cardoso ()

Source code:
https://github.com/DBOSlab/catGenes


Key capabilities

catGenes integrates multiple components of phylogenetic data preparation:

Sequence retrieval

  • Download DNA sequences from GenBank using accession numbers
  • Retrieve sequences using taxonomic queries
  • Mine targeted loci from plastid and mitochondrial genomes

Alignment processing

  • Combine multiple FASTA files
  • Perform automated multiple sequence alignment
  • Convert alignments among FASTA, NEXUS, and PHYLIP formats

Dataset assembly

  • Compare taxa across loci
  • Concatenate multilocus datasets
  • Handle datasets with duplicated taxa or multiple accessions
  • Export partitioned datasets for downstream phylogenetic analyses

Phylogenetic analysis support

  • Perform evolutionary model selection
  • Generate MrBayes command blocks
  • Run MrBayes directly from R

Tree visualization

  • Plot and edit phylogenetic trees using ggtree
  • Produce publication-ready phylogenetic figures

Contributing

Contributions are welcome. You can help improve catGenes by:

  • reporting bugs or requesting features
  • submitting pull requests
  • suggesting improvements to documentation or tutorials

Please open issues or contributions on GitHub:

https://github.com/DBOSlab/catGenes


License

catGenes is open source software released under the MIT License. Open tools promote reproducibility, transparency, and collaborative scientific development.


Acknowledgements

Development of catGenes takes place within the research environment of the Rio de Janeiro Botanical Garden (JBRJ). We gratefully acknowledge support from CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico. These initiatives support the development of open computational tools that facilitate biodiversity research, phylogenetics, and reproducible scientific workflows.