About catGenes
Tools for DNA Alignment Concatenation, Sequence Mining, and Phylogenetic Analysis
Overview
catGenes is an R package designed to support reproducible phylogenetic and phylogenomic workflows, from sequence retrieval and alignment preparation to multilocus dataset assembly, evolutionary model selection, Bayesian inference, and phylogenetic tree visualization.
Although originally developed to compare and concatenate multiple DNA alignments, the package now provides a broader ecosystem of tools for retrieving sequences from GenBank, mining loci from plastid and mitochondrial genomes, combining FASTA files, performing automated multiple sequence alignment, converting alignment formats, exporting partitioned datasets, generating MrBayes command blocks, running MrBayes from R, and editing phylogenetic trees with ggtree.
The package aims to simplify and standardize many steps that researchers typically perform manually when preparing multilocus DNA datasets for phylogenetic analysis.
Motivation
Modern phylogenetic workflows often involve multiple software tools, repeated file conversions, and extensive manual editing of sequence labels, partitions, and alignment datasets. These steps can introduce inconsistencies and make analyses difficult to reproduce.
catGenes was developed to streamline these processes within the R environment by integrating several common tasks into a unified workflow. By automating sequence retrieval, dataset preparation, and phylogenetic analysis support, the package helps researchers focus more on biological interpretation rather than file manipulation.
The package is particularly useful for:
- multilocus Sanger to genome-level datasets
- plastid and mitochondrial loci mined from organellar genomes
- datasets containing duplicated taxa or multiple accessions
- partitioned Bayesian phylogenetic analyses
- reproducible phylogenetic tree visualization and editing
Package Information
Package: catGenes
Type: R Package
Current Version: 1.0.0
Authors:
Maintainer:
Domingos Cardoso (domingoscardoso@jbrj.gov.br)
Source code:
https://github.com/DBOSlab/catGenes
Key capabilities
catGenes integrates multiple components of phylogenetic data preparation:
Sequence retrieval
- Download DNA sequences from GenBank using accession numbers
- Retrieve sequences using taxonomic queries
- Mine targeted loci from plastid and mitochondrial genomes
Alignment processing
- Combine multiple FASTA files
- Perform automated multiple sequence alignment
- Convert alignments among
FASTA,NEXUS, andPHYLIPformats
Dataset assembly
- Compare taxa across loci
- Concatenate multilocus datasets
- Handle datasets with duplicated taxa or multiple accessions
- Export partitioned datasets for downstream phylogenetic analyses
Phylogenetic analysis support
- Perform evolutionary model selection
- Generate
MrBayescommand blocks - Run
MrBayesdirectly from R
Tree visualization
- Plot and edit phylogenetic trees using
ggtree - Produce publication-ready phylogenetic figures
Contributing
Contributions are welcome. You can help improve catGenes by:
- reporting bugs or requesting features
- submitting pull requests
- suggesting improvements to documentation or tutorials
Please open issues or contributions on GitHub:
https://github.com/DBOSlab/catGenes
License
catGenes is open source software released under the MIT License. Open tools promote reproducibility, transparency, and collaborative scientific development.
Acknowledgements
Development of catGenes takes place within the research environment of the Rio de Janeiro Botanical Garden (JBRJ). We gratefully acknowledge support from CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico. These initiatives support the development of open computational tools that facilitate biodiversity research, phylogenetics, and reproducible scientific workflows.