NAME

minimeta - assembler for long-read metagenomic/metatranscriptomic data sets

SYNOPSIS

minimeta --in <reads.fq> --out <consensus.fasta>

DESCRIPTION

Produces a polished consensus assembly from long-read sequencing data using miniasm, racon, and medaka. Software settings are tuned for metagenomic/metatranscriptomic assemblies of variable, sometimes low, coverage.

PREREQUISITES

Requires the following non-core Perl libraries:

BioX::Seq

Additionally, the following external programs are required for one or more of the optional processing modules (errors will be thrown for missing programs only if that module is requested). All optional dependencies are available in Bioconda.

minimap2
miniasm
racon
medaka
samtools
bedtools
seqkit
redundans
cutadapt
homopolish

OPTIONS

Input

--in filename: Path to input reads in FASTx format (required)
--assembly filename: Path to existing assembly. If provided, assembly is skipped and only polishing is performed (default: none).
--homopolish filename: Path to reference FASTA file used by homopolish. Providing this filename also triggers polishing using homopolish (default: none).

Output

--out filename: Path to write consensus sequence to (as FASTA) [default: STDOUT]

Configuration

--min_cov integer: Minimum read coverage required by assembler to keep position (default: 2)
--min_len integer: Minimum contig length to keep (default: 1)
--mask_below integer: If given, final assembly positions with coverage depth below this value will be hard masked with 'N' (default: off)
--split float: If given in conjunction with --mask_below, splits contigs at masked regions into smaller pieces. (default: off)
--only_split_at_hp: If given in conjuction with --split, only splits low coverage regions if one or both junctions is at a homopolymer stretch (default: off)
--threads integer: Number of processsing threads to use for mapping and polishing (default: 1)
--n_racon integer: Number of Racon polishing rounds to perform (default: 3)
--n_medaka integer: Number of Medaka polishing rounds to perform (default: 1)
--medaka_model string: Name of model to be used by medaka_consensus (based on basecalling model used for data) (default: depends on medaka version)
--medaka_batch_size integer: Batch size (medaka_consensus parameter -b) for medaka to use; using a smaller value should reduce memory consumption (default: 100)
--shred_len integer: For re-assemblies, the maximum length of pseudo-reads to generate as an absolute value; the actual value will be the minimum of this and the value of --shred_max_frac times the actual contig length (default: 2000)
--shred_max_frac float: For re-assemblies, the maximum length of pseudo-reads to generate as a fraction of the contig length; the actual value will be the minimum of this and the value of --shred_len (default: 0.66)
--shred_tgt_depth integer: For re-assemblies, the target depth of the pseudoreads on each contig; this is used to calculate how many reads to generate (default: 10)
--hp_model string: Name of model to be used by homopolish. Has no effect if --homopolish not used. (default: R9.4.pkl)
--noshuffle: Don't randomly shuffle input reads prior to assembly (default: shuffle)
--trim_polyN: Trim long poly-N stretches from reads prior to assembly (default: off)
--reassemblies integer: Perform one or more rounds of pseudo-assembly in order to minimize redundancy. For each round, the existing assembly is shredded into pseudoreads and reassembled.
--chunk_size integer: If this option is given, input reads will be split into chunks of --chunk_size reads and each chunk will be assembled independently. The resulting assemblies will be combined, shredded into pseudoreads, and reassembled.
--deterministic: Use a fixed seed for random processes such as shuffling (default: off)
--reduce: Apply a reduction algorithm to the pre-final assembly to remove redundant contigs (i.e. contigs mostly or completely overlapping with identity above a cutoff specified by --min_ident. Currently this is done using Redundans, which is required to be installed. (default: off)
--min_ident float: Minimum identity (0 to 1) between contigs required to remove shorter contig during redundancy reduction. (default: 0.8)
--minimizer_cutoff integer: During all-vs-all mapping, discard minimizers occurring above this frequency. This is the -f parameter to minimap2, and can be useful with high-coverage input datasets that may otherwise consume very large amounts of memory and time. A value between 1000 and 10,000 may be useful in these cases. (default: off)
--quiet: Don't write status messages to STDERR
--help: Print usage description and exit
--version: Print software version and exit

CAVEATS AND BUGS

Please submit bug reports to the issue tracker in the distribution repository.

AUTHOR

Jeremy Volkening (jeremy.volkening@base2bio.com)

LICENSE AND COPYRIGHT

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.