Usage
Installation
To use Snapper, first install it using pip:
(base) $ conda create -n snapper python=3.7
(base) $ conda activate snapper
(snapper) $ conda install -c bioconda ont-fast5-api ont-tombo
(snapper) $ pip install snapper-ont
We recommend to use python 3.7 because later python versions might be incompatible with inner biopython dependencies.
Usage
The Snapper pipeline is designed to perform only comparative analysis and requires two samples to be sequenced. In most cases, these samples are native DNA and control DNA obtained after whole-genome amplification (WGA), but in general it is not a constrain. Thus, the algorithm might be used to compare two native DNA dataset, where the first has been extracted from some wild type bacteria, and the second from a mutant. Such an example is considered in details in the Usercases section.
Suppose that we have fast5_native folder containing single-fast5 files for the native sample, fast5_wga for the control sample,
and reference.fna with the genome of the object of our interest.
Firstly, the fast5 files should be resquiggled using Tombo and converted to the multifast5-format using ont-fast5-api:
(snapper) tombo resquiggle fast5_native reference.fna
(snapper) tombo resquiggle fast5_wga reference.fna
(snapper) single_to_multi_fast5 -i fast5_native -s fast5_multi_native
(snapper) single_to_multi_fast5 -i fast5_wga -s fast5_multi_wga
Before the resquiggling command, ensure that basecalling data have been added to the fast5 files (see tombo preprocess annotate_raw_with_fastqs command).
The authors do not recommend to change the batch-size parameter (-n, --batch-size) while running single_to_multi_fast5 command since the default value (4000 reads per batch) seems reasonable in terms of efficiency.
A typical Snapper running command:
(snapper) snapper -sample_fast5dir fast5_multi_native -control_fast5fir fast5_mutli_wga -reference reference.fna
Optional parameters:
-ks_tThe threshold value of -log(p-value) in the Kolmogorov-Smirnov test (default is 3, that means the desired p-value is less then 1e-3). In general, the lower this value, the higher the algorithm sensitivity, but the authors recommend to tune this parameter only in cases when the coverege of either native or control sample is less than 20 per position, since the algorithm performs inner k-mers coverage balancing while computing statistcs values.
-outdirOutput directory name. By default, the output directory is named
Results_yyyy_mm_dd_ttttttaccording to the run datatime, for exampleResults_2022_09_02_180002.
-coveregeThe threshold of genome coverege (default is 40). The algorithm collects raw signal data for each sample until this threshold is reached.
-threadsNumber of threads used (default is 8)
-k_sizek-mer size, must be odd, should not be less than 11 (default is 15)
-long_k_sizelong k-mer size, must be odd, should not be less than 21 (default is 29)
-max_motifsThe maximum expected number of motifs that the algorithm will try to extract (default is 20). The resulting number of motifs might be lower.
-min_confThe minimal confidence (chi-square statistics) value. Default is 100.
-target_chrTarget chromosome name. By default, all sequences available in the input reference fasta file are processed independently.
Output files explanation
For each contig and each strand, the analysis is performed independently, so, in the output folder you can see a few resulting files for each contig+strand variant.
Let’s consider forward strand of contig_1 contig. The following files are generated for this particular combination:
passed_motifs_forward_contig_1.fastaThis file contains all 11-mers that have a significant signal shift between the native and control samples.
final_motifs_forward_contig_1.fastaThis file contains a sorted list of all potential methylation motifs extracted by a greedy algorithm implemented in Snapper. For each extracted motif, the confidence value is given in the header. For motifs with confidence level lower than 3000, the author recommend to check their correctness manually (see Usercases)
seq_iter/forward_contig_1/folderIn this folder the greedy algorithm saves sets of unexplained 11-mer sequences on each greedy interation. Formally,
seqs_iter_1.fastais identical topassed_motifs_forward_contig_1.fastasince the all passed 11-mers are unexplained on the first iteration.
plots_forward_contig_1This folder contains signal distribution plots for each extracted methylation motif.
motif_refineThis folder contains the long motif enrichment results.
Citation
Dmitry N Konanov, Vladislav V Babenko, Aleksandra M Belova, Arina G Madan, Daria I Boldyreva, Oksana E Glushenko, Ivan O Butenko, Dmitry E Fedorov, Alexander I Manolov, Danil V Krivonos, Vassilii N Lazarev, Vadim M Govorun, Elena N Ilina, Snapper: high-sensitive detection of methylation motifs based on Oxford Nanopore reads, Bioinformatics, 2023