View on GitHub

ShoRAH

Short Reads Assembly into Haplotypes

input local global help
Download this project as a .zip file Download this project as a tar.gz file

Generally speaking

The global analysis is performed by running shorah.py on the input sorted bam file. This will perform a shotgun local analysis, followed by a global haplotype reconstruction and a frequency estimation. The output is a file with extension .global_haps.fasta. It is a fasta file with all the reconstructed haplotype sequences, with the header indicating the frequency after the underscore. So, for example

>HAP0_0.264857
CCTCAGATCACTCTTTGGCAACGACCCCTCGTCACAATAAAGATAGGGG

means that the haplotype was estimated to have a frequency of 26.5%.

This file is a selection of the most frequent among all reconstructed haplotypes. These are in the file with extension .popl.

A word of caution

Inferring haplotypes over a region longer than the reads is hard. Many false positives can be introduced if reads are shorter than the region one would need to observe to capture enough diversity. See the references

The .popl file will typically contain many haplotypes, most of which at very low frequencies. You are advised not to give high confidence to haplotypes at frequency below a certain threshold that depends from case to case. Other software you could use for global reconstruction


Go back home