Hello, everyone. Recently, I have downloaded the latest variant data of 1000genomes project from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20110521/ and the mapped data from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/HG00096/. I want to transform these variants(INDEL and SV) to sequences for each individual. I haven't found any program available to do the job. Therefore, I writed a script according to these rules: 1) For an insert, next_pos=current_pos+1; 2) For a delete or SV, next_pos=current_pos+length(ref_allele). But I encountered some problem for adjacent variants as follows:
CHROM POS ID REF ALT HG00096
20 131503 . ATC A 1|0:0.940:0.00,-0.70,-31.60
20 131505 rs3079754 CTCT C 1|0:1.000:-3.20,0.00,-40.50
I got the reference sequence as ATCTCTTG ##chr20:131503-131010. But when I resolved the first position 131503 which resulted the next position as 131506(131503+3) skipping the variant rs3079754, thus I got the sequence A(131503-131505) + TCTTG(131506-131510) as ATCTTG containing the variants. And I observed the mapped result as ATCTG using IGV. Two are different, Which I should choose?
Besides, other instances:
20 179735 rs76051089 TTAAAA T
20 179741 . TAAAAG T
The refernce sequence is TTAAAATAAAAGTGAA ##chr20:179735-179750, I got the va ...
↧