PED : Polymorphic Edge Detection

Polymorphic Edge Detection (PED) is the analysis flow for DNA polymorphism detection from short reads of next generation sequencer (NGS). I developed two methods to detect polymorphisms based on detection of the polymorphic edge. One is based on bidirectional alignment and the other is based on comparison of k-mers. Examples of PED result and useful information are shown in Web pages (English) (Japanese) (Paper)(Blog).

Polymorphic Edge

DNA polymorphism is any difference of DNA sequence between individuals. These differences are single nucleotide polymorphism (SNP), insertion, deletion, inversion, translocation and copy number variation. On the non-polymorphic region, sequences between two individuals are completely same. At the position of SNP, or at the beginning of other polymorphisms, the nucleotide must be different between individuals.

Bidirectional alignment method

                                                                Chr11 80443004
                                                                |
TTTTTAATTGAAAAGGCATTAAGCTGGGTCTATGCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGATAGGTAGAAAAAAAAAACCACTATCAGCAACA Reference sequence matching from 5'-end
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  | |     | ||||||||   |  |       |  
TTTTTAATTGAAAAGGCATTAAGCTGGGTCTATGCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGATAGGTAGAAAAAAAAAACCACTATCAGCAACAGT Short read sequence
|||       ||             |  | |     |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTAATTGAAAAGGCATTAAGCTGGGTCTATGCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTAGATAGGTAGAAAAAAAAAACCACTATCAGCAACAGT Reference sequence matching from 3'-end
                                   |
                                   Chr11 80442977

Short read sequence is aligned with reference sequence from both 5’- and 3’-ends. Positions indicated over and bellow of the alignment are first mismatched base, i.e., polymorphic edge. The bidirectional alignment clearly indicates two bases (GT) deletion in the short read. The bidirectional can detect not only deletion but also SNP, insertion, inversion and translocation.

K-mer method

Individual_A AAATGGTACATTTATATTAT
Individual_B AAATGGTACATTTATATTAC

All short reads from Individual_A and Individual_B are sliced to k-mer (e.g. k = 20) in each position. For example, the Individual_A has the k-mer sequence of AAATGGTACATTTATATTAT but does not have AAATGGTACATTTATATTAC. On the other hand, the Individual_B has the AAATGGTACATTTATATTAC but does not have AAATGGTACATTTATATTAT. The last base of k-mer of Individual_A is T, and Individual_B is C. The last base of k-mers must be SNP or edge of insertion, deletion, inversion, translocation or copy number variation. The k-mer method detects edges of polymorphism by difference of last base of k-mers. This method enables to detect polymorphisms by direct comparison of NGS data.

For analysis of SARS-CoV-2(COVID-19) data

perl download.pl accession=SRR11542244
perl ped.pl target=SRR11542244,ref=SARS-CoV-2

docker run -v `pwd`:/work -w /ped akiomiyao/ped perl download.pl accession=SRR11542244,wd=/work
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl ped.pl target=SRR11542244,ref=SARS-CoV-2,wd=/work

Run time of ped.pl is only two minutes for one accession using a standard desktop computer installed Linux (Ubuntu).
If you want to analyze your private sequences,

cd ped
mkdir your_sample_name
mkdir your_sample_name/read
cp somewhere/read_data.fastq your_sample_name/read
perl ped.pl target=your_sample_name,ref=SARS-CoV-2
or 
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl ped.pl target=your_sample_name,ref=SARS-CoV-2,wd=/work

Target name for ped.pl is the directory name.
Detailed Link for COVID-19 analysis

Simplified instruction

The ped.pl is a multithreaded (multiprocess) script, suitable for the multi-core CPU like as 4 or 8 cores.
Of course, the ped.pl can run with the 2 or single core machine, but slow.
The ped.pl runs on Linux (or FreeBSD) machine and Mac with at least 4 GB RAM and 1 TB hard disk (or SSD).

Following is a demonstration of spontaneous SNPs and SVs detection from a Caenorhabditis elegans with 250-times repeated generations.
```
cd ped
perl download.pl accession=ERR3063486
perl download.pl accession=ERR3063486
perl ped.pl target=ERR3063487,control=ERR3063486,ref=WBcel235
```
Installation of fastq-dump and ped scripts is described below.
The docker container for Linux includes fastq-dump and ped scripts.
```
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl download.pl accession=ERR3063486,wd=/work
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl download.pl accession=ERR3063487,wd=/work
docker run -v `pwd`:/work -w /ped akiomiyao/ped perl ped.pl target=ERR3063487,control=ERR3063486,ref=WBcel235,wd=/work
```
ERR3063487 sequence is after 250 generations of the nematode (ERR3063486).
BioPoject https://www.ncbi.nlm.nih.gov/bioproject/PRJEB30822
Downloading fastq files may take several hours, because connection of fastq-dump to NCBI-SRA is slow.
Sometimes, download.pl returns the timeout of network connection. In the case, network will be reconnected and resumed the download.
Fastq files will be saved in ERR3063486/read and ERR3063487/read.
Result of SNPs and SVs in ERR3063487 against ERR3063486, i.e. spontaneous mutations during 250 generations, will be saved in ERR3063487 (target) directory.
If control is omitted, polymorphisms against reference genome will be saved in target directory.
If script runs without arguments, description of how to use the script will be shown.
ERR3063487.vcf is the vcf format result. The vcf file can be opened by Integrative Genomics Viewer.
Options,
thread=8 : specify the max thread (process) number.
Default is the number of logical core.
tmpdir=/mnt/ssd : specify the temporally directory to /mnt/ssd. Default is target directory.
clipping=100 : If length of short reads is not fixed, ped.pl determine the suitable clipping length.
If you want to force the clipping length, add the clipping option.
Distribution of counts by sequence length can be obtained by check_length.pl
perl check_length.pl target=ERR3063487
Clipping length between 90-95% coverage is enough.
Current version of ped.pl has auto clipping function.

Result files,

File name               Description
ERR3063487.aln          Bidirectional alignment
ERR3063487.bi.primer    Primer data for PCR
ERR3063487.bi.snp       SNP data (original format)
ERR3063487.bi.snp.count SNP data (Showing snp counts from aln data)
ERR3063487.index        Index file for alignemt search
ERR3063487.log          Process log
ERR3063487.report       Log of ped.pl
ERR3063487.sv           Structural variation data
ERR3063487.sv.count     Structural variation data (Showing snp counts from aln data)
ERR3063487.sv.primer    Primer data for PCR
ERR3063487.vcf          SNP and SV data (vcf format, for IGV)
ERR3063487.full.vcf     SNP and SV data (vcf format, full output)
ERR3063487.count.vcf    SNP and SV data (vcf format, full output with unverified data)

For analyses of metagenome or mixed genome (e.g. SARS-CoV-2 data from a patient), using count data is recommended.
Because detected SNPs or SVs in closed position but on differenent genome strand may be filtered out during verification process.

Installation

If you do not want to use the docker container, downloading of programs is required.
Programs run on Unix platforms (FreeBSD or Linux) and Mac.
Download zip file of PED from https://github.com/akiomiyao/ped and extract.
or
```
git clone https://github.com/akiomiyao/ped.git
```

If your machine do not have git program, install git from package.

sudo apt install git (Ubontu)
sudo yum install git (CentOS)
sudo pkg install git (FreeBSD)

If you got scripts by clone command of git, update to newest version is very easy using pull command of git.
```
git pull
```
To download sequence data, fastq-dump from NCBI is required.
Tool kit can be download from
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Details of setup fastq-dump is described in
https://akiomiyao.github.io/ped/sratoolkit/index.html
To download reference data, curl is required.
If your machine do not have curl program, install curl from package.
```
sudo apt install curl (Ubontu)
sudo yum install curl (CentOS)
sudo pkg install curl (FreeBSD)
```

Setup of Docker (For Docker users, Optional)

If docker is installed, ped can be run with docker command without preinstall of ped.
https://docs.docker.com/install/linux/docker-ce/ubuntu/
```
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
```
or
```
sudo apt install docker
sudo apt install docker.io
```
To get or update the container,
```
sudo docker pull akiomiyao/ped
```
To check running containers,
```
sudo docker stats
```
To kill running container,
```
sudo docker kill Container_ID
```
If you want to run the docker container without sudo or su,
```
sudo usermod -a -G docker your_username
```
After the new login, docker commands can be execute with your account.

Supporting reference genomes

  Name             Description
  97103            Water melon (Citrullus lanatus subsp. vulgaris) cv. 97213v2
  Asagao1.2        Asagao (Ipomoea nil) Japanese morning glory
  B73v4            Corn (Zea mays B73) RefGen v4
  Bomo             Silkworm (Bombyx mori) Genome assembly (Nov.2016)
  Bsubtilis        Bacillus subtilis subsp. subtilis str. 168 (NC_000913.3)
  Camarosa1.0      Strawberry (Fragaria x ananassa) Camarosa genome assembly v1.0
  Camellia20200506 Black tea (Cammellia sinensis) assembly
  CharlestonGray2  Water melon (Citrullus lanatus subsp. vulgaris) cv. Charleston Gray v2
  CsinensisHz2     Sweet orange (Citrus sinensis) Hzau v2.0
  Ecoli            Escherichia coli str. K-12 substr. MG1655 (NC_000913.3)
  Fielder1         Wheat (Triticum aestivum L. cv. Fielder) Version 1
  GRCm38           Mouse (Mus musculus) Genome Reference Consortium Mouse Build 38
  Gifu1.2          Lotus japonicus Gifu 
  Gmax275v2.0      Soybean (Glycine max) genome project assemble version 2
  HBV              Hepatitis B virus (strain ayw, NC_003977.2)
  IBSC2            Barley (Hordeum vulgare L. cv. Molex) Release 47
  IRGSP1.0         Rice (Olyza sativa L. cv. Nipponbare) version 1.0
  IWGSC1.0         Wheat (Triticum aestivum L. cv. Chinese Spring) Version 1.0
  KOD1             Thermococcus kodakarensis KOD1 (NC_006624.1)
  LJ3              Lotus japonicus MG20 v3.0 (Download from https://lotus.au.dk/data/download into LJ3 directory)
  Lactuca_sativa   Lettuce (Lactuca sativa)
  RIB40            Aspergillus oryzae RIB40 (ASM18445v3) 
  Reikou2.3        Strawberry Reikou genome v2.3
  SARS-CoV-2       Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, NC_045512.2)
  SL3              Tomato (Solanum lycopersicum cv. Heinz 1706) Build 3.0
  SScrofa11.1      Pig (Sus scrofa) Release-97
  TAIR10           Arabidopsis thaliana version TAIR10
  UMD3.1           Cow (Bos taurus L1 Dominette 01449) UMD 3.1
  Vcholerae        Vibrio cholerae O1 biovar El Tor str. N16961
  WBcel235         Caenorhabditis elegans WBcel235
  danRer11         Zebrafish (Danio rerio) Genome Reference Consortium Zebrafish Build 11
  dmel626          Drosophila melanogaster
  hg19             Human (Homo sapiens) Genome Reference Consortium Human Build 19
  hg38             Human (Homo sapiens) Genome Reference Consortium Human Build 38
  sacCer3          Saccharomyces cerevisiae (UCSC sacCer3)

If fetch the fasta file is failed by the script, fetch the file separately and save in the reference directory and run the script.
Specify only reference, ped.pl will make the reference data only.

Otherwise, if you want to make reference data absent in the config file,

perl ped.pl ref=reference,file=fasta_file_name

For example,

mkdir IRGSP1.0  
cp somewhere/IRGSP-1.0_genome.fasta.gz IRGSP1.0  
perl ped.pl ref=IRGSP1.0,file=IRGSP-1.0_genome.fasta.gz

Instruction for k-mer method

For example,

perl ped.pl target=ERR3063487,control=ERR3063486,ref=WBcel235,method=kmer

ERR3063487 specific SNPs against ERR3063486 will be detected.

if you want to detect SNPs against reference genome,

perl ped.pl target=ERR3063487,ref=WBcel235,method=kmer

ERR3063487.kmer.snp is the list of SNPs.
ERR3063487.kmer.vcf is the vcf file for SNPs.
ERR3063487.kmer.primer is the list of primer sequence to detect SNPs.
Primer files are experimental.
The algorithm of detection primer sequences has been developed by my experience of PCR experiment.
The k-mer method is able to detect polymorphisms by the direct comparison between two short read data.
If you want to SNP detection between target and control without reference data,
run script without reference specification.
For example,
```
perl ped.pl target=ERR3063487,control=ERR3063486,method=kmer
```
ERR3063487.kmer is the list of polymorphic edge.
SNPs tagged with first 19-mer will be used for the genetic analysis,
such as segregation analysis.
The 19-mer can be used as the identifier (i.e. name) for analysis.

Examples of result

A part of SNP list of the bidirectional method is

1       3189273 T       C       50      0       15      9       H
1       3189333 T       C       50      0       14      0       R
1       3189345 A       G       50      0       13      17      H
1       3189429 G       A       50      0       15      0       R
1       3189498 G       A       50      0       0       39      M
1       3189503 T       G       50      0       2       1       R
1       3189527 T       G       50      0       10      0       R
1       3189609 T       A       50      0       41      1       R
1       3189704 C       G       50      0       0       11      M
1       3189741 A       G       50      0       0       23      M
1       3189819 A       G       50      0       0       4       R
1       3189833 C       T       50      0       0       46      M
Column 1: Chromosome number
Column 2: Position of SNP
Column 3: Reference base at the SNP position
Column 4: Alternative base
Column 5: Number of detected reads with alternative base
Column 6: Number of reads in the control sort_uniq file with control type polymorphism
Column 7: Number of reads in the control sort_uniq file with target type polymorphism
Column 8: Number of reads in the target sort_uniq file with control type polymorphism
Column 9: Number of reads in the target sort_uniq file with target type polymorphism
Column 10: Genotype (M: homozygous, H: heterozygous, R: reference type, N: not applicable)
Following columns will be appeared in the primer file.
Column 11: Left primer sequence 
Column 12: Right primer sequence
Column 13: Estimated size of the amplified fragment
Column 14: Upstream and downstream sequence around the SNP
The algorithm of detection primer sequences has been developed by my experience of PCR experiment.

A part of the structural variation result is
```
1 902788 1 902774 f deletion -1 50 0 10 9 H AAAAAAAAAAAAAA
1 907869 1 907835 f insertion 32 50 0 19 11 H C
1 911222 1 911221 f deletion -1 50 0 21 15 H T
1 923312 1 923312 f deletion -1 50 0 0 34 M _
1 931147 1 931131 f insertion 4 50 0 7 19 N CCCTCCCTCCC
1 932618 1 932617 f deletion -4 50 0 1 32 M TTTC

Column 1: Chromosome number of junction detected by 5’ to 3’ matching
Column 2: Position of junction detected by 5’ to 3’ matching
Column 3: Chromosome number of junction detected by 3’ to 5’ matching
Column 4: Position of junction detected by 3’ to 5’ matching
Column 5: Direction
Column 6: Type of polymorphism (insertion, deletion, inversion and translocation)
Column 7: Size of insertion or deletion
Column 8: Number of reads in the control sortuniq file with control type polymorphism
Column 9: Number of reads in the control sort_uniq file with target type polymorphism
Column 10: Number of reads in the target sort_uniq file with control type polymorphism
Column 11: Number of reads in the target sort_uniq file with target type polymorphism
Column 12: Genotype (M: homozygous, H: heterozygous, R: reference type, N: not applicable)
Column 13: Sequence between junctions ( is no sequence within junctions)
Following columns will be appeared in the primer file.
Column 14: Left primer sequence
Column 15: Right primer sequence
Column 16: Estimated size of the amplified fragment
Column 17: Upstream and downstream sequences around the junction
The algorithm of detection primer sequences has been developed by my experience of PCR experiment.


- A part of SNP result by the *k*-mer method is

X 14544549 ACAGTTGTATTTTTCAATT A A G r 0 0 0 13 0 27 0 0 13 0 0 30 M
X 14544549 TACACCACTGTAAGTCAAC A A G f 13 0 0 0 0 0 27 0 13 0 0 30 M
X 14592687 AAAAATTTGGATTTTTGGA G G GT r 0 15 0 0 20 20 0 1 5 0 10 0 R
X 14592707 AAAAATTTGGATTTTTGGA G G AG f 0 0 15 0 20 0 20 0 5 0 10 0 R
X 14615799 ACCCCTATATATAGTGTTT T T AT r 25 0 0 0 16 0 0 12 26 0 18 0 R
X 14615819 ACCCCTATATATAGTGTTT T T AT f 0 0 0 25 12 0 0 16 26 0 23 0 R
X 14624654 GTTGTGCTTTATTTATTTG A A AT r 0 0 0 19 15 0 0 20 19 0 19 0 R
X 14624674 GTTGTGCTTTATTTATTTG A A AG f 18 0 0 0 18 0 16 0 19 0 17 0 R
X 14632040 ATTTTTTTCACAAACAAGG T T A r 26 0 0 0 0 0 0 23 21 0 0 21 M
X 14632040 CTTAAATCGGAGAACAAAT T T A f 0 0 0 25 22 0 0 0 21 0 0 21 M
X 14682371 TCAGTTCAATCACGATAAA A A AT r 0 0 0 19 10 0 0 19 24 0 20 0 R

Column 1: Chromosome number
Column 2: Position of SNP
Column 3: (k-1)-mer (k = 20)
Column 4: Base of reference at the position of SNP
Column 5: Base of control
Column 6: Base of target
Column 7: Direction of k-mer sequence on the reference
Column 8: Number of k-mer with A at the end of k-mer in the control
Column 9: Number of k-mer with C at the end of k-mer in the control
Column 10: Number of k-mer with G at the end of k-mer in the control
Column 11: Number of k-mer with T at the end of k-mer in the control
Column 12: Number of k-mer with A at the end of k-mer in the target
Column 13: Number of k-mer with C at the end of k-mer in the target
Column 14: Number of k-mer with G at the end of k-mer in the target
Column 15: Number of k-mer with T at the end of k-mer in the target
Column 16: Number of reads in the control sort_uniq file with control type base
Column 17: Number of reads in the control sort_uniq file with target type base
Column 18: Number of reads in the target sort_uniq file with control type base
Column 19: Number of reads in the target sort_uniq file with target type base
Column 20: Genotype (M: homozygous, H: heterozygous, R: reference type, N: not applicable)
Following columns will be appeared in the primer file.
Column 21: Left primer sequence
Column 22: Right primer sequence
Column 23: Estimated size of the amplified fragment
Column 24: Upstream and downstream sequence around the SNP
The algorithm of detection primer sequences has been developed by my experience of PCR experiment.

## Detection of polymorphisms between control and target
- SNPs between ERR3063486 (wild-type) and ERR3063487 (mutant)

I 27950 A T 31 0 0 17 M
I 892680 C G 8 1 5 3 H
I 1196268 T A 9 0 8 4 H
I 1380502 A T 22 0 0 13 M
I 1728826 T G 8 0 6 3 H
I 3203676 T G 16 1 11 5 H
I 3407954 C A 28 0 0 23 M
I 3656814 A C 8 1 8 4 H
I 5001132 T G 8 1 6 3 H
I 6324213 G T 24 0 0 21 M
I 7249395 T G 19 1 7 4 H
I 7263091 T G 14 1 9 6 H
I 9136539 T A 20 0 0 16 M
I 10137843 T G 6 0 9 4 H
I 11168832 A C 5 1 5 3 H
I 14097862 A T 17 1 9 13 H
II 86891 A G 7 0 8 4 H
II 271122 C A 16 0 6 3 H
II 1179320 C A 11 0 0 17 M
II 2500482 C A 15 0 0 24 M
II 3552886 A C 15 1 5 3 H
II 3624396 A C 12 1 9 5 H
II 3648966 C T 9 0 0 27 M
II 3771956 G C 19 0 0 16 M
II 3935824 A C 7 0 6 3 H
II 3979685 T G 9 0 5 3 H
II 8284226 C A 19 0 0 20 M
II 8447001 T G 10 0 9 4 H
II 8553707 T A 21 0 0 6 M
II 9410187 C T 23 0 0 18 M
II 9937543 T G 21 0 0 18 M
II 10629519 A C 17 1 12 8 H
II 10685303 T A 21 0 0 16 M
II 12732768 A C 13 1 6 3 H
II 14096056 T G 6 0 5 4 H
III 198231 T A 16 0 0 18 M
III 3091906 T G 19 0 9 4 H
III 3643248 G C 6 0 6 3 H
III 4824486 C G 33 0 0 23 M
III 7532164 T G 8 1 6 3 H
III 9723566 G A 25 0 0 23 M
III 11532166 C T 15 0 1 18 M
III 13092063 C T 13 0 6 6 H
III 13273147 A T 15 1 6 4 H
III 13350315 A C 14 0 7 4 H
IV 554740 A C 11 1 9 4 H
IV 876681 A G 20 0 7 4 H
IV 1159003 A C 13 0 7 6 H
IV 1327294 G A 26 0 0 31 M
IV 2417073 G A 32 0 0 21 M
IV 2858610 C T 21 0 0 14 M
IV 3931877 G A 13 0 0 20 M
IV 4298928 A C 11 1 9 4 H
IV 5200355 G T 27 0 0 23 M
IV 6481216 C G 17 0 0 8 M
IV 6796145 C T 16 0 0 19 M
IV 6967218 G A 25 0 0 25 M
IV 8053747 T A 23 0 0 19 M
IV 8951120 T G 10 1 13 6 H
IV 9709645 C A 22 0 0 27 M
IV 10195034 T G 14 1 8 4 H
IV 13672183 C T 18 0 0 18 M
IV 14217812 A C 9 0 6 3 H
IV 14760376 T G 9 0 5 3 H
IV 16870604 A C 12 1 8 4 H
V 7011 T G 10 1 11 5 H
V 974819 A C 6 0 9 4 H
V 3052728 A C 9 1 10 5 H
V 3277678 G A 22 0 1 19 M
V 3786240 T A 23 0 0 17 M
V 4261370 T G 14 0 9 4 H
V 5134172 C T 19 0 17 10 H
V 7816318 A C 20 1 8 4 H
V 9771516 A T 25 0 18 12 H
V 10513400 A C 11 0 9 4 H
V 11310419 G T 15 0 0 18 M
V 15880652 T C 20 1 6 3 H
V 19657843 T A 25 0 0 24 M
V 19718778 G A 11 0 0 11 M
V 19733914 A C 6 0 6 3 H
V 19742410 T G 6 1 5 3 H
V 20202209 T G 5 0 5 3 H
V 20413571 T G 13 1 6 3 H
X 2843588 A C 13 0 9 4 H
X 4194330 T C 23 0 0 22 M
X 4247242 T G 7 1 6 3 H
X 5446454 C T 21 0 0 7 M
X 6994561 A T 29 0 0 29 M
X 7299494 A G 7 0 6 3 H
X 7586849 T G 8 1 8 4 H
X 10486673 A T 31 0 0 23 M
X 12500491 T G 7 1 6 3 H
X 14544549 A G 13 0 0 30 M
X 14632040 T A 21 0 0 21 M
X 14735899 T G 8 0 5 3 H
X 15395882 A G 17 0 13 6 H
X 15815152 T G 13 1 5 3 H
X 16861146 C T 9 0 0 11 M
X 16964164 T G 14 0 5 3 H

- Structural variations between ERR3063486 (wild-type) and ERR3063487 (mutant)

I 834776 I 834746 f deletion -12 6 0 7 4 H TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT
I 1251597 I 1251573 f deletion -2 5 0 8 4 H TGTGTGTGTGTGTGTGTGTGTGTGT
I 1412142 I 1411952 f insertion 102 12 1 6 3 H CCCCCCGCTGACCCCAAACCAATATCCCGTCAAAAAACGAAAATTCATATTTTTCTTAATCTACAGTAATCCTACAGTGCCCCTACA
I 1560682 I 1560674 f deletion -1 17 0 0 19 M TTTTTTTT
I 1560682 I 1561254 r inversion N 13 0 0 6 M TTAAAGGTGGTGTGGTCGAATTTTTTTT
I 2160919 I 2160898 f deletion -1 9 0 8 4 H TTTTTTTTCAAAAAAAAAAAA
I 2384261 I 2384244 f insertion 1 14 1 8 4 H CAAAAAAAAAAAAAA
I 2715230 III 12981863 r translocation N 5 1 5 3 H CGTATTGCACAGCACATTTGACGCGCAAAAT
I 5081495 I 5081478 f deletion -2 7 1 6 4 H TTTTTTTTTTTTTTTTTT
I 8028077 I 8028066 f deletion -1 6 0 6 3 H TTTTTTTTTTT
I 8028078 I 8028065 f insertion 1 6 0 6 3 H TTTTTTTTTTT
I 9825014 I 9825007 f insertion 1 23 0 8 13 H AAAAA
I 10622455 IV 9192034 f translocation N 5 1 6 3 H GTTCAAATAAAAATATTTTTTT
I 10887954 I 10887958 f deletion -6 11 0 1 10 M A
I 11005509 I 11005489 f deletion -1 5 1 0 5 M AAATTTTTTTTTTTTTTTTT
I 12856424 I 12856415 f deletion -1 14 0 8 5 H TTTTTTTTT
I 13734806 I 13734788 f deletion -1 5 0 6 4 H TTTTTTTTTTTTTTTTTT
II 191614 II 191601 f insertion 1 11 0 9 4 H AAAAAAAAAAA
II 948033 II 948099 f deletion -69 20 0 1 15 M AT
II 1777672 II 1770125 f insertion 7506 8 1 9 4 H ATGGTGAGTAGCCGGTAATTTCATAGTTATTGAAATTTGA
II 2361947 II 2361931 f deletion -1 10 1 0 15 M TTTTTCTTTTTTTTTT
II 4463297 V 1000739 r translocation N 6 0 6 4 H TTTCGATTTTCCAGAAAATCAAAAAAAAA
II 4895056 V 18778607 r translocation N 5 0 5 3 H TTCTACGTTTTGCAATGTGTTTTTT
II 5473562 II 3675168 r inversion N 8 1 5 3 H TTTTACTCAGTTATGTTTTTTTT
II 12746320 III 4520340 f translocation N 6 1 5 3 H TGTAAAATTGTTTTTTTTT
II 13112130 V 20740040 f translocation N 6 0 7 4 H AAAAAAAAACGCATGCATTTTTCG
II 13327324 II 13327697 r inversion N 6 1 6 5 H TTTTGACACTTTTTAGTAATAAATGCAAAAAAAATCAACAAAAATAGACTAAACATTGTAAAAACTGTAAAAACTAAGAGAAAAAAT
III 1663181 III 1663357 f deletion -209 6 1 7 5 H TTTTTTCCAGAAATTAATATTTCTAGAAAAAT
III 2300741 X 15839886 r translocation N 19 0 11 6 H TTAAAGGTGGAGTAGCGCCAGTGGGAAAATTGCTTTAAAACATGCCTATGGTACCACAATGACCAAATATCAT
III 2520715 X 97782 f translocation N 7 1 6 5 H TATTTTTTCGCCATTTTTTTT
III 2985966 IV 876821 f translocation N 5 1 6 3 H AAAAAAATTTTTTTTTT
III 3566956 III 3566962 f deletion -48 8 1 6 4 H TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT
III 6231151 III 6231140 f deletion -1 14 0 6 3 H AAAAAAAAAAA
III 10402397 III 10402387 f deletion -2 5 1 7 4 H TTTTTTTTTTT
III 11280814 X 2272016 r translocation N 12 0 6 3 H TTTTTTTCAAAAAAAAAAAAA
III 12543896 III 12543907 f deletion -62 8 1 9 4 H AAATTTCCGGAAAACATGCAAATTGCCAGAATTGAAAATTTCCGGCAAAT
III 13419473 III 13419443 f insertion 6 5 1 10 5 H TGTGTGTGTGTGTGTGTGTGTGT
IV 1786345 IV 1786337 f deletion -1 15 0 0 12 M AAAAAAAA
IV 2112309 IV 2112287 f deletion -1 7 1 5 3 H TTTTTTGTTTTTTTTTTTTTTT
IV 2314588 IV 2314573 f deletion -1 10 1 8 4 H TTTTTTTTTTTTTTT
IV 2445289 IV 2445276 f deletion -1 15 0 5 3 H TTTTTTTTTTTTT
IV 3192017 IV 3192001 f deletion -1 10 1 8 5 H AAAAAAAAAAAAAAAA
IV 3192018 IV 3192000 f insertion 1 10 0 7 4 H AAAAAAAAAAAAAAAA
IV 3336297 IV 3336328 f deletion -31 21 0 0 23 M A
IV 3867139 IV 3867116 f deletion -2 11 1 6 3 H ATATATATATATATATATATATAT
IV 4399486 IV 4399441 f insertion 2 8 0 10 6 H GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
IV 4489131 IV 4489122 f deletion -1 16 0 1 17 M TTTTTTTTT
V 1027708 V 1027700 f deletion -39 9 1 10 5 H GCCTATGGCCTACGCCTATGGCCTACGCCTATGGCCTACGCCTATG
V 1989616 V 1989642 f deletion -3 6 1 6 3 H GATAAAAACTACTTGGATAAATGA
V 9026732 V 9026728 f deletion -3 6 1 7 4 H ATTATT
V 11884926 II 3151004 r translocation N 5 1 6 5 H GTGCCGAGTGCCGATCGGCACAATGTG
V 18672436 IV 2681633 r translocation N 7 1 5 3 H GGGAAAATTGCTTTAAAACATGCCTATGGTACTACAA
V 18890234 V 18892184 r inversion N 6 1 6 3 H TTTTTATTGAAAACTAGTATAAAAATATA
V 20123760 V 20123893 f deletion -150 12 0 7 4 H GGGGTTCGAACCCCGG
X 99547 X 99523 f deletion -1 8 1 9 5 H AAAAAAAATTTTTTTTTTTTTTTT
X 438457 X 438446 f insertion 1 12 0 5 5 H TTTTTTTTT
X 562164 X 562155 f insertion 1 22 0 0 22 M AAAAAAA
X 1522918 X 1522902 f deletion -1 6 1 5 3 H AAAAAAAAAAAAAAAA
X 2498483 X 2498466 f deletion -1 6 1 5 3 H TTTTTTTTTTTTTTTTT
X 3823840 X 3823828 f deletion -1 10 0 8 4 H AAAAAAAAAAAA
X 5885390 X 5885377 f deletion -1 11 0 10 5 H TTTTTTTTTTTTT
X 7312229 X 11435923 r inversion N 6 0 5 4 H TATTCACCCCGTTCGACTGTGCAATGGGTTTAATCTATTCACTTTGTAAATCAAAGAATCGACGACCGCCTCCTGAA
X 10023790 III 7850798 r translocation N 5 0 6 3 H ATATCAAAATTTCATTTTTTTT
X 14258766 III 303520 f translocation N 6 0 9 5 H TCACAAAATTCTTTGGCCGCCCCAAGTGTCCTAACTCGAAG

## Detection of Copy Number Variation

perl ped.pl target=ttm2,control=ttm5,ref=IRGSP1.0,method=cnv
```

Copy number (counts of read) in each position of reference will be saved into ttm2.ttm5.cnv file at target directory.
If control is omitted, reference sequence will be used as the control.

Author

Akio Miyao, Ph.D. miyao@affrc.go.jp
Institute of Crop Science / National Agriculture and Food Research Organization
2-1-2, Kannondai, Tsukuba, Ibaraki 305-8518, Japan

Version

Version 1.6 Scripts for grid engine have been removed.
Version 1.5 Threads in ped.pl have been changed to forking process.
Version 1.4 Add clipping of short reads for RT-PCR data. Add application of CAVID-19 analysis.
Version 1.3 Update for search.pl for confirmation of alignment. Improvement of making sort_uniq data.
Version 1.2 sort_uniq files are divided to 64 subfiles by first three nucleotide sequence. Remake of reference data is required.
Version 1.1 sort_uniq files are compressed by gzip. Requirement of disk space is reduced but requires more CPU time.
Version 1.0 Original version for PED paper.

Citing PED

Cite this article as: Polymorphic edge detection (PED): two efficient methods of polymorphism detection from next-generation sequencing data
Akio Miyao, Jianyu Song Kiyomiya, Keiko Iida, Koji Doi, Hiroshi Yasue
BMC Bioinformatics. 2019 20(1):362.
URL https://doi.org/10.1186/s12859-019-2955-6
PDF https://rdcu.be/bH7e8

License

NARO NON-COMMERCIAL LICENSE AGREEMENT Version 1.0

This license is for ‘Non-Commercial’ use of software for polymorphic edge detection (PED)

Scientific use of PED is permitted free of charge.
Modification of PED is only permitted to the person of downloaded and his/her colleagues.
The National Agriculture and Food Research Organization (hereinafter referred to as NARO) does not guarantee that defects, errors or malfunction will not occur with respect to PED.
NARO shall not be responsible or liable for any damage or loss caused or be alleged to be caused, directly or indirectly, by the download and use of PED.
NARO shall not be obligated to correct or repair the program regardless of the extent, even if there are any defects of malfunctions in PED.
The copyright and all other rights of PED belong to NARO.
Selling, renting, re-use of license, or use for business purposes etc. of PED shall not be allowed. For commercial use, license of commercial use is required. Inquiries for such commercial license are directed to ped_request@ml.affrc.go.jp.
The PED may be changed, or the distribution maybe canceled without advance notification.
In case the result obtained using PED in used for publication in academic journals etc., please refer the publication of PED and/or acknowledge the use of PED in the publication.