连续纯合片段(runs of homozygosity, ROH)是亲代将同源相同的单倍型传递给子代而形成的,基于ROH计算的基因组近交系数是个体真实的近交系数,它比传统系谱计算的期望近交系数更加准确,而且不再依赖于系谱记录的准确性和完整性,能准确地反映两个配子间的关系。
ROH片段的长短还可以推断近交历史。长的ROH片段反映最近世代发生过近交,而短的ROH说明较远世代产生近交,因为世代数越短ROH片段被重组打断的可能性就越小。
计算连续纯合片段 runs of homozygosity
芯片文件为例
1.转成vcf
nohup /home/software/plink --allow-extra-chr --chr-set 26 --file project_96s --make-bed --out project_96s &
nohup /home/software/plink --allow-extra-chr --chr-set 26 --bfile project_96s --export vcf --out project_96s &
2.过滤缺失0.1,MAF0.05
只使用常染色体上的位点,最大缺失率(max-missing)<10%即检出率>90%的SNP位点,最小等位基因频率(MAF)>0.05(哈代温伯格平衡检验的P值大于10^-6 -hwe)。
nohup /home/software/plink --allow-extra-chr --chr-set 26 -bfile project_96s --geno 0.1 --maf 0.05 --make-bed --out project_96s-geno01-maf05 &
3.去多等位基因,INDEL
nohup /home/software/plink --allow-extra-chr --chr-set 26 --bfile project_96s-geno01-maf05 --export vcf --out project_96s-geno01-maf05 & (bed-vcf)
bgzip -c -f project_96s-geno01-maf05.vcf > project_96s-geno01-maf05.vcf.gz
nohup /home/software/bcftools/bcftools view -m 2 -M 2 --type "snps" project_96s-geno01-maf05.vcf.gz -O z -o project_96s-geno01-maf05-snps.vcf.gz &
4.去重复标记,对vcf文件进行去重并排序,加头
# 去重
sort -k 1,2 -u project_96s-geno01-maf05-snps.vcf > project_96s-geno01-maf05-snps.sorted.uniq.vcf
# 排序
cat project_96s-geno01-maf05-snps.sorted.uniq.vcf |/home/software/vcftools/src/perl/vcf-sort > project_96s-geno01-maf05-snps.sorted.uniq.recode.vcf
# 提取前40行vcf文件
head -40 project_96s-geno01-maf05-snps.vcf > 40.vcf
# 加头
cat 40.vcf project_96s-geno01-maf05-snps.sorted.uniq.recode.vcf > project_96s-geno01-maf05-snps.sorted.uniq-40tou.recode.vcf
5.(基因填充beagle)
6.去除0号染色体
# 删除0、27、30号染色体
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou.recode.vcf --recode --recode-INFO-all --not-chr 0 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0 &
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0.recode.vcf --recode --recode-INFO-all --not-chr 27 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27 &
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27.recode.vcf --recode --recode-INFO-all --not-chr 30 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30 &
7.转成map ped,调整ped文件前两列 ,第一列为品种名称,第二列为样品名称,做出bed bam fam文件
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.vcf --plink --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &(vcf---map ped)
cat project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.ped | awk -F " " '{$1=null;print}' > project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-1.recode.ped
cat project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-1.recode.ped | awk -F " " '{$1=null;print}' > project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-12.recode.ped
dos2unix 96.txt
paste -d "" 96.txt project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-12.recode.ped > project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.ped
sed -i 's/ /\t/g' project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.ped
nohup /home/software/plink --allow-extra-chr --chr-set 26 --file project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode --make-bed --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &(map ped ---bed bim fam)
8.计算ROH
nohup /home/software/plink --allow-extra-chr --chr-set 26 --bfile project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode --homozyg-window-snp 50 --homozyg-snp 50 --homozyg-kb 300 --homozyg-density 50 --homozyg-gap 1000 --homozyg-window-missing 5 --homozyg-window-threshold 0.05 --homozyg-window-het 03 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &
#.indiv文件中NSEG为个数,KB为长度,KBAVG为平均长度。
将 ROHs 分为了三类,(1)短片段 ROHs(class A);(2)中等长度 ROHs(class B);(3)长片段 ROHs(class C)。
对照的界值为 class A: 500Kb-700Kb, class B: 700Kb-1500Kb, class C:超过 1500Kb 。
http://events.jianshu.io/p/5f5d78def191
Plink 中使用滑动窗口的方法来进行 ROHs 的检测:特定 SNP 数目的窗口以一定的步长在 SNP 芯片上滑动,对能够满足一定参数(默认:缺失率<5,杂合率<1)的窗口进行拼接从而完成对 ROHs 的检测。
这个过程在 Plink v1.07中通过命令“homozyg“来完成,考虑到基因组上存在一定数量的拷贝数改变,这种拷贝数改变(CNV)往往较短,通过 Plink 进行 ROHs 的检测时无法将这类拷贝数改变区分开来,参考以前的研究,长度在 0.5Mb 以上的拷贝数改变在健康人群中非常少见,因此将 ROHs 的长度限制设置为 0.5Mb。另外,如果一个ROH 包含两个距离超过 0.1Mb 的 SNPs,那么这个 ROH 被打断为两个 ROHs。
每个 ROH 包含的 SNPs 数量阈值尚无统一定论,因此分别评估 SNPs 阈值为50,60,70,80,90,100 个 SNPs 的 ROHs 的情况。
ROH分析(重测序数据分染色体做)
# 求和:
cat D1.hwe | awk '{a+=$7}END{print a}'
# 计算ROH:
do /home/software/plink --file /home/zhangsj/Wild_European/Plink/Wild_European_chr${K}? --homozyg --out Wild_European_chr${K}
# 得到三个文件:.hom /.hom.summary / .hom.indiv
#使用indiv文件进行统计(利用上面的求和脚本),计算每个个体的平均ROH的长度,和数量。
#使用excel进行统计
=SUM(OFFSET(D$1,(ROW(A1)-1)*19,,19)) }
ROH画图R脚本
detectRUNS: an R package to detect runs of homozygosity and heterozygosity in diploid genomes (r-project.org)
注意:R中使用的map ped 文件均为空格格式!!!
install.packages('detectRUNS')
library('detectRUNS')
setwd("D:/putty/Rdata/ROH/detectRUNS_0.9.6/dec")
# getting map and ped paths
genotypeFile <- system.file("dec", "2.recode.ped", package = "detectRUNS")
mapFile <- system.file("dec", "2.recode.map", package = "detectRUNS")
# sliding-window-based run detectionslid
slidingRuns <- slidingRUNS.run( genotypeFile = "2.recode.ped",mapFile = "2.recode.map", windowSize = 15, threshold = 0.05, minSNP = 20, ROHet = FALSE, maxOppWindow = 1, maxMissWindow = 1, maxGap = 10^6, minLengthBps = 250000, minDensity = 1/10^3, maxOppRun = NULL, maxMissRun = NULL )
# consecutive SNP-based run detection
consecutiveRuns <- consecutiveRUNS.run(genotypeFile = "Kijas2016_Sheep_subset.ped",mapFile = "Kijas2016_Sheep_subset.map", minSNP = 20, ROHet = FALSE, maxGap = 10^6, minLengthBps = 250000, maxOppRun = 1, maxMissRun = 1 )
# “Runs of heterozygosity” 计算杂合
slidingRuns_het <- slidingRUNS.run( genotypeFile = "Kijas2016_Sheep_subset.ped",mapFile = "Kijas2016_Sheep_subset.map", windowSize = 10, threshold = 0.05, minSNP = 10, ROHet = TRUE, maxOppWindow = 2, maxMissWindow = 1, maxGap = 10^6, minLengthBps = 10000, minDensity = 1/10^6, maxOppRun = NULL, maxMissRun = NULL )
consecutiveRuns_het <- consecutiveRUNS.run( genotypeFile = "Kijas2016_Sheep_subset.ped",mapFile = "Kijas2016_Sheep_subset.map", minSNP = 10, ROHet = TRUE, maxGap = 10^6, minLengthBps = 10000, maxOppRun = 2, maxMissRun = 1 )
# Summary statistics on detected runs
summaryList <- summaryRuns( runs = slidingRuns, mapFile = "2.recode.map", genotypeFile = "2.recode.ped", Class = 6, snpInRuns = TRUE)
summaryList$summary_ROH_count # 可以看到雅各布斯绵羊中0-6mbp有894个ROH。
summaryList$summary_ROH_mean_chr # 得到每个染色体和每个品种的平均ROH数。
head(summaryList$SNPinRun) # 对于每个SNP,数据框“SNPinRun”包含了它在任何给定人群/组中运行的次数比例。
# The summary information included in the list returned from summaryRuns() is conveniently organized in data.frames, so that it can easily be visualized, manipulated and written out to text files.
# 包含在从summaryRuns()返回的列表中的摘要信息可以方便地组织在data.frames中,这样就可以方便地可视化、操作并将其写入文本文件。
# Plots
plot_Runs(runs = slidingRuns[slidingRuns$chrom==2,])
plot_StackedRuns(runs = slidingRuns)
plot_StackedRuns(runs = slidingRuns[slidingRuns$group=="XLL",])
plot_SnpsInRuns(runs = slidingRuns[slidingRuns$chrom==2,], genotypeFile = "2.recode.ped", mapFile = "2.recode.map")
plot_SnpsInRuns( runs = slidingRuns[slidingRuns$chrom==24,], genotypeFile = "Kijas2016_Sheep_subset.ped", mapFile = "Kijas2016_Sheep_subset.map")
topRuns <- tableRuns( runs = slidingRuns, genotypeFile = "2.recode.ped", mapFile ="2.recode.map", threshold = 0.7)
plot_manhattanRuns( runs = slidingRuns[slidingRuns$group=="XLL",], genotypeFile ="2.recode.ped", mapFile = "2.recode.map")
# From runs of homozygosity (ROH)
# getting map and ped paths
genotypeFile <- system.file("dec", "2.recode.ped", package = "detectRUNS")
mapFile <- system.file("dec", "2.recode.map", package = "detectRUNS")
# calculating runs of Homozygosity
## Not run:
# skipping runs calculation
runs <- slidingRUNS.run( genotypeFile= "2.recode.ped", mapFile ="2.recode.map", windowSize = 15, threshold = 0.1, minSNP = 15,ROHet = FALSE, maxOppositeGenotype = 1, maxMiss = 1, minLengthBps = 100000, minDensity = 1/10000)
## End(Not run)
# loading pre-calculated data
runsFile <- system.file("dec", "2.recode.csv", package="detectRUNS")
runsDF <- readExternalRuns(inputFile = runsFile, program = 'detectRUNS')
Froh_inbreeding(runs = runsDF, mapFile = "2.recode.map")
Froh_inbreeding(runs = runsDF, mapFile = "2.recode.map", genome_wide=FALSE)
slidingRuns <- slidingRUNS.run( genotypeFile = "2.recode.ped",mapFile = "2.recode.map", windowSize = 15, threshold = 0.05, minSNP = 20, ROHet = FALSE, maxOppWindow = 1, maxMissWindow = 1, maxGap = 10^6, minLengthBps = 250000, minDensity = 1/10^3, maxOppRun = NULL, maxMissRun = NULL )
ROH画图孙脚本
制作391.recode.hom.indiv.txt
Kazakh SRR 246 140312
Kazakh SRR 235 123750
Mongolian SRR 242 124087
Mongolian SRR 212 106493
Mongolian SRR 214 257443
Mongolian SRR 178 81486.2
Jingjiang SRR 647 352246
Jingjiang SRR 612 345693
b=read.table("391.recode.hom.indiv.txt",header=F)
# txt为indiv数据
head(b)
library("ggplot2")
qplot(b[,3],b[,4],col=factor(b[,1]),xlab="11",ylab="22")
Breed=b[,1]
png("roh.png",width = 2200,height = 1400,units="px",res=391) # res为分辨率
ggplot(data = b , aes(x = b[,3],y = b[,4],group = Breed,shape = Breed,color = Breed))+labs(x="Number",y="Length")+geom_point(size=1.2)+scale_shape_manual(values = seq(0,75))
dev.off()
x轴为数量
y轴为长度
plink计算近交系数 inbreeding coefficient
近交系数是指根据近亲交配的世代数,将基因的纯化程度用百分数来表示即为近交系数,也指个体由于近交而造成异质基因减少时,同质基因或纯合子所占的百分比也叫近交系数,个体中两个亲本的共祖系数。
从基因层面来说,近亲婚配的后果就是一个基因的allele来自共同祖先,即血缘同源IBD。 近交系数的定义为在一个常染色体基因座上获得 2 个同源相同(identity by descent, IBD)等位基因的概率。为了更客观的描述个体间近亲婚配情况,提出以下两个概念:
1.coffcient of relationship,针对两个个体间,表示的是两个个体间来自共同祖先的同源基因比例,称之为共祖系数。
2.cofficient of inbreeding,针对一个个体,表示的是该个体任意一个基因的两个allele来自同一个祖先的概率,称之为近交系数。
1.基于基因组关系矩阵的近交系数(FGRM)
2.基因组近交系数
(1)基于ROH计算的基因组近交系数 (FROH) -homozyg
FROH:Froh refers to the proportion of the genome (0-1) that is in runs of homozygosity (ROHs).
# 计算LROH(分染色体做,后合在一起)
vi ROH.sh
for K in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ;do /home/software/vcftools/src/cpp/vcftools --vcf test.vcf --LROH --chr ${K} --out test.vcf.${K}ROH ;done
nohup bash ROH.sh &
cat *.LROH > all.LROH # 将所有的.ROH制作成all.ROH,打开去除表头
# 芯片v3.0中绵羊基因组全长2653.84Mb
# 牛基因组全长2489.37Mb
# 重测序绵羊基因组全长
Froh.png
R统计LROH
a=read.table("all.LROH",header=FALSE)
cha=a[,3]-a[,2]
jishu=aggregate(cha,by=list(a[,8]),length)
zc=aggregate(cha,by=list(a[,8]),sum)
aa=data.frame(jishu,zc)
write.table(aa,"aa.txt",col.names=F,row.names=F,quote=F,sep="\t")
(2)基于基因组杂合度的近交系数(Fhom) -het (使用plink软件根据snp同源性计算近交系数)
Fh is the canonical estimate of genomic F based on excess SNP homozygosity.
[Keller Matthew C,Visscher Peter M,Goddard Michael E. Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data.[J]. Genetics,2011,189(1).]
nohup /home/software/plink --noweb --file project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode --het --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &
#该分析将自动跳过单倍体标记(男性X和Y染色体标记)
#.het文件中 FID:家系ID,IID:个体ID,O(HOM):观察到的纯合子数量,E(HOM):期望的纯合子数量,N(NM):非缺失基因型数量,F:近交系数估计(本例:F=0.036538)
值得注意的是,从概念的定义可以看出,F值理论上是位于0到1范围内的正数,而软件的计算结果中会出现负数,这通常是计算过程中随机抽样的误差,说明该计算结果不是很可靠。但是如果负值非常大,比如-0.5以上,这说明这个样本杂合子较多,可能存在了DNA的污染,其分型结果是有问题的。
例如:大部分个体没有近交情况,绝大多数个体近交系数在0~0.15,少部分个体近交程度比较高,近交系数最高为0.25。[郝鑫宇,冯健,孙延晓,陈琳琳,王起山,潘玉春.保种猪场系谱记录、管理及群体遗传结构分析——以莱芜猪为例[J].中国畜牧杂志,2020,56(04):60-65.]
3.系谱近交系数
系谱近交系数 (FPED)
plink计算亲缘系数
https://www.jianshu.com/p/8340055d3646?utm_campaign=haruki
亲缘系数又称血缘系数,亲属间遗传相关系数,亲缘相关系数。是指将群体中个体之间基因组成的相似程度用数值来表示即为血缘系数。意义即拥有共同祖先的两个人,在某一位点上具有相同基因的概率。它与近交系数不同 , 近交系数 Fχ是说明 x 这个个体本身是什么程度的近交产生的 , 而亲缘系数则是反映两个个体间的遗传相关(亲缘)程度。
#该步骤进行样本间的亲缘关系估计,并将清除未标记明显亲子关系的个体亲缘关系过近的个体。
plink --noweb --file 3 --genome --out 4
# 3表示质控后的PED和MAP文件
# 输出:4.genome
FID1 First sample’s family ID
IID1 First sample’s within-family ID
FID2 Second sample’s family ID
IID2 Second sample’s within-family ID
RT Relationship type inferred from .fam/.ped file,从fam/ped文件推断的样本关系 UN
EZ IBD sharing expected value, based on just .fam/.ped relationship NA
Z0 P(IBD=0)
Z1 P(IBD=1)
Z2 P(IBD=2)
PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ) ////// Proportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1),IBD比例
PHE Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs) ////// Pairwise phenotypic code (1, 0, -1 = case-case, case-ctrl, and ctrl-ctrl pairs, respectively)
DST IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs ) ////// IBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)
PPC IBS binomial test
RATIO Of HETHET : IBS 0 SNPs (expected value is 2)
例如:利用基因组同源相同信息估计个体间亲缘关系,在全基因组关联分析等研究中常被用于无关个体的挑选,部分研究使用的标准为个体间pi_ hat<0. 125。传统上,利用系谱信息估计亲缘系数为0.125时,反映的是至少3代以上个体间的亲缘关系。与此同时,有研究认为亲缘关系在3代以外的个体可以近似为无关个体。本研究表明,约50%的个体对间亲缘系数大于等于0.125,即个体间存在较近的亲缘关系。同样,利用基因组信息通过计算基因组纯合片段占比发现,约16.67%的样本近交系数大于等于0.125。上述分析表明, 恩施黑猪个体间亲缘关系较近,具有明显的近交累积现象。
[吴林慧,孙琦,王荔茹,张凯丽,谢胜松,李新云,赵书红,马云龙.恩施黑猪基因组群体遗传学参数的估计与选择信号研究[J].畜牧兽医学报,2019,50(03):485-494.]
例如:绝大部分个体间的亲缘系数在0~0.35,亲缘系数最高的为0.579。[郝鑫宇,冯健,孙延晓,陈琳琳,王起山,潘玉春.保种猪场系谱记录、管理及群体遗传结构分析——以莱芜猪为例[J].中国畜牧杂志,2020,56(04):60-65.]
R计算近交系数与亲缘系数
https://blog.csdn.net/yijiaobani/article/details/87876901
https://cloud.tencent.com/developer/article/1550322
1.近交系数
ID <- c("X","B","A","C","D","E","H","G","I")
Sire <- c("B","E","C","E","E","H",0,0,0)
Dam <- c("A",0,"D",0,"G","G","I","I",0)
ped <- data.frame(ID,Sire,Dam)
ped
install.packages('nadiv')
library(nadiv)
pped =prepPed(ped)
A = as.matrix(makeA(pped));A
re = data.frame(ID=row.names(A),inbreeding = diag(A)-1)
write.csv(re,"inbreeding.csv",row.names=F)
2.亲缘系数
id <- c(289,135,181,90,188,49)
sire <- c(135,90,49,16,0,16)
dam <- c(181,188,188,0,0,0)
ped <- data.frame(id,sire,dam)
ped
library(nadiv)
pped = prepPed(ped)
mat = as.matrix(makeA(pped))
# 135 VS 181
mat[5,6]/sqrt(mat[5,5]*mat[6,6])
# 289 VS 16
mat[1,7]/sqrt(mat[1,1]*mat[7,7])
id = row.names(mat)
id1 = rep(id,length(id))
id2 = rep(id,each=length(id))
coeff = matrix(0,dim(mat)[1],dim(mat)[1])
for(i in 1:dim(mat)[1]){
for(j in 1:dim(mat)[1]){
coeff[i,j] = mat[i,j]/sqrt(mat[i,i]*mat[j,j])
}
}
coeff_value = as.vector(coeff)
re = data.frame(id1,id2,coeff_value)
write.csv(re,"coeff.csv",row.names=F)
IBD与IBS
IBD全称Identity By Descent, 又叫做血缘同源,指的是两个个体中共有的等位基因来源于共同祖先。
IBS全称Identity By State, 又叫做状态同源,指的是两个个体中共有的等位基因序列相同。IBS的遗传距离可以衡量样本 间的相似性,评价其亲缘关系。
对于某个等位基因,IBS state只要求allel的个数相同即可,而IBD state则进一步要求相同的allel来自于共同祖先。
IBD与IBS.png
1.iBD
https://blog.csdn.net/weixin_43569478/article/details/108079637
iBD
# 利用IBD可以描述两个样本间的亲缘关系,采用plink计算(IBD)PI_HAT:
nohup /home/software/plink --allow-extra-chr --chr-set 29 --noweb --file test --genome --allow-no-sex --out test-ibd &
# 理想状态下父子关系的两个样本,Z0, Z1, Z2对应的值分别为0,1, 0,所有位点的一个allel都继承自父本;同卵双胞胎的两个样本,则为0,0,1,所有的allel都来自共同的祖先,对于异卵双胞胎,则为0.25,0.5,0.25
# PI_HAT这个统计量的取值范围为0-1,数值越大,两个样本的亲缘关系越近,当为1时,表示的就是同卵双胞胎,或者重复样本,可以根据这个值筛选亲缘关系近的样本进行过滤。
各列含义.png
2.iBS
https://blog.csdn.net/weixin_43569478/article/details/108079152
https://www.jianshu.com/p/d555dc28c3c0?from=singlemessage
IBS.png
iBS
# 在家系数据中,由于有父代的分型数据, IBD运用的很多,在自然群体中,则通常使用IBS。
# 基于SNP分型结果, 我们可以计算样本间的IBS距离。
# 距离可以衡量样本间的相似性,根据IBS distance距离矩阵,可以对样本进行MDS分析。
# 通过MDS分析对样本构成可以有一个清楚的认识,也可以用于剔除明显偏离群体的样本。在实际的数据分析中,可以借助plink软件来计算IBS距离矩阵,用法如下:
nohup /home/software/plink --allow-extra-chr --chr-set 29 --file QC.project_96s_noinclude0.recode-502502.bed-geno02-maf03 --cluster --matrix &
# 默认情况下会生成plink.mibs文件,是一个距离矩阵,可以用R语言读取,然后进行MDS分析。R语言中MDS分析的代码如下R画图
m <- as.matrix(read.table("plink.mibs"))
mds <- cmdscale(as.dist(1-m))
k <- c( rep("green",45) , rep("blue",44) )
plot(mds,pch=20,col=k)
legend("topleft", legend = c("CHB", "JPB"), pch = 20, col = c("green", "blue"))
计算SNP位点杂合度:期望杂合度 观测杂合度(分群体计算后,所有的加在一起除以位点数)
nohup /home/software/plink --allow-extra-chr --chr-set 29 --hardy --file com-517-51 --out com-517-51 &
分别计算
O(HET)
cat test.hwe |awk '{sum+=$7} END {print "Average = ", sum/NR}'
E(HET)
cat test.hwe |awk '{sum+=$8} END {print "Average = ", sum/NR}'
查看ped文件有多少列
cat filename | awk '{print NF}'
cat filename | awk -F '{print NF}' (非tab键文件使用-F)