The genetic evaluations discussed so far combined phenotypic data and pedigree information (resemblance between relatives) to predict individuals’ performance. We use pedigree information to built the numerator relationship matrix based on probabilities that alleles are identical by descent (IBD). In addition, it is possible to obtain the information of relationship between individuals based on molecular markers. The idea is to estimate the proportion of DNA that are similar between a pair of individuals based on matching of markers alleles, being known as identical by state (IBS).
We can incorporate the markers information into the linear mixed model and predict the estimated breeding values (\(EBVs\)) of individual with phenotypic record and also individuals without phenotypic record, once they have been genotyped. Here, we will use the information of single nucleotide polimorphism (\(SNPs\)) markers into the genomic best linear unbiased prediction (\(GBLUP\)) to predict genomic values of individuals.
rm(list=ls())
require(asreml)
## Online License checked out Tue Dec 6 08:58:06 2022
require(dplyr)
require(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
require(superheat)
This dataset was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor, Gorjanc, and Hickey 2021). This dataset mimic an evaluation of 500 maize genotypes, in one location, and the trait measured was Yield. Also, a set of 6759 single nucleotide polymorphism (SNPs) were randomly sampled through the 12 pair of chromosomes.
Factor | Levels |
---|---|
Location | One |
Genotypes | 500 |
Block | 3 |
Traits | Yield |
Heritability | 0.4 |
# Reading the dataset
phenodata = read.table("GBLUP_phenodata.txt", h=TRUE,na.strings='NA')
# Plotting the trait distribution
ggplot(phenodata, aes(x = Yield)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
colour = "black", fill = "blue") +
geom_density(alpha = .7, linewidth = 1.5, colour = "red") +
labs(x = NULL, title = "Yield",y="Density")
# Setting as factors
phenodata$Genotype = phenodata$Genotype %>% as.factor()
phenodata$Rep = phenodata$Rep %>% as.factor()
The dosage matrix (SNP matrix) represents the counting of the number of allele copies at each position/loci in the individuals’ DNA. Hence,
\[ AA = 2 \] \[ Aa = 1 \] \[ aa = 0 \]
# Reading the SNP matrix
genodata = read.table("GBLUP_genodata.txt", h=TRUE)
# Transforming it into matrix format for latter
SNP_data = as.matrix(genodata)
dim(SNP_data)
## [1] 500 6759
SNP_data[1:10,1:10]
## SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
## G1 2 0 0 0 NA 0 0 2 NA 1
## G2 2 0 0 0 2 0 0 2 2 1
## G3 1 NA 0 0 1 NA 0 1 2 1
## G4 1 1 0 0 2 0 0 1 2 NA
## G5 1 1 1 0 1 0 0 1 2 1
## G6 1 1 0 0 1 1 0 1 2 1
## G7 1 1 1 0 1 0 0 1 2 1
## G8 NA NA 1 0 NA 0 0 1 NA NA
## G9 1 1 0 0 1 1 NA 1 2 NA
## G10 0 2 0 0 0 1 0 0 NA 2
We implement a one step GBLUP. The model was given as follows:
\[ y = Xb + Zg + e \]
where \(y\) is the vector of phenotypic data; \(b\) is the vector of repetitions (assumed to be fixed); \(g\) is the vector of genotype effects (assumed to be random), where \(g~N(0,G\sigma^2_g)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_g\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e~N(0,\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance . The letters \(X\) and \(Z\) refers to the incidence matrices for \(b\) and \(g\), respectively.
The kernel is variance-covariance structure that represents the relationship between observations (in our case individuals) based on some measured marker or trait (in our case the SNPs).
# Toy example
mat = matrix(c(0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,2,1,0,0,1,2,0,1), ncol=6)
mat1 = scale(mat)
# Creating the kernel
mat1%*%t(mat1)
## [,1] [,2] [,3] [,4]
## [1,] 3.5454545 -1.090909 -3.272727 0.8181818
## [2,] -1.0909091 4.909091 -2.181818 -1.6363636
## [3,] -3.2727273 -2.181818 7.090909 -1.6363636
## [4,] 0.8181818 -1.636364 -1.636364 2.4545455
mm=tcrossprod(mat1)
Here, we will built the additive relationship matrix following the propositions made by VanRaden (2008), as follows:
\[ G = {\frac{ZZ'}{2{\sum{p_i(1-p_i)}}}} \]
where \(p_i\) and \(1-p_i\) represents the allele frequency for both \(A\) and \(a\) at each loci. In this case, we divided by \({2{\sum{p_i(1-p_i)}}}\) to scale G to be analogous to the relationship matrix A ((Isik, Holland, and Maltecca 2017; VanRaden 2008)).
To build the additive kernel (aka genomic relationship matrix), we will use the package AGHMatrix (Amadeu et al. 2016). This package uses the SNP information coded as 2,1,0 to create a relationship matrix between the pair of genotypes. Also, we are able to built other kernels, such as dominance relationship matrix. In our case, we should pay attention in three parameters that will be important while we create the kernel:
Minor allele frequency (\(maf\)): This parameter is connected with the frequency of the alleles in the population. As rare alleles tend to decrease the importance of alleles that contributed with the target trait, we can filter and drop those with small frequency.
Threshold: This parameter is connected with the quality of the SNP dataset. It represents a threshold to the amount of missing data that we will allow for both, SNP and genotypes.
Method: In this case, which one should be the method used to built the kernel. For additive kernels, using the SNP data, the method indicated is the one from VanRaden (2008), as previously discussed.
require(AGHmatrix)
## Carregando pacotes exigidos: AGHmatrix
# Built the G matrix based on Van Raden method.
G.Mat = Gmatrix(SNPmatrix=SNP_data,
maf=0.05,
thresh = 0.8,
method="VanRaden",
missingValue = NA)
## Initial data:
## Number of Individuals: 456
## Number of Markers: 6759
##
## Missing data check:
## Total SNPs: 6759
## 0 SNPs dropped due to missing data threshold of 0.8
## Total of: 6759 SNPs
## MAF check:
## 1043 SNPs dropped with MAF below 0.05
## Total: 5716 SNPs
## Monomorphic check:
## No monomorphic SNPs
## Summary check:
## Initial: 6759 SNPs
## Final: 5716 SNPs ( 1043 SNPs removed)
##
## Completed! Time = 3.84 seconds
# Check the G matrix
G.Mat[1:10,1:10]
## G1 G2 G3 G4 G5 G6 G7
## G1 1.3034627 0.2862966 0.3369708 0.4856171 0.4576207 0.19614447 0.2623478
## G2 0.2862966 1.3315674 0.3809725 0.5775074 0.3423925 0.33667944 0.4248994
## G3 0.3369708 0.3809725 1.2743977 0.4855393 0.1729801 0.40527340 0.3505137
## G4 0.4856171 0.5775074 0.4855393 1.3700459 0.2929668 0.27572141 0.4963317
## G5 0.4576207 0.3423925 0.1729801 0.2929668 1.2419685 0.25970945 0.2827460
## G6 0.1961445 0.3366794 0.4052734 0.2757214 0.2597095 1.15886458 0.2875433
## G7 0.2623478 0.4248994 0.3505137 0.4963317 0.2827460 0.28754327 1.1840038
## G8 0.3495422 0.3252146 0.3196278 0.3591825 0.3874575 0.32915606 0.3354296
## G9 0.1717145 0.4330856 0.3846506 0.3456988 0.2600038 0.38117939 0.3773658
## G10 0.1326128 0.2064758 0.2198335 0.2048313 0.1208767 0.07926512 0.1781531
## G8 G9 G10
## G1 0.3495422 0.17171452 0.13261281
## G2 0.3252146 0.43308562 0.20647583
## G3 0.3196278 0.38465057 0.21983352
## G4 0.3591825 0.34569876 0.20483127
## G5 0.3874575 0.26000383 0.12087669
## G6 0.3291561 0.38117939 0.07926512
## G7 0.3354296 0.37736579 0.17815311
## G8 1.1858051 0.25891975 0.14649824
## G9 0.2589197 1.12299564 0.09962113
## G10 0.1464982 0.09962113 1.34586500
superheat::superheat(G.Mat)
# Adding a value to the diagonal of the matrix if you get an error of singularities
#diag(G.Mat) = diag(G.Mat) + 0.001
# Requisition of ASREML
attr(G.Mat,"rowNames")<-as.character(colnames(G.Mat))
attr(G.Mat,"colNames")<-as.character(colnames(G.Mat))
str(G.Mat)
## num [1:500, 1:500] 1.303 0.286 0.337 0.486 0.458 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:500] "G1" "G2" "G3" "G4" ...
## ..$ : chr [1:500] "G1" "G2" "G3" "G4" ...
## - attr(*, "rowNames")= chr [1:500] "G1" "G2" "G3" "G4" ...
## - attr(*, "colNames")= chr [1:500] "G1" "G2" "G3" "G4" ...
# Model 1: GBLUP
mod1= asreml(fixed = Yield ~ Rep,
random = ~ vm(Genotype, G.Mat),
data = phenodata)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:15 2022
## LogLik Sigma2 DF wall cpu
## 1 -4664.733 169.996 1497 08:58:16 0.2
## 2 -4659.928 173.576 1497 08:58:16 0.2
## 3 -4659.158 175.672 1497 08:58:16 0.2
## 4 -4659.157 175.825 1497 08:58:16 0.2
## 5 -4659.157 175.760 1497 08:58:17 0.2
summary(mod1)
## $call
## asreml(fixed = Yield ~ Rep, random = ~vm(Genotype, G.Mat), data = phenodata)
##
## $loglik
## [1] -4659.157
##
## $nedf
## [1] 1497
##
## $sigma
## [1] 13.25744
##
## $varcomp
## component std.error z.ratio bound %ch
## vm(Genotype, G.Mat) 7.819558 2.498582 3.129599 P 0.4
## units!R 175.759702 6.771460 25.955952 P 0.0
##
## $bic
## [1] 9332.936
## attr(,"parameters")
## [1] 2
##
## $aic
## [1] 9322.313
## attr(,"parameters")
## [1] 2
##
## attr(,"class")
## [1] "summary.asreml"
The cross-validations methods are divided into four different classes (sensu Jarquin et al. (2018)), regarding the genotypes and environments. As follows:
Figure 1 - Cross-validations schemes.
In the first part we will organize the phenotypic information and create the partitions to the cross validation scheme. Generally, we can divide the phenotypic dataset into two:
Testing set: Population where the phenotype will be predicted based on the markers and in the information from the training set.
Training set: Population with the information that we use to calibrate the model.
# Create an ID to track genotypes in the TST and TRN data sets
phenodata$unique_ID = paste(phenodata$Genotype, phenodata$Rep, sep='_')
head(phenodata)
## Location Genotype Rep Yield unique_ID
## 1 Florida G1 1 147.3268 G1_1
## 2 Florida G2 1 140.5955 G2_1
## 3 Florida G3 1 136.1656 G3_1
## 4 Florida G4 1 149.5841 G4_1
## 5 Florida G5 1 130.7658 G5_1
## 6 Florida G6 1 150.9746 G6_1
# Creating the folds for cross validation
nInd=500 # Number of individuals
k=10 # Number of folds
set.seed(1234) # Guarantee the same individuals in each fold
folds = sample(1:k, size=nInd,replace=T) # Creating the folds
# Object to store the values across loop
corr = vector()
Second, the implementation of the prediction into a loop structure, as follows:
for(i in 1:max(folds)){ # loop across the folds
# Store the information of individuals that we will predict
pred.gen=which(folds==i)
pred.gen=paste0("G",pred.gen)
test.set = phenodata %>%
filter(Genotype %in% pred.gen) %>%
droplevels()
# Assuming NA in the data for the individuals that we will predict
train.set = phenodata
train.set$yNA = replace(train.set$Yield, (train.set$Genotype %in% pred.gen), NA)
# GBLUP model
mod2 = asreml(fixed = yNA ~ Rep,
random = ~ vm(Genotype,G.Mat),
na.action=na.method(y="include"),
data = train.set)
# Assessing the predictions by 'predict' function
yHat =predict(mod2,classify="Genotype+Rep",sed=T)$pvals
# Creating the id for later combining
yHat.set = yHat %>%
mutate(unique_ID = paste(Genotype,Rep,sep="_")) %>%
droplevels()
# Combining the datasets
comb_data = merge(yHat.set, test.set, by="unique_ID")
# Prediction accuracy (correlation)
corr[i] = cor(comb_data$predicted.value,comb_data$Yield, use="complete")
}
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:17 2022
## LogLik Sigma2 DF wall cpu
## 1 -4246.950 172.742 1359 08:58:17 0.1
## 2 -4242.763 176.359 1359 08:58:17 0.1
## 3 -4242.039 178.580 1359 08:58:17 0.1
## 4 -4242.037 178.755 1359 08:58:17 0.1
## 5 -4242.037 178.680 1359 08:58:17 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:18 2022
## LogLik Sigma2 DF wall cpu
## 1 -4242.037 178.711 1359 08:58:18 0.2
## 2 -4242.037 178.707 1359 08:58:18 0.1
## 3 -4242.037 178.700 1359 08:58:19 1.7
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:20 2022
## LogLik Sigma2 DF wall cpu
## 1 -4303.417 170.506 1380 08:58:20 0.1
## 2 -4299.308 173.950 1380 08:58:20 0.1
## 3 -4298.608 176.060 1380 08:58:20 0.1
## 4 -4298.607 176.229 1380 08:58:20 0.1
## 5 -4298.607 176.155 1380 08:58:20 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:20 2022
## LogLik Sigma2 DF wall cpu
## 1 -4298.607 176.186 1380 08:58:21 0.1
## 2 -4298.607 176.182 1380 08:58:21 0.1
## 3 -4298.607 176.175 1380 08:58:22 1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:22 2022
## LogLik Sigma2 DF wall cpu
## 1 -4152.031 167.587 1335 08:58:23 0.1
## 2 -4147.909 171.112 1335 08:58:23 0.1
## 3 -4147.210 173.251 1335 08:58:23 0.1
## 4 -4147.209 173.411 1335 08:58:23 0.1
## 5 -4147.208 173.342 1335 08:58:23 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:23 2022
## LogLik Sigma2 DF wall cpu
## 1 -4147.208 173.371 1335 08:58:23 0.2
## 2 -4147.208 173.367 1335 08:58:23 0.1
## 3 -4147.208 173.360 1335 08:58:25 1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:25 2022
## LogLik Sigma2 DF wall cpu
## 1 -4186.479 169.227 1344 08:58:25 0.1
## 2 -4182.547 172.662 1344 08:58:26 0.1
## 3 -4181.878 174.754 1344 08:58:26 0.1
## 4 -4181.877 174.910 1344 08:58:26 0.1
## 5 -4181.876 174.842 1344 08:58:26 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:26 2022
## LogLik Sigma2 DF wall cpu
## 1 -4181.876 174.871 1344 08:58:26 0.2
## 2 -4181.876 174.867 1344 08:58:26 0.1
## 3 -4181.876 174.860 1344 08:58:28 1.8
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:28 2022
## LogLik Sigma2 DF wall cpu
## 1 -4200.508 170.430 1347 08:58:28 0.2
## 2 -4196.877 173.762 1347 08:58:29 0.1
## 3 -4196.172 175.946 1347 08:58:29 0.1
## 4 -4196.170 176.162 1347 08:58:29 0.1
## 5 -4196.169 176.072 1347 08:58:29 0.1
## 6 -4196.169 176.108 1347 08:58:29 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:29 2022
## LogLik Sigma2 DF wall cpu
## 1 -4196.169 176.093 1347 08:58:29 0.2
## 2 -4196.169 176.095 1347 08:58:29 0.1
## 3 -4196.169 176.098 1347 08:58:31 1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:31 2022
## LogLik Sigma2 DF wall cpu
## 1 -4010.927 169.615 1287 08:58:31 0.1
## 2 -4006.418 173.472 1287 08:58:31 0.1
## 3 -4005.779 175.574 1287 08:58:32 0.1
## 4 -4005.778 175.657 1287 08:58:32 0.1
## 5 -4005.778 175.619 1287 08:58:32 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:32 2022
## LogLik Sigma2 DF wall cpu
## 1 -4005.778 175.636 1287 08:58:32 0.2
## 2 -4005.778 175.634 1287 08:58:32 0.1
## 3 -4005.778 175.630 1287 08:58:34 1.7
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:34 2022
## LogLik Sigma2 DF wall cpu
## 1 -4174.676 166.288 1344 08:58:34 0.1
## 2 -4169.166 170.412 1344 08:58:34 0.1
## 3 -4168.429 172.572 1344 08:58:34 0.1
## 4 -4168.428 172.669 1344 08:58:35 0.1
## 5 -4168.428 172.627 1344 08:58:35 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:35 2022
## LogLik Sigma2 DF wall cpu
## 1 -4168.428 172.645 1344 08:58:35 0.2
## 2 -4168.428 172.642 1344 08:58:35 0.1
## 3 -4168.428 172.638 1344 08:58:37 1.8
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:37 2022
## LogLik Sigma2 DF wall cpu
## 1 -4241.406 171.337 1359 08:58:37 0.1
## 2 -4236.251 175.364 1359 08:58:37 0.1
## 3 -4235.552 177.501 1359 08:58:38 0.1
## 4 -4235.551 177.591 1359 08:58:38 0.1
## 5 -4235.551 177.551 1359 08:58:38 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:38 2022
## LogLik Sigma2 DF wall cpu
## 1 -4235.551 177.568 1359 08:58:38 0.2
## 2 -4235.551 177.566 1359 08:58:38 0.1
## 3 -4235.551 177.562 1359 08:58:40 1.9
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:40 2022
## LogLik Sigma2 DF wall cpu
## 1 -4265.812 172.774 1365 08:58:40 0.1
## 2 -4261.310 176.504 1365 08:58:41 0.1
## 3 -4260.598 178.682 1365 08:58:41 0.1
## 4 -4260.597 178.828 1365 08:58:41 0.1
## 5 -4260.596 178.765 1365 08:58:41 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:41 2022
## LogLik Sigma2 DF wall cpu
## 1 -4260.596 178.791 1365 08:58:41 0.2
## 2 -4260.596 178.788 1365 08:58:41 0.1
## 3 -4260.596 178.781 1365 08:58:43 1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:43 2022
## LogLik Sigma2 DF wall cpu
## 1 -4211.011 170.732 1350 08:58:43 0.1
## 2 -4206.258 174.569 1350 08:58:44 0.1
## 3 -4205.593 176.648 1350 08:58:44 0.1
## 4 -4205.592 176.748 1350 08:58:44 0.1
## 5 -4205.592 176.703 1350 08:58:44 0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec 6 08:58:44 2022
## LogLik Sigma2 DF wall cpu
## 1 -4205.592 176.723 1350 08:58:44 0.1
## 2 -4205.592 176.720 1350 08:58:44 0.1
## 3 -4205.592 176.715 1350 08:58:46 1.6
mean(corr)
## [1] 0.2423157