In this file I show how to implement a GBLUP model in a CV2 scheme using BGLR (Pérez-Rodrı́guez & Los Campos, 2022) in a Single- and Multi-trait framework.
To perform the analyses, we will need the following packages:
## Carregando pacotes exigidos: AGHmatrix
## Carregando pacotes exigidos: BGLR
## Carregando pacotes exigidos: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.2
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
This dataset was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor et al., 2021). IT mimics an evaluation of 500 maize genotypes, in four locations, and four traits were measured. Also, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through the 10 pair of chromosomes.
Phenotypic data
We generated BLUEs (best linear unbiased estimation) for each trait, and
it can be loaded from Phenotypes.csv file.
#--- 1. Loading the data
Phenotypes = read.csv("Phenotypes.csv")
#--- 2. As factor
Phenotypes$Genotype = Phenotypes$Genotype %>% as.factor
Phenotypes$Env = Phenotypes$Env %>% as.factor
Genomic data
The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA,
respectively. The dataset for all 500 genotypes can be loaded from
Genotypes.csv.
Creating G Matrix
Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008).
## Initial data:
## Number of Individuals: 500
## Number of Markers: 3000
##
## Missing data check:
## Total SNPs: 3000
## 0 SNPs dropped due to missing data threshold of 1
## Total of: 3000 SNPs
## MAF check:
## No SNPs with MAF below 0
## Monomorphic check:
## No monomorphic SNPs
## Summary check:
## Initial: 3000 SNPs
## Final: 3000 SNPs ( 0 SNPs removed)
##
## Completed! Time = 2.29 seconds
Single-trait model
The GBLUP model in a singletrait framework is given as follows:
\[ y = 1u + Za + e \]
where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(a\) is the vector of genotype effects (assumed to be random), where \(a \sim N(0,G\sigma^2_a)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_a\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e\sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(a\).
Multi-trait model
The GBLUP model in a multitrait framework is given as follows:
\[ y = 1u + Za + e \]
where \(y\) is the matrix of BLUEs data for the target traits; \(u\) represents the mean (fixed); \(a\) is the matrix of genotype effects (assumed to be random), where \(a \sim N(0,\sum_a\otimes G)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sum_a\) represents the genetic covariance matrix; and \(e\) is the residuals (random), where \(e\sim N(0,\sum_{res}\otimes I)\), \(\sum_{res}\) is the residual covariance matrix. The letter \(Z\) refers to the incidence matrix for \(a\).
For the model implementation we use three different strategies. The first strategy is the prediction a single trait model where I used the information from the trait to predict individuals in the same environment.
So, we will use all trait 1 over four environments.
#--- 1. Organizing the phenotypic data
Y<- Phenotypes[,c(1,2,3)] %>%
pivot_wider(., names_from = 'Env', values_from = 'Trait1')
#--- 2. Scale the data
y = scale(Y[,-1])
Implementing the model using the function BGLR()
#--- 1. Setting NAs for the individuals
set.seed(524521)
yNA <-y
yNA[sample(1:dim(Y)[1],100),1] <- NA
yNA[sample(1:dim(Y)[1],100),2] <- NA
yNA[sample(1:dim(Y)[1],100),3] <- NA
yNA[sample(1:dim(Y)[1],100),4] <- NA
#--- 2. Preparing the single trait model
y1NA <- yNA[,1] #Trait 1
y2NA <- yNA[,2] #Trait 2
y3NA <- yNA[,3] #Trait 3
y4NA <- yNA[,4] #Trait 4
#--- 3. ETA and model for each trait
ETA <- list(A=list(K=G.Mat, model='RKHS'))
fm1 <- BGLR(y1NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
fm2 <- BGLR(y2NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
fm3 <- BGLR(y3NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
fm4 <- BGLR(y4NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
#--- 4. Correlation for single trait
test<-is.na(yNA)
tst1<-test[,1];tst2<-test[,2]<-tst3<-test[,3];tst4<-test[,4]
COR.ST1 <- cor(fm1$yHat[tst1],y[tst1,1])
COR.ST2 <- cor(fm2$yHat[tst2],y[tst2,2])
COR.ST3 <- cor(fm3$yHat[tst3],y[tst3,3])
COR.ST4 <- cor(fm4$yHat[tst4],y[tst4,4])
The same information is used with several traits at once. In this second scenario, we used the function Multitrait() from BGLR. The prediction in this case is for the partial information via associated traits.
#--- 1. Set to NA the
set.seed(524521)
yNA <-y
yNA[sample(1:dim(Y)[1],100),1] <- NA
yNA[sample(1:dim(Y)[1],100),2] <- NA
yNA[sample(1:dim(Y)[1],100),3] <- NA
yNA[sample(1:dim(Y)[1],100),4] <- NA
#--- 2. Model
fm <- Multitrait(y = yNA,
ETA = list(A=list(K = G.Mat, model='RKHS', Cov = list(type = 'UN'))),
nIter = 1200,
burnIn = 120,
thin = 5,
verbose = FALSE)
## Checking variance co-variance matrix K for linear term 1
## Ok
## Setting linear term 1
## MSx=1.49553739794528
## UNstructured covariance matrix
## df0 was set to 5
## S0 set to
## 1 2 3 4
## 1 3.357451 1.1475108 1.230007 1.2813106
## 2 1.147511 3.5389379 1.358102 0.9823556
## 3 1.230007 1.3581023 3.507723 1.2229457
## 4 1.281311 0.9823556 1.222946 3.4048273
## Initializing resCov
## Setting hyperparameters for UNstructured R
## S0 was set to
## 1 2 3 4
## 1 5.021194 1.716145 1.839521 1.916248
## 2 1.716145 5.292614 2.031093 1.469150
## 3 1.839521 2.031093 5.245931 1.828961
## 4 1.916248 1.469150 1.828961 5.092047
## Done
#--- 3. Extracting the estimated parameters
YHatInt <- fm$ETAHat
#--- 4. Predictive correlation
test<-is.na(yNA)
tst1<-test[,1];tst2<-test[,2]<-tst3<-test[,3];tst4<-test[,4]
COR.T1<-cor(YHatInt[tst1,1],y[tst1,1])
COR.T2<-cor(YHatInt[tst2,2],y[tst2,2])
COR.T3<-cor(YHatInt[tst3,3],y[tst3,3])
COR.T4<-cor(YHatInt[tst4,4],y[tst4,4])
#--- 5. Extracting estimates of variance parameters
(COR.RES<-fm$resCov$R) # residual covariance matrix
## 1 2 3 4
## 1 0.53942397 0.08142737 0.1345714 0.07613355
## 2 0.08142737 0.74555259 0.2194581 0.06983051
## 3 0.13457140 0.21945809 0.7647425 0.12847156
## 4 0.07613355 0.06983051 0.1284716 0.67923635
## 1 2 3 4
## 1 0.3694649 0.1542590 0.1543947 0.2009270
## 2 0.1542590 0.2499385 0.1314739 0.1302072
## 3 0.1543947 0.1314739 0.2333951 0.1516298
## 4 0.2009270 0.1302072 0.1516298 0.2830937
## 1 2 3 4
## 1 1.0000000 0.5076302 0.5257750 0.6212790
## 2 0.5076302 1.0000000 0.5443485 0.4895005
## 3 0.5257750 0.5443485 1.0000000 0.5898936
## 4 0.6212790 0.4895005 0.5898936 1.0000000
#--- 1. Correlation output from both models
CorCV2 <- matrix(NA,nrow=4,ncol=2)
colnames(CorCV2) <- c('MT','ST')
# MTM
CorCV2[1,1] <- COR.T1
CorCV2[2,1] <- COR.T2
CorCV2[3,1] <- COR.T3
CorCV2[4,1] <- COR.T4
# STM
CorCV2[1,2] <- COR.ST1
CorCV2[2,2] <- COR.ST2
CorCV2[3,2] <- COR.ST3
CorCV2[4,2] <- COR.ST4
#--- 2. Comparison table
CorCV2
## MT ST
## [1,] 0.5177216 0.4094381
## [2,] 0.6541127 0.7008189
## [3,] 0.3581804 0.2873510
## [4,] 0.3133186 0.2624125
University of Florida, deamorimpeixotom@ufl.edu↩︎