In this file I show how to implement a GBLUP model in a CV1 scheme using BGLR (Pérez-Rodrı́guez & Los Campos, 2022) in a Single- and Multi-trait framework.
To perform the analyses, we will need the following packages:
This dataset was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor et al., 2021). IT mimics an evaluation of 500 maize genotypes, in four locations, and four traits were measured. Also, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through the 10 pair of chromosomes.
Phenotypic data
We generated BLUEs (best linear unbiased estimation) for each trait, and
it can be loaded from Phenotypes.csv file.
#--- 1. Loading the data
Phenotypes = read.csv("Phenotypes.csv")
#--- 2. As factor
Phenotypes$Genotype = Phenotypes$Genotype %>% as.factor
Phenotypes$Env = Phenotypes$Env %>% as.factor
#--- 3. Plotting
Phenotypes[,c(1,3:6)] %>%
pivot_longer(., cols = c('Trait1','Trait2','Trait3','Trait4')) %>%
ggplot(., aes(x = Env, y = value)) +
geom_boxplot(color="brown", fill="orange", alpha=0.2) +
theme(strip.text = element_text(face = "bold", size = 20, colour = 'blue'))+
labs(x = 'Environment', title = "Traits (BLUEs) distribution across environments",y="Density")+
facet_wrap(~name,scales = "free")
Genomic data
The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA,
respectively. The dataset for all 500 genotypes can be loaded from
Genotypes.csv file.
#--- 1. Loading the SNPs
Genotypes = as.matrix(read.csv("Genotypes.csv"))
#--- 2. Genotypes names
rownames(Genotypes) = unique(Phenotypes$Genotype)
Creating G Matrix
Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008) in the AGHmatrix package (Amadeu et al., 2016).
Single-trait model
The GBLUP model in a singletrait framework is given as follows:
\[ y = 1u + Za + e \]
where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(a\) is the vector of genotype effects (assumed to be random), where \(a \sim N(0,G\sigma^2_a)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_a\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e\sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(a\).
Multi-trait model
The GBLUP model in a multitrait framework is given as follows:
\[ y = 1u + Za + e \]
where \(y\) is the matrix of BLUEs data for the target traits; \(u\) represents the mean (fixed); \(a\) is the matrix of genotype effects (assumed to be random), where \(a \sim N(0,\sum_a\otimes G)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sum_a\) represents the genetic covariance matrix; and \(e\) is the residuals (random), where \(e\sim N(0,\sum_{res}\otimes I)\), \(\sum_{res}\) is the residual covariance matrix. The letter \(Z\) refers to the incidence matrix for \(a\).
For this model, we used the trait 1 from the environment 1.
#--- 1. Organizing the phenotypic data
Y<- Phenotypes[Phenotypes$Env == 1,c(1:3)]
#--- 2. Scale the pheno data
y = scale(Y[,3]) # only the trait
Deploy the model using the Gmatrix, phenodata and the BGLR() function
#--- 1. Parameters for the folds
nReps = 5 # change here for more repetitions (10-20 is a good fit)
nFolds = 5 # 5 is reasonable, but it depends of or dataset size
#--- 2. Output for the estimated values
yHatCV=rep(NA,length(Y$Trait1))
pred = data.frame()
#--- 3. Model
# Phenodata
y = as.matrix(y)
# Time for the loop
set.seed(0928761)
for(Rep in 1:nReps){
folds=sample(1:nFolds,size=length(Y$Trait1),replace=T)
for(i in 1:max(folds)){
tst=which(folds==i)
yNA=y
yNA[tst]=NA
# model
fm=BGLR(y=yNA,
ETA=list(list(K=GMat,model='RKHS')),
nIter=1200,
burnIn=120,
verbose = FALSE)
# Predicted values
yHatCV[tst]=fm$yHat[tst]
}
# Accuracy for each fold
pred=rbind(pred,
data.frame(Repetitions = Rep,
Cor=cor(yHatCV, y)))
}
#--- 4. Mean for the correlation
mean(pred$Cor)
To implement this analyses we used trait number 1 over four environments.
#--- Organizing the phenotypic data
Y<- Phenotypes[,c(1,2,3)] %>%
pivot_wider(., names_from = 'Env', values_from = 'Trait1')
#--- Scale the data
y = scale(Y[,-1])
dim(y)
Deploy the model using Multitrait() function.
#--- 1. Parameters for the folds
nReps = 5 # change here for more repetitions (10-20 is a good fit)
nFolds = 5 # 5 is reasonable, but it depends of or dataset size
#--- 2. Output for the estimated values
yHatCV=data.frame(yHat1 = rep(NA,dim(y)[1]),
yHat2 = rep(NA,dim(y)[1]),
yHat3 = rep(NA,dim(y)[1]),
yHat3 = rep(NA,dim(y)[1]))
pred = data.frame()
#--- 3. Model
set.seed(0928761)
for (i in 1:nFolds){
# Partitions for a 5-FCV
folds=sample(1:nFolds,size=dim(y)[1],replace=T)
# Model using multitrait
for(p in 1:nReps){
tst=which(folds==p)
yNA=y
yNA[tst,]=NA
# Model
fm <- Multitrait(y=yNA,
ETA = list(list(K=GMat,model='RKHS')),
resCov = list(type="DIAG"),
nIter = 1200,
burnIn = 120,
saveAt='MGB_',
verbose = FALSE)
yHatCV[tst,] = fm$ETAHat[tst,]
}
# Accuracy for each fold
pred=rbind(pred,
data.frame(Fold = i,
Cor1=diag(cor(yHatCV,y))[1],
Cor2=diag(cor(yHatCV,y))[2],
Cor3=diag(cor(yHatCV,y))[3],
Cor4=diag(cor(yHatCV,y))[4]))
}
pred
#--- 4. Extracting estimates of variance parameters
(CORB.RES<-fm$resCov$R) # residual covariance matrix
(CORB.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE
#--- 5. Trait correlation
cov2cor(CORB.u)
Now, lets use the information of 4 traits in just one environment (number 4).
#--- 1. Organizing the phenotypic data
Y<- Phenotypes[Phenotypes$Env == 4,]
#--- 2. Scale the pheno data
y = scale(Y[,-c(1:2)])
Deploy the model using Multitrait() function.
#--- 1. Parameters for the folds
nReps = 5 # change here for more repetitions (10-20 is a good fit)
nFolds = 5 # 5 is reasonable, but it depends of or dataset size
#--- 2. Output for the estimated values
yHatCV=data.frame(yHat1 = rep(NA,dim(y)[1]),
yHat2 = rep(NA,dim(y)[1]),
yHat3 = rep(NA,dim(y)[1]),
yHat3 = rep(NA,dim(y)[1]))
pred = data.frame()
#--- 3. Model
set.seed(0928761)
for (i in 1:nFolds){
# Partitions for a 5-FCV
folds=sample(1:nFolds,size=dim(y)[1],replace=T)
# Model using multitrait
for(p in 1:nReps){
tst=which(folds==p)
yNA=y
yNA[tst]=NA
fm <- Multitrait(y=yNA,
ETA = list(list(K=GMat,model='RKHS')),
resCov = list(type="DIAG"),
nIter = 1200,
burnIn = 120,
saveAt='MGB_',
verbose = FALSE)
yHatCV[tst,] = fm$ETAHat[tst,]
}
# Accuracy for each fold
pred=rbind(pred,
data.frame(Fold = i,
Cor1=diag(cor(yHatCV,y))[1],
Cor2=diag(cor(yHatCV,y))[2],
Cor3=diag(cor(yHatCV,y))[3],
Cor4=diag(cor(yHatCV,y))[4]))
}
pred
#--- 4. Extracting estimates of variance parameters
(CORB.RES<-fm$resCov$R) # residual covariance matrix
(CORB.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE
#--- 5. Trait correlation
cov2cor(CORB.u)
University of Florida, deamorimpeixotom@ufl.edu↩︎