CV1_BGLR.knit

1. Genomic selection in a CV1 scheme

In this file I show how to implement a GBLUP model in a CV1 scheme using BGLR (Pérez-Rodrı́guez & Los Campos, 2022) in a Single- and Multi-trait framework.

To perform the analyses, we will need the following packages:

rm(list=ls())
require(AGHmatrix)
require(BGLR)
require(tidyverse)
require(dplyr)
require(ggplot2)

2. Datasets

This dataset was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor et al., 2021). IT mimics an evaluation of 500 maize genotypes, in four locations, and four traits were measured. Also, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through the 10 pair of chromosomes.

Phenotypic data
We generated BLUEs (best linear unbiased estimation) for each trait, and it can be loaded from Phenotypes.csv file.

#--- 1. Loading the data
Phenotypes = read.csv("Phenotypes.csv")

#--- 2. As factor
Phenotypes$Genotype = Phenotypes$Genotype %>% as.factor
Phenotypes$Env = Phenotypes$Env %>% as.factor

#--- 3. Plotting
Phenotypes[,c(1,3:6)] %>%
  pivot_longer(., cols = c('Trait1','Trait2','Trait3','Trait4')) %>% 
ggplot(., aes(x = Env, y = value)) +
  geom_boxplot(color="brown", fill="orange", alpha=0.2) + 
  theme(strip.text = element_text(face = "bold", size = 20, colour = 'blue'))+
  labs(x = 'Environment', title = "Traits (BLUEs) distribution across environments",y="Density")+
  facet_wrap(~name,scales = "free")

Figure 1. Trait performance over the four environments

Genomic data
The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA, respectively. The dataset for all 500 genotypes can be loaded from Genotypes.csv file.

#--- 1. Loading the SNPs
Genotypes = as.matrix(read.csv("Genotypes.csv"))

#--- 2. Genotypes names
rownames(Genotypes) = unique(Phenotypes$Genotype)

Creating G Matrix

Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008) in the AGHmatrix package (Amadeu et al., 2016).

#--- 1. Additive matrix
GMat = AGHmatrix::Gmatrix(Genotypes)

3. Statistical model

Single-trait model

The GBLUP model in a singletrait framework is given as follows:

\[ y = 1u + Za + e \]

where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(a\) is the vector of genotype effects (assumed to be random), where \(a \sim N(0,G\sigma^2_a)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_a\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e\sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(a\).

Multi-trait model

The GBLUP model in a multitrait framework is given as follows:

\[ y = 1u + Za + e \]

where \(y\) is the matrix of BLUEs data for the target traits; \(u\) represents the mean (fixed); \(a\) is the matrix of genotype effects (assumed to be random), where \(a \sim N(0,\sum_a\otimes G)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sum_a\) represents the genetic covariance matrix; and \(e\) is the residuals (random), where \(e\sim N(0,\sum_{res}\otimes I)\), \(\sum_{res}\) is the residual covariance matrix. The letter \(Z\) refers to the incidence matrix for \(a\).

4. Single-trait GBLUP using BGLR() function

For this model, we used the trait 1 from the environment 1.

#--- 1. Organizing the phenotypic data
Y<- Phenotypes[Phenotypes$Env == 1,c(1:3)] 

#--- 2. Scale the pheno data
y = scale(Y[,3]) # only the trait

Deploy the model using the Gmatrix, phenodata and the BGLR() function

#--- 1. Parameters for the folds
nReps = 5 # change here for more repetitions (10-20 is a good fit)
nFolds = 5 # 5 is reasonable, but it depends of or dataset size

#--- 2. Output for the estimated values
yHatCV=rep(NA,length(Y$Trait1))
pred = data.frame()

#--- 3. Model
# Phenodata
y = as.matrix(y)

# Time for the loop
set.seed(0928761) 
 for(Rep in 1:nReps){
   
 folds=sample(1:nFolds,size=length(Y$Trait1),replace=T)  
 
  for(i in 1:max(folds)){
    tst=which(folds==i)
    yNA=y
    yNA[tst]=NA
    # model
    fm=BGLR(y=yNA,
            ETA=list(list(K=GMat,model='RKHS')),
            nIter=1200,
            burnIn=120,
            verbose = FALSE)
    # Predicted values
    yHatCV[tst]=fm$yHat[tst]
  }
 # Accuracy for each fold
 pred=rbind(pred, 
            data.frame(Repetitions = Rep,
                       Cor=cor(yHatCV, y)))
}
 

#--- 4. Mean for the correlation
mean(pred$Cor)

5. Multi-trait GBLUP using Multitrait() function

To implement this analyses we used trait number 1 over four environments.

#--- Organizing the phenotypic data

Y<- Phenotypes[,c(1,2,3)] %>%  
  pivot_wider(., names_from = 'Env', values_from = 'Trait1') 

#--- Scale the data
y = scale(Y[,-1])
dim(y)

Deploy the model using Multitrait() function.

#--- 1. Parameters for the folds
nReps = 5 # change here for more repetitions (10-20 is a good fit)
nFolds = 5 # 5 is reasonable, but it depends of or dataset size

#--- 2. Output for the estimated values
yHatCV=data.frame(yHat1 = rep(NA,dim(y)[1]),
                  yHat2 = rep(NA,dim(y)[1]),
                  yHat3 = rep(NA,dim(y)[1]),
                  yHat3 = rep(NA,dim(y)[1]))
pred = data.frame()

#--- 3. Model
set.seed(0928761)

for (i in 1:nFolds){
  
# Partitions for a 5-FCV
folds=sample(1:nFolds,size=dim(y)[1],replace=T)  
  
# Model using multitrait
for(p in 1:nReps){
  
  tst=which(folds==p)
  yNA=y
  yNA[tst,]=NA
  
  # Model
  fm <- Multitrait(y=yNA,
                   ETA = list(list(K=GMat,model='RKHS')),
                   resCov = list(type="DIAG"),
                   nIter = 1200,
                   burnIn = 120,
                   saveAt='MGB_',
                   verbose = FALSE)
  
  yHatCV[tst,] = fm$ETAHat[tst,]
}

# Accuracy for each fold
 pred=rbind(pred, 
            data.frame(Fold = i,
                       Cor1=diag(cor(yHatCV,y))[1],
                       Cor2=diag(cor(yHatCV,y))[2],
                       Cor3=diag(cor(yHatCV,y))[3],
                       Cor4=diag(cor(yHatCV,y))[4]))

}

pred


#--- 4. Extracting estimates of variance parameters
(CORB.RES<-fm$resCov$R) # residual covariance matrix
(CORB.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE

#--- 5. Trait correlation  
cov2cor(CORB.u)

6. Multi-trait GBLUP using Multitrait() function

Now, lets use the information of 4 traits in just one environment (number 4).

#--- 1. Organizing the phenotypic data
Y<- Phenotypes[Phenotypes$Env == 4,] 

#--- 2. Scale the pheno data
y = scale(Y[,-c(1:2)])

Deploy the model using Multitrait() function.

#--- 1. Parameters for the folds
nReps = 5 # change here for more repetitions (10-20 is a good fit)
nFolds = 5 # 5 is reasonable, but it depends of or dataset size

#--- 2. Output for the estimated values
yHatCV=data.frame(yHat1 = rep(NA,dim(y)[1]),
                  yHat2 = rep(NA,dim(y)[1]),
                  yHat3 = rep(NA,dim(y)[1]),
                  yHat3 = rep(NA,dim(y)[1]))

pred = data.frame()

#--- 3. Model
set.seed(0928761)

for (i in 1:nFolds){
  
# Partitions for a 5-FCV
folds=sample(1:nFolds,size=dim(y)[1],replace=T)  
  
# Model using multitrait
for(p in 1:nReps){
  
  tst=which(folds==p)
  yNA=y
  yNA[tst]=NA
  
  fm <- Multitrait(y=yNA,
                   ETA = list(list(K=GMat,model='RKHS')),
                   resCov = list(type="DIAG"),
                   nIter = 1200,
                   burnIn = 120,
                   saveAt='MGB_',
                   verbose = FALSE)
  
  yHatCV[tst,] = fm$ETAHat[tst,]
}

# Accuracy for each fold
 pred=rbind(pred, 
            data.frame(Fold = i,
                       Cor1=diag(cor(yHatCV,y))[1],
                       Cor2=diag(cor(yHatCV,y))[2],
                       Cor3=diag(cor(yHatCV,y))[3],
                       Cor4=diag(cor(yHatCV,y))[4]))
}

pred

#--- 4. Extracting estimates of variance parameters
(CORB.RES<-fm$resCov$R) # residual covariance matrix
(CORB.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE

#--- 5. Trait correlation  
cov2cor(CORB.u)

References

Amadeu, R. R., Cellon, C., Olmstead, J. W., Garcia, A. A., Resende Jr, M. F., & Muñoz, P. R. (2016). AGHmatrix: R package to construct relationship matrices for autotetraploid and diploid species: A blueberry example. The Plant Genome, 9(3), plantgenome2016–01.

Gaynor, R. C., Gorjanc, G., & Hickey, J. M. (2021). AlphaSimR: An r package for breeding program simulations. G3, 11(2), jkaa017.

Pérez-Rodrı́guez, P., & Los Campos, G. de. (2022). Multitrait bayesian shrinkage and variable selection models with the BGLR-r package. Genetics, 222(1), iyac112.

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

University of Florida, deamorimpeixotom@ufl.edu ↩︎

Cross-validation scheme 1

Marco Antonio Peixoto, Post doc¹

1. Genomic selection in a CV1 scheme

2. Datasets

3. Statistical model

4. Single-trait GBLUP using BGLR() function

5. Multi-trait GBLUP using Multitrait() function

6. Multi-trait GBLUP using Multitrait() function

References

Cross-validation scheme 1

Marco Antonio Peixoto, Post doc1

1. Genomic selection in a CV1 scheme

2. Datasets

3. Statistical model

4. Single-trait GBLUP using BGLR() function

5. Multi-trait GBLUP using Multitrait() function

6. Multi-trait GBLUP using Multitrait() function

References

Marco Antonio Peixoto, Post doc¹