CV2_BGLR.knit

1. Genomic selection in a CV2 scheme

In this file I show how to implement a GBLUP model in a CV2 scheme using BGLR (Pérez-Rodrı́guez & Los Campos, 2022) in a Single- and Multi-trait framework.

To perform the analyses, we will need the following packages:

#--- 1. Cleaning
rm(list=ls())

#--- 2. Packages
require(AGHmatrix)

## Carregando pacotes exigidos: AGHmatrix

require(BGLR)

## Carregando pacotes exigidos: BGLR

require(tidyverse)

## Carregando pacotes exigidos: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

require(dplyr)
require(ggplot2)

2. Datasets

This dataset was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor et al., 2021). IT mimics an evaluation of 500 maize genotypes, in four locations, and four traits were measured. Also, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through the 10 pair of chromosomes.

Phenotypic data
We generated BLUEs (best linear unbiased estimation) for each trait, and it can be loaded from Phenotypes.csv file.

#--- 1. Loading the data
Phenotypes = read.csv("Phenotypes.csv")

#--- 2. As factor
Phenotypes$Genotype = Phenotypes$Genotype %>% as.factor
Phenotypes$Env = Phenotypes$Env %>% as.factor

Genomic data
The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA, respectively. The dataset for all 500 genotypes can be loaded from Genotypes.csv.

Creating G Matrix

Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008).

## Initial data: 
##  Number of Individuals: 500 
##  Number of Markers: 3000 
## 
## Missing data check: 
##  Total SNPs: 3000 
##   0 SNPs dropped due to missing data threshold of 1 
##  Total of: 3000  SNPs 
## MAF check: 
##  No SNPs with MAF below 0 
## Monomorphic check: 
##  No monomorphic SNPs 
## Summary check: 
##  Initial:  3000 SNPs 
##  Final:  3000  SNPs ( 0  SNPs removed) 
##  
## Completed! Time = 2.29  seconds

3. Statistical model

Single-trait model

The GBLUP model in a singletrait framework is given as follows:

\[ y = 1u + Za + e \]

where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(a\) is the vector of genotype effects (assumed to be random), where \(a \sim N(0,G\sigma^2_a)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_a\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e\sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(a\).

Multi-trait model

The GBLUP model in a multitrait framework is given as follows:

\[ y = 1u + Za + e \]

where \(y\) is the matrix of BLUEs data for the target traits; \(u\) represents the mean (fixed); \(a\) is the matrix of genotype effects (assumed to be random), where \(a \sim N(0,\sum_a\otimes G)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sum_a\) represents the genetic covariance matrix; and \(e\) is the residuals (random), where \(e\sim N(0,\sum_{res}\otimes I)\), \(\sum_{res}\) is the residual covariance matrix. The letter \(Z\) refers to the incidence matrix for \(a\).

4. Single-trait GBLUP using BGLR() function

For the model implementation we use three different strategies. The first strategy is the prediction a single trait model where I used the information from the trait to predict individuals in the same environment.

So, we will use all trait 1 over four environments.

#--- 1. Organizing the phenotypic data

Y<- Phenotypes[,c(1,2,3)] %>%  
  pivot_wider(., names_from = 'Env', values_from = 'Trait1') 

#--- 2. Scale the data
y = scale(Y[,-1])

Single-trait model

Implementing the model using the function BGLR()

#--- 1. Setting NAs for the individuals
  set.seed(524521)
  yNA <-y
  yNA[sample(1:dim(Y)[1],100),1] <- NA
  yNA[sample(1:dim(Y)[1],100),2] <- NA
  yNA[sample(1:dim(Y)[1],100),3] <- NA
  yNA[sample(1:dim(Y)[1],100),4] <- NA

#--- 2. Preparing the single trait model
  y1NA <- yNA[,1] #Trait 1
  y2NA <- yNA[,2] #Trait 2
  y3NA <- yNA[,3] #Trait 3
  y4NA <- yNA[,4] #Trait 4

#--- 3. ETA and model for each trait
  ETA <- list(A=list(K=G.Mat, model='RKHS'))
  fm1 <- BGLR(y1NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
  fm2 <- BGLR(y2NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
  fm3 <- BGLR(y3NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)
  fm4 <- BGLR(y4NA,ETA=ETA,nIter=1200,burnIn=120,verbose=FALSE)

#--- 4. Correlation for single trait
  test<-is.na(yNA)
  tst1<-test[,1];tst2<-test[,2]<-tst3<-test[,3];tst4<-test[,4]
  COR.ST1 <- cor(fm1$yHat[tst1],y[tst1,1])
  COR.ST2 <- cor(fm2$yHat[tst2],y[tst2,2])
  COR.ST3 <- cor(fm3$yHat[tst3],y[tst3,3])
  COR.ST4 <- cor(fm4$yHat[tst4],y[tst4,4])

Multi trait model

The same information is used with several traits at once. In this second scenario, we used the function Multitrait() from BGLR. The prediction in this case is for the partial information via associated traits.

#--- 1. Set to NA the
  set.seed(524521)
  yNA <-y
  yNA[sample(1:dim(Y)[1],100),1] <- NA
  yNA[sample(1:dim(Y)[1],100),2] <- NA
  yNA[sample(1:dim(Y)[1],100),3] <- NA
  yNA[sample(1:dim(Y)[1],100),4] <- NA

#--- 2. Model
fm <- Multitrait(y = yNA,
                 ETA = list(A=list(K = G.Mat, model='RKHS', Cov = list(type = 'UN'))), 
                 nIter = 1200, 
                 burnIn = 120, 
                 thin = 5, 
                 verbose = FALSE)

## Checking variance co-variance matrix K  for linear term 1

## Ok

## Setting linear term 1

## MSx=1.49553739794528

## UNstructured covariance matrix

## df0 was set to 5

## S0 set to

##          1         2        3         4
## 1 3.357451 1.1475108 1.230007 1.2813106
## 2 1.147511 3.5389379 1.358102 0.9823556
## 3 1.230007 1.3581023 3.507723 1.2229457
## 4 1.281311 0.9823556 1.222946 3.4048273

## Initializing resCov

## Setting hyperparameters for UNstructured R

## S0 was set to

##          1        2        3        4
## 1 5.021194 1.716145 1.839521 1.916248
## 2 1.716145 5.292614 2.031093 1.469150
## 3 1.839521 2.031093 5.245931 1.828961
## 4 1.916248 1.469150 1.828961 5.092047

## Done

#--- 3. Extracting the estimated parameters
YHatInt <- fm$ETAHat

#--- 4. Predictive correlation
test<-is.na(yNA)
tst1<-test[,1];tst2<-test[,2]<-tst3<-test[,3];tst4<-test[,4]
COR.T1<-cor(YHatInt[tst1,1],y[tst1,1])
COR.T2<-cor(YHatInt[tst2,2],y[tst2,2])
COR.T3<-cor(YHatInt[tst3,3],y[tst3,3])
COR.T4<-cor(YHatInt[tst4,4],y[tst4,4])

#--- 5. Extracting estimates of variance parameters
(COR.RES<-fm$resCov$R) # residual covariance matrix

##            1          2         3          4
## 1 0.53942397 0.08142737 0.1345714 0.07613355
## 2 0.08142737 0.74555259 0.2194581 0.06983051
## 3 0.13457140 0.21945809 0.7647425 0.12847156
## 4 0.07613355 0.06983051 0.1284716 0.67923635

(COR.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE

##           1         2         3         4
## 1 0.3694649 0.1542590 0.1543947 0.2009270
## 2 0.1542590 0.2499385 0.1314739 0.1302072
## 3 0.1543947 0.1314739 0.2333951 0.1516298
## 4 0.2009270 0.1302072 0.1516298 0.2830937

#--- 6. Trait correlation  
cov2cor(COR.u)

##           1         2         3         4
## 1 1.0000000 0.5076302 0.5257750 0.6212790
## 2 0.5076302 1.0000000 0.5443485 0.4895005
## 3 0.5257750 0.5443485 1.0000000 0.5898936
## 4 0.6212790 0.4895005 0.5898936 1.0000000

Correlation for the models (MT and ST)

#--- 1. Correlation output from both models
CorCV2 <- matrix(NA,nrow=4,ncol=2)
colnames(CorCV2) <- c('MT','ST')

  # MTM
  CorCV2[1,1] <- COR.T1
  CorCV2[2,1] <- COR.T2
  CorCV2[3,1] <- COR.T3
  CorCV2[4,1] <- COR.T4
 
  # STM
  CorCV2[1,2] <- COR.ST1
  CorCV2[2,2] <- COR.ST2
  CorCV2[3,2] <- COR.ST3
  CorCV2[4,2] <- COR.ST4

#--- 2. Comparison table

CorCV2

##             MT        ST
## [1,] 0.5177216 0.4094381
## [2,] 0.6541127 0.7008189
## [3,] 0.3581804 0.2873510
## [4,] 0.3133186 0.2624125

References

Gaynor, R. C., Gorjanc, G., & Hickey, J. M. (2021). AlphaSimR: An r package for breeding program simulations. G3, 11(2), jkaa017.

Pérez-Rodrı́guez, P., & Los Campos, G. de. (2022). Multitrait bayesian shrinkage and variable selection models with the BGLR-r package. Genetics, 222(1), iyac112.

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

University of Florida, deamorimpeixotom@ufl.edu ↩︎

Cross validation scheme 2 (CV2) in BGLR package

Marco Antonio Peixoto, Post doc¹

1. Genomic selection in a CV2 scheme

2. Datasets

3. Statistical model

4. Single-trait GBLUP using BGLR() function

Single-trait model

Multi trait model

Correlation for the models (MT and ST)

References

Cross validation scheme 2 (CV2) in BGLR package

Marco Antonio Peixoto, Post doc1

1. Genomic selection in a CV2 scheme

2. Datasets

3. Statistical model

4. Single-trait GBLUP using BGLR() function

Single-trait model

Multi trait model

Correlation for the models (MT and ST)

References

Marco Antonio Peixoto, Post doc¹