{width=10%} Cross-validation scheme 1

```{=html}``` ## 1. Genomic selection in a CV1 scheme In this file I show how to implement a **GBLUP** model in a CV1 scheme using BGLR [@perez2022multitrait] in a Single- and Multi-trait framework. To perform the analyses, we will need the following packages: ```{r echo=TRUE, warning=FALSE,eval=FALSE} rm(list=ls()) require(AGHmatrix) require(BGLR) require(tidyverse) require(dplyr) require(ggplot2) ```
## 2. Datasets This dataset was simulated using the coalescent theory implemented in $AlphasimR$ package [@gaynor2021alphasimr]. IT mimics an evaluation of 500 maize genotypes, in four locations, and four traits were measured. Also, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through the 10 pair of chromosomes. **Phenotypic data** We generated BLUEs (best linear unbiased estimation) for each trait, and it can be loaded from *Phenotypes.csv* file. ```{r echo=TRUE, eval=FALSE} #--- 1. Loading the data Phenotypes = read.csv("Phenotypes.csv") #--- 2. As factor Phenotypes$Genotype = Phenotypes$Genotype %>% as.factor Phenotypes$Env = Phenotypes$Env %>% as.factor #--- 3. Plotting Phenotypes[,c(1,3:6)] %>% pivot_longer(., cols = c('Trait1','Trait2','Trait3','Trait4')) %>% ggplot(., aes(x = Env, y = value)) + geom_boxplot(color="brown", fill="orange", alpha=0.2) + theme(strip.text = element_text(face = "bold", size = 20, colour = 'blue'))+ labs(x = 'Environment', title = "Traits (BLUEs) distribution across environments",y="Density")+ facet_wrap(~name,scales = "free") ``` ![Figure 1. Trait performance over the four environments](Figure1(CV1).png){width=100%}
**Genomic data** The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA, respectively. The dataset for all 500 genotypes can be loaded from **Genotypes.csv** file. ```{r echo=TRUE, eval=FALSE} #--- 1. Loading the SNPs Genotypes = as.matrix(read.csv("Genotypes.csv")) #--- 2. Genotypes names rownames(Genotypes) = unique(Phenotypes$Genotype) ``` **Creating G Matrix** Here, we will compose the additive relationship matrix (*G* matrix) following the propositions made by @vanraden2008efficient in the AGHmatrix package [@amadeu2016aghmatrix]. ```{r echo=TRUE, eval=FALSE} #--- 1. Additive matrix GMat = AGHmatrix::Gmatrix(Genotypes) ```
## 3. Statistical model **Single-trait model** The **GBLUP** model in a singletrait framework is given as follows: $$ y = 1u + Za + e $$ where $y$ is the vector of BLUEs data for the target trait; $u$ represents the mean (fixed); $a$ is the vector of genotype effects (assumed to be random), where $a \sim N(0,G\sigma^2_a)$, $G$ is a matrix that describes the genomic similarities between pair of genotypes and $\sigma^2_a$ represents the genetic variance; and $e$ is the vector of residuals (random), where $e\sim N(0,I\sigma^2_{res})$, $\sigma^2_{res}$ is the residual variance. The letter $Z$ refers to the incidence matrix for $a$.
**Multi-trait model** The **GBLUP** model in a multitrait framework is given as follows: $$ y = 1u + Za + e $$ where $y$ is the matrix of BLUEs data for the target traits; $u$ represents the mean (fixed); $a$ is the matrix of genotype effects (assumed to be random), where $a \sim N(0,\sum_a\otimes G)$, $G$ is a matrix that describes the genomic similarities between pair of genotypes and $\sum_a$ represents the genetic covariance matrix; and $e$ is the residuals (random), where $e\sim N(0,\sum_{res}\otimes I)$, $\sum_{res}$ is the residual covariance matrix. The letter $Z$ refers to the incidence matrix for $a$.
## 4. Single-trait GBLUP using BGLR() function For this model, we used the trait 1 from the environment 1. ```{r echo=TRUE, eval=FALSE} #--- 1. Organizing the phenotypic data Y<- Phenotypes[Phenotypes$Env == 1,c(1:3)] #--- 2. Scale the pheno data y = scale(Y[,3]) # only the trait ``` Deploy the model using the Gmatrix, phenodata and the BGLR() function ```{r, echo=-TRUE, eval=FALSE} #--- 1. Parameters for the folds nReps = 5 # change here for more repetitions (10-20 is a good fit) nFolds = 5 # 5 is reasonable, but it depends of or dataset size #--- 2. Output for the estimated values yHatCV=rep(NA,length(Y$Trait1)) pred = data.frame() #--- 3. Model # Phenodata y = as.matrix(y) # Time for the loop set.seed(0928761) for(Rep in 1:nReps){ folds=sample(1:nFolds,size=length(Y$Trait1),replace=T) for(i in 1:max(folds)){ tst=which(folds==i) yNA=y yNA[tst]=NA # model fm=BGLR(y=yNA, ETA=list(list(K=GMat,model='RKHS')), nIter=1200, burnIn=120, verbose = FALSE) # Predicted values yHatCV[tst]=fm$yHat[tst] } # Accuracy for each fold pred=rbind(pred, data.frame(Repetitions = Rep, Cor=cor(yHatCV, y))) } #--- 4. Mean for the correlation mean(pred$Cor) ```
## 5. Multi-trait GBLUP using Multitrait() function To implement this analyses we used trait number 1 over four environments. ```{r echo=TRUE, eval=FALSE} #--- Organizing the phenotypic data Y<- Phenotypes[,c(1,2,3)] %>% pivot_wider(., names_from = 'Env', values_from = 'Trait1') #--- Scale the data y = scale(Y[,-1]) dim(y) ``` Deploy the model using Multitrait() function. ```{r, echo = TRUE, eval=FALSE} #--- 1. Parameters for the folds nReps = 5 # change here for more repetitions (10-20 is a good fit) nFolds = 5 # 5 is reasonable, but it depends of or dataset size #--- 2. Output for the estimated values yHatCV=data.frame(yHat1 = rep(NA,dim(y)[1]), yHat2 = rep(NA,dim(y)[1]), yHat3 = rep(NA,dim(y)[1]), yHat3 = rep(NA,dim(y)[1])) pred = data.frame() #--- 3. Model set.seed(0928761) for (i in 1:nFolds){ # Partitions for a 5-FCV folds=sample(1:nFolds,size=dim(y)[1],replace=T) # Model using multitrait for(p in 1:nReps){ tst=which(folds==p) yNA=y yNA[tst,]=NA # Model fm <- Multitrait(y=yNA, ETA = list(list(K=GMat,model='RKHS')), resCov = list(type="DIAG"), nIter = 1200, burnIn = 120, saveAt='MGB_', verbose = FALSE) yHatCV[tst,] = fm$ETAHat[tst,] } # Accuracy for each fold pred=rbind(pred, data.frame(Fold = i, Cor1=diag(cor(yHatCV,y))[1], Cor2=diag(cor(yHatCV,y))[2], Cor3=diag(cor(yHatCV,y))[3], Cor4=diag(cor(yHatCV,y))[4])) } pred #--- 4. Extracting estimates of variance parameters (CORB.RES<-fm$resCov$R) # residual covariance matrix (CORB.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE #--- 5. Trait correlation cov2cor(CORB.u) ```
## 6. Multi-trait GBLUP using Multitrait() function Now, lets use the information of 4 traits in just one environment (number 4). ```{r echo=TRUE, eval=FALSE} #--- 1. Organizing the phenotypic data Y<- Phenotypes[Phenotypes$Env == 4,] #--- 2. Scale the pheno data y = scale(Y[,-c(1:2)]) ``` Deploy the model using Multitrait() function. ```{r, echo=TRUE, eval=FALSE} #--- 1. Parameters for the folds nReps = 5 # change here for more repetitions (10-20 is a good fit) nFolds = 5 # 5 is reasonable, but it depends of or dataset size #--- 2. Output for the estimated values yHatCV=data.frame(yHat1 = rep(NA,dim(y)[1]), yHat2 = rep(NA,dim(y)[1]), yHat3 = rep(NA,dim(y)[1]), yHat3 = rep(NA,dim(y)[1])) pred = data.frame() #--- 3. Model set.seed(0928761) for (i in 1:nFolds){ # Partitions for a 5-FCV folds=sample(1:nFolds,size=dim(y)[1],replace=T) # Model using multitrait for(p in 1:nReps){ tst=which(folds==p) yNA=y yNA[tst]=NA fm <- Multitrait(y=yNA, ETA = list(list(K=GMat,model='RKHS')), resCov = list(type="DIAG"), nIter = 1200, burnIn = 120, saveAt='MGB_', verbose = FALSE) yHatCV[tst,] = fm$ETAHat[tst,] } # Accuracy for each fold pred=rbind(pred, data.frame(Fold = i, Cor1=diag(cor(yHatCV,y))[1], Cor2=diag(cor(yHatCV,y))[2], Cor3=diag(cor(yHatCV,y))[3], Cor4=diag(cor(yHatCV,y))[4])) } pred #--- 4. Extracting estimates of variance parameters (CORB.RES<-fm$resCov$R) # residual covariance matrix (CORB.u<-fm$ETA[[1]]$Cov$Omega) # genetic covariance matrix UE #--- 5. Trait correlation cov2cor(CORB.u) ``` ## References ::: {#refs} :::

Marco Antonio Peixoto