Introduction

Herein, we will discuss some practical aspects and the implementation of genomic selection in a multi-environment trial, using the reaction norms model (Jarquin et al. (2018)) for the implementation. To perform the analyses, we will need the following packages (if you don’t have them installed, please, use the function ‘install.packages(“pack_name”)’ to do so):

# Clear environment safely
rm(list = ls())

# Load required packages quietly
suppressPackageStartupMessages({
  library(dplyr)
  library(ggplot2)
  library(AGHmatrix)
  library(BGLR)
  library(sommer)
  library(ggfortify)
})


Dataset

For the implementation, we can load the file 1.GxE_Dataset.RData. It will comprise four files: myYPheno: file containing the phenotypic information for the target trait (SG trait), in ten environments (EnvID), and a column for repetition from the experimental design. MarkDat: markers coded in 0,1,2,3,4 (tetraploid potato data) for all individuals. bluesTrait1: best linear unbiased estimation for the genotypes in each environment (it was estimated from the phenotypic data in a previously analyses). EnvMarkers: a set of environmental covariates for each one of the ten environments. It came from soil and weather (rain, solar radiation, temperature, etc.) sources.

# Reading the data
load("1.GxE_Dataset.RData", verbose = TRUE)
## Loading objects:
##   EnvMarkers
##   myYPheno
##   MarkDat
##   bluesTrait1

Factors influencing genomic selection

We have discussed several factors influencing genomic models accuracy. They are divided in five classes:

🧬 1. Genotyping Factors

Details about markers such as density, method, and quality of the genotyping is important. In this case, the clones included in the analyses had previously been genotyped using the potato Infinium SNP array. Tetraploid genotype calls (coded 0–4) were made using a normal mixture model with R/fitPoly (Zych et al. 2019) and imputed via R/randomForest (Breiman 2001; Liaw and Wiener 2002).

🌱 2. Training Population

The first factor is related with the training population. The number of individuals in the training population and phenotypes (quality) may influence the model performance. However, we can adjust this factor. In addition, linkage disequilibrium (LD) between markers and QTLs, and the effective population size (N) play an important role.

# Number of individuals

tapply(myYPheno$GenID, myYPheno$EnvID, FUN = function(a) length(unique(a)))
##  Env1 Env10  Env2  Env3  Env4  Env5  Env6  Env7  Env8  Env9 
##   144    64   148   136   136   140    60    61    62    64
# Number of markers
dim(MarkDat)
## [1]   157 14852

To understand more about the population, lets use the marker data to explore the population structure. First, lets compose a relationship matrix.

Relationship matrix

Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008), as follows:

\[ G = {\frac{ZZ'}{2{\sum{p_i(1-p_i)}}}} \]

where \(p_i\) and \(1-p_i\) represents the allele frequency for both \(A\) and \(a\) at each loci. In this case, we divided by \({2{\sum{p_i(1-p_i)}}}\) to scale G to be analogous to the numerator relationship matrix A ((VanRaden, 2008)).

To compose the additive kernel (aka genomic relationship matrix), we will use the package AGHMatrix (Amadeu et al., 2023). This package uses the SNP information coded as 2,1,0 to create a relationship matrix between the pair of genotypes. Also, we are able to build other kernels, such as dominance relationship matrix. In our case, we should pay attention in three parameters that will be important while we create the kernel:

Minor allele frequency (\(maf\)): This parameter is connected with the frequency of the alleles in the population. As rare alleles tend to decrease the importance of alleles that contributed with the target trait, we can filter out those with small frequency.

Threshold (\(thresh.missing\)): This parameter is connected with the quality of the SNP dataset. It represents a threshold to the amount of missing data that we will allow for both, SNP and genotypes.

Method (\(method\)): In this case, which one should be the method used to build the kernel. For additive kernels, using SNP data, the method indicated is the one from VanRaden (2008), as previously discussed.

Ploidy (\(ploidy\)): We are dealing with the a tetraploidy potato example, and we will use 4 for this option once it aligns with the genotypic calls.

# Additive matrix
GMat = AGHmatrix::Gmatrix(SNPmatrix=MarkDat, 
                          maf=0.05,                # Minor allele frequency
                          thresh.missing = 0.8,    # threshold for missing data
                          method="VanRaden",       # Kernel (i.e., additive, dominance, etc.)
                          missingValue = NA,       # Missing data representation
                          ploidy = 4)              # data ploidy  
## Initial data: 
##  Number of Individuals: 157 
##  Number of Markers: 14852 
## 
## Missing data check: 
##  Total SNPs: 14852 
##   0 SNPs dropped due to missing data threshold of 0.8 
##  Total of: 14852  SNPs 
## 
## MAF check: 
##   3009 SNPs dropped with MAF below 0.05 
##  Total: 11843  SNPs 
## 
## Heterozigosity data check: 
##  No SNPs with heterozygosity, missing threshold of =  0 
## 
## Summary check: 
##  Initial:  14852 SNPs 
##  Final:  11843  SNPs ( 3009  SNPs removed) 
##  
## Completed! Time = 0.663  seconds


Population structure

### Checking for population structure

pca_result <- prcomp(GMat, scale. = TRUE)

autoplot(pca_result, data = GMat,
         geom = "point",
         shape = 20, 
         size = 5,
         alpha = 1,
         color = "#CC9900") +  
  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        axis.text.x = element_text(size = 12, colour = "black"),
        axis.text.y.left = element_text(size = 12, colour = "black"),
        axis.text.y.right = element_blank(),
        axis.title = element_text(size = 18, face = "bold"))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

🌾 3. Genetic Architecture

Another important factor would be the heritability of the trait and the number of QTLs (architecture of the trait). This is a figure reflecting the broad sense heritability per trial for the target trait per environment.

💻 4. Statistical Models

We can pick the best model where the assumptions will meet the data the most. It could be a GBLUP, RRBLUP, BayesA, BayesB, RKHS, etc.

🌦 5. Genotype × Environment Interaction

The last factor to play a role is the extent of GxE in the dataset analysed. Large amounts of GxE can create changes in the ranking of the individuals, which complicates the selection process.

Let’s analyze the GxE in the dataset:

## 1. GxE model in sommer to account for sources of variation
ans1 <- sommer::mmes(SG ~ Rep,
                     random= ~ EnvID + GenID + GenID:EnvID,
                     rcov= ~ units,
                     data=myYPheno)
## iteration    LogLik     wall    cpu(sec)   restrained
##     1      251.812   8:13:35      32           0
##     2      601.321   8:14:7      64           0
##     3      700.393   8:14:39      96           0
##     4      709.479   8:15:12      129           0
##     5      709.564   8:15:44      161           0
##     6      709.565   8:16:18      195           0
##     7      709.565   8:16:51      228           0
# Plot
df <- data.frame(
  Category = c("E", "G", "GE", "Res"),
  Value    = c(round(summary(ans1)$varcomp[1,1]/sum(summary(ans1)$varcomp[,1])*100,1),
               round(summary(ans1)$varcomp[2,1]/sum(summary(ans1)$varcomp[,1])*100,1),
               round(summary(ans1)$varcomp[3,1]/sum(summary(ans1)$varcomp[,1])*100,1),
               round(summary(ans1)$varcomp[4,1]/sum(summary(ans1)$varcomp[,1])*100,1)))


# Create pie chart
ggplot(df, aes(x = "", y = Value, fill = Category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +  # stacked bar
  coord_polar(theta = "y") +  # convert to pie
  theme_void() +              # remove background and axes
  geom_text(aes(label = paste0(Value, "%")),
            position = position_stack(vjust = 0.5)) +  # label inside slices
  scale_fill_manual(values = c("E" = "#FF9999",
                               "G" = "#66B2FF",
                               "GE" = "#CC9900",
                               "Res" = "blue")) 

Looking into the environmental covariates matrix for modelling GxE

W <- as.matrix(EnvMarkers)
WMat = ((W) %*% t((W)))/ncol(W) # Is W scaled?
heatmap(WMat)

nEnv = 10
ListEnv = unique(bluesTrait1$EnvID)

Implementing the genomic models

The model used accounted for the sources of variation before discussed. From an simple example, we can understand why and when we need a model with the job.

The model implementation

The model implemented can be represented in matrix form as follows:

\[ y = Xb + Zg + Tp + e\] where, y is the vector of BLUEs (two-stage analyses), b is the vector of fixed effects, comprising the environment effect, g is the vector of genotypes, which follows \(g ~ N(0, G \otimes \sigma^2_g)\), where \(\sigma^2_g\) is the genetic variance and G is the additive relationship matrix, p is the vector of genotype-by-environment interaction, which follows \(p ~ N(0, GEI\otimes\sigma^2_{gxe})\), where \(\sigma^2_{gxe}\) is the genotype-by-environment interaction and GEI is the combination via hadamard product of the W and G matrices (explained below), and e is the vector of residuals, which follows \(e ~ N(0, I\sigma^2_{res})\), where \(\sigma^2_{res}\) is the residual variance.X, Z, and T are incidence matrices for b, g, and p, respectively.

The GBLUP model before described will be inplemented in a Bayesian context. For the Bayesian assumptions for each component in BGLR, please see Pérez & Los Campos (2014). In BGLR software, every components has to have the same size of the y vector. For such, we need to create the matrices and give them to the ETA argument together with the model specified. Then, we need to expand the matrices by using the matrix multiplication as in:

Matrix multiplication

# Example G matrix (3x3)
G <- matrix(c(1, 0.3, -0.2,
              0.3, 1, 0.5,
              -0.2, 0.5, 1), 
            nrow = 3, ncol = 3, byrow = TRUE)

# Example incidence matrix Z_L (5x3)
Z_L <- matrix(c(1, 0, 0,
                0, 1, 0,
                0, 0, 1,
                1, 0, 0,
                0, 1, 0),
              nrow = 5, ncol = 3, byrow = TRUE)

# Compute Z_L %*% G %*% t(Z_L)
ZGZt <- Z_L %*% G %*% t(Z_L)

# Show result
ZGZt
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  1.0  0.3 -0.2  1.0  0.3
## [2,]  0.3  1.0  0.5  0.3  1.0
## [3,] -0.2  0.5  1.0 -0.2  0.5
## [4,]  1.0  0.3 -0.2  1.0  0.3
## [5,]  0.3  1.0  0.5  0.3  1.0

Hadamard multiplication

# Example G matrix (3x3)
G <- matrix(c(1, 0.3, -0.2,
              0.3, 1, 0.5,
              -0.2, 0.5, 1), 
            nrow = 3, ncol = 3, byrow = TRUE)

# Example W matrix (3x3)
W <- matrix(c(2, 0.1, 0.4,
              0.1, 3, 0.7,
              0.4, 0.7, 4), 
            nrow = 3, ncol = 3, byrow = TRUE)

# Hadamard (element-wise) product
GW <- G * W

# Show result
GW
##       [,1] [,2]  [,3]
## [1,]  2.00 0.03 -0.08
## [2,]  0.03 3.00  0.35
## [3,] -0.08 0.35  4.00

Implementing the models

## ---------- 3.1 Organizing the models ------------- ##

##>>>---- Creating the matrix for reaction norm Model
myYphi = bluesTrait1
myYphi$GenID = myYphi$GenID %>% as.character()

##>>>---- Incidence matrix for genotypes
IDs <- myYphi$GenID
IDsFinal <- factor(IDs, levels=colnames(GMat))
Z_L<-as.matrix(model.matrix(~IDsFinal-1))
KG=Z_L%*%GMat%*%t(Z_L) # Matrix multiplication


##>>>---- Creating design matrix for Environments
LIDs <- myYphi$EnvID
LIDs <- factor(LIDs, levels=colnames(WMat))
Z_E <- as.matrix(model.matrix(~LIDs-1))
KW=Z_E%*%WMat%*%t(Z_E) # Matrix multiplication

## Creating the matrix KGW by Hadamard product
KGW = KW*KG

## ---------- 3.2. Creating the models ------------- ##

##>>>---- Model 1: y = u + Zg + e
etaM1=list(list(K=KG,model='RKHS'))

##>>>---- Model 2: y = Xb + Zg + e
etaM2=list(list(model='BRR',X=Z_E),
           list(K=KG,model='RKHS'))

##>>>---- Model 3: y = Xb + Zg + Tp + e
etaM3=list(list(model='BRR',X=Z_E),
           list(K=KG,model='RKHS'),
           list(K=KGW, model='RKHS'))

## list
etaList = list(etaM1, etaM2, etaM3)

## ---------- 3.3. Running the models ------------- ##
for(Model in 1:3){ # Loop for the models (from 1 to 3)
 
for(TargetEnv in ListEnv){ # Loop for the environments (from 1 to 10)

##>>>---- Assigning NAs
myYna = myYphi
myYna$SG[myYna$EnvID == TargetEnv] <- NA

##>>>----  Model
fm=BGLR::BGLR(y = as.matrix(myYna$SG),
              ETA = etaList[[Model]],
              nIter = 1200,
              burnIn = 120,
              thin = 10,
              saveAt = paste0('RN_',TargetEnv),
              verbose = TRUE)


##>>>----  Accuracy for each fold
predM =  data.frame(Env = TargetEnv,
                    Model = Model,
                    Cor=cor(fm$yHat[fm$whichNa], myYphi$SG[fm$whichNa], use="p"),
                    RMSE = sqrt(mean((fm$yHat[fm$whichNa] - myYphi$SG[fm$whichNa])^2, na.rm = TRUE)))

write.table(predM, file = "Pred_Outcomes.txt", quote = F, row.names= F, append = TRUE, col.names = FALSE)

file.remove(list.files(pattern = "\\.dat$"))

  } # Close loop for models

} # Close loop for environments

Plotting the results

The results will show the prediction in each environment

dfPred <- read.table("Pred_Outcomes.txt")
colnames(dfPred) <- c("Env", "Model", "Cor", "RMSE")
dfPred$Model = as.factor(dfPred$Model)

# Looking at the averages
round(tapply(dfPred$Cor, dfPred$Model, mean),2)
##    1    2    3 
## 0.67 0.70 0.72
round(tapply(dfPred$RMSE, dfPred$Model, mean),2)
##    1    2    3 
## 0.87 0.83 0.82
ggplot(dfPred, aes(x = Env, y = Cor, color = Model)) +
  geom_point(size = 4, position = position_dodge(width = 0.6)) +
  theme_minimal(base_size = 14) +
  labs(
    y = "Accuracy",
    x = "Environment",
    color = "Model",
    title = ""
  ) +
  theme(
    legend.position = "right",
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold")
  )


References

Amadeu, R. R., Garcia, A. A. F., Munoz, P. R., & Ferrão, L. F. V. (2023). AGHmatrix: Genetic relationship matrices in r. Bioinformatics, 39(7), btad445.
Jarquin, D., Howard, R., Xavier, A., & Das Choudhury, S. (2018). Increasing predictive ability by modeling interactions between environments, genotype and canopy coverage image data for soybeans. Agronomy, 8(4), 51.
Pérez, P., & Los Campos, G. de. (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics, 198(2), 483–495.
VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

  1. University of Florida, ↩︎

  2. University of Florida, ↩︎