Genomic models accounting for GxE term
Herein, we will discuss some practical aspects and the implementation of genomic selection in a multi-environment trial, using the reaction norms model (Jarquin et al. (2018)) for the implementation. To perform the analyses, we will need the following packages (if you don’t have them installed, please, use the function ‘install.packages(“pack_name”)’ to do so):
# Clear environment safely
rm(list = ls())
# Load required packages quietly
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(AGHmatrix)
library(BGLR)
library(sommer)
library(ggfortify)
})For the implementation, we can load the file 1.GxE_Dataset.RData. It will comprise four files: myYPheno: file containing the phenotypic information for the target trait (SG trait), in ten environments (EnvID), and a column for repetition from the experimental design. MarkDat: markers coded in 0,1,2,3,4 (tetraploid potato data) for all individuals. bluesTrait1: best linear unbiased estimation for the genotypes in each environment (it was estimated from the phenotypic data in a previously analyses). EnvMarkers: a set of environmental covariates for each one of the ten environments. It came from soil and weather (rain, solar radiation, temperature, etc.) sources.
## Loading objects:
## EnvMarkers
## myYPheno
## MarkDat
## bluesTrait1
We have discussed several factors influencing genomic models accuracy. They are divided in five classes:
Details about markers such as density, method, and quality of the genotyping is important. In this case, the clones included in the analyses had previously been genotyped using the potato Infinium SNP array. Tetraploid genotype calls (coded 0–4) were made using a normal mixture model with R/fitPoly (Zych et al. 2019) and imputed via R/randomForest (Breiman 2001; Liaw and Wiener 2002).
The first factor is related with the training population. The number of individuals in the training population and phenotypes (quality) may influence the model performance. However, we can adjust this factor. In addition, linkage disequilibrium (LD) between markers and QTLs, and the effective population size (N) play an important role.
## Env1 Env10 Env2 Env3 Env4 Env5 Env6 Env7 Env8 Env9
## 144 64 148 136 136 140 60 61 62 64
## [1] 157 14852
To understand more about the population, lets use the marker data to explore the population structure. First, lets compose a relationship matrix.
Relationship matrix
Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008), as follows:
\[ G = {\frac{ZZ'}{2{\sum{p_i(1-p_i)}}}} \]
where \(p_i\) and \(1-p_i\) represents the allele frequency for both \(A\) and \(a\) at each loci. In this case, we divided by \({2{\sum{p_i(1-p_i)}}}\) to scale G to be analogous to the numerator relationship matrix A ((VanRaden, 2008)).
To compose the additive kernel (aka genomic relationship matrix), we will use the package AGHMatrix (Amadeu et al., 2023). This package uses the SNP information coded as 2,1,0 to create a relationship matrix between the pair of genotypes. Also, we are able to build other kernels, such as dominance relationship matrix. In our case, we should pay attention in three parameters that will be important while we create the kernel:
Minor allele frequency (\(maf\)): This parameter is connected with the frequency of the alleles in the population. As rare alleles tend to decrease the importance of alleles that contributed with the target trait, we can filter out those with small frequency.
Threshold (\(thresh.missing\)): This parameter is connected with the quality of the SNP dataset. It represents a threshold to the amount of missing data that we will allow for both, SNP and genotypes.
Method (\(method\)): In this case, which one should be the method used to build the kernel. For additive kernels, using SNP data, the method indicated is the one from VanRaden (2008), as previously discussed.
Ploidy (\(ploidy\)): We are dealing with the a tetraploidy potato example, and we will use 4 for this option once it aligns with the genotypic calls.
# Additive matrix
GMat = AGHmatrix::Gmatrix(SNPmatrix=MarkDat,
maf=0.05, # Minor allele frequency
thresh.missing = 0.8, # threshold for missing data
method="VanRaden", # Kernel (i.e., additive, dominance, etc.)
missingValue = NA, # Missing data representation
ploidy = 4) # data ploidy ## Initial data:
## Number of Individuals: 157
## Number of Markers: 14852
##
## Missing data check:
## Total SNPs: 14852
## 0 SNPs dropped due to missing data threshold of 0.8
## Total of: 14852 SNPs
##
## MAF check:
## 3009 SNPs dropped with MAF below 0.05
## Total: 11843 SNPs
##
## Heterozigosity data check:
## No SNPs with heterozygosity, missing threshold of = 0
##
## Summary check:
## Initial: 14852 SNPs
## Final: 11843 SNPs ( 3009 SNPs removed)
##
## Completed! Time = 0.663 seconds
Population structure
### Checking for population structure
pca_result <- prcomp(GMat, scale. = TRUE)
autoplot(pca_result, data = GMat,
geom = "point",
shape = 20,
size = 5,
alpha = 1,
color = "#CC9900") +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
axis.text.x = element_text(size = 12, colour = "black"),
axis.text.y.left = element_text(size = 12, colour = "black"),
axis.text.y.right = element_blank(),
axis.title = element_text(size = 18, face = "bold"))## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Another important factor would be the heritability of the trait and the number of QTLs (architecture of the trait). This is a figure reflecting the broad sense heritability per trial for the target trait per environment.
We can pick the best model where the assumptions will meet the data the most. It could be a GBLUP, RRBLUP, BayesA, BayesB, RKHS, etc.
The last factor to play a role is the extent of GxE in the dataset analysed. Large amounts of GxE can create changes in the ranking of the individuals, which complicates the selection process.
Let’s analyze the GxE in the dataset:
## 1. GxE model in sommer to account for sources of variation
ans1 <- sommer::mmes(SG ~ Rep,
random= ~ EnvID + GenID + GenID:EnvID,
rcov= ~ units,
data=myYPheno)## iteration LogLik wall cpu(sec) restrained
## 1 251.812 8:13:35 32 0
## 2 601.321 8:14:7 64 0
## 3 700.393 8:14:39 96 0
## 4 709.479 8:15:12 129 0
## 5 709.564 8:15:44 161 0
## 6 709.565 8:16:18 195 0
## 7 709.565 8:16:51 228 0
# Plot
df <- data.frame(
Category = c("E", "G", "GE", "Res"),
Value = c(round(summary(ans1)$varcomp[1,1]/sum(summary(ans1)$varcomp[,1])*100,1),
round(summary(ans1)$varcomp[2,1]/sum(summary(ans1)$varcomp[,1])*100,1),
round(summary(ans1)$varcomp[3,1]/sum(summary(ans1)$varcomp[,1])*100,1),
round(summary(ans1)$varcomp[4,1]/sum(summary(ans1)$varcomp[,1])*100,1)))
# Create pie chart
ggplot(df, aes(x = "", y = Value, fill = Category)) +
geom_bar(stat = "identity", width = 1, color = "white") + # stacked bar
coord_polar(theta = "y") + # convert to pie
theme_void() + # remove background and axes
geom_text(aes(label = paste0(Value, "%")),
position = position_stack(vjust = 0.5)) + # label inside slices
scale_fill_manual(values = c("E" = "#FF9999",
"G" = "#66B2FF",
"GE" = "#CC9900",
"Res" = "blue")) Looking into the environmental covariates matrix for modelling GxE
The model used accounted for the sources of variation before discussed. From an simple example, we can understand why and when we need a model with the job.
The model implemented can be represented in matrix form as follows:
\[ y = Xb + Zg + Tp + e\] where, y is the vector of BLUEs (two-stage analyses), b is the vector of fixed effects, comprising the environment effect, g is the vector of genotypes, which follows \(g ~ N(0, G \otimes \sigma^2_g)\), where \(\sigma^2_g\) is the genetic variance and G is the additive relationship matrix, p is the vector of genotype-by-environment interaction, which follows \(p ~ N(0, GEI\otimes\sigma^2_{gxe})\), where \(\sigma^2_{gxe}\) is the genotype-by-environment interaction and GEI is the combination via hadamard product of the W and G matrices (explained below), and e is the vector of residuals, which follows \(e ~ N(0, I\sigma^2_{res})\), where \(\sigma^2_{res}\) is the residual variance.X, Z, and T are incidence matrices for b, g, and p, respectively.
The GBLUP model before described will be inplemented in a Bayesian context. For the Bayesian assumptions for each component in BGLR, please see Pérez & Los Campos (2014). In BGLR software, every components has to have the same size of the y vector. For such, we need to create the matrices and give them to the ETA argument together with the model specified. Then, we need to expand the matrices by using the matrix multiplication as in:
# Example G matrix (3x3)
G <- matrix(c(1, 0.3, -0.2,
0.3, 1, 0.5,
-0.2, 0.5, 1),
nrow = 3, ncol = 3, byrow = TRUE)
# Example incidence matrix Z_L (5x3)
Z_L <- matrix(c(1, 0, 0,
0, 1, 0,
0, 0, 1,
1, 0, 0,
0, 1, 0),
nrow = 5, ncol = 3, byrow = TRUE)
# Compute Z_L %*% G %*% t(Z_L)
ZGZt <- Z_L %*% G %*% t(Z_L)
# Show result
ZGZt## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0 0.3 -0.2 1.0 0.3
## [2,] 0.3 1.0 0.5 0.3 1.0
## [3,] -0.2 0.5 1.0 -0.2 0.5
## [4,] 1.0 0.3 -0.2 1.0 0.3
## [5,] 0.3 1.0 0.5 0.3 1.0
# Example G matrix (3x3)
G <- matrix(c(1, 0.3, -0.2,
0.3, 1, 0.5,
-0.2, 0.5, 1),
nrow = 3, ncol = 3, byrow = TRUE)
# Example W matrix (3x3)
W <- matrix(c(2, 0.1, 0.4,
0.1, 3, 0.7,
0.4, 0.7, 4),
nrow = 3, ncol = 3, byrow = TRUE)
# Hadamard (element-wise) product
GW <- G * W
# Show result
GW## [,1] [,2] [,3]
## [1,] 2.00 0.03 -0.08
## [2,] 0.03 3.00 0.35
## [3,] -0.08 0.35 4.00
## ---------- 3.1 Organizing the models ------------- ##
##>>>---- Creating the matrix for reaction norm Model
myYphi = bluesTrait1
myYphi$GenID = myYphi$GenID %>% as.character()
##>>>---- Incidence matrix for genotypes
IDs <- myYphi$GenID
IDsFinal <- factor(IDs, levels=colnames(GMat))
Z_L<-as.matrix(model.matrix(~IDsFinal-1))
KG=Z_L%*%GMat%*%t(Z_L) # Matrix multiplication
##>>>---- Creating design matrix for Environments
LIDs <- myYphi$EnvID
LIDs <- factor(LIDs, levels=colnames(WMat))
Z_E <- as.matrix(model.matrix(~LIDs-1))
KW=Z_E%*%WMat%*%t(Z_E) # Matrix multiplication
## Creating the matrix KGW by Hadamard product
KGW = KW*KG
## ---------- 3.2. Creating the models ------------- ##
##>>>---- Model 1: y = u + Zg + e
etaM1=list(list(K=KG,model='RKHS'))
##>>>---- Model 2: y = Xb + Zg + e
etaM2=list(list(model='BRR',X=Z_E),
list(K=KG,model='RKHS'))
##>>>---- Model 3: y = Xb + Zg + Tp + e
etaM3=list(list(model='BRR',X=Z_E),
list(K=KG,model='RKHS'),
list(K=KGW, model='RKHS'))
## list
etaList = list(etaM1, etaM2, etaM3)
## ---------- 3.3. Running the models ------------- ##
for(Model in 1:3){ # Loop for the models (from 1 to 3)
for(TargetEnv in ListEnv){ # Loop for the environments (from 1 to 10)
##>>>---- Assigning NAs
myYna = myYphi
myYna$SG[myYna$EnvID == TargetEnv] <- NA
##>>>---- Model
fm=BGLR::BGLR(y = as.matrix(myYna$SG),
ETA = etaList[[Model]],
nIter = 1200,
burnIn = 120,
thin = 10,
saveAt = paste0('RN_',TargetEnv),
verbose = TRUE)
##>>>---- Accuracy for each fold
predM = data.frame(Env = TargetEnv,
Model = Model,
Cor=cor(fm$yHat[fm$whichNa], myYphi$SG[fm$whichNa], use="p"),
RMSE = sqrt(mean((fm$yHat[fm$whichNa] - myYphi$SG[fm$whichNa])^2, na.rm = TRUE)))
write.table(predM, file = "Pred_Outcomes.txt", quote = F, row.names= F, append = TRUE, col.names = FALSE)
file.remove(list.files(pattern = "\\.dat$"))
} # Close loop for models
} # Close loop for environmentsThe results will show the prediction in each environment
dfPred <- read.table("Pred_Outcomes.txt")
colnames(dfPred) <- c("Env", "Model", "Cor", "RMSE")
dfPred$Model = as.factor(dfPred$Model)
# Looking at the averages
round(tapply(dfPred$Cor, dfPred$Model, mean),2)## 1 2 3
## 0.67 0.70 0.72
## 1 2 3
## 0.87 0.83 0.82
ggplot(dfPred, aes(x = Env, y = Cor, color = Model)) +
geom_point(size = 4, position = position_dodge(width = 0.6)) +
theme_minimal(base_size = 14) +
labs(
y = "Accuracy",
x = "Environment",
color = "Model",
title = ""
) +
theme(
legend.position = "right",
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold")
)University of Florida, mresende@ufl.edu↩︎
University of Florida, deamorimpeixotom@ufl.edu↩︎