Introduction2GS.knit

Introduction

In this practice, we will show how to implement Genomic selection models. We will introduce rrBLUP, GBLUP, and BayesB genomic models, and a cross validation scheme.

To perform the analyses, we will need the following packages (if you don’t have it installed, please, use the function ‘install.packages(“pack_name”)’ to do so):

## Loading required package: AGHmatrix

## Loading required package: BGLR

## Loading required package: ggplot2

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dataset

This dataset that we will use was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor et al., 2021). This dataset mimic an evaluation of 500 maize genotypes, in one location, and one trait was simulated. In addition, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through 10 pair of chromosomes.

Phenotypic data
We generated BLUEs (best linear unbiased estimation) for each trait, and it can be loaded from Phenotypes.csv.

# Loading the dataset
Phenotypes = read.csv("Phenotypes.csv")

# Naming it
phenodata = Phenotypes

# Plotting the distribution
hist(phenodata$Trait1)

Genomic data
The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA, respectively. The dataset for all 500 genotypes can be loaded from Genotypes.csv.

# Loading the SNP data
Genotypes = as.matrix(read.csv("Genotypes.csv"))

#Genotype' names
rownames(Genotypes)= phenodata$Genotype

# Genotypes
Genotypes[1:5,1:5]

##       X1_2 X1_7 X1_16 X1_19 X1_27
## 26662    2    0     2     0     2
## 30678    0    0     2     0     2
## 30129    2    0     2     0     2
## 28907    2    0     2     0     2
## 30522    2    0     2     0     2

Transforming Genotypes into factors

# As factor
phenodata$Genotype = phenodata$Genotype %>% as.factor

Relationship matrix Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008), as follows:

\[ G = {\frac{ZZ'}{2{\sum{p_i(1-p_i)}}}} \]

where \(p_i\) and \(1-p_i\) represents the allele frequency for both \(A\) and \(a\) at each loci. In this case, we divided by \({2{\sum{p_i(1-p_i)}}}\) to scale G to be analogous to the numerator relationship matrix A ((VanRaden, 2008)).

To compose the additive kernel (aka genomic relationship matrix), we will use the package AGHMatrix (Amadeu et al., 2023). This package uses the SNP information coded as 2,1,0 to create a relationship matrix between the pair of genotypes. Also, we are able to build other kernels, such as dominance relationship matrix. In our case, we should pay attention in three parameters that will be important while we create the kernel:

Minor allele frequency (\(maf\)): This parameter is connected with the frequency of the alleles in the population. As rare alleles tend to decrease the importance of alleles that contributed with the target trait, we can filter out those with small frequency.

Threshold (\(thresh.missing\)): This parameter is connected with the quality of the SNP dataset. It represents a threshold to the amount of missing data that we will allow for both, SNP and genotypes.

Method (\(method\)): In this case, which one should be the method used to build the kernel. For additive kernels, using SNP data, the method indicated is the one from VanRaden (2008), as previously discussed.

# Additive matrix
GMat = AGHmatrix::Gmatrix(SNPmatrix=Genotypes, 
                          maf=0.05,                # Minor allele frequency
                          thresh.missing = 0.8,    # threshold for missing data
                          method="VanRaden",       # Kernel (i.e., additive, dominance, etc.)
                          missingValue = NA)       # Missing data representation

## Initial data: 
##  Number of Individuals: 500 
##  Number of Markers: 3000 
## 
## Missing data check: 
##  Total SNPs: 3000 
##   0 SNPs dropped due to missing data threshold of 0.8 
##  Total of: 3000  SNPs 
## 
## MAF check: 
##   545 SNPs dropped with MAF below 0.05 
##  Total: 2455  SNPs 
## 
## Heterozigosity data check: 
##  No SNPs with heterozygosity, missing threshold of =  0 
## 
## Summary check: 
##  Initial:  3000 SNPs 
##  Final:  2455  SNPs ( 545  SNPs removed) 
##  
## Completed! Time = 0.479  seconds

GBLUP

The first model implemented is a GBLUP in a Bayesian framework. For such, we will use the BGLR package Pérez & Los Campos (2014). In this package, we will need an element named “ETA”. In the ETA, we will set the linear predictors of the model and the priors, such as relationship matrix (Z) or markers data (X), and the specified model, such as BRR, RKHS, BayesA, BayesB, BayesC, etc.

Statistical model

The GBLUP model was given as follows:

\[ y = 1u + Za + e \]

where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(a\) is the vector of genetic effects (assumed to be random), where \(a \sim N(0,G\sigma^2_a)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_a\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e\sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(a\).

###----------- Linear predictor for BGLR
ETA = list(A = list(K = GMat,         # K (relationship matrix)
                    model = 'RKHS'))  # Specify the model

###----------- Phenotypic data
Y = as.matrix(phenodata[,3])

###-----------  Model for trait 1
fmGBLUP = BGLR(y=Y,              # Phenotypic data
               ETA=ETA,          # Model ETA
               nIter=1200,       # Number of iterations for the model
               burnIn=120,       # Burnin iterations
               thin=5,           # Sampling throughout iterations
               verbose = FALSE)

#Mean estimated
fmGBLUP$mu

## [1] 155.4077

#Variance component for genetic effect
fmGBLUP$ETA[[1]]$varU

## [1] 23.0554

#Variance component for the residual effect
fmGBLUP$varE

## [1] 44.08907

#Estimated breeding value
fmGBLUP$yHat

##   [1] 168.4831 163.3156 161.7358 155.5516 161.9454 159.3653 149.5575 153.6357
##   [9] 165.1772 162.8603 156.4687 156.2087 154.3260 156.8493 162.0347 164.7271
##  [17] 160.9272 154.4090 166.8155 164.2023 147.1060 157.4104 157.4749 157.8095
##  [25] 153.1050 161.5261 160.6728 157.2049 154.9957 151.8785 164.6420 151.7843
##  [33] 152.7559 159.1825 150.7143 155.4871 156.3427 155.9623 155.8209 143.2928
##  [41] 157.1531 149.2952 159.2077 157.6279 158.4288 155.4483 164.2073 150.3547
##  [49] 155.1576 163.8108 155.5770 155.5020 163.2571 158.4761 160.6272 162.1739
##  [57] 158.8137 156.9506 162.9826 155.1899 163.8391 149.0905 162.6962 153.6202
##  [65] 160.4182 164.8019 155.2363 163.3184 154.2661 153.1267 156.8628 158.7780
##  [73] 144.6567 160.2672 160.2017 154.6249 157.8436 157.4087 157.9783 152.5313
##  [81] 157.8793 155.5252 151.8052 153.9427 152.2790 151.3782 159.1794 153.8159
##  [89] 156.0659 159.8007 161.4519 147.0747 155.6803 154.2022 158.2217 153.3862
##  [97] 156.3711 151.1518 158.7444 155.6301 160.0977 156.5134 158.7197 150.0689
## [105] 157.2293 163.3215 151.7975 155.3520 153.3629 155.0665 157.4387 157.9335
## [113] 153.5064 152.9778 152.3543 154.6049 157.3189 164.6450 157.0784 154.8683
## [121] 156.0254 152.6675 153.8964 156.2013 155.7050 155.5055 157.2305 160.3514
## [129] 151.3189 154.7379 145.6831 155.3197 154.8065 153.6255 152.6093 155.5756
## [137] 158.4948 159.3496 158.3913 153.6251 150.7086 157.0940 157.8484 156.0134
## [145] 154.1085 155.5477 154.7015 156.4575 154.8129 156.2584 150.4163 154.4374
## [153] 153.1671 148.9171 153.0219 161.7263 158.3676 159.6869 155.4475 149.6814
## [161] 151.7224 149.4345 156.6846 158.4998 159.1617 152.6588 160.4530 154.7009
## [169] 151.3909 157.8084 160.2915 159.9834 158.1717 159.8774 162.8843 158.5086
## [177] 156.6125 154.4104 147.8992 153.6803 150.5039 155.5958 153.4689 159.3247
## [185] 160.2719 165.4000 158.6261 152.5649 152.4227 151.5003 153.6518 152.8249
## [193] 152.6860 150.2997 151.5350 157.7982 158.9376 151.1118 157.9461 152.9349
## [201] 154.7202 157.8168 155.2217 152.3201 152.6951 152.5002 155.8602 153.2326
## [209] 150.9227 154.0485 151.7642 157.6375 152.4544 155.5893 161.9868 162.8339
## [217] 154.8615 156.6171 154.3671 154.0762 153.9632 158.6683 150.8252 160.3269
## [225] 155.2862 147.0340 156.0673 157.8979 153.4197 157.2619 150.8996 154.1044
## [233] 161.0026 157.4874 155.0020 147.8593 161.6334 161.6513 160.7009 160.8564
## [241] 154.5925 153.6467 149.6669 148.2922 162.8702 158.7411 152.0362 148.7808
## [249] 157.7721 160.3615 152.2110 157.7876 159.4107 148.3235 157.1277 158.6927
## [257] 151.2105 160.4948 156.8649 158.4751 154.9017 162.2050 154.8179 157.5477
## [265] 157.2717 155.9294 149.2724 156.4814 151.9880 153.4764 153.0564 158.1257
## [273] 154.3578 160.5110 156.9149 149.0874 148.4564 156.3403 158.7033 160.8337
## [281] 156.9146 153.8828 155.1049 152.3630 149.0906 160.8816 161.8347 153.1679
## [289] 148.4888 153.0538 147.2570 149.8462 150.2132 159.1284 154.9792 150.5713
## [297] 164.2077 147.6202 149.1326 158.3909 150.2965 153.1896 149.6643 154.3474
## [305] 151.5617 153.5222 150.0904 159.1509 154.8127 158.2107 154.1098 150.5147
## [313] 156.1590 151.9306 151.5637 152.2927 161.5555 157.5804 154.5137 156.7692
## [321] 153.2914 166.3045 154.5857 152.9869 143.9507 157.4492 144.2317 155.2314
## [329] 152.5158 158.6500 155.6335 150.7728 148.3699 154.5375 153.3551 146.4667
## [337] 159.4592 152.5326 154.0770 155.4129 157.2284 161.8314 155.4325 156.0897
## [345] 151.2773 151.9075 159.8494 154.4344 161.5766 158.8966 150.6228 150.6584
## [353] 152.3595 156.9741 155.4673 158.9896 154.8065 152.7343 152.5267 150.1766
## [361] 149.5473 153.3773 149.1112 161.8065 150.1799 146.2321 155.9178 154.7846
## [369] 153.9918 156.8714 155.3982 158.6921 153.7267 151.1370 158.1705 157.7460
## [377] 149.1689 156.2076 153.5927 161.6478 158.9612 155.0121 150.6842 155.5860
## [385] 155.0556 160.4142 161.0038 162.9503 155.1658 157.3010 153.2718 152.0785
## [393] 146.8683 146.7295 147.2789 157.4410 165.1581 152.1934 159.8507 157.9965
## [401] 151.9811 158.5242 162.4140 156.2504 154.5126 150.7887 151.6755 148.8612
## [409] 154.5689 150.8061 147.7697 150.3360 161.3438 153.5884 159.4357 155.8213
## [417] 157.9858 154.4135 151.8830 159.6831 161.5572 154.2388 159.2493 156.7499
## [425] 161.6058 149.2876 153.6096 160.7215 154.3227 153.8731 150.1730 158.7542
## [433] 153.9383 154.2425 158.8469 146.2076 152.9361 159.4284 151.1594 156.3568
## [441] 152.2039 155.2858 155.2126 151.6545 149.9562 155.1169 155.1401 152.3942
## [449] 151.3500 155.4266 159.6497 154.5528 158.6469 160.4723 154.5170 156.7408
## [457] 147.9792 146.6120 156.2355 158.0651 153.3125 157.9473 157.6799 159.6717
## [465] 163.5960 153.8715 155.4949 153.4178 153.8638 158.0741 152.0145 159.7109
## [473] 148.3182 149.9580 147.3042 152.3206 154.0417 158.2255 151.7062 155.0286
## [481] 153.5198 153.4162 155.9817 159.0468 152.7039 154.0873 156.5659 154.0885
## [489] 158.7304 149.2309 152.8332 157.3956 150.7809 150.8797 153.7790 151.2453
## [497] 154.7719 162.2339 163.2634 148.4095

#Model accuracy
cor(as.matrix(phenodata$Trait1), fmGBLUP$ETA[[1]]$u)

##           [,1]
## [1,] 0.8518497

# Estimated breeding values (Genotypic effect)
plot(fmGBLUP$ETA[[1]]$u, col=4, cex=.5, type='o', main='GBLUP')

rrBLUP

Statistical model

Random regression BLUP or RRBLUP is a marker model that is given as follows:

\[ y = 1u + Zm + e \]

where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(m\) is the vector of random marker effects (assumed to be random), where \(m \sim N(0,I\sigma^2_m)\) and \(\sigma^2_m\) represents the marker genetic variance; and \(e\) is the vector of residuals (random), where \(e \sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(m\).

Now we can deploy RRBLUP model.

###----------- Linear predictor 
ETA = list(A = list(X = Genotypes,   # change from K (relationship matrix) to X (Markers) in ETA
                    model = 'BRR'))  # Specify the model
                                 
###----------- Phenotypic data
Y = as.matrix(phenodata[,3])

###----------- Model for trait 1
fmrrBLUP = BGLR(y=Y,             # Phenotypic data
                ETA=ETA,          # Model ETA
                nIter=1200,       # Number of iterations for the model
                burnIn=120,       # Burnin iterations
                thin=5,           # Sampling throughout iterations
                verbose = FALSE)

# Mean estimated
fmrrBLUP$mu

## [1] 160.0468

#Variance component for SNP effect
fmrrBLUP$ETA[[1]]$varB

## [1] 0.02432252

#Variance component for the residual effect
fmrrBLUP$varE

## [1] 44.43108

# SNP effects
length(fmrrBLUP$ETA[[1]]$b) #'b' slot

## [1] 3000

# Estimated effect values (SNP effect)

plot(fmrrBLUP$ETA[[1]]$b, col=4, cex=.5, type='o', main='rrBLUP')

Equivalence rrBLUP/GBLUP:

We estimated in the above model the additive marker effects. From that, we can backsolve to estimate a Genomic Estimated Breeding Values (GEBV) for each genotype. The simplest way is to use the product between SNPs matrix and the vector of the additive markers’ effects:

# Backsolve to GEBV values
GEBV = Genotypes %*% fmrrBLUP$ETA[[1]]$b 

# Data frame
GBLUP_rrBLUP = data.frame("Genotype" = phenodata$Genotype,
                          "GEBV_GBLUP" = fmGBLUP$ETA[[1]]$u,
                          "GEBV_rrBLUP" = GEBV)

# Estimated effect values (SNP effect)
ggplot(GBLUP_rrBLUP, aes(x = GEBV_GBLUP, y = GEBV_rrBLUP))+
  geom_point()

RRBLUP vs. BayesB

# Filtering for one environment and one trait
phenodat0 = Phenotypes

###----------- Phenotypic data
Y = as.matrix(phenodat0[,3])

###-----------  Model for trait 1
fmBB <- BGLR(y=Y,            # Phenotypic data
             ETA=list(list(X=Genotypes, model='BayesB')),          # Model ETA
             nIter=1200,       # Number of iterations for the model
             burnIn=120,       # Burnin iterations
             thin=5,
             verbose = FALSE)           # Sampling throughout iterations


###-----------  Model for trait 1
fmBRR <- BGLR(y=Y,            # Phenotypic data
             ETA=list(list(X=Genotypes, model='BRR')),          # Model ETA
             nIter=1200,       # Number of iterations for the model
             burnIn=120,       # Burnin iterations
             thin=5,
             verbose = FALSE)           # Sampling throughout iterations


# Plotting the effects
BayesB_rrBLUP = data.frame("effect_BayesB" = fmBB$ETA[[1]]$b,
                           "effect_rrBLUP" = fmBRR$ETA[[1]]$b)

ggplot(BayesB_rrBLUP, aes(x = effect_BayesB, y = effect_rrBLUP))+
  geom_point()

Cross-validation

Using the information to predict accuracy in a CV1 method

The cross-validations methods are divided into four different classes (sensu Jarquin et al. (2018)), regarding the genotypes and environments. As follows:

Figure 1 - Cross-validations schemes.

Phenotypic information

The first step will be to organize the phenotypic information and create the partitions to the cross validation scheme (CV1). Generally, we can divide the phenotypic dataset into two:

Testing set: Population where the phenotype will be predicted based on the markers and in the information from the training set.

Training set: Population with the information that we use to calibrate the model.

###----------- Parameters for the folds
nFolds = 5

###----------- Output for the estimated values
yHatCV=rep(NA,length(phenodata$Trait1))
pred = data.frame()

###----------- Model
# Phenodata
y = as.matrix(phenodata$Trait1)

# Time for the loop
set.seed(0928761) 

folds=sample(1:nFolds,size=length(phenodata$Trait1),replace=T)  
 
  for(i in 1:max(folds)){
    tst=which(folds==i)
    yNA=y
    yNA[tst]=NA
    
    # model
    fm=BGLR(y=yNA,
            ETA=list(list(K=GMat,model='RKHS')),
            nIter=1200,
            burnIn=120,
            verbose = FALSE)
    # Predicted values
    yHatCV[tst]=fm$yHat[tst]
    
    
}

# Accuracy for each fold
 pred=data.frame(Model = 'GBLUP',
                 Cor=cor(yHatCV, y),
                 rmse = sqrt(mean((y - yHatCV)^2, na.rm = T)))

 
# Accuracy for CV1

pred

##   Model       Cor     rmse
## 1 GBLUP 0.3906002 7.989514

#obs: add repetitions

Using different relatioship matrices

# G matrix via WtW
GWtW = Genotypes %*% t(Genotypes) / ncol(Genotypes)

# G matrix via distance
Gdist = 1-(scale(dist(Genotypes, method = "euclidean", diag = TRUE, upper = TRUE), scale = FALSE))

###----------- Parameters for the folds
nReps = 5
nFolds = 5

###----------- Output for the estimated values
yHatCV=rep(NA,length(phenodata$Trait1))
yHatCV1=rep(NA,length(phenodata$Trait1))
yHatCV2=rep(NA,length(phenodata$Trait1))
pred = data.frame()
pred1 = data.frame()
pred2 = data.frame()

###----------- Model
# Phenodata
y = as.matrix(phenodata$Trait1)

# Time for the loop
set.seed(0928761) 
 for(Rep in 1:nReps){
   
 folds=sample(1:nFolds,size=length(phenodata$Trait1),replace=T)  
 
  for(i in 1:max(folds)){
    tst=which(folds==i)
    yNA=y
    yNA[tst]=NA
    
    # model 1: VanRaden implementation
    fm=BGLR(y=yNA,
            ETA=list(list(K=GMat,model='RKHS')),
            nIter=1200,
            burnIn=120,
            verbose = FALSE)
    # Predicted values
    yHatCV[tst]=fm$yHat[tst]

    # model 2: WtW matrix
    fm1=BGLR(y=yNA,
            ETA=list(list(K=GWtW,model='RKHS')),
            nIter=1200,
            burnIn=120,
            verbose = FALSE)
    # Predicted values
    yHatCV1[tst]=fm1$yHat[tst]
    
    # model 3: distance matrix
    fm2=BGLR(y=yNA,
            ETA=list(list(K=Gdist,model='RKHS')),
            nIter=1200,
            burnIn=120,
            verbose = FALSE)
    # Predicted values
    yHatCV2[tst]=fm2$yHat[tst]
    
  }
 # Accuracy for each fold
 pred=rbind(pred,
            data.frame(Repetitions = Rep,
                       Cor=cor(yHatCV, y)))
 pred1=rbind(pred1,
            data.frame(Repetitions = Rep,
                       Cor=cor(yHatCV1, y)))
 pred2=rbind(pred2, 
            data.frame(Repetitions = Rep,
                       Cor=cor(yHatCV2, y)))
}
 

## data frame with for the correlation
corComp = data.frame(Gmat = mean(pred$Cor),
                     wtw = mean(pred1$Cor),
                     dist = mean(pred2$Cor))

corComp

##        Gmat       wtw      dist
## 1 0.3791834 0.3799011 0.3555165

References

Amadeu, R. R., Garcia, A. A. F., Munoz, P. R., & Ferrão, L. F. V. (2023). AGHmatrix: Genetic relationship matrices in r. Bioinformatics, 39(7), btad445.

Gaynor, R. C., Gorjanc, G., & Hickey, J. M. (2021). AlphaSimR: An r package for breeding program simulations. G3, 11(2), jkaa017.

Jarquin, D., Howard, R., Xavier, A., & Das Choudhury, S. (2018). Increasing predictive ability by modeling interactions between environments, genotype and canopy coverage image data for soybeans. Agronomy, 8(4), 51.

Pérez, P., & Los Campos, G. de. (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics, 198(2), 483–495.

Pérez-Rodrı́guez, P., & Los Campos, G. de. (2022). Multitrait bayesian shrinkage and variable selection models with the BGLR-r package. Genetics, 222(1), iyac112.

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

University of Florida, mresende@ufl.edu ↩︎
University of Florida, deamorimpeixotom@ufl.edu ↩︎

Molecular Markers Assisted in Plant Breeding - Genomic selection

Marcio Resende, professor¹

Marco Antonio Peixoto, Research assistant²

Introduction

Dataset

GBLUP

rrBLUP

RRBLUP vs. BayesB

Cross-validation

Using different relatioship matrices

References

Molecular Markers Assisted in Plant Breeding - Genomic selection

Marcio Resende, professor1

Marco Antonio Peixoto, Research assistant2

Introduction

Dataset

GBLUP

rrBLUP

RRBLUP vs. BayesB

Cross-validation

Using different relatioship matrices

References

Marcio Resende, professor¹

Marco Antonio Peixoto, Research assistant²