1. Introduction

The genetic evaluations discussed so far combined phenotypic data and pedigree information (resemblance between relatives) to predict individuals’ performance. We use pedigree information to built the numerator relationship matrix based on probabilities that alleles are identical by descent (IBD). In addition, it is possible to obtain the information of relationship between individuals based on molecular markers. The idea is to estimate the proportion of DNA that are similar between a pair of individuals based on matching of markers alleles, being known as identical by state (IBS).

We can incorporate the markers information into the linear mixed model and predict the estimated breeding values (\(EBVs\)) of individual with phenotypic record and also individuals without phenotypic record, once they have been genotyped. Here, we will use the information of single nucleotide polimorphism (\(SNPs\)) markers into the genomic best linear unbiased prediction (\(GBLUP\)) to predict genomic values of individuals.

2. Dataset

Packages

rm(list=ls())
require(asreml)
## Online License checked out Tue Dec  6 08:58:06 2022
require(dplyr)
require(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
require(superheat)

Simulated dataset parameters

This dataset was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor, Gorjanc, and Hickey 2021). This dataset mimic an evaluation of 500 maize genotypes, in one location, and the trait measured was Yield. Also, a set of 6759 single nucleotide polymorphism (SNPs) were randomly sampled through the 12 pair of chromosomes.

Factor Levels
Location One
Genotypes 500
Block 3
Traits Yield
Heritability 0.4

Phenotypic dataset

# Reading the dataset
phenodata = read.table("GBLUP_phenodata.txt", h=TRUE,na.strings='NA')

# Plotting the trait distribution
ggplot(phenodata, aes(x = Yield)) +  
  geom_histogram(aes(y = after_stat(density)), bins = 30, 
                 colour = "black", fill = "blue") +
  geom_density(alpha = .7, linewidth = 1.5, colour = "red") +
  labs(x = NULL, title = "Yield",y="Density")

# Setting as factors
phenodata$Genotype = phenodata$Genotype %>% as.factor()
phenodata$Rep = phenodata$Rep %>% as.factor()

Dosage matrix dataset

The dosage matrix (SNP matrix) represents the counting of the number of allele copies at each position/loci in the individuals’ DNA. Hence,

\[ AA = 2 \] \[ Aa = 1 \] \[ aa = 0 \]

# Reading the SNP matrix
genodata = read.table("GBLUP_genodata.txt", h=TRUE)

# Transforming it into matrix format for latter
SNP_data = as.matrix(genodata)
dim(SNP_data)
## [1]  500 6759
SNP_data[1:10,1:10]
##     SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
## G1      2     0     0     0    NA     0     0     2    NA      1
## G2      2     0     0     0     2     0     0     2     2      1
## G3      1    NA     0     0     1    NA     0     1     2      1
## G4      1     1     0     0     2     0     0     1     2     NA
## G5      1     1     1     0     1     0     0     1     2      1
## G6      1     1     0     0     1     1     0     1     2      1
## G7      1     1     1     0     1     0     0     1     2      1
## G8     NA    NA     1     0    NA     0     0     1    NA     NA
## G9      1     1     0     0     1     1    NA     1     2     NA
## G10     0     2     0     0     0     1     0     0    NA      2

4. Model implementation

Statistical model

We implement a one step GBLUP. The model was given as follows:

\[ y = Xb + Zg + e \]

where \(y\) is the vector of phenotypic data; \(b\) is the vector of repetitions (assumed to be fixed); \(g\) is the vector of genotype effects (assumed to be random), where \(g~N(0,G\sigma^2_g)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_g\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e~N(0,\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance . The letters \(X\) and \(Z\) refers to the incidence matrices for \(b\) and \(g\), respectively.

Fitting the model

Creating the kernel for GBLUP

The kernel is variance-covariance structure that represents the relationship between observations (in our case individuals) based on some measured marker or trait (in our case the SNPs).

# Toy example
mat = matrix(c(0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,2,1,0,0,1,2,0,1), ncol=6)

mat1 = scale(mat)

# Creating the kernel
mat1%*%t(mat1)
##            [,1]      [,2]      [,3]       [,4]
## [1,]  3.5454545 -1.090909 -3.272727  0.8181818
## [2,] -1.0909091  4.909091 -2.181818 -1.6363636
## [3,] -3.2727273 -2.181818  7.090909 -1.6363636
## [4,]  0.8181818 -1.636364 -1.636364  2.4545455
mm=tcrossprod(mat1)

Here, we will built the additive relationship matrix following the propositions made by VanRaden (2008), as follows:

\[ G = {\frac{ZZ'}{2{\sum{p_i(1-p_i)}}}} \]

where \(p_i\) and \(1-p_i\) represents the allele frequency for both \(A\) and \(a\) at each loci. In this case, we divided by \({2{\sum{p_i(1-p_i)}}}\) to scale G to be analogous to the relationship matrix A ((Isik, Holland, and Maltecca 2017; VanRaden 2008)).

To build the additive kernel (aka genomic relationship matrix), we will use the package AGHMatrix (Amadeu et al. 2016). This package uses the SNP information coded as 2,1,0 to create a relationship matrix between the pair of genotypes. Also, we are able to built other kernels, such as dominance relationship matrix. In our case, we should pay attention in three parameters that will be important while we create the kernel:

Minor allele frequency (\(maf\)): This parameter is connected with the frequency of the alleles in the population. As rare alleles tend to decrease the importance of alleles that contributed with the target trait, we can filter and drop those with small frequency.

Threshold: This parameter is connected with the quality of the SNP dataset. It represents a threshold to the amount of missing data that we will allow for both, SNP and genotypes.

Method: In this case, which one should be the method used to built the kernel. For additive kernels, using the SNP data, the method indicated is the one from VanRaden (2008), as previously discussed.

require(AGHmatrix) 
## Carregando pacotes exigidos: AGHmatrix
# Built the G matrix based on Van Raden method.
G.Mat = Gmatrix(SNPmatrix=SNP_data, 
                maf=0.05, 
                thresh = 0.8, 
                method="VanRaden",
                missingValue = NA)
## Initial data: 
##  Number of Individuals: 456 
##  Number of Markers: 6759 
## 
## Missing data check: 
##  Total SNPs: 6759 
##   0 SNPs dropped due to missing data threshold of 0.8 
##  Total of: 6759  SNPs 
## MAF check: 
##   1043 SNPs dropped with MAF below 0.05 
##  Total: 5716  SNPs 
## Monomorphic check: 
##  No monomorphic SNPs 
## Summary check: 
##  Initial:  6759 SNPs 
##  Final:  5716  SNPs ( 1043  SNPs removed) 
##  
## Completed! Time = 3.84  seconds
# Check the G matrix
G.Mat[1:10,1:10]
##            G1        G2        G3        G4        G5         G6        G7
## G1  1.3034627 0.2862966 0.3369708 0.4856171 0.4576207 0.19614447 0.2623478
## G2  0.2862966 1.3315674 0.3809725 0.5775074 0.3423925 0.33667944 0.4248994
## G3  0.3369708 0.3809725 1.2743977 0.4855393 0.1729801 0.40527340 0.3505137
## G4  0.4856171 0.5775074 0.4855393 1.3700459 0.2929668 0.27572141 0.4963317
## G5  0.4576207 0.3423925 0.1729801 0.2929668 1.2419685 0.25970945 0.2827460
## G6  0.1961445 0.3366794 0.4052734 0.2757214 0.2597095 1.15886458 0.2875433
## G7  0.2623478 0.4248994 0.3505137 0.4963317 0.2827460 0.28754327 1.1840038
## G8  0.3495422 0.3252146 0.3196278 0.3591825 0.3874575 0.32915606 0.3354296
## G9  0.1717145 0.4330856 0.3846506 0.3456988 0.2600038 0.38117939 0.3773658
## G10 0.1326128 0.2064758 0.2198335 0.2048313 0.1208767 0.07926512 0.1781531
##            G8         G9        G10
## G1  0.3495422 0.17171452 0.13261281
## G2  0.3252146 0.43308562 0.20647583
## G3  0.3196278 0.38465057 0.21983352
## G4  0.3591825 0.34569876 0.20483127
## G5  0.3874575 0.26000383 0.12087669
## G6  0.3291561 0.38117939 0.07926512
## G7  0.3354296 0.37736579 0.17815311
## G8  1.1858051 0.25891975 0.14649824
## G9  0.2589197 1.12299564 0.09962113
## G10 0.1464982 0.09962113 1.34586500
superheat::superheat(G.Mat)

# Adding a value to the diagonal of the matrix if you get an error of singularities
#diag(G.Mat) = diag(G.Mat) + 0.001

# Requisition of ASREML
attr(G.Mat,"rowNames")<-as.character(colnames(G.Mat))
attr(G.Mat,"colNames")<-as.character(colnames(G.Mat))

str(G.Mat)
##  num [1:500, 1:500] 1.303 0.286 0.337 0.486 0.458 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:500] "G1" "G2" "G3" "G4" ...
##   ..$ : chr [1:500] "G1" "G2" "G3" "G4" ...
##  - attr(*, "rowNames")= chr [1:500] "G1" "G2" "G3" "G4" ...
##  - attr(*, "colNames")= chr [1:500] "G1" "G2" "G3" "G4" ...

Fitting the model using the G.Mat matrix

# Model 1: GBLUP
  mod1= asreml(fixed = Yield ~ Rep, 
             random = ~ vm(Genotype, G.Mat), 
             data = phenodata)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:15 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4664.733       169.996   1497 08:58:16    0.2
##  2     -4659.928       173.576   1497 08:58:16    0.2
##  3     -4659.158       175.672   1497 08:58:16    0.2
##  4     -4659.157       175.825   1497 08:58:16    0.2
##  5     -4659.157       175.760   1497 08:58:17    0.2
summary(mod1)
## $call
## asreml(fixed = Yield ~ Rep, random = ~vm(Genotype, G.Mat), data = phenodata)
## 
## $loglik
## [1] -4659.157
## 
## $nedf
## [1] 1497
## 
## $sigma
## [1] 13.25744
## 
## $varcomp
##                      component std.error   z.ratio bound %ch
## vm(Genotype, G.Mat)   7.819558  2.498582  3.129599     P 0.4
## units!R             175.759702  6.771460 25.955952     P 0.0
## 
## $bic
## [1] 9332.936
## attr(,"parameters")
## [1] 2
## 
## $aic
## [1] 9322.313
## attr(,"parameters")
## [1] 2
## 
## attr(,"class")
## [1] "summary.asreml"

Using the information to predict accuracy in a CV1 method

The cross-validations methods are divided into four different classes (sensu Jarquin et al. (2018)), regarding the genotypes and environments. As follows:

Figure 1 - Cross-validations schemes.

Phenotypic information

In the first part we will organize the phenotypic information and create the partitions to the cross validation scheme. Generally, we can divide the phenotypic dataset into two:

Testing set: Population where the phenotype will be predicted based on the markers and in the information from the training set.

Training set: Population with the information that we use to calibrate the model.

# Create an ID to track genotypes in the TST and TRN data sets
phenodata$unique_ID = paste(phenodata$Genotype, phenodata$Rep, sep='_')

head(phenodata)
##   Location Genotype Rep    Yield unique_ID
## 1  Florida       G1   1 147.3268      G1_1
## 2  Florida       G2   1 140.5955      G2_1
## 3  Florida       G3   1 136.1656      G3_1
## 4  Florida       G4   1 149.5841      G4_1
## 5  Florida       G5   1 130.7658      G5_1
## 6  Florida       G6   1 150.9746      G6_1
# Creating the folds for cross validation
nInd=500                                # Number of individuals
k=10                                    # Number of folds

set.seed(1234)                          # Guarantee the same individuals in each fold
folds = sample(1:k, size=nInd,replace=T)  # Creating the folds 

# Object to store the values across loop
corr = vector()  
Implementation through a loop

Second, the implementation of the prediction into a loop structure, as follows:

for(i in 1:max(folds)){              # loop across the folds

  # Store the information of individuals that we will predict
  pred.gen=which(folds==i)               
  pred.gen=paste0("G",pred.gen)
  
 test.set = phenodata %>%
    filter(Genotype %in% pred.gen) %>%
    droplevels()                    

  # Assuming NA in the data for the individuals that we will predict
  train.set = phenodata
  train.set$yNA = replace(train.set$Yield, (train.set$Genotype %in% pred.gen), NA)
  
   
  # GBLUP model 
  mod2 = asreml(fixed = yNA ~ Rep, 
             random = ~ vm(Genotype,G.Mat), 
             na.action=na.method(y="include"),
             data = train.set)
  
  # Assessing the predictions by 'predict' function
  yHat =predict(mod2,classify="Genotype+Rep",sed=T)$pvals

  # Creating the id for later combining
  yHat.set = yHat %>%
       mutate(unique_ID = paste(Genotype,Rep,sep="_")) %>%
       droplevels()

  # Combining the datasets
  comb_data = merge(yHat.set, test.set, by="unique_ID")

  # Prediction accuracy (correlation)
  corr[i] = cor(comb_data$predicted.value,comb_data$Yield, use="complete")
}
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:17 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4246.950       172.742   1359 08:58:17    0.1
##  2     -4242.763       176.359   1359 08:58:17    0.1
##  3     -4242.039       178.580   1359 08:58:17    0.1
##  4     -4242.037       178.755   1359 08:58:17    0.1
##  5     -4242.037       178.680   1359 08:58:17    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:18 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4242.037       178.711   1359 08:58:18    0.2
##  2     -4242.037       178.707   1359 08:58:18    0.1
##  3     -4242.037       178.700   1359 08:58:19    1.7
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:20 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4303.417       170.506   1380 08:58:20    0.1
##  2     -4299.308       173.950   1380 08:58:20    0.1
##  3     -4298.608       176.060   1380 08:58:20    0.1
##  4     -4298.607       176.229   1380 08:58:20    0.1
##  5     -4298.607       176.155   1380 08:58:20    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:20 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4298.607       176.186   1380 08:58:21    0.1
##  2     -4298.607       176.182   1380 08:58:21    0.1
##  3     -4298.607       176.175   1380 08:58:22    1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:22 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4152.031       167.587   1335 08:58:23    0.1
##  2     -4147.909       171.112   1335 08:58:23    0.1
##  3     -4147.210       173.251   1335 08:58:23    0.1
##  4     -4147.209       173.411   1335 08:58:23    0.1
##  5     -4147.208       173.342   1335 08:58:23    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:23 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4147.208       173.371   1335 08:58:23    0.2
##  2     -4147.208       173.367   1335 08:58:23    0.1
##  3     -4147.208       173.360   1335 08:58:25    1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:25 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4186.479       169.227   1344 08:58:25    0.1
##  2     -4182.547       172.662   1344 08:58:26    0.1
##  3     -4181.878       174.754   1344 08:58:26    0.1
##  4     -4181.877       174.910   1344 08:58:26    0.1
##  5     -4181.876       174.842   1344 08:58:26    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:26 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4181.876       174.871   1344 08:58:26    0.2
##  2     -4181.876       174.867   1344 08:58:26    0.1
##  3     -4181.876       174.860   1344 08:58:28    1.8
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:28 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4200.508       170.430   1347 08:58:28    0.2
##  2     -4196.877       173.762   1347 08:58:29    0.1
##  3     -4196.172       175.946   1347 08:58:29    0.1
##  4     -4196.170       176.162   1347 08:58:29    0.1
##  5     -4196.169       176.072   1347 08:58:29    0.1
##  6     -4196.169       176.108   1347 08:58:29    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:29 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4196.169       176.093   1347 08:58:29    0.2
##  2     -4196.169       176.095   1347 08:58:29    0.1
##  3     -4196.169       176.098   1347 08:58:31    1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:31 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4010.927       169.615   1287 08:58:31    0.1
##  2     -4006.418       173.472   1287 08:58:31    0.1
##  3     -4005.779       175.574   1287 08:58:32    0.1
##  4     -4005.778       175.657   1287 08:58:32    0.1
##  5     -4005.778       175.619   1287 08:58:32    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:32 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4005.778       175.636   1287 08:58:32    0.2
##  2     -4005.778       175.634   1287 08:58:32    0.1
##  3     -4005.778       175.630   1287 08:58:34    1.7
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:34 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4174.676       166.288   1344 08:58:34    0.1
##  2     -4169.166       170.412   1344 08:58:34    0.1
##  3     -4168.429       172.572   1344 08:58:34    0.1
##  4     -4168.428       172.669   1344 08:58:35    0.1
##  5     -4168.428       172.627   1344 08:58:35    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:35 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4168.428       172.645   1344 08:58:35    0.2
##  2     -4168.428       172.642   1344 08:58:35    0.1
##  3     -4168.428       172.638   1344 08:58:37    1.8
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:37 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4241.406       171.337   1359 08:58:37    0.1
##  2     -4236.251       175.364   1359 08:58:37    0.1
##  3     -4235.552       177.501   1359 08:58:38    0.1
##  4     -4235.551       177.591   1359 08:58:38    0.1
##  5     -4235.551       177.551   1359 08:58:38    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:38 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4235.551       177.568   1359 08:58:38    0.2
##  2     -4235.551       177.566   1359 08:58:38    0.1
##  3     -4235.551       177.562   1359 08:58:40    1.9
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:40 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4265.812       172.774   1365 08:58:40    0.1
##  2     -4261.310       176.504   1365 08:58:41    0.1
##  3     -4260.598       178.682   1365 08:58:41    0.1
##  4     -4260.597       178.828   1365 08:58:41    0.1
##  5     -4260.596       178.765   1365 08:58:41    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:41 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4260.596       178.791   1365 08:58:41    0.2
##  2     -4260.596       178.788   1365 08:58:41    0.1
##  3     -4260.596       178.781   1365 08:58:43    1.6
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:43 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4211.011       170.732   1350 08:58:43    0.1
##  2     -4206.258       174.569   1350 08:58:44    0.1
##  3     -4205.593       176.648   1350 08:58:44    0.1
##  4     -4205.592       176.748   1350 08:58:44    0.1
##  5     -4205.592       176.703   1350 08:58:44    0.1
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Dec  6 08:58:44 2022
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4205.592       176.723   1350 08:58:44    0.1
##  2     -4205.592       176.720   1350 08:58:44    0.1
##  3     -4205.592       176.715   1350 08:58:46    1.6
mean(corr)
## [1] 0.2423157

5. Final considerations

  1. The implementation of Genomic BLUP (GBLUP) resembles the analyses on genetic assessment conduced so far, with the replacement of pedigree-based matrix to an SNP-based matrix.
  2. From a set of SNPs we can create a relationship matrix between individuals.
  3. The G matrix can be used efficiently in the context of linear mixed models to predict individuals’ performance.

6. References

Amadeu, Rodrigo R, Catherine Cellon, James W Olmstead, Antonio AF Garcia, Marcio FR Resende Jr, and Patricio R Muñoz. 2016. “AGHmatrix: R Package to Construct Relationship Matrices for Autotetraploid and Diploid Species: A Blueberry Example.” The Plant Genome 9 (3): plantgenome2016–01.
Gaynor, R Chris, Gregor Gorjanc, and John M Hickey. 2021. “AlphaSimR: An r Package for Breeding Program Simulations.” G3 11 (2): jkaa017.
Isik, Fikret, James Holland, and Christian Maltecca. 2017. Genetic Data Analysis for Plant and Animal Breeding. Vol. 400. Springer.
Jarquin, Diego, Reka Howard, Alencar Xavier, and Sruti Das Choudhury. 2018. “Increasing Predictive Ability by Modeling Interactions Between Environments, Genotype and Canopy Coverage Image Data for Soybeans.” Agronomy 8 (4): 51.
VanRaden, Paul M. 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414–23.