Lost in translation: On the problem of data coding in penalized whole genome regression with interactions

Johannes Martini; Francisco Rosales; Ngoc-Thuy Ha; Johannes Heise; Valentin Wimmer; Thomas Kneib

doi:10.1534/g3.118.200961

Lost in translation: On the problem of data coding in penalized whole genome regression with interactions

Johannes Martini, Francisco Rosales, Ngoc-Thuy Ha, Johannes Heise, Valentin Wimmer, Thomas Kneib

Producción científica: Contribución a una revista › Artículo de revista › revisión exhaustiva

7 Citas (Scopus)

Resumen

Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.

Idioma original	Inglés
Páginas (desde-hasta)	1117-1129
Número de páginas	13
Publicación	G3: Genes, Genomes, Genetics
Volumen	9
N.º	4
DOI	https://doi.org/10.1534/g3.118.200961
Estado	Publicada - abr. 2019
Publicado de forma externa	Sí

Nota bibliográfica

Funding Information:
JWRM thanks KWS SAAT SE as well as the German Research Foundation (DFG) via the research training group 1644 “Scaling Problems in Statistics” for support during his PhD thesis.

Publisher Copyright:
Copyright © 2019 Martini et al.

ODS de las Naciones Unidas

Este resultado contribuye a los siguientes Objetivos de Desarrollo Sostenible

Acceder al documento

10.1534/g3.118.200961

Otros archivos y enlaces

Citar esto

@article{beb637ed7b594e20b087d37139cdeef0,

title = "Lost in translation: On the problem of data coding in penalized whole genome regression with interactions",

abstract = "Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.",

keywords = "EGBLUP, GenPred, Genomic Prediction, Genomic selection, GxE, Hadamard products, Interactions, Shared Data Resources, Whole genome regression",

author = "Johannes Martini and Francisco Rosales and Ngoc-Thuy Ha and Johannes Heise and Valentin Wimmer and Thomas Kneib",

note = "Funding Information: JWRM thanks KWS SAAT SE as well as the German Research Foundation (DFG) via the research training group 1644 “Scaling Problems in Statistics” for support during his PhD thesis. Publisher Copyright: Copyright {\textcopyright} 2019 Martini et al.",

year = "2019",

month = apr,

doi = "10.1534/g3.118.200961",

language = "Ingl{\'e}s",

volume = "9",

pages = "1117--1129",

journal = "G3: Genes, Genomes, Genetics",

issn = "2160-1836",

publisher = "Genetics Society of America",

number = "4",

}

TY - JOUR

T1 - Lost in translation

T2 - On the problem of data coding in penalized whole genome regression with interactions

AU - Martini, Johannes

AU - Rosales, Francisco

AU - Ha, Ngoc-Thuy

AU - Heise, Johannes

AU - Wimmer, Valentin

AU - Kneib, Thomas

N1 - Funding Information: JWRM thanks KWS SAAT SE as well as the German Research Foundation (DFG) via the research training group 1644 “Scaling Problems in Statistics” for support during his PhD thesis. Publisher Copyright: Copyright © 2019 Martini et al.

PY - 2019/4

Y1 - 2019/4

N2 - Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.

AB - Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.

KW - EGBLUP

KW - GenPred

KW - Genomic Prediction

KW - Genomic selection

KW - GxE

KW - Hadamard products

KW - Interactions

KW - Shared Data Resources

KW - Whole genome regression

UR - http://www.scopus.com/inward/record.url?scp=85064724821&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/84b85aec-d51d-3d25-9da0-06886107716d/

U2 - 10.1534/g3.118.200961

DO - 10.1534/g3.118.200961

M3 - Artículo de revista

C2 - 30760541

SN - 2160-1836

VL - 9

SP - 1117

EP - 1129

JO - G3: Genes, Genomes, Genetics

JF - G3: Genes, Genomes, Genetics

IS - 4

ER -

Lost in translation: On the problem of data coding in penalized whole genome regression with interactions

Resumen

Nota bibliográfica

ODS de las Naciones Unidas

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto