DNA Methylation mediation of Adverse Outcomes in Type 2 Diabetes

1 Million Veteran Program

  • Current data release:

    • 539,020 Individuals

      • 45,460 individuals with methylation data

        • 28,709 (~64%) with whole genome sequencing
    • Among patients with methylation data, ~40% are T2D.

1.1 Methylation Data

  • 768,569 CpG site probes

  • CpG site methylation is expressed in terms of beta values, i.e.,

\[ \beta = \frac{\text{methylated signal}}{\text{methylated signal} + \text{unmethylated signal}}. \]

  • For each sample, the estimated percentage of each blood cell type (B-cell, CD4 T, CD8 T, Eos, Mono, Neu, NK) is available in the Houseman dataset.

2 Study Objectives

  • Identify differentially methylated CpG sites associated with glycemia exposures

  • Identify differentially methylated CpG sites associated with diabetic complications

    • retinopathy

    • nephropathy

  • Determine whether the identified CpG sites are responsible for the association between glycemia exposures and DM complications

3 Cohort Identification

Figure 1: Sample Subject Timeline with Landmark Events
  • Subjects must have \(\ge 1\) DM diagnosis code or equivalent prior to DNA sample collections

  • Subjects diagnosed with DM after DNA sample collection will be excluded

  • Subjects diagnosed with a DM related complication before DNA sample collection will be excluded as well

flowchart TD
    A["All MVP subjects with <br/> methylation data <br/> N = 45,460"]
    B1["Subjects without a <br/>diabetes diagnosis<br/>or equivalent<br/> _____"]
    A --> B1
    A ===> B2["Subjects with ≥1<br/>diabetes diagnosis or equivalent<br/>prior to DNA sample collection<br/>N=13,741"]
    B2 --> C1["Subjects <b>with</b> <br/> diabetes related complications<br/> prior to DNA sample collection <br/>_____"]
    B2 ===> C2["Subjects <b>without</b> <br/> diabetes related complications<br/> prior to DNA sample collection"]
    subgraph excl["<div style='padding: 298px 0 0 0;'>Excluded from Sample</div>"]
        B1 ~~~ C1
    end

Figure 2: Cohort Selection Flowchart
Note

Additionally, we may wish to not just include diabetic subjects, but also pre-diabetic subjects (5.7% < HbA1c < 6.5%).

  • These subjects may already be on glucose lowering medication(s), such as metformin

  • Follow-up HbA1c measurements may be sparse for pre-diabetic subjects

4 Outcome Definition

4.1 Incidence Outcomes

  • Use available patient level EHR data to identify diabetic subjects

    • Diagnosis codes alone may fail to capture subjects diagnosed outside of the VA
  • Diabetic Retinopathy will be identified by

    • PheCode definition or the case algorithm defined in (Breeyear et al. 2023). Reproducing the ‘V2’ algorithm specific to MVP:

      • (Any combination of 2 or more unique days of ICD9 or ICD10 codes for DR of any severity (nsdr_icd, mnpdr_icd, snpdr_icd, pdr_icd)
        • OR health factor codes [any_dr_hf, nsdr_hf, mnpdr_hf, snpdr_hf, pdr_hf])
        • OR (ophthalmic exam CPT codes specific to fundus photography [opt_exam]
        • OR evidence of teleretinal visit through health factors [tri_evidence]
        • OR general CPT cdoe with a 408 stop code to confirm visit to eye clinic [opt_code_use_with_stop_code])
          • AND (1 or more ICD9 or ICD10 codes for DR of any severity OR health factor codes for DR within 24 hours of the ophthalmic exam
  • Chronic Kidney Disease (CKD) or nephropathy will be identified by

  • Additional diabetes related variables to be constructed using definitions provided in (Yang et al. 2024)

4.2 Time-to-Event Outcomes

  • Will be defined by taking the earliest date of diagnosis with a specific diabetic complication (Retinopathy/Nephropathy)

    • Need to consider possible competing risks including death

      • Review of available software for time-to-event EWAS with competing risks

4.3 Continuous Outcomes

  • EGFR

5 Exposure Definition

  • Want to construct a measure that quantifies the total exposure to high glycemia in the five1 years immediately prior to DNA sample collection.

1 It may be appropriate to have a longer window, or even lookback through the full subject history. Model choice will depend on the distribution of available data pre-DNA sample collection

Figure 3: Sample HbA1C trajectory in pre-period, where sample collection occurs nine years following initial DM diagnosis
Figure 4: Time-In-Range definition as defined by Conlin
Figure 5: Proposed Time-In-Range exposure
  1. Mean HbA1c level in the five years prior to DNA methylation sample collection

  2. Maximum HbA1c level in the five years prior to DNA methylation sample collection

  3. Cumulative HbA1c level in the five years prior to DNA methylation sample collection

  4. CV of HbA1c level in the five years prior to DNA methylation sample collection

  5. Time out of range for Normal HbA1c level in the five years prior to DNA methylation sample collection

Figure 6: Exposure value versus Lookback time from MVP Enrollment
Note

For pre-diabetic subjects (5.7% < HbA1c < 6.5%), the above definitions for glycemia exposure will likely be difficult or impossible to calculate. Pre-diabetic subjects may by asymptomatic or have few symptoms of high glycemia, so regular HbA1c measurements are likely to be unavailable for them.

6 Current Workflow

  • Identify exposure associated CpG sites by EWAS
  • DMRcate to identify exposure-associated methylation regions
    • ENmix provides combp and ipdmr which can perform the same task, but they do not work very well
  • Using identified regions, can perform:
    • Functional Pathway Annotation

7 Analysis Plan

flowchart LR
  cohort(Cohort Identification) --> ewas(EWAS)
  outcome(Outcome Definition) --> ewas
  exposure(Exposure Definition) --> ewas
  subgraph Mediation Analysis
  retina(Retinopathy)
  ckd(Chronic Kidney Disease)
  end
  ewas(EWAS) ---> retina & ckd
  ewas(EWAS) --> meQTL(meQTL Identification) --> MR-retina & MR-ckd
  ewas(EWAS) --> funanno(Functional Annotation) 
  subgraph Mendelian Randomization
  MR-retina(Retinopathy)
  MR-ckd(Chronic Kidney Disease)
  end
  ewas ---> validation(External Data Validation)

Figure 7

8 EWAS: Analysis to determine which CpG sites methylation are associated with HbA1c level

  • Hypothesis: HbA1c has a cumulative effect on DNA methylation levels, i.e., the current DNA methylation depends more on one’s historical HbA1C levels rather than current HbA1c level.

    • To analyze this we will perform an EWAS across all CpG sites under the different models for prior exposure Section 5
  • EWAS covariates for adjustment (base model): sex, age, cell composition, potential batch effects, duration of diabetes prior to sample collection (considered below). Therefore we will consider:

    1. Model 1:

    \[ \beta \sim \mbox{Mean HbA1c + sex + age + cell composition + batch effects} \]

    1. Model 2:

    \[ \beta \sim \mbox{Maximum HbA1c + sex + age + cell composition + batch effects} \]

    1. Model 3:

    \[ \beta \sim \mbox{Cumulative HbA1c level + sex + age + cell composition + batch effects} \]

    1. Model 4:

    \[ \beta \sim \mbox{CV of HbA1c level + sex + age + cell composition + batch effects} \]

  • Adjustment of Diabetes duration: interval between the date of the first DM diagnosis code to MVP recruitment date, potentially can include it as an interval censored covariates. Alternatively, we will have to extract a newly diagnosed DM cohort (much smaller sample size)

  • Included in batch effects are some technical details specific to MVP:

    • Scanner: identifier of scanner used to scan the assay for hybridization

    • Duration of Sample Storage: the time frame (in days) between blood draw and hybridization to Epic methylation assay

    • Per sample batch effects may be complicated or difficult to adjust for given that subjects enroll continuously and MVP being a multi-site program with a complex population. Possible strategies:

      • Include technicalMetric_PC[1-27] as covariates

        • these are the first 27 technical principal components calculated using background and control signals across probes to examine sample clustering
      • Model batch effect as a random effect (meffil supports this for EWAS)

  • Threshold for p-value significance will be set at \(5\%\) after Bonferroni correction.

  • The R package meffil will be used for EWAS on MVP data, and has already been used to fit an EWAS for body mass index.

  • The R package bacon was used to reduce inflation and bias for the aforementioned analysis.

8.1 Sensitivity Analysis

  • We will evaluate the following additional clinical variables (see Table 1) as potential confounders in the EWAS analysis. We will add 1 variable at a time to the base model and evaluate the change in p-value of historical HbA1c exposures with and without each of above variables
Table 1: Additional clinical covariates to be considered.
Category Variable
Demographics race
smoking status
body mass index
Cardiac and blood lipids systolic blood pressure
diastolic blood pressure
(total?) cholesterol
triglyceride
low-density lipoprotein
high-density lipoprotein
Diabetic Complications PDR
SNPDR
CSME
AER \(>30\) mg / 24h
Cardiovascular Disease

8.2 Validation

We will validate the results using external data,

  • UCLA ATLAS
  • MESA data
  • ?

8.3 Annotation

  • HbA1c associated regions will be defined as those having at least two HbA1c associated CpG sites in a 500 base pair window.

  • Additionally, each significant HbA1c associated CpG will be annotated based on its genomic location relative to RefSeq genes.

9 meQTL Identification

  • Using genomic data, a GWAS at a Bonferroni corrected significance level \(\alpha=5\%\) will be conducted for each Hb1A1c-associated CpG site identified in Section 8, adjusting for age, sex, cell composition, and batch effects.

  • For each CpG site, SNPs significantly associated with CpG site methylation will be identified as meQTLs.

10 Causal Mediation Analysis (Last Revision 2025.08.26)

10.1 Step ‘0’: Identify at-risk subset(s)

  • For the outcomes:

    • Retinopathy (non-specific)

      • Defined as earliest diagnosis of DR (non-specific), from various sources (ICD or Health Factors with additional criterion, see Breeyear)

      • Excluding subjects with any DR diagnosis prior to MVP blood sample collection

    • Diabetic Neuropathy (DN)

      • Defined as earliest diagnosis code date of DN

      • Excluding subjects with any DN diagnosis prior to MVP blood sample collection

    • Diabetic Kidney Disease (DKD)

      • eGFR calculated using race-agnostic CKD-EPI equation

        • Stage 3a CKD, defined as eGFR <60 mL / (min * m2)

        • Stage 3b CKD, defined as eGFR <45 mL / (min * m2)

      • Subjects must have at least one eGFR measurement in the 5-year period preceeding blood sample collection

      • Baseline eGFR >= 60 mL / (min * m2), as defined by the most recent eGFR measurement up to MVP blood sample collection

      • CKD diagnosis date defined as the earliest date of two consecutive eGFR measurements meeting the above criteria (3a/3b) and at least 45 days apart, up to 365 days

10.2 Step 1: Methylation | Glycemic Exposure

  • For each diabetic complication, in the at-risk subset,

    • Outcome: DNA methylation

    • (\(A \to M\)) At each CpG, regress CpG site methylation on glycemic exposure, controlling for confounders of DNA methylation

    • Equivalent to EWAS in the at-risk subset

10.3 Step 2: Complication Onset | Glycemic Exposure + Methylation + Confounders

  • Time-to-event Modeling (Accelerated Failure Time Model):

    • Outcome: time from MVP blood sample collection to the development of diabetic complication

    • Covariates:

      • Age at MVP enrollment
      • Sex
      • HbA1c-derived exposure prior to MVP (up to 5 years)
      • Prior diabetes duration at MVP enrollment

10.4 Step 3:

  • For each CpG site, construct joint-significance test statistic (MaxP/JS-Uniform)

10.5 Step 4:

  • For each HbA1c associated CpG site, we will fit the same model as in step 1, this time including both historical HbA1c level and DNA methylation level as covariates.

  • Following Chen, we will quantify the mediation effect by:

The percentage of statistic score change (\(z\)-score in Cox proportional-hazards models and \(t\)-score in linear regression) on historical HbA1c from step 3 versus step 1 was then calculated to obtain the percentage of DNA methylation that explains the association between HbA1c and complications (defined from step 1)

\[ \mbox{Percent Change} = \frac{\mbox{test statistic for HbA1c in Step 1} - \mbox{test statistic for HbA1c in Step 3}}{\mbox{test statistic for HbA1c in Step 1}} \times 100\% \]

10.6 Step 5

  • For each outcome, Chen selected the top 10 HbA1c-associated CpG sites and performed a brute force best subset selection, picking the CpG sites that best explained the risk of the specific diabetes complication. Then, the percentage change was constructed similarly as in Step 3.

11 Mendelian Randomization (MR)

MR can be performed for each outcome, analyzing the causal effect of DNA methylation levels on diabetic complications (Chen et al. 2020). In MVP, since there are \(~650,000 - 45,460 = ~605,000\) individuals with genotyping data and without methylation data, this cohort can be used to perform Subsample Instrumental Variable Estimation (Pierce and Burgess 2013)

  • For each outcome, the procedure is as follows:

    1. Using the smaller cohort with methylation data, regress the exposure \(X\) (in-this case CpG site Methylation at all CpG sites associated with the outcome) on the genetic information \(G\), this is already available from Section 9

    2. Using the larger cohort, who do not have methylation data, conduct a GWAS for the outcome \(Y\), and obtain fitted values \(\widehat{Y}\)

    3. Using the fitted model from (1), obtain the predicted values for the exposure \(X\) in the cohort where it is unobserved, \(\widehat{X}\).

    4. The causal effect of methylation \(X\) on the outcome \(Y\) can be estimated by regressing \(\widehat{Y}\) on \(\widehat{X}\).

  • This is to be repeated for each HbA1c associated CpG site

  • R package TwoSampleMR for this.

12 Subgroup Analysis

  • Stratify by baseline or pre-baseline HbA1c level

  • Stratify by race/ethnicity

  • Stratify by baseline or pre-baseline eGFR level

  • Stratify by relative time of diagnosis with Type 2 diabetes (recently diagnosed vs. not recently diagnosed)

Glossary

  • DM: Diabetes Mellitus

  • EWAS: Epigenome-wide Association Study

  • DCCT: Diabetes Control and Complications Trial (1983-1993)

    • EDIC: Epidemiology of Diabetes Interventions and Complications (1994-present), longitudinal monitoring of patients enrolled in DCCT
  • CpG site: cytosine/guanine

  • PDR: Proliferative Diabetic Retinopathy

  • SNPDR: Severe Nonproliferative Diabetic Retinopathy

  • CSME: Clinically Significant Macular Edema

  • AER: Albumin Excretion Rate

  • meQTL: Methylation Quantitative Trait Loci

References

Breeyear, Joseph H, Sabrina L Mitchell, Cari L Nealon, Jacklyn N Hellwege, Brian Charest, Anjali Khakaria, Christopher W Halladay, et al. 2023. “Development of Portable Electronic Health Record Based Algorithms to Identify Individuals with Diabetic Retinopathy.” medRxiv, 2023–11.
Chen, Zhuo, Feng Miao, Barbara H Braffett, John M Lachin, Lingxiao Zhang, Xiwei Wu, Delnaz Roshandel, et al. 2020. “DNA Methylation Mediates Development of HbA1c-Associated Complications in Type 1 Diabetes.” Nature Metabolism 2 (8): 744–62.
Kim, Do Hyun, Aubrey Jensen, Kelly Jones, Sridharan Raghavan, Lawrence S Phillips, Adriana Hung, Yan V Sun, et al. 2023. “A Platform for Phenotyping Disease Progression and Associated Longitudinal Risk Factors in Large-Scale EHRs, with Application to Incident Diabetes Complications in the UK Biobank.” JAMIA Open 6 (1): ooad006.
Pierce, Brandon L, and Stephen Burgess. 2013. “Efficient Design for Mendelian Randomization Studies: Subsample and 2-Sample Instrumental Variable Estimators.” American Journal of Epidemiology 178 (7): 1177–84.
Yang, Peter K, Sandra L Jackson, Brian R Charest, Yiling J Cheng, Yan V Sun, Sridharan Raghavan, Elizabeth M Litkowski, et al. 2024. “Type 1 Diabetes Genetic Risk in 109,954 Veterans with Adult-Onset Diabetes: The Million Veteran Program (MVP).” Diabetes Care 47 (6): 1032–41.