flowchart TD
A["All MVP subjects with <br/> methylation data <br/> N = 45,460"]
B1["Subjects without a <br/>diabetes diagnosis<br/>or equivalent<br/> _____"]
A --> B1
A ===> B2["Subjects with ≥1<br/>diabetes diagnosis or equivalent<br/>prior to DNA sample collection<br/>N=13,741"]
B2 --> C1["Subjects <b>with</b> <br/> diabetes related complications<br/> prior to DNA sample collection <br/>_____"]
B2 ===> C2["Subjects <b>without</b> <br/> diabetes related complications<br/> prior to DNA sample collection"]
subgraph excl["<div style='padding: 298px 0 0 0;'>Excluded from Sample</div>"]
B1 ~~~ C1
end
DNA Methylation mediation of Adverse Outcomes in Type 2 Diabetes
1 Million Veteran Program
Current data release:
539,020 Individuals
45,460 individuals with methylation data
- 28,709 (~64%) with whole genome sequencing
Among patients with methylation data, ~40% are T2D.
1.1 Methylation Data
768,569 CpG site probes
CpG site methylation is expressed in terms of beta values, i.e.,
\[ \beta = \frac{\text{methylated signal}}{\text{methylated signal} + \text{unmethylated signal}}. \]
- For each sample, the estimated percentage of each blood cell type (B-cell, CD4 T, CD8 T, Eos, Mono, Neu, NK) is available in the
Housemandataset.
2 Study Objectives
Identify differentially methylated CpG sites associated with glycemia exposures
Identify differentially methylated CpG sites associated with diabetic complications
retinopathy
nephropathy
Determine whether the identified CpG sites are responsible for the association between glycemia exposures and DM complications
3 Cohort Identification
Subjects must have \(\ge 1\) DM diagnosis code or equivalent prior to DNA sample collections
Subjects diagnosed with DM after DNA sample collection will be excluded
Subjects diagnosed with a DM related complication before DNA sample collection will be excluded as well
Additionally, we may wish to not just include diabetic subjects, but also pre-diabetic subjects (5.7% < HbA1c < 6.5%).
These subjects may already be on glucose lowering medication(s), such as metformin
Follow-up HbA1c measurements may be sparse for pre-diabetic subjects
4 Outcome Definition
4.1 Incidence Outcomes
Use available patient level EHR data to identify diabetic subjects
- Diagnosis codes alone may fail to capture subjects diagnosed outside of the VA
Diabetic Retinopathy will be identified by
PheCode definition or the case algorithm defined in (Breeyear et al. 2023). Reproducing the ‘V2’ algorithm specific to MVP:
- (Any combination of 2 or more unique days of ICD9 or ICD10 codes for DR of any severity (nsdr_icd, mnpdr_icd, snpdr_icd, pdr_icd)
- OR health factor codes [any_dr_hf, nsdr_hf, mnpdr_hf, snpdr_hf, pdr_hf])
- OR (ophthalmic exam CPT codes specific to fundus photography [opt_exam]
- OR evidence of teleretinal visit through health factors [tri_evidence]
- OR general CPT cdoe with a 408 stop code to confirm visit to eye clinic [opt_code_use_with_stop_code])
- AND (1 or more ICD9 or ICD10 codes for DR of any severity OR health factor codes for DR within 24 hours of the ophthalmic exam
- (Any combination of 2 or more unique days of ICD9 or ICD10 codes for DR of any severity (nsdr_icd, mnpdr_icd, snpdr_icd, pdr_icd)
Chronic Kidney Disease (CKD) or nephropathy will be identified by
- Leveraging the phenotyping construction developed in (Kim et al. 2023)
Additional diabetes related variables to be constructed using definitions provided in (Yang et al. 2024)
4.2 Time-to-Event Outcomes
Will be defined by taking the earliest date of diagnosis with a specific diabetic complication (Retinopathy/Nephropathy)
Need to consider possible competing risks including death
- Review of available software for time-to-event EWAS with competing risks
4.3 Continuous Outcomes
- EGFR
5 Exposure Definition
- Want to construct a measure that quantifies the total exposure to high glycemia in the five1 years immediately prior to DNA sample collection.
1 It may be appropriate to have a longer window, or even lookback through the full subject history. Model choice will depend on the distribution of available data pre-DNA sample collection
Mean HbA1c level in the five years prior to DNA methylation sample collection
Maximum HbA1c level in the five years prior to DNA methylation sample collection
Cumulative HbA1c level in the five years prior to DNA methylation sample collection
CV of HbA1c level in the five years prior to DNA methylation sample collection
Time out of range for Normal HbA1c level in the five years prior to DNA methylation sample collection
For pre-diabetic subjects (5.7% < HbA1c < 6.5%), the above definitions for glycemia exposure will likely be difficult or impossible to calculate. Pre-diabetic subjects may by asymptomatic or have few symptoms of high glycemia, so regular HbA1c measurements are likely to be unavailable for them.
6 Current Workflow
- Identify exposure associated CpG sites by EWAS
DMRcateto identify exposure-associated methylation regionsENmixprovidescombpandipdmrwhich can perform the same task, but they do not work very well
- Using identified regions, can perform:
- Functional Pathway Annotation
7 Analysis Plan
flowchart LR cohort(Cohort Identification) --> ewas(EWAS) outcome(Outcome Definition) --> ewas exposure(Exposure Definition) --> ewas subgraph Mediation Analysis retina(Retinopathy) ckd(Chronic Kidney Disease) end ewas(EWAS) ---> retina & ckd ewas(EWAS) --> meQTL(meQTL Identification) --> MR-retina & MR-ckd ewas(EWAS) --> funanno(Functional Annotation) subgraph Mendelian Randomization MR-retina(Retinopathy) MR-ckd(Chronic Kidney Disease) end ewas ---> validation(External Data Validation)
8 EWAS: Analysis to determine which CpG sites methylation are associated with HbA1c level
Hypothesis: HbA1c has a cumulative effect on DNA methylation levels, i.e., the current DNA methylation depends more on one’s historical HbA1C levels rather than current HbA1c level.
- To analyze this we will perform an EWAS across all CpG sites under the different models for prior exposure Section 5
EWAS covariates for adjustment (base model): sex, age, cell composition, potential batch effects, duration of diabetes prior to sample collection (considered below). Therefore we will consider:
- Model 1:
\[ \beta \sim \mbox{Mean HbA1c + sex + age + cell composition + batch effects} \]
- Model 2:
\[ \beta \sim \mbox{Maximum HbA1c + sex + age + cell composition + batch effects} \]
- Model 3:
\[ \beta \sim \mbox{Cumulative HbA1c level + sex + age + cell composition + batch effects} \]
- Model 4:
\[ \beta \sim \mbox{CV of HbA1c level + sex + age + cell composition + batch effects} \]
Adjustment of Diabetes duration: interval between the date of the first DM diagnosis code to MVP recruitment date, potentially can include it as an interval censored covariates. Alternatively, we will have to extract a newly diagnosed DM cohort (much smaller sample size)
Included in batch effects are some technical details specific to MVP:
Scanner: identifier of scanner used to scan the assay for hybridization
Duration of Sample Storage: the time frame (in days) between blood draw and hybridization to Epic methylation assay
Per sample batch effects may be complicated or difficult to adjust for given that subjects enroll continuously and MVP being a multi-site program with a complex population. Possible strategies:
Include
technicalMetric_PC[1-27]as covariates- these are the first 27 technical principal components calculated using background and control signals across probes to examine sample clustering
Model batch effect as a random effect (
meffilsupports this for EWAS)
Threshold for p-value significance will be set at \(5\%\) after Bonferroni correction.
The R package
meffilwill be used for EWAS on MVP data, and has already been used to fit an EWAS for body mass index.The R package
baconwas used to reduce inflation and bias for the aforementioned analysis.
8.1 Sensitivity Analysis
- We will evaluate the following additional clinical variables (see Table 1) as potential confounders in the EWAS analysis. We will add 1 variable at a time to the base model and evaluate the change in p-value of historical HbA1c exposures with and without each of above variables
| Category | Variable |
|---|---|
| Demographics | race |
| smoking status | |
| body mass index | |
| Cardiac and blood lipids | systolic blood pressure |
| diastolic blood pressure | |
| (total?) cholesterol | |
| triglyceride | |
| low-density lipoprotein | |
| high-density lipoprotein | |
| Diabetic Complications | PDR |
| SNPDR | |
| CSME | |
| AER \(>30\) mg / 24h | |
| Cardiovascular Disease |
8.2 Validation
We will validate the results using external data,
- UCLA ATLAS
- MESA data
- ?
8.3 Annotation
HbA1c associated regions will be defined as those having at least two HbA1c associated CpG sites in a 500 base pair window.
Additionally, each significant HbA1c associated CpG will be annotated based on its genomic location relative to RefSeq genes.
9 meQTL Identification
Using genomic data, a GWAS at a Bonferroni corrected significance level \(\alpha=5\%\) will be conducted for each Hb1A1c-associated CpG site identified in Section 8, adjusting for age, sex, cell composition, and batch effects.
For each CpG site, SNPs significantly associated with CpG site methylation will be identified as meQTLs.
10 Causal Mediation Analysis (Last Revision 2025.08.26)
10.1 Step ‘0’: Identify at-risk subset(s)
For the outcomes:
Retinopathy (non-specific)
Defined as earliest diagnosis of DR (non-specific), from various sources (ICD or Health Factors with additional criterion, see Breeyear)
Excluding subjects with any DR diagnosis prior to MVP blood sample collection
Diabetic Neuropathy (DN)
Defined as earliest diagnosis code date of DN
Excluding subjects with any DN diagnosis prior to MVP blood sample collection
Diabetic Kidney Disease (DKD)
eGFR calculated using race-agnostic CKD-EPI equation
Stage 3a CKD, defined as eGFR <60 mL / (min * m2)
Stage 3b CKD, defined as eGFR <45 mL / (min * m2)
Subjects must have at least one eGFR measurement in the 5-year period preceeding blood sample collection
Baseline eGFR >= 60 mL / (min * m2), as defined by the most recent eGFR measurement up to MVP blood sample collection
CKD diagnosis date defined as the earliest date of two consecutive eGFR measurements meeting the above criteria (3a/3b) and at least 45 days apart, up to 365 days
10.2 Step 1: Methylation | Glycemic Exposure
For each diabetic complication, in the at-risk subset,
Outcome: DNA methylation
(\(A \to M\)) At each CpG, regress CpG site methylation on glycemic exposure, controlling for confounders of DNA methylation
Equivalent to EWAS in the at-risk subset
10.3 Step 2: Complication Onset | Glycemic Exposure + Methylation + Confounders
Time-to-event Modeling (Accelerated Failure Time Model):
Outcome: time from MVP blood sample collection to the development of diabetic complication
Covariates:
- Age at MVP enrollment
- Sex
- HbA1c-derived exposure prior to MVP (up to 5 years)
- Prior diabetes duration at MVP enrollment
10.4 Step 3:
- For each CpG site, construct joint-significance test statistic (MaxP/JS-Uniform)
10.5 Step 4:
For each HbA1c associated CpG site, we will fit the same model as in step 1, this time including both historical HbA1c level and DNA methylation level as covariates.
Following Chen, we will quantify the mediation effect by:
The percentage of statistic score change (\(z\)-score in Cox proportional-hazards models and \(t\)-score in linear regression) on historical HbA1c from step 3 versus step 1 was then calculated to obtain the percentage of DNA methylation that explains the association between HbA1c and complications (defined from step 1)
\[ \mbox{Percent Change} = \frac{\mbox{test statistic for HbA1c in Step 1} - \mbox{test statistic for HbA1c in Step 3}}{\mbox{test statistic for HbA1c in Step 1}} \times 100\% \]
10.6 Step 5
- For each outcome, Chen selected the top 10 HbA1c-associated CpG sites and performed a brute force best subset selection, picking the CpG sites that best explained the risk of the specific diabetes complication. Then, the percentage change was constructed similarly as in Step 3.
11 Mendelian Randomization (MR)
MR can be performed for each outcome, analyzing the causal effect of DNA methylation levels on diabetic complications (Chen et al. 2020). In MVP, since there are \(~650,000 - 45,460 = ~605,000\) individuals with genotyping data and without methylation data, this cohort can be used to perform Subsample Instrumental Variable Estimation (Pierce and Burgess 2013)
For each outcome, the procedure is as follows:
Using the smaller cohort with methylation data, regress the exposure \(X\) (in-this case CpG site Methylation at all CpG sites associated with the outcome) on the genetic information \(G\), this is already available from Section 9
Using the larger cohort, who do not have methylation data, conduct a GWAS for the outcome \(Y\), and obtain fitted values \(\widehat{Y}\)
Using the fitted model from (1), obtain the predicted values for the exposure \(X\) in the cohort where it is unobserved, \(\widehat{X}\).
The causal effect of methylation \(X\) on the outcome \(Y\) can be estimated by regressing \(\widehat{Y}\) on \(\widehat{X}\).
This is to be repeated for each HbA1c associated CpG site
R package
TwoSampleMRfor this.
12 Subgroup Analysis
Stratify by baseline or pre-baseline HbA1c level
Stratify by race/ethnicity
Stratify by baseline or pre-baseline eGFR level
Stratify by relative time of diagnosis with Type 2 diabetes (recently diagnosed vs. not recently diagnosed)
Glossary
DM: Diabetes Mellitus
EWAS: Epigenome-wide Association Study
DCCT: Diabetes Control and Complications Trial (1983-1993)
- EDIC: Epidemiology of Diabetes Interventions and Complications (1994-present), longitudinal monitoring of patients enrolled in DCCT
CpG site: cytosine/guanine
PDR: Proliferative Diabetic Retinopathy
SNPDR: Severe Nonproliferative Diabetic Retinopathy
CSME: Clinically Significant Macular Edema
AER: Albumin Excretion Rate
meQTL: Methylation Quantitative Trait Loci