The purpose of this vignette is to accompany xxx manuscript on step counts in NHANES 2011-2014.
Background
NHANES is a nationally representative study of about 5,000 Americans per year. The study includes demographic, socioeconomic, dietary, and health-related questions; data from the survey are publicly available. All NHANES study participants age six and older (2011-12, Wave G) or age three or older (2013-14, Wave H) were asked to wear an ActiGraph GT3X+ on the non-dominant wrist starting on the day of their exam at the NHANES Mobile Examination Center. They were instructed to wear the device at all times for seven consecutive days and remove the device on the morning of the ninth day. The devices collected triaxial acceleration at 80 Hz.
In 2022, the raw accelerometer data were released. We downloaded the raw data and applied four open source and one proprietary step count algorithm to the raw data. The algorithms are:
Our pipeline for processing the raw data and obtaining minute-level step counts can be found at https://github.com/muschellij2/nhanes_80hz. The minute level data, in 1440 format, are available for download on Physionet (include link).
In this vignette, we start with data that has been summarized at the subject level, and investigate:
Distributions of steps by age
Correlation between step algorithms and other PA summaries
Performance of step counting algorithms in single- and multivariable mortality prediction
Briefly, the process from minute-level data to subject-level data is as follows:
For each step algorithm or PA variable, create day summaries by summing values across any minute of the day that does not have any data quality flags and is classified by NHANES as wake wear, sleep wear, or unknown
For each subject, take mean of day summaries across all valid days, where valid days are defined as days that meet the following criteria:
At least 1,368 minutes (95% of a full day) were classified as wake wear, sleep wear, or unknown and did not have any data quality flags
At least seven hours (420 minutes) of the day were classified as wake wear
At least seven hours (420 minutes) had non-zero MIMS
Data overview
Code
# read in data df_all=readRDS(here::here("data", "covariates_accel_mortality_df.rds"))df_all%>%glimpse()
The entire dataset consists of 19,931 individuals. The varaibles that start with total or peak denote accelerometery-derived physical activity variables; total is mean of the daily total values of that variable, while peak1 is the mean of the highest value obtained in each day, and peak30 is the mean of the mean of the 30 highest values obtained in each day.
However, not all of these individuals received accelerometers, had valid accelerometer data, or were 18 years and older. Thus, we will exclude some of these individuals from our analysis. We can generate a CONSORT diagram to summarize the number of individuals excluded for various reasons.
We make a survey-weighted table of population characteristics for these individuals:
Code
labs=c("SEQN","Data release cycle","Interview Examination Status","Sex","Age (yrs)","Age (mos)","Race/ethnicity","Six month time period","Educ. level adults","Marital status","2 yr int weight","2 yr exam weight","Pseudo PSU","Psueudo stratum","Annual HH income","Weight (kg)","Height (cm)","BMI (kg/m2)","Overweight","Diabetes orig.","Diabetes","Arthritis","Coronary Heart Failure","Congenital Heart Disease","Angina","Heart attack","Stroke","Cancer",colnames(df_all)[29:33],"Alcohol use","BMI Category","Smoking status","Mobility problem","General health condition","Eligbility","Died by 5 years follow up","COD","COD Diabetes","COD Hypertension","Person-months follow up from interview","Person-months follow up from exam","Eligibility category","COD category","AC","MIMS","Actilife steps","ADEPT steps","log10 AC","log10 MIMS","Oak steps","Sc. RF steps","Sc. SSL steps","Verisense rev. steps","Verisense steps","Peak1 AC","Peak1 MIMS","Peak1 Actilife steps","Peak1 ADEPT steps","Peak1 log10 AC","Peak1 log10 MIMS","Peak1 Oak steps","Peak1 Sc. RF steps","Peak1 Sc. SSL steps","Peak1 Verisense rev. steps","Peak1 Verisense steps","Peak30 AC","Peak30 MIMS","Peak30 Actilife steps","Peak30 ADEPT steps","Peak30 log10 AC","Peak30 log10 MIMS","Peak30 Oak steps","Peak30 Sc. RF steps","Peak30 Sc. SSL steps","Peak30 Verisense rev. steps","Peak30 Verisense steps","No. valid days","Received accelerometer","Valid accelerometry","Inclusion category","Education level")names(labs)=colnames(df_accel)df_tab1=df_all%>%labelled::set_variable_labels(!!!labs)# survey weighted table# question about this!!df_svy=df_tab1%>%filter(has_accel)%>%filter(valid_accel)%>%select(gender,age_in_years_at_screening,race_hispanic_origin,cat_education,cat_bmi,bin_diabetes,chf,chd,stroke,cat_alcohol,cat_smoke,bin_mobilityproblem,general_health_condition,mortstat,data_release_cycle,masked_variance_pseudo_psu,masked_variance_pseudo_stratum,full_sample_2_year_mec_exam_weight)%>%mutate(WTMEC4YR =full_sample_2_year_mec_exam_weight/2, WTMEC4YR_norm =WTMEC4YR/mean(WTMEC4YR, na.rm =TRUE))%>%select(-full_sample_2_year_mec_exam_weight)%>%svydesign( ids =~masked_variance_pseudo_psu, weights =~WTMEC4YR_norm, strata =~masked_variance_pseudo_stratum, data =., nest =TRUE)# survey weighted tabletab=df_svy%>%tbl_svysummary( by =data_release_cycle, include =-c(masked_variance_pseudo_psu,masked_variance_pseudo_stratum,WTMEC4YR,WTMEC4YR_norm), statistic =list(all_continuous()~"{mean} ({sd})",all_categorical()~"{n} ({p}%)"), digits =all_continuous()~2, missing_text ="Missing",)%>%add_overall()%>%modify_caption("Demographic Characteristics, All Adults")tab
Demographic Characteristics, All Adults
Characteristic
Overall N = 12,5171
7 N = 6,2341
8 N = 6,2831
Sex
Female
6,508 (52%)
3,239 (52%)
3,269 (52%)
Male
6,009 (48%)
2,995 (48%)
3,014 (48%)
Age (yrs)
40.99 (21.49)
41.47 (21.07)
40.53 (21.89)
Race/ethnicity
Non-Hispanic White
8,013 (64%)
4,034 (65%)
3,979 (63%)
Non-Hispanic Black
1,484 (12%)
751 (12%)
734 (12%)
Other Race - Including Multi-Rac
972 (7.8%)
457 (7.3%)
515 (8.2%)
Mexican American
1,257 (10%)
571 (9.2%)
686 (11%)
Other Hispanic
790 (6.3%)
421 (6.7%)
370 (5.9%)
Education level
More than HS
6,193 (63%)
3,125 (63%)
3,068 (63%)
Less than HS
1,560 (16%)
824 (17%)
736 (15%)
HS/HS equivalent
2,098 (21%)
1,029 (21%)
1,069 (22%)
Missing
2,666
1,256
1,410
BMI Category
Normal
3,743 (30%)
1,920 (31%)
1,823 (29%)
Obese
3,891 (31%)
1,887 (31%)
2,004 (32%)
Overweight
3,549 (29%)
1,833 (30%)
1,715 (27%)
Underweight
1,240 (10.0%)
540 (8.7%)
700 (11%)
Missing
94
54
41
Diabetes
1,052 (8.4%)
508 (8.2%)
543 (8.6%)
Missing
7
4
3
Coronary Heart Failure
290 (2.9%)
158 (3.2%)
132 (2.7%)
Missing
2,668
1,260
1,409
Congenital Heart Disease
371 (3.8%)
170 (3.4%)
201 (4.1%)
Missing
2,684
1,267
1,417
Stroke
309 (3.1%)
164 (3.3%)
145 (3.0%)
Missing
2,665
1,256
1,408
Alcohol use
Never drinker
1,240 (9.9%)
559 (9.0%)
681 (11%)
Former drinker
1,446 (12%)
726 (12%)
720 (11%)
Moderate drinker
3,117 (25%)
1,644 (26%)
1,473 (23%)
Heavy drinker
785 (6.3%)
439 (7.0%)
347 (5.5%)
Missing alcohol
5,928 (47%)
2,866 (46%)
3,062 (49%)
Smoking status
Never smoker
5,656 (56%)
2,758 (55%)
2,898 (57%)
Former smoker
2,449 (24%)
1,256 (25%)
1,192 (24%)
Current smoker
1,912 (19%)
962 (19%)
950 (19%)
Missing
2,501
1,259
1,242
Mobility problem
1,656 (17%)
749 (15%)
907 (19%)
Missing
2,668
1,256
1,413
General health condition
Poor
282 (2.3%)
127 (2.0%)
155 (2.5%)
Fair
1,684 (13%)
789 (13%)
895 (14%)
Good
4,582 (37%)
2,240 (36%)
2,343 (37%)
Very good
4,055 (32%)
2,104 (34%)
1,951 (31%)
Excellent
1,913 (15%)
974 (16%)
939 (15%)
Died by 5 years follow up
761 (7.5%)
416 (8.1%)
345 (6.8%)
Missing
2,370
1,125
1,245
1 n (%); Mean (SD)
Mean physical activity variables
We can also make a table of the mean physical activity variables by wave and age group.
Code
df_adult=df_accel%>%filter(age_in_years_at_screening>=18)%>%mutate(weight =full_sample_2_year_mec_exam_weight/2, weight_norm =weight/mean(weight))%>%mutate(group ="All adults")df_old=df_accel%>%filter(age_in_years_at_screening>=50)%>%mutate(weight =full_sample_2_year_mec_exam_weight/2, weight_norm =weight/mean(weight))%>%mutate(group ="50+")df=bind_rows(df_adult, df_old)df=df%>%rename( Verisense ="total_vssteps","Verisense rev"="total_vsrevsteps", Oak ="total_oaksteps", ADEPT ="total_adeptsteps","Stepcount RF"="total_scrfsteps","Stepcount SSL"="total_scsslsteps", MIMS ="total_PAXMTSM", Actilife ="total_actisteps","log10 MIMS"="total_log10PAXMTSM","log10 AC"="total_log10AC", AC ="total_AC")%>%select(!contains("peak"))df_analysis_svy=survey::svydesign( id =~masked_variance_pseudo_psu, strata =~masked_variance_pseudo_stratum, weights =~weight_norm, data =df, nest =TRUE)tab=df_analysis_svy%>%tbl_strata( strata =data_release_cycle, .tbl_fun =~.x%>%tbl_svysummary(., include =c(contains("Actilife"),contains("ADEPT"),contains("Oak"),contains("rf"),contains("ssl"),"Verisense","Verisense rev","MIMS","log10 MIMS","AC","log10 AC"), by =group, statistic =list(all_continuous()~"{mean} ({sd})",all_categorical()~"{n} ({p}%)")), .header ="**Wave {strata}**, N = {n}")%>%modify_caption("Physical Activity Mean Totals Stratified by Age and Wave")tab
Physical Activity Mean Totals Stratified by Age and Wave
Characteristic
Wave 7, N = 6513
Wave 8, N = 6404
50+ N = 2,1461
All adults N = 4,3671
50+ N = 2,1071
All adults N = 4,2971
Actilife
11,195 (4,001)
12,169 (3,995)
10,850 (4,008)
11,902 (4,048)
ADEPT
2,342 (1,593)
2,659 (1,569)
2,151 (1,680)
2,453 (1,575)
Oak
10,254 (5,027)
11,794 (5,061)
9,733 (4,886)
11,381 (5,065)
Stepcount RF
9,888 (5,542)
11,509 (5,722)
9,502 (5,442)
11,263 (5,764)
Stepcount SSL
8,358 (4,402)
9,144 (4,400)
8,027 (4,388)
8,846 (4,399)
Verisense
8,337 (4,027)
9,497 (4,062)
7,974 (4,019)
9,163 (4,065)
Verisense rev
7,532 (4,521)
9,102 (4,756)
7,122 (4,402)
8,725 (4,730)
MIMS
12,435 (3,642)
13,572 (3,758)
12,243 (3,574)
13,467 (3,784)
log10 MIMS
960 (158)
998 (154)
952 (160)
994 (155)
AC
2,333,927 (791,711)
2,573,262 (815,192)
2,291,942 (766,303)
2,549,846 (816,305)
log10 AC
2,878 (383)
2,955 (366)
2,847 (394)
2,933 (370)
1 Mean (SD)
Physical activity variables
Distributions
We plot the distributions of each step and physical activity variable. We plot the 99th percentile with a vertical line to indicate the effect of Winsorizing at the 99th percentile. We choose to use the Winsorized variables in the mortality analysis since the raw distributions are all quite right-skewed.
Next we plot the survey-weighted means by age for each algorithm. First we calculate the survey-weghted means, then smooth the means and confidence intervals with a loess smooth.
Code
survey_design=df_accel%>%filter(age_in_years_at_screening>=18&age_in_years_at_screening<80)%>%mutate(weight =full_sample_2_year_mec_exam_weight/2, weight_norm =weight/mean(weight))%>%ungroup()%>%survey::svydesign( id =~masked_variance_pseudo_psu, strata =~masked_variance_pseudo_stratum, weights =~weight_norm, data =., nest =TRUE)# function to calculate mean estimates by agecalc_by_age=function(var, df){# var = "total_oaksteps"formula=as.formula(paste("~", var))total_by_age_gender=svyby(formula,~age_in_years_at_screening,df,svymean)%>%rename(mean =contains(var))%>%mutate(metric =var)}means_df=map_dfr(.x =df_accel%>%select(contains("total")|contains("peak"))%>%colnames(), .f =calc_by_age, df =survey_design)# get smooths for means and confidence intervalsmodels=means_df%>%mutate(lb =mean-1.96*se, ub =mean+1.96*se)%>%tidyr::nest(data =-c(metric))%>%dplyr::mutate(# Perform loess calculation on each group m =purrr::map(data, loess, formula =mean~age_in_years_at_screening, span =.75),# Retrieve the fitted values from each model fitted_mean =purrr::map(m, `[[`, "fitted"), l =purrr::map(data, loess, formula =lb~age_in_years_at_screening, span =.75),# Retrieve the fitted values from each model fitted_lb =purrr::map(l, `[[`, "fitted"), u =purrr::map(data, loess, formula =ub~age_in_years_at_screening, span =.75),# Retrieve the fitted values from each model fitted_ub =purrr::map(u, `[[`, "fitted"))# Apply fitted y's as a new columnresults=models%>%dplyr::select(-m, -l, -u)%>%tidyr::unnest(cols =c(data, contains("fitted")))
Now we can plot the means by age for both totals and peak variables:
Code
# name color vector as desirednames(col_vec)=c("Actilife","ADEPT","Oak","Stepcount RF","Stepcount SSL","Verisense","Verisense rev.")results%>%filter(grepl("step", metric)&grepl("total", metric))%>%mutate(across(contains("fitted"), ~.x/1000), metric =factor(metric, levels =c("total_actisteps","total_adeptsteps","total_oaksteps","total_scrfsteps","total_scsslsteps","total_vssteps","total_vsrevsteps"), labels =c("Actilife","ADEPT","Oak","Stepcount RF","Stepcount SSL","Verisense","Verisense rev.")))%>%ggplot(aes( x =age_in_years_at_screening, y =fitted_mean, ymin =fitted_lb, ymax =fitted_ub, color =metric, fill =metric))+facet_grid(.~metric)+geom_line(linewidth =1)+geom_ribbon(alpha =.2, linetype =0)+scale_fill_manual(values =col_vec, name ="Algorithm")+theme_light()+scale_color_manual(values =col_vec, name ="Algorithm")+geom_hline(aes(yintercept =10), col ="darkgrey", linetype ="dashed")+theme( legend.position =c(0.65, 0.15), legend.title =element_blank(), strip.text =element_text(size =12), axis.title =element_text(size =14), title =element_text(size =14), axis.text =element_text(size =12), panel.grid.minor =element_blank(), legend.text =element_text(size =12))+labs(x ="Age (years)", y ="Mean Daily Steps (x1000)", title ="Smoothed Survey Weighted Mean Daily Steps by Age ")+scale_y_continuous(breaks =seq(0, 16, 1))+scale_x_continuous(breaks =seq(20, 80, 20))+guides(fill =guide_legend(nrow =2, byrow =FALSE), color =guide_legend(nrow =2, byrow =FALSE))
Code
results%>%filter(grepl("peak", metric)&grepl("step", metric))%>%mutate( type =sub("_.*", "", metric), method =sub("^[^_]*_", "", metric), type =factor(type, labels =c("1-minute", "30-minute")))%>%filter(grepl("step", metric))%>%mutate(method =factor(method, levels =c("actisteps","adeptsteps","oaksteps","scrfsteps","scsslsteps","vssteps","vsrevsteps"), labels =c("Actilife","ADEPT","Oak","Stepcount RF","Stepcount SSL","Verisense","Verisense rev.")))%>%ggplot(aes( x =age_in_years_at_screening, y =fitted_mean, ymin =fitted_lb, ymax =fitted_ub, fill =method, color =method))+facet_grid(.~type)+geom_line()+geom_ribbon(alpha =0.2, color =NA)+scale_fill_manual(values =col_vec, name ="Algorithm")+scale_color_manual(values =col_vec, name ="Algorithm")+theme_light()+theme( legend.title =element_blank(), strip.text =element_text(size =12), axis.title =element_text(size =14), title =element_text(size =14), axis.text =element_text(size =12), panel.grid.minor =element_blank(), legend.text =element_text(size =12), legend.position =c(0.3, 0.1))+labs(x ="Age (years)", y ="Peak Cadence (steps/min)", title ="Smoothed Survey Weighted Peak Cadence by Age")+scale_y_continuous(breaks =seq(20, 120, 10))+scale_x_continuous(breaks =seq(20, 80, 10))+guides(fill =guide_legend(nrow =2, byrow =FALSE), color =guide_legend(nrow =2, byrow =FALSE))
Finally, we can calculate the % change per year in the smoothed means and plot:
Code
results%>%filter(grepl("step", metric)&grepl("total", metric))%>%mutate(metric =factor(metric, levels =c("total_actisteps","total_adeptsteps","total_oaksteps","total_scrfsteps","total_scsslsteps","total_vssteps","total_vsrevsteps"), labels =c("Actilife","ADEPT","Oak","Stepcount RF","Stepcount SSL","Verisense","Verisense rev.")))%>%group_by(metric)%>%mutate(pct_chg =(fitted_mean-dplyr::lag(fitted_mean))/dplyr::lag(fitted_mean)*100)%>%ggplot(aes( x =age_in_years_at_screening, y =pct_chg, color =metric, fill =metric))+geom_line(linewidth =1)+geom_ribbon(alpha =.2, linetype =0, aes(ymax =0, ymin =0))+scale_fill_manual(values =col_vec, name ="Algorithm")+theme_light()+scale_color_manual(values =col_vec, name ="Algorithm")+geom_hline(aes(yintercept =0), col ="darkgrey", linetype ="dashed", linewidth =1.1)+theme_light()+labs(x ="Age (years)", y ="% change from previous year", title ="Survey weighted estimated per-year difference in mean daily steps")+theme(legend.title =element_blank(), legend.position =c(0.4, 0.25), axis.title =element_text(size =14), title =element_text(size =14), axis.text =element_text(size =12), strip.text =element_text(size =12), legend.text =element_text(size =14))+scale_x_continuous(breaks =seq(20, 80, 10), limits =c(18, 79))+guides(fill =guide_legend(nrow =2, byrow =FALSE), color =guide_legend(nrow =2, byrow =FALSE))
Correlations
We calculate the pairwise Spearman correlations between each step variable and physical activity variable, then plot the correlation matrix.
Code
cor_mat=df_accel%>%select(contains("total"))%>%select(contains("acti"),contains("adept"),contains("oak"),contains("sc"),contains("vss"),contains("vsr"),total_AC,total_PAXMTSM,total_log10AC,total_log10PAXMTSM)%>%cor(., use ="complete", method ="spearman")# correlation p value matrixpvals=df_accel%>%select(contains("total"))%>%select(contains("acti"),contains("adept"),contains("oak"),contains("sc"),contains("vss"),contains("vsr"),total_AC,total_PAXMTSM,total_log10AC,total_log10PAXMTSM)%>%rstatix::cor_pmat(.)%>%select(-rowname)%>%as.matrix()colnames(cor_mat)=rownames(cor_mat)=colnames(pvals)=rownames(pvals)=c("Actilife steps","ADEPT","Oak","Stepcount RF","Stepcount SSL","Verisense","Verisense rev.","AC","MIMS","log10 AC","log10 MIMS")# r for box around correlationr=c("ADEPT", "Actilife steps", "Verisense rev.", "Verisense")corrplot::corrplot(cor_mat, method ="color", type ="upper", col =paletteer_d("colorBlindness::Blue2Orange8Steps"), tl.col ="black", tl.srt =30, p.mat =pvals, col.lim =c(0.2, 1), sig.level =0.05, insig ="blank", is.corr =FALSE, addgrid.col ="white", diag =FALSE, addCoef.col ='black')%>%corrplot::corrRect(namesMat =r)
We can also plot the mean differences (x1000).
Code
mean_mat=function(df, abs=FALSE){# Ensure the input is a dataframeif(!is.data.frame(df)){stop("Input must be a dataframe")}# Check if all columns are numericif(!all(sapply(df, is.numeric))){stop("All columns in the dataframe must be numeric")}# Get the number of columnsnum_cols<-ncol(df)# Initialize a matrix to store the mean absolute differencesmad_matrix<-matrix(NA, nrow =num_cols, ncol =num_cols)rownames(mad_matrix)<-colnames(df)colnames(mad_matrix)<-colnames(df)# Calculate the mean absolute difference for each pair of columnsfor(iin1:num_cols){for(jin1:num_cols){if(i<=j){# Calculate mean absolute difference, handling potential NAsif(abs){d=abs(df[, i]-df[, j])}else{d=df[, i]-df[, j]}# Debug output# print(paste("Calculating MAD for columns", colnames(df)[i], "and", colnames(df)[j]))# print(head(abs_diff))# Calculate mean# mad_matrix[i, j] <- mean(abs_diff %>% unlist, na.rm = TRUE)mad_matrix[i, j]<-mean(d%>%unlist, na.rm =TRUE)mad_matrix[j, i]<-mad_matrix[i, j]# Symmetric matrix}}}return(mad_matrix)}diff_mat=df_accel%>%select(contains("total"))%>%select(contains("acti"),contains("adept"),contains("oak"),contains("sc"),contains("vss"),contains("vsr"),total_AC,total_PAXMTSM,total_log10AC,total_log10PAXMTSM)%>%mean_mat()# absolute difference matrixabs_diff_mat=df_accel%>%select(contains("total"))%>%select(contains("acti"),contains("adept"),contains("oak"),contains("sc"),contains("vss"),contains("vsr"),total_AC,total_PAXMTSM,total_log10AC,total_log10PAXMTSM)%>%mean_mat(., abs =TRUE)# make NA for non-step algosdiff_mat[, c("total_AC","total_log10AC","total_PAXMTSM","total_log10PAXMTSM")]<-NAdiff_mat[c("total_AC","total_log10AC","total_PAXMTSM","total_log10PAXMTSM"), ]<-NAcolnames(diff_mat)=rownames(diff_mat)=c("Actilife steps","ADEPT","Oak","Stepcount RF","Stepcount SSL","Verisense","Verisense rev.","AC","MIMS","log10 AC","log10 MIMS")diff_mat=diff_mat/1000corrplot::corrplot(diff_mat, method ="color", type ="lower", col =paletteer_c("grDevices::Purple-Green", 10), col.lim =c(-10, 10), tl.col ="black", tl.srt =30, insig ="blank", is.corr =FALSE, diag =FALSE,# title = "Mean Difference in Steps (x1000)", addgrid.col ="white", addCoef.col ='grey50', na.label ="-")
Mortality prediction
Single variable
To investigate the association between each step algorithm and mortality, we fit a Cox proportional hazards model with the step algorithm as the only predictor. We also fit these single variable models with traditional predictors of mortality.
Where \(\texttt{followup\_time}\) is the time to death or censoring, \(\texttt{mortality}\) is the binary indicator of death, and \(\texttt{predictor}\) is the step algorithm or traditional predictor. We perform 10-fold cross validation and estimate the mean concordance over the 10 folds of the model. We then repeat this process 100 times and calculate the mean concordance from the repeated models. The code below illustrates the process, but we have precomputed the results.
Code
# packages for parallel processing; not necessary but will speed up require(future)require(furrr)# make sure dataset meets criteria df_mortality=df_all%>%filter(valid_accel)%>%# valid accelerometryfilter(age_in_years_at_screening>=50&age_in_years_at_screening<80)%>%# age criteriafilter(if_all(.cols =c(age_in_years_at_screening, gender,race_hispanic_origin, cat_education,cat_bmi, chd, chf, heartattack, stroke, cancer,bin_diabetes, cat_alcohol, cat_smoke, bin_mobilityproblem,general_health_condition, mortstat, permth_exm, total_PAXMTSM),~!is.na(.x)))%>%# no missing datamutate(event_time =permth_exm/12)# event time in years = person months since exam / 12# winsorize at 99th percentile df_mortality_win=df_mortality%>%ungroup()%>%mutate(across(c(contains("total"), contains("peak")), ~winsorize(.x, quantile(.x, probs =c(0, 0.99)))))# concordance is survival metric survival_metrics=metric_set(concordance_survival)# proportional hazards model survreg_spec=proportional_hazards()%>%set_engine("survival")%>%set_mode("censored regression")# function to fit model on one variable fit_model=function(var, folds, spec, metrics, mort_df){require(tidyverse); require(tidymodels); require(censored)# create workflowwflow=workflow()%>%add_model(spec)%>%add_variables(outcomes =mort_surv, predictors =all_of(var))%>%add_case_weights(case_weights_imp)# fit model on foldsres=fit_resamples(wflow, resamples =folds, metrics =metrics, control =control_resamples(save_pred =TRUE))# get metrics -- for some reason if you just use collect_metrics it doesn't take into account case weights,# so we write this function to get concordanceget_concordance=function(row_num, preds, surv_df){preds%>%slice(row_num)%>%unnest(.predictions)%>%select(.pred_time, .row, mort_surv)%>%left_join(surv_df%>%select(row_ind, case_weights_imp), by =c(".row"="row_ind"))%>%concordance_survival(truth =mort_surv, estimate =".pred_time", case_weights =case_weights_imp)%>%pull(.estimate)}# get concordance for each foldconcordance_vec=map_dbl(.x =1:nrow(res), .f =get_concordance, preds =res, surv_df =mort_df)rm(res)# return as tibbletibble(concordance =concordance_vec, variable =var)}# all of the variables we want to use in single variable prediction - traditional demo_vars=c("age_in_years_at_screening","cat_bmi","gender","race_hispanic_origin","bin_diabetes","cat_education","chf","chd","heartattack","stroke","cancer","cat_alcohol","cat_smoke","bin_mobilityproblem","general_health_condition")# pa predictors pa_vars=df_mortality_win%>%select(contains("total"), contains("peak"))%>%colnames()vars=c(demo_vars, pa_vars)# normalize weights df=df_mortality_win%>%mutate(weight =full_sample_2_year_mec_exam_weight/2, weight_norm =weight/mean(weight))# create a survival objectsurv_df=df%>%mutate(mort_surv =Surv(event_time, mortstat))%>%mutate(case_weights_imp =hardhat::importance_weights(weight_norm))%>%mutate(row_ind =row_number())# set seed and generate 100x 10-folds set.seed(4575)folds=vfold_cv(surv_df, v =10, repeats =100)fname="metrics_wtd_100_singlevar.rds"plan(multisession, workers =ncores)# parallel setup # run on all variables results=furrr::future_map_dfr( .x =vars, .f =fit_model, spec =survreg_spec, metrics =survival_metrics, folds =folds, mort_df =surv_df, .options =furrr_options(seed =TRUE, globals =TRUE))if(!dir.exists(here::here("results"))){dir.create(here::here("results"))}# save results saveRDS(results, here::here("results", fname))
We read in the results, and make a table and figure ranking the variables in terms of concordance.
Code
wt_single=readRDS(here::here("results", "metrics_wtd_100_singlevar.rds"))%>%group_by(variable)%>%mutate(ind =row_number(), rep =floor((row_number()-1)/10))%>%group_by(variable, rep)%>%summarize(concordance =mean(concordance))# calculating mean concordance from each 10-fold repeat df_mortality=df_all%>%filter(num_valid_days>=3)%>%filter(age_in_years_at_screening>=50&age_in_years_at_screening<80)# data used for mortality analysis; Winsorizeddf_mortality_win=df_mortality%>%filter(if_all(.cols =c(age_in_years_at_screening, gender,race_hispanic_origin, cat_education,cat_bmi, chd, chf, heartattack, stroke, cancer,bin_diabetes, cat_alcohol, cat_smoke, bin_mobilityproblem,general_health_condition, mortstat, permth_exm),~!is.na(.x)))%>%mutate(event_time =permth_exm/12)%>%ungroup()%>%mutate(across(c(contains("total"), contains("peak")), ~winsorize(.x, quantile(.x, probs =c(0, 0.99)))))vlabs=c("Age (yrs)","Diabetes","Mobility problem","Cancer","Alcohol use","BMI Category","Education category","Smoking status","Congenital Heart Disease","Coronary Heart Failure","Gender","General health condition","Heart attack","Peak1 AC","Peak1 Actilife steps","Peak1 ADEPT steps","Peak1 log10 AC","Peak1 log10 MIMS","Peak1 Oak steps","Peak1 MIMS","Peak1 Stepcount RF steps","Peak1 Stepcount SSL steps","Peak1 Verisense rev. steps","Peak1 Verisense steps","Peak30 AC","Peak30 Actilife steps","Peak30 ADEPT steps","Peak30 log10 AC","Peak30 log10 MIMS","Peak30 Oak steps","Peak30 MIMS","Peak30 Stepcount RF steps","Peak30 Stepcount SSL steps","Peak30 Verisense rev. steps","Peak30 Verisense steps","Race/ethnicity","Stroke","AC","Actilife steps","ADEPT steps","log10 AC","log10 MIMS","Oak steps","MIMS","Stepcount RF steps","Stepcount SSL steps","Verisense rev. steps","Verisense steps")data_summary=wt_single%>%group_by(variable)%>%summarize(across(concordance, list( mean =~mean(.x), se =~sd(.x)/sqrt(n()))))%>%mutate(variable_fac =factor(variable, labels =vlabs))%>%filter(!grepl("peak", variable))df_means=df_mortality_win%>%group_by(mortstat)%>%summarize(across(c(contains("total"), contains("peak"), contains("age")), ~mean(.x)),across(c(bin_mobilityproblem, cancer, stroke, heartattack, chd, chf, bin_diabetes), ~sum(.x)/n()*100), gender =sum(gender=="Female")/n()*100)%>%pivot_longer(cols =-mortstat)%>%pivot_wider(names_from =mortstat, values_from =value, names_prefix ='died_')%>%rename(variable =name)# get the best variable based on concodancebest_var=data_summary%>%arrange(desc(concordance_mean))%>%slice(1)%>%pull(variable)best_var_vec=wt_single%>%filter(variable==best_var)%>%pull(concordance)t_tests=wt_single%>%group_by(variable)%>%filter(variable!=best_var)%>%nest()%>%mutate(t_test =map(data,~t.test(.x$concordance,best_var_vec, var.eq =FALSE, paired =FALSE, alternative ="less")), res =map(t_test, tidy))%>%unnest(res)%>%ungroup()%>%select(variable, p.value)# join t tests with summary data and make tabletab=data_summary%>%left_join(t_tests)%>%left_join(df_means)%>%arrange(desc(concordance_mean))%>%mutate( variable_fac =forcats::fct_reorder(variable_fac, concordance_mean), lb =concordance_mean-1.96*concordance_se, ub =concordance_mean+1.96*concordance_se, ci2 =sprintf("%0.3f", round(concordance_mean, 3)), ci =paste0(sprintf("%0.3f", round(concordance_mean, 3))," (",sprintf("%0.3f", round(lb, 3)),", ",sprintf("%0.3f", round(ub, 3)),")"))%>%select(variable_fac,ci2,# ci,# p.value, total_alive =died_0, total_deceased =died_1)%>%# mutate(pvalue = style_pvalue(p.value, digits = 3)) %>%# mutate(pvalue = ifelse(is.na(pvalue), "---", pvalue),# across(starts_with("mean"), ~ if_else(is.na(.x), "---", as.character(round(# .x, 0# ))))) %>%rename( Variable =variable_fac,"Mean"=ci2,# "Mean (95% CI)" = ci,# "p-value" = pvalue,"Mean value among alive"=total_alive,"Mean value among deceased"=total_deceased)%>%gt::gt()%>%gt::fmt_number(columns =starts_with("mean"), decimals =0)%>%# gt::tab_style(# style = list(cell_text(weight = "bold")),# locations = cells_body(columns = everything(), rows = p.value > 0.05 |# is.na(p.value))# ) %>%# gt::cols_hide(p.value) %>%tab_header(title ="100x 10-fold Cross-Validated Single Variable Concordance", subtitle ="")%>%# tab_footnote(footnote = "p-value for one sided t-test comparing variable to variable with higest mean concordance",# locations = cells_column_labels(columns = "p-value")) %>%cols_align(columns =everything(), align ="left"); tab
100x 10-fold Cross-Validated Single Variable Concordance
Traditional predictors & steps from stepcount RF & MIMS
We use the same 100-times repeated 10-fold cross validation as before. Below is the code to run the multivariable models for a large combination of models; again, we prerun the code but include below.