NHANES Steps Vignette

Code
# load packages and define colors 
library(tidyverse)
library(gtsummary)
library(gt)
library(tidymodels)
library(censored)
library(paletteer)
library(survey)
library(patchwork)
library(consort)
library(data.table)
col1 = "#CC5151"; col2 = "#422CB2" # from paletteer_d("colorBlindness::SteppedSequential5Steps")
# paletteer_d("ggthemes::colorblind") # for colorblind friendly palette
col_vec = c("#000000FF", "#009E73FF", "#CC79A7FF", "#E69F00FF", "#D55E00FF", "#56B4E9FF", "#0072B2FF")

winsorize = function (x, val = quantile(x, probs = c(0.05, 0.95), na.rm = FALSE)) {
  x[x < val[1L]] <- val[1L]
  x[x > val[2L]] <- val[2L]
  return(x)
}

The purpose of this vignette is to accompany xxx manuscript on step counts in NHANES 2011-2014.

Background

NHANES is a nationally representative study of about 5,000 Americans per year. The study includes demographic, socioeconomic, dietary, and health-related questions; data from the survey are publicly available. All NHANES study participants age six and older (2011-12, Wave G) or age three or older (2013-14, Wave H) were asked to wear an ActiGraph GT3X+ on the non-dominant wrist starting on the day of their exam at the NHANES Mobile Examination Center. They were instructed to wear the device at all times for seven consecutive days and remove the device on the morning of the ninth day. The devices collected triaxial acceleration at 80 Hz.

In 2022, the raw accelerometer data were released. We downloaded the raw data and applied four open source and one proprietary step count algorithm to the raw data. The algorithms are:

  • Actilife (proprietary)
  • ADEPT1
  • Oak2
  • Verisense (original and revised)3
  • stepcount (random forest and self-supervised learner)4

We also obtained ActiGraph Activity Counts (AC) and used Monitor Independent Movement Summaries (MIMS) provided by NHANES.

Our pipeline for processing the raw data and obtaining minute-level step counts can be found at https://github.com/muschellij2/nhanes_80hz. The minute level data, in 1440 format, are available for download on Physionet (include link).

In this vignette, we start with data that has been summarized at the subject level, and investigate:

  1. Distributions of steps by age
  2. Correlation between step algorithms and other PA summaries
  3. Performance of step counting algorithms in single- and multivariable mortality prediction
Note

The github repo accompanying this analysis can be found at https://github.com/lilykoff/nhanes_steps_mortality

Data processing

Briefly, the process from minute-level data to subject-level data is as follows:

  • For each step algorithm or PA variable, create day summaries by summing values across any minute of the day that does not have any data quality flags and is classified by NHANES as wake wear, sleep wear, or unknown
  • For each subject, take mean of day summaries across all valid days, where valid days are defined as days that meet the following criteria:
    • At least 1,368 minutes (95% of a full day) were classified as wake wear, sleep wear, or unknown and did not have any data quality flags
    • At least seven hours (420 minutes) of the day were classified as wake wear
    • At least seven hours (420 minutes) had non-zero MIMS

Data overview

Code
# read in data 
df_all = readRDS(here::here("data", "covariates_accel_mortality_df.rds"))
df_all %>% 
  glimpse() 
Rows: 19,931
Columns: 85
$ SEQN                                    <chr> "62161", "62162", "62163", "62…
$ data_release_cycle                      <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ interview_examination_status            <fct> Both interviewed and MEC exami…
$ gender                                  <fct> Male, Female, Male, Female, Fe…
$ age_in_years_at_screening               <dbl> 22, 3, 14, 44, 14, 9, 0, 6, 21…
$ age_in_months_at_screening_0_to_24_mos  <dbl> NA, NA, NA, NA, NA, NA, 11, NA…
$ race_hispanic_origin                    <fct> Non-Hispanic White, Mexican Am…
$ six_month_time_period                   <fct> May 1 through October 31, Nove…
$ education_level_adults_20               <fct> High school graduate/GED or eq…
$ marital_status                          <fct> Never married, NA, NA, Married…
$ full_sample_2_year_interview_weight     <dbl> 102641.406, 15457.737, 7397.68…
$ full_sample_2_year_mec_exam_weight      <dbl> 104236.583, 16116.354, 7869.48…
$ masked_variance_pseudo_psu              <dbl> 1, 3, 3, 1, 2, 1, 2, 2, 1, 3, …
$ masked_variance_pseudo_stratum          <dbl> 91, 92, 90, 94, 90, 91, 92, 10…
$ annual_household_income                 <fct> "$75,000 to $99,999", "$15,000…
$ weight_kg                               <dbl> 69.2, 12.7, 49.4, 67.2, 69.1, …
$ standing_height_cm                      <dbl> 172.3, 94.7, 168.9, 170.1, 159…
$ body_mass_index_kg_m_2                  <dbl> 23.3, 14.2, 17.3, 23.2, 27.2, …
$ overweight                              <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ diabetes                                <fct> No, No, No, No, No, No, NA, No…
$ bin_diabetes                            <dbl> 0, 0, 0, 0, 0, 0, NA, 0, 0, 0,…
$ arthritis                               <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ chf                                     <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ chd                                     <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ angina                                  <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ heartattack                             <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ stroke                                  <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ cancer                                  <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ days_physically_active_at_least_60_min  <dbl> NA, 7, NA, NA, NA, 7, NA, 2, N…
$ minutes_sedentary_activity              <dbl> 300, NA, 720, 300, 600, NA, NA…
$ minutes_walk_bicycle_for_transportation <dbl> NA, NA, 60, NA, 45, NA, NA, NA…
$ any_physical_activities_past_7_days     <dbl> NA, NA, NA, NA, NA, NA, NA, NA…
$ need_special_equipment_to_walk          <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ cat_alcohol                             <fct> Never drinker, Missing alcohol…
$ cat_bmi                                 <fct> Normal, Underweight, Underweig…
$ cat_smoke                               <fct> Never smoker, NA, NA, Never sm…
$ bin_mobilityproblem                     <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ general_health_condition                <fct> Good, Excellent, Good, Excelle…
$ eligstat                                <int> 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, …
$ mortstat                                <dbl> 0, NA, NA, 0, NA, NA, NA, NA, …
$ ucod_leading                            <int> NA, NA, NA, NA, NA, NA, NA, NA…
$ cod_diabetes                            <int> NA, NA, NA, NA, NA, NA, NA, NA…
$ cod_hyperten                            <int> NA, NA, NA, NA, NA, NA, NA, NA…
$ permth_int                              <int> 93, NA, NA, 90, NA, NA, NA, NA…
$ permth_exm                              <int> 92, NA, NA, 89, NA, NA, NA, NA…
$ cat_eligibility                         <chr> "Eligible", "Under age 18, not…
$ cat_cod                                 <chr> NA, NA, NA, NA, NA, NA, NA, NA…
$ peak1_AC                                <dbl> 14064.947, NA, 22163.589, 1277…
$ peak30_AC                               <dbl> 10180.908, NA, 14438.444, 9783…
$ peak1_actisteps                         <dbl> 85.00000, NA, 71.42857, 68.571…
$ peak30_actisteps                        <dbl> 55.81111, NA, 54.21429, 46.566…
$ peak1_adeptsteps                        <dbl> 80.38455, NA, 48.27829, 76.579…
$ peak30_adeptsteps                       <dbl> 37.409279, NA, 22.057184, 30.1…
$ peak1_log10AC                           <dbl> 4.141743, NA, 4.340481, 4.1055…
$ peak30_log10AC                          <dbl> 3.999456, NA, 4.142875, 3.9861…
$ peak1_log10PAXMTSM                      <dbl> 1.805037, NA, 2.018282, 1.8105…
$ peak30_log10PAXMTSM                     <dbl> 1.665479, NA, 1.804895, 1.6676…
$ peak1_oaksteps                          <dbl> 105.46667, NA, 99.92143, 98.89…
$ peak30_oaksteps                         <dbl> 72.19778, NA, 77.21238, 68.572…
$ peak1_PAXMTSM                           <dbl> 63.99733, NA, 105.37614, 63.85…
$ peak30_PAXMTSM                          <dbl> 46.03225, NA, 65.27048, 46.170…
$ peak1_scrfsteps                         <dbl> 110.16667, NA, 107.57143, 102.…
$ peak30_scrfsteps                        <dbl> 86.10556, NA, 92.83333, 83.619…
$ peak1_scsslsteps                        <dbl> 115.33333, NA, 109.85714, 102.…
$ peak30_scsslsteps                       <dbl> 86.93333, NA, 85.70952, 75.509…
$ peak1_vsrevsteps                        <dbl> 102.16667, NA, 87.71429, 86.57…
$ peak30_vsrevsteps                       <dbl> 69.49444, NA, 69.44762, 57.400…
$ peak1_vssteps                           <dbl> 94.00000, NA, 75.85714, 74.571…
$ peak30_vssteps                          <dbl> 57.57222, NA, 55.44762, 47.504…
$ total_AC                                <dbl> 2484713, NA, 2994159, 2536749,…
$ total_actisteps                         <dbl> 11515.667, NA, 11744.857, 1152…
$ total_adeptsteps                        <dbl> 3219.4974, NA, 1334.1346, 2148…
$ total_log10AC                           <dbl> 2781.876, NA, 3074.808, 3110.3…
$ total_log10PAXMTSM                      <dbl> 945.7735, NA, 1048.7594, 1030.…
$ total_oaksteps                          <dbl> 12493.367, NA, 12700.121, 1117…
$ total_PAXMTSM                           <dbl> 13194.086, NA, 15756.931, 1341…
$ total_scrfsteps                         <dbl> 11235.833, NA, 12292.571, 8737…
$ total_scsslsteps                        <dbl> 9746.833, NA, 8257.000, 6576.5…
$ total_vsrevsteps                        <dbl> 10336.8333, NA, 10931.5714, 75…
$ total_vssteps                           <dbl> 9978.833, NA, 9039.857, 7692.1…
$ num_valid_days                          <int> 6, NA, 7, 7, 6, 3, NA, 7, 7, 7…
$ has_accel                               <lgl> TRUE, FALSE, TRUE, TRUE, TRUE,…
$ valid_accel                             <lgl> TRUE, FALSE, TRUE, TRUE, TRUE,…
$ inclusion_type                          <fct> >= 3 valid days, Didn't receiv…
$ cat_education                           <fct> HS/HS equivalent, NA, NA, More…

The entire dataset consists of 19,931 individuals. The varaibles that start with total or peak denote accelerometery-derived physical activity variables; total is mean of the daily total values of that variable, while peak1 is the mean of the highest value obtained in each day, and peak30 is the mean of the mean of the 30 highest values obtained in each day.

However, not all of these individuals received accelerometers, had valid accelerometer data, or were 18 years and older. Thus, we will exclude some of these individuals from our analysis. We can generate a CONSORT diagram to summarize the number of individuals excluded for various reasons.

Code
# consort diagram
df_all_temp =
  df_all %>%
  mutate(missing_covar = if_any(.cols = c(age_in_years_at_screening, gender,
                                          race_hispanic_origin, cat_education,
                                          cat_bmi, chd, chf, heartattack, stroke, cancer,
                                          bin_diabetes, cat_alcohol, cat_smoke, bin_mobilityproblem,
                                          general_health_condition, mortstat, permth_exm),
                                ~is.na(.x)))

r2 = data.table(
  id = df_all_temp$SEQN,
  wave = df_all_temp$data_release_cycle,
  age = df_all_temp$age_in_years_at_screening,
  has_accel = df_all_temp$has_accel,
  num_valid_days = df_all_temp$num_valid_days,
  valid_accel = df_all_temp$valid_accel,
  inclusion_type = df_all_temp$inclusion_type,
  missing_covars = df_all_temp$missing_covar
) %>%
  mutate(
    wave = if_else(wave == 7, "2011-2012", "2013-2014"),
    no_accel = as.factor(if_else(!has_accel, "Did not receive accelerometer", NA_character_)),
    invalid_accel =
           case_when(num_valid_days == 0 ~ "Zero valid days",
                     num_valid_days < 3 ~ "Less than 3 valid days",
                     TRUE ~ NA_character_),
         younger_18 = if_else(valid_accel & age < 18, "Age < 18", NA_character_),
         younger_50 = age < 50,
         older_79 = age > 79,
         accel_analysis = valid_accel & age > 18,
         mort_exc =
           case_when(valid_accel & age < 50 & age > 18 ~ "Younger than 50",
                     valid_accel & age >= 80 ~ "Older than 79",
                     valid_accel & missing_covars ~ "Missing covariates",
                     TRUE ~ NA_character_),
         mort_analysis = valid_accel & age >= 50 & age < 80 & !missing_covars)

consort_plot(r2,
             orders = c(id      = 'NHANES Study Population',
                        wave = "",
                        no_accel    = "Excluded",
                        has_accel    = 'Received accelerometer',
                        invalid_accel = 'Excluded: insufficient valid wear time',
                        valid_accel = 'Valid accelerometry',
                        younger_18 = "Excluded",
                        accel_analysis = "Accelerometry analysis\nassessed",
                        mort_exc = 'Excluded sequentially',
                        mort_analysis = 'Mortality analysis\nassessed'),
             side_box = c('no_accel', 'invalid_accel', 'younger_18', 'mort_exc'),
             allocation = 'wave')

Population characteristics

For the remainder of the analysis, we exclude individuals younger than 18 and individuals without at least 3 days of valid accelerometry data.

Code
df_accel =
  df_all %>%
  filter(num_valid_days >= 3 & age_in_years_at_screening >= 18)

We make a survey-weighted table of population characteristics for these individuals:

Code
labs = c(
  "SEQN",
  "Data release cycle",
  "Interview Examination Status",
  "Sex",
  "Age (yrs)",
  "Age (mos)",
  "Race/ethnicity",
  "Six month time period",
  "Educ. level adults",
  "Marital status",
  "2 yr int weight",
  "2 yr exam weight",
  "Pseudo PSU",
  "Psueudo stratum",
  "Annual HH income",
  "Weight (kg)",
  "Height (cm)",
  "BMI (kg/m2)",
  "Overweight",
  "Diabetes orig.",
  "Diabetes",
  "Arthritis",
  "Coronary Heart Failure",
  "Congenital Heart Disease",
  "Angina",
  "Heart attack",
  "Stroke",
  "Cancer",
  colnames(df_all)[29:33],
  "Alcohol use",
  "BMI Category",
  "Smoking status",
  "Mobility problem",
  "General health condition",
  "Eligbility",
  "Died by 5 years follow up",
  "COD",
  "COD Diabetes",
  "COD Hypertension",
  "Person-months follow up from interview",
  "Person-months follow up from exam",
  "Eligibility category",
  "COD category",
  "AC",
  "MIMS",
  "Actilife steps",
  "ADEPT steps",
  "log10 AC",
  "log10 MIMS",
  "Oak steps",
  "Sc. RF steps",
  "Sc. SSL steps",
  "Verisense rev. steps",
  "Verisense steps",
  "Peak1 AC",
  "Peak1 MIMS",
  "Peak1 Actilife steps",
  "Peak1 ADEPT steps",
  "Peak1 log10 AC",
  "Peak1 log10 MIMS",
  "Peak1 Oak steps",
  "Peak1 Sc. RF steps",
  "Peak1 Sc. SSL steps",
  "Peak1 Verisense rev. steps",
  "Peak1 Verisense steps",
  "Peak30 AC",
  "Peak30 MIMS",
  "Peak30 Actilife steps",
  "Peak30 ADEPT steps",
  "Peak30 log10 AC",
  "Peak30 log10 MIMS",
  "Peak30 Oak steps",
  "Peak30 Sc. RF steps",
  "Peak30 Sc. SSL steps",
  "Peak30 Verisense rev. steps",
  "Peak30 Verisense steps",
  "No. valid days",
  "Received accelerometer",
  "Valid accelerometry",
  "Inclusion category",
  "Education level"
)
names(labs) = colnames(df_accel)

df_tab1 =
  df_all %>%
  labelled::set_variable_labels(!!!labs)

# survey weighted table
# question about this!!
df_svy =
  df_tab1 %>%
  filter(has_accel) %>%
  filter(valid_accel) %>%
  select(
    gender,
    age_in_years_at_screening,
    race_hispanic_origin,
    cat_education,
    cat_bmi,
    bin_diabetes,
    chf,
    chd,
    stroke,
    cat_alcohol,
    cat_smoke,
    bin_mobilityproblem,
    general_health_condition,
    mortstat,
    data_release_cycle,
    masked_variance_pseudo_psu,
    masked_variance_pseudo_stratum,
    full_sample_2_year_mec_exam_weight
  )  %>%
  mutate(WTMEC4YR = full_sample_2_year_mec_exam_weight / 2,
         WTMEC4YR_norm = WTMEC4YR / mean(WTMEC4YR, na.rm = TRUE)) %>%
  select(-full_sample_2_year_mec_exam_weight) %>%
  svydesign(
    ids = ~ masked_variance_pseudo_psu,
    weights = ~ WTMEC4YR_norm,
    strata = ~ masked_variance_pseudo_stratum,
    data = .,
    nest = TRUE
  )

# survey weighted table
tab = df_svy %>%
  tbl_svysummary(
    by = data_release_cycle,
    include = -c(
      masked_variance_pseudo_psu,
      masked_variance_pseudo_stratum,
      WTMEC4YR,
      WTMEC4YR_norm
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    missing_text = "Missing",
  ) %>%
  add_overall() %>%
  modify_caption("Demographic Characteristics, All Adults") 
tab
Demographic Characteristics, All Adults
Characteristic Overall
N = 12,5171
7
N = 6,2341
8
N = 6,2831
Sex


    Female 6,508 (52%) 3,239 (52%) 3,269 (52%)
    Male 6,009 (48%) 2,995 (48%) 3,014 (48%)
Age (yrs) 40.99 (21.49) 41.47 (21.07) 40.53 (21.89)
Race/ethnicity


    Non-Hispanic White 8,013 (64%) 4,034 (65%) 3,979 (63%)
    Non-Hispanic Black 1,484 (12%) 751 (12%) 734 (12%)
    Other Race - Including Multi-Rac 972 (7.8%) 457 (7.3%) 515 (8.2%)
    Mexican American 1,257 (10%) 571 (9.2%) 686 (11%)
    Other Hispanic 790 (6.3%) 421 (6.7%) 370 (5.9%)
Education level


    More than HS 6,193 (63%) 3,125 (63%) 3,068 (63%)
    Less than HS 1,560 (16%) 824 (17%) 736 (15%)
    HS/HS equivalent 2,098 (21%) 1,029 (21%) 1,069 (22%)
    Missing 2,666 1,256 1,410
BMI Category


    Normal 3,743 (30%) 1,920 (31%) 1,823 (29%)
    Obese 3,891 (31%) 1,887 (31%) 2,004 (32%)
    Overweight 3,549 (29%) 1,833 (30%) 1,715 (27%)
    Underweight 1,240 (10.0%) 540 (8.7%) 700 (11%)
    Missing 94 54 41
Diabetes 1,052 (8.4%) 508 (8.2%) 543 (8.6%)
    Missing 7 4 3
Coronary Heart Failure 290 (2.9%) 158 (3.2%) 132 (2.7%)
    Missing 2,668 1,260 1,409
Congenital Heart Disease 371 (3.8%) 170 (3.4%) 201 (4.1%)
    Missing 2,684 1,267 1,417
Stroke 309 (3.1%) 164 (3.3%) 145 (3.0%)
    Missing 2,665 1,256 1,408
Alcohol use


    Never drinker 1,240 (9.9%) 559 (9.0%) 681 (11%)
    Former drinker 1,446 (12%) 726 (12%) 720 (11%)
    Moderate drinker 3,117 (25%) 1,644 (26%) 1,473 (23%)
    Heavy drinker 785 (6.3%) 439 (7.0%) 347 (5.5%)
    Missing alcohol 5,928 (47%) 2,866 (46%) 3,062 (49%)
Smoking status


    Never smoker 5,656 (56%) 2,758 (55%) 2,898 (57%)
    Former smoker 2,449 (24%) 1,256 (25%) 1,192 (24%)
    Current smoker 1,912 (19%) 962 (19%) 950 (19%)
    Missing 2,501 1,259 1,242
Mobility problem 1,656 (17%) 749 (15%) 907 (19%)
    Missing 2,668 1,256 1,413
General health condition


    Poor 282 (2.3%) 127 (2.0%) 155 (2.5%)
    Fair 1,684 (13%) 789 (13%) 895 (14%)
    Good 4,582 (37%) 2,240 (36%) 2,343 (37%)
    Very good 4,055 (32%) 2,104 (34%) 1,951 (31%)
    Excellent 1,913 (15%) 974 (16%) 939 (15%)
Died by 5 years follow up 761 (7.5%) 416 (8.1%) 345 (6.8%)
    Missing 2,370 1,125 1,245
1 n (%); Mean (SD)

Mean physical activity variables

We can also make a table of the mean physical activity variables by wave and age group.

Code
df_adult = df_accel %>%
  filter(age_in_years_at_screening >= 18) %>%
  mutate(weight = full_sample_2_year_mec_exam_weight / 2,
         weight_norm = weight / mean(weight)) %>%
  mutate(group = "All adults")

df_old = df_accel %>%
  filter(age_in_years_at_screening >= 50) %>%
  mutate(weight = full_sample_2_year_mec_exam_weight / 2,
         weight_norm = weight / mean(weight)) %>%
  mutate(group = "50+")


df = bind_rows(df_adult, df_old)

df = df %>%
  rename(
    Verisense = "total_vssteps",
    "Verisense rev" = "total_vsrevsteps",
    Oak = "total_oaksteps",
    ADEPT = "total_adeptsteps",
    "Stepcount RF" =  "total_scrfsteps",
    "Stepcount SSL" =  "total_scsslsteps",
    MIMS = "total_PAXMTSM",
    Actilife = "total_actisteps",
    "log10 MIMS"  = "total_log10PAXMTSM",
    "log10 AC" = "total_log10AC",
    AC = "total_AC"
  ) %>%
  select(!contains("peak"))

df_analysis_svy = survey::svydesign(
  id = ~ masked_variance_pseudo_psu,
  strata = ~ masked_variance_pseudo_stratum,
  weights = ~ weight_norm,
  data = df,
  nest = TRUE
)


tab = df_analysis_svy %>%
  tbl_strata(
    strata = data_release_cycle,
    .tbl_fun =
      ~ .x %>%
      tbl_svysummary(
        .,
        include = c(
          contains("Actilife"),
          contains("ADEPT"),
          contains("Oak"),
          contains("rf"),
          contains("ssl"),
          "Verisense",
          "Verisense rev",
          "MIMS",
          "log10 MIMS",
          "AC",
          "log10 AC"
        ),
        by = group,
        statistic = list(
          all_continuous() ~ "{mean} ({sd})",
          all_categorical() ~ "{n} ({p}%)"
        )
      ),
    .header = "**Wave {strata}**, N = {n}"
  ) %>%
  modify_caption("Physical Activity Mean Totals Stratified by Age and Wave")
tab
Physical Activity Mean Totals Stratified by Age and Wave
Characteristic
Wave 7, N = 6513
Wave 8, N = 6404
50+
N = 2,1461
All adults
N = 4,3671
50+
N = 2,1071
All adults
N = 4,2971
Actilife 11,195 (4,001) 12,169 (3,995) 10,850 (4,008) 11,902 (4,048)
ADEPT 2,342 (1,593) 2,659 (1,569) 2,151 (1,680) 2,453 (1,575)
Oak 10,254 (5,027) 11,794 (5,061) 9,733 (4,886) 11,381 (5,065)
Stepcount RF 9,888 (5,542) 11,509 (5,722) 9,502 (5,442) 11,263 (5,764)
Stepcount SSL 8,358 (4,402) 9,144 (4,400) 8,027 (4,388) 8,846 (4,399)
Verisense 8,337 (4,027) 9,497 (4,062) 7,974 (4,019) 9,163 (4,065)
Verisense rev 7,532 (4,521) 9,102 (4,756) 7,122 (4,402) 8,725 (4,730)
MIMS 12,435 (3,642) 13,572 (3,758) 12,243 (3,574) 13,467 (3,784)
log10 MIMS 960 (158) 998 (154) 952 (160) 994 (155)
AC 2,333,927 (791,711) 2,573,262 (815,192) 2,291,942 (766,303) 2,549,846 (816,305)
log10 AC 2,878 (383) 2,955 (366) 2,847 (394) 2,933 (370)
1 Mean (SD)

Physical activity variables

Distributions

We plot the distributions of each step and physical activity variable. We plot the 99th percentile with a vertical line to indicate the effect of Winsorizing at the 99th percentile. We choose to use the Winsorized variables in the mortality analysis since the raw distributions are all quite right-skewed.

Code
df_accel_win =
  df_accel %>%
  mutate(across(c(contains("total"), contains("peak")), ~winsorize(.x, quantile(.x, probs = c(0, 0.99)))))


q99 = 
  df_accel %>%
  summarize(across(c(contains("total")), ~quantile(.x, 0.99))) %>% 
  pivot_longer(cols = contains("total"), values_to = "q")

df_accel %>%
  select(contains("steps") & contains("total"), SEQN) %>%
  pivot_longer(cols = -c(SEQN)) %>%
  left_join(q99, by = "name") %>%
  mutate(name = factor(name, labels = c("Actilife steps", "ADEPT", "Oak", "Stepcount RF", "Stepcount SSL", "Verisense", "Verisense rev."))) %>%
  ggplot(aes(x = value / 1000))+
  geom_density(fill = col1, color = col1) +
  facet_wrap(.~name, nrow = 1)+
  geom_vline(aes(xintercept = q/1000))+
  theme_bw() +
  theme(legend.position = "none",
        panel.grid.minor.x = element_blank(),
        axis.title = element_text(size = 14),
        title = element_text(size = 14),
        axis.text = element_text(size = 12),
        strip.text = element_text(size = 12))+
  labs(x = "Mean Daily Steps x 1000", y = "Density", title = "Distribution of Mean Daily Step Counts",
       subtitle = "All Individuals aged 18+ with Valid Accelerometry Data")

Code
df_accel %>%
  select(!contains("steps") & contains("total"), SEQN) %>%
  pivot_longer(cols = -c(SEQN)) %>%
  left_join(q99, by = "name") %>%
   mutate(name = factor(name, labels = c("AC", "log10 AC", "MIMS", "log10 MIMS"))) %>%
  ggplot(aes(x = value / 1000))+
  geom_density(fill = col2, color = col2) +
  facet_wrap(.~name, nrow = 1, scales = "free")+
  geom_vline(aes(xintercept = q/1000))+
  theme_bw() +
  theme(legend.position = "none",
        panel.grid.minor.x = element_blank(),
        axis.title = element_text(size = 14),
        title = element_text(size = 14),
        axis.text = element_text(size = 12),
        strip.text = element_text(size = 12)) +
  labs(x = "Mean Daily Value x 1000", y = "Density", title = "Distribution of Mean Daily  Counts",
       subtitle = "All Individuals aged 18+ with Valid Accelerometry Data")

Means by age

Next we plot the survey-weighted means by age for each algorithm. First we calculate the survey-weghted means, then smooth the means and confidence intervals with a loess smooth.

Code
survey_design =
  df_accel %>%
  filter(age_in_years_at_screening >= 18 & age_in_years_at_screening < 80) %>%
  mutate(weight = full_sample_2_year_mec_exam_weight / 2,
         weight_norm = weight / mean(weight)) %>%
  ungroup()  %>%
  survey::svydesign(
  id = ~ masked_variance_pseudo_psu,
  strata = ~ masked_variance_pseudo_stratum,
  weights = ~ weight_norm,
  data = .,
  nest = TRUE
)

# function to calculate mean estimates by age
calc_by_age =
  function(var, df) {
    # var = "total_oaksteps"
    formula = as.formula(paste("~", var))
    total_by_age_gender = svyby(formula,
                                ~ age_in_years_at_screening,
                                df,
                                svymean) %>%
      rename(mean = contains(var)) %>%
      mutate(metric = var)
  }

means_df =
  map_dfr(.x = df_accel %>% select(contains("total") | contains("peak")) %>% colnames(),
          .f = calc_by_age, df = survey_design)

# get smooths for means and confidence intervals
models = means_df %>%
  mutate(lb = mean - 1.96 * se,
         ub = mean + 1.96 * se) %>%
  tidyr::nest(data = -c(metric)) %>%
  dplyr::mutate(
    # Perform loess calculation on each group
    m = purrr::map(data, loess,
                   formula = mean ~ age_in_years_at_screening, span = .75),
    # Retrieve the fitted values from each model
    fitted_mean = purrr::map(m, `[[`, "fitted"),
    l = purrr::map(data, loess,
                   formula = lb ~ age_in_years_at_screening, span = .75),
    # Retrieve the fitted values from each model
    fitted_lb = purrr::map(l, `[[`, "fitted"),
    u = purrr::map(data, loess,
                   formula = ub ~ age_in_years_at_screening, span = .75),
    # Retrieve the fitted values from each model
    fitted_ub = purrr::map(u, `[[`, "fitted")
  )

# Apply fitted y's as a new column
results = models %>%
  dplyr::select(-m, -l, -u) %>%
  tidyr::unnest(cols = c(data, contains("fitted")))

Now we can plot the means by age for both totals and peak variables:

Code
# name color vector as desired
names(col_vec) = c(
  "Actilife",
  "ADEPT",
  "Oak",
  "Stepcount RF",
  "Stepcount SSL",
  "Verisense",
  "Verisense rev."
)
results %>%
  filter(grepl("step", metric) & grepl("total", metric)) %>%
  mutate(across(contains("fitted"), ~ .x / 1000),
         metric = factor(
           metric,
           levels = c(
             "total_actisteps",
             "total_adeptsteps",
             "total_oaksteps",
             "total_scrfsteps",
             "total_scsslsteps",
             "total_vssteps",
             "total_vsrevsteps"
           ),
           labels = c(
             "Actilife",
             "ADEPT",
             "Oak",
             "Stepcount RF",
             "Stepcount SSL",
             "Verisense",
             "Verisense rev."
           )
         )) %>%
  ggplot(
    aes(
      x = age_in_years_at_screening,
      y = fitted_mean,
      ymin = fitted_lb,
      ymax = fitted_ub,
      color = metric,
      fill = metric
    )
  ) +
  facet_grid(. ~ metric) +
  geom_line(linewidth = 1) +
  geom_ribbon(alpha = .2, linetype = 0) +
  scale_fill_manual(values = col_vec, name = "Algorithm") +
  theme_light() +
  scale_color_manual(values = col_vec, name = "Algorithm") +
  geom_hline(aes(yintercept = 10),
             col = "darkgrey",
             linetype = "dashed") +
  theme(
    legend.position = c(0.65, 0.15),
    legend.title = element_blank(),
    strip.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    title = element_text(size = 14),
    axis.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    legend.text = element_text(size = 12)
  ) +
  labs(x = "Age (years)", y = "Mean Daily Steps (x1000)", title = "Smoothed Survey Weighted Mean Daily Steps by Age ") +
  scale_y_continuous(breaks = seq(0, 16, 1)) +
  scale_x_continuous(breaks = seq(20, 80, 20)) + 
  guides(fill = guide_legend(nrow = 2, byrow = FALSE),
         color = guide_legend(nrow = 2, byrow = FALSE))

Code
 results %>%
  filter(grepl("peak", metric) & grepl("step", metric)) %>%
  mutate(
    type = sub("_.*", "", metric),
    method = sub("^[^_]*_", "", metric),
    type = factor(type, labels = c("1-minute", "30-minute"))
  ) %>%
  filter(grepl("step", metric)) %>%
  mutate(method = factor(
    method,
    levels = c(
      "actisteps",
      "adeptsteps",
      "oaksteps",
      "scrfsteps",
      "scsslsteps",
      "vssteps",
      "vsrevsteps"
    ),
    labels = c(
      "Actilife",
      "ADEPT",
      "Oak",
      "Stepcount RF",
      "Stepcount SSL",
      "Verisense",
      "Verisense rev."
    )
  )) %>%
  ggplot(
    aes(
      x = age_in_years_at_screening,
      y = fitted_mean,
      ymin = fitted_lb,
      ymax = fitted_ub,
      fill = method,
      color = method
    )
  ) +
  facet_grid(. ~ type) +
  geom_line() +
  geom_ribbon(alpha = 0.2, color = NA) +
  scale_fill_manual(values = col_vec, name = "Algorithm") +
  scale_color_manual(values = col_vec, name = "Algorithm") +
  theme_light() +
  theme(
    legend.title = element_blank(),
    strip.text =  element_text(size = 12),
    axis.title = element_text(size = 14),
    title = element_text(size = 14),
    axis.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    legend.text = element_text(size = 12),
    legend.position = c(0.3, 0.1)
  ) +
  labs(x = "Age (years)", y = "Peak Cadence (steps/min)", title = "Smoothed Survey Weighted Peak Cadence by Age") +
  scale_y_continuous(breaks = seq(20, 120, 10)) +
  scale_x_continuous(breaks = seq(20, 80, 10)) + 
  guides(fill = guide_legend(nrow = 2, byrow = FALSE),
         color = guide_legend(nrow = 2, byrow = FALSE))

Finally, we can calculate the % change per year in the smoothed means and plot:

Code
results %>%
  filter(grepl("step", metric) & grepl("total", metric)) %>%
  mutate(metric = factor(
    metric,
    levels = c(
      "total_actisteps",
      "total_adeptsteps",
      "total_oaksteps",
      "total_scrfsteps",
      "total_scsslsteps",
      "total_vssteps",
      "total_vsrevsteps"
    ),
    labels = c(
      "Actilife",
      "ADEPT",
      "Oak",
      "Stepcount RF",
      "Stepcount SSL",
      "Verisense",
      "Verisense rev."
    )
  )) %>%
  group_by(metric) %>%
  mutate(pct_chg = (fitted_mean - dplyr::lag(fitted_mean)) / dplyr::lag(fitted_mean) *
           100) %>%
  ggplot(aes(
    x = age_in_years_at_screening,
    y = pct_chg,
    color = metric,
    fill = metric
  )) +
  geom_line(linewidth = 1) +
  geom_ribbon(alpha = .2, linetype = 0, aes(ymax = 0, ymin = 0)) +
  scale_fill_manual(values = col_vec, name = "Algorithm") +
  theme_light() +
  scale_color_manual(values = col_vec, name = "Algorithm") +
  geom_hline(
    aes(yintercept = 0),
    col = "darkgrey",
    linetype = "dashed",
    linewidth = 1.1
  ) +
  theme_light() +
  labs(x = "Age (years)", y = "% change from previous year", title = "Survey weighted estimated per-year difference in mean daily steps") +
  theme(legend.title = element_blank(),
        legend.position = c(0.4, 0.25),
        axis.title = element_text(size = 14),
        title = element_text(size = 14),
        axis.text = element_text(size = 12),
        strip.text = element_text(size = 12),
        legend.text = element_text(size = 14)) +
  scale_x_continuous(breaks = seq(20, 80, 10), limits = c(18, 79))+
  guides(fill = guide_legend(nrow = 2, byrow = FALSE),
         color = guide_legend(nrow = 2, byrow = FALSE))

Correlations

We calculate the pairwise Spearman correlations between each step variable and physical activity variable, then plot the correlation matrix.

Code
cor_mat =
  df_accel %>% select(contains("total")) %>%
  select(
    contains("acti"),
    contains("adept"),
    contains("oak"),
    contains("sc"),
    contains("vss"),
    contains("vsr"),
    total_AC,
    total_PAXMTSM,
    total_log10AC,
    total_log10PAXMTSM
  ) %>%
  cor(., use = "complete", method = "spearman")

# correlation p value matrix
pvals = df_accel %>% select(contains("total")) %>%
  select(
    contains("acti"),
    contains("adept"),
    contains("oak"),
    contains("sc"),
    contains("vss"),
    contains("vsr"),
    total_AC,
    total_PAXMTSM,
    total_log10AC,
    total_log10PAXMTSM
  ) %>%
  rstatix::cor_pmat(.) %>%
  select(-rowname) %>%
  as.matrix()


colnames(cor_mat) = rownames(cor_mat) = colnames(pvals) = rownames(pvals) = c(
  "Actilife steps",
  "ADEPT",
  "Oak",
  "Stepcount RF",
  "Stepcount SSL",
  "Verisense",
  "Verisense rev.",
  "AC",
  "MIMS",
  "log10 AC",
  "log10 MIMS"
)

# r for box around correlation
r = c("ADEPT", "Actilife steps", "Verisense rev.", "Verisense")

corrplot::corrplot(
  cor_mat,
  method = "color",
  type = "upper",
  col = paletteer_d("colorBlindness::Blue2Orange8Steps"),
  tl.col = "black",
  tl.srt = 30,
  p.mat = pvals,
  col.lim = c(0.2, 1),
  sig.level = 0.05,
  insig = "blank",
  is.corr = FALSE,
  addgrid.col = "white",
  diag = FALSE,
  addCoef.col = 'black'
) %>%
  corrplot::corrRect(namesMat  = r)

We can also plot the mean differences (x1000).

Code
mean_mat = function(df, abs = FALSE) {
  # Ensure the input is a dataframe
  if (!is.data.frame(df)) {
    stop("Input must be a dataframe")
  }

  # Check if all columns are numeric
  if (!all(sapply(df, is.numeric))) {
    stop("All columns in the dataframe must be numeric")
  }

  # Get the number of columns
  num_cols <- ncol(df)

  # Initialize a matrix to store the mean absolute differences
  mad_matrix <- matrix(NA, nrow = num_cols, ncol = num_cols)
  rownames(mad_matrix) <- colnames(df)
  colnames(mad_matrix) <- colnames(df)

  # Calculate the mean absolute difference for each pair of columns
  for (i in 1:num_cols) {
    for (j in 1:num_cols) {
      if (i <= j) {
        # Calculate mean absolute difference, handling potential NAs
        if(abs){
          d = abs(df[, i] - df[, j])
        }
        else{
          d = df[, i] - df[, j]
        }
        # Debug output
        # print(paste("Calculating MAD for columns", colnames(df)[i], "and", colnames(df)[j]))
        # print(head(abs_diff))

        # Calculate mean
        # mad_matrix[i, j] <- mean(abs_diff %>% unlist, na.rm = TRUE)
        mad_matrix[i, j] <- mean(d %>% unlist, na.rm = TRUE)
        mad_matrix[j, i] <- mad_matrix[i, j]  # Symmetric matrix
      }
    }
  }

  return(mad_matrix)
}
diff_mat =
  df_accel %>% select(contains("total")) %>%
  select(
    contains("acti"),
    contains("adept"),
    contains("oak"),
    contains("sc"),
    contains("vss"),
    contains("vsr"),
    total_AC,
    total_PAXMTSM,
    total_log10AC,
    total_log10PAXMTSM
  ) %>%
  mean_mat()
# absolute difference matrix
abs_diff_mat =
  df_accel %>% select(contains("total")) %>%
  select(
    contains("acti"),
    contains("adept"),
    contains("oak"),
    contains("sc"),
    contains("vss"),
    contains("vsr"),
    total_AC,
    total_PAXMTSM,
    total_log10AC,
    total_log10PAXMTSM
  ) %>%
  mean_mat(., abs = TRUE)

# make NA for non-step algos
diff_mat[, c("total_AC",
             "total_log10AC",
             "total_PAXMTSM",
             "total_log10PAXMTSM")] <- NA
diff_mat[c("total_AC",
           "total_log10AC",
           "total_PAXMTSM",
           "total_log10PAXMTSM"), ] <- NA
colnames(diff_mat) = rownames(diff_mat) =  c(
  "Actilife steps",
  "ADEPT",
  "Oak",
  "Stepcount RF",
  "Stepcount SSL",
  "Verisense",
  "Verisense rev.",
  "AC",
  "MIMS",
  "log10 AC",
  "log10 MIMS"
)
diff_mat = diff_mat / 1000
corrplot::corrplot(
  diff_mat,
  method = "color",
  type = "lower",
  col = paletteer_c("grDevices::Purple-Green", 10),
  col.lim = c(-10, 10),
  tl.col = "black",
  tl.srt = 30,
  insig = "blank",
  is.corr = FALSE,
  diag = FALSE,
  # title = "Mean Difference in Steps (x1000)",
  addgrid.col = "white",
  addCoef.col = 'grey50',
  na.label = "-"
)

Mortality prediction

Single variable

To investigate the association between each step algorithm and mortality, we fit a Cox proportional hazards model with the step algorithm as the only predictor. We also fit these single variable models with traditional predictors of mortality.

Specifically, we fit the following model:

\[\text{coxph(Surv(followup\_time, mortality)} \sim \text{predictor})\]

Where \(\texttt{followup\_time}\) is the time to death or censoring, \(\texttt{mortality}\) is the binary indicator of death, and \(\texttt{predictor}\) is the step algorithm or traditional predictor. We perform 10-fold cross validation and estimate the mean concordance over the 10 folds of the model. We then repeat this process 100 times and calculate the mean concordance from the repeated models. The code below illustrates the process, but we have precomputed the results.

Code
# packages for parallel processing; not necessary but will speed up 
require(future)
require(furrr)

# make sure dataset meets criteria 
df_mortality =
  df_all %>%
  filter(valid_accel) %>% # valid accelerometry
  filter(age_in_years_at_screening >= 50 &
           age_in_years_at_screening < 80) %>%  # age criteria
  filter(if_all(.cols = c(age_in_years_at_screening, gender,
                          race_hispanic_origin, cat_education,
                          cat_bmi, chd, chf, heartattack, stroke, cancer,
                          bin_diabetes, cat_alcohol, cat_smoke, bin_mobilityproblem,
                          general_health_condition, mortstat, permth_exm, total_PAXMTSM),
                ~!is.na(.x))) %>% # no missing data
  mutate(event_time = permth_exm / 12) # event time in years = person months since exam / 12

# winsorize at 99th percentile 
df_mortality_win =
  df_mortality %>%
  ungroup() %>%
  mutate(across(c(contains("total"), contains("peak")), ~winsorize(.x, quantile(.x, probs = c(0, 0.99)))))

# concordance is survival metric 
survival_metrics = metric_set(concordance_survival)
# proportional hazards model 

survreg_spec = proportional_hazards() %>%
  set_engine("survival") %>%
  set_mode("censored regression")

# function to fit model on one variable 
fit_model = function(var, folds, spec, metrics, mort_df){
  require(tidyverse); require(tidymodels); require(censored)
  # create workflow
  wflow = workflow() %>%
    add_model(spec) %>%
    add_variables(outcomes = mort_surv,
                  predictors = all_of(var)) %>%
    add_case_weights(case_weights_imp)

  # fit model on folds
  res = fit_resamples(
    wflow,
    resamples = folds,
    metrics = metrics,
    control = control_resamples(save_pred = TRUE)
  )

  # get metrics -- for some reason if you just use collect_metrics it doesn't take into account case weights,
  # so we write this function to get concordance
  get_concordance = function(row_num, preds, surv_df){
    preds %>%
      slice(row_num) %>%
      unnest(.predictions) %>%
      select(.pred_time, .row, mort_surv) %>%
      left_join(surv_df %>% select(row_ind, case_weights_imp), by = c(".row" = "row_ind")) %>%
      concordance_survival(truth = mort_surv, estimate = ".pred_time", case_weights = case_weights_imp) %>%
      pull(.estimate)
  }

  # get concordance for each fold
  concordance_vec = map_dbl(.x = 1:nrow(res), .f = get_concordance, preds = res, surv_df = mort_df)
  rm(res)
  # return as tibble
  tibble(concordance = concordance_vec,
         variable = var)
}

# all of the variables we want to use in single variable prediction - traditional 
demo_vars = c(
  "age_in_years_at_screening",
  "cat_bmi",
  "gender",
  "race_hispanic_origin",
  "bin_diabetes",
  "cat_education",
  "chf",
  "chd",
  "heartattack",
  "stroke",
  "cancer",
  "cat_alcohol",
  "cat_smoke",
  "bin_mobilityproblem",
  "general_health_condition")

# pa predictors 
pa_vars =
  df_mortality_win %>%
  select(contains("total"), contains("peak")) %>%
  colnames()

vars = c(demo_vars, pa_vars)

# normalize weights 
df = df_mortality_win %>%
  mutate(weight = full_sample_2_year_mec_exam_weight / 2, weight_norm = weight / mean(weight))

# create a survival object
surv_df =
  df %>%
  mutate(mort_surv = Surv(event_time, mortstat)) %>%
  mutate(case_weights_imp = hardhat::importance_weights(weight_norm)) %>%
  mutate(row_ind = row_number())

# set seed and generate 100x 10-folds 
set.seed(4575)
folds = vfold_cv(surv_df, v = 10, repeats = 100)
fname = "metrics_wtd_100_singlevar.rds"

plan(multisession, workers = ncores) # parallel setup 
# run on all variables 
results =
  furrr::future_map_dfr(
    .x = vars,
    .f = fit_model,
    spec = survreg_spec,
    metrics = survival_metrics,
    folds = folds,
    mort_df = surv_df,
    .options = furrr_options(seed = TRUE, globals = TRUE)
  )

if(!dir.exists(here::here("results"))){
  dir.create(here::here("results"))
}
# save results 
saveRDS(results, here::here("results", fname))

We read in the results, and make a table and figure ranking the variables in terms of concordance.

Code
wt_single = readRDS(here::here("results", "metrics_wtd_100_singlevar.rds")) %>%
  group_by(variable) %>%
  mutate(ind = row_number(),
         rep = floor((row_number()-1)/10)) %>%
  group_by(variable, rep) %>%
  summarize(concordance = mean(concordance)) # calculating mean concordance from each 10-fold repeat 

df_mortality =
  df_all %>%
  filter(num_valid_days >= 3) %>%
  filter(age_in_years_at_screening >= 50 &
           age_in_years_at_screening < 80)


# data used for mortality analysis; Winsorized
df_mortality_win =
  df_mortality %>%
  filter(if_all(.cols = c(age_in_years_at_screening, gender,
                                  race_hispanic_origin, cat_education,
                                  cat_bmi, chd, chf, heartattack, stroke, cancer,
                                  bin_diabetes, cat_alcohol, cat_smoke, bin_mobilityproblem,
                                  general_health_condition, mortstat, permth_exm),
                ~!is.na(.x))) %>%
  mutate(event_time = permth_exm / 12) %>%
  ungroup() %>%
  mutate(across(c(contains("total"), contains("peak")), ~winsorize(.x, quantile(.x, probs = c(0, 0.99)))))

vlabs = c(
  "Age (yrs)",
  "Diabetes",
  "Mobility problem",
  "Cancer",
  "Alcohol use",
  "BMI Category",
  "Education category",
  "Smoking status",
  "Congenital Heart Disease",
  "Coronary Heart Failure",
  "Gender",
  "General health condition",
  "Heart attack",
  "Peak1 AC",
  "Peak1 Actilife steps",
  "Peak1 ADEPT steps",
  "Peak1 log10 AC",
  "Peak1 log10 MIMS",
  "Peak1 Oak steps",
  "Peak1 MIMS",
  "Peak1 Stepcount RF steps",
  "Peak1 Stepcount SSL steps",
  "Peak1 Verisense rev. steps",
  "Peak1 Verisense steps",
  "Peak30 AC",
  "Peak30 Actilife steps",
  "Peak30 ADEPT steps",
  "Peak30 log10 AC",
  "Peak30 log10 MIMS",
  "Peak30 Oak steps",
  "Peak30 MIMS",
  "Peak30 Stepcount RF steps",
  "Peak30 Stepcount SSL steps",
  "Peak30 Verisense rev. steps",
  "Peak30 Verisense steps",
  "Race/ethnicity",
  "Stroke",
  "AC",
  "Actilife steps",
  "ADEPT steps",
  "log10 AC",
  "log10 MIMS",
  "Oak steps",
  "MIMS",
  "Stepcount RF steps",
  "Stepcount SSL steps",
  "Verisense rev. steps",
  "Verisense steps"
)

data_summary = wt_single %>%
  group_by(variable) %>%
  summarize(across(concordance, list(
    mean = ~ mean(.x), se = ~ sd(.x) / sqrt(n())
  ))) %>%
  mutate(variable_fac = factor(
    variable,
    labels = vlabs
  )) %>%
  filter(!grepl("peak", variable))

df_means =
  df_mortality_win %>%
  group_by(mortstat) %>%
  summarize(across(c(contains("total"), contains("peak"), contains("age")), ~ mean(.x)),
            across(c(bin_mobilityproblem, cancer, stroke, heartattack, chd, chf, bin_diabetes), ~sum(.x)/n()*100),
            gender = sum(gender == "Female")/n()*100) %>%
  pivot_longer(cols = -mortstat) %>%
  pivot_wider(names_from = mortstat,
              values_from = value,
              names_prefix = 'died_') %>%
  rename(variable = name)

# get the best variable based on concodance
best_var = data_summary %>%
  arrange(desc(concordance_mean)) %>%
  slice(1) %>%
  pull(variable)

best_var_vec = wt_single %>%
  filter(variable == best_var) %>%
  pull(concordance)

t_tests = wt_single %>%
  group_by(variable) %>%
  filter(variable != best_var) %>%
  nest() %>%
  mutate(t_test = map(
    data,
    ~ t.test(
      .x$concordance,
      best_var_vec,
      var.eq = FALSE,
      paired = FALSE,
      alternative = "less"
    )
  ),
  res = map(t_test, tidy)) %>%
  unnest(res) %>%
  ungroup() %>%
  select(variable, p.value)

# join t tests with summary data and make table
tab = data_summary %>%
  left_join(t_tests) %>%
  left_join(df_means) %>%
  arrange(desc(concordance_mean)) %>%
  mutate(
    variable_fac = forcats::fct_reorder(variable_fac, concordance_mean),
    lb = concordance_mean - 1.96 * concordance_se,
    ub = concordance_mean + 1.96 * concordance_se,
    ci2 = sprintf("%0.3f", round(concordance_mean, 3)),
    ci = paste0(
      sprintf("%0.3f", round(concordance_mean, 3)),
      " (",
      sprintf("%0.3f", round(lb, 3)),
      ", ",
      sprintf("%0.3f", round(ub, 3)),
      ")"
    )
  ) %>%
  select(variable_fac,
         ci2,
         # ci,
         # p.value,
         total_alive = died_0,
         total_deceased = died_1) %>%
  # mutate(pvalue = style_pvalue(p.value, digits = 3)) %>%
  # mutate(pvalue = ifelse(is.na(pvalue), "---", pvalue),
  #        across(starts_with("mean"), ~ if_else(is.na(.x), "---", as.character(round(
  #          .x, 0
  #        ))))) %>%
  rename(
    Variable = variable_fac,
    "Mean" = ci2,
    # "Mean (95% CI)" = ci,
    # "p-value" = pvalue,
    "Mean value among alive" = total_alive,
    "Mean value among deceased" = total_deceased
  ) %>%
  gt::gt() %>%
  gt::fmt_number(columns = starts_with("mean"), decimals = 0) %>%
  # gt::tab_style(
  #   style = list(cell_text(weight = "bold")),
  #   locations = cells_body(columns = everything(), rows = p.value > 0.05 |
  #                            is.na(p.value))
  # ) %>%
  # gt::cols_hide(p.value) %>%
  tab_header(title = "100x 10-fold Cross-Validated Single Variable Concordance", subtitle = "") %>%
  # tab_footnote(footnote = "p-value for one sided t-test comparing variable to variable with higest mean concordance",
  #              locations = cells_column_labels(columns = "p-value")) %>%
  cols_align(columns = everything(), align = "left"); tab
100x 10-fold Cross-Validated Single Variable Concordance
Variable Mean Mean value among alive Mean value among deceased
Stepcount RF steps 0.732 10,591 6,502
ADEPT steps 0.725 2,391 1,380
Oak steps 0.723 10,845 7,067
Verisense rev. steps 0.720 8,021 4,878
Verisense steps 0.716 8,885 5,885
Actilife steps 0.710 11,790 8,740
Stepcount SSL steps 0.702 8,786 5,784
AC 0.688 2,454,243 1,900,197
MIMS 0.682 13,002 10,438
Mobility problem 0.675 26 57
Age (yrs) 0.673 62 68
General health condition 0.662 NA NA
log10 MIMS 0.647 987 884
log10 AC 0.630 2,939 2,708
Diabetes 0.594 20 37
Smoking status 0.589 NA NA
Education category 0.572 NA NA
Alcohol use 0.551 NA NA
Congenital Heart Disease 0.551 5 15
Coronary Heart Failure 0.549 3 15
Gender 0.547 53 44
Heart attack 0.546 5 16
Cancer 0.539 14 24
Stroke 0.533 5 13
BMI Category 0.524 NA NA
Race/ethnicity 0.524 NA NA
Code
var_labels =
  tibble(
    names =
      unique(wt_single$variable),
    labels = c(
      "Age at screening",
      "Diabetes",
      "Mobility problem",
      "Cancer",
      "Alcohol use",
      "BMI Category",
      "Education level",
      "Smoking status",
      "CHF",
      "CHD",
      "Gender",
      "Self-reported health",
      "Heart attack",
      "Peak1 AC",
      "Peak1 MIMS",
      "Peak1 Actilife steps",
      "Peak1 ADEPT steps",
      "Peak1 log10 AC",
      "Peak1 log10 MIMS",
      "Peak1 Oak steps",
      "Peak1 Stepcount RF steps",
      "Peak1 Stepcount SSL steps",
      "Peak1 Verisense rev. steps",
      "Peak1 Verisense steps",
      "Peak30 AC",
      "Peak30 MIMS",
      "Peak30 Actilife steps",
      "Peak30 ADEPT steps",
      "Peak30 log10 AC",
      "Peak30 log10 MIMS",
      "Peak30 Oak steps",
      "Peak30 Stepcount RF steps",
      "Peak30 Stepcount SSL steps",
      "Peak30 Verisense rev. steps",
      "Peak30 Verisense steps",
      "Race/ethnicity",
      "Stroke",
      "AC",
      "MIMS",
      "Actilife steps",
      "ADEPT steps",
      "log10 AC",
      "log10 MIMS",
      "Oak steps",
      "Stepcount RF steps",
      "Stepcount SSL steps",
      "Verisense rev. steps",
      "Verisense steps"
    )
  )


wt_single %>%
  filter(!grepl("peak", variable)) %>%
  mutate(var_group = case_when(
    grepl("steps", variable) ~ "Step variable",
    grepl("total", variable) ~ "Non-step accelerometry variable",
    TRUE ~ "Non-accelerometry variable"
  ),
  var_group = factor(var_group, levels = c("Step variable", "Non-step accelerometry variable", "Non-accelerometry variable"))) %>%
  group_by(variable, var_group) %>%
  summarize(mean = mean(concordance),
            sd = sd(concordance),
            se = sd(concordance)/sqrt(n())) %>%
  mutate(ci_low = mean - 1.96*se,
         ci_high = mean + 1.96*se) %>%
  ungroup() %>%
  left_join(var_labels, by = c("variable" = "names")) %>%
  mutate(labels = factor(labels),
         labels = fct_reorder(labels, mean)) %>%
  ggplot(aes(y = labels, x = mean, xmin = ci_low, xmax = ci_high, color = var_group, shape = var_group))+
  geom_point(size = 3) +
  # geom_errorbarh(height = .3) +
  theme_bw() +
  # scale_color_manual(values = c("#CC79A7FF", "#009E73FF", "#0072B2FF"), name = "")+
  scale_color_manual(values = c("#FF6DB6", "#009292", "#006DDB"), name = "")+ # colors from paletteer_d("colorBlindness::paletteMartin")
  scale_shape_manual(values = c(8, 17, 16), name = "")+
  scale_x_continuous(limits=c(0.5, 0.75), breaks=seq(0.5, 0.75, .05))+
  theme(legend.position = c(.35, .75),
        legend.title = element_blank(),
        legend.text = element_text(size = 14),
        axis.title = element_text(size = 14),
        axis.text.y = element_text(size = 12),
        strip.text = element_text(size = 14))+
  labs(x = "Mean 100-times repeated 10-fold Cross-Validated Survey-Weighted Concordance", y = "")