Walking Fingerprinting

Using Functional Data Analysis

Lily Koffman

4/15/25

Follow along

https://shorturl.at/AJOnk

Introduction

4th year PhD candidate in biostatistics at Johns Hopkins
Advisor: Ciprian Cranineanu
Wearable and Implantable Technology (WIT) research group
Interests: accelerometry, functional data, walking

Background and motivation

Problem description

What do I mean “walking fingerprinting”?

Problem description

What do I mean “walking fingerprinting”?

Problem description

What do I mean “walking fingerprinting”?

Applications

Backing up: accelerometry

Accelerometry: collected from a wearable device
Between 15 and 100 observations per second in 3 dimensions
\(g\) units = 9.81 \(m/s^2\)

Accelerometry data

Outline

Transform accelerometry time series into scalar predictors
- Compute acceleration, lag acceleration joint distribution
- Represent joint distribution as 2D image
- Partition image into cells: number of points in each cell is scalar summary
Use scalar summaries in one vs. rest classification
Functional regression approach
Applications to datasets
Results and next steps

Transform accelerometry into scalar predictors

Segment data into 1-s chunks

Zoom in on one second

Plot acceleration, lag acceleration

Transform to 2D grid

Transform acceration to 2D grid

Transform to 2D grid

Derive predictors from grid

Derive predictors from the grid

Derive predictors: repeat for all seconds

Derive predictors: repeat for all subjects

Summary

Fit models

Scalar predictors and one vs. rest classification

Aside: functional regression approach

\[\text{logit}(p_{ij}^{i_0}) =\beta_0^{i_0} + \int_{u=1}^S\int_{s=u}^SF_{i_0}\{ v_{ij}(s), v_{ij}(s-u), u\}dsdu \] \(u = 1, \dots, S = 100\) (number of observations per second)

\(v_{ij}(s)\) = acceleration at centisecond \(s\) for subject \(i\) in second \(j\) \(F(\cdot, \cdot, \cdot)\): trivariate smooth function

Aside: functional regression approach

Toy example

4 observations per second (data are observed every 1/25th second)
2 seconds
1 individual
data = \[\begin{bmatrix} v_1(1) & v_1(2) & v_1(3) & v_1(4) \\ v_2(1) & v_2(2) & v_2(3) & v_2(4) \\ \end{bmatrix} \]
acceleration matrix = \[\begin{bmatrix} v_1(2) & v_1(3) & v_1(4) & v_1(3) & v_1(4) & v_1(4) \\ v_2(2) & v_2(3) & v_2(4) & v_2(3) & v_2(4) & v_2(4) \\ \end{bmatrix} \]
lag acceleration matrix = \[\begin{bmatrix} v_1(1) & v_1(1) & v_1(1) & v_1(2) & v_1(2) & v_1(3) \\ v_2(1) & v_2(1) & v_2(1) & v_2(2) & v_2(2) & v_2(3) \\ \end{bmatrix} \]
lag matrix = \[\begin{bmatrix} 1 & 2 & 3 & 1 & 2 & 1\\ 1 & 2 & 3 & 1 & 2 & 1\\\end{bmatrix} \]

Both methods work!

“Fingerprints” distinguish individuals

But…

Small sample size (\(n \in \{30, 153\}\))
We know the participants are walking
Young and healthy people
Can we adapt method to work in a larger, more diverse sample of free living data?

Application to NHANES

NHANES

Biannual survey of ~5,000 Americans
2011-2014: wrist-worn accelerometers included in protocol
7 full days of free-living data from a nationally representative sample of Americans n > 15,000
\(>10\) TB of raw data

Methods

Use ADEPT (ADaptive Empirical Pattern Transformation)¹ to identify seconds with walking
Filter to walking bouts of at least 10 seconds in length
Construct fingerprints from filtered bouts
Train + test models on varying size subsets

Finding walking

Raw data

ADEPT templates

ADEPT walking identification: example

Per-subject walking time

Define walking bout: \(\geq\) 10s where at least every other second has steps

Train/test partitioning & model fitting

Train/test paradigms

Random: 3 minutes of walking randomly sampled from all seconds. 75% used for training, 25% used for testing. \(n = 13{,}367\) \((85\%)\)
Temporal: 2 min 15 seconds of walking from one day used for training, 45 seconds from a later day used for testing. \(n = 10{,}770\) \((69\%)\)

Model fitting

Variable screening used to remove near-zero variance predictors for all models with tidymodels::step_nzv()
Models fit: logistic regression, lasso, random forest, extreme gradient boosting, linear and nonlinear scalar on function regression
First fit on subsets of \(n=100\); best models fit on larger subsets of \(n=500, 1000, 5000, 10000, N\)
Evaluate rank-1, rank-5, rank-1%, rank-5% accuracy

Model fitting: scalar on function regression

Linear scalar on function regression (SoFR) model: \[\text{logit}(p_{ij}^{i_0}) =\beta_0^{i_0} + \int_S \beta_1(s) X_{ij}(s)ds + \epsilon_{ij}\] Where \(S \in \{1, 2, \dots, 432\}\), \(X_{ij}(s)\) is number of points for subject \(i\) in second \(j\) in grid cell \(s\), \(\epsilon_{ij} \sim \mathcal{N}(0, \sigma^2)\)
Nonlinear scalar on function regression (SoFR) model: \[\text{logit}(p_{ij}^{i_0}) =\beta_0^{i_0} + \int_S f\big(\beta_1(s) X_{ij}(s)\big)ds + \epsilon_{ij}\]

Model fitting: scalar on function regression

# predictor matrix 
pred_mat =
  train_data %>% 
  select(starts_with("x")) %>%
  as.matrix()

xdf =
  train_data %>%
  select(-starts_with("x")) %>%
  mutate(pred_mat = df_mat)

if (linear) {
  pfr_fit =
    refund::pfr(
      outcome ~ lf(pred_mat, argvals = seq(1, 432)),
      family = binomial(link = "logit"),
      method = "REML",
      data = xdf
    )
} else {
  pfr_fit =
    refund::pfr(
      outcome ~ af(pred_mat, argvals = seq(1, 432)),
      family = binomial(link = "logit"),
      method = "REML",
      data = xdf
    )
}

Results

Larger subsets: logistic regression

Oversampling

We can oversample the predicted subject to be a certain percent of the training data and see if this improves the model (imbalanced class)

Longer training

Increase the amount of time observed for each subject to 6 minutes per person? Intuition is this should improve model performance

Fingerprints

Next steps

Regress fingerprint on outcomes

NHANES data rich with information
Age, sex, diseases, mortality, etc.
Can we associate step or walking patterns with comoborbidities / demographics?

Preliminary analyses

Calculate for each subject the proportion of time spent in each grid cell and perform separate regressions for each grid cell:

\[\text{time in cell}_i = \beta_0 + \beta_1\text{mortality at 5 years}_i \]

We do this for each cell, then plot the results. Greyed out cells were not significant after Bonferroni correction.

Interpret red cells as: change in 5-year mortality associated with 1% increase in time spent in cell \(c\). Next step: image on scalar regression

Thank you!

github.com/lilykoff; lilykoff.com
JHU collaborators: Ciprian Crainiceanu, John Muschelli, Yan Zhang, Wearable and Implantable Technology (WIT) research group
External collaborators: Andrew Leroux (Colorado School of Public Health), Jaroslaw Harezlak (Indiana University School of Public Health)