---
title: "How To Use CGMmissingDataR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How To Use CGMmissingDataR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

# Overview

CGMmissingDataR imputes missing glucose values in continuous glucose monitoring
(CGM) data. The main user-facing function is:

```r
run_missing_glucose_imputation()
```

The function is designed for real missing glucose values. It handles two common
forms of CGM missingness:

1. explicit missing glucose values, where a row exists but the glucose value is
   `NA`; and
2. implicit missing readings, where expected timestamps are absent from the data.

Before imputation, the function regularizes each subject to an equal
`interval_minutes` timestamp grid. Missing timestamp gaps are converted into
explicit rows with `target_col = NA`, then imputed by the same workflow used for
explicit missing glucose values.

The returned data frame is intentionally minimal. It contains the original
user-supplied columns plus a completed glucose column named
`imputed_glucose_value`. Internal columns used for timestamp regularization,
time features, lag features, rolling means, model fitting, and missingness
tracking are not returned.

The core workflow is:

1. read a data frame or CSV file;
2. parse and sort timestamps by subject;
3. regularize each subject to an equal `interval_minutes` timestamp grid;
4. insert missing timestamp rows with `target_col = NA`;
5. create internal time, lag, and rolling-mean features;
6. impute the target and feature matrix;
7. choose the final model from `models`, using the post-regularization missing
   rate for the default automatic selection;
8. return the original columns plus `imputed_glucose_value`.

# Installation

Install the CRAN release with:

```r
install.packages("CGMissingDataR")
```

Install the development version with:

```r
install.packages("devtools")
devtools::install_github("ZhangLabUKY/CGMmissingDataR")
```

Load the package:

```{r setup}
library(CGMissingDataR)
```

# Example data

`CGMExmplDat10Pct` is a small multi-subject CGM data set included with the
package. It contains a subject identifier, raw timestamp column, glucose column,
age, and HbA1c.

```{r example-data}
data("CGMExmplDat10Pct")

summary_table <- data.frame(
  Rows = nrow(CGMExmplDat10Pct),
  Columns = ncol(CGMExmplDat10Pct),
  Subjects = length(unique(CGMExmplDat10Pct$USUBJID)),
  MissingGlucose = sum(is.na(CGMExmplDat10Pct$LBORRES)),
  MissingPercent = round(mean(is.na(CGMExmplDat10Pct$LBORRES)) * 100, 1)
)

summary_table
head(CGMExmplDat10Pct)
```

The example data intentionally does not include `TimeSeries`. The imputation
function creates required time features internally from the raw `Time` column.

# Required input columns

At minimum, the imputation function needs:

| Role | Argument | Example column |
|---|---|---|
| Glucose value to impute | `target_col` | `LBORRES` |
| Subject identifier | `id_col` | `USUBJID` |
| Raw timestamp | `time_col` | `Time` |
| Additional predictors | `feature_cols` | `AGE`, `hba1c`, `SEX` |

The target column may contain missing values. Predictor columns should be numeric
or coercible to numeric. The `SEX` column, when present, is internally encoded
as `M = 1` and `F = 0`.

# What counts as missing?

CGM exports can represent missingness in two ways.

## Explicit missing glucose values

A row exists, but the glucose value is missing:

| Time | LBORRES |
|---|---:|
| 00:00 | 120 |
| 00:05 | NA |
| 00:10 | 125 |

The row with `LBORRES = NA` is imputed.

## Timestamp gaps

A row is absent entirely, producing a jump in the timestamp sequence:

| Time | LBORRES |
|---|---:|
| 00:00 | 120 |
| 00:05 | 122 |
| 00:30 | 130 |

With `interval_minutes = 5`, the function internally regularizes this to:

| Time | LBORRES |
|---|---:|
| 00:00 | 120 |
| 00:05 | 122 |
| 00:10 | NA |
| 00:15 | NA |
| 00:20 | NA |
| 00:25 | NA |
| 00:30 | 130 |

The inserted rows are then imputed using the same workflow as explicit `NA`
values. Because of this, the returned data frame may have more rows than the
input data when timestamp gaps are present.

# Basic real-imputation workflow

For the CRAN-safe R-native path, use `imputer_backend = "mice"`.

```{r basic-imputation}
impute_out <- suppressWarnings(
  run_missing_glucose_imputation(
    CGMExmplDat10Pct,
    target_col = "LBORRES",
    feature_cols = c("AGE", "hba1c", "SEX"),
    id_col = "USUBJID",
    time_col = "Time",
    imputer_backend = "mice",
    xgb_nrounds = 5
  )
)
```

The result is a data frame:

```{r output-shape}
class(impute_out)
nrow(impute_out)
names(impute_out)
```

The returned columns are the original user-supplied columns plus
`imputed_glucose_value`.

| Column | Meaning |
|---|---|
| Original columns | The user's input columns, including the original glucose column. |
| Original target column, e.g. `LBORRES` | The original glucose column. Values originally missing or inserted from timestamp gaps remain `NA`. |
| `imputed_glucose_value` | Completed glucose values after imputation. |

```{r output-preview}
head(impute_out[c(
  "USUBJID",
  "SEX",
  "Time",
  "LBORRES",
  "AGE",
  "hba1c",
  "imputed_glucose_value"
)])
```

The original target column is not overwritten:

```{r original-target-unchanged}
sum(is.na(CGMExmplDat10Pct$LBORRES))
sum(is.na(impute_out$LBORRES))
sum(is.na(impute_out$imputed_glucose_value))
```

Inspect rows where the original target column is missing. These include explicit
missing glucose values and, when timestamp gaps are present, rows inserted during
timestamp regularization.

```{r missing-row-preview}
missing_rows <- is.na(impute_out$LBORRES)
head(impute_out[missing_rows, c(
  "USUBJID",
  "Time",
  "LBORRES",
  "imputed_glucose_value"
)])
```

# How the method is selected

By default, the function automatically chooses the final imputation model from
the target missing rate after timestamp-gap regularization:

- if the missing rate is less than or equal to `use_arima_if_missing_leq`, the
  final method is `MICE+ARIMA`;
- otherwise, the final method is `MICE+XGBoost`.

The default threshold is `0.05`.

Users can override this automatic rule by setting `models` to exactly one of
`"arima"`, `"xgboost"`, `"rf"`, `"knn"`, or `"lightgbm"`. These options run
MICE first, then use the selected model with the same internal time, lag, and
rolling-mean features.

Method labels and missingness-tracking columns are internal implementation
details in the minimal user-facing output. The returned data frame keeps only
the original input columns plus `imputed_glucose_value`.

# Thread control

Real-imputation model engines use one thread by default:

```r
n_threads = 1
```

This conservative default is friendly to CRAN checks and shared computing
systems. Users can increase `n_threads` for faster local XGBoost, Random Forest,
or LightGBM runs. ARIMA and kNN do not use this setting.

# Time handling and timestamp regularization

The function accepts common timestamp formats, including colon-separated,
hyphen-separated, slash-separated, ISO-style, and `POSIXct` inputs.

Examples of accepted character formats include:

```r
"2020:01:16:00:00"
"2020-01-16 00:00:00"
"2020/01/16 00:00:00"
"01/16/2020 00:00"
"2020-01-16T00:00:00"
```

The function uses the timestamp column and `interval_minutes` to regularize each
subject's data to an expected CGM interval. The default is:

```r
interval_minutes = 5
```

Observed timestamps are aligned to the subject-level interval grid, missing grid
positions are inserted, and the inserted target values are set to `NA` before
imputation.

# Internal engineered features

The workflow creates `TimeSeries`, `TimeDifferenceMinutes`, lag features, and a
rolling mean before imputation. These features help the model use temporal order,
time spacing, and recent glucose history.

For example, after timestamp regularization, lag features are created on the
expanded grid:

| Time | LBORRES | lag1 | lag2 | lag3 |
|---|---:|---:|---:|---:|
| 00:00 | 120 | NA | NA | NA |
| 00:05 | 122 | 120 | NA | NA |
| 00:10 | NA | 122 | 120 | NA |
| 00:15 | NA | NA | 122 | 120 |
| 00:20 | NA | NA | NA | 122 |

These engineered columns are used internally by the imputer and final model but
are removed from the returned data frame.

```{r internal-feature-check}
grep("^lag[0-9]+$|^rollmean$|^TimeSeries$|^TimeDifferenceMinutes$", names(impute_out), value = TRUE)
```

This should return an empty character vector because those features are internal
implementation details.

# Continuous imputed values

`imputed_glucose_value` is returned as a continuous numeric model estimate. It is
not rounded to the nearest whole number by default because downstream analyses
may benefit from retaining the model-estimated precision.

Users who need whole-number glucose values for reporting can round after
imputation:

```{r rounding-example, eval = FALSE}
impute_out$imputed_glucose_value_rounded <- round(impute_out$imputed_glucose_value)
```

# Optional Python-compatible backend

For closest agreement with the Python reference workflow, use:

```r
imputer_backend = "sklearn"
```

In that mode, the function sends the input data frame to Python through
`reticulate`. Python then performs preprocessing and imputation with:

- `pandas` for data-frame operations;
- `scikit-learn` for `IterativeImputer`, Random Forest, and kNN;
- `statsmodels` for ARIMA;
- Python `xgboost` for XGBoost regression;
- Python `lightgbm` when forcing LightGBM.

The completed pandas data frame is then converted back to R.

## Installing optional Python dependencies

Install `reticulate` in R:

```r
install.packages("reticulate")
```

Declare the Python dependencies before running the Python backend:

```r
reticulate::py_require(c(
  "numpy",
  "pandas",
  "scikit-learn",
  "statsmodels",
  "xgboost"
))

# Optional, only needed for models = "lightgbm"
reticulate::py_install("lightgbm", pip = TRUE)
```

Then call the function with `imputer_backend = "sklearn"`:

```{r python-backend-example, eval = FALSE}
out_py <- run_missing_glucose_imputation(
  CGMExmplDat10Pct,
  target_col = "LBORRES",
  feature_cols = c("AGE", "hba1c"),
  id_col = "USUBJID",
  time_col = "Time",
  imputer_backend = "sklearn",
  xgb_nrounds = 5
)

head(out_py[c(
  "USUBJID",
  "Time",
  "LBORRES",
  "imputed_glucose_value"
)])
```

The Python backend is optional. It is not required for package installation or
for building this vignette.

# Choosing a backend

| Backend | Use case | Notes |
|---|---|---|
| `mice` | Default R-native workflow | CRAN-safe and does not require Python. |
| `sklearn` | Closest Python-compatible workflow | Requires `reticulate` and Python packages. |

Use `mice` for simple installation and CRAN-safe examples. Use `sklearn` when
comparing with the Python reference workflow or when you want Python libraries
to perform the full strict path.

# Exporting results

Set `export = TRUE` to write the returned imputed data frame to a timestamped
CSV file in the current working directory.

```{r export-example, eval = FALSE}
out <- run_missing_glucose_imputation(
  CGMExmplDat10Pct,
  target_col = "LBORRES",
  feature_cols = c("AGE", "hba1c"),
  id_col = "USUBJID",
  time_col = "Time",
  imputer_backend = "mice",
  export = TRUE
)
```

The exported CSV contains the original input columns plus
`imputed_glucose_value`.

# Troubleshooting

## Timestamp parsing errors

If you see an error such as:

```r
Some timestamp values could not be parsed
```

check the values in your timestamp column:

```r
head(unique(your_data$Time))
```

Use a standard format such as `YYYY-mm-dd HH:MM:SS`, `YYYY:mm:dd:HH:MM`, or a
`POSIXct` column.

## Unexpected row counts

If the returned data frame has more rows than the input data, this is expected
when timestamp gaps are present. The function creates rows for missing expected
CGM readings before imputation.

If the increase is larger than expected, inspect whether the timestamp column
contains off-grid times such as seconds, irregular minutes, or mixed timestamp
formats.

## Python module errors

If the Python backend reports a missing module such as `sklearn`, remember that
the package is installed as `scikit-learn` but imported as `sklearn`.

```r
reticulate::py_require(c(
  "numpy",
  "pandas",
  "scikit-learn",
  "statsmodels",
  "xgboost"
))

# Optional, only needed for models = "lightgbm"
reticulate::py_install("lightgbm", pip = TRUE)
```

If Python was already initialized before declaring requirements, restart R and
run the call again.

## Warnings from `mice`

Small or highly collinear data sets can cause `mice` to report logged events.
This is common with tiny examples and does not necessarily indicate failure.
With real data, inspect those warnings to decide whether columns should be
removed, recoded, or simplified.

# Session information

```{r session-info}
utils::sessionInfo()
```