---
title: "Bulk RNA-Seq Example"
output:
  html_document:
    toc: true
    toc_depth: 2
    number_sections: true
    theme: flatly
params:
  de_results: NULL
  enrichment_results: NULL
  grn_object: NULL
  report_date: !r Sys.Date()
vignette: >
  %\VignetteIndexEntry{Bulk RNA-Seq Example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


## Bulk RNA-Seq Analysis with XYomics
Authors: Enrico Glaab and Sophie Le Bars

# Introduction

This vignette demonstrates a standard workflow for analyzing bulk RNA-seq data to identify sex-specific effects using the **XYomics** package. The package provides a streamlined set of tools for differential expression, pathway analysis, and network-based interpretation.

This tutorial covers the following key steps:

1.  **Simulating a bulk RNA-seq dataset** with built-in sex-specific and shared gene expression changes.
2.  **Discussing and applying two different strategies for differential expression analysis**: sex-stratified analysis, and interaction analysis.
3.  **Identifying and categorizing sex-specific genes** into distinct groups (male-specific, female-specific, etc.) based on stratified analysis.
4.  **Visualizing expression patterns** of top genes using the package's built-in plotting functions.
5.  **Conducting pathway enrichment analysis** to determine the biological functions of the identified gene sets.
6.  **Constructing and visualizing a protein-protein interaction network** to explore the interactions between key genes.

# Data Simulation

We begin by simulating a bulk RNA-seq dataset to illustrate the package's capabilities. The `simulate_omics_data` function creates a dataset with 500 genes and 100 samples, balanced across phenotype (Control/Disease) and sex (male/female).

The four categorized groups are:
-   **Male-specific:** Genes up-regulated in the disease state only in males.
-   **Female-specific:** Genes up-regulated in the disease state only in females.
-   **Sex-dimorphic:** Genes regulated in opposite directions between males and females in the disease state.
-   **Sex-neutral:** Genes regulated with same directions between males and females in the disease state.

```{r simulate-data, message=FALSE, warning=FALSE}
library(XYomics)
library(dplyr)

# Load precomputed bulk dataset
expression_data <- readRDS(
  system.file("extdata", "bulk_expression.rds", package = "XYomics")
)

sex <- readRDS(
  system.file("extdata", "bulk_sex.rds", package = "XYomics")
)

phenotype <- readRDS(
  system.file("extdata", "bulk_phenotype.rds", package = "XYomics")
)
```

# Differential Expression Analysis Strategies

When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.

**Sex-stratified analyses** use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. These may include classical parametric hypothesis tests, such as Welch’s test for normally distributed data, or non-parametric tests, such as the Mann–Whitney U test, as well as special moderated statistics for high-dimensional omics data analysis, such as the empirical Bayes moderated t-statistic. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.

**Interaction analysis**  formally tests whether the relationship between disease and molecular changes differs significantly between males and females. Not only can such interaction terms reveal complexities in disease mechanisms that might otherwise be obscured in analyses that do not consider SABV, but they also have the potential to detect changes that are limited to the magnitude of an effect, an aspect that sex-stratified analyses do not capture. Nevertheless, robust estimation of interaction effects requires large sample sizes, which are often not available due to the costs associated with advanced molecular profiling techniques such as single-cell RNA sequencing.

The **XYomics** package provides functions for both types of analyses.

## Method 1: Sex-Stratified Analysis

Here, we use the `sex_stratified_analysis_bulk()` function to perform a stratified analysis.

```{r stratified-analysis, eval=FALSE}
res <- sex_stratified_analysis_bulk(expression_data, sex, phenotype)
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value is 3.


```

```{r load-stratified-results}
res <- readRDS(
  system.file("extdata", "bulk_results_degs.rds", package = "XYomics")
)
# Internal QC results from XYomics can be accessed through:
# res$validation
#
# This includes validation of sex and phenotype balance, group sample sizes,
# detection of design imbalances, and identification of groups below the
# minimum sample threshold required for reliable downstream analysis.
res$validation
```


# Identification of Sex-Specific Genes (from Stratified Analysis)

The `categorize_sex()` function categorizes genes based on the results of the **stratified analysis**. This provides a useful, albeit potentially less robust, classification.

```{r sex-specific-analysis, eval=FALSE}
res_cat <- categorize_sex(res$male_DEGs, res$female_DEGs)
```

```{r load-categorized-results}
res_cat <- readRDS(
  system.file("extdata", "bulk_results_cat.rds", package = "XYomics")
)

cat("Top sex-stratified differentially expressed genes:\n")
head(res_cat)

table(res_cat$DEG_Type)
```


## Method 2: Interaction Term Analysis

We use the `sex_interaction_analysis_bulk()` function to perform a formal interaction analysis.

```{r interaction-term-analysis-bulk, eval=FALSE}
interaction_term_results_bulk <- sex_interaction_analysis_bulk(expression_data, phenotype, sex)
# The `min_samples` parameter defines the minimum number of cells required
# to perform the interaction analysis. The default value ofr interaction analysis is 20.

```

```{r load-interaction-results}
interaction_term_results_bulk <- readRDS(
  system.file("extdata", "bulk_interaction_results.rds", package = "XYomics")
)

head(interaction_term_results_bulk$summary_stats)# signficant sex-modulated DEGs: interaction_term_results_bulk$sig_results
```


# Visualization of Expression Patterns

We use `generate_violinplot_bulk()` to visualize the expression of top genes identified by the stratified analysis.

```{r visualization}
top_male_gene <- res_cat %>%
  filter(DEG_Type == "male-specific") %>%
  arrange(Male_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)

top_female_gene <- res_cat %>%
  filter(DEG_Type == "female-specific") %>%
  arrange(Female_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)
top_dimorphic_gene <- res_cat %>%
  filter(DEG_Type == "sex-dimorphic") %>%
  arrange(Female_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)
top_neutral_gene <- res_cat %>%
  filter(DEG_Type == "sex-neutral") %>%
  arrange(Female_FDR) %>%
  pull(Gene_Symbols) %>%
  head(1)

top_genes = c(top_male_gene, top_female_gene, top_dimorphic_gene, top_neutral_gene )

generate_violinplot_bulk(expression_data,sex,phenotype , top_genes)
```

# Pathway Enrichment Analysis

We perform pathway enrichment on the gene lists derived from the stratified analysis using `categorized_enrich()`.

```{r pathway-analysis, eval=FALSE}
pathway_results <- categorized_enrich(res_cat, enrichment_db = "GO")

# The `enrichment_db` parameter can be set to "GO", "KEGG", "REACTOME".
# If the user wants to use a custom database, a TERM2GENE data frame must be
# provided via the `custom_db` parameter.
#
# Gene identifiers can be either any Gene identifier type supported by OrgDb
#' (e.g., "SYMBOL", "ENTREZID", "ENSEMBL", "UNIPROT")
# using the `gene_type` parameter.
#
# The `return_df` parameter controls the output format:
# if set to TRUE, results are returned as data frames;
# by default (FALSE), enrichResult objects are returned
```

```{r load-pathway-results}
pathway_results <- readRDS(
  system.file("extdata", "bulk_pathway_results.rds", package = "XYomics")
)

head(pathway_results)
```

```{r celltype1_plot , results='asis', fig.keep='all', fig.width=15, fig.height=7, message=FALSE}
# visualize via dotplot
plot <- plot_enrichment_dotplots(pathway_results)
print(plot)
```

# Protein-protein Interaction Network Analysis

Finally, we construct a protein-protein interaction network using the results from the stratified analysis.

### 1. Fetching the STRING Network

First, we download a protein-protein interaction network from the STRING database using `get_string_network()` or directly from the package.

```{r get-string-network, message=FALSE}
# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)

# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # You can also download the hormonal network "hormonal_network_metacore.rds" instead of "string_example_network.rds".
```

### 2. Constructing the PCSF Network

Next, we define "prizes" for our genes based on their statistical significance (e.g., -log10 of the p-value) and use the `construct_ppi_pcsf` function to build a context-specific network.

```{r construct-pcsf-network, eval=FALSE}
# Use dimorphic DE results to define prizes
dimorphic_specific <- res_cat[res_cat$DEG_Type == "sex-dimorphic", ]
dimorphic_prizes <- -log10((dimorphic_specific$Male_FDR + dimorphic_specific$Female_FDR) / 2)
names(dimorphic_prizes) <-dimorphic_specific$Gene_Symbols

# Construct the PCSF subnetwork

dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)
network <- ppi_pipeline(res_cat, g)
```

```{r load-network-results}
network <- readRDS(
  system.file("extdata", "bulk_network_results.rds", package = "XYomics")
)

neutral_network <- readRDS(
  system.file("extdata", "bulk_neutral_network.rds", package = "XYomics")
)

neutral_specific <- res_cat[res_cat$DEG_Type == "sex-neutral", ]
```

### 3. Visualizing the Network

Finally, we use the `plot_network()` function to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.

```{r visualize-network}


plot_network(neutral_network, "sex-neutral", neutral_specific, show_barplot = T)


#Generate plots for all categories

all_plots <- plot_network_pipeline(network, res_cat)

```

### Generating a Report
The `generate_cat_report` function can be used to compile all results into a single HTML report.

```{r report, eval=FALSE}
# This command generates a comprehensive HTML report 

template_path = system.file("extdata", "Template_report_bulk.Rmd", package = "XYomics") # fetch the template for bulk from the package

generate_cat_report(res_cat, pathway_results, network, template_path = template_path)


```

# Conclusion

This vignette has demonstrated two key approaches for analyzing sex-specific effects in bulk RNA-seq data. While sex-stratified analysis is a common first step, **interaction analysis provides a more statistically robust method** for identifying genes whose regulation by disease is truly dependent on sex. We recommend using an interaction-based approach when possible, while being mindful of the need for adequate sample sizes.