Skip to content

Conversation

dlaehnemann
Copy link
Member

@dlaehnemann dlaehnemann commented Oct 8, 2025

This will generate one heatmap per list, and bootstrap plots for each of the genes in the union of the lists.

Summary by CodeRabbit

  • New Features

    • Support multiple named gene lists — one heatmap and one bootstrap plot per list; list name included in filenames, logs, and outputs.
    • Ability to exclude samples via a YAML list.
  • Refactor

    • Modernized per-list plotting and bootstrap flow; heatmap generation moved to a tidyverse/ggplot-based pipeline; per-transcript bootstrap handling updated.
  • Documentation

    • Clarified multi-list usage, per-list plotting, sample-exclude format, and Ensembl/Pfam guidance.
  • Chores

    • Default bootstrap FDR raised to 0.5; updated Ensembl (115) and Pfam (37.1) and adjusted R environment pins.

Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

Walkthrough

Replaces single-string diffexp.genes_of_interest.genelist with a map diffexp.genes_of_interest.gene_lists across configs and schema; updates Snakemake rules and wildcards to iterate by {gene_list}; adapts R scripts and conda envs to produce per-gene-list heatmaps and per-transcript bootstraps; bumps Ensembl and Pfam refs and adjusts test configs.

Changes

Cohort / File(s) Summary
Configs (main and tests)
config/config.yaml, .test/config/config.yaml, .test/three_prime/config/config.yaml
Replace diffexp.genes_of_interest.genelist (string) with diffexp.genes_of_interest.gene_lists (map of named file paths, e.g. gene_list_1, gene_list_2); update experiment.resources.ref.release 114→115 and pfam 37.0→37.1; add/clarify diffexp.exclude examples.
Schema
workflow/schemas/config.schema.yaml
Replace genelist (string) with gene_lists (object) using patternProperties allowing arbitrary named string entries; update required fields to expect gene_lists.
Snakemake rules
workflow/rules/common.smk, workflow/rules/diffexp.smk
Remove input_genelist(predef_genelist) helper; lookup diffexp/genes_of_interest/gene_lists; switch wildcard/iteration from mode to {gene_list}; include {gene_list} in inputs/outputs/logs/labels and pass per-list paths to bootstrap/heatmap rules.
R scripts
workflow/scripts/plot-bootstrap.R, workflow/scripts/plot_diffexp_heatmap.R
plot-bootstrap.R: read named TSV gene-lists when activated, derive transcripts_of_interest from observed data, and output per-transcript bootstrap PDFs including gene/transcript/model in filenames. plot_diffexp_heatmap.R: use {gene_list} wildcard, support topn or predefined per {gene_list}, select genes, and render heatmaps with tidyverse/ggalign ggplot pipeline (replacing pheatmap-based code).
Conda envs
workflow/envs/heatmap.yaml, workflow/envs/sleuth.yaml
heatmap.yaml: replace r-pheatmap, r-dplyr, r-tidyr with r-base, r-tidyverse, r-ggalign. sleuth.yaml: loosen r-base pin from =4 to >=4.1.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CFG as Config (diffexp.gene_lists)
  participant SMK as Snakemake
  participant HEAT as plot_diffexp_heatmap.R
  participant BOOT as plot-bootstrap.R

  CFG->>SMK: provide map of named gene-list paths
  rect rgb(240,248,255)
    note right of SMK: Snakemake iterates per `gene_list` key
    SMK->>SMK: expand `{gene_list}` wildcard
    SMK->>HEAT: run heatmap job with {model, gene_list, predef_gene_list}
    HEAT-->>SMK: produce `{model}.{gene_list}.diffexp-heatmap.{gene_list}.pdf` + log
  end

  rect rgb(245,255,240)
    note right of SMK: bootstrap reads all enabled gene_lists
    SMK->>BOOT: pass gene_lists map (paths + activate flags)
    BOOT->>BOOT: read TSVs, derive transcripts_of_interest from observed counts
    BOOT-->>SMK: generate bootstrap PDFs per transcript (filenames include gene, transcript, model)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • johanneskoester
  • FelixMoelder

Poem

I hop through configs, tidy and neat,
Named lists now gather each gene's heartbeat.
Heatmaps and bootstraps stamped with their name,
Per-list wildcards frame each plotting game. 🐰

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights the breaking feature of supporting multiple gene lists which aligns exactly with the PR objectives of enabling multiple gene_list inputs and updating downstream plotting behavior. It is concise, uses conventional commit syntax to denote a breaking change, and clearly conveys the primary change without extraneous details.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/allow-multiple-gene-set-lists-heatmap-bootstrap-plots

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e015b3 and 8dbfca8.

📒 Files selected for processing (1)
  • config/config.yaml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • config/config.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-rna-workflow
  • GitHub Check: run-three-prime-rna-workflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
workflow/rules/diffexp.smk (1)

258-279: Consider adding wildcard constraint for {gene_list}.

The rule correctly implements per-gene-list heatmap generation. However, consider adding a wildcard constraint for {gene_list} to prevent potential issues with special characters in gene list names and maintain consistency with existing patterns (see line 35: {model,[^.]+}).

For example, add a constraint in the output definition:

-            "results/plots/diffexp-heatmap/{model}.{gene_list}.diffexp-heatmap.{mode}.pdf",
+            "results/plots/diffexp-heatmap/{model}.{gene_list,[^./]+}.diffexp-heatmap.{mode}.pdf",

This prevents gene list names containing dots or slashes from causing wildcard ambiguity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e580493 and a6eaf38.

⛔ Files ignored due to path filters (2)
  • .test/resources/gene_list.tsv is excluded by !**/*.tsv
  • .test/resources/gene_list_2.tsv is excluded by !**/*.tsv
📒 Files selected for processing (8)
  • .test/config/config.yaml (1 hunks)
  • .test/three_prime/config/config.yaml (1 hunks)
  • config/config.yaml (1 hunks)
  • workflow/rules/common.smk (1 hunks)
  • workflow/rules/diffexp.smk (2 hunks)
  • workflow/schemas/config.schema.yaml (1 hunks)
  • workflow/scripts/plot-bootstrap.R (1 hunks)
  • workflow/scripts/plot_diffexp_heatmap.R (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Tests
workflow/rules/common.smk

[error] 336-336: SyntaxError: positional argument follows keyword argument.

🔇 Additional comments (1)
workflow/rules/diffexp.smk (1)

149-149: LGTM! Correct implementation for multi-list bootstrap aggregation.

The change correctly retrieves the entire gene_lists map from the config, enabling the bootstrap plot script to aggregate genes across all provided lists as intended.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
workflow/rules/common.smk (1)

329-340: Syntax error from previous review has been addressed.

The critical issues flagged in the past review (missing dpath= keyword and incorrect path) have been resolved. The current implementation correctly uses lookup(within=config, dpath="diffexp/genes_of_interest/gene_lists").

For consistency with line 326 (which uses gene_list=["topn"]), consider explicitly converting the lookup result to a list:

             gene_list=lookup(
                 within=config, dpath="diffexp/genes_of_interest/gene_lists"
             ),
+            gene_list=list(
+                lookup(
+                    within=config, dpath="diffexp/genes_of_interest/gene_lists"
+                )
+            ),

This makes the type explicit and aligns with the default gene_list specification on line 326.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a67b6a and dca8aef.

📒 Files selected for processing (2)
  • workflow/rules/common.smk (1 hunks)
  • workflow/rules/diffexp.smk (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/rules/diffexp.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-three-prime-rna-workflow
  • GitHub Check: run-rna-workflow
🔇 Additional comments (1)
workflow/rules/common.smk (1)

323-328: LGTM! Default heatmap generation for top N genes is well-structured.

The addition of the {gene_list} wildcard and the default gene_list=["topn"] ensures that a heatmap for the top differentially expressed genes is always generated, which maintains backward compatibility while supporting the new multi-list feature.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
workflow/scripts/plot-bootstrap.R (2)

23-30: Well-implemented gene list aggregation.

The logic correctly handles multiple gene lists:

  • character() initialization prevents crashes when gene_lists is empty (addresses previous review)
  • union() automatically deduplicates genes across lists
  • Pipe operators and transformations are appropriate

Consider adding error handling around file reads to provide clearer feedback if a gene list file is missing or malformed:

   for (gene_list in unname(snakemake@params[["genes_of_interest"]][["gene_lists"]])) {
-    new_genes <- read_tsv(
-      gene_list,
-      col_names = FALSE
-    )
+    new_genes <- tryCatch(
+      read_tsv(gene_list, col_names = FALSE, show_col_types = FALSE),
+      error = function(e) {
+        stop(sprintf("Failed to read gene list '%s': %s", gene_list, e$message))
+      }
+    )
     genes <- union(genes, new_genes |> unlist() |> unname())
   }

36-39: Update outdated comment.

The comment refers to an old implementation detail. Consider updating it to reflect the current logic:

 } else {
-    # "genes" is null, if the list provided in config.yaml is empty
+    # Create empty tibble when genes_of_interest is not activated
     genes_of_interest <- tibble(ext_gene = character())
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4fb68b and 84aca9e.

📒 Files selected for processing (1)
  • workflow/scripts/plot-bootstrap.R (1 hunks)
🔇 Additional comments (2)
workflow/scripts/plot-bootstrap.R (2)

22-22: LGTM! Parameter access updated correctly.

The parameter path change from genes to genes_of_interest aligns with the PR's shift to multiple gene lists.


31-35: LGTM! Tibble creation is correct.

enframe() appropriately converts the gene vector to a tibble with the expected ext_gene column name.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
workflow/scripts/plot-bootstrap.R (1)

22-35: Add error handling for missing or malformed gene list files.

The loop reads TSV files without error handling. If a file path is invalid or the file is malformed, the script will crash with potentially unclear errors. Consider wrapping the file read operations in tryCatch() to provide informative error messages.

Additionally, the native pipe operator |> requires R >= 4.1.0, which may break on older R versions if not properly documented or enforced.

Apply this diff to add error handling:

 if (snakemake@params[["genes_of_interest"]][["activate"]] == TRUE) {
   genes = character()
   for (gene_list in unname(snakemake@params[["genes_of_interest"]][["gene_lists"]])) {
-    new_genes <- read_tsv(
-      gene_list,
-      col_names = FALSE
-    )
-    genes <- union(genes, new_genes |> unlist() |> unname())
+    tryCatch({
+      new_genes <- read_tsv(
+        gene_list,
+        col_names = FALSE,
+        show_col_types = FALSE
+      )
+      genes <- union(genes, new_genes |> unlist() |> unname())
+    }, error = function(e) {
+      stop(sprintf("Failed to read gene list file '%s': %s", gene_list, e$message))
+    })
   }
   genes_of_interest <- enframe(
     genes,
     name = NULL,
     value = "ext_gene"
   )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84aca9e and f6709e7.

📒 Files selected for processing (1)
  • workflow/scripts/plot-bootstrap.R (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-three-prime-rna-workflow
  • GitHub Check: run-rna-workflow
🔇 Additional comments (1)
workflow/scripts/plot-bootstrap.R (1)

37-38: LGTM!

The empty tibble initialization is consistent with the structure created when genes_of_interest is activated, ensuring downstream compatibility.

@dlaehnemann
Copy link
Member Author

Actionable comments posted: 0
🧹 Nitpick comments (1)

workflow/scripts/plot-bootstrap.R (1)> 22-35: Add error handling for missing or malformed gene list files.

The loop reads TSV files without error handling. If a file path is invalid or the file is malformed, the script will crash with potentially unclear errors. Consider wrapping the file read operations in tryCatch() to provide informative error messages.
Additionally, the native pipe operator |> requires R >= 4.1.0, which may break on older R versions if not properly documented or enforced.
Apply this diff to add error handling:

 if (snakemake@params[["genes_of_interest"]][["activate"]] == TRUE) {
   genes = character()
   for (gene_list in unname(snakemake@params[["genes_of_interest"]][["gene_lists"]])) {
-    new_genes <- read_tsv(
-      gene_list,
-      col_names = FALSE
-    )
-    genes <- union(genes, new_genes |> unlist() |> unname())
+    tryCatch({
+      new_genes <- read_tsv(
+        gene_list,
+        col_names = FALSE,
+        show_col_types = FALSE
+      )
+      genes <- union(genes, new_genes |> unlist() |> unname())
+    }, error = function(e) {
+      stop(sprintf("Failed to read gene list file '%s': %s", gene_list, e$message))
+    })
   }
   genes_of_interest <- enframe(
     genes,
     name = NULL,
     value = "ext_gene"
   )

📜 Review details

@coderabbitai read_tsv() is a function from a tidyverse package (readr) and these provide good error handling and clear error messages. Thus, for tidyverse package functions, no manual error handling for common cases is necessary.

Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Tip

For best results, initiate chat on the files or code changes.

@dlaehnemann, thank you for the clarification! You're absolutely right that read_tsv() and other tidyverse functions provide robust error handling with clear, informative error messages out of the box. I'll keep this in mind for future reviews.


✏️ Learnings added
Learnt from: dlaehnemann
URL: https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/pull/164

Timestamp: 2025-10-10T07:43:04.524Z
Learning: In this Snakemake workflow project, tidyverse package functions (such as `read_tsv()` from `readr`) provide good error handling and clear error messages. Manual error handling with `tryCatch()` for common cases is not necessary when using tidyverse functions.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

@fxwiegand fxwiegand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the rotation of the genes into the horizontal list. I would not say that this helps readability either. Also previous lists would need to be adapted again. Couldnt you simply rotate the list where it is needed in the code?

Also one thought that came to my mind was that this is a breaking change correct? One could make this non-breaking by simply allowing to add a second gene set via another column.

However all this is probably nitpicky stuff - looks good in general i would say!

@dlaehnemann
Copy link
Member Author

Good points, thanks. The explanatory comment in the config/config.yaml and the actual example gene list for the tests were previously disagreeing, because I added a wrong explanatory comment there (I guess I misunderstood the explicit specification of the \t separator in previous code, which would not be necessary). I made them agree in this way, but possibly the other way around makes the code a bit clearer, being able to directly parse in a column. I'll adjust this.

Also, this is a breaking change, yes. So we should reflect this in the version bump. I also like the idea of having only a single file with multiple columns, but this would also require a header line with column names, to name each list. So I guess this would also be breaking. And I am actually unsure whether users would rather maintain a single file with multiple columns, or multiple files with a single column each. The former would mean that genes arbitrarily line up in rows, even though they don't relate in any way. So I guess I would stick with one file per gene list?

@dlaehnemann dlaehnemann changed the title feat: allow for multiple gene_lists feat!: allow for multiple gene_lists Oct 13, 2025
@fxwiegand
Copy link
Contributor

Good points, thanks. The explanatory comment in the config/config.yaml and the actual example gene list for the tests were previously disagreeing, because I added a wrong explanatory comment there (I guess I misunderstood the explicit specification of the \t separator in previous code, which would not be necessary). I made them agree in this way, but possibly the other way around makes the code a bit clearer, being able to directly parse in a column. I'll adjust this.

Yeah i get that! The .tsv extension is quite confusing when theres only a single column and it somehow also is a line break delimited file then.

Also, this is a breaking change, yes. So we should reflect this in the version bump. I also like the idea of having only a single file with multiple columns, but this would also require a header line with column names, to name each list. So I guess this would also be breaking. And I am actually unsure whether users would rather maintain a single file with multiple columns, or multiple files with a single column each. The former would mean that genes arbitrarily line up in rows, even though they don't relate in any way. So I guess I would stick with one file per gene list?

Ah yes right the columns would at least need a header for naming the gene set then so yeah the change will be breaking anyway. In that case i also don't really have a strong opinion on single file vs. multiple files. I agree that the single file solution file could be confusing when looking at a row of that table.

@Addimator
Copy link
Collaborator

For the single file multiple columns discussion I want to add that I personally like single files (especially since the cluster is so damn slow a lot of times) to not have to open multiple files. But since this is a really specific scenario and there should not be too many files anyways I do not think this is an important reason to switch to single file multiple cols.

@Addimator
Copy link
Collaborator

I just finished the review. Again some nitpicking stuff and nothing major to add. Great PR from my side too!

@dlaehnemann
Copy link
Member Author

OK, I went back to line-separated gene lists. But then I couldn't bear the current status of the code and changed it all to proper tidyverse syntax and hopefully made it all a bit more readable. Also fixed the bootstrap plot filtering in the process, which was omitting quite a few requested genes. And switched the heatmap plotting from the very specific pheatmap() to the ggplot2-based ggalign(), which should be much easier to adjust and improve.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
config/config.yaml (2)

144-156: Clarify gene-list docs: avoid “infix”, add period, and state “no header”

Tighten wording and remove ambiguity about headers.

Apply:

-  # If you want a heatmap and bootstrap plots for particular sets of genes,
+  # If you want a heatmap and bootstrap plots for particular sets of genes,
   # you can set `activate: true` and provide one or more gene list files. In
-  # those files, list all your HGNC gene symbols of interest, one gene per line.
+  # those files, list all your HGNC gene symbols of interest, one gene per line (no header).
@@
-    gene_lists:
-      # use a descriptive gene list name, as this will be used as an infix in
-      # the filename of the resulting heatmap plot
+    gene_lists:
+      # Use a descriptive gene list name; it will be used in the filename of the resulting heatmap plot.
       gene_list_1: "config/gene_list.tsv"

42-42: Consider updating default Ensembl release (duplicate of prior nit)

Default is "114". If you want “latest working” by default, consider bumping to current.

What is the latest Ensembl release as of today, and is release 114 still recommended as a default for new analyses?
🧹 Nitpick comments (5)
workflow/scripts/plot_diffexp_heatmap.R (2)

35-38: Robust gene-list read: handle headerless files explicitly

If lists are single-column without header (as docs suggest), read as headerless and rename to avoid accidental header rows being treated as genes.

Apply:

-    predef_gene_list <-
-        read_tsv(
-            snakemake@input[["predef_gene_list"]],
-            col_names = c("gene")
-        )
+    predef_gene_list <-
+        read_tsv(
+            snakemake@input[["predef_gene_list"]],
+            col_names = FALSE,
+            show_col_types = FALSE
+        ) |>
+        rename(gene = 1)

47-70: Ensure proper matrix type for ggheatmap and qualify unit()

  • Coerce to a numeric matrix so ggheatmap has the expected input.
  • Use grid::unit to avoid missing import.

Apply:

-matrix <- selected_genes |>
+matrix <- selected_genes |>
     mutate(transcript = str_pad(transcript, 18, "right")) |>
     replace_na(list(gene = "   ")) |>
     mutate(t = str_c(gene, " - ", transcript)) |>
     select(-gene, -transcript) |>
-    column_to_rownames(var = "t")
+    column_to_rownames(var = "t") |>
+    as.matrix()
@@
-    anno_top(size = unit(3, "cm")) +
+    anno_top(size = grid::unit(3, "cm")) +
@@
-    anno_right(size = unit(3, "cm")) +
+    anno_right(size = grid::unit(3, "cm")) +
workflow/scripts/plot-bootstrap.R (3)

14-19: Prefer slice_min over superseded top_n and avoid negative n

Cleaner and more explicit.

Apply:

-top_genes <- transcripts |>
-    filter(qval <= snakemake@params[["fdr"]]) |>
-    top_n(top_n, qval) |>
+top_genes <- transcripts |>
+    filter(qval <= snakemake@params[["fdr"]]) |>
+    slice_min(qval, n = abs(top_n)) |>
     dplyr::select(ext_gene) |>
     drop_na() |>
     distinct(ext_gene)

35-35: Create output dir recursively and silence warnings

Avoid warnings if the directory exists and handle nested dirs.

Apply:

-dir.create(snakemake@output[[1]])
+dir.create(snakemake@output[[1]], showWarnings = FALSE, recursive = TRUE)

12-12: Optional: suppress type messages for read_tsv

Keeps logs clean.

Apply:

-transcripts <- read_tsv(snakemake@input[["transcripts"]])
+transcripts <- read_tsv(snakemake@input[["transcripts"]], show_col_types = FALSE)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6709e7 and a4b56c6.

⛔ Files ignored due to path filters (2)
  • .test/resources/gene_list.tsv is excluded by !**/*.tsv
  • .test/resources/gene_list_2.tsv is excluded by !**/*.tsv
📒 Files selected for processing (5)
  • .test/config/config.yaml (2 hunks)
  • config/config.yaml (3 hunks)
  • workflow/envs/heatmap.yaml (1 hunks)
  • workflow/scripts/plot-bootstrap.R (1 hunks)
  • workflow/scripts/plot_diffexp_heatmap.R (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .test/config/config.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: formatting

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
config/config.yaml (1)

153-156: Clarify wording (“infix”), add period, and note “no header”.

Make wording clearer and guide users to single-column, headerless files to match the readers.

-      # use a descriptive gene list name, as this will be used as an infix in
-      # the filename of the resulting heatmap plot
+      # Use a descriptive gene list name; it will be part of the resulting heatmap
+      # filename.
+      # Each gene list file must be single-column, no header; one HGNC symbol per line.
       gene_list_1: "config/gene_list.tsv"
🧹 Nitpick comments (5)
workflow/scripts/plot_diffexp_heatmap.R (2)

47-53: Ensure a true numeric matrix for ggheatmap.

Coerce to matrix after setting row names.

 matrix <- selected_genes |>
     mutate(transcript = str_pad(transcript, 18, "right")) |>
     replace_na(list(gene = "   ")) |>
     mutate(t = str_c(gene, " - ", transcript)) |>
     select(-gene, -transcript) |>
-    column_to_rownames(var = "t")
+    column_to_rownames(var = "t") |>
+    as.matrix()

57-65: Namespace unit() to avoid runtime resolution issues.

Explicitly use grid::unit.

-    anno_top(size = unit(3, "cm")) +
+    anno_top(size = grid::unit(3, "cm")) +
@@
-    anno_right(size = unit(3, "cm")) +
+    anno_right(size = grid::unit(3, "cm")) +
workflow/scripts/plot-bootstrap.R (3)

14-19: Use slice_min instead of negative top_n.

Cleaner and future-proof.

-top_genes <- transcripts |>
-    filter(qval <= snakemake@params[["fdr"]]) |>
-    top_n(top_n, qval) |>
+top_genes <- transcripts |>
+    filter(qval <= snakemake@params[["fdr"]]) |>
+    slice_min(qval, n = as.integer(snakemake@params[["top_n"]]), with_ties = FALSE) |>
     dplyr::select(ext_gene) |>
     drop_na() |>
     distinct(ext_gene)

35-35: Make dir.create idempotent.

Avoid warnings when dir exists and ensure parents are created.

-dir.create(snakemake@output[[1]])
+dir.create(snakemake@output[[1]], recursive = TRUE, showWarnings = FALSE)

22-26: Multi-file read_tsv: confirm behavior.

read_tsv() (via vroom) can read a vector of files and row-bind them. Confirm that your readr version guarantees this across environments; otherwise, loop and bind_rows.

# Alternative (if needed):
genes_of_interest <- purrr::map_df(
  unname(snakemake@params[["genes_of_interest"]][["gene_lists"]]),
  ~ read_tsv(.x, col_names = "ext_gene")
) |> distinct()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6709e7 and a4b56c6.

⛔ Files ignored due to path filters (2)
  • .test/resources/gene_list.tsv is excluded by !**/*.tsv
  • .test/resources/gene_list_2.tsv is excluded by !**/*.tsv
📒 Files selected for processing (5)
  • .test/config/config.yaml (2 hunks)
  • config/config.yaml (3 hunks)
  • workflow/envs/heatmap.yaml (1 hunks)
  • workflow/scripts/plot-bootstrap.R (1 hunks)
  • workflow/scripts/plot_diffexp_heatmap.R (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .test/config/config.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-rna-workflow
  • GitHub Check: run-three-prime-rna-workflow

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
workflow/scripts/plot_diffexp_heatmap.R (4)

10-12: Optional: quiet readr parsing output

Consider suppressing column-type messages to keep logs clean.

-log_counts <- read_tsv(
-    snakemake@input[["logcountmatrix_file"]]
-)
+log_counts <- readr::read_tsv(
+  snakemake@input[["logcountmatrix_file"]],
+  show_col_types = FALSE
+)

47-53: Guard empty selections to avoid ggheatmap crash

If no rows are selected (empty gene list or no matches), ggheatmap likely errors. Short‑circuit with a friendly message and create an empty placeholder plot or skip saving.

-matrix <- selected_genes |>
+if (nrow(selected_genes) == 0) {
+  message("No genes selected for heatmap; skipping plot.")
+  quit(save = "no", status = 0)
+}
+
+matrix <- selected_genes |>

54-70: Namespace grid::unit for clarity

Be explicit to avoid masked imports.

-    anno_top(size = unit(3, "cm")) +
+    anno_top(size = grid::unit(3, "cm")) +
...
-    anno_right(size = unit(3, "cm")) +
+    anno_right(size = grid::unit(3, "cm")) +

34-38: Gene list read: document “no header” assumption

You assign col_names = "gene", implying a headerless single-column file. Consider a quick check to detect and warn on accidental headers.

 predef_gene_list <-
-    read_tsv(
+    readr::read_tsv(
         snakemake@input[["predef_gene_list"]],
-        col_names = c("gene")
+        col_names = "gene",
+        show_col_types = FALSE
     )
config/config.yaml (2)

45-51: Avoid hardcoding “release-114” in examples

Use a placeholder or the current release to prevent confusion.

-# https://ftp.ensembl.org/pub/release-114/species_EnsemblVertebrates.txt
+# https://ftp.ensembl.org/pub/release-<release>/species_EnsemblVertebrates.txt
...
-# https://ftp.ensembl.org/pub/release-114/fasta/homo_sapiens/cdna/
+# https://ftp.ensembl.org/pub/release-<release>/fasta/homo_sapiens/cdna/

144-151: Clarify gene list file format: no header

Explicitly state that files must be single-column, headerless TSV/CSV with one HGNC symbol per line.

-# you can set `activate: true` and provide one or more gene list files. In
-# those files, list all your HGNC gene symbols of interest, one gene per line.
+# you can set `activate: true` and provide one or more gene list files.
+# Each file must be a single-column, headerless TSV/CSV listing HGNC gene
+# symbols of interest, one gene per line.
workflow/scripts/plot-bootstrap.R (2)

35-35: Optional: make dir.create idempotent

Avoid warnings if the directory already exists.

-dir.create(snakemake@output[[1]])
+dir.create(snakemake@output[[1]], recursive = TRUE, showWarnings = FALSE)

14-19: Prefer slice_min over deprecated top_n

top_n() is superseded; slice_min expresses intent directly.

-top_genes <- transcripts |>
-    filter(qval <= snakemake@params[["fdr"]]) |>
-    top_n(top_n, qval) |>
+top_genes <- transcripts |>
+    filter(qval <= snakemake@params[["fdr"]]) |>
+    dplyr::slice_min(order_by = qval, n = as.integer(snakemake@params[["top_n"]])) |>
     dplyr::select(ext_gene) |>
     drop_na() |>
     distinct(ext_gene)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6709e7 and 4c4536c.

⛔ Files ignored due to path filters (2)
  • .test/resources/gene_list.tsv is excluded by !**/*.tsv
  • .test/resources/gene_list_2.tsv is excluded by !**/*.tsv
📒 Files selected for processing (5)
  • .test/config/config.yaml (3 hunks)
  • config/config.yaml (4 hunks)
  • workflow/envs/heatmap.yaml (1 hunks)
  • workflow/scripts/plot-bootstrap.R (1 hunks)
  • workflow/scripts/plot_diffexp_heatmap.R (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-three-prime-rna-workflow
  • GitHub Check: run-rna-workflow
🔇 Additional comments (2)
.test/config/config.yaml (1)

85-87: FDR=0.5 in tests: confirm intent

A 0.5 FDR will yield many positives (more bootstrap plots). If this is to stress plotting in CI, fine; otherwise consider a tighter threshold.

workflow/envs/heatmap.yaml (1)

6-7: r-ggalign includes ggheatmap(), anno_top(), anno_right(), and align_dendro(); no new dependencies required.

Copy link
Collaborator

@Addimator Addimator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round, but again only two small things. The R improvements were particularly interesting to me since I originally wrote the pheatmap stuff. It was my first time looking into R after a long time. So rule of thumb: Whenever possible use tidyverse and ggplot for doing stuff in R, I will remember that :)

type: string
required:
- genelist
- gene_lists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is gene_lists required? When the user sets activate to FALSE, then we do not need gene_lists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is debatable. It was in there previously and I kept it, because even if we set it to FALSE, we usually keep the placeholder to have the expected structure in place. Also, the bootstrap rule doesn't actually specify the respective files under input, so we would not get a missing input exception, but rather some lookup error where I wouldn't be sure if it is clearly interpretable. On the other hand, if someone really doesn't care about these lists, but wants other updates to this workflow, they will be forced to update their config. So we could consider dropping this.
Maybe @fxwiegand has another opinion?

dlaehnemann and others added 2 commits October 14, 2025 13:04
Co-authored-by: Adrian Prinz <44083468+Addimator@users.noreply.github.com>
@dlaehnemann
Copy link
Member Author

Another round, but again only two small things. The R improvements were particularly interesting to me since I originally wrote the pheatmap stuff. It was my first time looking into R after a long time. So rule of thumb: Whenever possible use tidyverse and ggplot for doing stuff in R, I will remember that :)

I think this was actually re-implementing the sleuth-internal heatmap, right? Because that already wasn't configurable in the way we wanted it to be? But yeah, I also wanted to try this package, which seems to be very systematic and allows for all kinds of complex multi-panel plots with nice alignment among them:
https://yunuuuu.github.io/ggalign-gallery/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants