Skip to content

Commit ec1b6f0

Browse files
authored
Merge pull request #13 from Center-for-Health-Data-Science/update_recount3
Several updates to the Recount3 subpackage
2 parents 2ccd6d4 + 71d0c6b commit ec1b6f0

File tree

11 files changed

+35
-9020
lines changed

11 files changed

+35
-9020
lines changed

bulkDGD/execs/dgd_get_recount3_data.py

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,9 @@ def main():
262262
# Create a list to store the futures.
263263
futures = []
264264

265+
# Create a set to store the names of the output/log files.
266+
output_names = set()
267+
265268
# For each row of the data frame containing the samples' batches
266269
for num_batch, row in enumerate(df.itertuples(index = False), 1):
267270

@@ -288,9 +291,29 @@ def main():
288291

289292
#-------------------------------------------------------------#
290293

294+
# Get the overall name for the output/log files.
295+
output_name = f"{project_name}_{samples_category}"
296+
297+
# Set a counter in case the name already exists and we need
298+
# to name the files differently.
299+
counter = 1
300+
301+
# If the name already exists
302+
while output_name in output_names:
303+
304+
# Uniquify the name by adding a counter.
305+
output_name = output_name + f"_{counter}"
306+
307+
# Update the counter.
308+
counter += 1
309+
310+
# Add the new name to the list of names.
311+
output_names.add(output_name)
312+
313+
#-------------------------------------------------------------#
314+
291315
# Get the name of the output file.
292-
output_csv_name = \
293-
f"{project_name}_{samples_category}_{num_batch}.csv"
316+
output_csv_name = f"{output_name}.csv"
294317

295318
# Get the path to the output file.
296319
output_csv_path = os.path.join(wd, output_csv_name)
@@ -306,8 +329,7 @@ def main():
306329
#-------------------------------------------------------------#
307330

308331
# Get the path to the log file and the file's extension.
309-
log_file_name = \
310-
f"{project_name}_{samples_category}_{num_batch}.log"
332+
log_file_name = f"{output_name}.log"
311333

312334
# Get the path to the log file.
313335
log_file_path = os.path.join(wd, log_file_name)

bulkDGD/recount3/data/gtex_tissues.txt

Lines changed: 0 additions & 94 deletions
This file was deleted.

bulkDGD/recount3/data/readme.md

Lines changed: 1 addition & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# `data`
22

3-
Last updated: 29/03/2024
3+
Last updated: 22/07/2024
44

55
## `gtex_metadata_fileds`
66

@@ -23,43 +23,6 @@ AGE
2323
DTHHRDY
2424
```
2525

26-
## `gtex_tissues.txt`
27-
28-
A plain text file containing the list of available GTEx tissues. `dgd_get_recount3_data` uses it to check whether the user-provided tissue is valid.
29-
30-
Example:
31-
32-
```
33-
# GTEx tissue types - STUDY_NA is not included
34-
35-
# Adipose tissue
36-
ADIPOSE_TISSUE
37-
38-
# Adrenal gland
39-
ADRENAL_GLAND
40-
41-
# Blood
42-
BLOOD
43-
44-
# Blood vessel
45-
BLOOD_VESSEL
46-
```
47-
48-
## `sra_codes.txt`
49-
50-
A plain text file containing the list of available SRA codes. `dgd_get_recount3_data` uses it to check whether the user-provided SRA code is valid.
51-
52-
Example:
53-
54-
```
55-
# SRA codes
56-
57-
SRP107565
58-
SRP149665
59-
SRP017465
60-
SRP119165
61-
```
62-
6326
## `sra_metadata_fields.txt`
6427

6528
A plain text file containing the fields (= columns) found in the files describing the metadata associated with SRA samples downloaded from the Recount3 platform.
@@ -78,28 +41,6 @@ sample_acc
7841
experiment_acc
7942
```
8043

81-
## `tcga_cancer_types.txt`
82-
83-
A plain text file containing the list of TCGA cancer types. `dgd_get_recount3_data` uses it to check whether the user-provided cancer type is valid.
84-
85-
Example:
86-
87-
```
88-
# TCGA cancer types (from https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations, CNTL, FPPP, and MISC excluded)
89-
90-
# Adrenocortical carcinoma
91-
ACC
92-
93-
# Bladder Urothelial Carcinoma
94-
BLCA
95-
96-
# Breast Invasive Carcinoma
97-
BRCA
98-
99-
# Cervical squamous cell carcinoma and endocervical carcinoma
100-
CESC
101-
```
102-
10344
## `tcga_metadata_fields.txt`
10445

10546
A plain text file containing the fields (= columns) found in the files describing the metadata associated with TCGA samples downloaded from the Recount3 platform.

0 commit comments

Comments
 (0)