This repository is for CSVs of DS data at various stages of extraction, transformation, and enrichment. It also includes RDF and JSON data extracted from our linked database for display and search through a user interface.
New files should be called: DATE-name.csv
, where DATE
is the date the file was created in YYYYMMDD
format and name
is a description, like jhu
, combined
, upenn
, etc. When more than one descriptor is applied, descriptors are separated by a dash, such as in 20220920-language-combined-enriched
. In addition, when element
is provided in the instructions, it is a description of the metadata element, field, or type of data, such as in languages.csv
, named-subjects-unreconciled.csv
, or 20220705-places-combined-enriched.csv
.
- Navigate to
member-data
directory. - Create sub-directory for institution using code list (if directory does not exist).
- Create sub-sub-directory based on filename DATE in YYYY-MM format (if directory does not exist).
- Click on
Add File
button and clickUpload files
from context menu. - Drag and drop or
choose your files
to be uploaded. Commit changes
directly to main branch.
- Navigate to
ds-data/terms/batch
directory. - Create sub-directory based on filename DATE in YYYY-MM format (if does not exist).
- Click on
Add File
button and clickUpload files
from context menu (file to be uploaded should beDATE-element-enriched.csv
). - Drag and drop or
choose your files
to be uploaded. Commit changes
directly to main branch.
See also instructions at ds-open-refine/instructions/
- Navigate to
ds-data/terms/reconciled
directory - Click on
Add File
button and clickUpload files
from context menu (file to be uploaded should be 'element.csv`). - Drag and drop or
choose your files
to be uploaded. Commit changes
directly to main branch.
.
├── README.md
├── Workflow-README-template.md
├── config.yml
├── member-data
│ ├── burke
│ │ ├── 2022-02
│ │ │ ├── 2022-02-23-burke-marc-combined-import.csv
│ │ │ ├── 2022-02-25-burke-marc-mmw-import.csv
│ │ ├── 2022-04
│ │ │ ├── 2022-04-04-burke-hebrew-import.csv
│ ├── ccny
│ │ ├── 2022-02
│ │ │ ├── 2022-02-23-ccny-mets-import.csv
│ │ │ ├── 2022-02-25-ccny-csv-import.csv
│ ├── columbia
│ │ ├── 2023-01
│ │ │ ├── 2022-02-23-columbia-marcxml-european-import.csv
etc.
├── split_data.rb
├── test
│ ├── missing_inst_name.csv
│ ├── missing_qid.csv
│ └── unknown_qid.csv
└── workflow
├── 2022-02-23-combined-README.md
├── 2022-02-23-combined-enriched.csv
├── 2022-02-23-combined.csv
├── 2022-04-04-combined-README.md
├── 2022-04-04-combined-enriched.csv
└── 2022-04-04-combined.csv
CSVs in the workflow directory should be split into institution-specific
directories. The split_data.rb
script splits the CSV on the QID in the
holding_institution
column and puts the file in folder as defined in the
config.yml
file.
The configuration file contains the QID, name and a single-word folder for each institution. New repositories should be added to the configuration. The format of the entries is like so:
---
- :qid: Q814779
:name: Beinecke Rare Book & Manuscript Library
:directory: beinecke
- :qid: Q995265
:name: Bryn Mawr College
:directory: brynmawr
- :qid: Q63969940
:name: Burke Library at Union Theological Seminary
:directory: burke
split_data.rb
validates the config file and the CSV.