Skip to content
Rafael JPD edited this page Jun 2, 2025 · 1 revision

Description

datachecker is a command-line tool that validates the content of XML files to ensure they meet required standards. The following checks are performed:

  • Accessibility data validation
  • Affiliation data validation
  • Alternatives validation
  • App group validation
  • Article abstract validation
  • Article and subarticles validation
  • Article contributors validation
  • Article DOI validation
  • Article license validation
  • Author notes validation
  • Cross-reference validation
  • Data availability validation
  • Dates validation
  • Errata validation
  • Figure validation
  • Footnote validation
  • Formula validation
  • Funding group validation
  • Issue metadata validation
  • Journal metadata validation
  • Language validation
  • Media validation
  • Peer review validation
  • Preprint validation
  • References validation
  • Related articles validation
  • Supplementary material validation
  • Table of contents sections validation
  • Table validation

Technologies

  • Python 3.x
  • lxml

Features

  • Batch Validation: Validates a single XML file or all XML files in a folder.
  • Error Reporting: Generates a CSV file summarizing validation errors.
  • Exception Logging: Outputs a JSONL file with detailed exception information.
  • Output Management: Automatically creates the output directory if it does not exist.
  • Flexible Output: Option to generate one CSV file per XML file with --csv_per_xml.
  • Command-Line Interface: Simple usage with required and optional arguments.

Future versions

  • Web interface for XML validation

Prerequisites

To use the XML Data Checker, you must have Python 3.9 or greater installed. You can download it directly from this link or visit the Python website.

Installation

Packtools can be installed using pip. The following sections provide step-by-step instructions for installation on both Linux and Windows systems.

Linux

Create a folder, enter it, create a virtual environment called .venv, activate it, and install packtools:

mkdir scielo-packtools
cd scielo-packtools
python3 -m venv .venv
source .venv/bin/activate
pip install packtools>=4.10.0

Windows

Create a folder, enter it, create a virtual environment called .venv, and install packtools:

md scielo-packtools
cd scielo-packtools
python3 -m venv .venv
.venv\Scripts\activate
pip install packtools>=4.10.0

Usage

Before using the utility, make sure your virtual environment is active. Change to the scielo-packtools directory and activate the environment if needed. When running the command, specify the path to the XML file or folder and the desired output directory. Keep in mind that these two parameters are mandatory and must be provided in the specified order (first, the XML file or folder, then the output directory).

For Linux:

cd scielo-packtools
source .venv/bin/activate

For Windows:

cd scielo-packtools
.venv\Scripts\activate

To validate a single XML file:

data_checker.py path/to/article.xml path/to/output

To validate all XML files in a folder:

data_checker.py path/to/folder path/to/output

To validate all XML files in a folder creating one CSV file per XML file:

data_checker.py path/to/folder path/to/output --csv_per_xml

Here are the command-line arguments

usage: data_checker.py [-h] [--csv_per_xml] xml_path output_path

XML data checker

positional arguments:
  xml_path       XML folder or file path
  output_path    Ouput folder path

options:
  -h, --help     show this help message and exit
  --csv_per_xml  Create one csv per xml

Real examples

# Validate a single XML file
python packtools/data_checker.py ~/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml ~/results

In this case, two filess will be created in the ~/results folder:

2025-06-02T105237913325-errors.csv  2025-06-02T105237913325-exceptions.jsonl

The first file provides a summary of the errors detected in the XML file, while the second file includes detailed exception information from the validation process. Below is an example of the 2025-06-02T105237913325-errors.csv file, which is a CSV with five columns:

xml,response,context,advice,detail
/home/rafaeljpd/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml,WARNING,bibliographic strip,Unable to check if issue is registered,"{'volume': '38', 'number': None, 'supplement': None}"
/home/rafaeljpd/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml,WARNING,subject,"Unable to check if Original Article (<subject-group subj-group-type=""heading""><subject>Original Article</subject></subject-group>) is a valid table of contents section because the journal (ABCD. Arquivos Brasileiros de Cirurgia Digestiva (São Paulo)) sections were not informed","{'parent': 'article', 'parent_id': None, 'parent_lang': 'en', 'parent_article_type': 'research-article', 'original_article_type': 'research-article', 'journal': 'ABCD. Arquivos Brasileiros de Cirurgia Digestiva (São Paulo)', 'article_title': 'FECAL CALPROTECTIN AND INTESTINAL METABOLITES: WHAT IS THEIR IMPORTANCE IN THE ACTIVITY AND DIFFERENTIATION OF PATIENTS WITH INFLAMMATORY BOWEL DISEASES?', 'subject': 'Original Article', 'subj_group_type': 'heading', 'section': 'Original Article', 'subsections': []}"
/home/rafaeljpd/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml,CRITICAL,contrib,Lucas Correia LINS : Mark the contrib role. Consult SPS documentation for detailed instructions,"{'contrib-group-type': None, 'parent': 'article', 'parent_id': None, 'parent_lang': 'en', 'parent_article_type': 'research-article', 'original_article_type': 'research-article', 'contrib_type': 'author', 'contrib_ids': {'orcid': '0000-0001-6355-2775'}, 'contrib_name': {'surname': 'LINS', 'given-names': 'Lucas Correia'}, 'contrib_full_name': 'Lucas Correia LINS', 'contrib_xref': [{'rid': 'aff1', 'ref_type': 'aff', 'text': '1'}]}"

The CSV file contains five columns: xml, response, context, advice, and detail. The xml column shows the path to the validated XML file. The response column indicates the severity or type of validation result. The context column describes where in the XML the issue was found. The advice column provides a brief recommendation or explanation. The detail column offers additional information about the validation finding.

Below is an example command to validate all XML files in a folder:

# Validate all XML files in a folder
python packtools/data_checker.py ~/xml_folder ~/results

Clone this wiki locally