XML Data Checker

Description

datachecker is a command-line tool that validates the content of XML files to ensure they meet required standards. The following checks are performed:

Accessibility data validation
Affiliation data validation
Alternatives validation
App group validation
Article abstract validation
Article and subarticles validation
Article contributors validation
Article DOI validation
Article license validation
Author notes validation
Cross-reference validation
Data availability validation
Dates validation
Errata validation
Figure validation
Footnote validation
Formula validation
Funding group validation
Issue metadata validation
Journal metadata validation
Language validation
Media validation
Peer review validation
Preprint validation
References validation
Related articles validation
Supplementary material validation
Table of contents sections validation
Table validation

Technologies

Python 3.x
lxml

Features

Batch Validation: Validates a single XML file or all XML files in a folder.
Error Reporting: Generates a CSV file summarizing validation errors.
Exception Logging: Outputs a JSONL file with detailed exception information.
Output Management: Automatically creates the output directory if it does not exist.
Flexible Output: Option to generate one CSV file per XML file with --csv_per_xml.
Command-Line Interface: Simple usage with required and optional arguments.

Future versions

Web interface for XML validation

Prerequisites

To use the XML Data Checker, you must have Python 3.9 or greater installed. You can download it directly from this link or visit the Python website.

Installation

Packtools can be installed using pip. The following sections provide step-by-step instructions for installation on both Linux and Windows systems.

Linux

Create a folder, enter it, create a virtual environment called .venv, activate it, and install packtools:

mkdir scielo-packtools
cd scielo-packtools
python3 -m venv .venv
source .venv/bin/activate
pip install packtools>=4.10.0

Windows

Create a folder, enter it, create a virtual environment called .venv, and install packtools:

md scielo-packtools
cd scielo-packtools
python3 -m venv .venv
.venv\Scripts\activate
pip install packtools>=4.10.0

Usage

Before using the utility, make sure your virtual environment is active. Change to the scielo-packtools directory and activate the environment if needed. When running the command, specify the path to the XML file or folder and the desired output directory. Keep in mind that these two parameters are mandatory and must be provided in the specified order (first, the XML file or folder, then the output directory).

For Linux:

cd scielo-packtools
source .venv/bin/activate

For Windows:

cd scielo-packtools
.venv\Scripts\activate

To validate a single XML file:

data_checker.py path/to/article.xml path/to/output

To validate all XML files in a folder:

data_checker.py path/to/folder path/to/output

To validate all XML files in a folder creating one CSV file per XML file:

data_checker.py path/to/folder path/to/output --csv_per_xml

Here are the command-line arguments

usage: data_checker.py [-h] [--csv_per_xml] xml_path output_path

XML data checker

positional arguments:
  xml_path       XML folder or file path
  output_path    Ouput folder path

options:
  -h, --help     show this help message and exit
  --csv_per_xml  Create one csv per xml

Real examples

# Validate a single XML file
python packtools/data_checker.py ~/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml ~/results

In this case, two filess will be created in the ~/results folder:

2025-06-02T105237913325-errors.csv  2025-06-02T105237913325-exceptions.jsonl

The first file provides a summary of the errors detected in the XML file, while the second file includes detailed exception information from the validation process. Below is an example of the 2025-06-02T105237913325-errors.csv file, which is a CSV with five columns:

xml,response,context,advice,detail
/home/rafaeljpd/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml,WARNING,bibliographic strip,Unable to check if issue is registered,"{'volume': '38', 'number': None, 'supplement': None}"
/home/rafaeljpd/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml,WARNING,subject,"Unable to check if Original Article (<subject-group subj-group-type=""heading""><subject>Original Article</subject></subject-group>) is a valid table of contents section because the journal (ABCD. Arquivos Brasileiros de Cirurgia Digestiva (São Paulo)) sections were not informed","{'parent': 'article', 'parent_id': None, 'parent_lang': 'en', 'parent_article_type': 'research-article', 'original_article_type': 'research-article', 'journal': 'ABCD. Arquivos Brasileiros de Cirurgia Digestiva (São Paulo)', 'article_title': 'FECAL CALPROTECTIN AND INTESTINAL METABOLITES: WHAT IS THEIR IMPORTANCE IN THE ACTIVITY AND DIFFERENTIATION OF PATIENTS WITH INFLAMMATORY BOWEL DISEASES?', 'subject': 'Original Article', 'subj_group_type': 'heading', 'section': 'Original Article', 'subsections': []}"
/home/rafaeljpd/WX7Vm7ZQm6k6d9DCQ3dXnDH.xml,CRITICAL,contrib,Lucas Correia LINS : Mark the contrib role. Consult SPS documentation for detailed instructions,"{'contrib-group-type': None, 'parent': 'article', 'parent_id': None, 'parent_lang': 'en', 'parent_article_type': 'research-article', 'original_article_type': 'research-article', 'contrib_type': 'author', 'contrib_ids': {'orcid': '0000-0001-6355-2775'}, 'contrib_name': {'surname': 'LINS', 'given-names': 'Lucas Correia'}, 'contrib_full_name': 'Lucas Correia LINS', 'contrib_xref': [{'rid': 'aff1', 'ref_type': 'aff', 'text': '1'}]}"

The CSV file contains five columns: xml, response, context, advice, and detail. The xml column shows the path to the validated XML file. The response column indicates the severity or type of validation result. The context column describes where in the XML the issue was found. The advice column provides a brief recommendation or explanation. The detail column offers additional information about the validation finding.

Below is an example command to validate all XML files in a folder:

# Validate all XML files in a folder
python packtools/data_checker.py ~/xml_folder ~/results

en-US	pt-BR	es-ES

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XML Data Checker

Description

Technologies

Features

Future versions

Prerequisites

Installation

Linux

Windows

Usage

To validate a single XML file:

To validate all XML files in a folder:

To validate all XML files in a folder creating one CSV file per XML file:

Here are the command-line arguments

Real examples

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally