Skip to content

Archival Data Quality

Asbjørn Skødt edited this page Aug 10, 2022 · 15 revisions

Introduction

Archiving preserves data for prosterity. The goal of archiving documents is to take a snapshot of the data as they were at the time of archiving. Then, the documents may at a later time be made accessible to a user, either the original data producer, a researcher or anyone interested, who have legitimate access rights to the documents. Ideally, the document should be as similar as possible to the original document, they were copied from, both in content and structure, but technical changes may be performed by the archive in order to ensure persistent accessibility to the document throughout changing technological landscapes. This may result in data loss, but if so it must be reduced to the minimum and it must be documented.

Here follows a list of requirements, which CLISC checks for and performs actions to the archived spreadsheet in order to make sure the spreadsheet meets the necessary data quality level for long-term storage. If archiving is not selected when using CLISC, the list of requirements will not be followed.

Archival requirements

Must have data

Spreadsheets containing zero data have no value for later reuse and hence no value for archiving.

Risk

There's two risks for spreadsheets, either having no sheets or no cells with data. Fortunately, Excel does not allow user to save a spreadsheet without any sheets and a COM exception is also thrown when using Excel Interop.

System.Runtime.InteropServices.COMException: 'The project folder must have at least one visible worksheet.

However, other ways of deleting the sheets without using Excel might exist, and secondly in Excel saving a spreadsheet without any cell values is allowed.

Solution

Check for number of sheets (minimum one) and check for presense of cell value in any sheet in the spreadsheet (minimum one).

Embedded objects

Risk

Solution

External Relationships

External relationships include linked OLE objects (they are not embedded) and linked cell values fetched from another spreadsheet. Both external relationships relate to files in your local directory, and which relationship will be broken, if the files or spreadsheet is removed from the local directory.

Risk

Data from the context of external relationships will be lost, if a spreadsheet is archived, which typically means removing the spreadsheet from it's local environment parameters if the archive is pursuing a data migration strategy.

Example of data loss

Solution

Remove any external relationships. In the case of linked cell values, keep the actual cell values as the snapshot. In the case of unembedded OLE objects, they should be handled manually and the object should ideally be archived separately and the relationship documented in metadata. Alternative solution is to embed the OLE object.

RTD functions

Microsoft Office Excel provides a worksheet function, RealTimeData (RTD). This function enables you to call a Component Object Model (COM) Automation server to retrieve data in real time.

The RTD function uses the following syntax =RTD(RealTimeServerProgID,ServerName,Topic1,[Topic2], ...)

Source

Risk

When opening an archived spreadsheet many years from now, the server connection might still be available and the values will auto-update. This intervenes with the archival goal of taking snapshots of data at a given time.

Solution

Remove any RTD formula functions in cells in any sheet in a spreadsheet, but keep the actual cell values as the snapshot.

Clone this wiki locally