Newspaper Curation App

NCA on Github

Fixing Flagged Workflow Issues

This refers to issues which were already queued from their uploaded location, had derivatives generated, and were ready for metadata entry.

The metadata entry person flags errors to say essentially, “NCA’s UI cannot fix the problems on this issue”. We have seen a variety of problems like this:

  • The PDF or JP2 image derivatives are corrupt in some way, even though the tools which generated them seemed to run without errors
  • The pages are out of order - somebody reordered pages incorrectly, and the issue now has to be manually pulled, fixed, and re-inserted into the workflow
  • The issue is incorrect in some other way, and wasn’t caught when queueing from the uploads area (e.g., a publisher uploaded two issues in the same location, pages were missing from an upload, etc.)

Most errors can be caught prior to queueing an issue for processing, so it is very important that curators be aware of the additional cost of having to fix issues that are incorrect after they’ve gotten into the workflow.

Identifying bad issues

NCA provides a place for privileged users to process “unfixable” errors. Anybody with the “Issue Manager” role can see a tab in the Workflow section of the application labeled “Unfixable Errors”. In this tab, issue managers can claim and then process these issues, choosing to return them back to NCA if they were flagged incorrectly, or move them to a configured error location (ERRORED_ISSUES_PATH in the settings file).

When moved to the error location, the issues will be put into a directory based on the current month (e.g., /mnt/news/errors/2024-01) so that they’re somewhat organized without having so many subdirectories as to make the process more painful than necessary.

Within the month subdirectory, issues will be identifiable by their LCCN, date, edition, and database id in the same way they existed in the workflow location. This will look something like sn96088087-2010041901-1. Under that directory you will find content and derivatives. The derivatives are preserved just in case debugging is necessary (e.g., if a JP2 is broken, but the source PDF seems fine). The content directory will contain the source files, including original uploads in the case of publishers’ sftp-delivered files, in an archive called original.tar.

Additionally, a file called actions.txt will be present and describe all actions taken on the issue along with any comments written by curators and reviewers.

Fixing removed issues

This is a much more difficult problem to solve, because of the wide variety of errors that can occur. Fixing problems will typically require a manual examination of each issue that was removed. The actions.txt file should help understand what caused the issue to be removed, but that won’t necessarily help fix the problem.

There are cases where the only option is to delete the issue entirely and accept that it will not be able to be a part of your archive. This can happen if the publisher uploaded the wrong issue and no longer has access to the original files, or if scanned papers’ TIFFs were corrupt and the original paper is no longer available.

Putting issues back into the workflow

If an issue can be fixed and put back into the workflow, there are typically two options:

  • Pretend it’s a new upload and start right at the beginning. Derivative files must be removed, and the uploads must conform to the file and folder specs as defined in our file/folder upload specs
  • Just delete the issue folder and get it re-scanned or re-uploaded from the publisher. In this case, the normal procedures are followed and the issue will be completely new for all intents and purposes.

Database manipulation is almost never the right approach once issues have been moved out of the workflow. If you believe that you need to manipulate the database to get an issue back into the workflow, you need to be 100% certain you understand the application as well as the precise meaning of every field in every table.

Generally speaking, if database manipulation is the correct approach, it should have been done instead of reporting an error and removing the issue.