Newspaper Curation App

NCA on Github

Technical Details

This document attempts to explain the entire workflow from upload to batch generation in a way that developers can understand what’s needed and how to at least begin investigating if something goes wrong.

Jobs and the Job Queue

All background work in NCA is made up of relatively small parts tied together in a single “pipeline”. A pipeline represents a distinct operation that is made up of smaller units, the jobs themselves. A job is usually the smallest atomic “thing” we can run: updating an issue status in the database, calling out to openjpeg to generate JP2 derivatives from an issue’s PDFs, etc. We attempt to make all jobs idempotent: running a job that already ran should never change the database / file system / app state.

The pipeline organizes jobs into the more complex operations. For instance, when it’s time to pull PDFs from SFTP into NCA, that generates a pipeline consisting of over a dozen atomic jobs: things like updating the issue’s status so NCA knows it’s being worked on, copying the files to the workflow location, splitting pages, etc. Even the “move files” operations are idempotent: the copy is one job, then a job verifies that the copied files are correct, and then a third job removes the source files.

The job runner, started by the run-jobs binary, regularly scans the database looking for jobs to run. The default setup has different queues to keep I/O-heavy jobs, such as derivative generation, from delaying fast jobs like small database updates. This makes NCA more efficient, as jobs can run in parallel when there won’t be resource contention. Jobs in the same pipeline will never be run in parallel, as it’s assumed there are dependencies from one to the next, but when multiple pipelines are queued up, NCA will process whatever is next in each pipeline.

If you’re trying to watch job logs as a whole, this can be confusing: a pipeline’s jobs will run in their sequence, but different pipelines can be running at the same time, so job logs can look chaotic. If you’re trying to watch jobs for a given operation, you’ll want to group them by pipeline to make sense of what’s going on.

The job runner also looks for issues in the scan and page review areas that are ready to enter the workflow. These aren’t actual jobs and aren’t tied to pipelines, they’re just a separate background task that’s always being watched.

All jobs store logs in the database, but these are currently not exposed to end users (not even admins). To help mitigate this, the job runner also logs to STDERR, though without pipeline filtering, those again can be tricky to parse without some advanced log filtering application.

Uploads

Whenever issues are uploaded into NCA, the application’s “Uploaded Issues” pages will display these issues along with any obvious errors the application was able to detect. After a reasonable amount of time (to ensure uploading is completed; some publishers slowly upload issue pages throughout the day, or even multiple days), issues may be queued up for processing. Too-new issues will be displayed, but queueing will be disabled.

Born-digital issues, when queued, are preprocessed (in order to ensure derivatives can be generated, forcing one-pdf-per-page, etc.), then moved into the page review area. The pages will be named sequentially in the format seq-dddd.pdf, starting with seq-0001.pdf, then seq-0002.pdf, etc. These PDFs might already be ordered correctly, but we’ve found the need to manually reorder them many times, and have decided an out-of-band process for reviewing and reordering is necessary. An easy approach is to have somebody use something like Adobe Bridge to review and rename in bulk. Once complete, an issue’s pages need to be ordered by their filenames, e.g., 0001.pdf, 0002.pdf, etc. Until issues are all given a fully numeric name, the job runner will not pick them up.

Note: if issue folders are deleted from the page review location for any reason, they must be cleaned up manually: Handling Page Review Problems. Once NCA is tracking uploads, deleting them outside the system will cause error logs to go a bit haywire, and the issues can’t be re-uploaded since NCA will believe they quasi-exist.

For scanned issues, since they are in-house for us, it is assumed they’re already going to be properly named (<number>.tif and <number>.pdf) and ordered, so after being queued they immediately get moved and processed for derivatives.

The bulk upload queue tool (compiled to bin/bulk-issue-queue) can be used to push all issues of a given type (scan vs. born digital) and issue key into the workflow as if they’d been queued from the web app. This tool should only be run when people aren’t using the NCA queueing front-end, as it will queue things faster than the NCA cache will be updated, which can lead to NCA’s web view being out of sync with reality. The data will be intact, but it can be confusing. Also note that for scanned issues, this tool can take a long time because it verifies the DPI of all images embedded in PDFs.

Derivative Processing

Once issues are ready for derivatives (born-digital issues have been queued, pre-processed, and renamed; scanned issues have been queued and moved), a job is queued for derivative processing. This creates JP2 images from either the PDFs (born-digital) or TIFFs (scanned), and the ALTO-compatible OCR XML based on the text in the PDF. In our process, the PDFs are created by OCRing the TIFFs. This process is manual and out-of-band since we rely on Abbyy, and there isn’t a particularly easy way to integrate it into our workflow.

The derivative generation process is probably the slowest job in the system. As such, it is particularly susceptible to things like server power outage. In the event that a job is canceled mid-operation, somebody will have to modify the database to change the job’s status from in_process to pending.

The derivative jobs are very fault-tolerant:

  • Derivatives are generated in a temporary location, and only moved into the issue folder after the derivative has been generated successfully
  • Derivatives which were already created are not rebuilt

These two factors make it easy to re-kick-off a derivative process without worrying about data corruption.

Note that different OSes can report to NCA that something worked when the OS still has yet to fully sync the files. This is out of NCA’s control, and it is exceedingly rare that it causes problems, but really unusual events (like a very unfortunately-timed power failure, or catastrophic OS crash) can leave things in a state that causes problems which NCA can’t do anything about. These kinds of events are virtually nonexistent even when power failures occur, but there are ways to help prevent problems:

  • Make sure your system has a UPS so small power failures don’t cause problems.
  • Make sure your system’s got enough disk space! Disk exhaustion is one of the worst problems that even modern OSes still handle very poorly.
  • Replace faulty hardware! A hard-crash that’s bad enough can interrupt a process before the OS has a chance to finalize file I/O.

Error Reports

If an issue has some kind of problem which cannot be fixed with metadata entry, the metadata person will report an error. Once an error is reported, the issue will be hidden from all but Issue Managers in the NCA UI and one of them will have to decide how to handle it. See Fixing Flagged Workflow Issues.

Post-Metadata / Batch Generation

After metadata has been entered and approved, the issue is considered “done”. An issue XML will be generated (using the METS template defined by the setting METS_XML_TEMPLATE_PATH) and born-digital issues’ original PDFs are moved into the issue location for safe-keeping. Assuming these are done without error, the issue is marked “ready for batching”.

A “batch builder” can then select organizations (e.g., the MARC org codes) they want batches built for by visiting the “Create Batches” page in NCA. General high-level aggregate data should give the batch builder enough information to choose what to batch, after which they decide how big the batches should be.

Alternatively, the batch queue command-line script (compiled to bin/queue-batches) grabs all issues which are ready to be batched, organizes them by organization (a.k.a., MARC Org Code / awardee) for batching (each awardee must have its issues in a separate batch), and generates batches if there are enough pages (see the MINIMUM_ISSUE_PAGES setting).

Note: the MINIMUM_ISSUE_PAGES setting will be ignored if any issues waiting to be batched have been ready for batching for more than 30 days. This is necessary to handle cases where an issue had to have special treatment after the bulk of a batch was completed, and would otherwise just sit and wait indefinitely.

Batch Management

Once a batch is queued for generation:

  • The files will be put into the configured BATCH_OUTPUT_PATH
  • The live files (non-TIFF, non-tar originals, etc.) are synced to the BATCH_PRODUCTION_PATH
  • NCA sends a command to the staging ONI Agent (configured via STAGING_AGENT) to load the batch
  • NCA polls the agent until the batch load is reported as successful

Once all these jobs are complete, the batch will be visible to batch reviewers, letting them know action is needed to approve the batch. The batch page will have a link to the staging environment’s batch page for easier review, as well as two possible actions to take: approve the batch for production or reject it from staging due to problems in one or more issues.

If rejected, batch reviewers will need to find and flag the problem issues so NCA can process the rest of the batch. Issues will be flagged as unfixable (moving to a state where issue managers will have to take action), and the batch reviewer will need to enter a comment to help identify what was wrong. Once issues are done being flagged, the batch reviewer can finalize the batch, rebuilding it with only the good issues, and NCA will reload it on staging where it will be ready for another round of QC.

Once a batch has been approved in staging, NCA will contact the production ONI Agent (configured via PRODUCTION_AGENT) to load it live, and then poll the agent regularly until the batch load has completed.

After batches are live, NCA will move the original batch and any backups (the source PDFs for born-digital batches, for instance) to the archival location specificed by the configured BATCH_ARCHIVE_PATH. At this point the batch is ready for final archival.

Batch Archival

At UO, our archive path is actually a “staging area” for batches which are getting ready for a move to the dark archive. Your process may not be quite the same, but hopefully this is of help even if only to understand why NCA handles batches the way it does.

Once we have enough content to justify a push to the dark archive, we move a pile of batches from the staging area into the transfer area. Our archival team runs the process (they generate their own manifests, for instance, and sync all files to the archive). When they confirm that batches are archived, we flag them in NCA as such.

Whether or not you follow that process, you will still need to specify an archive path (BATCH_ARCHIVE_PATH) in your settings file and flag batches as being live. Your archive path may be a direct mount to your final archive, a location you manage manually, or some dummy location you simply delete if you aren’t preserving the original content for some reason.

Once flagged as archived, NCA stores the archival date and time. This is important for knowing when it’s safe to clean up the files. The issue deletion script (bin/delete-live-done-issues, created by a standard make run) will look for batches archived more than four weeks ago and then completely delete all files in NCA tied to these batches. The files in your archive will not be removed, of course, but NCA will ensure its workflow directories are cleaned up to make room for new incoming files.

The four-week “timer” was originally put in place to ensure files have had a chance to be fully backed up offsite, but it also serves another purpose: it gives you a chance to handle problems that weren’t caught during the QC process. Once NCA’s workflow files are removed, reprocessing a batch becomes significantly more difficult.

Unless you have very few batches, very small batches, or a lot of disk space, bin/delete-live-done-issues should be run regularly to avoid running out of storage. NCA can handle most problems gracefully, but running out of storage is almost guaranteed to cause you some headaches.