PLC Disaster Recovery — Backup, Restore and Testing

Two Different Things That Get Conflated

The first mistake on pharma projects is treating backup and validated configuration archive as the same thing. They serve completely different purposes and they need to be maintained separately.

The operational backup is the daily automated snapshot of GMP data — the historian database, SCADA application, and PLC project archive — that you restore from if a server fails. It protects against data loss. It runs every night and retains rolling copies according to a defined retention schedule.

The validated configuration archive is the permanent baseline record of the approved software at each key milestone: FAT, SAT, IQ, OQ, PQ, and every approved change. It protects the validated state. It is never overwritten. It is updated only when a formally approved change has been made to the software, and each update adds a new entry — it does not replace the previous one.

These two things can co-exist on the same NAS and they can both be covered by the same procedure document. But they must be managed separately, versioned separately, and referenced separately in the qualification evidence. Confusing them — using the nightly backup as the configuration archive, or treating the validated archive as the restore source — is a data integrity gap.

RTO and RPO — Define Them Before the URS Is Approved

The Recovery Time Objective and Recovery Point Objective must be defined as requirements — in the URS — before the system is designed. They drive hardware and architecture decisions: storage type, backup frequency, network infrastructure. If you define them after the system is built, you may find you have designed a system that cannot meet the operational needs of the site.

The RTO is how long the site can tolerate the system being offline from the point of failure declaration to the point of verified restoration. For most pharma utility and monitoring systems, ≤4 hours is achievable with a well-prepared restore procedure and reasonable hardware. For continuous manufacturing processes with in-process product at risk, the tolerance may be 30 minutes or less — which forces a very different architecture.

The RPO is the maximum acceptable data loss measured in time — effectively, how old the most recent backup can be. A daily backup at 02:00 means the worst-case RPO is approximately 24 hours: a server failure at 01:59 means up to a day of historian records may not be in the backup. The data buffering capability in the PLC (which stores records in non-volatile memory and synchronises them to the historian on reconnection) reduces the actual data loss within that window, but the RPO for the historian database itself is bounded by the backup frequency.

EU GMP Annex 11 Requirement

Annex 11 Clause 7.2 requires that data backup and restore procedures are defined and tested. Clause 17 requires availability of backup data. The requirement is not prescriptive about architecture — but it is clear that backup must exist, restore must be tested, and the test evidence must be retained. An untested backup procedure is not compliant, regardless of how well the backup itself is configured. The OQ is where the test evidence is generated.

What to Back Up — The Four Components

A complete pharma system backup covers four distinct components. Missing any one of them means the restore will be incomplete.

Historian SQL database — the GMP process data, alarm history, and audit trail. This is the most critical backup target. It must be backed up with the database engine offline or using a hot-backup method that guarantees transactional consistency. Copying the raw SQL files while the database is running does not produce a consistent backup.
SCADA application snapshot — the InTouch, WinCC, Ignition, or FactoryTalk application itself: screens, scripts, tag database, alarm configuration. This is what gets restored when the SCADA server hardware fails. Without it, you have the historian data but no application to display or query it.
Historian tag configuration export — the tag configuration tells the historian which tags to record, at what scan rate, with what deadband, and how long to retain each. This is often stored inside the SCADA application but should be separately exported, because a misconfigured historian after restore (pointing at wrong tag names or missing tags) will record data silently with gaps or wrong engineering units.
PLC project archive — the full TIA Portal, Studio 5000, or equivalent project file. This is what you load onto a replacement CPU if the PLC hardware fails. Without it, you cannot restore the validated PLC program. The archive in the nightly backup should match the validated archive; if they diverge (because a change was made without going through change control) that divergence is itself a compliance finding.

Operational Backup Architecture

The backup should run automatically — never rely on a manual step to initiate it. A scheduled task at a low-traffic time (02:00 is standard) triggers the backup script. The script performs the backup, calculates a SHA-256 checksum of the backup files, and writes the checksum alongside the backup. The SCADA system reads the backup status tag and displays it on the maintenance dashboard. A backup failure within 15 minutes of the backup window generates a High alarm in the active alarm banner. Nobody has to log in and check a log file to know the backup failed.

The backup target should be a physically separate device from the SCADA server — a NAS on the OT network segment. A server failure that takes down both the SCADA server and a backup disk in the same chassis is not a backup. For sites with an offsite or cloud backup policy, the NAS should synchronise to the offsite target after the primary backup completes. The GMP data retention requirement — typically seven years for process records — must be reflected in the backup retention policy: 30 daily backups online, annual backups archived for seven years.

The checksum requirement is important for a pharma context. If a backup file is silently corrupted — bit rot on the NAS, a partial write due to a network interruption — you need to know before you discover it at the worst possible moment during a restore. The SHA-256 checksum calculated immediately after backup and stored alongside the file gives you the ability to verify integrity at any point. During restore, you recalculate the checksum and compare. If it does not match, you do not use that backup.

The Validated Configuration Archive — Separate and Permanent

The validated configuration archive is not a backup. It is a controlled record. It lives on the SCADA server itself (in a protected directory) and is replicated to the NAS. Its contents never get deleted or overwritten — each new entry is additive.

The archive holds three items for each milestone: the PLC project archive file, the SCADA application snapshot, and the historian tag configuration export. Each file is named with a convention that embeds the system ID, component, software version, date, and change control reference. The hash log file records the SHA-256 hash of each item at each milestone — this is what creates the hash chain that the Validation Summary Report confirms as intact.

// THE NIGHTLY BACKUP AND THE VALIDATED ARCHIVE ARE SEPARATE THINGS WITH SEPARATE PURPOSES. CONFLATING THEM IS A DATA INTEGRITY GAP.

The critical discipline here is the naming convention. Every archive file must be uniquely named so that an engineer three years from now can look at the archive folder and immediately understand the version history. The convention should embed at minimum: system identifier, component (PLC / SCADA / Historian), software version tag, date in YYYYMMDD format, and the change control reference that authorised the update. Never use "latest" or "final" in a file name.

The Restore Procedure — What It Must Cover

The restore procedure is where most sites are exposed. They have backups. They have never tested the restore. An untested restore procedure in a GMP context is not a compliant backup strategy — EU GMP Annex 11 is explicit that recovery procedures must be tested.

A complete restore procedure covers fifteen steps in a defined sequence. The key ones that get missed:

QA approval before execution. A system restore on a production validated system is not an IT task — it is a change to the validated state. QA must be informed and must approve before the restore begins. Record the approval, the time it was given, and who gave it.
Hash verification after restore. Once the system is back up, recalculate the PLC and SCADA hashes and compare them against the validated archive. If they match, the restore has returned the system to its validated state. If they do not match — because the backup was taken from a state that does not correspond to the validated baseline — that is a deviation requiring investigation before the system returns to production.
Post-restore data integrity checks. Before returning the system to production: query the historian for records from before the backup timestamp to confirm pre-backup data is intact; log in with AD credentials across all defined roles to confirm authentication works; verify that a sample of audit trail records are present and read-only.
RTO measurement. Record the start time when the restore begins and the end time when post-restore verification is complete. Calculate the elapsed time. The RTO criterion — ≤4 hours — must be met. If it is not, that is a finding: either the procedure needs to be optimised or the hardware needs to be upgraded.

The Two OQ Test Cases

Backup and recovery generates two OQ test cases. Both are mandatory.

Backup Failure Detection (OQ-075)

Simulate a backup failure by making the backup target temporarily inaccessible — disconnect the NAS or rename the target share. Trigger a manual backup run. Verify that within 15 minutes of the backup window end, a High alarm appears in the active alarm banner and the backup status tag on the SCADA maintenance dashboard shows FAILED. This confirms the monitoring works. Record the time from backup attempt to alarm as evidence.

Full Restore Test with RTO Measurement (OQ-076)

This is the test that actually validates the recovery procedure. Confirm a recent successful backup exists and record its timestamp. Initiate a full restore per the recovery procedure. Take the SCADA server offline during the restore. Record start time. Bring the system back up, complete all post-restore verification steps, and record end time. Calculate and record the RTO. The acceptance criterion is ≤4 hours (240 minutes). Post-restore verification covers: historian data from before the backup timestamp is intact, AD authentication is functional, and the audit trail is intact and read-only up to the backup timestamp.

The OQ evidence for OQ-076 includes screenshots of the restored system online, the RTO calculation, and a historian query export showing pre-backup data. This evidence feeds into the Validation Summary Report, which reports the measured RTO and confirms it was within specification.

In the QLean Framework

The Recovery and Backup Procedure (RBP-SYS-001) covers all six areas: backup architecture (daily 02:00, full backup to NAS-BKP-01, SHA-256 checksum, 30-day online retention, 7-year annual archive), validated configuration archive (four archive items, naming convention, hash log, never-overwrite rule), scheduled backup monitoring (operator dashboard check, backup failure alarm ALM-DATA-002), 15-step full system restore procedure (QA approval, hash re-verification after restore, post-restore data integrity checks, RTO measurement), and backup failure investigation. RTO specification is ≤4 hours; RPO is ≤1 calendar day. OQ-SYS-001 contains OQ-075 (backup failure detection) and OQ-076 (full restore test with RTO measurement) as High-risk test cases. The VSR confirms the measured RTO from OQ-076 as part of the Phase Results Summary.

Backup in the Periodic Review

The backup strategy does not stop being a compliance topic after the OQ is closed. EU GMP Annex 11 requires that backup and recovery is reviewed as part of the periodic review cycle. The periodic review should check: backup success rate over the review period (100% is the target; any failure should have a corresponding MDL entry with root cause and resolution), whether the hardware on the backup path has changed (new NAS, new server), and whether the RTO is still achievable given any changes to the data volumes or restore infrastructure since the OQ was executed.

A site that runs the same backup restore test at each periodic review — every 24 months — has much stronger evidence of ongoing compliance than one that tested it once at OQ five years ago and has not touched it since. The restore test takes a few hours. The compliance benefit is substantial.