Redundancy in Pharma PLC Systems — When It's Required

The Wrong Way to Answer the Redundancy Question

The most common way the redundancy question gets answered on pharma projects is the wrong way. Either the client asks for a redundant CPU "because it's a pharma project," the SI specifies it to avoid the conversation, or — equally bad — nobody asks the question and a non-redundant system gets installed in a critical process area that would have justified it.

Redundancy is not a pharma compliance requirement in the same way that audit trails or electronic signatures are. No regulation mandates a hot-standby PLC. What regulations do require — through GAMP 5, EU GMP Annex 11, and the general GMP principle of system reliability — is that you have understood the failure modes of your system and applied appropriate controls. Redundancy is one type of control. In some cases it is the right one. In others it is engineering overhead that adds cost, complexity, and additional validation scope without meaningfully reducing risk.

The correct approach is to let the risk assessment drive the decision, and then document the outcome in the Validation Plan and Hardware Design Specification. That is the approach this article explains.

What Redundancy Actually Means — Four Distinct Levels

When someone says "redundant system" on a pharma project they could mean any of four different things, and each has a different cost, complexity, and validation implication. Conflating them is where scope misalignments happen.

// REDUNDANCY LEVELS ARE INDEPENDENT DECISIONS. PSU REDUNDANCY IS ALMOST ALWAYS JUSTIFIED. CPU REDUNDANCY RARELY IS UNLESS THE PROCESS CANNOT TOLERATE A CONTROLLED SHUTDOWN.

Level 1 — Power Supply Redundancy

Dual 24VDC power supplies in parallel with a redundancy module is the lowest-cost, lowest-complexity redundancy option and the one most consistently justified across pharma projects. A single PSU failure in a non-redundant panel takes down the entire control system — including the safe-state logic, the alarm system, and the data historian connection — until the PSU is replaced. With a redundant pair, a PSU failure generates a High alarm and the system continues operating on the remaining unit. Maintenance can replace the failed unit under a controlled procedure without a process shutdown.

The validation impact is minimal: the IQ verifies both PSUs are present and the redundancy module is installed, and the OQ includes a PSU failure test. It adds two lines to the test script. This is the one form of redundancy that should be in the URS for essentially every GMP panel.

Level 2 — Network Ring Topology

For distributed I/O architectures — where remote I/O islands are connected back to the main CPU over PROFINET or similar — a ring topology with MRP (Media Redundancy Protocol) protection is cheap insurance. A linear daisy-chain topology means a single cable or switch failure breaks communication to every downstream device. A ring means the network self-heals around the fault; PROFINET MRP recovery time is typically under 200 milliseconds. The field devices stay online. The process continues. An alarm fires so maintenance knows to investigate, but the process does not see a control interruption.

The additional hardware cost is modest — one extra managed switch port and a cable run to close the ring. The validation impact is an additional OQ test case confirming ring healing behaviour on simulated cable disconnection. Again: well worth it for any distributed I/O layout.

Level 3 — Sensor Redundancy

Redundant field instruments for critical quality attributes is a risk-driven decision. The question is: what is the consequence of losing the measurement entirely versus accepting a brief data gap while the single sensor is repaired or replaced? For the highest-severity CQAs — conductivity in a purified water system, temperature in a sterile fill environment — losing the measurement means losing the ability to demonstrate product quality. A redundant sensor allows the system to continue monitoring and controlling from the backup measurement while the failed sensor is addressed under change control.

The risk assessment drives this. If the failure mode "sensor failure" scores a Severity of 5 (direct patient safety impact) and Occurrence of 3 or above, redundancy as a risk control reduces Occurrence and Detectability. The risk register should document this decision explicitly, not leave it as an implicit engineering choice.

Level 4 — CPU Redundancy

Hot-standby CPU redundancy — where a secondary CPU shadows the primary and takes over in milliseconds on primary failure — is the most expensive, most complex, and most frequently over-specified form of redundancy in pharma. The correct question is not "is this a pharma system?" but "does this process require continuous uninterrupted control such that even a 30-second controlled shutdown for CPU replacement is unacceptable?"

For most pharma utility and monitoring systems — water systems, EMS, HVAC — the answer is no. A controlled CPU failure response (fail-safe positions, alarm to operator, process held in a known safe state) is entirely acceptable. The process can wait for a controlled restart. CPU redundancy in these applications adds significant capital cost, doubles the validated scope (two CPUs means two software configurations to verify and maintain), and complicates change control for every future modification.

CPU redundancy is genuinely justified for continuous manufacturing processes where any interruption in control directly impacts in-progress product that cannot be held — live cell culture, continuous chromatography, aseptic fill lines with in-process product that cannot be paused. If your process falls into that category, the justification is clear and the cost is appropriate. If it doesn't, the risk assessment should document why it was not required, not just silently omit it.

How the Risk Assessment Drives the Decision

Every redundancy decision should be traceable to a risk register entry. This is not bureaucratic overhead — it is how you demonstrate to QA and to auditors that the decision was made deliberately rather than by default or oversight.

The risk register entry for, say, "conductivity sensor failure" should show: Severity (product quality consequence of distributing out-of-spec water), Occurrence (frequency of sensor failure given calibration interval and instrument reliability), Detectability (how quickly the failure is detected — immediately via a Bad Quality alarm, or only on the next calibration). The initial RPN score drives the control decision: if the residual RPN without redundancy is acceptable, document it and move on. If it is not — if Severity is 5 and the residual risk without redundancy is still High — then redundancy as an engineering control is the justified response, and the risk register records that the redundant sensor reduces the Occurrence score and therefore the residual RPN to an acceptable level.

The same logic applies to every other redundancy level. Dual PSU: failure of a single PSU taking down the control system has a Severity and Occurrence; the redundant pair reduces Occurrence to near-zero and reduces Severity of the hardware failure itself to a managed alarm rather than a process shutdown. Document it. Network ring: single cable failure breaks distributed I/O communication; ring topology reduces Occurrence of process impact from that failure mode. Document it.

What Auditors Actually Ask

An inspector reviewing your hardware design will not ask "is this system redundant?" They will ask: "What happens if [component X] fails, and how did you decide whether that was acceptable?" A system with no CPU redundancy but a clear risk register entry explaining why controlled failover is acceptable is a stronger position than a redundant CPU system where nobody can explain why it was specified. The documented decision — either way — is what matters.

Where the Decision Gets Documented

The redundancy strategy belongs in two places: the Validation Plan and the Hardware Design Specification.

In the Validation Plan, the system description section should summarise the overall redundancy strategy at an architectural level — not component by component, but the principle: which functions have redundancy, which don't, and what the consequence of failure is for each. This gives QA the overview they need before the HDS is issued.

In the HDS, the redundancy strategy table is the definitive record. It lists every component or function that has a redundancy provision, specifies what that provision is, and describes the failure consequence for a single-unit failure. The table also implicitly documents the components that do not have redundancy — anything absent from the redundancy table is non-redundant, and the HDS author should have a risk register justification to hand for any reviewer who asks.

In the QLean Framework

The Hardware Design Specification (HDS-SYS-001) includes a formal Redundancy Strategy table in Section 1.2 covering all major components. The framework's worked example specifies: dual redundant 24VDC PSUs in parallel with automatic failover (single PSU failure generates High alarm, system continues on remaining unit); dual-ring PROFINET topology for field I/O with managed redundancy (single cable or switch failure heals automatically, no data loss); and server-side RAID 5 storage with dual power supply (hot-swap). The architecture note in Section 1 explicitly states that control logic executes locally in the PLC independent of SCADA availability — making SCADA server failure a High alarm rather than a process control loss. The Risk Assessment workbook (RA-SYS-001) includes "redundant sensors" as a risk control against conductivity sensor failure, with the Occurrence score reduction documented. The framework does not specify CPU redundancy — deliberately, with the architectural rationale documented in HDS Section 1.2.

Validation Implications of Each Redundancy Level

Redundancy adds validation scope. Before specifying it, understand what you are committing to verify:

PSU redundancy: IQ verification that both units are installed and the redundancy module is present. OQ test case: simulate single PSU failure, confirm High alarm fires and system continues operating. One additional IQ check item, one additional OQ test case. Minimal impact.
Network ring: IQ verification that the ring topology is correctly configured in managed switches. OQ test case: disconnect a ring segment, confirm network heals within acceptance criterion (typically under 500ms), confirm no I/O data loss. One network config check, one OQ test case. Low impact.
Sensor redundancy: Two instruments on every GMP-critical point means twice the calibration certificates at IQ, twice the loop checks at SAT, and an additional OQ test section covering the switchover logic (what triggers switchover, is there an alarm, does the SCADA display the backup sensor value correctly). Moderate impact — plan for it in the test schedule.
CPU redundancy: Significant validation impact. Two CPUs require two software baselines that must be maintained in sync. The OQ must include CPU switchover testing under load. Change control for every software modification must verify both CPUs. The hash chain verification at each change covers both units. This is not a reason to avoid CPU redundancy if it is justified — but it must be scoped correctly in the Validation Plan from the start.

The SCADA Layer Is Not the PLC

One important distinction that gets confused on pharma projects: SCADA server failure and PLC CPU failure are not the same event and do not require the same redundancy response.

A well-designed pharma system executes all control logic in the PLC. The SCADA provides operator interface, alarm display, data historisation, and reporting. If the SCADA server fails, the PLC continues executing its logic autonomously. The process does not stop. The operator loses their visual interface and alarm display — which is a High alarm and a significant operational problem — but the system maintains its safe state and continues controlling the process.

This means SCADA server high-availability (virtualisation, server clustering, rapid restore from backup) is often a more practical and cost-effective approach than PLC CPU redundancy — because the thing most likely to fail in a complex system is the Windows server, not the PLC CPU. The RBP-SYS-001 backup and recovery procedure, with a defined Recovery Time Objective of four hours or less, is the framework's approach to SCADA server availability. A four-hour RTO for SCADA restoration is acceptable for most pharma utility systems; it would not be acceptable for a continuous manufacturing process where SCADA loss means loss of batch visibility.

Understand which components are in the control path versus the monitoring path. Design redundancy for the control path based on risk. Design availability for the monitoring path based on operational requirements and RTO.