The Wrong Way to Answer the Redundancy Question

The most common way the redundancy question gets answered on pharma projects is the wrong way. Either the client asks for a redundant CPU "because it's a pharma project," the SI specifies it to avoid the conversation, or — equally bad — nobody asks the question and a non-redundant system gets installed in a critical process area that would have justified it.

Redundancy is not a pharma compliance requirement in the same way that audit trails or electronic signatures are. No regulation mandates a hot-standby PLC. What regulations do require — through GAMP 5, EU GMP Annex 11, and the general GMP principle of system reliability — is that you have understood the failure modes of your system and applied appropriate controls. Redundancy is one type of control. In some cases it is the right one. In others it is engineering overhead that adds cost, complexity, and additional validation scope without meaningfully reducing risk.

The correct approach is to let the risk assessment drive the decision, and then document the outcome in the Validation Plan and Hardware Design Specification. That is the approach this article explains.

What Redundancy Actually Means — Four Distinct Levels

When someone says "redundant system" on a pharma project they could mean any of four different things, and each has a different cost, complexity, and validation implication. Conflating them is where scope misalignments happen.

FOUR LEVELS OF REDUNDANCY — PHARMA PLC SYSTEMS LEVEL 1 — PSU Dual 24VDC PSUs in parallel with redundancy module COST: LOW FAILOVER: AUTO VAL. IMPACT: LOW Recommended for all GMP panels LEVEL 2 — NETWORK Ring topology PROFINET MRP self-healing ring COST: LOW–MED FAILOVER: AUTO VAL. IMPACT: LOW Recommended for distributed I/O runs LEVEL 3 — SENSOR Dual redundant field instruments for critical CQAs COST: MED FAILOVER: LOGIC VAL. IMPACT: MED Risk-driven for Severity 4–5 CQAs LEVEL 4 — CPU Hot-standby CPU bumpless switchover on primary failure COST: HIGH FAILOVER: AUTO VAL. IMPACT: HIGH Justified only for continuous processes
// REDUNDANCY LEVELS ARE INDEPENDENT DECISIONS. PSU REDUNDANCY IS ALMOST ALWAYS JUSTIFIED. CPU REDUNDANCY RARELY IS UNLESS THE PROCESS CANNOT TOLERATE A CONTROLLED SHUTDOWN.

Level 1 — Power Supply Redundancy

Dual 24VDC power supplies in parallel with a redundancy module is the lowest-cost, lowest-complexity redundancy option and the one most consistently justified across pharma projects. A single PSU failure in a non-redundant panel takes down the entire control system — including the safe-state logic, the alarm system, and the data historian connection — until the PSU is replaced. With a redundant pair, a PSU failure generates a High alarm and the system continues operating on the remaining unit. Maintenance can replace the failed unit under a controlled procedure without a process shutdown.

The validation impact is minimal: the IQ verifies both PSUs are present and the redundancy module is installed, and the OQ includes a PSU failure test. It adds two lines to the test script. This is the one form of redundancy that should be in the URS for essentially every GMP panel.

Level 2 — Network Ring Topology

For distributed I/O architectures — where remote I/O islands are connected back to the main CPU over PROFINET or similar — a ring topology with MRP (Media Redundancy Protocol) protection is cheap insurance. A linear daisy-chain topology means a single cable or switch failure breaks communication to every downstream device. A ring means the network self-heals around the fault; PROFINET MRP recovery time is typically under 200 milliseconds. The field devices stay online. The process continues. An alarm fires so maintenance knows to investigate, but the process does not see a control interruption.

The additional hardware cost is modest — one extra managed switch port and a cable run to close the ring. The validation impact is an additional OQ test case confirming ring healing behaviour on simulated cable disconnection. Again: well worth it for any distributed I/O layout.

Level 3 — Sensor Redundancy

Redundant field instruments for critical quality attributes is a risk-driven decision. The question is: what is the consequence of losing the measurement entirely versus accepting a brief data gap while the single sensor is repaired or replaced? For the highest-severity CQAs — conductivity in a purified water system, temperature in a sterile fill environment — losing the measurement means losing the ability to demonstrate product quality. A redundant sensor allows the system to continue monitoring and controlling from the backup measurement while the failed sensor is addressed under change control.

The risk assessment drives this. If the failure mode "sensor failure" scores a Severity of 5 (direct patient safety impact) and Occurrence of 3 or above, redundancy as a risk control reduces Occurrence and Detectability. The risk register should document this decision explicitly, not leave it as an implicit engineering choice.

Level 4 — CPU Redundancy

Hot-standby CPU redundancy — where a secondary CPU shadows the primary and takes over in milliseconds on primary failure — is the most expensive, most complex, and most frequently over-specified form of redundancy in pharma. The correct question is not "is this a pharma system?" but "does this process require continuous uninterrupted control such that even a 30-second controlled shutdown for CPU replacement is unacceptable?"

For most pharma utility and monitoring systems — water systems, EMS, HVAC — the answer is no. A controlled CPU failure response (fail-safe positions, alarm to operator, process held in a known safe state) is entirely acceptable. The process can wait for a controlled restart. CPU redundancy in these applications adds significant capital cost, doubles the validated scope (two CPUs means two software configurations to verify and maintain), and complicates change control for every future modification.

CPU redundancy is genuinely justified for continuous manufacturing processes where any interruption in control directly impacts in-progress product that cannot be held — live cell culture, continuous chromatography, aseptic fill lines with in-process product that cannot be paused. If your process falls into that category, the justification is clear and the cost is appropriate. If it doesn't, the risk assessment should document why it was not required, not just silently omit it.

Every redundancy decision should be traceable to a risk register entry. This is not bureaucratic overhead — it is how you demonstrate to QA and to auditors that the decision was made deliberately rather than by default or oversight.

The risk register entry for, say, "conductivity sensor failure" should show: Severity (product quality consequence of distributing out-of-spec water), Occurrence (frequency of sensor failure given calibration interval and instrument reliability), Detectability (how quickly the failure is detected — immediately via a Bad Quality alarm, or only on the next calibration). The initial RPN score drives the control decision: if the residual RPN without redundancy is acceptable, document it and move on. If it is not — if Severity is 5 and the residual risk without redundancy is still High — then redundancy as an engineering control is the justified response, and the risk register records that the redundant sensor reduces the Occurrence score and therefore the residual RPN to an acceptable level.

The same logic applies to every other redundancy level. Dual PSU: failure of a single PSU taking down the control system has a Severity and Occurrence; the redundant pair reduces Occurrence to near-zero and reduces Severity of the hardware failure itself to a managed alarm rather than a process shutdown. Document it. Network ring: single cable failure breaks distributed I/O communication; ring topology reduces Occurrence of process impact from that failure mode. Document it.

What Auditors Actually Ask

An inspector reviewing your hardware design will not ask "is this system redundant?" They will ask: "What happens if [component X] fails, and how did you decide whether that was acceptable?" A system with no CPU redundancy but a clear risk register entry explaining why controlled failover is acceptable is a stronger position than a redundant CPU system where nobody can explain why it was specified. The documented decision — either way — is what matters.

Where the Decision Gets Documented

The redundancy strategy belongs in two places: the Validation Plan and the Hardware Design Specification.

In the Validation Plan, the system description section should summarise the overall redundancy strategy at an architectural level — not component by component, but the principle: which functions have redundancy, which don't, and what the consequence of failure is for each. This gives QA the overview they need before the HDS is issued.

In the HDS, the redundancy strategy table is the definitive record. It lists every component or function that has a redundancy provision, specifies what that provision is, and describes the failure consequence for a single-unit failure. The table also implicitly documents the components that do not have redundancy — anything absent from the redundancy table is non-redundant, and the HDS author should have a risk register justification to hand for any reviewer who asks.

In the QLean Framework

The Hardware Design Specification (HDS-SYS-001) includes a formal Redundancy Strategy table in Section 1.2 covering all major components. The framework's worked example specifies: dual redundant 24VDC PSUs in parallel with automatic failover (single PSU failure generates High alarm, system continues on remaining unit); dual-ring PROFINET topology for field I/O with managed redundancy (single cable or switch failure heals automatically, no data loss); and server-side RAID 5 storage with dual power supply (hot-swap). The architecture note in Section 1 explicitly states that control logic executes locally in the PLC independent of SCADA availability — making SCADA server failure a High alarm rather than a process control loss. The Risk Assessment workbook (RA-SYS-001) includes "redundant sensors" as a risk control against conductivity sensor failure, with the Occurrence score reduction documented. The framework does not specify CPU redundancy — deliberately, with the architectural rationale documented in HDS Section 1.2.

Validation Implications of Each Redundancy Level

Redundancy adds validation scope. Before specifying it, understand what you are committing to verify:

The SCADA Layer Is Not the PLC

One important distinction that gets confused on pharma projects: SCADA server failure and PLC CPU failure are not the same event and do not require the same redundancy response.

A well-designed pharma system executes all control logic in the PLC. The SCADA provides operator interface, alarm display, data historisation, and reporting. If the SCADA server fails, the PLC continues executing its logic autonomously. The process does not stop. The operator loses their visual interface and alarm display — which is a High alarm and a significant operational problem — but the system maintains its safe state and continues controlling the process.

This means SCADA server high-availability (virtualisation, server clustering, rapid restore from backup) is often a more practical and cost-effective approach than PLC CPU redundancy — because the thing most likely to fail in a complex system is the Windows server, not the PLC CPU. The RBP-SYS-001 backup and recovery procedure, with a defined Recovery Time Objective of four hours or less, is the framework's approach to SCADA server availability. A four-hour RTO for SCADA restoration is acceptable for most pharma utility systems; it would not be acceptable for a continuous manufacturing process where SCADA loss means loss of batch visibility.

Understand which components are in the control path versus the monitoring path. Design redundancy for the control path based on risk. Design availability for the monitoring path based on operational requirements and RTO.