N+1 redundancy

N+1 redundancy is a form of resilience that ensures system availability in the event of component failure. Components (N) have at least one independent backup component (+1). The level of resilience is referred to as active/passive or standby as backup components do not actively participate within the system during normal operation. The level of transparency (disruption to system availability) during failover is dependent on a specific solution, though degradation to system resilience will occur during failover.[1]
It is also possible to have N+1 redundancy with active-active components, in such cases the backup component will remain active in the operation even if all other components are fully functional, however the system will be able to perform in the event that one component is faulted and recover from a single component failure.

Contents

Examples of N+1 redundancy

N+1 redundancy:Edit

  • Connecting devices (server etc.) in dual switch SAN (Storage area network) fabrics employ a discrete path to each switch. Only one path is active at any given time, resiliency is provided by the availability of an additional path if the active path becomes unavailable.
  • Data centre power generators that activate when the normal power source is unavailable.

1+1 redundancy:Edit

1+1 redundancy typically offers the advantage of additional failover transparency in the event of component failure. The level of resilience is referred to as active/active or hot as backup components actively participate with the system during normal operation. Failover is generally transparent (disruption to system availability) as failover does not actually occur (just degradation to system resilience) as the backup components were already active within the system.[1]
  • Dual active power supplies in a server.
  • Mirrored hard drives within a server/PC system.

2+1 or 3+1 redundancy:Edit

2+1 redundancy or 3+1 redundancy is common on power systems for blade servers where a relatively small number of highly rated PSUs efficiently power a greater number of blades. An example is a server chassis that has three power supplies; the system may be set to 2+1 redundancy so that the blades can enjoy the power of two PSUs and have one available to give redundancy if one fails. It is also common to mix live (hot) redundancy where PSUs are online, and cold standby redundancy where they are offline until needed. The reason for this, in the case of PSUs, is that a common failure mode is component's end-of-life failure, and if PSUs are equally used, then they are highly likely to fail within a short space of time of each other when toward the end of service life.
Applications
Redundant systems are often used in data centers to ensure that computer systems continue without service interruption. Other common implementations include aerospace, where redundant systems are used to improve safety and integrity of spacecraftelectric power systems and automobiles, where the emergency brake is available in a car as a redundant component in case of failure of the main brake systems.
See also
High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the users point of view – unavailable.[1] Generally, the term downtime is used to refer to periods when a system is unavailable.

PrinciplesEdit

There are three principles of systems design in reliability engineering which can help achieve high availability.
  1. Elimination of single points of failure. This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
  2. Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
  3. Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.

Scheduled and unscheduled downtimeEdit

A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various applicationmiddleware, and operating system failures.
If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.
Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.

Percentage calculationEdit

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.
Availability %Downtime per year[2]Downtime per monthDowntime per weekDowntime per day
55.5555555% ("nine fives")162.33 days13.53 days74.92 hours10.67 hours
90% ("one nine")36.53 days73.05 hours16.80 hours2.40 hours
95% ("one and a half nines")18.26 days36.53 hours8.40 hours1.20 hours
97%10.96 days21.92 hours5.04 hours43.20 minutes
98%7.31 days14.61 hours3.36 hours28.80 minutes
99% ("two nines")3.65 days7.31 hours1.68 hours14.40 minutes
99.5% ("two and a half nines")1.83 days3.65 hours50.40 minutes7.20 minutes
99.8%17.53 hours87.66 minutes20.16 minutes2.88 minutes
99.9% ("three nines")8.77 hours43.83 minutes10.08 minutes1.44 minutes
99.95% ("three and a half nines")4.38 hours21.92 minutes5.04 minutes43.20 seconds
99.99% ("four nines")52.60 minutes4.38 minutes1.01 minutes8.64 seconds
99.995% ("four and a half nines")26.30 minutes2.19 minutes30.24 seconds4.32 seconds
99.999% ("five nines")5.26 minutes26.30 seconds6.05 seconds864.00 milliseconds
99.9999% ("six nines")31.56 seconds2.63 seconds604.80 milliseconds86.40 milliseconds
99.99999% ("seven nines")3.16 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds
99.999999% ("eight nines")315.58 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds
99.9999999% ("nine nines")31.56 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds
Uptime and availability can be used synonymously, as long as the items being discussed are kept consistent. That is, a system can be up, but its services are not available, as in the case of a network outage. This can also be viewed as, a system can be available to work on, but its services are not up from a functional perspective (as opposed to software service/process perspective). The perspective is important here, whether the item being discussed is the server hardware, server OS, functional service, software service/process...etc. Keep the perspective consistent throughout a discussion, then uptime and availability can be used synonymously.

"Nines"Edit

Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions (blackoutsbrownouts or surges) 99.999% of the time would have 5 nines reliability, or class five.[3] In particular, the term is used in connection with mainframes[4][5] or enterprise computing, often as part of a service-level agreement.
Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5.[6][7] This is casually referred to as "three and a half nines",[8] but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: \log _{{10}}2\approx 0.3):[a] 99.95% availability is 3.3 nines, not 3.5 nines.[9] More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much.[b]
A formulation of the class of 9s c based on a system's unavailability x would be
c:=\lfloor -\log _{10}x\rfloor
similar measurement is sometimes used to describe the purity of substances.
In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.[citation needed] The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.[10] For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link bit error rates.

Measurement and interpretationEdit

Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% uptime. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic.
Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand.
An alternative metric is mean time between failures (MTBF).

Closely related conceptsEdit

Recovery time (or estimated time of repair (ETR), also known as recovery time objective (RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric is mean time to recovery (MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.
Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.
service level agreement ("SLA") formalizes an organization's availability objectives and requirements.

System designEdit

Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability. That is because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and failover).
High availability requires less human intervention to restore operation in complex systems; the reason for this being that the most common cause for outages is human error.[11]
Redundancy is used to create systems with high levels of availability (e.g. aircraft flight computers). In this case it is required to have high levels of failure detectability and avoidance of common cause failures. Two kinds of redundancy are passive redundancy and active redundancy.
Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving electric power transmission. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system.
Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet routing is derived from early work by Birman and Joseph in this area.[12] Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic.
Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between planned maintenanceupgrade events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellitesGlobal Positioning System is an example of a zero downtime system.
Fault instrumentation can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if this occurs during a mission critical period.
Modeling and simulation is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.

Reasons for unavailabilityEdit

A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to not following best practice in each of the following areas (in order of importance):[13]
  1. Monitoring of the relevant components
  2. Requirements and procurement
  3. Operations
  4. Avoidance of network failures
  5. Avoidance of internal application failures
  6. Avoidance of external services that fail
  7. Physical environment
  8. Network redundancy
  9. Technical solution of backup
  10. Process solution of backup
  11. Physical location
  12. Infrastructure redundancy
  13. Storage architecture redundancy
The factors themselves are based on the work of Evan Marcus and Hal Stern.[14]

Costs of unavailabilityEdit

In a 1998 report from IBM Global Services, unavailable systems were estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.[15]
High availability is one of the primary requirements of the control systems in unmanned vehicles[16] and autonomous maritime vessels.[17] If the controlling system becomes unavailable, the Ground Combat Vehicle (GCV) or ASW Continuous Trail Unmanned Vessel (ACTUV) would be lost.

High-availability cluster


High-availability clusters (also known as HA clusters or fail-over clusters) are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundantcomputers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.[1]
HA clusters are often used for critical databases, file sharing on a network, business applications, and customer services such as electronic commerce websites.
HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is redundantly connected via storage area networks.
HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One subtle but serious condition all clustering software must be able to handle is split-brain, which occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.
HA clusters often also use quorum witness storage (local or cloud) to avoid this scenario. A witness device cannot be shared between two halves of a split cluster, so in the event that all cluster members cannot communicate with each other (e.g., failed heartbeat), if a member cannot access the witness, it cannot become active.

Application design requirementsEdit

Not every application can run in a high-availability cluster environment, and the necessary design decisions need to be made early in the software design phase. In order to run in a high-availability cluster environment, an application must satisfy at least the following technical requirements, the last two of which are critical to its reliable function in a cluster, and are the most difficult to satisfy fully:
  • There must be a relatively easy way to start, stop, force-stop, and check the status of the application. In practical terms, this means the application must have a command line interface or scripts to control the application, including support for multiple instances of the application.
  • The application must be able to use shared storage (NAS/SAN).
  • Most importantly, the application must store as much of its state on non-volatile shared storage as possible. Equally important is the ability to restart on another node at the last state before failure using the saved state from the shared storage.
  • The application must not corrupt data if it crashes, or restarts from the saved state.
  • A number of these constraints can be minimized through the use of virtual server environments, wherein the hypervisor itself is cluster-aware and provides seamless migration of virtual machines (including running memory state) between physical hosts -- see Microsoft Server 2012 and 2016 Failover Clusters.
    • A key difference between this approach and running cluster-aware applications is that the latter can deal with server application crashes and support live "rolling" software upgrades while maintaining client access to the service (e.g. database), by having one instance provide service while another is being upgraded or repaired. This requires the cluster instances to communicate, flush caches and coordinate file access during hand-off.

Node configurationsEdit

2 node High Availability Cluster network diagram
The most common size for an HA cluster is a two-node cluster, since that is the minimum required to provide redundancy, but many clusters consist of many more, sometimes dozens of nodes.
The attached diagram is a good overview of a classic HA cluster, with the caveat that it does not make any mention of quorum/witness functionality (see above).
Such configurations can sometimes be categorized into one of the following models:
  • Active/active — Traffic intended for the failed node is either passed onto an existing node or load balanced across the remaining nodes. This is usually only possible when the nodes use a homogeneous software configuration.
  • Active/passive — Provides a fully redundant instance of each node, which is only brought online when its associated primary node fails.[2] This configuration typically requires the most extra hardware.
  • N+1 — Provides a single extra node that is brought online to take over the role of the node that has failed. In the case of heterogeneous software configuration on each primary node, the extra node must be universally capable of assuming any of the roles of the primary nodes it is responsible for. This normally refers to clusters that have multiple services running simultaneously; in the single service case, this degenerates to active/passive.
  • N+M — In cases where a single cluster is managing many services, having only one dedicated failover node might not offer sufficient redundancy. In such cases, more than one (M) standby servers are included and available. The number of standby servers is a tradeoff between cost and reliability requirements.
  • N-to-1 — Allows the failover standby node to become the active one temporarily, until the original node can be restored or brought back online, at which point the services or instances must be failed-back to it in order to restore high availability.
  • N-to-N — A combination of active/active and N+M clusters, N to N clusters redistribute the services, instances or connections from the failed node among the remaining active nodes, thus eliminating (as with active/active) the need for a 'standby' node, but introducing a need for extra capacity on all active nodes.
The terms logical host or cluster logical host is used to describe the network address that is used to access services provided by the cluster. This logical host identity is not tied to a single cluster node. It is actually a network address/hostname that is linked with the service(s) provided by the cluster. If a cluster node with a running database goes down, the database will be restarted on another cluster node.

Node reliabilityEdit

HA clusters usually use all available techniques to make the individual systems and shared infrastructure as reliable as possible. These include:
These features help minimize the chances that the clustering failover between systems will be required. In such a failover, the service provided is unavailable for at least a little while, so measures to avoid failover are preferred.

Failover strategiesEdit

Systems that handle failures in distributed computing have different strategies to cure a failure. For instance, the Apache Cassandra APIHector defines three ways to configure a failover:
  • Fail Fast, scripted as "FAIL_FAST", means that the attempt to cure the failure fails if the first node cannot be reached.
  • On Fail, Try One - Next Available, scripted as "ON_FAIL_TRY_ONE_NEXT_AVAILABLE", means that the system tries one host, the most accessible or available, before giving up.
  • On Fail, Try All, scripted as "ON_FAIL_TRY_ALL_AVAILABLE", means that the system tries all existing, available nodes before giving up.

ImplementationsEdit

There are several free and commercial solutions available, such as:

RAID


RAID (Redundant Array of Independent Disks, originally Redundant Array of Inexpensive Disks) is a data storage virtualizationtechnology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.[1]
Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the required level of redundancy and performance. The different schemes, or data distribution layouts, are named by the word "RAID" followed by a number, for example RAID 0 or RAID 1. Each scheme, or RAID level, provides a different balance among the key goals: reliabilityavailabilityperformance, and capacity. RAID levels greater than RAID 0 provide protection against unrecoverable sector read errors, as well as against failures of whole physical drives.

HistoryEdit

The term "RAID" was invented by David PattersonGarth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987. In their June 1988 paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)", presented at the SIGMOD conference, they argued that the top performing mainframe disk drives of the time could be beaten on performance by an array of the inexpensive drives that had been developed for the growing personal computer market. Although failures would rise in proportion to the number of drives, by configuring for redundancy, the reliability of an array could far exceed that of any large single drive.[2]
Although not yet using that terminology, the technologies of the five levels of RAID named in the June 1988 paper were used in various products prior to the paper's publication,[3] including the following:
  • Mirroring (RAID 1) was well established in the 1970s including for example Tandem NonStop Systems.
  • In 1977, Norman Ken Ouchi at IBM filed a patent disclosing what was subsequently named RAID 4.[4]
  • Around 1983, DEC began shipping subsystem mirrored RA8X disk drives (now known as RAID 1) as part of its HSC50 subsystem.[5]
  • In 1986, Clark et al. at IBM filed a patent disclosing what was subsequently named RAID 5.[6]
  • Around 1988, the Thinking Machines' DataVault used error correction codes (now known as RAID 2) in an array of disk drives.[7] A similar approach was used in the early 1960s on the IBM 353.[8][9]
Industry RAID manufacturers later redefined the acronym to stand for "Redundant Array of Independent Disks".[10][11][12][13]

OverviewEdit

Many RAID levels employ an error protection scheme called "parity", a widely used method in information technology to provide fault tolerance in a given set of data. Most use simple XOR, but RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois field or Reed–Solomon error correction.[14]
RAID can also provide data security with solid-state drives (SSDs) without the expense of an all-SSD system. For example, a fast SSD can be mirrored with a mechanical drive. For this configuration to provide a significant speed advantage an appropriate controller is needed that uses the fast SSD for all read operations. Adaptec calls this "hybrid RAID".[15]

Standard levelsEdit

Storage servers with 24 hard disk drives and built-in hardware RAID controllers supporting various RAID levels
A number of standard schemes have evolved. These are called levels. Originally, there were five RAID levels, but many variations have evolved, notably several nested levels and many non-standard levels(mostly proprietary). RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard:[16][17]
RAID 0
RAID 0 consists of striping, but no mirroring or parity. Compared to a spanned volume, the capacity of a RAID 0 volume is the same; it is sum of the capacities of the disks in the set. But because striping distributes the contents of each file among all disks in the set, the failure of any disk causes all files, the entire RAID 0 volume, to be lost. A broken spanned volume at least preserves the files on the unfailing disks. The benefit of RAID 0 is that the throughput of read and write operations to any file is multiplied by the number of disks because, unlike spanned volumes, reads and writes are done concurrently,[11] and the cost is complete vulnerability to drive failures.
RAID 1
RAID 1 consists of data mirroring, without parity or striping. Data is written identically to two drives, thereby producing a "mirrored set" of drives. Thus, any read request can be serviced by any drive in the set. If a request is broadcast to every drive in the set, it can be serviced by the drive that accesses the data first (depending on its seek time and rotational latency), improving performance. Sustained read throughput, if the controller or software is optimized for it, approaches the sum of throughputs of every drive in the set, just as for RAID 0. Actual read throughput of most RAID 1 implementations is slower than the fastest drive. Write throughput is always slower because every drive must be updated, and the slowest drive limits the write performance. The array continues to operate as long as at least one drive is functioning.[11]
RAID 2
RAID 2 consists of bit-level striping with dedicated Hamming-code parity. All disk spindle rotation is synchronized and data is stripedsuch that each sequential bit is on a different drive. Hamming-code parity is calculated across corresponding bits and stored on at least one parity drive.[11] This level is of historical significance only; although it was used on some early machines (for example, the Thinking Machines CM-2),[18] as of 2014 it is not used by any commercially available system.[19]
RAID 3
RAID 3 consists of byte-level striping with dedicated parity. All disk spindle rotation is synchronized and data is striped such that each sequential byte is on a different drive. Parity is calculated across corresponding bytes and stored on a dedicated parity drive.[11]Although implementations exist,[20] RAID 3 is not commonly used in practice.
RAID 4
RAID 4 consists of block-level striping with dedicated parity. This level was previously used by NetApp, but has now been largely replaced by a proprietary implementation of RAID 4 with two parity disks, called RAID-DP.[21] The main advantage of RAID 4 over RAID 2 and 3 is I/O parallelism: in RAID 2 and 3, a single read I/O operation requires reading the whole group of data drives, while in RAID 4 one I/O read operation does not have to spread across all data drives. As a result, more I/O operations can be executed in parallel, improving the performance of small transfers.[1]
RAID 5
RAID 5 consists of block-level striping with distributed parity. Unlike RAID 4, parity information is distributed among the drives, requiring all drives but one to be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks.[11] Like all single-parity concepts, large RAID 5 implementations are susceptible to system failures because of trends regarding array rebuild time and the chance of drive failure during rebuild (see "Increasing rebuild time and failure probability" section, below).[22] Rebuilding an array requires reading all data from all disks, opening a chance for a second drive failure and the loss of the entire array. In August 2012, Dell posted an advisory against the use of RAID 5 in any configuration on Dell EqualLogic arrays and RAID 50 with "Class 2 7200 RPM drives of 1 TB and higher capacity" for business-critical data.[23]
RAID 6
RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance up to two failed drives. This makes larger RAID groups more practical, especially for high-availability systems, as large-capacity drives take longer to restore. RAID 6 requires a minimum of four disks. As with RAID 5, a single drive failure results in reduced performance of the entire array until the failed drive has been replaced.[11] With a RAID 6 array, using drives from multiple sources and manufacturers, it is possible to mitigate most of the problems associated with RAID 5. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 instead of RAID 5.[24] RAID 10 also minimizes these problems.[25]

Nested (hybrid) RAIDEdit

In what was originally termed hybrid RAID,[26] many storage controllers allow RAID levels to be nested. The elements of a RAID may be either individual drives or arrays themselves. Arrays are rarely nested more than one level deep.[27]
The final array is known as the top array. When the top array is RAID 0 (such as in RAID 1+0 and RAID 5+0), most vendors omit the "+" (yielding RAID 10 and RAID 50, respectively).
  • RAID 0+1: creates two stripes and mirrors them. If a single drive failure occurs then one of the stripes has failed, at this point it is running effectively as RAID 0 with no redundancy. Significantly higher risk is introduced during a rebuild than RAID 1+0 as all the data from all the drives in the remaining stripe has to be read rather than just from one drive, increasing the chance of an unrecoverable read error (URE) and significantly extending the rebuild window. [28] [29][30]
  • RAID 1+0: (see: RAID 10) creates a striped set from a series of mirrored drives. The array can sustain multiple drive losses so long as no mirror loses all its drives.[31]
  • JBOD RAID N+N: With JBOD (just a bunch of disks), it is possible to concatenate disks, but also volumes such as RAID sets. With larger drive capacities, write delay and rebuilding time increase dramatically (especially, as described above, with RAID 5 and RAID 6). By splitting a larger RAID N set into smaller subsets and concatenating them with linear JBOD,[clarification needed] write and rebuilding time will be reduced. If a hardware RAID controller is not capable of nesting linear JBOD with RAID N, then linear JBOD can be achieved with OS-level software RAID in combination with separate RAID N subset volumes created within one, or more, hardware RAID controller(s). Besides a drastic speed increase, this also provides a substantial advantage: the possibility to start a linear JBOD with a small set of disks and to be able to expand the total set with disks of different size, later on (in time, disks of bigger size become available on the market). There is another advantage in the form of disaster recovery (if a RAID N subset happens to fail, then the data on the other RAID N subsets is not lost, reducing restore time).[citation needed]

Non-standard levelsEdit

Many configurations other than the basic numbered RAID levels are possible, and many companies, organizations, and groups have created their own non-standard configurations, in many cases designed to meet the specialized needs of a small niche group. Such configurations include the following:
  • Linux MD RAID 10 provides a general RAID driver that in its "near" layout defaults to a standard RAID 1 with two drives, and a standard RAID 1+0 with four drives; however, it can include any number of drives, including odd numbers. With its "far" layout, MD RAID 10 can run both striped and mirrored, even with only two drives in f2 layout; this runs mirroring with striped reads, giving the read performance of RAID 0. Regular RAID 1, as provided by Linux software RAID, does not stripe reads, but can perform reads in parallel.[31][32][33]
  • Hadoop has a RAID system that generates a parity file by xor-ing a stripe of blocks in a single HDFS file.[34]
  • BeeGFS, the parallel file system, has internal striping (comparable to file-based RAID0) and replication (comparable to file-based RAID10) options to aggregate throughput and capacity of multiple servers and is typically based on top of an underlying RAID to make disk failures transparent.

ImplementationsEdit

The distribution of data across multiple drives can be managed either by dedicated computer hardware or by software. A software solution may be part of the operating system, part of the firmware and drivers supplied with a standard drive controller (so-called "hardware-assisted software RAID"), or it may reside entirely within the hardware RAID controller.

Software-basedEdit

Software RAID implementations are provided by many modern operating systems. Software RAID can be implemented as:
  • A layer that abstracts multiple devices, thereby providing a single virtual device (e.g. Linux kernel's md and OpenBSD's softraid)
  • A more generic logical volume manager (provided with most server-class operating systems, e.g. Veritas or LVM)
  • A component of the file system (e.g. ZFSSpectrum Scale or Btrfs)
  • A layer that sits above any file system and provides parity protection to user data (e.g. RAID-F)[35]
Some advanced file systems are designed to organize data across multiple storage devices directly, without needing the help of a third-party logical volume manager:
  • ZFS supports the equivalents of RAID 0, RAID 1, RAID 5 (RAID-Z1) single-parity, RAID 6 (RAID-Z2) double-parity, and a triple-parity version (RAID-Z3) Also referred to as RAID 7[36]. As it always stripes over top-level vdevs, it supports equivalents of the 1+0, 5+0, and 6+0 nested RAID levels (as well as striped triple-parity sets) but not other nested combinations. ZFS is the native file system on Solarisand illumos, and is also available on FreeBSD and Linux. Open-source ZFS implementations are actively developed under the OpenZFSumbrella project.[37][38][39][40][41]
  • Spectrum Scale, initially developed by IBM for media streaming and scalable analytics, supports declustered RAID protection schemes up to n+3. A particularity is the dynamic rebuilding priority which runs with low impact in the background until a data chunk hits n+0 redundancy, in which case this chunk is quickly rebuilt to at least n+1. On top, Spectrum Scale supports metro-distance RAID 1.[42]
  • Btrfs supports RAID 0, RAID 1 and RAID 10 (RAID 5 and 6 are under development).[43][44]
  • XFS was originally designed to provide an integrated volume manager that supports concatenating, mirroring and striping of multiple physical storage devices.[45] However, the implementation of XFS in Linux kernel lacks the integrated volume manager.[46]
Many operating systems provide RAID implementations, including the following:
  • Hewlett-Packard's OpenVMS operating system supports RAID 1. The mirrored disks, called a "shadow set", can be in different locations to assist in disaster recovery.[47]
  • Apple's macOS and macOS Server support RAID 0, RAID 1, and RAID 1+0.[48][49]
  • FreeBSD supports RAID 0, RAID 1, RAID 3, and RAID 5, and all nestings via GEOM modules and ccd.[50][51][52]
  • Linux's md supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6, and all nestings.[53] Certain reshaping/resizing/expanding operations are also supported.[54]
  • Microsoft's server operating systems support RAID 0, RAID 1, and RAID 5. Some of the Microsoft desktop operating systems support RAID. For example, Windows XP Professional supports RAID level 0, in addition to spanning multiple drives, but only if using dynamic disks and volumes. Windows XP can be modified to support RAID 0, 1, and 5.[55] Windows 8 and Windows Server 2012 introduces a RAID-like feature known as Storage Spaces, which also allows users to specify mirroring, parity, or no redundancy on a folder-by-folder basis.[56]
  • NetBSD supports RAID 0, 1, 4, and 5 via its software implementation, named RAIDframe.[57]
  • OpenBSD supports RAID 0, 1 and 5 via its software implementation, named softraid.[58]
If a boot drive fails, the system has to be sophisticated enough to be able to boot off the remaining drive or drives. For instance, consider a computer whose disk is configured as RAID 1 (mirrored drives); if the first drive in the array fails, then a first-stage boot loader might not be sophisticated enough to attempt loading the second-stage boot loader from the second drive as a fallback. The second-stage boot loader for FreeBSD is capable of loading a kernel from such an array.[59]

Firmware- and driver-basedEdit

SATA 3.0 controller that provides RAID functionality through proprietary firmware and drivers
Software-implemented RAID is not always compatible with the system's boot process, and it is generally impractical for desktop versions of Windows. However, hardware RAID controllers are expensive and proprietary. To fill this gap, inexpensive "RAID controllers" were introduced that do not contain a dedicated RAID controller chip, but simply a standard drive controller chip with proprietary firmware and drivers. During early bootup, the RAID is implemented by the firmware and, once the operating system has been more completely loaded, the drivers take over control. Consequently, such controllers may not work when driver support is not available for the host operating system.[60] An example is Intel Matrix RAID, implemented on many consumer-level motherboards.[61][62]
Because some minimal hardware support is involved, this implementation approach is also called "hardware-assisted software RAID",[63][64][65] "hybrid model" RAID,[65] or even "fake RAID".[66] If RAID 5 is supported, the hardware may provide a hardware XOR accelerator. An advantage of this model over the pure software RAID is that—if using a redundancy mode—the boot drive is protected from failure (due to the firmware) during the boot process even before the operating systems drivers take over.[65]

IntegrityEdit

Data scrubbing (referred to in some environments as patrol read) involves periodic reading and checking by the RAID controller of all the blocks in an array, including those not otherwise accessed. This detects bad blocks before use.[67] Data scrubbing checks for bad blocks on each storage device in an array, but also uses the redundancy of the array to recover bad blocks on a single drive and to reassign the recovered data to spare blocks elsewhere on the drive.[68]
Frequently, a RAID controller is configured to "drop" a component drive (that is, to assume a component drive has failed) if the drive has been unresponsive for eight seconds or so; this might cause the array controller to drop a good drive because that drive has not been given enough time to complete its internal error recovery procedure. Consequently, using RAID for consumer-marketed drives can be risky, and so-called "enterprise class" drives limit this error recovery time to reduce risk.[citation needed] Western Digital's desktop drives used to have a specific fix. A utility called WDTLER.exe limited a drive's error recovery time. The utility enabled TLER (time limited error recovery), which limits the error recovery time to seven seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (e.g. the Caviar Black line), making such drives unsuitable for use in RAID configurations.[69] However, Western Digital enterprise class drives are shipped from the factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. For non-RAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive.[69] In late 2010, the Smartmontools program began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in RAID setups.[69]
While RAID may protect against physical drive failure, the data is still exposed to operator, software, hardware, and virus destruction. Many studies cite operator fault as the most common source of malfunction,[70] such as a server operator replacing the incorrect drive in a faulty RAID, and disabling the system (even temporarily) in the process.[71]
An array can be overwhelmed by catastrophic failure that exceeds its recovery capacity and the entire array is at risk of physical damage by fire, natural disaster, and human forces, while backups can be stored off site. An array is also vulnerable to controller failure because it is not always possible to migrate it to a new, different controller without data loss.[72]

WeaknessesEdit

Correlated failuresEdit

In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates the assumptions of independent, identical rate of failure amongst drives; failures are in fact statistically correlated.[11] In practice, the chances for a second failure before the first has been recovered (causing data loss) are higher than the chances for random failures. In a study of about 100,000 drives, the probability of two drives in the same cluster failing within one hour was four times larger than predicted by the exponential statistical distribution—which characterizes processes in which events occur continuously and independently at a constant average rate. The probability of two failures in the same 10-hour period was twice as large as predicted by an exponential distribution.[73]

Unrecoverable read errors during rebuildEdit

Unrecoverable read errors (URE) present as sector read failures, also known as latent sector errors (LSE). The associated media assessment measure, unrecoverable bit error (UBE) rate, is typically guaranteed to be less than one bit in 1015 for enterprise-class drives (SCSIFCSAS or SATA), and less than one bit in 1014 for desktop-class drives (IDE/ATA/PATA or SATA). Increasing drive capacities and large RAID 5 instances have led to the maximum error rates being insufficient to guarantee a successful recovery, due to the high likelihood of such an error occurring on one or more remaining drives during a RAID set rebuild.[11][74] When rebuilding, parity-based schemes such as RAID 5 are particularly prone to the effects of UREs as they affect not only the sector where they occur, but also reconstructed blocks using that sector for parity computation. Thus, an URE during a RAID 5 rebuild typically leads to a complete rebuild failure.[75]
Double-protection parity-based schemes, such as RAID 6, attempt to address this issue by providing redundancy that allows double-drive failures; as a downside, such schemes suffer from elevated write penalty—the number of times the storage medium must be accessed during a single write operation.[76] Schemes that duplicate (mirror) data in a drive-to-drive manner, such as RAID 1 and RAID 10, have a lower risk from UREs than those using parity computation or mirroring between striped sets.[25][77] Data scrubbing, as a background process, can be used to detect and recover from UREs, effectively reducing the risk of them happening during RAID rebuilds and causing double-drive failures. The recovery of UREs involves remapping of affected underlying disk sectors, utilizing the drive's sector remapping pool; in case of UREs detected during background scrubbing, data redundancy provided by a fully operational RAID set allows the missing data to be reconstructed and rewritten to a remapped sector.[78][79]

Increasing rebuild time and failure probabilityEdit

Drive capacity has grown at a much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger-capacity drives may take hours if not days to rebuild, during which time other drives may fail or yet undetected read errors may surface. The rebuild time is also limited if the entire array is still in operation at reduced capacity.[80] Given an array with only one redundant drive (which applies to RAID levels 3, 4 and 5, and to "classic" two-drive RAID 1), a second drive failure would cause complete failure of the array. Even though individual drives' mean time between failure (MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.[22]
Some commentators have declared that RAID 6 is only a "band aid" in this respect, because it only kicks the problem a little further down the road.[22] However, according to the 2006 NetApp study of Berriman et al., the chance of failure decreases by a factor of about 3,800 (relative to RAID 5) for a proper implementation of RAID 6, even when using commodity drives.[81] Nevertheless, if the currently observed technology trends remain unchanged, in 2019 a RAID 6 array will have the same chance of failure as its RAID 5 counterpart had in 2010.[74][81]
Mirroring schemes such as RAID 10 have a bounded recovery time as they require the copy of a single failed drive, compared with parity schemes such as RAID 6, which require the copy of all blocks of the drives in an array set. Triple parity schemes, or triple mirroring, have been suggested as one approach to improve resilience to an additional drive failure during this large rebuild time.[81]

Atomicity: including parity inconsistency due to system crashesEdit

A system crash or other interruption of a write operation can result in states where the parity is inconsistent with the data due to non-atomicity of the write process, such that the parity cannot be used for recovery in the case of a disk failure (the so-called RAID 5 write hole).[11] The RAID write hole is a known data corruption issue in older and low-end RAIDs, caused by interrupted destaging of writes to disk.[82] The write hole can be addressed with write-ahead logging. Recently mdadm fixed it by introducing a dedicated journaling device (to avoid performance penalty, typically, SSDs and NVMs are preferred) for that purpose.[83][84]
This is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote "Update in Place is a Poison Apple" during the early days of relational database commercialization.[85]

Write-cache reliabilityEdit

There are concerns about write-cache reliability, specifically regarding devices equipped with a write-back cache, which is a caching system that reports the data as written as soon as it is written to cache, as opposed to when it is written to the non-volatile medium. If the system experiences a power loss or other major failure, the data may be irrevocably lost from the cache before reaching the non-volatile storage. For this reason good write-back cache implementations include mechanisms, such as redundant battery power, to preserve cache contents across system failures (including power failures) and to flush the cache at system restart time. [86]
source: https://en.m.wikipedia.org/wiki/N%2B1_redundancy#N+1_redundancy:

Comments