Disaster recovery in the data center By S. L. Sander
Disaster is a perpetual threat to your storage systems. It can appear in many forms such as failed hard drives and data corruption. For example, a virus that infects a hard drive or RAID array can corrupt or, in some cases, even delete large amounts of the data on the drive/array.
When catastrophic events such as this occur, users who don't have a disaster recovery plan can pay a high price.
Disgruntled employees, former employees, or malicious hackers intent on doing damage are all serious threats. Not to be discounted are such natural disasters as fire (and water damage caused by sprinklers, halon systems or water from fire hoses), flood, earthquake, hurricane, tornado, runaway cars slamming into buildings or other disasters waiting to happen.
Loss, theft or damage of notebook computers used by remote workers can be devastating to the workers who have lost their data ? and could provide considerable cost to organizations that have not properly secured the data from these remote machines.
In this uncertain environment, IT managers must be able to anticipate and prepare for disaster. Many approaches that can help them are available. Among the most prominent are backup, redundancy, mirroring and snapshot technologies.
Backup Backup may be the granddaddy of all disaster preparation approaches. Historically, system files and all data have been backed up onto tape cartridges. For security reasons, the tapes were frequently stored offsite. That way, users have always been able to access a set of data cartridges that that can help them restore their data.
Complete restoration may require the use of identically configured replacement systems and compatible tape drives. In this scenario, users don't have to wait for the necessary equipment to become available. However, this approach doesn't come without significant cost, as unused equipment continues to decline in value while it's not being used.
Also when restoring from tape, complete system and data restoration may take days or weeks if considerable amounts of data and multiple systems must be restored. Even systems that can restore 1 GB per minute will take more than 16 hours to restore one terabyte. In today's multi-terabyte environment, even the fastest tape backup and restore may be impractical for organizations with large amounts of data.
Backup has been working its way to other, faster media. Affordable, recordable DVD provides relatively low capacity (when compared to tape), but much higher data rates.
Hard drives have often been suggested for system backup. With the price of hard drives continuing to decline faster than the cost of tape, it won't be long before hard drives cost less per megabyte than tape cartridges with comparable storage capacities.
Although hard drives may be somewhat more susceptible to damage from mishandling, there should be no question that hard-drive performance is absolutely preferable to the performance of tape. New NAS devices, such as the new Maxtor 4300 from Maxtor Corp., are designed to deliver high performance system protection. The Maxtor 4300 provides 400 GB of hard-drive storage, and costs about $6,000.
Backing up a SAN with a second SAN is another strategy. With an appropriate failover capability, the backup SAN can be used to maintain operations while the failed SAN is replaced. The backup SAN and can also be used to rebuild the files on a new SAN. (In most cases, however, a second SAN would most likely be mirrored, rather than configured for backup).
Redundancy Redundancy is built into many storage components. The idea behind redundancy is obvious ? if a device fails, a second device already built into the failed system takes over. In tape libraries, for example, it is not atypical to see redundant power supplies, redundant fans, redundant drives and, perhaps even redundant controllers.
In storage arrays, the use of RAID technologies enables the risk of component failure to be easily overcome. For example, should a drive in a RAID array go down, the rest of the drives will continue to provide data services while also notifying the IT manager of the drive failure. In most systems, a failed drive can be removed and quickly replaced with a hot-swappable drive. After installation, the data from the failed drive is restored to the new drive. This restoration data has been striped across other drives in the array.
In the examples above, redundancy has been applied to storage and networking components. However, for better disaster preparedness, redundancy should also be considered for other locations.
In an extreme example, a data center in Los Angeles may be connected to a redundant data center in New York. Should a disaster befall either data center, the other would be able to immediately support users coast to coast.
Mirroring In the Los Angeles/New York example, perhaps the most effective way of implementing a disaster preparation scenario would be via mirroring. With mirroring, data written to or deleted from one array is also written to or deleted from the mirror array. By mirroring data systems, an organization can be reasonably assured that, should one system fail, the mirror can be used to maintain operations. The mirror can also be used to rebuild its repaired or replaced counterpart at the other location.
Mirroring can be performed over considerable distances ? such as demonstrated by the transcontinental example above, or within inches of drives. For example, a hard drive may be mirrored to another drive in the same computer, or complete SANs can be mirrored, using fiber or IP over long distances. With the emergence of 2 GB Ethernet, and the soon-to-be-arriving 10 GB Ethernet on the horizon, the concept of data mirroring over secure Ethernet connections is gaining increased interest.
Mirroring, however, also brings certain degrees of risk. Because what occurs on one drive is copied to the mirror, a virus or other potentially damaging code that can damage data on one drive will also damage the mirror.
"Mirroring takes care of physical redundancy," says Alan Welsh, CTO and CEO of Columbia Data Systems. Welsh notes that his system had been "zapped," destroying the data on a drive. "If we had been mirrored, we would have been in big trouble, because the mirrored drive would also have been zapped," he comments.
Although mirroring is an excellent approach for recovery from certain disasters, such as a failed hard drive, it may not be the most effective disaster recovery approach for other disaster types.
Snapshot technologies Snapshot technologies are also gaining increasing popularity. Although a few approaches to snapshots have been proposed, it appears that the one developed by Columbia Data Systems may be the most successful.
This technology has been included in systems offered by IBM, and has been integrated, under license, by Microsoft in its Server Appliance Kit. The technology is built into Maxtor's MaxAttach 4300, which is based on Windows 2000 with the Server Appliance Kit extensions.
Snapshot takes a series of images of the data on system drives, and stores these images onto a protected area on a hard drive. At regular intervals, an image of the changes that have been made to the target drive(s) is recorded onto the snapshot cache file.
Disasters such as virus attacks, hacker attacks or user error can be remedied by reverting to a snapshot. If necessary, the system can be restarted. The system can be instructed to rebuild the drive(s) from a snapshot image, and the complete image can be restored.
In principle, snapshots can be used to protect terabytes of data. In current implementations, such as the one on the Maxtor 4300, considerably less data is protected. For multi-terabyte installations, an appropriate strategy that protects the most critical system data via snapshots ? or that puts devices equipped with snapshot capabilities into the most important workgroups ? should be carefully considered. Backup to mirror drives or tape should be considered for data that does not require immediate recovery in the event of a disaster.
As effective as snapshots may eventually be, however, this approach does not solve the "bullet-in-a-server" type of disaster. If the snapshot image resides on the server, and the server or drive fails, recovery from a snapshot isn't possible, because the device that holds the snapshot isn't available. Welsh favors the idea of mirroring an entire data volume and system volume over a network. That way, in the event of a disaster, the failed drive can be replaced or restored, and the snapshot on the mirror can be used to restore the replacement drive.
While virtualization may provide considerable benefits for data storage managers, it also provides considerable challenges to those who need to protect and restore such virtualized storage in the event of a disaster. According to Marc Farley, president of Building Storage Inc., and author of "Building Storage Networks," "If I have a machine that is virtualizing, the host file system only sees the virtual address space. Depending on how you put it together, it is very difficult to get the same kind of configuration to get the storage that you had following a disaster."
"You need to make sure that you save your configuration information for your virtual space, because having virtual space makes it harder to back up and recover everything," Farley says. "If you are trying to do the right thing as an IT person, you need to understand how you will piece all of this virtualized data back together, and what your comfort zone is. It is worthwhile, before you install or rebuild a virtual solution, to try to figure out how you will do recovery, and where the information is, in order to keep it whole."
snwonline.com Storage Networking World Online - What is Behind the Move to Storage Networks? - Disaster recovery in the data center |