Fooling Around with Storage The Motley Fool, a popular online financial forum, wisely looked to Storage Area Networks to keep up with soaring user demand.
networkmagazine.com
by Katrina Glerum
Storage Area Networks (SANs) have been getting a lot of attention lately, and rightly so: They meet the five goals of storage, which are to increase available disk space, serve multiple consumers of disk space, reduce access time to data on the disks, ensure the reliability or coherency of data on the disks, and assure the security of data on the disks. ---------------------------------------- Business Profile: The Motley Fool Financial Forum Headquarters: Alexandria, Virginia
Web address: www.fool.com
Industry: Online financial forum
Chief techie geek/ project visionary: Dwight Gibbs
Number of visitors per month (mid-1999): More than two million
Technology in focus: Storage Area Networks
Business challenge: By mid-1998, Motley Fool's pool of Windows NT/Internet Information Server (IIS) Web servers couldn't be scaled quickly or cheaply enough to keep up with site and audience growth. The NT boxes were overloaded, making high availability and 24-by-7 uptime elusive. Looking at the current environment and projecting for expected growth, Dwight Gibbs realized that the network needed to be rearchitected to achieve scalability, reliability, and availability.
Solution: The solution was divided into two projects. First, Gibbs invested in a network storage device, the Network Appliance (www.netapp.com) F720 filer. By storing all static Web files on a single, high-speed, optimized device, Gibbs dramatically reduced I/O requests on all Windows NT Web servers, which helped stabilize and speed up the entire network. Second, Gibbs increased reliability and enabled failover on the Web servers hitting the company's SQL databases by implementing Microsoft Cluster Server (MSCS), which is dual-node NT clustering on a Fibre Channel array.
The entire project took about seven months, mostly because other events took priority at one time or another. However, the results were highly successful. Not only is the network now far more reliable, scalable, and robust, but the new architecture is already creating savings in projected costs. Furthermore, with experience in SAN technology, the company is better positioned for future growth. ------------------------------------------------
Considering those goals, the advent of cheap, plentiful drive space hasn't eliminated storage problems. In fact, some would argue that the problems have gotten worse because cheap storage encourages enterprises to fill disk space without addressing speed, access, or reliability issues. Hence the enthusiasm about SANs, which address these issues more effectively than traditional network file systems by taking advantage of advances in clustering, hierarchical storage management, and high-bandwidth Fibre Channel connectivity.
A number of competitors are aggressively and creatively going after this market, but how does the technology work in the real world? The Motley Fool Financial Forum found out when the company invested in SANs using Network Appliance (www.netapps.com) file servers (or filers) and server clustering from Compaq Computer.
WHAT KIND OF FOOL AM I? Dwight Gibbs, the Motley Fool's “Chief Techie Geek,” took a serious look at SANs in mid-1998. At the rate the Fool's Web site and audience were growing, storage issues demanded serious attention. This wasn't a startling revelation, mind you, but rapid growth, limited resources, and untested technology always made storage an easy decision to postpone. And although Gibbs had always agreed with the theory behind SANs, “the hype surrounding them made us pretty skeptical—as we usually are.”
In this, Gibbs shows his true Fool credentials. He is technologist number one, systems architect, original programmer, and technical visionary for a company that has earned its place as the “most-consulted financial forum in the online world” through its candid, humorous, and skeptical financial advice.
The Motley Fool was founded in 1993 by brothers David and Tom Gardner on America Online. The company name derives from Elizabethan drama, where only the court jester, or fool, could tell the king the truth without getting his head lopped off.
Gibbs' career with the Fool is a frenetic Internet startup success story. When he began to help build the Motley Fool's own Web site in September 1994, at the tender age of 28, Gibbs was also taking classes toward an MBA (rounding out BAs in finance and MIS and a master's degree in MIS). He was working 10 hours a week in the university computer lab and was newly married. Until the Fool could hire additional help, Gibbs was doing “pretty much anything and everything,” including database design, user support, LAN/WAN installation, applications programming, and so on. Today he leads a department of 42 and gets a lot more sleep.
A FOOL'S CHALLENGE The Motley Fool launched to about 60 visitors in 1994. By 1998, it was closing on one million visitors per month, and now it counts its monthly audience in the multimillions. At first, Gibbs, running a Windows NT/Internet Information Server (IIS) Web server environment, could add capacity to the site by simply throwing another NT workstation onto the network. He likes NT because he and his team do a lot of Active Server Pages (ASPs), which they find easy to develop in.
However, ASP is CPU-intensive, and since the IP stack isn't highly optimized on NT, a single page call with a dozen IP graphics requests hoards CPU cycles needed by applications. “Don't waste cycles serving files,” Gibbs says, “because anything can serve files.” At $30,000 a pop for an NT server, any fool can see that bogging down frontline hardware with I/O requests isn't cost-effective.
Besides, Gibbs knew that simply adding servers would create scalability problems because he would just end up with more machines that needed to be published to. Even with 500 servers, he would still be copying one file to every server.
So, he asked himself, why not copy all the files to one box and let everything grab them as need be? By putting all graphics files on a single file server, for instance, Gibbs realized he could free up the Web servers running ASPs to serve one file per page request. That wouldn't solve the high-availability problem, however. Even if Gibbs could improve network reliability by lightening the load on the Web servers, this strategy wouldn't help if a server crashed or needed maintenance.
Gibbs decided on two projects to solve his reliability problems. The first, which he handed off to his team of “Web Server Dudes,” was to lower the I/O load on his Web servers. The second project, which he assigned to his database administrators, was to cluster the NT servers hitting the company's SQL databases.
SEEKING A FILER For the first project, Gibbs liked Network Appliance's F720 filer the best. “I guess it was the simplicity that appealed to us,” he recalls. Network Appliance's multiprotocol servers are basically an array of disks on a Fibre Channel, running the proprietary Data ONTAP operating system, which is “essentially BSD stripped down to its file-serving guts.” There are three primary elements in the Data ONTAP microkernel: a real-time mechanism for process execution, the Write Anywhere File Layout (WAFL) file system, and the RAID Manager.
Network Appliance promises multiprotocol data access, high availability, ease of management, and, above all, performance. The company steers clear of specific claims because its machines are often at the mercy of its particular implementations, but Fibre Channel itself routinely offers transfer rates of 100Mbits/sec, compared to 40Mbits/sec for SCSI.
Gibbs took two big risks by choosing a Network Appliance box. One, the F720's proprietary OS made his environment more heterogeneous, which meant more training and administrative hassles, especially when he could have just set up a Unix or NT box for dedicated file serving.
Two, the F720 was expensive. The Fool declined to give specific figures, but Network Appliance's top-end box listed at about $90,000 in 1998—and that didn't include the $15,000 the Fool spent on hubs and controllers for the SAN.
Gibbs admits that his choice made some people nervous. But after considering other storage devices and other SAN solutions, Gibbs picked the F720 box because it was fast and offered “exactly what we needed and no more.” He also liked WAFL, which he knew “should scale like crazy.”
FOOLS RUSH IN Gibbs mitigated his risk by learning as much as he could about Network Appliance, including performing a financial analysis to make sure the company would be around in the near future. After that he brought in the F720 for a 30-day evaluation.
Initially, the two guys keeping the Motley Fool's hardware running weren't too happy to be asked to install and test this strange machine. According to Gibbs, despite Network Appliance's claims, “It wasn't just put it down, turn it on, and it works. There was a learning curve. It was long, it was steep, but hammer on the box intensively for about two weeks, and you have a pretty good handle on it.” In addition, Network Appliance responded very quickly to questions, especially during evaluation.
Because of a heavy workload, Gibbs doubled the evaluation period to 60 days, but his guys still couldn't get maximum performance from the machine. They tried different architectures, but as he says, “When we benchmarked it against our NT boxes, NT was kicking NetApp's butt all over. It was embarrassing.”
In frustration Gibbs' team called Network Appliance and said, “Hey, isn't it strange that your box costs twice as much and works half as well?” When this tactic didn't produce any answers, the team asked a more constructive question: “Any secrets to tuning this thing?” They were rewarded with an e-mail full of settings to try.
The recommended configuration changes gave the Network Appliance box double the performance of NT, and the Motley Fool team was sold. In fact, Gibbs now has one box for the Web site, another one for corporate headquarters, and a third one on the way for a large SQL database that will be used on the Web site.
Ultimately, Gibbs and his team built a straightforward SAN. A 100Mbit/sec connection leads into a Cisco Systems switch, and the Network Appliance filer hangs off this switch in front of a BIG/ip load balancer, which balances the load over a network of Compaq NT Web servers (see figure).
At first, Gibbs and his team considered putting the filer behind the load balancer, with the servers. But considering the number of IP requests that could have been shunted off to the F720, Gibbs and his team realized they could save significant work on the balancer if it was situated in front. If traffic grows to the point that they need multiple filers requiring load balancing, Gibbs figures they can use a round-robin approach.
CLUSTERING TOGETHER The clustering problem posed different challenges. Gibbs and his team decided to work with Compaq to explore installing Microsoft Cluster Server (MSCS) on a pair of servers accessing “cluster-aware” SQL 6.5 databases. MSCS supports failover between two NT nodes, each of which can have independent workloads. However, it isn't considered true clustering, such as that offered by Unix. With a maximum of two nodes, scalability is severely limited, for example.
MSCS didn't work straight out of the box. All told, the Fibre Channel array, hub, and controllers only cost about $15,000. The real investment, Gibbs says, was in time. The Motley Fool worked through an established relationship with Compaq, however, and one of Compaq's MSCS gurus in Houston provided assistance and advice. Still, it took at least two solid weeks to get the system working.
Once implemented, the Motley Fool had two Compaq servers, each connected to a SQL database, in an active-active MSCS cluster connected via Fibre Channel array. The technical team put a lot of cache on the controllers, and it's all RAID 5, so it's pretty fast and pretty reliable. Gibbs put it into production in September 1998, and since then it has been up to the task, enough so that Gibbs put in another identical cluster in the summer of 1999.
IN PRODUCTION Gibbs beat his original cost projections (under $100,000), but he didn't make the two-to-three-month timeline he'd been going for. Nevertheless, the new SAN works as advertised. Gibbs' team has moved all the company's static content over to the Network Appliance box, reducing the load on production NT boxes so much that “they don't crash as often.” The Motley Fool has also saved tens of thousands of dollars that it would have had to invest in bandwidth, disk space, and CPU resources—all just to get the performance the company's getting today. Plus, the Motley Fool is getting performance without incurring higher costs of management or site reorganization. Gibbs is also confident about the future, because if one box can no longer handle the entire site, the Network Appliance boxes cluster pretty well.
This is not to say that there haven't been a few hiccups. For one, the Data ONTAP OS is case-sensitive, which interfered with some of the pathing because the site was developed in non-case-sensitive NT. Gibbs' simple solution was to force all file names to lower case.
A more interesting problem was that the box would die after logging 2Gbytes of HTTP requests. Aside from the obvious performance problem this caused, Gibbs already had tools for logging HTTP requests. However, he couldn't turn off the logging function until Network Appliance wrote a special release for the kernel.
Richard Soffell, Web Server Dude for the Motley Fool, ran into another issue that called for a kernel fix: Although the Network Appliance filer could handle 512 simultaneous HTTP connections, traffic spikes caused by major market moves would overwhelm the box. To free up connections, he decided to reduce the HTTP connection timeout. The disadvantage to this is a higher overhead for re-establishing lost connections, but Soffell says the OS is so fast that the effect appears to be negligible.
With these changes in place, the box is now humming along with minimal hassle and costs, short of running standard diagnostics and making lease payments. Gibbs is pleased with Network Appliance's products, technology, and service. Between the Network Appliance box and the Fibre Channel array, Gibbs is confident that his environment is far more flexible and robust than it was a year ago. More to the point, with traffic continuing to soar every month, Gibbs is glad to have a scalable, well-conceived architecture in place.
LESSONS FOR THE WISE When asked to reflect on what he's learned from this project, Gibbs had a few words of advice.
“It will always take longer than you think. Tying shoelaces on the Web takes longer than you think.”
“No box's default configuration is even close to what you want it to be.”
“It's always a good idea to decrease load on your main Web servers.”
“You have to spend money to save money.”
Furthermore, he says SANs make a lot of sense, and Fibre Channel really works. He's currently working with Compaq to cluster a Fibre Channel array so as to have multiple clusters reading off one Fibre Channel array, using Fibre Channel hubs. Gibbs is still waiting for real n-node clustering to come for NT, but this next-generation technology from Compaq looks promising. Meanwhile he's keeping an eye on some interesting work IBM is doing as part of its ongoing effort to bring mainframe technology to the NT server world.
As a last piece of advice, Gibbs says, “Initially you have to forget about the technology and ask yourself, ‘Does the concept make sense?' Then worry about the technology.” For example, when the Fool went looking for a SQL data storer, Gibbs says, “Here's 80Gbytes that's real easy to expand and scale, and it's fast as hell. Well, that's pretty compelling.”
By methodically investigating and testing various pieces of SAN technology for the past year, Gibbs and his team now have a handle on a technology that lets them keep up with soaring demand proactively, rather than reactively. And he's done it while saving the Fool money. In today's wild Web arena, that's pretty compelling.
In addition to technology journalism, Katrina Glerum specializes in multilingual Web site production. She can be reached at kat@midwinter.com. |