The following is a 9/30/96 InformationWeek article:
Databases -- Towering Terabytes -- Fast-growing data warehouses are reaching terabyte size and creating a new elite among corporations. The payoff can be huge-and immediate.
By John Foley
Corporations have discovered a simple formula for turn- ing their data into a strategic advantage:More is better. Organizations are stockpiling ever-larger amounts of data on products, customers, and transactions in an effort to understand what sells, reach new prospects, and make better business decisions. As a result, data warehouses and operational databases are swelling by hundreds of Gbytes; some are expected to balloon 10 times in size over the next 12 months.
The phenomenon is creating a new elite in the user community:companies with databases holding trillions of bytes-or terabytes-of data. It's also forcing technology managers to hastily devise methods of dealing with the runaway growth.
Building and maintaining terabyte-size databases costs millions of dollars, but the payback can be huge-and immediate. Lucent Technologies in Murray Hill, N.J., recouped $10 million the first day a terabyte data warehouse went into operation for its Network Systems business. "We found a $10 million chunk of product we had shipped and not billed," says Doug Lewis, CIO of Lucent, which is being spun off from AT&T this year as part of the telecom carrier's so-called trivestiture. "It was such a quick hit it made us feel we were on the right track." The warehouse also has a role to play in Lucent's independent future. "We're using it to manage our financials as a new company," says Lewis.
AT&T competitor MCI is using a year-old data warehouse-at three terabytes, one of the world's biggest-to cut the expense of finding new customers. "In 1994, it cost 65 to 70 cents for every lead we generated," says Lance Boxer, CIO at MCI in Washington. "Now we're paying about 4.5 to 6 cents per lead, and the cost is going down." A related benefit is that customers are sticking longer with MCI. The data warehouse, Boxer says, "is probably one of the most important projects in the company, and it's being expanded."
A key to MCI's success is the sheer volume of data pouring into its database, which runs on Informix Software's Extended Parallel Server and IBM's parallel SP hardware. MCI matches up billing and call-detail information culled from its telecommunications network with demographic data provided by third-party suppliers. Data is put into 21 models with 10,000 variables that define customer behavior. The warehouse is growing at 100 Gbytes per month. "I have a feeling it's not going to slow down," says Boxer.
MasterCard International in Purchase, N.Y., is using a rapidly growing warehouse of credit-card transaction data to better serve its 22,000 member banks. The financial giant's MasterCard OnLine database, deployed last year, can be used by the banks to analyze customer buying patterns-and hence to launch innovative products and services. The database sprouted to 1.5 terabytes in its first year. "By this time next year, we're anticipating that it will be in the 9- to 10-terabyte range," says Anne Grim, senior VP of global information services with MasterCard. Grim expects the database, which runs on Oracle7, to hit 30 terabytes by 1998.
Similarly, Wal-Mart Stores Inc.'s data warehouse has helped the retailer increase sales per square foot of store space and keep a higher percentage of fast-moving products in stock. In March, Wal-Mart announced plans to more than double the storage capacity of its NCR Tera-data warehouse to 7.5 terabytes.
And CMG Direct Interactive is collecting World Wide Web "click stream" data-a record of which sites users select-to offer target marketing online. CMG anticipates that its Informix XPS database will grow from 200 Gbytes to 2 terabytes in a year.
Such mega-databases are expensive. Boxer estimates that MCI has invested $40 million in its data warehouse-or more than $10 million per terabyte. But practitioners say a large data warehouse should pay for itself. The state of Michigan, for example, is spending $10 million on a multiterabyte warehouse provided by Bull HN Information Systems Inc. that will be used by 20 state departments, starting with the Medical Services Administration. While that's a huge sum, if the database can help Medical Services shave just 1% from its $8 billion Medicaid program through greater efficiency and improved fraud control, savings to the state could total $80 million. "That's eight times the cost of the warehouse," says Gary Swindon, Michigan's director of computing and telecommunications.
COLLECTING AND KEEPING
Data warehouses are reaching the terabyte level because businesses and agencies like Michigan's are not only collecting more data, but they're also keeping it longer to analyze trends. In addition, database administrators are opening up warehouses to more users and creating indexes to support their queries. Databases, in turn, are "growing like pension funds," says Alan Paller, director of research with the Data Warehousing Institute in Bethesda, Md.
The Terabyte Club, a common interest group formed in June by the Data Warehousing Institute for companies that manage huge databases, has been slow to take off, in large part because huge databases remain rare. Paller estimates that only two dozen or so terabyte databases are deployed around the world, though others say the number is several times higher. But all agree the number will get very big, very fast.
One reason huge databases remain rare is that a terabyte is difficult to handle. "At 10 Mbytes per second-faster than any disk and faster than most disk controllers-it takes 1.2 days to read a terabyte," says Jim Gray, a senior researcher with Microsoft's Bay Area Research Center in San Francisco. Gray believes that massively parallel processing (MPP) computers are the only way to efficiently handle such large volumes of data. As the cost of computer hardware and storage drops, MPP-based warehouses will become an option for more companies. "That is why Microsoft is interested," says Gray. "It can be a commodity business to manage terabyte databases."
Not all terabyte databases are data warehouses. Winter Corp., a Cambridge, Mass., database consultancy, surveyed 138 companies this year on the size and growth rates of large databases. The survey turned up three terabyte-plus online transaction-processing (OLTP) databases, including United Parcel Service's mammoth 3.2-terabyte package-distribution database running on an IBM mainframe.
The Winter survey also found that nearly 30% of companies with databases larger than 100 Gbytes expected those databases to grow beyond a terabyte within the next year. Several even predicted their databases would exceed 30 terabytes within three years. "Large database sizes in 1997 will probably be at least twice the large database sizes in 1996," says Richard Winter, the consulting company's president.
If database growth is not carefully managed, a large data warehouse can be an expensive flop. The ramifications can include cost overruns, poor performance, disgruntled users, and a strain on an IT department's resources. "There's a nasty little secret of large-scale data warehousing," says Winter.
"A high percentage are not very successful."
Also, planning for this kind of growth is tricky. Raw data-pulled from transaction-processing systems and other sources-generally accounts for only a quarter to half of the overall size of a warehouse. Indexes, temporary files, backup capacity, mirrored data, and workspace make up the rest. MCI's 3-terabyte warehouse, for example, is really 1.5 terabytes of raw data that is mirrored, or duplicated for redundancy. It's not unusual, experts say, for only 250 Gbytes of data to be contained within a terabyte warehouse, though the data-to-database ratio varies with each company.
As a result, miscalculations are common and costly. Storage devices-the most expensive component in a terabyte-size data warehouse-have to be added to ac- commodate unanticipated volume. "We've seen underestimates on the order of two to two-and-a-half times within the first 12 months of implementation," says Roy Sanford, director of enterprise alliances with storage specialist EMC Corp. in Hopkinton, Mass. Sanford's rule of thumb:Storage requirements are 70% to 250% above the raw data volume.
Barry Rosenberg, senior manager with Deloitte & Touche Consulting Group in Washington and organizer of the Data Warehousing Institute's Terabyte Club, says users pushing toward the terabyte threshold should be prepared for problems. "All the database vendors claim to be able to support data warehouses of this size, but it's not been truly tried and tested," Rosenberg says. "You wind up learning-as does the vendor-things the product can't do."
To help, Informix, Oracle, NCR, and Sybase have all demonstrated multiterabyte support on their respective database-management systems. Microsoft recently began a project to load a terabyte of data on its SQL Server database. IBM is working on terabyte support for its nonmainframe DB2 platform. Such demos have become a proving ground. In February, for example, NCR and EMC teamed to show off an 11-terabyte system in Tokyo.
But IT managers with experience warn that operational environments are more complicated. As data volume approaches a terabyte, it strains the same database-management systems that worked flawlessly in public demos. As databases grow, modeling the database, loading it with data, "cleansing" the data, and creating indexes are more complicated and time-consuming. Performance can bog down as ad hoc queries scour gobs of records. Data Warehousing Institute's Paller says that as data volume goes up, "speed goes down super-linearly."
"We're pushing the limits," adds MasterCard VP Grim. "That has been an ongoing concern." And MCI CIO Boxer admits:"We do have some size issues."
The pitfalls can be worked through by experienced users. For example, it's possible to tune the database and add middleware. "You should expect to remodel your data structures three to six times," says Don Haderle, an IBM fellow and director of data management architecture and technology with IBM Software. "That means take the data out and put it all back in."
SPECIALISTS REQUIRED
In other words, it's a huge job. Lucent has developed specialists in data engineering and performance management to maintain acceptable response times from its data warehouse. "You simply can't explain to a senior VP that this is going to take 10 minutes to compute," says Lewis. "It isn't going to wash."
Some companies resist the move to terabyte databases. GTE Corp.'s telephone operations unit is capping off a new warehouse of customer demographic information at 600 Gbytes. David North, director of data management with the GTE unit, says the demand for a larger warehouse exists, but "it's being throttled by our ability to manage that quantity of data." Of particular concern is the ability to maintain high-quality data. "If you're going to market to individuals," North adds, "the data has to be accurate at the individual customer's level."
GTE's warehouse, running on Informix software and Hewlett-Packard hardware, is the newest and largest of 121 decision-support applications at GTE. Of those, North estimates that the leading 26 applications alone cost $50 million annually to support. The top 10, he adds, have 2.7 terabytes of overlapping data. Cost and data redundancy are pushing GTE toward a common architecture.
Like its peers with larger data warehouses, GTE expects a business boost from its warehouse of demographic data. "With [telecommunications] competition coming, we're getting pinched on the customer end first," says North. "We have to proactively market to those customers."
BREATHING ROOM
Database vendors are working to improve the scalability of their software to give administrators of large databases some breathing room. Oracle's next-generation Oracle8 database management system, being tested now, will support hundreds of terabytes, company officials promise. Informix makes similar claims about its Universal Server, which was scheduled to enter testing in September.
But what happens if data volume grows faster than a database management system's ability to keep up? Some IT managers say a distributed architecture is the only answer. Capital One Financial Corp., a credit-card issuer in Fairview Park, Va., has spread 2 terabytes of data across four Oracle7 data marts. The advantage, says Dave Buch, IT director of data warehousing with Capital One, is that the data marts are more manageable. The disadvantage is that it's harder to build a central repository of corporate information if that becomes important to the business. "It's kind of like trying to go back and build the first floor of a building after you've built the second floor," says Buch.
MasterCard's Grim says a distributed "logical" warehouse may be the only way to handle the tens of terabytes that will eventually be stored in MasterCard Online's database. Since the warehouse is used by banks in different parts of the world, it may be possible to segment the database for efficiency. "It would give us some redundancy, some ability to do load balancing, and also accommodate time differences," she says.
Among database suppliers, Sybase recommends the distributed approach over a so-called single-instance database. "We don't believe that bigger is better," says Josh Bersin, group director of data warehouse solutions with Sybase. "As things get bigger, everything gets more expensive, harder to administer, and takes longer. You pay a price for size." Still, at least one Sybase customer, telecommunications carrier KDD of Japan, has built a terabyte warehouse using Sybase's SQL Server 11 database management system.
The critical factor in planning a large-scale data warehouse is to focus the effort on attainable goals, say experts. Winter recommends limiting the initial rollout to a maximum of three applications and building from there. "Data warehousing is all about creating the right infrastructure for the long term," he says. Michigan's Swindon agrees:"Tackle the project in serial fashion so you have some success. Under penalty of death, don't attempt the grand program [all at once], because you will fail."
For those that succeed at managing terabytes, the payback can be handsome. A growing number of corporations have no option. "Survival in the financial industry is dependent upon effective use of information and creating value from the vast amount of information we have," says MasterCard's Grim. "It's central to MasterCard's future."
---
Union Pacific Gets Data On Track
For some companies, building a large data warehouse is the light at the end of the tunnel. But as Union Pacific Railroad Co.'s warehouse grew toward a terabyte in size, "We could see the wall coming rapidly," says Betty Kight, senior manager of systems development with Union Pacific Technologies, the company's IT division in St. Louis.
Specifically, Union Pacific faced three major problems with its fast-growing data warehouse:scalability, performance, and efficiency. The railroad hopes to bypass all three this month when it replaces the data warehouse's underlying NCR 3600 server with a new NCR WorldMark 5100. The upgrade promises to free up valuable storage space-enough to accommodate data growth and maintain performance as new users log onto the system.
When Union Pacific installed the NCR 3600 in 1992, its Teradata database management system had just 40 Gbytes of data. Four years later, that figure had mushroomed to 440 Gbytes-double that when the mirrored storage was factored in.
Now, by using RAID 5 storage in conjunction with the WorldMark 5100, Union Pacific no longer will have to mirror its data. The system will come equipped with 1.1 terabytes of storage capacity, giving Union Pacific, with its 440 Gbytes of data, plenty of room to grow. "We have pretty much tripled our capacity," says Kight. "The 5100's high-speed interconnect will help keep performance levels high."
The improvements are needed. Union Pacific continually adds users to the warehouse, and these new users inevitably ask for more data-lots more. Also, the system is shared with Union Pacific Railroad's parent company, Union Pacific Corp. in Bethlehem, Pa., and sister company Overnite Transportation Co. in Richmond, Va. In 1992, the database supported 20 users. Today, that number is rapidly approaching 2,000 users-and growing by 100 users a month.
In fact, Kight expects Union Pacific to push the 1.1-terabyte limit on its new warehouse within 18 months. "People ask me about implementation tips," she says. "I tell them, 'Plan for unprecedented growth.'
---
A Terabyte Holds...
- A 100-byte record for every person on earth and an index on those records, or
- A JPEG-compressed pixel for every square meter of land-enough to create a fine-grain photo of Earth, or l 1 billion business letters, taking up 150 miles of bookshelf space, or
- 10 million MPEG images, enough to run a video continuously for 10 days and nights.
Data:Microsoft
Copyright * 1996 CMP Media Inc. |