Differentiating Data Warehouse and Data Lake – from Deloitte Consulting
The Data Warehouse
Data warehouses have long been part of many BI platforms. Typically, they allow general business users without deep data expertise to perform ad hoc analyses and reporting on pre-structured data. Select sets of data go through the extract, transform, and load (ETL) process to establish business rules before they are stored, so users don’t have to impose structure on the data themselves when they want to analyze it—it’s in the right form already. Data is then made available 24/7 along with analytics tools such as dashboards and predesigned reports for specific uses accessed via a user-friendly front end.
Data accuracy, timeliness, and consistency are all essential in data warehouses, which tend to rely on traditional hardware and software including columnar storage and relational databases. Typically, they are associated with service-level agreements between IT and the business.
The Data Lake
Data lakes, on the other hand, are oriented more toward big data. Data is stored in much higher volumes, but in more primitive form; users are often data scientists and other specialists performing exploratory analyses. Because data is not first transformed, high-volume storage is relatively inexpensive. Expert users can then curate the subsets they need for ever-evolving purposes, imposing structure as they go and using temporary sandbox areas to build and train models for any given task. Once they have a model that works, they can operationalize it and move on.
Data lakes tend to use massively parallel processing and data partitioning; compute and storage are usually kept separate, so they can be scaled independently, and users can pay just for storage when they’re not doing any analysis. By making a large repository of corporate data available in a central location, these databases can be particularly valuable for expert users who focus on developing advanced analytical models and machine learning algorithms, for example. Users can search the data lake, use data-wrangling tools to build data sets, and spin up compute resources on demand to build analytical models without worrying about expensive and idle hardware.
This kind of storage often makes the most sense for data for which specific use cases have not yet been defined. A searchable data catalog and appropriate data retention policies are essential, as are strong governance processes to help ensure data quality and prevent the lake from becoming a “data swamp.”
A Hybrid Option
Both data lakes and data warehouses have a place in the enterprise world. Some organizations use one or the other, but there is also a combined approach in which companies can ingest incoming data into a data lake and then move select subsets into a data warehouse as specific use cases are defined. When data ages past a certain point or falls into disuse, dynamic tiering functionality can automatically move it back to the data lake for cheaper long-term storage. By keeping “hot” data in a warehouse and “cold” data in a lake, companies can make the most of the strengths of each storage option. |