SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Investing in Exponential Growth

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
From: Paul H. Christiansen9/5/2018 2:33:36 PM
   of 1084
 
Differentiating Data Warehouse and Data Lake – from Deloitte Consulting

The Data Warehouse

Data warehouses have long been part of many BI platforms. Typically, they allow general business users without deep data expertise to perform ad hoc analyses and reporting on pre-structured data. Select sets of data go through the extract, transform, and load (ETL) process to establish business rules before they are stored, so users don’t have to impose structure on the data themselves when they want to analyze it—it’s in the right form already. Data is then made available 24/7 along with analytics tools such as dashboards and predesigned reports for specific uses accessed via a user-friendly front end.

Data accuracy, timeliness, and consistency are all essential in data warehouses, which tend to rely on traditional hardware and software including columnar storage and relational databases. Typically, they are associated with service-level agreements between IT and the business.

The Data Lake

Data lakes, on the other hand, are oriented more toward big data. Data is stored in much higher volumes, but in more primitive form; users are often data scientists and other specialists performing exploratory analyses. Because data is not first transformed, high-volume storage is relatively inexpensive. Expert users can then curate the subsets they need for ever-evolving purposes, imposing structure as they go and using temporary sandbox areas to build and train models for any given task. Once they have a model that works, they can operationalize it and move on.

Data lakes tend to use massively parallel processing and data partitioning; compute and storage are usually kept separate, so they can be scaled independently, and users can pay just for storage when they’re not doing any analysis. By making a large repository of corporate data available in a central location, these databases can be particularly valuable for expert users who focus on developing advanced analytical models and machine learning algorithms, for example. Users can search the data lake, use data-wrangling tools to build data sets, and spin up compute resources on demand to build analytical models without worrying about expensive and idle hardware.

This kind of storage often makes the most sense for data for which specific use cases have not yet been defined. A searchable data catalog and appropriate data retention policies are essential, as are strong governance processes to help ensure data quality and prevent the lake from becoming a “data swamp.”

A Hybrid Option

Both data lakes and data warehouses have a place in the enterprise world. Some organizations use one or the other, but there is also a combined approach in which companies can ingest incoming data into a data lake and then move select subsets into a data warehouse as specific use cases are defined. When data ages past a certain point or falls into disuse, dynamic tiering functionality can automatically move it back to the data lake for cheaper long-term storage. By keeping “hot” data in a warehouse and “cold” data in a lake, companies can make the most of the strengths of each storage option.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext