Friday 28 September 2018

Data Lab vs Warehouse

According to Wikipedia a Data Warehouse is defined as:

... are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.”

The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.”

and a Data Lake as:

...a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).”

In a nutshell these differences are:
Data Warehouse
Data Lake
Only retains the data required, reducing noise from the users
Retains all Data
Arranges data into standardised structures
Can handle data of any variety / structure
Consolidates data from a variety of sources into ‘One Source of the Truth’
Supports a variety of users but they may require specific skills
Supports decision making and can be used by a range of user abilities
Ability to adapt quickly to new data types
Stringent Row/ Column security can be put in place quickly. 
Enables quicker data analysis

From a Data science perspective which is better… well that depends. There are advantages of a Data Warehouse, the structure and reproducibility allows quick, consistent analysis to be performed. A Data Lake on the other hand maybe more difficult, and not always reproducible, but give you the more granular and wide ranging data to find the really interesting analysis. 

So my Answer – a Data warehouse where I can bring my own data. Maybe something like a Data Ware-Lake?

No comments:

Post a Comment