According to Wikipedia a Data Warehouse is defined as:
“... are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.”
“The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.”
and a Data Lake as:
“...a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).”
In a nutshell these differences are:
Data Warehouse
|
Data Lake
|
Only retains the data required, reducing noise from the users
|
Retains all Data
|
Arranges data into standardised structures
|
Can handle data of any variety / structure
|
Consolidates data from a variety of sources into ‘One Source of the Truth’
|
Supports a variety of users but they may require specific skills
|
Supports decision making and can be used by a range of user abilities
|
Ability to adapt quickly to new data types
|
Stringent Row/ Column security can be put in place quickly.
|
Enables quicker data analysis
|
From a Data science perspective which is better… well that depends. There are advantages of a Data Warehouse, the structure and reproducibility allows quick, consistent analysis to be performed. A Data Lake on the other hand maybe more difficult, and not always reproducible, but give you the more granular and wide ranging data to find the really interesting analysis.
So my Answer – a Data warehouse where I can bring my own data. Maybe something like a Data Ware-Lake?