Monday 23 January 2023

Data: What quantity beats quality?

 “Data is the new science. Big Data holds the answers.”

Pat Gelsinger

 

In general, both data quality and data quantity are important considerations for machine learning. The importance of each factor can vary depending on the specific task and the characteristics of the data.

 

Data quality refers to the accuracy, relevance, and completeness of the data that is used to train a ML model. High-quality data is essential for training accurate and reliable machine learning models, as the model's performance is directly influenced by the quality of the data.




On the other hand, data quantity refers to the amount of data that is available to train a ML model. In general, more data is better, as it allows the model to learn more about the patterns and relationships in the data. 

 

In practice, it is often desirable to have both high-quality data and a large quantity of data for ML. However, it is important to balance the need for high-quality data with the practical considerations of collecting and storing large amounts of data, where is the data coming from and is it considered a reliable source. This is where the business is essential - they will understand the business processes that go into collecting that data. These little      can really mean the difference between a right and wrong ML prediction.

 

The optimal balance between data quality and data quantity depends on the specific task and the characteristics of the data. It is important to carefully evaluate the trade-offs and ensure that the data used to train a ML model is of sufficient quality and quantity to support accurate and reliable predictions.

 

So how can we get data quality from our data, well actually there are plenty of approaches in Oracle technology. Within the data base, we can use some SQL to get overall descriptive analytics of our data.


select …

from sys.all_tab_columns col

join sys.all_tab_col_statistics stat




When we are using the Autonomous Database (ADW), when loading data, such as CSV, then we can also get a description of the data through the evaluation of the data, and the construction of data models. In the example below we can pick out our fact table and the associated dimensions:









There is a Data Analysis tool within the ADW, however, in my opinion it is a little limited:



If we are in Oracle Analytics Cloud (OAS has similar too), then we have the following available once the data flow is created. This, in my opinion is a useful, however, what we don't have is the curve of the data, however it doesn't start to draw the eye to what is biased within the data. Potentially a very easy and nice view for a domain user. 



This first stage of data understanding, enables the first complexity of ML to come to the surface, Data Processing:

 

Data (pre)processing: In many cases, raw data used to train a ML model is not in a usable form. It may be dirty, incomplete, or unstructured, and must be cleaned, transformed, and structured in a way that is suitable for machine learning. This process, known as data (pre)processing, can be complex and time-consuming. Understanding what is appropriate to change, remove or update, is a unique field in itself, and often requires domain expertise to get it right.




No comments:

Post a Comment