Small Lab, Big Data: January 2023

Monday, 23 January 2023

Data: What quantity beats quality?

“Data is the new science. Big Data holds the answers.”

Pat Gelsinger

In general, both data quality and data quantity are important considerations for machine learning. The importance of each factor can vary depending on the specific task and the characteristics of the data.

Data quality refers to the accuracy, relevance, and completeness of the data that is used to train a ML model. High-quality data is essential for training accurate and reliable machine learning models, as the model's performance is directly influenced by the quality of the data.

On the other hand, data quantity refers to the amount of data that is available to train a ML model. In general, more data is better, as it allows the model to learn more about the patterns and relationships in the data.

In practice, it is often desirable to have both high-quality data and a large quantity of data for ML. However, it is important to balance the need for high-quality data with the practical considerations of collecting and storing large amounts of data, where is the data coming from and is it considered a reliable source. This is where the business is essential - they will understand the business processes that go into collecting that data. These little can really mean the difference between a right and wrong ML prediction.

The optimal balance between data quality and data quantity depends on the specific task and the characteristics of the data. It is important to carefully evaluate the trade-offs and ensure that the data used to train a ML model is of sufficient quality and quantity to support accurate and reliable predictions.

So how can we get data quality from our data, well actually there are plenty of approaches in Oracle technology. Within the data base, we can use some SQL to get overall descriptive analytics of our data.

select …

from sys.all_tab_columns col

join sys.all_tab_col_statistics stat

…

When we are using the Autonomous Database (ADW), when loading data, such as CSV, then we can also get a description of the data through the evaluation of the data, and the construction of data models. In the example below we can pick out our fact table and the associated dimensions:

There is a Data Analysis tool within the ADW, however, in my opinion it is a little limited:

If we are in Oracle Analytics Cloud (OAS has similar too), then we have the following available once the data flow is created. This, in my opinion is a useful, however, what we don't have is the curve of the data, however it doesn't start to draw the eye to what is biased within the data. Potentially a very easy and nice view for a domain user.

This first stage of data understanding, enables the first complexity of ML to come to the surface, Data Processing:

Data (pre)processing: In many cases, raw data used to train a ML model is not in a usable form. It may be dirty, incomplete, or unstructured, and must be cleaned, transformed, and structured in a way that is suitable for machine learning. This process, known as data (pre)processing, can be complex and time-consuming. Understanding what is appropriate to change, remove or update, is a unique field in itself, and often requires domain expertise to get it right.

Thursday, 19 January 2023

We are very Diverse

"The most interesting people you’ll find are ones that don’t fit into your average cardboard box. They’ll make what they need, they’ll make their own boxes"

Dr. Temple Grandin

A small divergent to the usual blogs, and one more personal to me. I’ve never stepped away from this one, and that is being an Aspie, more specifically I’m a female adult who finds themselves on the Autistic spectrum.

I know some people have often found it hard to understand, I can after all be found at numerous conferences and within groups of people. So, I thought I’d offer an insight to my world.

Autism spectrum disorder (ASD) is a developmental disability caused by differences in the brain. Scientists believe there are multiple causes of ASD that act together to change the most common ways people develop.

Adults with autism often have some of the following ‘signs’ as described by the NHS as:

not understanding social "rules", such as not talking over people
avoiding eye contact
getting too close to other people, or getting very upset if someone touches or gets too close to you
noticing small details, patterns, smells or sounds that others do not
having a very keen interest in certain subjects or activities
liking to plan things carefully before doing them

With the above list, and with some help from the other half, here’s examples of how I’m often autistic.

I don’t always know when to talk in social gatherings or meals – so I don’t talk. Appearing maybe shy and introverted, but I’ve learnt that I can’t get in trouble if I don’t talk.

Throughout a conference, I have a check list of things to do while I’m at an event. Example, raise your head, try to make eye contact, or look at their forehead/ over their head. Don’t stay on the outside of the room all the time.

My biggest problem – noise! It hurts my head, and it’s hard to explain the pain, so if you see me wearing headphones, then it is likely I’m playing white noise/ familiar songs to essentially drown out the noise or distract me. I’ve met other people with autism and its sometimes textures, so don’t be offended if they don’t want that free t-shirt.

I’m extremely keen on data and mathematical challenges, and if given an opportunity to do an escape room – you have very little chance of getting me off task! That’s the other problem, if I’m on task, you’ll be hard pressed to get me ‘off Task’.

Planning – I pack, repack, unpack, pack at least 2 weeks before the conference. I will have planned out my exact route with at least 3 alternatives in the event anything is cancelled or delayed. So should you ever get lost/ not know which train to catch – drop me a message I’ve probably got that route planned out!

Autism is just one of the many Neurodiversity’s out there, (Neurodiversity describes the idea that people experience and interact with the world around them in many ways).

So why write this, because whilst the world is different for those of us who are Neurodiverse, we also bring a unique view to the world. We see patterns, details, or approaches that the typical brain doesn’t see. We will likely test a system in a unique way or find a data enrichment that wasn’t thought about before.

In the past, I know of several companies that would have overlooked neurodiverse community, it is changing, and just like Women in Technology (WIT) I’m also supportive of neurodiversity in technology.

So, at your next meeting, or conference, I wonder if you will see those who see the world differently, and maybe, if they are having a tough day, reach out a helping hand – because they might show you something very new and different. If you are in the UK, one of the ways is a lanyard, it indicates that the person wearing it has a hidden disability, I’ve found it very useful when travelling through busy airports and train stations.

Friday, 6 January 2023

Machine Learning – why isn’t it everywhere?

“If we have data, let’s look at data. If all we have are opinions,

let’s go with mine.”

Jim Barksdale

Let start with a basic, ML is not AI. Yes, Machine Learning (ML) is a type of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable computers to improve their performance on a particular task through experience.

In ML, a computer is fed a large dataset and uses statistical analysis to identify patterns and relationships within the data. The computer can then use this knowledge to make predictions or decisions without being explicitly programmed to do so. If this is the case – then the question may well be, well why isn’t it everywhere?

ML poses many challenges to it being conducted and implemented within businesses, some of these challenges are:

Data Quality and Quantity: ML algorithms require large amounts of high-quality data to learn effectively. This can be a challenge because it is often difficult to obtain large amounts of clean, accurate, and relevant data for the question we are trying to answer.
Overfitting: Overfitting occurs when a ML model is trained too well on the training data and does not generalize well to new, unseen data. This can be a problem because the model will perform poorly when deployed in the real world or to changes that happen over time.
Feature Engineering: Is the process of selecting and creating the input features that will be used to train a ML model. The key to this is domain knowledge and expertise, which can be time-consuming and difficult to convey.
Hyperparameter Tuning: ML algorithms have several hyperparameters that control their behaviour and performance. Finding the best values for these hyperparameters can be a challenge because it requires experimentation and evaluation.
Bias and Fairness: ML algorithms can sometimes perpetuate or amplify societal biases that are present in the training data.
Explainability: Many ML models are considered "black boxes" because it is difficult to understand how they arrived at a particular prediction. This lack of explainability can make it difficult to trust and deploy ML systems in certain contexts.

Based on the above there are several reasons why companies may not adopt and use ML on a regular basis:

Lack of resources: Implementing ML can require a significant investment in terms of time, money, and personnel.
Lack of expertise: ML requires specialized knowledge and skills, which may not be present within a company.
Complexity: ML projects can be complex and require a significant amount of infrastructure and technical expertise to set up and maintain.
Concerns about bias and fairness: ML algorithms can sometimes perpetuate or amplify societal biases that are present in the training data.
Legal and regulatory issues: There may be legal or regulatory hurdles that a company must navigate to implement machine learning. For example, there may be concerns about data privacy or the ethical use of ML.
Lack of clear ROI: In some cases, it may be difficult to quantify the potential benefits of a machine learning project, which can make it difficult for a company to justify the investment.

Over the coming blogs, we’ll investigate these areas of concerns and how we can address some of these within the Oracle ecosystem. Looking at the how to’s to solve these complexities and how you can become a ML superhero within Oracle technology.

Oracle has several tools and features that support ML, including Oracle Machine Learning (OML), Oracle Cloud Infrastructure (OCI) Data Science, and Oracle Autonomous Database (ADW).

OML is a suite of tools and libraries that allows users to build, train, and deploy ML models within the Oracle Database. It includes several pre-built machine learning algorithms and supports integration with popular open-source machine learning libraries such as scikit-learn and TensorFlow.

These can also be controlled through Oracle Analytics Cloud (OAC):

OCI Data Science is a cloud-based platform that provides a range of tools and services for data science and ML, including data preparation, model training, and model deployment. It also includes support for popular ML libraries and frameworks.

ADW is a fully managed database service that uses machine learning to optimize and manage itself, eliminating the need for manual tuning and maintenance. It includes support for in-database ML using SQL and Python.

As you can see we’ve got a lot of exploring and learning to do in 2023, and I’m grateful I’m able to help some people get started, and maybe help those who have started, look at it in a different way.

Monday, 2 January 2023

Evolution is a truth

“It is a truth universally acknowledged,..”

Jane Austen, Pride and Prejudice

2022 was a year to both remember and forget. A year where personal lose went very deep, with my father passing, it marked that the last 4 years had been tough. In, 2019 we lost mum, 2 years of the pandemic took its toll on all of us, then 2022 my father…. So hopefully 2023 will be less painful in that sense.

2022 isn’t all bad, we saw the return of in person conferences – and it was so good to have coffee and catch up. I personally had the pleasure of attending the Skywalker Ranch in San Francisco, meeting new people, and exchanging new ideas. There was the UKOUG event at the end of the year, returning to Birmingham – home to a great Christmas market. On a personal level, football (soccer to the USA audience) was in full flight, with an amazing county cup win, and continuing to grow the football analytics into ML as well as Graph technology.

But this doesn’t answer that question, why are you blogging. Well, there is another great addition to 2022, that was the return/ fixing of the Oracle ACE program. It’s fair to say that the program itself has been on a journey, from very low points, to the very high of it returning to its roots. There have been some changes along the way, some will argue for the better and others saying they aren’t fair. Personally, I see them as evolution of a program, like all technology and processes, it’s adapting, maturing, and evolving. One of the changes was that blog’s written/ posted on your company website won’t count for points – it’s a contentious change to some – but it did make me rethink. I mean as it is, the blogs I write are often actually about how to or looking at the wider parts of ML/ AI. Thus, why not write these posts as external rather than as a company? Answer – there is no reason.

2019 I stopped blogging, mainly to have more time with my Mum and care for my father, now that life has thrown it's curve balls, it's time to get back to it. Plus, I'm conscious that some still can't travel to conferences, so even though, my preferred approach to helping people is conferences, blogs absolutely have their place too.

So, that’s why these blogs have come about, time to help expand who reads about Oracle ML and AI. No hidden agendas, no company posts, just simply feeding back to the community, hoping to engage more discussions and create new friendships.

Wish me luck in 2023, it’s hopefully better than 2022, but also a new approach, an evolution into the writing market rather than the presenting market….

Pages