Small Lab, Big Data

Wednesday, 8 March 2023

Automating life, it can’t be that hard – surely….

“Machine intelligence is the last invention that humanity will ever need to make.”

Nick Bostrom

Automated Machine Learning (AutoML) is the process of automating the end-to-end process of applying Machine Learning to real-world problems. In a typical ML application, experts must apply the appropriate data pre-processing methods, feature engineering, feature extraction, and feature selection to make the data set most accessible for ML. Following these pre-processing steps, practitioners must then perform the algorithm selection and hyper-parameter optimization to maximize the predictive performance of the final ML model. Since many of these steps often go beyond the capabilities of laypersons, AutoML has been developed as an artificial intelligence-based solution to the ever-growing challenge of applying ML. Automating the end-to-end process of applying ML offers the benefits of producing more straightforward solutions, faster creation of these solutions, and models that often outperform hand-designed models.

Oracle AutoML UI

AutoML User Interface (AutoML UI) is an Oracle Machine Learning interface that provides you no-code automated machine learning modelling. When you create and run an experiment in AutoML UI, it performs automated algorithm selection, feature selection, and model tuning, thereby enhancing productivity as well as potentially increasing model accuracy and performance.

The following steps comprise a machine learning modelling workflow and are automated by the AutoML user interface:

Algorithm Selection: Ranks algorithms likely to produce a more accurate model based on the dataset and its characteristics, and some predictive features of the dataset for each algorithm.
Adaptive Sampling: Finds an appropriate data sample. The goal of this stage is to speed up Feature Selection and Model Tuning stages without degrading the model quality.
Feature Selection: Selects a subset of features that are most predictive of the target. The goal of this stage is to reduce the number of features used in the later pipeline stages, especially during the model tuning stage to speed up the pipeline without degrading predictive accuracy.
Model Tuning: Aims at increasing individual algorithm model quality based on the selected metric for each of the shortlisted algorithms.
Feature Prediction Impact: This is the final stage in the AutoML UI pipeline. Here, the impact of each input column on the predictions of the final tuned model is computed. The computed prediction impact provides insights into the behaviour of the tuned AutoML model.

Tuesday, 21 February 2023

Natural what now - NLP

“A different language is a different vision of life.”

Federico Fellini

I realised in my last blog, I didn't really provide enough description as to what NLP is or its history. So a quick blog to fill the readers in :)

Natural language processing (NLP) is a field of artificial intelligence and computer science that focuses on enabling computers to process, understand, and generate human language. It can be rule based or involve a number of complex algorithms. The tasks can be things like translation to extraction of specific items or grouping themes together. In the real world there are many applications to NLP from chat bots to virtual assistants.

Here is a brief overview of the history of NLP:

1950s and 1960s: The foundations of NLP are laid with the development of early computer programs that can process and analyse natural language data.
1970s and 1980s: NLP begins to develop more advanced techniques, including the use of rule-based systems and the development of the first machine translation systems.
1990s: The field of NLP sees significant growth and advances, including the development of statistical models and ML algorithms that can be used for NLP tasks.
2000s: NLP continues to advance, with the development of new techniques such as deep learning
2010s and beyond: NLP continues to evolve, with the development of new techniques and the widespread use of NLP in a wide range of applications, including virtual assistants and chatbots.

Overall, the history of NLP reflects the evolution of artificial intelligence and computer science, and it has led to significant advances in the field that have had a wide-ranging impact on society.

Monday, 20 February 2023

NLP: Where do I start

"Knowledge is power"

Francis Bacon

The first starting point for most analytics projects, is descriptive analytics. Basically answering that simple question of 'What happened', 'where are we' or 'When did it happen'. In the same sense we can get a feel for a unstructured document, or text field.

For each sentence, we can start understand the number of words, length, number of characters etc.

Depending on your Oracle environment, you could use Python through Oracle Data Science platform. Python has a number of language modules - we are going to get onto more of them shortly - but ultimately you could use something as simple as len.

If you are doing your analytics in SQL, maybe it's a text field inside the database e.g. survey responses etc, then there are a number of string functions you can utilise. Such as length or length, but with regular expression to remove special characters or spaces.

Stop Words

Once we understand the sentence in it's highest level then we can start to pull apart what the sentence is actually made up of. Stop words are those words they enable the connection of other words. They are, in a sense, the fluffy stuff. They don't add anything to the sentence, but make it nicer for us to read/ hear a sentence.

In this sentence we have multiple stop words. Now there is a caveat, and it's the work "it". In lower case we read it as "it", however in upper case "IT" is usually a department in your organisation. So be careful on the steps you take when it comes to handling stop words. If you do you cleaning/ preparation first such as lower case or removing punctuation, there is a chance that "IT" becomes "it" and is therefore a stop word.

If you are in Oracle Data Science platform then packages such as NLTK have a stop words built in and can remove it very quickly.

If you are in Oracle SQL, you'll need to build up a list of stop words and then apply the function.

In the next blog we'll start to look at Sentiment analysis.

Monday, 23 January 2023

Data: What quantity beats quality?

“Data is the new science. Big Data holds the answers.”

Pat Gelsinger

In general, both data quality and data quantity are important considerations for machine learning. The importance of each factor can vary depending on the specific task and the characteristics of the data.

Data quality refers to the accuracy, relevance, and completeness of the data that is used to train a ML model. High-quality data is essential for training accurate and reliable machine learning models, as the model's performance is directly influenced by the quality of the data.

On the other hand, data quantity refers to the amount of data that is available to train a ML model. In general, more data is better, as it allows the model to learn more about the patterns and relationships in the data.

In practice, it is often desirable to have both high-quality data and a large quantity of data for ML. However, it is important to balance the need for high-quality data with the practical considerations of collecting and storing large amounts of data, where is the data coming from and is it considered a reliable source. This is where the business is essential - they will understand the business processes that go into collecting that data. These little can really mean the difference between a right and wrong ML prediction.

The optimal balance between data quality and data quantity depends on the specific task and the characteristics of the data. It is important to carefully evaluate the trade-offs and ensure that the data used to train a ML model is of sufficient quality and quantity to support accurate and reliable predictions.

So how can we get data quality from our data, well actually there are plenty of approaches in Oracle technology. Within the data base, we can use some SQL to get overall descriptive analytics of our data.

select …

from sys.all_tab_columns col

join sys.all_tab_col_statistics stat

…

When we are using the Autonomous Database (ADW), when loading data, such as CSV, then we can also get a description of the data through the evaluation of the data, and the construction of data models. In the example below we can pick out our fact table and the associated dimensions:

There is a Data Analysis tool within the ADW, however, in my opinion it is a little limited:

If we are in Oracle Analytics Cloud (OAS has similar too), then we have the following available once the data flow is created. This, in my opinion is a useful, however, what we don't have is the curve of the data, however it doesn't start to draw the eye to what is biased within the data. Potentially a very easy and nice view for a domain user.

This first stage of data understanding, enables the first complexity of ML to come to the surface, Data Processing:

Data (pre)processing: In many cases, raw data used to train a ML model is not in a usable form. It may be dirty, incomplete, or unstructured, and must be cleaned, transformed, and structured in a way that is suitable for machine learning. This process, known as data (pre)processing, can be complex and time-consuming. Understanding what is appropriate to change, remove or update, is a unique field in itself, and often requires domain expertise to get it right.

Thursday, 19 January 2023

We are very Diverse

"The most interesting people you’ll find are ones that don’t fit into your average cardboard box. They’ll make what they need, they’ll make their own boxes"

Dr. Temple Grandin

A small divergent to the usual blogs, and one more personal to me. I’ve never stepped away from this one, and that is being an Aspie, more specifically I’m a female adult who finds themselves on the Autistic spectrum.

I know some people have often found it hard to understand, I can after all be found at numerous conferences and within groups of people. So, I thought I’d offer an insight to my world.

Autism spectrum disorder (ASD) is a developmental disability caused by differences in the brain. Scientists believe there are multiple causes of ASD that act together to change the most common ways people develop.

Adults with autism often have some of the following ‘signs’ as described by the NHS as:

not understanding social "rules", such as not talking over people
avoiding eye contact
getting too close to other people, or getting very upset if someone touches or gets too close to you
noticing small details, patterns, smells or sounds that others do not
having a very keen interest in certain subjects or activities
liking to plan things carefully before doing them

With the above list, and with some help from the other half, here’s examples of how I’m often autistic.

I don’t always know when to talk in social gatherings or meals – so I don’t talk. Appearing maybe shy and introverted, but I’ve learnt that I can’t get in trouble if I don’t talk.

Throughout a conference, I have a check list of things to do while I’m at an event. Example, raise your head, try to make eye contact, or look at their forehead/ over their head. Don’t stay on the outside of the room all the time.

My biggest problem – noise! It hurts my head, and it’s hard to explain the pain, so if you see me wearing headphones, then it is likely I’m playing white noise/ familiar songs to essentially drown out the noise or distract me. I’ve met other people with autism and its sometimes textures, so don’t be offended if they don’t want that free t-shirt.

I’m extremely keen on data and mathematical challenges, and if given an opportunity to do an escape room – you have very little chance of getting me off task! That’s the other problem, if I’m on task, you’ll be hard pressed to get me ‘off Task’.

Planning – I pack, repack, unpack, pack at least 2 weeks before the conference. I will have planned out my exact route with at least 3 alternatives in the event anything is cancelled or delayed. So should you ever get lost/ not know which train to catch – drop me a message I’ve probably got that route planned out!

Autism is just one of the many Neurodiversity’s out there, (Neurodiversity describes the idea that people experience and interact with the world around them in many ways).

So why write this, because whilst the world is different for those of us who are Neurodiverse, we also bring a unique view to the world. We see patterns, details, or approaches that the typical brain doesn’t see. We will likely test a system in a unique way or find a data enrichment that wasn’t thought about before.

In the past, I know of several companies that would have overlooked neurodiverse community, it is changing, and just like Women in Technology (WIT) I’m also supportive of neurodiversity in technology.

So, at your next meeting, or conference, I wonder if you will see those who see the world differently, and maybe, if they are having a tough day, reach out a helping hand – because they might show you something very new and different. If you are in the UK, one of the ways is a lanyard, it indicates that the person wearing it has a hidden disability, I’ve found it very useful when travelling through busy airports and train stations.

Pages