So, what exactly do we mean by "good" data when it comes to AI?
It’s not about the traditional understanding of “good data” which would have you adhering to specific quality metrics, but about ensuring that the data you want AI to work with is appropriate for your specific use case. For example, one key factor is timeliness: When was the data produced, and how relevant is it to your current needs? In some cases, older data might still hold value, while in others, more up-to-date information will be essential.
Building an AI capability
When embarking on an AI project, it may not be necessary to spend excessive time perfecting all your data before you get started. Instead, you should focus on identifying the most important AI use cases and ensuring that the data supporting those specific areas is as accurate and abundant as needed.
It’s more efficient to hone in on a narrow, well-defined use case to start with. This allows you to concentrate on making sure the data associated with that use case is fit for purpose. Leverage the data capabilities and technologies within your organisation to develop a foundational data platform which can serve that fit-for-purpose data to your AI solutions. Once you achieve success with a focused initiative, this can serve as a blueprint for expanding into additional AI-driven use cases across your organisation.
What do we mean by 'fit for purpose’?
When we say "fit for purpose," we mean that the data used should be appropriate for the intended outcome. For example, if you are deploying a model trained to diagnose patients based on symptoms, the data you provide must be correct and relevant to that purpose. If you mistakenly feed it inaccurate data, such as associating the wrong set of symptoms with a particular diagnosis, the model could potentially produce poor results. In this case, accuracy is crucial. However, providing examples of misdiagnoses, clearly marked as incorrect can be valuable, as it helps the model learn to differentiate errors from correct diagnoses. Here, the focus is on making sure the data is suitable for the task, rather than striving for perfection.
Another example of this is a chatbot providing expertise based on a body of articles. The relevance of those articles may diminish over time, particularly if they become outdated or disproved. For instance, an article might have been discredited in 2020. While it’s fine for the model to include articles that contain incorrect information, it becomes crucial to provide context – metadata like ‘Article X was discredited in 2020’ – so the model can decide what information to surface and what to avoid. The context helps the model determine the relevance and reliability of the data, ensuring that even imperfect sources can be used appropriately.
By focusing your efforts on areas of critical importance and ensuring the data quality is ‘good enough’ in those spaces, you can set a strong foundation for scaling AI capabilities effectively and efficiently.
Leveraging imperfect data in GPT
General-purpose models like GPT offer a great example of how even so-called "bad" data can be valuable, provided it is properly contextualised. In AI, data is only as good as the metadata surrounding it. Models like GPT examine vast datasets, encompassing everything from articles and comments to ratings and original content, whether good or terrible. However, what truly enhances the data is the surrounding context—what people thought of it, how much it was consumed, and how it was interacted with.
GPT models are trained on data available on the public internet – and we all know that not all of that information is good or accurate. However, the internet also provides a wealth of context – trusted sources, ratings, number of visits, citations, and links – which helps AI determine which information is most likely to provide the right answers. This surrounding context plays a crucial role in helping models like GPT sift through the vast amount of content and make more informed decisions about which data to prioritise.
RAG approach vs. Training AI from scratch
Many organisations are not building AI models entirely from scratch but are instead using a Retrieval-Augmented Generation (RAG) approach or fine-tuning existing models with their own data.
Building & training a model from the ground up requires massive, diverse datasets and considerable computational resources. In contrast, a RAG approach starts with a pre-trained model and augments it with a smaller, more targeted dataset specific to the organisation's needs. This approach is far simpler than training a model from the ground up, yet still can deliver very impressive results, making it a practical choice for many companies.
When we see statements about AI failing due to poor data quality, it’s essential to remember that the term "quality" is incredibly broad. The standards of data required by a medical diagnostic solution differ vastly from those needed for an internal-facing 'expert' chatbot. The success of any AI initiative is highly dependent on the specific context and goals of that initiative, as is the data quality needed to make it successful.
Prompts are just as important as data
In a RAG approach, the prompts you give to an AI model can be equally as important as the data it processes. The questions you ask and how you frame them can significantly influence the outcome, helping the AI model arrive at the answers you need. It’s critical to use your data to generate effective prompts, and your data must be in good enough shape to support this process. This is an area where external help from an expert data consultancy like Amplifi can help transform your data into the right prompts to drive the most relevant and accurate responses from RAG solutions.
How Amplifi drives AI success with focused, feasible use cases
When working with clients, our goal is to ensure that AI is not just a buzzword but a practical and impactful tool that drives real business improvements. To achieve this, we guide our clients through a structured approach that begins with focusing on their core business use cases – those that we believe has significant potential to benefit from AI. Our process is broken down into three key stages:
A) Is it a feasible use case?
The first step is to evaluate whether the identified use case is practical and feasible to be solved using the AI technologies available today. We’ll work closely with your organisation to ensure that the problem you want to solve can genuinely be addressed through AI, and whether the necessary data and infrastructure are in place. This phase is crucial because it helps avoid wasted time and resources on projects that aren’t a good fit for AI.
B) Do you have the right building blocks?
Once we confirm the feasibility of the use case, the next step is to assess whether you have the foundational elements needed to bring the use case to life. This includes looking at the quality and structure of your data and metadata, identifying any gaps, and deciding whether data enrichment or quality improvements are needed. Additionally, we evaluate your data capabilities to consider whether other processes or tools are needed to support the AI models you are working on today and how to scale these for future use cases.
C) Bringing AI expertise to the table
Finally, we apply our AI expertise to help you design and implement your solutions. From fine-tuning models to ensuring they are still aligned with business goals; we provide guidance on best practices and continuous improvement. Whether it’s through enriching datasets, improving data quality, or helping design processes that maintain data integrity, we ensure that any AI solution is effective, scalable, and sustainable for the long-term.
If you’re looking to take the next step with AI, or you want to ensure your data is ready for your specific use-case, Amplifi are well positioned to assist you at any stage of your journey. You can reach out to myself, or any of our data experts here. Or, if you’d like to hear more, download our guide 6 expert tips for driving value with AI below!