Generative AI validation

Reasons generative AI is still unreliable + questions to validate outputs

I’m building a new website for Edaith and although I’m using a template with pre-built elements I can choose from, there was something that I wanted to customise, which isn’t easy as a non-developer.

But with the help of generative AI, not being able to code was not a barrier.

I added a design customisation to my website through providing a code snippet to ChatGPT and asking for modifications and additions. Although it took a few tries getting the prompt right, the code it generated worked in the end. Generative AI enabled me to attain a nice to have feature despite my lack of technical skills. It’s a case in point to the idea that the technical capabilities and accessibility of generative AI could support more women to work in, or with, technology.

However, considering the unreliability of chatbot outputs and the absence of oversight in their development at this stage, I still wouldn’t use it for anything critical without a trustworthy human expert in the loop.

Tina

Generative AI validation

Generative AI products and services aren’t being effectively evaluated for responsible development or quality standards. Validating model appropriateness and generative AI outputs is part of a new skillset needed to work effectively with these technologies as they improve.

National government policies for ensuring the safety of advanced AI systems have included foundation model evaluations as a key method. Foundation models power prominent generative AI applications, such as the GPT models which enable ChatGPT. The EU’s AI Act requires these models to be evaluated for ‘systemic risks’. The UK and US have voluntary commitments from major AI companies to allow evaluations of their foundation models, as do France, Canada, Japan and others.

However, in practice evaluating impacts, capabilities, risks, performance, behaviour or social impacts of AI foundation models is not yet possible. Reasons include a lack of enforceability and common terminology, and the need for a range of associated mechanisms such as codes of practice, incident reporting and real-world monitoring (Jones et al 2024).

But the governance landscape of AI systems doesn’t explain why the latest generative AI products with the reputations of major technology companies at risk, are still unreliable and untrustworthy. After billions of investment and years of refinement they are still prone to make up information and fail at logic and reasoning.

Professor Gary Marcus, AI researcher and cognitive scientist at New York University, is a prominent critic of generative AI progress over the past two years. He sees the development approach, relying on large language models (LLMs), as being fundamentally flawed (Marcus 2024): 

My strong intuition, having studied neural networks for over 30 years (they were part of my dissertation) and LLMs since 2019, is that LLMs are simply never going to work reliably, at least not in the general form that so many people last year seemed to be hoping. Perhaps the deepest problem is that LLMs literally can’t sanity-check their own work.
 
Instead, they are simply and true next-word predictors (or, as I once put it, “autocomplete on steroids”), with no inherent way of verifying whether their predictions are correct. The lack of sanity-checking leads them to flub arithmetic, to make boneheaded errors, to make stuff up, to occasionally defame people, and so on — over and over again, in literally every transformer that has been released, from GPT-2 to GPT-3 to GPT-4 to the latest system SearchGPT. 

Marcus predicts that the field of artificial intelligence will progress with other approaches, such as hybrid AI systems, which couple the best features of generative AI to produce outputs quickly, with AI systems that can check the outputs generated with logic and reasoning. Similar to how our minds work as explained by Nobel prize winning economist Daniel Kahneman in Thinking, Fast, and Slow. A recent AI mathematical problem solving breakthrough with an AI entrant attaining silver medal in the International Math Olympiad was achieved with an approach in that direction, using two separate hybrid AI systems working in conjunction (Marcus 2024).

Until we have generative AI technologies that can be relied upon, their effective implementation requires us to develop skills to understand when and how to best work with generative AI outputs. As part of this skillset, the following questions can assist as a protocol for determining validity and co-constructing knowledge when working with generative AI (Robertson et al 2024):

  1. Do I have a basic understanding of the data used to train the GenAI model?
  2. Is the output comparable to verified statistical patterns?
  3. Can I verify the reasoning or process used to get the output?
  4. Does this confirm what I thought before? If yes, is there an alternative perspective/view that can be considered?
  5. How has my prompt, and subsequent interactions, potentially influenced the GenAI’s response? 

Technology skills

References

Jones, E., Hardalupas, M., Agnew, W. 2024. Under the radar. Examining the evaluation of foundation models. Ada Lovelace Institute.

Marcus, G. 2024. AlphaProof, AlphaGeometry, ChatGPT, and why the future of AI is neurosymbolic. What comes after chatbots? Marcus on AI (Substack) 

Robertson, J., Ferreira, C., Botha, E., & Oosthuizen, K. 2024. Game changers: A generative AI prompt protocol to enhance human-AI knowledge co-construction. Business Horizons. In Press, Corrected Proof.

Back to posts

What you need to know, without the noise.