The big AI story of 2025 was the rise of the reasoning models (e.g. GPT-5.X, Gemini 3.X, Opus 4.X). They have overturned much of the received wisdom about the kinds of tasks that Large Language Models (LLMs) can and can’t do, impressing us all with their ability to write code and solve advanced maths problems. In the AI team at Fletchers, we have also noticed a step change in their ability to solve complex legal problems out of the box.

As a no-win-no-fee law firm, one of the most important questions we can ask about one of our cases is: what is the probability that we will be able to settle this claim? This is important for forecasting revenue, but also for deciding which cases we are going to proceed with and which we will have to close. In this blogpost we discuss some of our recent attempts at trying to answer this question with GPT-5 (Spoiler alert: it doesn’t do a bad job!). Along the way we’ll confront some interesting difficulties, especially around how we might actually implement something like this in practice. We think our experience provides an interesting example of what doing data science could look like in the age of reasoning models.

Some background

In the early part of a clinical negligence case, we have two formal stages, called “risk assessments” where we make a decision about whether we will proceed with the case or not. We are already using AI to help with both of these stages. At the first risk assessment stage, all we have is our client’s account of what has happened, and we run a custom machine learning model to predict whether or not the claim is likely to succeed, and therefore whether we should accept the case. A human is still in the loop and all decisions are manually reviewed, but having the cases pre-sorted helps our legal experts to work more efficiently. At the second risk assessment stage, we have our client’s medical records, often thousands of pages in length, and these need to be reviewed by one of our nurse analysts. Again, we already use AI to help. We have built an AI powered interface called Mermaidd which pre-sorts the medical records and enables the nurse to navigate them more efficiently, to assist with their review.

We are very happy with these tools, but one thing that makes each of these tasks easier is that the inputs for each problem are very clear and well-defined. At the first stage, we are looking at the client’s account. At the second stage, we are looking at the client’s medical records. After this point, our cases progress in a much less predictable way, and the relevant information can be messier and spread through lots of separate documents. The idea of using AI to assess case prospects becomes much harder.

But we are now entering an era where we can give an AI agent access to all of the data from a particular legal case, in a similar format to how it would be presented to one of our lawyers, ask it a generic question, and it will go away and do a decent job of answering that question on a first attempt. We don’t even need to provide legal specific prompting. We can ask it to summarize the current state of a case, and it will begin by looking for information about interim payments and limitation dates, without us needing to introduce these concepts first. So this made us wonder: could we take one of our clinical negligence cases at an arbitrary point in time after the second risk assessment, and just ask GPT-5 how likely it is that the case will succeed? Would this work?

Our Approach

The latest models are so powerful, that it can sometimes feel like there is not much left for us to do other than evaluate them. But from our pespective, there is still one big technical challenge on the implementation side: finding the right way to give an AI agent access to our case data. If we asked some of our top lawyers some questions about a legal case, but only gave them a dump of unstructured files and a basic vector-embedding powered search, then they probably wouldn’t be able to do a very good job. This is definitely not how our internal case management system works! And this means that an AI agent probably won’t be able to do a good job either, even if it had superhuman general intelligence. We’ve tried some off the shelf legal AI tools and found that they sometimes struggle to understand our cases properly, we think because of this problem.

So the challenge for us on the development side is how to build an interface for our AI agent that closely mimics the interface we have for our human lawyers, being careful not to overflow the agent’s “context window” with too much information at once. Without giving away all of our secrets, we think we now have a decent approach to this problem which gives effective results. Again, there has been a step change in the ability of the latest generation of reasoning models to understand and effectively make use of an interface like this.

How do we sample the dataset?

We have lots of data from past clinical negligence cases that we can use to evaluate the performance of GPT-5. But finding the right way to create a sample from this dataset is a more interesting and subtle problem than you might think.

We started out by asking if we could build something for use on an arbitrary open case at an arbitrary point in time. So which cases and which points in time should we pick to evaluate this question? What we would ideally like to do, because it most closely matches how this tool might be used in practice, is the following:

What we would like to do: Sample all cases that were open at some point in time (say 2 years ago) and take the state of their data at that point in time as our dataset. Try to predict case outcome.

But the problem is, some of those cases won’t have concluded yet, which means we don’t know the ground truth, and we can’t evaluate the AI prediction. If we just exclude those cases from our dataset, this could introduce significant bias. Cases which are successful will generally close later than cases which are unsuccessful. This means if a case was at an early stage at the time of our sample, then it would be more likely to make it into our sample if ultimately unsuccessful than if ultimately successful. This then introduces a spurious correlation between “case being at an early stage”, and “case being unsuccessful”.

At this point you might say: so what? We aren’t training a model here that might learn the spurious correlation, we are simply asking GPT-5 what it thinks. It won’t learn from each example (continual learning being one of the main outstanding problems in cutting edge AI research) so there is no issue. But there are at least two reasons that we should still care:

  • As well as simply asking GPT-5 to spit out a probability, we might also want to try asking it to output a set of numbers capturing different factors that might affect case prospects, and then train a simpler machine learning model on those numbers (more on this approach below).
  • Even if we are just asking GPT-5 to spit out the probability, the sampling bias could still affect our evaluation of its performance. For example, if for some reason it was being unfairly harsh on early stage cases, this would look good on our sample, when it shouldn’t.

So what should we do instead? One solution would be to go so far back in time that almost no cases are still open. But the problem there is that you have to go a surprisingly long way back for that to be true (legal cases can last a very long time). Our business, and the quality of our data, has changed a lot in that time. It would be nice to be able to make use of more recent data. Is there a way of doing that?

As with all tricky stats problems, it is helpful to recall Bayes’ theorem.

We want our model to predict: \(P(\text{case success} \mid \text{case data at point in time})\)

But our model will actually learn (or be evaluated against):

\[P(\text{case success} \mid \text{case data at point in time} \cap \text{case time point included in sample})\]

which we can denote: \(P(S \mid X \cap I)\).

By Bayes’ theorem, this equals:

\[P(S \mid X) \frac{P(I \mid X \cap S)}{P(I \mid X)}\]

So to remove bias, we need the probability of inclusion of a given case moment in the sample to be independent of eventual case success (that way the numerator and denominator in the fraction cancel, and we are left with the probability we want).

Here is one approach which achieves this (at least under some assumptions):

  • Take all cases which closed in 2025 (successful and unsuccessful).
  • For each case, randomly pick a point in time, \(X\), between 0-3 years, and try to sample a point in time \(X\) years post the second risk assessment from that case.
  • If that case lasted \(\geqslant X\) years post the risk assessment, this succeeds, and we include this time point in our sample.
  • If the case lasted \(< X\) years post risk assessment, then there is nothing for us to sample, and we exclude the case completely.

The idea here is that each “case moment” from cases closed in 2025 has an equal chance of making it into our sample, independent of whether the case was successful or not.

This strategy is still far from perfect. We will still get some datapoints from a very long time ago. We are also excluding plenty of perfectly good cases from our dataset because of the result of a random number generator. But then sending whole cases to GPT-5 is expensive and slow, so we were never planning to use them all anyway!

The more serious problem is the assumption that cases closed in 2025 are representative, which we expect to be flawed in a number of ways. For example, the business is growing fast, so we expect a higher proportion of short cases in a sample of recently closed cases than in a sample of recently opened cases. This effect might also distort things in interesting ways.

The problem we have here is very similar to the kinds of problems that come up in medical research all of the time, looking at progression of disease or survival times, and there are a whole set of interesting statistical techniques that have been developed in that context. But we are just trying to do some quick tests here, so we have decided to stick with this sampling strategy for the time being!

How do we write our prompt?

When we first started looking at this, we were not expecting GPT-5 to do an amazing job at predicting case outcomes out of the box. Instead, we first compiled a list of all the reasons that our cases were closing unsuccessfully. We did this with input from our lawyers, and from, you guessed it, GPT-5, which we used to pull out and categorize the closure reasons from a set of recently closed cases. We then created a ‘scorecard’ based on these common closure reasons, which was reviewed and improved by our lawyers. Our plan was to ask GPT-5 to fill out this scorecard, to obtain not a single “chance of success” score, but a score against each of the many different reasons a case could fail. Finally, we could train a simple logistic regression model on top of these scores to predict case outcome, using the dataset generated above.

The first amazing result we’ve found is that: doing all of the above seems completely unnecessary! The results we present below were produced with a far simpler prompt, that just asks GPT-5 to output the probability:

You are assessing the prospects of success on this UK clinical negligence case.

Use your own judgement. Don't just go with the file-handler's opinion if you disagree.

What is your all things considered assessment of the probability that this case will succeed? Give your answer as a percentage.

The big surprise to us was that this prompt performed no worse (and probably slightly better) than the scorecard + logistic regression approach! (As measured by a very slightly improved AUC score).

(Important caveat: we also have several custom system prompts that our agents are given as well)

Initial Results

We sampled 197 cases (200 with 3 excluded due to processing errors) and ran the above basic prompt. Here are the results:

Prediction histogram

This doesn’t look too bad! It is not perfect, but it was never going to be. Predicting what will happen on one of these cases with certainty is impossible. But at the two extremes of GPT likelihood scores, it seems to be right almost all of the time!

We can show this more precisely by plotting a so-called ROC curve, which shows what our true positive and false positive rates would be if we turned our scores into a binary classifier, by picking some decision threshold. That is: we predict that every case above the threshold will succeed, and that every case below the threshold will not succeed. We can reduce the false positive rate by raising the threshold, but this comes with the cost of reducing our true positive rate (proportion of successful cases identified). The ROC curve shows this tradeoff. A classifier that spat out scores independently of case outcome would have a ROC curve along the 45 degree line. GPT-5 is doing a lot better than that! Area under the curve (AUC) on this plot is 0.868.

ROC Curve

The AUC metric is only sensitive to the ordering of the GPT predicted scores. It tests that the successful cases are being given higher scores than the unsuccessful cases. But it does not tell us much about whether these scores correspond to well calibrated probabilities (i.e. that 60% predictions succeed 60% of the time). Depending on the use case, this might not matter, but it is definitely interesting to ask whether GPT’s probabilities are well calibrated as well. We can have a very rough stab at this question by taking GPT-5’s probabilities (between 0 and 1), applying an inverse logistic function to get a ‘logit’ in the range - infinity to + infinity, and then training a logistic regression classifier on the results. This gives us a re-calibration of GPT-5’s scores to probabilities, which looks like this:

Recalibrated scores

It looks like GPT-5 is underconfident, although it’s fairly impressive how well these two lines match up!

Not so fast: why these results might not be as impressive as they first seem

Whenever you get a strong positive result in data science, the first question you should always ask yourself is: What did I do wrong? Things are almost always more complicated than they first appear.

And the big problem here, is that we started out looking at arbitrary points in time during the lifecycle of a case, but there is often a window of time close to closure when the case outcome is very easy to predict. For example, the defendant may admit liability a while before the case actually settles, if the negotiations over quantum take some time. “Predicting” that such a case will be successful before it settles is not very impressive. Conversely, for unsuccessful cases, the file handler may be discussing the closure and drafting the closure letter a few weeks before the closure date on file. When we saw the results above, with the impressive performance on high certainty cases, we quickly realised that they might all just be cases in these categories. If so, the results seem much less impressive.

It is worth saying that even if all the high-certainty cases were in this category, it would not necessarily make these predictions completely useless. Liability admissions are supposed to be recorded in our system, but the other signals we could be picking up on here (like closure letter drafting) are not. So although there is nothing especially impressive about GPT-5’s forecasting abilities in these examples, it could still give us a useful view of our unstructured data that we would not otherwise be able to see (or at least not without checking every single case manually).

Nevertheless, in hindsight, our initial goal of running at arbitrary points in time may not have been the right thing to aim for, as we have now realised it becomes very hard to evaluate the impressiveness of the results. Doing better than chance is not really the best comparison.

Exploring the high certainty closures - not entirely unimpressive

So it is hard to evaluate how impressive we should find these results, but lets have a go anyway!

When we look at the left hand tail of the histogram above, we find 14 of the 78 unsuccessful cases have a lower score than any successful case. So, at least on this dataset, with the right threshold we could have identified these 14 cases as losing cases without making any mistakes. How far were each of these cases from closure at the time they were sampled? How much time could we have saved?

It turns out that 6/14 were sampled within 35 days of closure. This is a high proportion, given the point in time was sampled over a period of 3 years, and shows that the problem we were concerned about is definitely happening. But that still leaves 8 that stayed open for well over a month, and 3 that stayed open for 200 days. On one of those 3 cases, we even obtained an additional report from a medical expert, which is an expensive thing to do on a case that is going to lose.

On the face of it, this looks like a potentially valuable use case for this approach. Could we have used GPT-5 to close that case earlier, saving the cost of that extra report?

But we don’t want to use AI to close more cases

As a business, we want to do the best job we can for each of our clients. Using AI to closes cases earlier, saving a small amount of cost per case, is not really the kind of exciting AI use case that we had in mind! It is interesting that when a lawyer reviewed the case above, they both agreed with the AI score (the case was very likely to be unsuccessful) and agreed with the decision to proceed with the additional expert report. They felt there was still an additional avenue worth exploring for our client, even if it was unlikely to pay off.

Using AI to keep cases open is very hard to justify with past data - we never see the counterfactual

What we are really excited about, is the idea of using AI to keep more cases open. For example, if a file handler wants to close a case, but the AI has not given it an especially low score, maybe this can be escalated and checked more carefully (on top of the reviews that are already in place)?

The big problem is, it is essentially impossible for us to find any examples in past data to justify using AI in this way. When we lose a case, this is almost never because we go to court and lose. It is because our lawyers have judged the prospects of success to be too low to enable us to continue. But this means we never find out what would have happened if we had continued! From a data science perspective, this makes our job really hard. We never see any of the mistakes that we are making in this direction!

What we want to have, is a case that was closed, where the AI gave it a high score, and where it would in fact have been successful had we kept the case open. But we never find out what would have happened if we had kept a case open, so these examples do not exist.

It is helpful to distinguish two probabilities:

  • What is the probability that Fletchers could settle this case if we chose to continue it no matter what?
  • What is the probability that Fletchers, following current processes, will eventually settle this case? (Likely without going to trial)

To inform closure decisions, we would really like to estimate the first, but all our data (including the results presented above) relates to the second.

Refining the use-case - Is there something we can actually use this for?

Our preliminary results suggested that GPT-5 was capable of assessing case prospects out of the box, which is very exciting! But we also had some big outstanding problems:

  • Running at arbitrary points in time makes it hard to know just how impressed we should be. A lot of its predictions were easy.
  • We can build evidence for closing cases based on low AI scores, and this occasionally identifies losing cases far in advance, but as a business we are not too comfortable with using AI in this way.
  • We can’t build evidence for keeping cases open based on AI scores. This is not a problem with our sampling, but a fundamental problem with the way our business works. We never see the counterfactual when we make a decision to close a case.

But we were still excited to do something with these results, so we came up with two ways forward to explore.

1 - Using AI to support closure overturns

When a file handler wants to close a case and this decision is being reviewed by a senior lawyer, can we provide an AI generated summary of the strengths and weaknesses, or even ask it to play Devil’s Advocate, with the aim of encouraging lawyers to overturn closure decisions more often, when appropriate? It is harder to quantitatively measure the value of this use case, but we can certainly work with some of our senior lawyers to evaluate the outputs qualitatively.

We have already received lots of positive feedback from lawyers who have reviewed GPT-5 generated summaries of the strengths and weaknesses of a case. We are excited to explore this approach further!

2 - Using AI at well-defined risk assessment stages in Personal Injury cases

If we abandon our goal of running at arbitrary points in time (which in hindsight may have been misguided) we can go back to evaluating performance at particular risk assessment stages. We do not currently use AI to inform risk assessment decisions on our Personal Injury cases, but maybe with GPT-5 we could start doing so. For example, could we bypass senior review of ‘proceed’ decisions if the AI gives a sufficiently high score, as a time saving measure? This could be particularly valuable for low value cases operating under a ‘fixed costs’ regime, where we may not be able to recover all of our time, even if successful. Using it this way also ensures we are only keeping more cases open, and never closing extra cases on the basis of AI output!

Results

Since our primary goal here is to save time, rather than to improve decision making, we can now simplify a lot of the sampling complications we had before. We just look at a set of cases that went through the relevant risk assessment stage at the same time, and we can compare GPT-5’s score to the final decision made at the time, which is always available, rather than against the final case outcome. Can GPT-5 reproduce the decisions that our lawyers are making?

It turned out that most of the cases which fail this risk assessment stage fail for reasons other than case prospects. If the quantum is too low and the case falls under the “small claims” track, then we are unable to accept it. Also, if certain procedural steps have not been completed correctly, then the case is sent back to the file handler to rectify, instead of being proceeded or rejected.

If we want to use AI to auto-approve file handler ‘proceed’ decisions, we will need to identify all of these things (and we plan to work on this) but for now, lets exclude all cases which failed to proceed for these reasons, and focus on the remainder. We were left with 237 cases, of which only 12 were rejected (it turns out it is actually very unusual for us to reject cases based on low prospects at this particular risk assessment stage).

The results are below:

Histogram for PI cases at second risk assessment

ROC Curve for PI cases at second risk assessment

These results really are quite impressive. The large vertical jump at 0.0 means that if we auto-proceeded with any case that had a GPT-5 score above a particular threshold (in this case 66%), then we would save senior review of half of the proceeds, without making any mistakes (at least based on this sample), although the total number of rejects is very low here. And these predictions aren’t “easy”. The AI is being asked to carry out the same difficult prospects assessment that we ask of our lawyers, with the same information. It is also worth pointing out that the ‘reject’ decisions with higher AI prospects scores are not necessarily correct anyway! Again, we never see the what would have happened had they proceeded.

There is lots of work we need to do before we can use this in practice. Most importantly, testing how well we can use AI to perform quantum assessments, or to verify that procedural steps have been completed correctly, but the early signs are that it can do a good job at these tasks too. We are confident that we can use AI to extract real value here, the challenge is just finding out how to do this in the most effective way.

Conclusion

As a large law firm, we hold huge amounts of data, containing all sorts of valuable insights. But unfortunately, most of this data is unstructured, which makes it extremely challenging to understand at scale. For the last few years, we have tried using LLMs to make sense of this data in various ways, with some success. But this has needed a lot of careful prompting, guardrails, and generally a new bespoke pipeline to be built for each new task. Now, we are having to quickly get used to a new world where we can just point a generic agent at our data and ask it questions, and this will work! This has transformed our ability to unlock insights from our data at scale, and we feel like we are just beginning to scratch the surface of the potential applications.