Solving the AI slop problem
Some thoughts on "Deep research" mode on OpenAI/Grok/Perplexity/Gemini
I heard a great term this last week: “AI slop”. Here was the context:
“Did you read the proposal?”
“Yes, it was 100% AI slop. Couldn’t make heads or tails of it.”
In the last two years, I have seen a lot of AI slop and I’m sure everyone reading this has too. In fact, if you’re a stock picker, I’m sure in your history of Google searching you’ve read a truck load of articles from Benzinga, Yahoo Finance, recently Stock Titan and many others where it’s clear programmatic rules or an LLM were used to author the article. You get a few hundred words of summary of press releases or quarterly earnings that are no more than rehashing numbers using colorful verbs / adverbs (“significantly beat estimates”) and adjectives (“enormous increase in earnings”). You get zero useful analysis.
ChatGPT / Perplexity / Gemini / Grok / others have lessened slop a bit (and sometimes yes you can get real value) but many times I feel like I end up reading something like below:
While this isn’t a bad start, if you start to probe for specifics you often get nowhere:
Give me the last ten years of EPS, dollars spent on buybacks and shares outstanding at year end. For dollars spent on buybacks, include weighted average buy price and shares purchased, which should sum to the repurchase total for the year.
The first response tells me how to read a 10K and the second response gives me a data table missing nearly all useful data. If I just read the 10K the second response already looks questionable, as much of the “Data not available” data clearly is available:
I understand prompting matters here, and a better prompt that tells the LLM to only use SEC filings to construct the table will get you further. That’s exactly what I did, and to my disappointment still got Data not available - FWIW, MGM has buybacks you can find in <5 minutes going back years - see my post here if you’re interested.
After some Googling, I found a lot of advice suggesting to use Deep Research and tried the same query there. Grok gave up after 29 minutes on this and ChatGPT (I’m paying for Pro) still could not find many years of data. It is still pretty cool to see the thinking process - from OpenAI:
I just tried Gemini with:
For MGM, give me the last ten years of EPS, dollars spent on buybacks and shares outstanding at year end. For dollars spent on buybacks, include weighted average buy price and shares purchased, that should sum to the repurchase total. Use only SEC filings and include a column that sources the URL you used for the SEC filing that had the info.
Here is the plan Gemini gives back, which seems pretty reasonable:
And here is what I got back after 5 minutes; Gemini did the best here, but it’s still wrong:
Let’s take 2018, the first year it can produce data (the company did buy back $327.5m of stock in 2017 although it’s right on 0 in 16/15). Gemini is off by almost a factor of 10 here - the company bought back $1.28bn that year, not $138.9m. On my end, I’m simply looking at “Purchase of common stock” on the cash flow statement, so I told Gemini exactly that:
Only look at "Purchase of common stock" on the cash flow statement. For example, for 2018, this should be $1.28bn if you use this technique. Remake this table doing this.
It (kind of) corrected the one mistake I called out but still failed to get everything right:
And even the correction is off - this is me just control F-ing for “Purchase of common stock” in the 2018 10-K - you can see it should get $1,283.3 (and the prior year $327.5 for 17’):
I do think Deep Research for stock research produces significantly better results even with the best prompts without this mode turned on. The advantage of Deep Research is you’re at least seeing the source data and confirming sec.gov is being used across the board (although is this data actually being used?).
I messed around with some other prompts yesterday for preferred stock and baby bond opportunities and found at a bare minimum Deep Research is great for discovering new securities that meet a specific criteria (ex. over 8% coupon and trading under par where common stock is being bought back). I don’t mind that it sources data from Seeking Alpha and Motley Fool. There is often cogent analysis there (well, sometimes :-)) that I wouldn’t have found on my own. I wish it used Substack more, but presumably there are a lot of Do Not Crawl / Block AI tags across content. My big takeaways over the last few days on AI for stock research:
You need to be extremely specific with prompts by asking to cite sources and if there’s important accounting, telling it where to find the data.
You are going to have better luck with Deep Research because my view is the LLMs need time to find the right data and look at it. If you expect <10s latency responses, it’s going to use data you don’t want.
Even if you get as ultra specific as you can with prompts, results are mixed. Vet everything.
At this point in time, all the big players gate Deep Research queries to a few per month. So use them wisely.
Deep Research can clearly string internet searches together. That is in some ways not different from my own process; one search leads me to another, I read the docs, I aggregate the information and form an opinion. I am convinced over a long enough time span AI will get this table right; the data is clearly there and it clearly has access to - every 10K Gemini cited is the correct 10K for that year for MGM.
On Gemini v. OpenAI v. Grok v. Perplexity - I’ve now used all four for a variety of tasks and don’t see a clear winner. I also see no relationship of time spent doing tasks versus quality after a few minutes. Gemini takes a lot less time than the others and seems to produce as good / better results in the MGM scenario.
So overall, I think AI slop can be partially fixed by a good prompter who fine tooth combs the responses. However, even the best prompts that restrain sources to primary source data clearly produce slop.
For those of you thinking “Ben is clearly down on AI” - I’m not. I use Cursor pretty much every day for coding work1 and I love the “Big 4” above as a starting point for research. My point here is that “slop” is a problem and it’s not always possible to prompt your way to a solution. We are early innings here and this could easily change by next year.
Last thing - it also became more evident to me working on this buyback table that prompts matter a whole lot. A longer prompt with acceptable research methods and asking for more “guarantees of correctness” is going to do better than a broad one. Asking for explanations of why something is correct helps. So, as is the case for many things in life, knowing how to use the tools effectively matters often more than the tools.
Coding I think is a much better use case for AI right now than financial analysis. With code, you can write unit and integration tests and there are millions of examples in other codebases that probably are 90%+ similar to your desired solution that AI can train on. Cursor when I prompt it well generally gets me what I want, versus “build me a basic buyback table and only use SEC filings” I have yet to see the Big 4 get right. For code, I’m convinced at this point it’s a necessary everyday tool that enhances productivity by an order of magnitude for some use cases.