This is the second week of my course summaries from teaching AI in Finance at NYU Stern (lecture slides here; first week’s summary here). Last week I outlined three core principles for evaluating AI in finance: turning insight into action, fix the slow part, and follow the price.

This week we apply this to financial document intelligence. The main issue we wrestle with is that LLMs are great at processing text, but by their inherent nature are unreliable at telling the truth.

Documents All the Way Down

Finance is sometimes seen as mathematical in nature, but really it’s about documents. 10-Ks, earnings calls, prospectuses, credit agreements, regulatory documents and filings: the raw material financial analysts have to work with is text. A lot of text. Historically, turning this text into actionable insights has been hard and rate limited by human processing capacity.

This makes financial document intelligence the most immediate and obvious AI use case, which has already seen a lot of adoption. The basic workflow for anything in this space has a few basic steps. First you extract and reconstruct a document base (for an enterprise use case, this may consist of your own proprietary documents). Then you do semantic understanding, which maps the underlying document base into stable concepts.

Next comes the retrieval and reasoning part. This is typically done through something called Retrieval Augmented Generation (RAG). RAG is, at the moment, one of the main AI enterprise use cases. The value of all of this is unlocked by automating and triggering actions, such as triggering an investment review, flagging a covenant violation, or surfacing a fraud risk. For finance applications especially, there is also ideally a reliability and governance layer on top of this to track provenance of claims and controls on deployment.

Customizing AI

To really build on these capabilities, we typically need to do some amount of customization. Pre-training entails building from scratch, and is now the domain of the major AI labs. BloombergGPT was a notable exception here, discussed in class as a case, and since then the industry has typically moved on to customizing general-purpose models rather than training domain-specific ones. The pace of advances of the frontier models has generally outpaced the benefits of training a model in a particular area.

That means fine-tuning a model by tailoring it for a specific task where you need a consistent style or the base model lacks specific domain expertise. Or prompt engineering, which is the main AI customization tool used today. This used to be a really important tool to unlock LLM capabilities, but has been growing less important over time. At this point in the cycle, the most important components here are to provide the right context to the model in solving your problem.

The basic reason this works is that foundation models today already have a lot of financial knowledge baked into them, and so most work in practice is about activating and directing that energy. This paper by Lopez-Lira and Tang shows us how a basic prompt asking GPT-4 to classify headlines as good or bad news works well enough to capture about 90% of initial market reactions.

The Jagged Frontier

One of the critical features of AI capabilities is they are uneven. Dell’Acqua et al. ran a classic field experiment with BCG consultants which introduced the evocative term of the “jagged frontier” to characterize AI abilities. As you’ve probably seen if you’ve played around with AI, it does really well at some tasks while totally missing on others.

The Jagged Frontier from Ethan Mollick

This has important implications also for how humans use AI. This team found that consultants working on tasks inside the AI frontier saw large improvements in quality, which were stronger for performers in the bottom-half (i.e., AI helps to level performance, which is what most of this literature has found). For the tasks that fell outside the frontier, AI users did worse than the control group.

Even when we are working with high-quality AI tools, we face important problems in deployment based on how humans interact with them. The Dell’Aqua falling asleep at the wheel paper suggests recruiters collaborating with higher-quality AI tools were less accurate than those with lower-quality AI because they stopped engaging critically. Shen and Tamkin argue that developers using AI assistance complete tasks faster (though not that much faster), while scoring substantially worse in comprehension quizzes.

All of this suggests a few modes of interaction with AI. We have the distinction between “centaur” and “cyborg” forms of interactions. “Centaurs” divide tasks between human and machine. It’s important here for the human to set the context and let the AI execute with clear boundaries for what each side contributes. The “cyborg” model instead blends human and machine work in an iterative fashion, going back and forth. The basic challenge is that AI is going to be most valuable in places where we are already strong enough to spot mistakes, and risks harming our own learning when it skips a necessary cognitive struggle.

Ultimately, what we need to realize as a society is that mass deployment of AI has the capacity to dull the mind as much as processed foods and easy transportation led to an obesity epidemic. We’ll need to figure out the right heuristics, the mental equivalent of working out, to ensure cognitive discipline in a world of AI slop.

Truth and Fiction

This brings us to the the core problem with AI deployment: hallucinations, and the challenges of figuring out truth from fiction.

Ultimately this gets to a philosophical question: do the scaling properties of AI development converge on a shared view of reality, or are the current problems with hallucinations going to be with us for a while because these tools are fundamentally stochastic parrots, doomed to remix their training dataset without converging on anything we would call true understanding?

This leads to one of the deeper lines of thought I’ve seen in the AI space: the Platonic Representation Hypothesis; which argues that models trained on different data and methods ultimately converge to a shared statistical model of reality which corresponds to the real world. There is some recent work which argues this is the pattern we see across scientific models in various domains.

The School of Athens by Raphael; with Plato in the center and most figures ambiguous and identifiable through subtle allusions and associations in context

It could be further scaling is all we need to get accurate world models; or it could be that Yann LeCun is right and we need a radically different model architecture. For the time being, however, we are stuck with hallucinations and errors in using AI. Benchmarks on long context retrieval show steep performance degradation when we expand the context, as is necessary to access a broad document corpus. When do these issues impact the ways we use AI?

My dean Bharat Anand (and his co-author Andy Wu) have a nice framework for analyzing this problem across the dimensions of how tacit vs. explicit is the data, and whether the cost of errors is high or low. AI does best when the cost of errors is low and data is pretty explicit. With higher cost of errors, but still pretty explicit data, AI is best at producing work for human verification. If the cost of errors is low but tacit knowledge is high, it’s best to use as a creative catalyst with humans still selecting the final option. Humans shine most when the cost of errors is high and tacit information is also high. Of course, one challenge in thinking through this framework long-term is the ability to write down the tacit information and context so AI can access these long-term appears to be growing over time.

I think here of the fictional essay by Scott Alexander which imagines the limits of the burden of knowledge. Scott describes a world in which scientists, at the limit, spend their whole lives just discovering the limits of knowledge, spending only a brief moment to expand those limits further. Does AI change this, as an ever-living entity unburdened by knowledge, and able to push ever further?

Grounding AI in Reality: RAG

I think so, one day. But we are a long way from that world, and our practical challenge is a lot more mundane. How do we actually solve the hallucination problem in practice, and get AI to give us accurate answers on a specific set of documents without making things up?

As alluded to above, the main solution to this today is Retrieval-Augmented Generation (RAG). The basic idea here is that LLMs working in document retrieval face a challenge of trying to find a needle in a haystack. So instead of letting the model perform next-token prediction using its entire training data, we are going to restrict the search space to a pre-specified corpus of documents and have the model generate answers grounded in specific references to that corpus.

The basic RAG pipeline works in two parts. First there is indexing which entails collecting documents, processing and cleaning the output, and “chunking” them into manageable pieces for future retrieval. Then there is retrieval itself, when the user submits a question, you retrieve the most relevant chunks, pass the question and retrieved context to the LLM, and generate an answer grounded in those references.

An example of this is from my own research with Alex Bartik and Dan Milo. Our objective was to measure housing regulations at scale across thousands of US municipalities. This is exactly the kind of task which was infeasible at scale before LLMs, because you would need large teams of researchers reading each code individually.

Example of RAG pipeline, from Generative Regulatory Measurement

We built a RAG pipeline to process municipal codes, splitting each ordinance based on its own hierarchical structure. One thing we see here is that chunks of text within the same article tend to cluster together near other chunks in embedding space, which suggests that geometrical distance follows contextual meaning. This means that the embedding model also places questions that we want to ask near the part of the text relevant to answer the question. This in turn enables drastically higher accuracy at lower cost to actually answer these regulatory questions correctly at scale.

UMAP two-dimensional representation of the zoning code for Arlington, MA

The final accuracy we get to here is about 96% for binary questions and a 0.87 correlation on continuous measures, using the AI methods available at the time. Is that good or bad? For a developer, that’s probably not high enough that you can skip the final verification step. This maybe puts AI in the category where you follow up to verify the output. But for a researcher, this is great: we can take the data and go ahead and run regressions, since we naturally have statistical processes that tolerate error.

Where this is Going

The takeaway here is that financial document intelligence, especially through RAG, is already useful for text-heavy workflows: research synthesis, compliance review, client communications, document Q&A, etc. Tasks that entail working with the AI to draft reports, navigate documents, and explore data are “within the frontier” of AI capabilities. Whereas fully automated decisions which entail high-stakes autonomy, responding to a changing environment, and real-time trading decisions are still (for now) outside that domain.

Should that change your workflow? This brings us back to Amdahl’s law we discussed last time. If the bottleneck in your workflow is document processing itself (or you can adopt a new workflow based on document processing), then RAG-based tools can meaningfully help. If the real problem is elsewhere, than faster document processing may come with real costs and possible errors that don’t actually solve your real problem.

Readings

Subscribe now

Leave a Reply

Sign Up for TheVCDaily

The best news in VC, delivered every day!