This is the fourth installment of my course summaries from teaching AI in Finance at NYU Stern (lecture slides here; previous summaries for weeks one, two, and three). This week focuses on market intelligence: the process of turning unstructured information into actionable investment decisions.

AI and LLMs are disrupting this sector by processing text at a scale and speed which fundamentally shifts the core economics of business analysis. Previously, this was a labor-intensive process bottlenecked by the speed of human reading capacity. Now, some of the core analytic functions have become commodified due to the rapid pace of AI advances. At the same time, faster and cheaper information doesn’t always help people make better investment decisions if the bottleneck shifts elsewhere. AI also enables completely new forms of intelligence functions: in particular in silico agent simulation. But are these information tools accurate?

So the key questions this week are: what is going on with the quality of information we summarize or simulate, and does it help us make better actions? And even bigger picture: where does the alpha go if everyone has access to AI tools?

The Arms Race in Textual Analysis

The history of text analysis in finance is a good illustration of the “bitter lesson” of scale economies combined with the “follow the price” principle from Session 1. Each generation of tool analysis commodifies one layer of analysis, pushing the alpha or edge further up the complexity stack.

The first generation was simple dictionary-based sentiment analysis. Tetlock’s classic 2007 paper counted words in one WSJ column using the Harvard psychosocial dictionary, estimated a simple pessimism factor, and showed it predicted Dow Jones returns. This was a big advance at the time, even though it built on a pretty simple measure. As we discussed back in Session 1, further advances from here developed finance-specific dictionaries (Loughran and McDonald) and chained together word combinations in n-grams and bag of words.

Then we get to LLMs. Lopez-Lira and Tang showed that GPT-4 can classify news headlines for stock market impact with pretty high accuracy (capturing 90% of the hit rate for initial reaction). The really interesting result though was that the Sharpe ratio of the LLM classification trading strategy was steadily decreasing over time alongside rising LLM adoption. The information edge from reading headlines was apparently real, but got competed away and is now largely priced in.

As that initial reading layer is commodified, we can move up the complexity value chain through other analysis opened up by LLM analysis. Leland Bybee takes this in a really interesting direction by using LLMs to generate economic beliefs or expectations by examining historical headlines, allowing him to build 120 years worth of economic sentiment data from newspapers. This also starts to move us in the direction of “representing” economic beliefs through their textual breadcrumbs.

From Bybee.

Another advancement of the complexity ladder is Hansen and Kazinnik who show that ChatGPT can decipher “Fedspeak;” which is to say the deliberately ambiguous or obfuscated language the Federal Reserve in monetary policy. You see some interesting advances across models in this textual analysis. GPT-3 tends to analyze the actual text of the statement itself, while GPT-4 is able to correctly map some of the subtext of statements as well with reasoning which lines up with that of a human analyst. The emerging model capacities here allow AI to escalate all the way to complex narrative interpretation (which starts to replicate sophisticated Romer-Romer style interpretation of complex texts).

As an interlude, I also have to mention Joe Weisenthal’s vibe coded tool, Fedlock. Rather than direct classification, this tool takes an elo-style competition in which pairs of Fed reserve statement are evaluated by an LLM-judge as to which statement is more dovish or hawkish; which produces a complete ranking of all statements. So we have these two application types of LLMs: direct classification, and pairwise elo-comparisons, which can be useful in different contexts (it seems like the elo approach produces a bit more of a continuous and smooth ranking, while the direct classification approach has some tendency to produce extreme outcomes).

We can move on from here to analyzing equity analyst valuations, in some work by Bastianello, Decaire, and Guenzel. They draw on 2.1 million equity analyst reports and use LLM tools to diagnose the topics and mental models analysts use in their representations of firm valuation (they have a followup paper which focuses on the comparisons between accuracy and complexity on DCF vs multiples-based valuation models).

At the current frontier of AI representations, we have Suproteem Sarkar, who generates economic representations of entire firms. To do this, he estimates vector embeddings of firms from financial news discussions that quantify the economic features and themes of each firm’s coverage. To do so requires solving one important problem in historical bias: the LLM has future knowledge, which potentially contaminates its assessment of historical data. There are two general ways to address this: “masking” historical data by swapping names to de-contextualize the LLM, or training models on the basis of historical data alone, which is what Sarkar does here.

The output from all of this is an interesting “map” or representation of firms which captures interesting relationships between firms. Cross-sectional comparisons show that “similar” firms in this embedding space also tend to see similar stock price co-movement. Shifts in the embedding representation also predict stock market changes, suggesting that part of the variation in valuation comes from shifts in market perception of the firm’s representation (including possible misperceptions such as periods when attention-grabbing features like “the internet” in the 90s or “AI” in the 2020s dominate the embedding location).

Beyond textual data, there is a whole alternative data ecosystem which represents another dimension of how AI has disrupted market intelligence. This includes everything from satellite imagery, web traffic, app downloads, credit card data spend, and so forth.

The common pattern here is that every stage of LLM scaling speeds up one layer of textual analysis. Initially, this produces some alpha for people with access to the analysis tool, while also unlocking new layers of complexity for people to analyze. However, that next set of documents ultimately becomes the raw material for the subsequent LLM deployment, and so we have an escalating level of textual and market intelligence analysis.

Attention Reallocation

A deeper question here is what all of this data does to the structure of forecasting itself. A big challenge is that for all the virtues of “big data;” we will inevitably have more information about the cross-section than the time-series (i.e., we can get a lot of information about conditions all over the country today, but are limited in our ability to analyze historical data before we were collecting all of this alternative data).

Dessaint, Foucault, and Fresard analyze some of the implications of this bias on forecasting horizon. Their analysis suggests that analysts have grown more accurate on shorter horizons, where all of the alternative data helps generate near-time accurate forecasts, but have actually gotten worse at long-term forecasts.

We see similar trends in the attention across stocks. Farboodi, Matray, Veldkamp, and Venkateswaran find that the improvements in forecasting accuracy are concentrated in large growth stocks. These are stocks whose value grew dramatically from the value of data and other intangible assets, and the market is increasingly focused on trying to understand the most valuable data. This leaves out small and value stocks, whose pricing is actually stagnating or falling behind.

An even broader version of this attention reallocation situation is discussed in this paper by Hao, Xu, Li, and Evans in Nature. They find that AI adoption in science expands the scope of individual scientist impact, but reduces the collective breadth of scientific focus. In the long-run, AI advances might eventually expand our reach to other currently data-poor environments, but in the short-run it seems to contract our focus.

The implication for finance is that while AI might get very good at reading 10-K’s and earnings calls, and dilute the alpha from that information commons, there are likely to remain pots of alpha from being able to mine hard to access information.

We see a version in this phenomenon in Kim, Muhn, and Nikolaev who find that humans outcompete machine analysts in domains involving institutional knowledge, such as intangible assets or financial distress, but lose in other domains characterized by broad information. This is consistent with many other results we look at in this course. The “bitter lesson” applies to domains with voluminous data that can be effectively mined for clear insights. But interpretation, judgement, and the ability to act on private or tacit information remain valuable, at least for now.

Simulating Your Customers

This discussion so far is about AI advancing its capacities to replicate or replace part of what humans typically do. But many of the most exciting AI applications are about using LLMs to do completely new things. One domain where this has really grown is the idea of having AI simulate economic agents. This idea has a few different intellectual grandfathers, but a very influential strain is attributable to John Horton and gets at the idea that become LLMs are trained on human-generated data, they contain implicit representations of human behavior in other contexts. They can be considered a homo silicus with the ability to be given endowments, preferences, and information and their behavior explored in simulations.

The results so far are very interesting, if early. Manning and Horton build “general social agents” combining social science theory and empirical data, and show it matches human play in novel games better than some standard models. Park et al. create generative agent simulations of 1,052 people by applying LLMs to qualitative surveys, and showing these surveys replicate participant responses in the General Social Survey 85% as accurately as the human agents themselves in a followup (another framework here is the “digital twin” methodology).

This technology, if it pans out, potentially unlocks a whole new layer of understanding consumers, voters, firms, governments, and other agents. Rather than running expensive focus groups, we can trial in a digital laboratory how people would respond to new products, pricing, or marketing campaigns.

A lot of this literature has thought about the careful ways LLMs need to match human behavior to ensure their responses to novel questions are in-line with their responses to other questions. Gui and Toubia for instance find that variations in the context provided to the LLM for apparently unrelated information tend to bias the LLM, and the resulting demand curves we get don’t match human demand curves well unless we fully “unblind” the model to the full experimental structure. This is a big problem: the whole point of running an experiment is that subjects shouldn’t know the hypothesis, but LLMs may need that precise context to work well, and may be distracted by unrelated context.

Some things that seem to help are demographic matching humans to specific personas. Fine-tuning machine responses on actual choice behavior also seems to get us closer to realistic willingness-to-pay estimates for existing products and features, but it remains challenging to extrapolate these improvements to novel product categories or consumer segments. Translating machine output into free text responses, and connecting those to Likert-scales, also seems to help (this result is a bit reminiscent of the elo idea above that the LLM may be better at comparing two texts rather than classifying them in total).

We’re left with a deeper question of whether synthetic agents are really representing the full structure of human beliefs, or if the map here is a poor guide to the territory. Barrie and Cerina find that LLM personas are considerably more coherent in their beliefs than are human agents. People tend to hold a mix of somewhat contradictory (Walt Whitman: “Very well then I contradict myself, I am large I contain multitudes”). The LLM agents sort of don’t contain multitudes and are a bit too consistent, which of course is a natural limitation to their deployment to understand human ideology.

Where we stand today I think is that agents work best when we have natural scaffolds to help structure their behavior: either a lot of background text on the personas we want to match, actual choice behavior from those agents, or some sort of relevant theory of agent behavior. They also seem great to deploy for low cost of error trialing purposes. But we are probably not yet at the point when they substitute for experiments on actual humans, especially when we thinking about novel scenarios which require extrapolation rather than interpolation, or politically polarized populations (I’ve written more of this challenge in out of sample prediction here; and do think we can do more creative deployment in the future).

Where is the Moat?

If intelligence becomes cheap and broadly available, what remains scarce? What are the new bottlenecks, and where do rents flow to? I think there are three broad answers:

  1. The first, most obviously, is proprietary data. Data is key to the entire analytic platform here, and especially in finance contexts we care most about new data which is always arriving. Data becomes the scarce input from which new algorithms can squeeze fresh insights.

  2. Second is the speed of action, rather than the speed of reading. In finance applications, this is going to be your investment committee process or portfolio management decision, rather than your analyst reading time. If these remain your bottlenecks, then it doesn’t really matter how fast document reading gets (Amdahl’s law again).

  3. Finally there is judgement under uncertainty. For now, humans seem to retain an edge in contexts of soft information, broader context, long-horizon prediction, novel situations, private information, and stochastic environment. This is going to be increasingly important when we bear in mind the adversarial response in finance. Even if machine-parsed text is quite accurate now, you can bet that market participants will strategically alter their words to break these relationships). This inherently adversarial and zero-sum nature of financial markets means that we are looking at a moving target, which is going to be a challenge even as model capacities improve.

A last image to leave you with: the Astronomer, by Vermeer. The astronomer has the heavens through the window, but is studying the globe: a map representation of the world. The question we’re always left with is whether this globe (or any map) is good enough to navigate by, or whether it misses something essential about the world.

The Astronomer by Vermeer

Cases

Subscribe now

Leave a Reply

Sign Up for TheVCDaily

The best news in VC, delivered every day!