This is the third installment of my course summaries from teaching AI in Finance at NYU Stern (see lecture slides here and last week’s summary here). We last left off discussing financial document intelligence and the problem of model accuracy. This week we turn to risk assessment, largely credit risk, which is a domain where AI has one of the longer track records of deployment in finance.
The key punchline from this session is that AI in risk management, to quote Homer Simpson, is the cause of and solution to all of life’s problems. Better models can improve prediction, expand credit access, lower losses, and help address operational challenges for financial institutions. But they also introduce new competitive dynamics, help enable adversarial financial actors and cybercrime, and generally add to new risks. Some of the best ways to tackle these risks, naturally enough, entail adopting yet more AI.
Trust and Verification
Credit scoring is a long-standing information problem in finance, and has seen substantial automation over time. Historically, credit access was deeply tied to one’s reputation and social standing, as trust played a key role in repayment. As early as the 1840s in the US, this information started to get codified into ledgers. Bradstreet was founded in 1857, and commercial credit started in an alphanumeric scoring system in 1864 by Dun (now Dun & Bradstreet). Even at that point we had the basic ingredients to put together an algorithmic model of credit scoring: some degree of private sector surveillance to collect information, information sharing to make it accessible, and a ratings system which turned that information into actionable content.
Consumer credit took another half-century to develop, while computerized systems started to hit banks as early as the 1950s. Anti-discrimination laws in the 1970s (Fair Credit Reporting Act in 1970, the Equal Credit Opportunity Act in 1976, and the Community Reinvestment Act in 1977) actually helped to boost credit score adoption. This legislation had the effect of limiting the scope for discrimination by lenders and requiring them to provide specific reasons for credit denial. Given the technology at the time, this was actually easiest to achieve through algorithmic approaches which allows lenders to make systematic lending decisions in a clearly observable way. We advanced to the FICO model in 1989, which became used for mortgages in 1995, and the VantageScore as a competitor from 2006.
The standard credit model developed around this time was a logistic regression, or logit. This is your basic workhorse model of binary classification to estimate whether a borrower will default or not on a loan. We have a sigmoid relationship between inputs and the probability of default, which provides a nice interpretable model with clear coefficients that can be presented to bank managers, regulators, and customers.
A standard Machine Learning counterpart, by contrast, is the random forest. This segments the data through as series of decision trees. This introduces many more parameters, and so the key trick is to avoid overfitting through a variety of design choices (such as measuring performance out of sample).
The basic tradeoff this then introduces is between a simple model, which is easy to implement and interpret, and a more complicated model which has greater predictive power but is more of a “black box.”
Logit vs the Forest
Fuster, Goldsmith-Pinkham, Ramadorai, and Walther have a really nice paper — “Predictably Unequal” — going through this comparison using US mortgage data. For the task of pure prediction, the ML random forest models deliver modest outperformance (in terms of ROC, Receiver Operating Characteristics, which plot the true positive rate against the false positive rate; and AUC, Area Under the Curve, which measures the area under the ROC curve).
The interesting part, however, shows up in contour plots of predicted default probabilities as a function of FICO score and income. The standard logit produces linear level sets: default rates trade off between FICO and income. The random forest, however, produces non-linear level sets which allows for a more complex representation of the underlying default risk. In particular, the model finds a region where borrowers with low credit scores, but high incomes, actually have less risk than expected; while borrowers with low income and high credit scores are actually pretty high risk.
This matters for fairness because the impacts on credit access of switching to a different statistical technology depends on how the model predicts risk as a function of borrower characteristics, and what the distribution is of those characteristics across groups. This team shows that different racial groups cluster in different parts of the FICO-income space, and so the nonlinear interactions that the ML models pick up help some groups while harming others. While the ML model outperforms in general here, it actually performs worse in a measure of credit racial disparities.
Other problems we have to worry about here include the issue that backtests might look great on historical data, but a new world or regime shifts could drastically change the situation. Of course this was a problem even before ML or AI; but the extent to which new AI methods gain their predictive edge from complex nonlinear interactions may make them brittle to structural shifts in how the world works.
Invisible Primes
So those are some of the downsides. What are the upsides of better model accuracy? A big category has been trying to expand the information set beyond traditional credit bureau files.
Berg, Burg, Gombović, and Puri have a paper on “digital footprints:” the information users leave just by accessing a website. They find this simple and easy to verify information matches the information content from credit bureau scores. The different in default rates between iOS and Android users, for example, is comparable to the difference between a median and 80th percentile credit score. Customers who arrive from a search engine (more likely to be impulse buyers) are more likely to default than those who arrived through price comparison websites. So these digital footprints can help proxy for additional aspects of income, character, or reputation; that “soft information” which credit scoring has been trying to capture for a while.
Di Maggio and Ratnadiwakara examine these trends more directly in fintech lending with alternative data. This platform is able to find “invisible primes”—consumers who have thin credit files and low credit scores, who are classified as high risk under traditional underwriting, but the model is able to actually classify as low risk. We see similar results in another paper here examining LendingClub data. Other work establishes the savings from improved algorithmic modeling might save costs or result in lower required regulatory capital.
We can push this further through AI by expanding the set of inputs and including, unstructured texts. Loan files, narratives, conversations with customers are all fair game now to be processed through LLMs and fed into prediction models. One example of this analyzes CFPB consumer complaints using ChatGPT, and finds this produces valuable signals. For instance, complaints with higher “resolution expectations” associate with higher deposit outflow, suggesting that AI can help triage which complaints associate with bank operational problems.
It’s harder to see so far how AI will further change the underwriting process, but one interesting sign comes in this paper by Gambacorta, Sabatini, and Schiaffi who explore the impacts of AI investments among Italian banks. The key finding is that AI banks are able to lend without a long relationship depth. So traditional banks rely on relationship history to build data and relationship value, while AI banks are apparently able to do a sufficient job of extracting signals that they can give you a loan officer without the soft information from a long-standing relationship. These AI firms also sustained credit supply to firms through Covid, helping firms sustain investment and employment.
AI Generated Risks
The last example, comparing AI banks to non-AI banks, also brings up an important category of AI-generated risks: the competitive dynamics. If a bank has a better model, they can cherry-pick the best borrowers, leaving other banks stuck with lemons. This creates a Red Queen-style arms race problem: you have to keep running to stay in place. Lenders have to keep investing in model improvement, lest they be sniped by an entrant who uses a better model to undercut incumbents for the borrowers where the existing model is most wrong.
But if everyone adopts similar models, you might instead get herding behavior which can amplify systemic risk. This is amplified by recency bias: the models we currently have are disproportionately trained on recent data, and it’s not so clear how they might handle a novel downturn. And as mentioned above, the explainability problem is a real concern. More sophisticated models may perform better, but explaining to a customer or regulator why a loan was denied becomes much harder when the model involves thousands of interacting features.
One way to mitigate some of these issues is a “hybrid” approach in which we use a complicated model to fit the model, but then approximate the complex AI or ML application through a simpler rule. i.e., in the mortgage lending example above, we could simply fit in a non-linear interaction of credit score and income to match the model predictability.
We also spend some time talking about SVB as a case study. Cookson, Fox, Gil-Bazo, Imbet, and Schiller show that banks in the SVB bank run with more exposure to Twitter lost more market value. Interestingly the issue wasn’t negative sentiment per se on Twitter, it was the attention on the social media platform, which created a coordination mechanism for depositors to flee. Other work has argued that misinformation can be created even more cheaply, which might be the basis for an adversarial attack.
This points to a novel class of AI-accelerated risks: liquidity risks on banks through coordinated misinformation or adversarial attacks. Other AI risks include errors from black-box models which propagate through the system, concentration risk on third-party vendors, and the risk of data privacy leakage from model outputs. Naturally, back to Homer Simpson, AI itself is likely to be an important tool in helping to diagnose and mitigate these risks. For example; by helping banks monitor social media chatter to check if depositors appear unusually flighty.
Then we have Knight Capital. In 2012, a software flaw in their trading code led to a $7 billion purchase spree in the first hour of trading, which ultimately led to the failure of one of Wall Street’s largest trading firms.
Thinking in General Equilibrium
It’s common when evaluating AI to think in partial equilibrium: we hold fixed everyone else’s technology, competitive response, regulation, and prices; and just imagine changing one thing. But a world in which AI is freely available will have broad changes in general equilibrium; a world in which prices, competitors, and regulators also move.
There is a wonderful graph of the length of the tax code which appears to hit an acceleration point upwards after the introduction of the typewriter. Making it cheaper to produce text, naturally enough, may have drastically increased the amount of regulatory documents out there. This same dynamic is likely to play out with AI in risk management, which is just an application of Jevons Paradox in our first session. The demands for documentation, model validation runs, compliance analysis, and stress testing are all likely to increase as the cost to produce them falls.
We are still at an early stage here, so it’s hard to assess the scope of these general equilibrium effects so far. We know that better models improve prediction, that alternative data can improve credit access, and that speed and automation can reduce costs. But the feedback loops: what happens to competitive dynamics, organizational responses, regulatory responses, and novel systemic risks resulting from common adoption of AI tools, are really hard to think through, but will likely account for a lot of the pain points in the coming years to get right.
Readings
-
We have two cases this week; one on SVB and AI bank runs, and one on Blackrock’s Alladin product.