This is the fifth week of my course summaries from teaching AI in Finance at NYU Stern (see lecture slides here; and last week’s summary here). This week focuses on fraud detection and compliance, which are two critical growth areas for AI applications.

The Base Rate Problem

Caravaggio’s The Cardsharps

We start off this week with Caravaggio’s The Cardsharps, which features a young man being cheated at cards through an accomplice looking at the cards signaling to his partner who has a few cards tucked behind their belt. I like this painting because it features all the everlasting elements of fraud: asymmetric information, coordination among bad actors, and the innocent mark. Fraud and financial crimes have since drastically shifted through scale and ease of execution through technology. Consumers report $12.5 billion in fraud to the FTC; global reported payment card fraud is maybe $30 billion a year; and global compliance costs on financial crimes is about $60 billion a year.

The core technical problem in fraud detection is the base rate problem. Despite the absolute magnitude of fraud, it’s quite rare in any given transaction, say less than 0.1%.

This means that the most “accurate” fraud detection algorithm is always going to be the classifier which tags all transactions as legitimate. It’s going to be 99.9%+ accurate! But this would be a terrible algorithm to deploy because failing to tag anything as fraudulent will be an open invitation for more crimes.

This is the fundamental tradeoff at the heart of fraud detection. Tightening the rules to capture more fraud (lowering false negatives, Type II errors) inevitably means flagging more legitimate transactions as fraudulent (raising false positive, Type 1 errors). These errors have complicated costs which vary based on contexts. Customers really dislike false positives, so tagging more fraud may lead to more customer churn. But letting real fraud through can also lead to huge losses, regulatory penalties, and reputational costs.

Beneish and Vorst have a nice paper illustrating this tradeoff using a range of fraud prediction models. Even the best models they consider have false positive rates in excess of 100:1, meaning that they tag 100 legitimate transactions as fraudulent for every one true fraud they capture. These tradeoffs are so bad they estimate it’s generally not cost effective to even adopt the fraud detection models.

From Rules to Representation

The broader history of fraud detection is similar to other areas of AI deployment in finance, particularly risk management which we had discussed in week 3. Prior to the 1990s we typically had manual review, which was expensive and slow to scale. From the 1990s and 2000s we had rule-based systems which were more scalable, but rigid and easy to game. The 2010s brought supervised ML techniques, such as random forests, which brought in labeled data to let the model learn about fraud drivers in more adaptable ways.

The impact of AI here has been to move towards richer representations of transaction data in ways that allow for more sophisticated analysis of fraudulent patterns of behavior. Purda and Skillicorn for instance show how even simple bag of words applied to the management discussions of financial reports can distinguish fraudulent from truthful filings. This shows that deceptive language carries detectable statistical signatures in word patterns which can be carefully mined for detection.

Dal Pozzolo and others writing using data from a real world credit card issuer, with 75 million transactions, highlight some of the challenges here. You want to develop real-time fraud detection algorithms, but are faced with the challenges of 1) concept drift (consumer and fraudster habits change over time), 2) class imbalance (the base rate problem we discuss above), and 3) verification latency (you have a delayed feedback loop in verifying fraudulent status).

One interesting AI solution to this problem comes from Capital One. Bruss and their colleagues developed DeepTrax, which is a graph embedding approach to financial transactions data. The idea is to treat sequences of merchant transactions just like words in a sentence: merchants which appear in similar transactions contexts should also be “close” in embedding space. Their derived embeddings seem to show intuitive clusters: KFC is similar to Taco Bell, Little Caesars, and Burger King; because apparently consumers tend to purchase these products in a similar enough context. The embedding dimensions even encode some attributes of branding or price point; so there is a directionality to the embeddings going from Ritz-Carlton to the Fairfield Inn among hotels, and Banana Republic to Old Navy for retailers along the same axis.

This is a good illustration of the fundamental objective of AI from session 1: generating maps or representations to simplify decision making. By doing this dimension reduction across firms, we can more easily figure out which purchases are likely to be “anomalous” and the authors suggest the embeddings improve the precision-recall AUC by about 1% over baseline, and also allows for a bit more simple ML implementation.

A natural question, given the succession of rules to ML to embeddings is whether LLMs can also process fraud directly. Tan, Ma, and Zhang look at that in this FinFRE-RAG paper. They translate transaction features into language prompts, select the most important attributes of a transaction, and retrieve similar transactions (using RAG). Interestingly, this approach still lags behind specific purpose-built ML classifiers, and the raw LLM approach (without RAG) is also quite bad. But LLM plus RAG has the added benefit of interpretability: it reasons about specific fraud patterns, and so can open up the dreaded “black box” of ML classifiers with some interpretability.

Graphs and Dark Fleets

An important development in fraud detection methods has been exploring graphical methods. This takes advantage of the reality that fraud is fundamentally relational, not just transactional. Chang, Zou, Xiang, and Jiang have a good review of graph neural networks for fraud detection. The basic idea is that graphical methods can tag fraud where the transaction itself looks fine in isolation, but the broader pattern of who is transacting with who, what’s the device structure, etc. can help surface novel fraudulent patterns. These methods operate at the level of the node (i.e., is the account fraudulent?), the edge level (is the transaction suspicious?), and the graph level (does this cluster of accounts form a fraud ring)?

There are some striking examples of using these methods. Fernández-Villaverde, Li, Xu, and Zanetti built a chip clustering model to detect “dark shipping.” These are ships which turn off their transponders to evade sanctions, and so are naturally enough challenging to measure. Their model incorporates aspects of vessels, their behavior during signal gaps, port visits and ship-to-ship transfers, to assign each ship and trip a “dark score.” They estimate that dark ships transported almost 8 million metric tons of crude oil each month from 2017-2023, with China taking about 15% of the supply. Dark shipments grew dramatically for instance after Western sanctions on Russia.

Another grim but illuminating example is the work by John Griffin and Kevin Mei on pig butchering scams. These are crimes in which victims are gradually lured through fake crypto investments which pay off in the short-run before the victim is finally defrauded for large amounts. They measure the digital signatures of these crimes through blockchain flows, which allow them to tag entire networks and clusters of fraudulent actors. Criminal networks sent over 32,000 small trust-building payments to exchanges used by US and European investors, before moving the amounts elsewhere and existing, typically through Tether. Blockchain is an interesting application here in that the digital ledger provides both the anonymity to engage in these criminal activities, while also providing the digital breadcrumbs to reconstruct the crimes after the fact.

Opening the Black Box

While there is some promise to these newer fraud detection problems, they face serious gasps in deployment. Bhatt and others survey firms and find that most of the deployment of ML is there to serve internal users, such as engineers debugging models, rather than the end users affected by model decisions. The lack of transparency and explainability behind complicated model decisions is a key barrier and friction around their broader adoption and use.

We saw above that LLMs can help to address one aspect of that explainability gap, through model rationalizations of classifications. Another important approach in practice is Shapley values, which decompose the model prediction into the contribution of each input variable. This helps to compute counterfactual explanations for what factors led to a credit card payment or money order being flagged as fraudulent.

Addressing the Compliance Burden

Figuring out how to improve the deployment of these tools is important because fraud has costs which extend far beyond the actual dollar losses. Financial firms spend billions to comply with Anti-Money Laundering (AML) and Know-Your-Consumer (KYC) regulations. The regulatory environment around finance has generally tightened after the financial crisis, where real challenges in fraudulent behavior led to a substantial regulatory response and the exit of banks from lending in mortgages as well as small business lending.

This leads to a situation I’ve discussed before: the advancement of technology in finance has been a disappointment in many dimensions. An easy way to see this is to simply look at mortgage rates and spreads over the last 26 years. Despite the rapid advancements in computing and technology, we’re really no better (arguably, we are actually worse) in pricing and originating mortgages.

From the Urban Institute.

This situation partially reflects the rising costs of regulatory compliance, which now exceed $10,000 for a mortgage loan. Naturally enough, the real costs and risks of fraud have set off regulatory demands for tighter paperwork, which are costly to provide.

This is where AI has the ability for broader impacts: through organizational and document burden changes which can automate KYC checks, accelerate regulatory document review, triage suspicious activity reports, and more broadly try to address the effective “tax” that compliance imposes on financial activities.

AI as the Sword and Shield

But we also have to be careful about the dangers here, following the “cause of and solution to all of life’s problems” dynamic. If the cost of meeting regulation falls down; than we might expect regulators to come back with a whole new stack of regulations to address, leaving the net effect more uncertain. This is a general equilibrium principle of what happens to the economy overall when the costs of some functions go down a lot, which we still have yet to think through.

AI also creates new vulnerabilities even as it addresses the old ones. Deepfakes and social engineering attacks are becoming extremely effective. The $25 million crime perpetrated in 2024 when an employee joined a seemingly routine video call populated with deepfake agents was just an early signal that the costs of phishing and other attacks has been dramatically lowered by AI.

Fraudsters have in fact access to the whole suite of AI technologies that banks do, resulting in another “arms race” dynamic. They can use the same embeddings, graph analysis, and language models to understand what detection systems are doing and figure out how to game them. The use of AI also creates a whole new set of attack vectors; from vibe coded software that lacks critical safety checks to novel prompt engineering attacks on the AI system themselves.

Whether the defenders or attackers benefit more from the dramatically increased scale of AI remains I think very much an open question.

Readings

Subscribe now

Leave a Reply

Sign Up for TheVCDaily

The best news in VC, delivered every day!