"Nobody Knows Anything": Backtests are Hard

Plus! The Economics of Hacking; Lead Generation; Adverse Selection; Tip Economics; First and Last Resort; Diff Jobs

17th June 2024

Byrne Hobart

In this issue:

"Nobody Knows Anything": Backtests are Hard—You'd think a question like "do small-caps outperform large-caps" or "what happens if you buy X when Y happens" would have a simple answer, but actually implementing a backtest means making many more judgment calls and rough approximations than you'd think. And while there's a lot more historical data in finance than in other fields, there just isn't enough to definitively prove most of the interesting ideas out there.
The Economics of Hacking—There is at least one field where surprisingly young people can rapidly rise to the top of high-impact organizations. Unfortunately, we tend to find out about this when they get arrested.
Lead Generation—When is it raising from well-informed experts who provide good advice, and when it is it a kickback?
Adverse Selection—Market a product to credit card rewards arbitrageurs, and you should expect it to get arbitraged.
Tip Economics—The economic case against taxing tips—and the case for why that's changing.
First and Last Resort—Nvidia and t-bills are complements.

The Diff June 17th 2024

0:00

/835.84

"Nobody Knows Anything": Backtests are Hard

A while ago, I had what sounded like a simple question: how do stocks react to being added to or removed from major indices, and how has this changed over time? This seems like a straightforward question: there are press releases, nicely-formatted, and it's not hard to find historical data on prices. But then you run into weird issues, like: why is my backtest of a large-cap-focused strategy periodically buying into smaller stocks and flipping them in a few days? That turns out to be an artifact of how the indexes handle spinoffs: the spun-off company might qualify for index membership, but it gets kicked in a few days. Is that something you'd want in your backtest, for completeness? Or is it something you'd remove because it's slavish adherence to how you described the problem before you started working on? And how many such judgment calls can you make before you run The Poor Man's Bonferroni Correction by concluding that you've hopelessly overfitted through how you went about your research process and are pretty guaranteed to produce something spurious at the end?

This kind of question comes up a lot. Systematic investors are, of course, looking for repeatable market phenomena that they can exploit over and over again until they go away. Fundamental investors are doing so in a more implicit way, but they will sometimes take the general phenomenon they're looking at—buy underlevered companies when they finally start doing a buyback, invest in industries with a long capital cycle after a few years of low aggregate capex, always consider shorting whatever the most valuable company in Canada is, etc.—and testing whether this does, in fact, produce the expected result.

Part of what makes backtesting hard is that it's straightforward to articulate what the portfolio you're thinking about would look like today, but there's an art to thinking about how to change it over time. Frequency of rebalancing is a factor with a surprising impact on backtest outcomes. Suppose you're testing some strategy like buying every stock in the 10th percentile or lower of price/book value, and shorting everything above the 90th percentile. How often do you check? You can imagine a range of possibilities, with some ridiculous extremes: you could rebalance on every tick, i.e. every time a company flips from just-below to just-above the threshold, a trade happens. This would be pretty ridiculous, and in effect it means testing a combination of a value investing strategy and a very short-term mean-reversion trading strategy. But if you run this strategy with sufficiently rare backtests, you're introducing even more new features: a strategy that never rebalances slowly evolves into a single-stock momentum bet on the long side and a single-stock mean-reversion bet on the short side, because eventually whatever stock compounded at the highest rate for longest is going to represent more of the portfolio.

And that's just what can go wrong with rebalancing in a backtest while assuming transaction costs to be zero. They're not! There are some systematic strategies that end up more or less measuring the implied cost of whatever the strategy is. Something like the overnight anomaly, i.e. that stocks' returns mostly happen between the market close and the next day's open, disappear after accounting for the cost of buying stocks every day at 4pm and exiting the next morning. Some small-cap backtests have similar problems, especially if you're trying to run a small-cap strategy at scale—in a sense, the historical excess returns you get from buying and holding a portfolio of small-caps are just a reflection of the cost of realizing those returns.^[1] And plenty of other strategies achieve their backtested alpha from being tax-inefficient—you need higher returns if you're realizing short-term rather than long-term capital gains, so there's a richer set of trading opportunities in those short-term investments.

There's another element of transaction costs that's even harder to model, because the highest effective transaction cost comes from when you're running a strategy, and someone else is running a smarter version of that strategy. Suppose there's some signal that, as far as you know, leads to above-average returns 55% of the time. And suppose this signal is really made up of two signals, such that if they both fire the hit rate is 60% and if just one of them does, it's a coin toss. If you run the simpler version of this strategy, but a competitor runs the more complex version, what happens is that they routinely outbid you every time both signals say to buy, and you get to make the trade 100% of the time when returns are no better than chance. This is naturally hard to model because, if you knew exactly what you were looking for, you'd be running the smarter strategy.^[2]

And then you get into the philosophical challenge of backtests: we just don't have that much data, and the more we extend it the more we're adding datapoints from a different distribution. A sufficiently long backtest of the results of investing in stocks will periodically also include an all-financials portfolio, and will sometimes look at companies that were basically government-sponsored entities that were constantly engineering short squeezes ($, Diff). Go back a little over a hundred years, and US market performance is mostly about railroads, with a side of companies mostly selling to railroads and a handful of companies that were dependent on railroads.^[3] You can broaden your backtest to international stocks, but now you're adding a big dose of Cold War and onward geopolitics to the implicit assumptions underpinning your returns. China's rise as a manufacturer could only happen once—no other country was big enough and broke enough to make that big of a splash in industry. And the rise of East Asian manufacturing economies before that was also quite situational—the US was a lot less protectionist when they viewed rising wages in Japan and South Korea as a bulwark against communism.^[4]

There are some cases where you can achieve statistical significance in a backtest and know that you're looking at a real phenomenon that will probably persist. You can divide them into two different kinds of unattainability:

There are high-frequency strategies that make a lot of trades with a small edge, such that their sample size is pretty robust. You won't know for sure if this strategy loses money in extreme market environments, because the sophistication of counterparties tends to rise over time (especially because the worst-performing traders you were trading against in September 2008 or March 2020 are more likely to have gone out of business).
There are lower-frequency strategies like the typical ones pursued by multi-manager funds—picking lots of stocks in a way that isolates idiosyncratic returns, and turning over that portfolio every few weeks or months. This also accrues enough information that it enables the identification of skill—but the model also ensures that the people best-positioned to know how much skill there is, and who has it, are the ones running the fund.

In principle, it's possible to run backtests that do produce valid signals that can be traded profitably. But for many of them, the signal needs to be anchored by some theory about why it exists, and that theory needs to be a theory of the average counterparty's laziness, sloppiness, institutional constraints, or willingness to offload risk. So most of the time, a backtest represents a partly-proven theory paired with a leap of faith.

It would be a pain to implement, but a small-cap strategy that could be worth pursuing would be something like this: make a long, long list of tiny companies that have a viable business—a list that will be a smaller and smaller percentage of small-caps over time, as the good ones keep getting acquired while the bad ones limp along and opportunistically issue more stock. Then, set up a bunch of alerts to take advantage of fat-finger trades. Hold this portfolio, and exit as companies achieve a higher market cap and more volume. There will be plenty of garbage that gets picked up with this strategy, but that's exactly the risk you're being paid to take when you run it. ↩︎
In practice, it seems that firms combine so many alpha-generating and risk-mitigating strategies at once that a clean example like this doesn't apply, and everyone running profitable strategies is getting adversely-selected into some trades and picked-off from others, such that there are incremental gains from tweaking some core strategy. ↩︎
Local manufacturers generally weren't big enough to list, and the regional and national brands got that way because the US had the infrastructure necessary to ship whatever you ordered in the Sears Roebuck catalog to wherever you happened to be. ↩︎
This wasn't the only driver, and both countries took many specific steps to make that growth happen. But the US's muted reaction was partly driven by a simple calculation: a worker in Nagoya or Ulsan who got laid off from a job building cars might be tempted to join a communist-adjacent party. A worker in Detroit probably wouldn't. Hegemonic foreign policy always has a sort of prodigal son dynamic, where it's more important to be nice to the countries that are wavering than the ones whose support is dependable. ↩︎

Diff Jobs

Companies in the Diff network are actively looking for talent. See a sampling of current open roles below:

A fintech company using AI to craft new investment strategies seeks a portfolio management associate with 2+ years of experience in trading or operations for equities or crypto. This is a technical role—FIX proficiency required, as well as Python, C#, and SQL. (NYC)
A blockchain-focused research and consulting firm is looking for an infrastructure engineer to secure their clients’ networks. Deep experience in DevOps, Linux systems, and IaC required; previous crypto experience preferred. (Remote)
A CRM-ingesting startup is on-boarding customers to its LLM-powered sales software, and is in need of a product engineer with a track record of building on their own. (NYC)
A well funded seed stage startup founded by former SpaceX engineers is looking for full stack engineers previously employed by Anduril or Palantir. (LA)
A company building the new pension of the 21st century and building universal basic capital is looking for a GTM / growth lead. (NYC)

Even if you don't see an exact match for your skills and interests right now, we're happy to talk early so we can let you know if a good opportunity comes up.

Find a Role

If you’re at a company that's looking for talent, we should talk! Diff Jobs works with companies across fintech, hard tech, consumer software, enterprise software, and other areas—any company where finding unusually effective people is a top priority.

Find Talent with Diff Jobs

Elsewhere

The Economics of Hacking

The average age of prominent tech founders seems to have gone up a bit in the last decade. There are plenty of very prominent thirty-and-over founders, but fewer wunderkinder than in the early days of social. There is still one area where people in their early twenties can end up running large organizations: the head of the Scattered Spider hacking group was arrested, and turned out to be 22. That is at least mild evidence that an important part of cybersecurity is that most of the people who are capable of hacking big companies also have much more lucrative ways to spend their time. (And the countries that seem most competitive in committing cybercrime are the ones with a large number of software engineers and fairly low wages for them.) That's a good thing to know, but a dangerous thing to rely on.

Lead Generation

There's a hazy line between two ways a given early-stage investment might be a uniquely good fit for an investor and vice-versa:

That investor works in the same domain as the startup, and can provide valuable advice, some warm introductions to customers, recommendations for key hires, etc.
That investor works in the same domain, sometimes buying products from companies much like the one being invested in. If that investor owns, say, 1% of a company that's valued at 20x revenue, they’re effectively getting a 20% commission on any revenue they send to that company.

The latter model is what infosec-focused VC fund Cyberstarts is being accused of. To be fair, they are reasonably transparent about this model, but the incentives within cybersecurity are somewhat pathological: part of the point of having a CISO is that the work of both identifying risks and figuring out the best way to mitigate them is very specialized, and plenty of solutions providers are below the radar (as are the problems they're solving, hence the need for a startup to address this rather than a big company).

The tech industry is built on the constant simmer of low-level conflicts of interest: founders want early employees to work hard despite not getting the same equity stake, employees may dial back their at-the-office effort because they're planning a new startup themselves, VCs optimize for portfolio-level returns even though that raises the risk of founder failure or founder burnout, regulators know that there are many letters of the alphabet that aren't already taken for a Chief-Something-or-other-Officer at the companies they regulate. This usually works just fine, and the conflicts both motivate people and inculcate some healthy cynicism. But that same tolerance for moderate conflicts can turn into more serious ones without anyone consciously choosing to operate that way.

Adverse Selection

Financial services often fit the same broad economic model as gyms, airlines, and mobile games: there's a typical pattern of behavior, and there's an outlier pattern with wildly different margins. For some of these, like airlines, the high-consumption customers are more profitable. For a gym, high-consumption customers are a drag on profitability since they increase the number of dumbbells and squat racks required to service a given number of members. (They do offset this slightly by being walking billboards—ironically, the best way for a gym to recover the margin hit from a member who actually works out a lot is to take another short-term margin hit by giving that customer a branded t-shirt.) A business perfectly optimized for annoying, adverse selection-driven losses might be a credit card that allows users to earn points on a transaction where they'd otherwise be charged a fee, with the plan of making that up on other transactions. Wells Fargo is apparently losing $10m/month through a partnership with Bilt, which offers such a card ($, WSJ). This probably looks worse than it is: the earliest adopters will be people who care about optimizing their loyalty points, and these people are going to look at the overall rewards for the card and only use it when it's better than the alternatives, which is another way of saying that they'll only use it for the massively negative margin loss-leader, instead of all the complements that it enables. The Bilt bet has to be that the product will expand to other consumers who are either less sophisticated or just less willing to spend their time optimizing credit card spend.

Tip Economics

The Diff has written a bit before about the rise of tipping, but hasn't addressed one topic: how should tips be taxed? There's a proposal to eliminate taxes on tipping, which would likely reduce tax revenue by $15-25bn/year. One of the annoying questions you have to ask about taxes is: even if we did tax this, could we really collect? It's one reason financial transaction taxes tend to get shot down: they're popular, but it's straightforward to execute the same transaction in a different venue, so in practice they end up being a regressive tax on people who don't have good accountants and prime brokerage relationships—i.e. a majority of investors but a minority of transaction-volume-weighted investors. Interestingly enough, the ability to effectively tax tips is rising as more of them get paid through apps rather than in cash. One way to view the previous status quo is that tip-heavy jobs were a sort of onramp from the informal economy to the formal economy, where instead of getting paid entirely in cash and reporting none of it, someone gets paid partly in a way that's recorded and taxed, and partly in a tax-optional format. Economic efficiency dictates that we wouldn't want to privilege certain kinds of income from taxes (that's true even if this income accrues to lower earners—the question of maximally efficient taxation is different from the question of what the ultimate income distribution should look like, and these should be solved separately). What this ultimately illustrates is that optimal policy is a moving target, and technology pushes out the efficient frontier even when it comes to trivial things like whether or not tipping someone $1 for serving a cup of coffee ought to be taxed or not.

First and Last Resort

The US has received about a third of global capital flows since the pandemic. This is part of America's economic role: when there's a global demand shortfall, America is best positioned to make up the gap, and conveniently these demand shortfalls coincide with demand for dollars. A trade deficit means that the rest of the world accumulates claims on US assets, and the US also has multiple attractive assets to offer the rest of the world—on one end of the spectrum, the default safe assets are issued by the US treasury, but at the other end of the spectrum, the most conventional risk assets like large-cap growth stocks, PE, and hedge funds, are also disproportionately offered by the US. It's a surprisingly stable equilibrium, which stays in balance as long as the country issuing the global reserve currency also has a disproportionate share of the fastest-growing multinational firms.

"Nobody Knows Anything": Backtests are Hard

Plus! The Economics of Hacking; Lead Generation; Adverse Selection; Tip Economics; First and Last Resort; Diff Jobs

"Nobody Knows Anything": Backtests are Hard

Diff Jobs

Elsewhere

The Economics of Hacking

Lead Generation

Adverse Selection

Tip Economics

First and Last Resort

Share this edition!