Crypto bot performance metrics: how to read Sharpe, Sortino, max drawdown without lying to yourself

TL;DR : Every public crypto bot leaderboard sells you a single number: “Sharpe 3.4 over 6 months”. Pretty. Mostly useless. Sharpe collapses when returns are non-normal (every crypto strategy ever). Sortino is cleaner but trivially gameable by ignoring extreme losses. Max drawdown lies whenever the lookback window is too short. Calmar ratio is the closest thing to an honest synthesis but still hides the asymmetry of crypto returns. This guide unpacks each metric mathematically, shows the three most common ways bot vendors quietly inflate them (period selection, survivorship, return aggregation frequency), and gives you a six-question checklist to apply before trusting any backtest or live track record. No bot recommendations. Just the math.

Why performance metrics matter more for bots than for discretionary trading

A discretionary trader reads their own equity curve and adjusts. A bot does not. It executes the strategy you encoded, including its weaknesses, without flinching. The only signal you have to know whether the bot is doing what it claims is its performance track record, summarised by three or four headline numbers.

That is precisely why those numbers are gamed. Bot vendors compete on a leaderboard, so they pick whatever metric is most flattering, on whatever window is most flattering, computed with whatever convention is most flattering. The result is that two strategies with identical underlying risk can have headline Sharpe ratios that differ by a factor of two. The objective of this article is to give you a way to read those numbers like a pro, not like a marketing target.

Three foundational definitions before we go deeper:

Returns are the percentage changes in equity over a defined period.
Volatility is the standard deviation of those returns.
Drawdown is the percentage loss from a previous equity peak to a subsequent trough.

Everything else is a ratio combining these three things.

Sharpe ratio: the misunderstood industry default

The Sharpe ratio was introduced by William Sharpe in 1966. The formula is:

Sharpe = (R_p - R_f) / sigma_p

Where:

R_p is the average return of the strategy over the period,
R_f is the risk-free rate (T-Bill, OIS, or 0 if you assume an unfunded strategy),
sigma_p is the standard deviation of the strategy returns.

When the source data is daily returns, the convention is to annualise by multiplying the numerator by 365 (or 252 for traditional markets) and the denominator by the square root of the same factor:

Annualised Sharpe = (mean_daily_return * 365) / (std_daily_return * sqrt(365))

For crypto, the convention is usually 365 because crypto markets trade 24/7, including weekends.

What Sharpe assumes that crypto returns violate

Sharpe assumes returns are roughly normally distributed and that volatility is symmetric. Both assumptions break in crypto.

Fat tails. Crypto returns exhibit kurtosis far above 3 (the normal-distribution baseline). The 2020 Covid crash, the May 2021 LUNA-adjacent flash, the FTX collapse in November 2022, the March 2023 banking weekend, all produced multi-sigma moves. Sharpe under-counts the risk of these events because standard deviation is computed assuming they fit a bell curve.

Skew. Many bot strategies are short-vol in disguise: small consistent gains punctuated by occasional large losses. Grid bots in trending markets, mean-reversion bots in liquid regimes, premium-harvesting strategies on perps. Sharpe treats a +5% day and a -5% day identically in the volatility term, even though the -5% day is what kills the account. Long-vol strategies (trend-following, breakout bots) show the opposite: many small losses and occasional large wins, which Sharpe also under-rewards.

Sample size. A Sharpe ratio computed on 6 months of daily data has approximately 180 observations. The 95% confidence interval around a sample Sharpe of 2.0 with n=180 spans roughly 1.0 to 3.0 even under perfect normality. In practice, with fat tails, the real interval is wider. A “Sharpe 3.4 over 6 months” is statistically indistinguishable from a true Sharpe of 1.5.

When to trust a Sharpe

A Sharpe ratio becomes meaningful when computed over at least 2 years of out-of-sample data, with daily returns, and when the strategy has experienced at least one regime shift (bull-to-bear or vice versa). Anything shorter is curve-fitting until proven otherwise.

For deeper foundational reading, the CFA Institute maintains accessible reference material on portfolio metrics, and academic sources such as SSRN host the original Sharpe (1966) paper.

Sortino ratio: cleaner, but trivially gameable

The Sortino ratio addresses one Sharpe weakness: it penalises only downside volatility, not upside.

Sortino = (R_p - R_target) / sigma_downside

Where R_target is a minimum acceptable return (often 0% or the risk-free rate), and sigma_downside is the standard deviation of returns below R_target.

Sortino is more aligned with how investors actually feel about risk: nobody loses sleep over big up days. A strategy with 70% winning days and a few moderate down days will show a much better Sortino than Sharpe, which is the correct intuition.

How Sortino is gamed

Sortino has two specific weaknesses bot vendors exploit:

Choice of the target return. Setting R_target = 0 versus R_target = risk-free rate versus R_target = daily average market return can shift Sortino by 30-50%. Most published Sortinos in crypto bot dashboards use R_target = 0, which is the most flattering option.

Truncation of the tail. A single -25% day on the strategy can completely dominate the downside deviation calculation. Some vendors quietly truncate or “smooth” that day in their dashboard (“outlier removed”). The Sortino jumps by 40-60% without any change in the underlying strategy. The honest practice is to publish raw Sortino, including all observed returns.

How to use Sortino responsibly

Compute both Sortino and Sharpe. If Sortino is significantly higher than Sharpe (say 1.8 vs 1.2), the strategy has positive skew, which is generally desirable. If Sortino is similar to Sharpe, returns are roughly symmetric. If Sortino is lower than Sharpe (rare), the strategy has hidden negative skew, which is a serious red flag.

Maximum drawdown: the metric you can never compute “live”

Max drawdown is conceptually simple: the largest peak-to-trough decline in equity over the observation window.

MaxDD = max over t [(peak_equity_before_t - equity_t) / peak_equity_before_t]

In practice, MaxDD is reported as a positive percentage (e.g. “Max drawdown: 18%”), even though the equity move is negative.

The trap: drawdown is censored by the observation window

This is the single most abused metric in the crypto bot industry. The reported max drawdown is only the maximum drawdown observed during the window you chose. If your window starts after the last major correction, the published max drawdown systematically understates the true tail risk of the strategy.

Concrete example: a bot strategy backtested from January 2023 to December 2024 shows a max drawdown of 12%. The same strategy backtested from October 2021 to December 2024 shows a max drawdown of 38%. Same strategy, same code, different window. Which one is “true”? Both. But the longer window includes the May 2022 LUNA collapse and the November 2022 FTX collapse, which are exactly the events a strategy should be tested against.

Operational rule: any max drawdown number not computed on at least one full bull-bear cycle (roughly 3-4 years in crypto) is a marketing number, not a risk number.

Time under water (TUW): how long the strategy spends in drawdown before recovering to a new peak. A strategy that drops 15% and recovers in two weeks is very different from one that drops 15% and stays under water for 9 months. TUW captures the psychological cost of the strategy.

Recovery factor: net profit divided by max drawdown. A recovery factor of 5 means the strategy has made five times its worst drawdown in net P&L over the window. Useful as a “is this strategy worth the risk it has historically taken” sanity check.

Calmar ratio: the closest thing to an honest synthesis

The Calmar ratio combines annual return and max drawdown:

Calmar = annual_return / abs(MaxDD)

A Calmar of 1.0 means your annual return equals your worst historical drawdown. A Calmar of 0.5 means the strategy lost as much as it earns in a year, at its worst point. A Calmar above 2.0 is rare and starts to look suspicious without a long, clean track record.

Calmar is harder to game than Sharpe or Sortino because:

It uses an asymmetric risk measure (MaxDD), which a strategy can never “fake” without changing the actual P&L curve.
It does not depend on the choice of risk-free rate or target return.
The denominator is conservative (worst observed drawdown), not an average.

The main weakness of Calmar is the same as max drawdown: it depends on the observation window. Computing a Calmar over 6 months produces a number that has almost no statistical meaning.

For a usable Calmar, use at least 3 years of data, computed on annual return divided by max drawdown over that full window. Compare across strategies of comparable timeframes only.

How vendors inflate metrics: the three classic moves

Three techniques cover almost all observed inflation patterns on public crypto bot dashboards.

1. Period selection (cherry-picking the window)

A 12-month window starting in late 2022 looks dramatically better than a 12-month window starting in early 2022. The difference is one or two specific weeks. Vendors who let you select “since launch” without disclosing the launch date are choosing the best window by definition.

Defensive question: ask for the metric computed over a rolling window (12 months, rolled monthly, all observations). If the vendor cannot or will not provide it, the headline number is not trustworthy.

2. Survivorship in cross-bot comparisons

When a vendor publishes “average Sharpe across our strategies is 1.8”, they usually mean across the strategies still listed today. Failed strategies have been delisted. The reported average is therefore the survivors’ Sharpe, which is systematically higher than the true population average by a factor that can exceed 2.0 in crypto where strategy mortality is high.

Defensive question: how many strategies have been launched in total over the period, and how many of them are still listed on the dashboard? If the answer is “all of them” or “we don’t track that”, the published Sharpe is contaminated by survivorship bias.

3. Return aggregation frequency (daily vs trade-level)

Some vendors compute Sharpe at the trade level rather than at the daily equity level. A bot with 100 trades a day on small moves will show extremely low per-trade volatility, inflating the trade-level Sharpe by orders of magnitude.

The standard convention is to compute returns on daily equity values, not on per-trade P&L. Defensive question: is the Sharpe computed on daily mark-to-market equity? If the answer is “trade-level” or “per-position”, the number is not comparable to any external benchmark.

Concrete example: the “Sharpe 3+ on 6 months” red flag

Suppose a bot dashboard displays:

Total return: +47% (last 6 months)
Sharpe ratio: 3.4
Max drawdown: 8%
Win rate: 71%

At face value, this looks excellent. Run the back-of-envelope check.

A Sharpe of 3.4 means returns are 3.4 standard deviations above zero. On 6 months of daily data (n ~ 180), the standard error around a sample Sharpe of 3.4 under a normal distribution assumption is approximately 0.27. The 95% confidence interval is therefore roughly [2.86, 3.94] under perfect statistical conditions.

In reality, with crypto fat tails, the effective sample size is smaller (perhaps 60-80 i.i.d. observations), and the confidence interval is wider, perhaps [2.0, 4.8]. So even taking the headline at face value, the strategy could have a “true” Sharpe of 2.0 or 4.8.

Next sanity check: a max drawdown of 8% over a window that includes only modest crypto volatility means the strategy has not been stress-tested by any significant regime change. If the 6 months selected do not contain a 30%+ Bitcoin pullback or a major liquidation event, the published 8% MaxDD is not informative about how the strategy behaves in a tail event.

Conclusion: “Sharpe 3.4 over 6 months” is not a lie, but it is not evidence of skill. It is evidence of a friendly window. Always demand longer track records, ideally including 2022 (LUNA, FTX) or earlier (March 2020 Covid crash) for any strategy claiming to manage tail risk.

A six-question checklist before trusting any bot performance dashboard

Before allocating capital to a bot based on its published metrics, run these six questions. If three or more receive unsatisfactory answers, treat the metrics as marketing.

Over what exact window are these metrics computed? Dates matter to the day. “Since launch” without a launch date is unacceptable.
What is the return aggregation frequency? Daily mark-to-market equity is the standard. Trade-level or weekly aggregation should raise a flag.
Are failed or delisted strategies included in cross-strategy averages? If not, the dashboard is showing survivors only.
Is the max drawdown computed over a window that includes at least one major regime shift? In crypto, that means at least Q1-Q2 2022 (LUNA), Q4 2022 (FTX), or Q1 2023 (banking weekend).
What is the Sortino-to-Sharpe ratio? A Sortino much higher than Sharpe indicates positive skew (good). A Sortino lower than Sharpe is rare and indicates hidden negative skew (bad).
What is the Calmar ratio over the longest available window? A Calmar above 2.0 is exceptional and demands very strong evidence. A Calmar below 0.5 indicates the strategy takes more drawdown than it earns in a year, which is rarely justified.

FAQ: crypto bot performance metrics in 2026

1. Is annualised Sharpe of 1.5 to 2.0 realistic for a crypto bot over multiple years?

For a well-executed strategy with disciplined risk management, yes. Top-decile multi-strategy crypto funds report Sharpe ratios in the 1.5 to 2.5 range over 3-5 year windows. Anything above 3.0 sustained over 3+ years is exceptional and warrants close inspection of the methodology.

2. Why does Bitcoin buy-and-hold show a higher Sharpe than most bots over 2017-2024?

Because Bitcoin had a strong upward trend punctuated by deep but recoverable drawdowns. Sharpe favours strategies with strong directional returns and tolerates volatility as long as the trend persists. Most active crypto bots underperform buy-and-hold on Sharpe over very long windows precisely because they sacrifice trend capture for risk management or for fee generation.

3. Can I compute these metrics myself from a CSV of trades?

Yes, and you should. Export the bot’s full trade history and daily equity values to a spreadsheet or Python notebook. Compute daily returns, then annualised Sharpe and Sortino, then drawdowns. The discrepancies between vendor-published metrics and your own computation are revealing.

4. What is a reasonable max drawdown to accept for a crypto bot?

There is no universal answer because it depends on the underlying volatility regime. As a heuristic: a max drawdown lower than the underlying asset class (Bitcoin had ~75% drawdown 2021-2022, ETH similar) over the same window is a baseline. A bot that exhibits 50%+ drawdown over a window where Bitcoin drew down 30% is taking concentrated risk you should understand before committing capital.

5. Is “win rate” a useful performance metric?

Almost never on its own. A strategy with 95% win rate that captures 0.3% per win and loses 6% on the 5% of losing trades has a negative expectancy. Win rate must always be reported alongside average win, average loss, and expectancy. Bots that headline win rate without the other numbers are advertising a misleading metric.

6. Should I trust monthly returns more than annual returns?

Monthly returns over multiple years are useful for computing volatility and drawdowns. The headline annual return without the monthly history hides volatility entirely. Always demand monthly returns at minimum, ideally daily.

7. How do I compare a Grid bot and a DCA bot fairly?

They cannot be compared on Sharpe alone because their return profiles are structurally different. Compare them on Calmar ratio computed over the same window, plus time-under-water, plus maximum drawdown. A DCA bot will typically show lower Sharpe but lower drawdown than a Grid bot in trending markets, and the reverse in ranging markets.

Sources and official references

CFA Institute : portfolio metrics reference material on cfainstitute.org
William Sharpe (1966) : original paper on the Sharpe ratio, available via SSRN
Frank Sortino (1991, 1994) : original publications on the Sortino ratio
Chainalysis : annual crypto market reports for context on volatility regimes on chainalysis.com
ESMA : guidance on presentation of past performance in investment advice on esma.europa.eu

Summary: stop reading single numbers

Performance metrics in crypto bot dashboards exist to be read in context, not as standalone scores. The triple Sharpe-Sortino-MaxDD plus Calmar over a window of at least 2 years that includes a regime shift gives you a defensible read on a strategy’s risk-adjusted profile. Anything shorter is either an early signal worth tracking or a marketing artifact, and you cannot tell which without the underlying data.

Three practical defaults to take away:

Demand the daily equity series, not just headline numbers. If the vendor refuses, the bot is opaque by design.
Compute Calmar yourself over the longest available window. If it diverges from the vendor’s number by more than 20%, ask why.
Beware Sharpe ratios above 3.0 on windows shorter than 2 years. They are almost always an artifact of period selection, return-frequency choice, or survivorship.

Crypto trading carries substantial risk of capital loss. Past performance, however measured, does not guarantee future results. The math in this article helps you read backwards more honestly. It does not predict anything forward.