MLFI cover

[Book] Commented summary of Machine Learning for Factor Investing by Guillaume Coqueret and Tony Guida

The book can be read online at

Quick opinion on the book:

Written in a very scholarly style, many references to the literature (including very recent ones) which can be useful in itself. Authors are well aware of both the the Computer Science / Machine Learning recent developments and the empirical asset pricing / factor investing papers. Not common.

The book is definitely more a good material for a grad level course rather than a collection of working recipes for investing. Some chapters do not actually bring much practical insights but rather present a particular set of ML tools that could potentially be applied to factor investing (e.g. SVM, RL). Some others do achieve to provide a good mix of insights into the mechanics of the ML model and its application in factor investing (e.g. Tree-based methods and Interpretability), and dicussions of results and their limits.

I would recommend this to master students from technical fields (e.g. applied math, electrical engineering, computer science) who are interested in quant investing. Can be helpful for interviews as usually these profiles are rather naive with respect to what machine learning can do out of the box.

The below contains excerpts that I found interesting or worth flagging: Essentially, high level comments rather than technical parts.

Chapter 1 Preface

equity investment strategies that are built on firm characteristics

Factor investing is a subfield of a large discipline that encompasses asset allocation, quantitative trading and wealth management. Its premise is that differences in the returns of firms can be explained by the characteristics of these firms.

Not about and doesn’t cover:

  • fraud detection or credit scoring
  • use cases of alternative datasets
  • machine learning theory
  • natural language processing

Finally, a modern book:

Thank you for sending your feedback directly (via pull requests) on the book’s website which is hosted at

Chapter 2 Notations and data

Interesting data available to experiment with:

This dataset comprises information on 1,207 stocks listed in the US (possibly originating from Canada or Mexico). The time range starts in November 1998 and ends in March 2019. For each point in time, 93 characteristics describe the firms in the sample. These attributes cover a wide range of topics:

  • valuation (earning yields, accounting ratios);
  • profitability and quality (return on equity);
  • momentum and technical analysis (past returns, relative strength index);
  • risk (volatilities);
  • estimates (earnings-per-share);
  • volume and liquidity (share turnover).

Yet, still quite poor compared to what industry practitioners have access to.

The predictors have been uniformized, that is: for any given feature and time point, the distribution is uniform.

Basically, cross-sectional ranking.

Chapter 3 Introduction

To the best of our knowledge, the only consensus is that, on the x side, the features should include classical predictors reported in the literature: market capitalization, accounting ratios, risk measures, momentum proxies.

For the dependent variable, many researchers and practitioners work with monthly returns, but other maturities may perform better out-of-sample.

While it is tempting to believe that the most crucial part is the choice of f (it is the most sophisticated, mathematically), we believe that the choice and engineering of inputs, that is, the variables, are at least as important.

As a corollary: data is key. The inputs given to the models are probably much more important than the choice of the model itself.

Chapter 4 Factor investing and asset pricing anomalies

Some researchers document fading effects because of publication: once the anomaly becomes public, agents invest in it, which pushes prices up and the anomaly disappears. McLean and Pontiff (2016) document this effect in the US but Jacobs and Müller (2020) find that all other countries experience sustained post-publication factor returns.

4.2.1 Simple portfolio sorts

  1. rank firms according to a particular criterion (e.g., size, book-to-market ratio);
  2. form J≥2 portfolios (i.e. homogeneous groups) consisting of the same number of stocks according to the ranking (usually, J=2, J=3, J=5 or J=10 portfolios are built, based on the median, terciles, quintiles or deciles of the criterion);

I don’t get the need of always splitting into a discrete number of portfolios (quantiles). Why not just studying the whole cross-section? Basically, visualize/study the empirical copula of (signal, future returns), which in practice is the joint distribution of (sorted firm characteristic, sorted future returns). This would avoid such obvious remarks:

A strong limitation of this approach is that the sorting criterion could have a non monotonic impact on returns and a test based on the two extreme portfolios would not detect it.

I see this number of quantiles (aka portfolios) as a parameter on which researchers overfit their findings reported in papers.

4.2.2 Factors

For most anomalies, theoretical justifications have been brought forward, whether risk-based or behavioural. We list the most frequently cited factors below:

  • Size (SMB = small firms minus large firms): Banz (1981), Fama and French (1992), Fama and French (1993), Van Dijk (2011), Asness et al. (2018) and Astakhov, Havranek, and Novak (2019).
  • Value (HM = high minus low: undervalued minus `growth’ firms): Fama and French (1992), Fama and French (1993), C. S. Asness, Moskowitz, and Pedersen (2013).
  • Momentum (WML = winners minus loser): Jegadeesh and Titman (1993), Carhart (1997) and C. S. Asness, Moskowitz, and Pedersen (2013). The winners are the assets that have experienced the highest returns over the last year (sometimes the computation of the return is truncated to omit the last month). Cross-sectional momentum is linked, but not equivalent, to time-series momentum (trend following), see e.g., Moskowitz, Ooi, and Pedersen (2012) and Lempérière et al. (2014). Momentum is also related to contrarian movements that occur both at higher and lower frequencies (short-term and long-term reversals), see Luo, Subrahmanyam, and Titman (2020).
  • Profitability (RMW = robust minus weak profits): Fama and French (2015), Bouchaud et al. (2019). In the former reference, profitability is measured as (revenues - (cost and expenses))/equity. Investment (CMA = conservative minus aggressive): Fama and French (2015), Hou, Xue, and Zhang (2015). Investment is measured via the growth of total assets (divided by total assets). Aggressive firms are those that experience the largest growth in assets.
  • Low `risk’ (sometimes: BAB = betting against beta): Ang et al. (2006), Baker, Bradley, and Wurgler (2011), Frazzini and Pedersen (2014), Boloorforoosh et al. (2020), Baker, Hoeyer, and Wurgler (2020) and Asness et al. (2020). In this case, the computation of risk changes from one article to the other (simple volatility, market beta, idiosyncratic volatility, etc.).

As is shown by Linnainmaa and Roberts (2018) and Hou, Xue, and Zhang (2020), many proclaimed factors are in fact very much data-dependent and often fail to deliver sustained profitability when the investment universe is altered or when the definition of variable changes (Clifford Asness and Frazzini (2013)).

One reason why people are overly optimistic about anomalies they detect is the widespread reverse interpretation of the p-value. Often, it is thought of as the probability of one hypothesis (e.g., my anomaly exists) given the data. In fact, it’s the opposite: it’s the likelihood of your data sample, knowing that the anomaly holds.

Lastly, even the optimal number of factors is a subject of disagreement among conclusions of recent work. While the traditional literature focuses on a limited number (3-5) of factors, more recent research […] advocates the need to use at least 15 or more.

Green, Hand, and Zhang (2017) even find that the number of characteristics that help explain the cross-section of returns varies in time.

The evidence on the effectiveness of timing is diverse: positive for Greenwood and Hanson (2012), Hodges et al. (2017), Haddad, Kozak, and Santosh (2020) and Lioui and Tarelli (2020), negative for Asness et al. (2017) and mixed for Dichtl et al. (2019).

[…] its acceleration has prompted research about whether or not characteristics related to ESG criteria (environment, social, governance) are priced. Dozens and even possibly hundreds of papers have been devoted to this question, but no consensus has been reached.

We gather below a very short list of papers that suggests conflicting results:

  • favorable: ESG investing works (Kempf and Osthoff (2007)), can work (Nagy, Kassam, and Lee (2016)), or can at least be rendered efficient (Branch and Cai (2012)); A large meta-study reports overwhelming favorable results (Friede, Busch, and Bassen (2015)), but of course, they could well stem from the publication bias towards positive results.
  • unfavorable: Ethical investing is not profitable: Adler and Kritzman (2008), Blitz and Swinkels (2020). An ESG factor should be long unethical firms and short ethical ones (Lioui (2018)).
  • mixed: ESG investing may be beneficial globally but not locally (Chakrabarti and Sen (2020)). Results depend on whether to use E, S or G (Bruder et al. (2019)).

On top of these contradicting results, several articles point towards complexities in the measurement of ESG. Depending on the chosen criteria and on the data provider, results can change drastically (see Galema, Plantinga, and Scholtens (2008), Berg, Koelbel, and Rigobon (2019) and Atta-Darkua et al. (2020)).

Chapter 5 Data preprocessing

The first step is selection. Given a large set of predictors, it seems a sound idea to filter out unwanted or redundant exogenous variables. Heuristically, simple methods include:

  • computing the correlation matrix of all features and making sure that no (absolute) value is above a threshold (0.7 is a common value) so that redundant variables do not pollute the learning engine;
  • carrying out a linear regression and removing the non significant variables (e.g., those with p-value above 0.05). > perform a clustering analysis over the set of features and retain only one feature within each cluster (see Chapter 16).

Both these methods are somewhat reductive and overlook nonlinear relationships.

Disagree for the third point. No reason that the clustering analysis targets only linear relationships as one can plug all sorts of distances in the clustering algorithm, cf. my work on Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering. In this context, one would generate the bivariate copulas between all pairs of features, apply an optimal transport distance-based clustering, and explore the clusters looking for non-linear relationships.

One problematic example is when the dataset is sampled at the monthly frequency (not unusual in the money management industry) with the labels being monthly returns and the features being risk-based or fundamental attributes. In this case, the label is very weakly autocorrelated, while the features are often highly autocorrelated. In this situation, most sophisticated forecasting tools will arbitrage between features which will probably result in a lot of noise. In linear predictive models, this configuration is known to generate bias in estimates (see the study of Stambaugh (1999) and the review by Gonzalo and Pitarakis (2018)).

In some cases (e.g., insufficient number of features), it is possible to consider ratios or products between features. Accounting ratios like price-to-book, book-to-market, debt-to-equity are examples of functions of raw features that make sense. The gains brought by a larger spectrum of features are not obvious. The risk of overfitting increases, just like in a simple linear regression adding variables mechanically increases the R2. The choices must make sense, economically.

Another way to increase the feature space (mentioned above) is to consider variations. Variations in sentiment, variations in book-to-market ratio, etc., can be relevant predictors because sometimes, the change is more important than the level.

In classical financial terms, this means that a particular model is likely to depend on the overarching situation which is often proxied by macro-economic indicators. One way to take this into account at the data level is simply to multiply the feature by a exogenous indicator $z_t$

Chapter 6 Penalized regressions and sparse hedging for minimum variance portfolios

6.2 Sparse hedging for minimum variance portfolios

Interesting section, need to investigate more.

Chapter 7 Tree-based methods

One notable contribution is Bryzgalova, Pelger, and Zhu (2019) in which the authors create factors from trees by sorting portfolios via simple trees, which they call Asset Pricing Trees.

Paper: Forest Through the Trees: Building Cross-Sections of Stock Returns

Monotonicity constraints are another element that is featured both in xgboost and lightgbm. Sometimes, it is expected that one particular feature has a monotonic impact on the label. For instance, if one deeply believes in momentum, then past returns should have an increasing impact on future returns (in the cross-section of stocks).

Chapter 8 Neural networks

Often, training a NN requires several epochs and up to a few dozen.

Disagree with this statement. Or maybe for the type of neural networks and data discussed in this book?

In ML-based asset pricing, the most notable application of GANs was introduced in Luyang Chen, Pelger, and Zhu (2020).

GANs can also be used to generate artificial financial data (see Efimov and Xu (2019), Marti (2019) and Wiese et al. (2020)), but this topic is outside the scope of the book.

8.6.3 A word on convolutional networks […] While this is clearly an interesting computer science exercise, the deep economic motivation behind this choice of architecture remains unclear.

Can simulate discrete wavelet processing. To dig more the analogy between CNN on time series and these wavelets. Pretty sure Mallat’s work on his scattering networks illustrates the links between CNNs and wavelets.

Essentially, for returns-like time series, I think it captures hierarchical relationships: returns at different time scales.

This puzzle encouraged researchers to construct novel NN structures that are better suited to tabular databases. Examples include Arik and Pfister (2019) and Popov, Morozov, and Babenko (2019) but their ideas lie outside the scope of this book. Surprisingly, the reverse idea also exists: Nuti, Rugama, and Thommen (2019) try to adapt trees and random forest so that they behave more like neural networks. The interested reader can have a look at the original papers.

Chapter 9 Support vector machines

Just a textbook introduction to SVM.

Chapter 10 Bayesian methods

Good references for Bayesian analysis are Gelman et al. (2013) and Kruschke (2014). The latter, like the present book, illustrates the concepts with many lines of R code.

Bayesian additive regression trees (BARTs) are an ensemble technique that mixes Bayesian thinking and regression trees. In spirit, they are close to the tree ensembles seen in Chapter 7, but they differ greatly in their implementation. In BARTs like in Bayesian regressions, the regularization comes from the prior. The original article is Chipman, George, and McCulloch (2010) and the implementation (in R) follows Sparapani, Spanbauer, and McCulloch (2019).

Chapter 11 Validating and tuning

Before we outline common evaluation benchmarks, we mention the econometric approach of Li, Liao, and Quaedvlieg (2020). The authors propose to assess the performance of a forecasting method compared to a given benchmark, conditional on some external variable. This helps monitor under which (economic) conditions the model beats the benchmark.

11.3 The search for good hyperparameters […] The interested reader can have a look at Snoek, Larochelle, and Adams (2012) and Frazier (2018) for more details on the numerical facets of this method.

Chapter 12 Ensemble models

12.1 Linear ensembles

Overall, findings are mixed and the heuristic simple average is, as usual, hard to beat (see, e.g., Genre et al. (2013)).

Chapter 13 Portfolio backtesting

There are many ways that this signal can be integrated in an investment decision (see Snow (2020) for ways to integrate ML tools into this task).

Snow, Derek. 2020. “Machine Learning in Asset Management: Part 2: Portfolio Construction—Weight Optimization.” Journal of Financial Data Science Forthcoming.

13.2 Turning signals into portfolio weights

Either selection or optimization. Both can work if very confident in the signal (predictions).

The benefit of this second definition is that it takes the compounding of returns into account and hence compensates for volatility pumping. To see this, consider a very simple two period model with returns $-r$ and $+r$. The arithmetic average is zero, but the geometric one $\sqrt{1 - r^2} - 1$ is negative.

A meaningful hit ratio is the proportion of times that a strategy beats its benchmark. This is of course not sufficient, as many small gains can be offset by a few large losses.

Transaction costs are often overlooked in academic articles but can have a sizable impact in real life trading (see e.g., Novy-Marx and Velikov (2015)). Martin Utrera et al. (2020) show how to use factor investing (and exposures) to combine and offset positions and reduce overall fees.

In Bailey and Prado (2014), the authors even propose a statistical test for Sharpe ratios, provided that some metrics of all tested strategies are stored in memory.

Did implement this Deflated Sharpe Ratio some time ago. Interesting idea.

13.4 Common errors and issues

13.4.1 Forward looking data

13.4.2 Backtest overfitting

Stating the obvious for those who need it:

The careful reader must have noticed that throughout Chapters 6 to 12, the performance of ML engines is underwhelming. These disappointing results are there on purpose and highlight the crucial truth that machine learning is no panacea, no magic wand, no philosopher’s stone that can transform data into golden predictions. Most ML-based forecasts fail. This is in fact not only true for very enhanced and sophisticated techniques, but also for simpler econometric approaches (Dichtl et al. (2020)), which again underlines the need to replicate results to challenge their validity.

Chapter 14 Interpretability

14.1 Global interpretations

14.1.1 Simple models as surrogates

Let us start with the simplest example of all. In a linear model, the following elements are usually extracted from the estimation of the $\beta_k$:

  • the $R^2$, which appreciates the global fit of the model (possibly penalized to prevent overfitting with many regressors). The $R^2$ is usually computed in-sample;
  • the sign of the estimates $\hat{\beta_k}$, which indicates the direction of the impact of each feature $x^k$ on $y$;
  • the $t$-statistics $t_{\hat{\beta_k}}$, which evaluate the magnitude of this impact: regardless of its direction, large statistics in absolute value reveal prominent variables. Often, the t-statistics are translated into $p$-values which are computed under some suitable distributional assumptions.

The last two indicators are useful because they inform the user on which features matter the most and on the sign of the effect of each predictor. This gives a simplified view of how the model processes the features into the output. Most tools that aim to explain black boxes follow the same principles.

14.1.2 Variable importance (tree-based)

There are differences in the way the models rely on the features. For instance, the most important feature changes from a model to the other: the simple tree model gives the most importance to the price-to-book ratio, while the random forest bets more on volatility and boosted trees give more weight to capitalization.

14.1.3 Variable importance (agnostic)

One way to track the added value of one particular feature is to look at what happens if its values inside the training set are entirely shuffled. If the original feature plays an important role in the explanation of the dependent variable, then the shuffled version of the feature will lead to a much higher loss.

14.1.4 Partial dependence plot

Finally, we refer to Zhao and Hastie (2019) for a theoretical discussion on the causality property of PDPs. Indeed, a deep look at the construction of the PDPs suggests that they could be interpreted as a causal representation of the feature on the model’s output.

14.2 Local interpretations

14.2.1 LIME

14.2.2 Shapley values

14.2.3 Breakdown

Chapter 15 Two key concepts: causality and non-stationarity

training a computer vision algorithm to discriminate between cows and camels will lead the algorithm to focus on grass versus sand! This is because most camels are pictured in the desert while cows are shown in green fields of grass. Thus, a picture of a camel on grass will be classified as cow while a cow on sand would be labelled “camel”. It is only with pictures of these two animals in different contexts (environments) that the learner will end up truly finding what makes a cow and a camel. A camel will remain a camel no matter where it is pictured: it should be recognized as such by the learner. If so, the representation of the camel becomes invariant over all datasets and the learner has discovered causality, i.e., the true attributes that make a camel a camel.

In finance, it is not obvious that invariance may exist. Market conditions are known to be time-varying and the relationships between firm characteristics and returns also change from year to year.

In Chapter 13, we advocate to do that by updating models as frequently as possible with rolling training sets: this allows the predictions to be based on the most recent trends. In Section 15.2 below, we introduce other theoretical and practical options.

Recommend reading:

Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. Second Edition. Vol. 29. Cambridge University Press.

which I did a couple of years ago. I would benefit from a more careful study of the book and its developments…

Common problems in machine learning (statistics):

  • covariate shift: $P_X$ changes but $P_{Y|X}$ does not: the features have a fluctuating distribution, but their relationship with $Y$ holds still;
  • concept drift: $P_{Y|X}$ changes but $P_X$ does not: feature distributions are stable, but their relation to $Y$ is altered.

In factor investing, the feature engineering process (see Section 5.4) is partly designed to bypass the risk of covariate shift. Uniformization guarantees that the marginals stay the same but correlations between features may of course change.

The main issue is probably concept drift when the way features explain the label changes through time.

In factor models, changes are presumably a combination of all four types: they can be abrupt during crashes, but most of the time they are progressive (gradual or incremental) and never ending (continuously recurring).

Naturally, if we aknowledge that the environment changes, it appears logical to adapt models accordingly, i.e., dynamically. This gives rise to the so-called stability-plasticity dilemma. This dilemma is a trade-off between model reactiveness (new instances have an important impact on updates) versus stability (these instances may not be representative of a slower trend and they may thus shift the model in a suboptimal direction).

15.2.2 Online learning

Online learning, combined to early stopping for neural networks, is applied to factor investing in Wong et al. (2020).

15.2.3 Homogeneous transfer learning

Koshiyama, Adriano, Sebastian Flennerhag, Stefano B Blumberg, Nick Firoozye, and Philip Treleaven. 2020. “QuantNet: Transferring Learning Across Systematic Trading Strategies.” arXiv Preprint, no. 2004.03445.

Chapter 16 Unsupervised learning

PCA, autoencoders, k-means, k-NN.

Chapter 17 Reinforcement learning

Introduction to RL. Not much insights for investing.

This curse of dimensionality is accompanied by fundamental question of training data. Two options are conceivable: market data versus simulations. Under a given controlled generator of samples, it is hard to imagine that the algorithm will beat the solution that maximizes a given utility function. If anything, it should converge towards the static optimal solution under a stationary data generating process (see, e.g. Chaouki et al. (2020) for trading tasks), which is by the way a very strong modelling assumption.

This leaves market data as a preferred solution but even with large datasets, there is little chance to cover all the (actions, states) combinations mentioned above. Characteristics-based datasets have depths that run through a few decades of monthly data, which means several hundreds of time-stamps at most. This is by far too limited to allow for a reliable learning process. It is always possible to generate synthetic data (as in Yu et al. (2019)), but it is unclear that this will solidly improve the performance of the algorithm.

I don’t think the RL field is ripe for applications in portfolio management and investing, and so do think my Google Brain friends which are RL specialists.

Stochastic control for market-making, sure, and there are already applications in production.

Chapter 18 Data Description and Chapter 19 Solution to exercises for those who want and have the time to replicate the experiments in the book. Especially relevant for a student or a fresh grad preparing for interviews.