Which machine learning features most improve short-term stock return forecasts?

Short-term stock return forecasts improve most when machine learning models are fed features that capture immediate market dynamics, validated risk factors, and alternative signals that reflect investor behavior. Empirical finance and applied machine learning emphasize careful feature engineering, robust validation, and awareness of trading frictions to translate statistical gains into economic value.

Market microstructure and time-series features

Features derived from the limit order book such as order imbalance, bid-ask spread, depth at top levels, and short-horizon trade sign sequences are consistently informative for intra-day and next-day returns. Marcos López de Prado at Cornell University and Guggenheim Partners argues that microstructure-aware features and event-based labeling reduce look-ahead bias and improve real-world performance. Traditional time-series transformations remain essential, including short-window momentum and volatility measures, rolling return ranks, and decay-weighted averages that capture recent price pressure. Tree-based models and regularized linear models often reveal these features as high-importance predictors, but their predictive power can vanish once trading costs and market impact are included.

Alternative data, sentiment, and factor signals

Combining classic risk factors from Eugene Fama at University of Chicago and Kenneth French at Dartmouth College with alternative signals such as news sentiment, search volume, and earnings surprise enhances short-horizon forecasts when aligned to event timing. Natural language processing embeddings of corporate announcements and social media sentiment provide incremental signal during volatile windows, but require domain-specific preprocessing to filter noise. Cultural and territorial nuances matter: language, reporting norms, and trading hours change the signal-to-noise ratio between regions, so transferability across markets is limited.

Feature engineering best practices emphasize cross-validated feature selection, meta-labeling to separate predictability from execution viability, and the use of explainable importance measures to avoid data-snooping. Consequences of neglecting these practices include overfitting, false discovery of alpha, and deployment losses once slippage and capacity constraints are considered. There is also an environmental and ethical dimension as training ever-larger models increases energy use and concentrates advantages with firms that control granular market data. High-quality short-term forecasting therefore combines microstructure-aware features, validated factor exposures, and judicious use of alternative data, all evaluated under realistic trading constraints.