The clearest and most stable predictive evidence.
Project Overview
Machine Learning in ESG Investment Signal Generation
An HKU Final Year Project on how SEC 10-K and 10-Q filings can be used to build, test, and interpret ESG-related investment signals.
The project follows one practical question from start to finish: when filing-based ESG text carries useful information, and how much of that signal survives once the setting moves from model validation to constrained portfolio use.
Tickers with both filing and price coverage.
The strongest standalone predictive block on the large sample.
Useful signal evidence, but not yet broad deployable alpha.
About the Project
This site is a guided view of the full research project.
The work sits between natural language processing, empirical asset pricing, and practical portfolio testing. It does not treat the project as a pure text-classification exercise, and it does not present the final result as a finished trading system. Instead, it asks a narrower and more useful question: what can be learned from SEC disclosure text, and where does that evidence hold up under stronger testing?
The website gives the short version. The full report, methods notes, and results summaries remain the detailed record behind it.
Project Info
- Title
- Machine Learning in ESG Investment Signal Generation
- Institution
- The University of Hong Kong
- Project Type
- Final Year Project
- Supervisor
- Prof. Zou Difan
- Core Data
- SEC 10-K and 10-Q filings
What the Site Covers
- Why the project starts with SEC text rather than ESG ratings alone
- How the signal is built through sentiment, topic, and event features
- What the predictive results show across horizons and feature families
- Why portfolio conversion is harder than prediction alone
- Where the current evidence ends, and why that matters
Why SEC Text
Why start with SEC filings instead of ESG ratings alone?
Third-party ESG ratings are convenient, but they are also inconsistent across providers and difficult to audit as a clean signal source. SEC filings offer something better for this specific research question: standardized disclosure, exact filing dates, and a text source that can be aligned to post-filing returns without hand-waving.
That choice matters because the project is not trying to build a generic ESG dashboard. It is trying to test whether disclosure-based ESG information contains stable medium-to-long horizon signal content, and whether that signal can survive the move from prediction into constrained portfolio use.
Why not ratings alone?
Provider disagreement and opaque methodology make them a weak foundation for clean signal extraction.
Why filings?
They are time-stamped, auditable, and better suited to reproducible time-aligned modeling.
Framework
The project is built as one connected research loop.
The project is built as a full research loop rather than a single model run. SEC filings are cleaned into analysis-ready text, turned into multiple feature families, tested against several target forms, and then pushed through a validation and portfolio-conversion layer before any economic claim is made.
Feature Family
Sentiment
FinBERT-based signals capture tone and directional pressure in filing language.
Feature Family
Topic
LDA-based features trace recurring disclosure themes that sentiment alone cannot summarize.
Feature Family
Event and Entity
Entity-linked events capture concrete governance, regulatory, safety, and environmental disclosures.
Data and Scope
From core sample to expanded universe
The project starts from a smaller validated sample, then reruns the main comparison logic on an expanded universe of roughly 200 tickers with filing and price coverage.
Horizon Design
Short horizon is a baseline, not the headline
21D is kept as a contrast case, while 126D and especially 252D carry the main empirical weight of the project.
Validation Logic
Evidence is judged in layers
The project moves from AUC and accuracy into IC, RankIC, stability, feature selection, neutralization, and then into constrained portfolio conversion.
Predictive Findings
The signal gets stronger as the horizon gets longer.
Short-horizon evidence is weak. Signal quality improves materially at 126 days and becomes clearest at 252 days. On the expanded universe, event-only is the strongest standalone predictive slice, while richer mixed stacks remain important once the question shifts from pure prediction toward broader validation and portfolio use.
Main Result
252D carries the cleanest signal
The project supports a medium-to-long horizon interpretation of SEC-text ESG information, not a short-term trading read.
Large-Sample Read
Event-only leads on standalone prediction
On the expanded universe, event-driven features produce the strongest standalone predictive slice at 252 days.
Interpretation
Prediction quality and portfolio usefulness split apart
The best predictive feature family is not always the one that survives best inside a constrained investment template.
Portfolio Conversion
The portfolio layer is where the project becomes more demanding.
The portfolio layer is where the project becomes more honest. The first minimal backtest is weak, which shows that prediction alone is not enough. J2 recovers stronger economic value under sparse, high-conviction construction. J3 then adds more realistic constraints and keeps the story positive in absolute terms, while still leaving a meaningful gap to benchmark performance.
What the evidence supports
SEC-text signals can become economically visible
Especially under sparse, high-conviction extreme-bucket construction and balanced medium-horizon portfolio rules.
What it does not support
Broad deployable alpha, yet
Benchmark lag, concentration, and the walk-forward hindsight gap still keep the current claim below a production-grade strategy statement.
Scope and Limits
This website presents the project at its real evidence boundary.
The validated core is the SEC-text pipeline. It supports a reproducible signal-extraction and validation framework, but it does not yet support a broad claim of benchmark-beating deployable alpha or a fully validated multi-source SEC plus news system.
What the project supports
- Reproducible ESG-related signal extraction from SEC filings
- Stronger medium-to-long horizon evidence, especially at 252D
- Economically visible signal under selective portfolio construction
What remains outside the main claim
- Broad deployable alpha under realistic benchmark comparison
- News as part of the main validated empirical pipeline
- Fully diversified live-ready strategy evidence
Deliverables
The site is the overview. The full record sits underneath it.
The project is deliberately documented as a reproducible research package. The website is the editorial layer; the report, methods notes, and results summaries remain the detailed record underneath it.
Method Notes
Methods
A condensed overview of the data pipeline, feature engineering stack, and validation logic.
Read methodsResults Summary
Results
The canonical summary of predictive, portfolio, and robustness findings across the project.
Read resultsScope Guardrail
Project Status
The single source of truth for what is implemented, validated, and still outside the main claim.
Check statusRepository Guide
README
The repo-level overview for setup, structure, and traceability across the research workflow.
Open README