Project Overview

Machine Learning in ESG Investment Signal Generation

An HKU Final Year Project on how SEC 10-K and 10-Q filings can be used to build, test, and interpret ESG-related investment signals.

The project follows one practical question from start to finish: when filing-based ESG text carries useful information, and how much of that signal survives once the setting moves from model validation to constrained portfolio use.

SEC 10-K / 10-Q FinBERT / LDA / NER Prediction to portfolio conversion

Explore the findings

Primary horizon 252D

The clearest and most stable predictive evidence.

Expanded universe ~200

Tickers with both filing and price coverage.

Strongest standalone slice Event-only

The strongest standalone predictive block on the large sample.

Final stance Research-valid

Useful signal evidence, but not yet broad deployable alpha.

SEC TEXT

Project Arc

From filings to signal evidence

SEC text, machine learning, validation, and portfolio conversion in one research loop.

Sentiment

Topic

Event

About the Project

This site is a guided view of the full research project.

The work sits between natural language processing, empirical asset pricing, and practical portfolio testing. It does not treat the project as a pure text-classification exercise, and it does not present the final result as a finished trading system. Instead, it asks a narrower and more useful question: what can be learned from SEC disclosure text, and where does that evidence hold up under stronger testing?

The website gives the short version. The full report, methods notes, and results summaries remain the detailed record behind it.

Project Info

Title: Machine Learning in ESG Investment Signal Generation
Institution: The University of Hong Kong
Project Type: Final Year Project
Supervisor: Prof. Zou Difan
Core Data: SEC 10-K and 10-Q filings

What the Site Covers

Why the project starts with SEC text rather than ESG ratings alone
How the signal is built through sentiment, topic, and event features
What the predictive results show across horizons and feature families
Why portfolio conversion is harder than prediction alone
Where the current evidence ends, and why that matters

Why SEC Text

Why start with SEC filings instead of ESG ratings alone?

Third-party ESG ratings are convenient, but they are also inconsistent across providers and difficult to audit as a clean signal source. SEC filings offer something better for this specific research question: standardized disclosure, exact filing dates, and a text source that can be aligned to post-filing returns without hand-waving.

That choice matters because the project is not trying to build a generic ESG dashboard. It is trying to test whether disclosure-based ESG information contains stable medium-to-long horizon signal content, and whether that signal can survive the move from prediction into constrained portfolio use.

Why not ratings alone?

Provider disagreement and opaque methodology make them a weak foundation for clean signal extraction.

Why filings?

They are time-stamped, auditable, and better suited to reproducible time-aligned modeling.

Conceptual comparison between third-party ESG ratings and SEC filings as the core validated text source — Motivation for using SEC 10-K and 10-Q filings as the core validated ESG text source.

Framework

The project is built as one connected research loop.

The project is built as a full research loop rather than a single model run. SEC filings are cleaned into analysis-ready text, turned into multiple feature families, tested against several target forms, and then pushed through a validation and portfolio-conversion layer before any economic claim is made.

Overall SEC-text ESG research framework used in the project.

S

Feature Family

Sentiment

FinBERT-based signals capture tone and directional pressure in filing language.

T

Feature Family

Topic

LDA-based features trace recurring disclosure themes that sentiment alone cannot summarize.

E

Feature Family

Event and Entity

Entity-linked events capture concrete governance, regulatory, safety, and environmental disclosures.

Data and Scope

From core sample to expanded universe

The project starts from a smaller validated sample, then reruns the main comparison logic on an expanded universe of roughly 200 tickers with filing and price coverage.

Horizon Design

Short horizon is a baseline, not the headline

21D is kept as a contrast case, while 126D and especially 252D carry the main empirical weight of the project.

Validation Logic

Evidence is judged in layers

The project moves from AUC and accuracy into IC, RankIC, stability, feature selection, neutralization, and then into constrained portfolio conversion.

Predictive Findings

The signal gets stronger as the horizon gets longer.

Short-horizon evidence is weak. Signal quality improves materially at 126 days and becomes clearest at 252 days. On the expanded universe, event-only is the strongest standalone predictive slice, while richer mixed stacks remain important once the question shifts from pure prediction toward broader validation and portfolio use.

Combined horizon comparison and 252-day feature-family comparison figures — Horizon strength and 252-day feature-family comparison. The main pattern is consistent: 252D is the strongest horizon, and the predictive winner is not automatically the full integrated stack.

Main Result

252D carries the cleanest signal

The project supports a medium-to-long horizon interpretation of SEC-text ESG information, not a short-term trading read.

Large-Sample Read

Event-only leads on standalone prediction

On the expanded universe, event-driven features produce the strongest standalone predictive slice at 252 days.

Interpretation

Prediction quality and portfolio usefulness split apart

The best predictive feature family is not always the one that survives best inside a constrained investment template.

Portfolio Conversion

The portfolio layer is where the project becomes more demanding.

The portfolio layer is where the project becomes more honest. The first minimal backtest is weak, which shows that prediction alone is not enough. J2 recovers stronger economic value under sparse, high-conviction construction. J3 then adds more realistic constraints and keeps the story positive in absolute terms, while still leaving a meaningful gap to benchmark performance.

Strategy progression from minimal backtest to stronger portfolio-conversion variants — Strategy progression from the weak minimal baseline toward stronger J2 and J3 portfolio-conversion variants.

Comparison between prediction winner and portfolio winner — The strongest predictive SEC slice and the strongest constrained portfolio-use signal are still not the same object.

Comparison between static best strategy and walk-forward strategy — Walk-forward results remain positive, but give up a visible amount of return and Sharpe relative to the static best row.

Concentration diagnostics for portfolio results — Stronger returns are still concentrated, which is why the final claim stays at research-valid signal framework rather than deployable alpha.

What the evidence supports

SEC-text signals can become economically visible

Especially under sparse, high-conviction extreme-bucket construction and balanced medium-horizon portfolio rules.

What it does not support

Broad deployable alpha, yet

Benchmark lag, concentration, and the walk-forward hindsight gap still keep the current claim below a production-grade strategy statement.

Scope and Limits

This website presents the project at its real evidence boundary.

The validated core is the SEC-text pipeline. It supports a reproducible signal-extraction and validation framework, but it does not yet support a broad claim of benchmark-beating deployable alpha or a fully validated multi-source SEC plus news system.

What the project supports

Reproducible ESG-related signal extraction from SEC filings
Stronger medium-to-long horizon evidence, especially at 252D
Economically visible signal under selective portfolio construction

What remains outside the main claim

Broad deployable alpha under realistic benchmark comparison
News as part of the main validated empirical pipeline
Fully diversified live-ready strategy evidence

Deliverables

The site is the overview. The full record sits underneath it.

The project is deliberately documented as a reproducible research package. The website is the editorial layer; the report, methods notes, and results summaries remain the detailed record underneath it.

Method Notes

Methods

A condensed overview of the data pipeline, feature engineering stack, and validation logic.

Read methods

Results Summary

Results

The canonical summary of predictive, portfolio, and robustness findings across the project.

Read results

Scope Guardrail

Project Status

The single source of truth for what is implemented, validated, and still outside the main claim.

Check status

Repository Guide

README

The repo-level overview for setup, structure, and traceability across the research workflow.

Open README