Skip to content

Data as a Factor

One of the most important ideas in shadow-factor is that source data should already be usable as factor material.

What This Means

In many research stacks, users first export raw data, then hand-write separate transformation code, and only then obtain a factor.

shadow-factor shortens that path.

It treats base fields as inputs to a factor system from the beginning.

Examples of useful base fields include:

  • net_profit
  • operating_revenue
  • total_assets
  • total_equity
  • market_cap

These fields can be used directly or transformed through DSL operators.

Direct Field Access

A field can already behave like a factor-shaped series once it is evaluated over trade date and stock universe.

Examples:

"net_profit_mrq_0"
"operating_revenue_mrq_0"
"total_assets_mrq_0"

This is useful when you want the latest visible reported value, not a fully derived formula yet.

From Data to Factor

The next step is to treat those fields as ingredients for better research features.

Examples:

"TTM(net_profit)"
"YoY(TTM(operating_revenue))"
"SafeDiv(TTM(net_profit), total_assets)"

This lets you move smoothly from:

  • raw field
  • time-aware transformed field
  • fully defined business factor

Why This Matters

This approach is useful because it:

  • reduces manual preprocessing code
  • keeps the research path shorter and easier to audit
  • makes factor definitions easier to compare across strategies
  • keeps data semantics closer to the final factor logic

Good Examples

Profitability

"SafeDiv(TTM(net_profit), total_assets)"

Interpretation: use base accounting fields to create an ROA-like factor.

Growth

"YoY(TTM(net_profit))"

Interpretation: growth is not stored as a separate raw column; it is generated from the same underlying data.

Efficiency

"SafeDiv(TTM(operating_profit), TTM(operating_revenue))"

Interpretation: margins are factor outputs created from reusable underlying statements.

Practical Takeaway

In shadow-factor, the question is often not:

  • “where do I export the data first?”

It is:

  • “which field or expression should become the factor?”