Building the Ultimate Trading Data Pipeline — Part 6: Populating factors and indicators in columnar files

Xavier Escudero
4 min readJan 12, 2024

In this write-up, we will utilize Pyspark to produce multiple Parquet files, each one corresponding to a distinct ticker/symbol. These files will encompass various technical indicators, including the Simple Moving Average (SMA), as well as aggregations such as the maximum closing value for each quarter and year, log returns, and more.

In our previous articles, our primary choice for storing financial data has been TimescaleDB tables. However, when dealing with read-heavy workloads and extensive data analysis, alternative formats such as columnar storage files, like Parquet files, offer notable advantages. These files are adept at horizontal scalability and facilitate straightforward schema evolution, making them particularly well-suited for integration with big data processing frameworks.

Building upon the previous way of working in our prior articles, we’ll break down the procedure into three distinct stages: Extract, Transform, and Load (ETL):

  1. Extract: Obtain the Open, High, Low, Close (OHLC) information from the daily_bars dataset, employing parallelization with partitions organized by date for enhanced parallel processing efficiency.
  2. Transform: Perform computations for factors and…

--

--

Xavier Escudero
Xavier Escudero

Written by Xavier Escudero

Innovation enthusiast, passionate about automation in several fields, like software testing and trading bots

Responses (1)