Building the Ultimate Trading Data Pipeline — Part 6: Populating factors and indicators in columnar files
In this write-up, we will utilize Pyspark to produce multiple Parquet files, each one corresponding to a distinct ticker/symbol. These files will encompass various technical indicators, including the Simple Moving Average (SMA), as well as aggregations such as the maximum closing value for each quarter and year, log returns, and more.
In our previous articles, our primary choice for storing financial data has been TimescaleDB tables. However, when dealing with read-heavy workloads and extensive data analysis, alternative formats such as columnar storage files, like Parquet files, offer notable advantages. These files are adept at horizontal scalability and facilitate straightforward schema evolution, making them particularly well-suited for integration with big data processing frameworks.
Building upon the previous way of working in our prior articles, we’ll break down the procedure into three distinct stages: Extract, Transform, and Load (ETL):
- Extract: Obtain the Open, High, Low, Close (OHLC) information from the
daily_bars
dataset, employing parallelization with partitions organized by date for enhanced parallel processing efficiency. - Transform: Perform computations for factors and…