Member-only story
Trading Data Analytics — Part 1: First steps with DuckDB and Parquet Files

In this article we’ll integrate DuckDB to obtain the parquet files stored previously in MinIO and execute our first queries.
We’ll use DuckDB as the analytical execution engine that allows us to run SQL queries directly on our existing datasets such as Pandas DataFrames, Polars DataFrames, CSV files, parquet files or traditional databases as Postgres. With that in mind we can focus on using SQL queries to extract the data we want.
Install JupyterLab Desktop App for your operating system. We’ll use it for exploration and experimentation. After conducting experimentation and analysis within JupyterLab, we may transition it to building applications, using for example Streamlit.
Start JupyterLab and select File > New > Notebook.
In a new cell copy the next content:
import duckdb
bucket_name = 'screener'
conn = duckdb.connect()
conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")
conn.execute("SET s3_region='us-east-1'")
conn.execute("SET s3_url_style='path'")
conn.execute("SET s3_endpoint='server:9000'")
conn.execute("SET s3_use_ssl=false")
conn.execute("SET s3_access_key_id='minio_user'")
conn.execute("SET s3_secret_access_key='minio_password'")
conn.execute(f"CREATE TABLE screener AS SELECT * FROM read_parquet('s3://{bucket_name}/screener.parquet/**/*.parquet')")
Invoking the duckdb.connect()
method without arguments we get a connection, to use an in-memory DuckDB database.
After that we install the HTTP File System (HTTPFS) extension for DuckDB, so we can interact with our S3 remote file system (MinIO), and we configure the connection environment variables. We deployed MinIO without SSL, so we need to set the connection s3_use_ssl
to false
.
Finally at conn.execute(f"CREATE TABLE screener AS SELECT * FROM read_parquet('s3://{bucket_name}/screener.parquet/**/*.parquet')")
we're creating a table in the DuckDB database by reading data from all parquet files located in the specified S3 bucket.
Now we can query DuckDB using SQL to check the number of rows:
total_rows_result = conn.execute("SELECT COUNT(*) AS total_rows FROM screener")
total_rows =…