Automating data loading from multiple files

In this example, we will show how to load into Tengri data that is stored as multiple .parquet files in the storage of S3.

Using the Python boto3 module, we will output the paths of all .parquet files in a particular storage:

import boto3
import boto3.s3.transfer

path = 'Stage/<lake_path>/DataLake/2049/'
s3_client = boto3.client('s3',
                        endpoint_url='***',
                        aws_access_key_id='***',
                        aws_secret_access_key='***',
                        region_name='us-east-1')

response = s3_client.list_buckets()

for bucket in response['Buckets']:
    print(f'Bucket: {bucket["Name"]}')
    response = s3_client.list_objects_v2(
    Bucket=bucket["Name"],
    Prefix='Stage')

    for content in response.get('Contents', []):
        file_path = str(content['Key'])
        if path in file_path:
            print(file_path)

Bucket: prostore
Stage/<lake_path>/DataLake/2049/1.parquet
Stage/<lake_path>/DataLake/2049/2.parquet
Stage/<lake_path>/DataLake/2049/3.parquet
Stage/<lake_path>/DataLake/2049/4.parquet
Stage/<lake_path>/DataLake/2049/5.parquet
Bucket: s3bench-asusnode21

Let’s try querying SQL for data from one of these files and make sure that the query is processed and the data is displayed in the output:

SELECT *
    FROM read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/1.parquet')
LIMIT 5;

+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+
| f1                                                                                                                          | f2    | f3                  | f4                         | jd1      | jd2      | command_load_datetime      | command_load_session_id |
+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+
| {"id": 673, "age": 77, "name": "d1b7d724ef3a8451189fc47da62b9888", "email": "b32c9c5d3b1768e7955144103d72c593@example.com"} | false | 0001-01-02 06:44:06 | <memory at 0x7f0bb8523880> | 7000-1-1 | 7000-2-1 | 2025-06-23 08:05:23.636063 | 195                     |
+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+
| {"id": 221, "age": 34, "name": "4de35c863a2df55018eca39f130c0738", "email": "3535ac8c3b04c33ac67ee70693707bdb@example.com"} | true  | 0001-01-02 02:59:53 | <memory at 0x7f0bb85219c0> | 7000-1-1 | 7000-2-1 | 2025-06-23 08:05:23.636063 | 195                     |
+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+
| {"id": 612, "age": 27, "name": "94a027942dffccb4abe3923914592ba5", "email": "10178786f4378e98f62be257c4215056@example.com"} | false | 0001-01-02 11:53:53 | <memory at 0x7f0bb8522c80> | 7000-1-1 | 7000-2-1 | 2025-06-23 08:05:23.636063 | 195                     |
+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+
| {"id": 136, "age": 73, "name": "500522adeb682c1efcdce8af83610a27", "email": "de507ff06310d2b35ea65c40f26f6a94@example.com"} | false | 0001-01-02 00:56:43 | <memory at 0x7f0bb8521fc0> | 7000-1-1 | 7000-2-1 | 2025-06-23 08:05:23.636063 | 195                     |
+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+
| {"id": 150, "age": 67, "name": "7d283d7c78b5692f6162d8b65baee1ed", "email": "df78cebca3c40d06e8b43ad8a7e98f56@example.com"} | true  | 0001-01-02 13:46:41 | <memory at 0x7f0bb8521600> | 7000-1-1 | 7000-2-1 | 2025-06-23 08:05:23.636063 | 195                     |
+-----------------------------------------------------------------------------------------------------------------------------+-------+---------------------+----------------------------+----------+----------+----------------------------+-------------------------+

Now we need to load the data from all the .parquet files into one table.

To do this, first initialise this table — create it, but do not write data to it yet, so that we can do it in the next step with a loop.

To initialise the table, execute this command with the SELECT * subquery for one of these files and the LIMIT — 0 parameter:

CREATE OR REPLACE TABLE raw_dyntest AS
    SELECT *
        FROM read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/1.parquet')
    LIMIT 0;

Done in 22.2 sec

+--------+
| status |
+--------+
| CREATE |
+--------+

This query creates a table with the required set of columns and their types, now we can load data into it in a loop.

For this purpose, in the cell of type Python we will describe a loop by the names of all files and in this loop using the function tngri.sql we will iteratively load data from each file .parquet into the initialised table raw_dyntest.

In each iteration we will output the number of rows in this table that should increment.

import tngri

for i in range(1,6):
    file_name = f"s3://prostore/Stage/<lake_path>/DataLake/2049/{i}.parquet"
    insert_sql = f"INSERT INTO raw_dyntest SELECT * FROM read_parquet('{file_name}')"
    print(insert_sql)
    tngri.sql(insert_sql)

    print(tngri.sql("SELECT count(*) FROM raw_dyntest"))

insert into raw_dyntest select * from read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/1.parquet')
shape: (1, 1)
+----------+
│ column_0 │
│ ---      │
│ i64      │
+----------+
│ 10000000 │
+----------+
insert into raw_dyntest select * from read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/2.parquet')
shape: (1, 1)
+----------+
│ column_0 │
│ ---      │
│ i64      │
+----------+
│ 20000000 │
+----------+
insert into raw_dyntest select * from read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/3.parquet')
shape: (1, 1)
+----------+
│ column_0 │
│ ---      │
│ i64      │
+----------+
│ 30000000 │
+----------+
insert into raw_dyntest select * from read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/4.parquet')
shape: (1, 1)
+----------+
│ column_0 │
│ ---      │
│ i64      │
+----------+
│ 40000000 │
+----------+
insert into raw_dyntest select * from read_parquet('s3://prostore/Stage/<lake_path>/DataLake/2049/5.parquet')
shape: (1, 1)
+----------+
│ column_0 │
│ ---      │
│ i64      │
+----------+
│ 50000000 │
+----------+

Now let’s display the number of rows in the table filled in this way:

SELECT count(*) AS row_count
    FROM raw_dyntest;

+-----------+
| row_count |
+-----------+
| 50000000  |
+-----------+

To check that queries to this data on such a volume (50 million rows) work properly, let’s display the distribution of true and false values in the f2 column:

SELECT f2, count(*) AS f2_cnt
    FROM raw_dyntest_ng
    GROUP BY f2;

+-------+----------+
| f2    | f2_cnt   |
+-------+----------+
| false | 24998577 |
+-------+----------+
| true  | 25001423 |
+-------+----------+

One of the columns of the table contains data in the form of JSON structure, and one of the keys of this structure is age. Let’s build a table of age values distribution. To do this, use the function json_extract.

SELECT
    json_extract(f1, ['age'])[1]::INT AS age,
    count(*) AS age_count
    FROM raw_dyntest
    GROUP BY 1 ORDER BY 1;

Done in 9.9 sec

+-----+-----------+
| age | age_count |
+-----+-----------+
| 0   | 500228    |
+-----+-----------+
| 1   | 499629    |
+-----+-----------+
| 2   | 500222    |
+-----+-----------+
| 3   | 500622    |
+-----+-----------+
| 4   | 500373    |
+-----+-----------+
| 5   | 500410    |
+-----+-----------+
| 6   | 499876    |
+-----+-----------+
| 7   | 499276    |
+-----+-----------+
| 8   | 500566    |
+-----+-----------+
| 9   | 499314    |
+-----+-----------+
| ... | ...       |
+-----+-----------+

Based on this table, plot the distribution of values by age:

import matplotlib

plt = cell_output.plot(
    title='Распределение по возрасту',
    y='age_count',
    ylabel='Количество',
    x='age',
    xlabel='Возраст',
)

plt.get_figure()