Analysing data from Instagram* posts

* Instagram belongs to Meta, a company recognised as extremist and banned in the Russian Federation.

In this example, we will demonstrate how we can analyse data from the "instagram ads" dataset at Kaggle. The data in this dataset are posts from the social network Instagram with all sorts of attributes (comments, likes, audio and video information, etc.).

Downloading of source data

First of all, let’s load the data. This can be done either manually via Upload Wizard, or by using, for example, this code at Python.

Here we set the URL of the dataset page to Kaggle, check that the downloaded file is a .zip file, and if it is, unzip its contents to a temporary directory, and then upload all the unzipped files to Tengri using the function tngri.upload_file. Output their information via print to get the names of the uploaded files for later use.

import urllib.request
import os
import zipfile
import tngri

zip_url = 'https://www.kaggle.com/api/v1/datasets/download/geraygench/instagram-ads'

urllib.request.urlretrieve(zip_url, 'data.zip')
temp_dir = 'temp'
try:
    with zipfile.ZipFile('data.zip', 'r') as zObject:
        zObject.extractall(path=temp_dir)
    for file in os.listdir(temp_dir):
        print(tngri.upload_file(os.path.join(temp_dir, file)))
except:
    print('Not a valid zip file')
<file_id>.json

Now let’s copy the name of the saved .json file to paste it into the next cell as an argument to the read_json function.

Let’s display the first 5 lines of the downloaded data to visually check that the download was successful. To do this, paste the file name copied in the previous step.

SELECT * FROM read_json("<file_id>.json")
    LIMIT 5
Done in 1.1 sec
+-----------------------------------------------+---------------------+---------+-------------+-----+
| inputUrl                                      | id                  | type    | shortCode   | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3351813129396710000 | Video   | C6ECEhTLlj7 | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3335943316421300000 | Video   | C5LpsWOIUc8 | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3319217982042080000 | Sidecar | C4QOysvtPUU | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3372181927023210000 | Sidecar | C7MZZiWNffn | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3369947094264730000 | Video   | C7EdQcJgkwr | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+

Now let’s create a instagram_post table and load this data into it.

For those columns in which we found a nested JSON structure, we will specify the JSON type. For some of the columns, we will specify new names and data types that are convenient for us (for example, TIMESTAMP). Let’s set a restriction that the value of the identifier ig_post_id must not be empty to reject potential invalid data fragments.

CREATE OR REPLACE TABLE raw_instagram_post AS
SELECT
    inputUrl,
    id AS ig_post_id,
    type,
    shortCode, caption, hashtags, mentions, url,
    dimensionsHeight, dimensionsWidth,
    commentsCount ,	firstComment,
    displayUrl, images, videoUrl,
    alt, likesCount,
    videoViewCount, videoPlayCount,
    timestamp::TIMESTAMP AS post_ts,
    ownerFullName, ownerUsername, ownerId,
    productType,
    videoDuration,
    isSponsored, isPinned,
    locationName, locationId,
    error, description, paidPartnership,
    sponsors::JSON AS sponsors,
    taggedUsers::JSON AS taggedUsers,
    musicInfo::JSON AS musicInfo,
    coauthorProducers::JSON AS coauthorProducers,
    latestComments::JSON AS latestComments,
    childPosts::JSON AS childPosts
FROM read_json("<file_id>.json")
    WHERE ig_post_id IS NOT NULL;
Done in 1.7 sec
+--------+
| status |
+--------+
| CREATE |
+--------+

Let’s display the first 5 rows of the table to visually check that the data in it corresponds to our expectations.

SELECT * FROM raw_instagram_post
    LIMIT 5;
Done in 1.2 sec
+-----------------------------------------------+---------------------+---------+-------------+-----+
| inputurl                                      | ig_post_id          | type    | shortcode   | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3351813129396710000 | Video   | C6ECEhTLlj7 | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3335943316421300000 | Video   | C5LpsWOIUc8 | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3319217982042080000 | Sidecar | C4QOysvtPUU | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3372181927023210000 | Sidecar | C7MZZiWNffn | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+
| https://www.instagram.com/fjallravenofficial/ | 3369947094264730000 | Video   | C7EdQcJgkwr | ... |
+-----------------------------------------------+---------------------+---------+-------------+-----+

Data preparation

Now let’s analyse the nested JSON structure in the musicinfo column using the unnest and json_keys functions:

SELECT unnest(keys) AS musicinfo_keys, keys_count FROM
    (SELECT json_keys(musicinfo) AS keys, count(*) keys_count
        FROM raw_instagram_post WHERE musicinfo IS NOT NULL GROUP BY 1) t ;
Done in 1.1 sec
+--------------------------+------------+
| musicinfo_keys           | keys_count |
+--------------------------+------------+
| artist_name              | 1176       |
+--------------------------+------------+
| song_name                | 1176       |
+--------------------------+------------+
| uses_original_audio      | 1176       |
+--------------------------+------------+
| should_mute_audio        | 1176       |
+--------------------------+------------+
| should_mute_audio_reason | 1176       |
+--------------------------+------------+
| audio_id                 | 1176       |
+--------------------------+------------+

Let’s extract the data from the JSON structure using the function json_extract and write it into a new table silver_instagram_musicinfo, the columns of which are the keys from the original JSON structure (inside musicinfo) that we saw in the previous step, and in the column ig_post_id we will write the id from the original table instagram_post.

Note that we use the json_extract function here no more than once for one column of the source table. This saves time and computational power, because after a single extraction of this data, it will be accessed almost instantly.
CREATE OR REPLACE TABLE silver_instagram_musicinfo AS
SELECT DISTINCT ig_post_id,
    arr[1] AS 'artist_name',
    arr[2] AS 'song_name',
    arr[3]::bool AS 'uses_original_audio',
    arr[4]::bool AS 'should_mute_audio',
    arr[5] AS 'should_mute_audio_reason',
    arr[6]::bigint AS 'audio_id'
FROM (SELECT ig_post_id, musicinfo,
    json_extract(musicinfo, ['artist_name',
                            'song_name',
                            'uses_original_audio',
                            'should_mute_audio',
                            'should_mute_audio_reason',
                            'audio_id'])
                            arr
    FROM raw_instagram_post WHERE musicinfo IS NOT NULL
) e;
Done in 1.1 sec
+--------+
| status |
+--------+
| CREATE |
+--------+

Let’s display the first five rows from the resulting table to visually check the data in it.

SELECT * FROM silver_instagram_musicinfo
    LIMIT 5;
Done in 1 sec
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+
| ig_post_id          | artist_name          | song_name                 | uses_original_audio | should_mute_audio | should_mute_audio_reason | audio_id         |
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+
| 3361389062866375005 | "fjallravenofficial" | "Original audio"          | true                | false             | ""                       | 1091589181917200 |
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+
| 3362166784222944940 | "Lyle Workman"       | "Sun Beyond the Darkness" | false               | false             | ""                       | 700744903595866  |
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+
| 3370702174072339576 | "victoriassecret"    | "Original audio"          | true                | false             | ""                       | 969258887793092  |
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+
| 3365659145996406849 | "victoriassecret"    | "Original audio"          | true                | false             | ""                       | 1856092344873079 |
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+
| 3367926439711140761 | "asos"               | "Original audio"          | true                | false             | ""                       | 1428332318047391 |
+---------------------+----------------------+---------------------------+---------------------+-------------------+--------------------------+------------------+

Now let’s look at the column of the source table with information about tagged users taggedUsers. Note that the data in it is represented as arrays. Using the functions unnest and json_extract, create the table silver_instagram_taggeduser:

CREATE OR REPLACE TABLE silver_instagram_taggeduser AS
    SELECT DISTINCT ig_post_id,
        arr[1]::varchar AS 'full_name',
        arr[2]::bigint AS 'user_id',
        arr[3]::bool AS 'is_verified',
        arr[4] AS 'profile_pic_url',
        arr[5] AS 'username'
    FROM (
        SELECT ig_post_id,
        json_extract(users, ['full_name',
                            'id','is_verified',
                            'profile_pic_url',
                            'username'])
        AS arr FROM
            (SELECT ig_post_id, unnest(json_extract(taggedUsers, '[*]'))
            AS users FROM
                raw_instagram_post WHERE taggedUsers IS NOT NULL
            ) un
        ) t;
Done in 1 sec
+--------+
| status |
+--------+
| CREATE |
+--------+

Let’s display the first 5 rows of the created table silver_instagram_taggeduser and visually check the data in it.

SELECT * FROM silver_instagram_taggeduser
    LIMIT 5;
Done in 1 sec
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| ig_post_id          | full_name         | user_id   | is_verified | profile_pic_url                                                                                                                                                                                                                                                                                                  | username           |
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| 3368467612563246843 | "Candice"         | 43114731  | true        | "https://scontent.cdninstagram.com/v/t51.2885-19/436426783_716232097252223_2080875243016028171_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=1&_nc_ohc=VZz3Y6G6kCgQ7kNvgEfJgX3&edm=APs17CUBAAAA&ccb=7-5&oh=00_AYDeE3YwxCQxVEK-3DLy0AK8cSmaFu3SVuwsUluh0pQ2Hg&oe=66526CB7&_nc_sid=10d13b"   | "candiceswanepoel" |
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| 3371457031159064630 | "Brittany Xavier" | 739181077 | true        | "https://scontent.cdninstagram.com/v/t51.2885-19/299498992_480162533527636_8112607220645623732_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=104&_nc_ohc=D83lh8W9o-EQ7kNvgEAZo1_&edm=APs17CUBAAAA&ccb=7-5&oh=00_AYCI_FerVMUvMEltLhAaANlfFdtezWw90dIOCff-oNuiLQ&oe=66526DD9&_nc_sid=10d13b" | "brittanyxavier"   |
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| 3350535229969664632 | "dev scudds"      | 200671402 | false       | "https://scontent.cdninstagram.com/v/t51.2885-19/316332449_539815587596795_7700370950656909913_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=111&_nc_ohc=yt51b9U7YQcQ7kNvgGr0PqA&edm=APs17CUBAAAA&ccb=7-5&oh=00_AYC_DbQKZ56A_C_kK4oCamtB10dpgXWxNr9dupqkTHIKZA&oe=665259D4&_nc_sid=10d13b" | "devscudds"        |
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| 3355576548572106408 | "Roxie Nafousi"   | 182064334 | true        | "https://scontent.cdninstagram.com/v/t51.2885-19/432688376_299228906523099_7608649815950644359_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=1&_nc_ohc=mNDTJLE0sakQ7kNvgFoinRd&edm=APs17CUBAAAA&ccb=7-5&oh=00_AYDyLDXyYkSqF7jeJ7SDKO-ATM7WbbH8OYqhw_GOjiGygQ&oe=66526921&_nc_sid=10d13b"   | "roxienafousi"     |
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| 3369046462609537802 | "ASOS MAN"        | 640874843 | true        | "https://scontent.cdninstagram.com/v/t51.2885-19/244277018_902025494057602_3929487210907869376_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=110&_nc_ohc=5aRpL0ZaRW8Q7kNvgFG0_9g&edm=APs17CUBAAAA&ccb=7-5&oh=00_AYBmZ6qVQtwRziMSsl8tGtlX7u4-RnYXa-MxQIaIEKyc7Q&oe=66527A51&_nc_sid=10d13b" | "asos_man"         |
+---------------------+-------------------+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+

Now let’s look at the column of the source table with information about comments latestComments. In it, the data is represented as an array of comments. To find out what keys occur inside these arrays, let’s use the functions unnest, json_keys and json_extract and execute this query:

SELECT unnest(keys) AS comment_keys FROM
    (SELECT json_keys(comments) AS keys, count(*) FROM
        (SELECT ig_post_id, unnest(json_extract(latestComments, '[*]'))
        AS comments FROM
            raw_instagram_post WHERE latestComments IS NOT NULL) un GROUP BY 1) t;
Done in 1 sec
+--------------------+
| comment_keys       |
+--------------------+
| id                 |
+--------------------+
| text               |
+--------------------+
| ownerUsername      |
+--------------------+
| ownerProfilePicUrl |
+--------------------+
| timestamp          |
+--------------------+
| likesCount         |
+--------------------+
| repliesCount       |
+--------------------+
| replies            |
+--------------------+

Let’s extract the comments into the silver_instagram_comments table in the same way as we did above for users and artists.

Note that in this table likesCount and repliesCount are the number of likes and replies to this comment (not to the original post).

CREATE OR REPLACE TABLE silver_instagram_comments as
    SELECT DISTINCT ig_post_id,
        arr[1]::bigint AS comment_id,
        arr[2] AS 'text',
        arr[3] AS 'ownerUsername',
        arr[4] AS 'ownerProfilePicUrl',
        arr[5]::timestamp AS 'comment_ts',
        arr[6]::int AS 'likesCount',
        arr[7]::int AS 'repliesCount',
        arr[8] AS 'replies'
FROM (
SELECT ig_post_id,
    json_extract(comments, ['id','text','ownerUsername','ownerProfilePicUrl',
    'timestamp','likesCount','repliesCount','replies']) AS arr
    FROM (SELECT ig_post_id, unnest(json_extract(latestComments, '[*]')) AS comments
        FROM raw_instagram_post WHERE latestComments IS NOT NULL) un ) t;
Done in 1 sec
+--------+
| status |
+--------+
| CREATE |
+--------+

Let’s check the resulting table:

SELECT * FROM silver_instagram_comments
    LIMIT 5;
Done in 1 sec
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+
| ig_post_id          | comment_id        | text                                                               | ownerusername                    | ... |
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+
| 3351813129396705531 | 18025641284105656 | "Your work apron is cool..\nDo Fjalraven sell them to the public?" | "dai_stick"                      | ... |
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+
| 3367108632665594085 | 18010752380180096 | "Well done 👏"                                                      | "alpencore"                      | ... |
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+
| 3362086677098626407 | 18041534590828292 | "Still made in cheap labour countries"                             | "leineleineleineleineleineleine" | ... |
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+
| 3362086677098626407 | 18028710974031364 | "🔥🔥🙌"                                                              | "arifwidie26"                    | ... |
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+
| 3361389062866375005 | 18012200105461376 | "Classic description of “type 2 fun”"                              | "itskate_g"                      | ... |
+---------------------+-------------------+--------------------------------------------------------------------+----------------------------------+-----+

Result 1: Popularity of performers

We now use the data collected above to construct a table in which each row corresponds to one artist, and the columns represent different information that somehow indicates the popularity of that artist:

  • songs — the number of songs by this artist

  • ig_posts — the number of posts with songs by this artist

  • comments — number of comments in posts with songs by this artist

  • tagged_users — number of tagged users in posts with songs by this artist

  • likesCount — number of likes in posts with songs by this artist

  • repliesCount — number of replies in posts with songs by this artist

To do this, we use the expression LEFT JOIN.

Organise the final table by the third column (ig_posts) in descending order.

SELECT artist_name,
    count(distinct song_name) AS songs,
    count(distinct im.ig_post_id) AS ig_posts,
    count(distinct comment_id) AS comments,
    count(distinct user_id) AS tagged_users,
    sum(likesCount) AS likesCount,
    sum(repliesCount) AS repliesCount
FROM silver_instagram_musicinfo im
    LEFT JOIN silver_instagram_taggeduser tu ON tu.ig_post_id=im.ig_post_id
    LEFT JOIN silver_instagram_comments ic ON ic.ig_post_id =im.ig_post_id
GROUP BY 1 ORDER BY 3 DESC;
Done in 1.6 sec
+-------------------+-------+----------+----------+--------------+------------+--------------+
| artist_name       | songs | ig_posts | comments | tagged_users | likescount | repliescount |
+-------------------+-------+----------+----------+--------------+------------+--------------+
| "telfarglobal"    | 9     | 136      | 985      | 120          | 634        | 156          |
+-------------------+-------+----------+----------+--------------+------------+--------------+
| "everlane"        | 1     | 95       | 532      | 6            | 513        | 65           |
+-------------------+-------+----------+----------+--------------+------------+--------------+
| "asos"            | 3     | 93       | 503      | 60           | 583        | 85           |
+-------------------+-------+----------+----------+--------------+------------+--------------+
| "victoriassecret" | 1     | 83       | 686      | 49           | 185        | 16           |
+-------------------+-------+----------+----------+--------------+------------+--------------+
| "summerfridays"   | 2     | 65       | 574      | 48           | 173        | 18           |
+-------------------+-------+----------+----------+--------------+------------+--------------+
| ...               | ...   | ...      | ...      | ...          | ...        | ...          |
+-------------------+-------+----------+----------+--------------+------------+--------------+

This table can be saved in .xlsx format by clicking the Download output as XLSX button on this cell. After that, in any application for viewing `.xlsx' files, you can organise the collected data by any column, apply additional filters, copy the data or a part of it for inserting into analytical reports.

The file saved in this way can be download here.

Result 2: Distribution of comments and events in them over time

Let’s analyse the distribution of comments and events in them (likes and repins) depending on the time elapsed from the publication of a post to the publication of a given comment on it — the "freshness" of the comment in relation to the post to which it refers.

To do this, let’s build a table in which each row will correspond to all such pairs "post — comment to it" that 0 days, 1 day and so on up to 9 days have passed between them. And for all such pairs we will output them in columns:

  • comments — the number of such comments

  • likescount — the number of likes of such comments.

  • repliescount — the number of replays for such comments

SELECT
    (comment_ts::DATE - post_ts::DATE)::INT AS days_till_comment,
    count(DISTINCT comment_id) AS comments,
    sum(c.likesCount)::int AS likesCount,
    sum(c.repliesCount)::int AS repliesCount
FROM raw_instagram_post p
LEFT JOIN silver_instagram_comments c  on p.ig_post_id=c.ig_post_id
GROUP BY 1 ORDER BY 1 LIMIT 10;
Done in 1.7 sec
+-------------------+----------+------------+--------------+
| days_till_comment | comments | likescount | repliescount |
+-------------------+----------+------------+--------------+
| 0                 | 2876     | 1926       | 614          |
+-------------------+----------+------------+--------------+
| 1                 | 3288     | 1847       | 460          |
+-------------------+----------+------------+--------------+
| 2                 | 1980     | 830        | 189          |
+-------------------+----------+------------+--------------+
| 3                 | 1352     | 635        | 102          |
+-------------------+----------+------------+--------------+
| 4                 | 841      | 288        | 46           |
+-------------------+----------+------------+--------------+
| 5                 | 610      | 230        | 26           |
+-------------------+----------+------------+--------------+
| 6                 | 516      | 103        | 37           |
+-------------------+----------+------------+--------------+
| 7                 | 411      | 88         | 26           |
+-------------------+----------+------------+--------------+
| 8                 | 299      | 75         | 17           |
+-------------------+----------+------------+--------------+
| 9                 | 249      | 96         | 17           |
+-------------------+----------+------------+--------------+

Now in the cell of type Python we will plot the graph using the data from this table. To do this, we will use the Python matplotlib module. To take data from the previous cell, we use the cell_output variable.

import matplotlib

plt = cell_output.plot(
    title='Распределение комментариев и событий в них',
    y=[
        'comments',
        'likescount',
        'repliescount',
    ],
    label = ['Комментариев', 'Лайков к комментарию', 'Реплаев к комментарию'],
    ylabel='Количество',
    x='days_till_comment',
    xlabel = 'Дней между постом и комментарием',
    figsize=(10,5),
)
plt.set_xticks(cell_output['days_till_comment'])
plt.get_figure()
instagram example result 2

The graph clearly shows that the number of comments to a post increases on the first day and then gradually decreases. You can also see how the activity within the comments themselves (likes and repins to comments) decreases depending on the obsolescence of the comment relative to the date of the post.