◢◢ Third Party Data Scraper (3PDS)

3PDS is a service that collects engagement data across various social platforms on User-Generated Content (UGC) created by JA users for use in client reporting. Engagement data consists of things like; views, upvotes, shares, reposts, impressions etc.

3PDS internally consists of two parts; the orchestrator & the workers. Here's a general overview of how the system works.

               listUnlockedUGC every 5 minutes
Orchestrator -------> ugc <---* 
  |                            \
 push job(s)                    *-- ugc_links 
  v                                   |
BullMQ queue                          v
  ^                          references post & community
listens
  |       Platform Scraper
Worker ---- requests ------> YouTube Data API V3 ---> YouTube data
  |
 stores data in
  |
  v
ugc_data

UGC

When a user posts a link to a piece of original content from an third-party platform on Just About, that link is inserted into the ugc table for future processing. The ugc table looks like this:

link : the link to the UGC on an external platform that the user created
iteration_index : how many times the UGC has had its engagement data scraped
iteration_started_at : when the scraping started for the previous iteration_index
iteration_ended_at : when the scraping ended for the previous iteraton_index
next_iteration_date : when the data should next be scraped
status : dictates the state of the UGC
- pending : data is waiting to be fetched
- locked : data is currently being fetched
- errored : getting the data failed for whatever reason
- dead : link is dead, no statistics can be pulled
- skipped : link does not have a provider

3PDS Orchestrator

The 3PDS Orchestrator manages the timings of when UGC should have its data collected & dispatching jobs to a queue to be processed by workers. UGC typically follows an exponential drop-off in terms of viewship over time, so the frequency of data collection should also reflect this, to avoid needlessly spending API calls on third-party services.

The Orchestrator checks every 5 minutes for rows in the ugc table with a pending status and a next_iteration_date that has been passed. With those returned rows, jobs are inserted into a Redis queue (which uses BullMQ) to be processed by the 3PDS Workers.

3PDS is designed to be horizontally scalable from the start, workers could be scaled up as queue congestion dictates to handle a huge amount of jobs.

3PDS Worker

The 3PDS worker takes an arbitrary link and through Platform Scrapers (detailed below) collects data relating to the link & stores that data in ugc_data.

The ugc_data table looks like this:

ugc_id : the ID of the row in the ugc table for which this data is associated with
iteration_index : correlates to the iteration of the data scraping, 1 = 1st scraping, 2 = 2nd and so forth
collected_at : when the data collection started
job_id : the job from the queue that caused this data to be collected
views : correlates to a view / impression on a 3rd party platform
upvotes : correlates to an upvote / heart / like on a 3rd party platform
downvotes : correlates to a downvote / dislike on a 3rd party platform
shares : correlates to an external share on a 3rd party platform
reposts : correlates to a repost / re-tweet / reblog on a 3rd party platform
replies : correlates to a direct reply on a post on a 3rd party platform

Platform Scrapers

A Platform Scraper (PS) is composed of two parts; the matcher, which identifies that a link can have its data scraped by this PS, and the fetcher, the instructions by which data can be scraped for that link.

Matcher

A matcher is a function that returns a boolean based on if a particular UGC link matches for this platform, e.g.

const isYouTubeURL = url => url.includes('youtube.com');

// tumblr.com/video/1234 - false
// youtube.com?w=1234    - true

A fetcher is a function that takes that same link, queries a third-party API for data, and returns some engagement data, e.g.

const YouTubeDataAPIV3Fetcher = async url => {
  const videoID = extractVideoIDFromYoutubeURL(url);
  const response = await fetch(`api.youtube.com/video/${url}`).then(r => r.json());

  return {
    views: response.views,
    upvotes: response.likes,
  }
}

Not all platforms use the same terminology across views, impressions, upvotes, likes, hearts etc. - yet all describe a similar intent. As such, forms of engagement are translated into a consistent format to simplify reporting.

Linking UGC to posts, `ugc_links`

ugc_links provides the ability to filter UGC by post or community, which is helpful in reporting, e.g. answering the question How many views were caused by UGC in community X over Y time period?. The ugc_links table looks like this:

post_id : the id of the post that this UGC exists in
community_id : the id of the community that this UGC's post was created in
ugc_id : the id of the ugc that this post relates to