← Home

◢◢ Third Party Data Scraper (3PDS)

3PDS is a service that collects engagement data across various social platforms on User-Generated Content (UGC) created by JA users for use in client reporting. Engagement data consists of things like; views, upvotes, shares, reposts, impressions etc.

3PDS internally consists of two parts; the orchestrator & the workers. Here's a general overview of how the system works.

               listUnlockedUGC every 5 minutes
Orchestrator -------> ugc <---* 
  |                            \
 push job(s)                    *-- ugc_links 
  v                                   |
BullMQ queue                          v
  ^                          references post & community
listens
  |       Platform Scraper
Worker ---- requests ------> YouTube Data API V3 ---> YouTube data
  |
 stores data in
  |
  v
ugc_data

UGC

When a user posts a link to a piece of original content from an third-party platform on Just About, that link is inserted into the ugc table for future processing. The ugc table looks like this:

3PDS Orchestrator

The 3PDS Orchestrator manages the timings of when UGC should have its data collected & dispatching jobs to a queue to be processed by workers. UGC typically follows an exponential drop-off in terms of viewship over time, so the frequency of data collection should also reflect this, to avoid needlessly spending API calls on third-party services.

The Orchestrator checks every 5 minutes for rows in the ugc table with a pending status and a next_iteration_date that has been passed. With those returned rows, jobs are inserted into a Redis queue (which uses BullMQ) to be processed by the 3PDS Workers.

3PDS is designed to be horizontally scalable from the start, workers could be scaled up as queue congestion dictates to handle a huge amount of jobs.

3PDS Worker

The 3PDS worker takes an arbitrary link and through Platform Scrapers (detailed below) collects data relating to the link & stores that data in ugc_data.

The ugc_data table looks like this:

Platform Scrapers

A Platform Scraper (PS) is composed of two parts; the matcher, which identifies that a link can have its data scraped by this PS, and the fetcher, the instructions by which data can be scraped for that link.

Matcher

A matcher is a function that returns a boolean based on if a particular UGC link matches for this platform, e.g.

const isYouTubeURL = url => url.includes('youtube.com');

// tumblr.com/video/1234 - false
// youtube.com?w=1234    - true

A fetcher is a function that takes that same link, queries a third-party API for data, and returns some engagement data, e.g.

const YouTubeDataAPIV3Fetcher = async url => {
  const videoID = extractVideoIDFromYoutubeURL(url);
  const response = await fetch(`api.youtube.com/video/${url}`).then(r => r.json());

  return {
    views: response.views,
    upvotes: response.likes,
  }
}

Not all platforms use the same terminology across views, impressions, upvotes, likes, hearts etc. - yet all describe a similar intent. As such, forms of engagement are translated into a consistent format to simplify reporting.

Linking UGC to posts, ugc_links

ugc_links provides the ability to filter UGC by post or community, which is helpful in reporting, e.g. answering the question How many views were caused by UGC in community X over Y time period?. The ugc_links table looks like this: