Poking around the data science community on Medium.com - Part 1

Nancy Chelaru-Centea

May 22, 2019


The data science community on the Medium.com has really exploded in the past few years, paralleling the data science boom. Browsing through the sheer deluge of data science articles published each day, I tend to cycle through intrigue, inadequacy, anxiety, boredom, clickbait fatigue, and some mixture of all of them. I think you likely have felt something similar.

Though, such a mountain of readily available data presents opportunities for analyses, so I decided to dive in to do some independent data science of my own. Turns out, this has been a great project for getting my feet wet with web scraping, data cleaning and natural language processing, none of which I have done before in any real sense. Plus, trying to sieve through the Medium data science hive mind feels like something that a website called "Intelligence Refinery" should do.

In the first part of this series, I will go over the web scraping portion of the project, where I collected, among other things, the URL, title, author name, publish date, tags, number of comments and number of claps of all articles with the tag "Data Science" published on Medium (earliest of which was published in 2009).

Since I was completely new to web scraping before this, I looked around for existing and fairly recent scripts scraping Medium articles. I had found two that use Selenium (here and here), but because of the large size of data to be scraped, I wanted to use Scrapy over Selenium (see a comparison of the two here). Plus, Scrapy has a built-in selector system that means I don't have to use BeautifulSoup to parse the HTML. I ended up using the Scrapy workflow by May Yeung (posted on Medium, of course) as a starting point and made the script below, after much trial and error.

Looking at the archive of all articles tagged with "Data Science", I see that I can iterate over each year (2009-present), each month (01-12) and each day (01-31) to see the story cards of all articles tagged with "Data Science". Even though each story card contains the title, author name, publication date, number of comments and claps, I needed to get to the actual article page to get the tags. So, the script below follows the article URL on each story card to access the article page and scrape all the desired elements. This was the major departure from May Yeung's workflow, which scraped only the story cards.

As there are almost 20,000 articles published in more recent years, I decided to divide up the scraping by year. As an example, here is the scrip that I used to get all the articles tagged with "Data Science" published in 2018:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging
import logging

import os
os.chdir('/Users/nancy/PycharmProjects/medium-ds-articles/data/raw/')

class Article(scrapy.Item):
    nameOfAuthor = scrapy.Field()
    linkOfAuthorProfile = scrapy.Field()
    NumOfComments = scrapy.Field()
    article = scrapy.Field()
    postingTime = scrapy.Field()
    NumOfClaps = scrapy.Field()
    articleURL = scrapy.Field()
    articleTags = scrapy.Field()
    readingTime = scrapy.Field()

logger = logging.getLogger('scrapylogger')


class MediumSpider(scrapy.Spider):
    name = "medium_spider"


    configure_logging(install_root_handler=False)
    logging.basicConfig(
        filename='medium_full_2016_log.txt',
        format='%(levelname)s: %(message)s',
        level=logging.INFO
    )

    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'medium_full_2018.csv',
        'AUTOTHROTTLE_ENABLED' : True,
        'AUTOTHROTTLE_START_DELAY' : 1,
        'AUTOTHROTTLE_MAX_DELAY' : 3
    }


    def start_requests(self):
        urls = []

        for month in range(1, 13):
            for day in range(1, 32):
                urls.append(f"https://medium.com/tag/data-science/archive/2018/{month:02}/{day:02}")

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)



    def parse(self, response):
        item = Article()

        for story in response.css('div.postArticle'):
            if story.css('div.postArticle-readMore a::attr(href)').extract_first() is not None:
                url = story.css('div.postArticle-readMore a::attr(href)').extract_first()
                yield scrapy.Request(url=url, callback=self.parse_full, meta={'item': item})

    def parse_full(self, response):

        item = response.meta['item']
        item['articleURL'] = response.request.url
        item['article'] = response.css('div.postArticle-content section div.section-content div h1::text, \
                                        div.postArticle-content section div.section-content div h1 a::text, \
                                        div.postArticle-content section div.section-content div h1 strong::text,\
                                        div.postArticle-content section div.section-content div h1 em::text, \
                                        div.postArticle-content section div.section-content div h3::text, \
                                        div.postArticle-content section div.section-content div h4::text, \
                                        div.postArticle-content section div.section-content div p strong::text, \
                                        div.postArticle-content section div.section-content div p strong em::text, \
                                        div.postArticle-content section div.section-content div p::text').extract_first()

        try:
            item['linkOfAuthorProfile'] = response.css('div.u-paddingBottom3 a').attrib['href']
        except KeyError:
            item['linkOfAuthorProfile'] = ' '

        try:
            item['readingTime'] = response.css('span.readingTime').attrib['title']
        except KeyError:
            item['readingTime'] = ' '


        item['nameOfAuthor'] = response.css('div.u-paddingBottom3 a::text').extract_first()
        item['postingTime'] = response.css('time::text').extract_first()
        item['articleTags'] = response.css('div.u-paddingBottom10 ul.tags--postTags li a::text').getall()
        item['NumOfComments'] = response.css(
            'div.buttonSet.u-flex0 button.button.button--chromeless.u-baseColor--buttonNormal.u-marginRight12::text').extract_first()
        item['NumOfClaps'] = response.xpath(
            '//div/main/article/footer/div[1]/div[3]/div/div[1]/div/span/button//text()').extract_first()


        yield item


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MediumSpider)
process.start()

Well, this is already quite lengthy for part 1. I will continue tomorrow with how I cleaned and parsed the data.

'Til then! :)