Poking around the data science community on Medium.com - Part 2

Nancy Dong

May 23, 2019


Where we last left off, I was attempting to scrape all the Medium articles tagged with "Data Science" published in 2018. The Scrapy crawler I wrote followed each link in the story cards and accessed each article page to get the author name, tags, number of claps, etc. Just to see if for whatever reason I am missing some articles, I wrote a second crawler where I just scraped the article link from each story card.

Note, I realized that there must be much simpler and more elegant ways of doing everything (please feel free to let me know in the comments if you have better ideas!). However, this was my first attempt at web scraping, so I settled for the quickest way to get some data that I can analyze.

Here is the script for the simplified crawler:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging
import logging

class Article(scrapy.Item):
    article = scrapy.Field()
    articleURL = scrapy.Field()

logger = logging.getLogger('scrapylogger')

class MediumSpider(scrapy.Spider):
    name = "medium_spider" # Name of the scraper

    configure_logging(install_root_handler=False)
    logging.basicConfig(
        filename='./data/raw/medium_titles_2018_log.txt',
        format='%(levelname)s: %(message)s',
        level=logging.INFO
    )

    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': './data/raw/medium_titles_2018.csv',
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_START_DELAY': 1,
        'AUTOTHROTTLE_MAX_DELAY': 3
    }

    def start_requests(self):
        urls = []

        for month in range(1, 13):
            for day in range(1, 32):
                urls.append(f"https://medium.com/tag/data-science/archive/2018/{month:02}/{day:02}")

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)



    def parse(self, response):
        for story in response.css('div.postArticle'):
            yield {
                'article': story.css(
                    'div.postArticle-content section div.section-content div h3::text').getall(),
                'articleURL': story.css('div.postArticle-readMore a::attr(href)').extract_first(),
            }


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(IntroSpider)
process.start()

Now that this is done, let's peek at the two data sets:

## Import libraries
import pandas as pd
pd.set_option('max.colwidth', 500)

## Import data sets
full = pd.read_csv("https://github.com/nd823/scrapy/raw/master/data/raw/medium_full_2018.csv")
titles = pd.read_csv("https://github.com/nd823/scrapy/raw/master/data/raw/medium_titles_2018.csv")
## Preview data
full.head()
NumOfClaps NumOfComments article articleTags articleURL linkOfAuthorProfile nameOfAuthor postingTime readingTime
0 NaN NaN The Woman Behind the Data Data,Datascience,Orlando https://medium.com/@datawonderment/the-woman-behind-the-data-c908cf4999f6?source=tag_archive---------25--------------------- https://medium.com/@datawonderment Data Wonderment Jan 1, 2018 2 min read
1 NaN NaN Precisely How Buzz Monitoring Can Be A Compelling Factor For An Organisation Big Data,Data Science,Data Analysis https://medium.com/@ankit.jain_86719/precisely-how-buzz-monitoring-can-be-a-compelling-factor-for-an-organisation-ea44e19dd756?source=tag_archive---------45--------------------- https://medium.com/@ankit.jain_86719 Canopus Infosystems Jan 1, 2018 2 min read
2 NaN NaN Transforming Sales and Marketing through Data Analytics Big Data,Data Analysis,Data Science https://medium.com/@ankit.jain_86719/transforming-sales-and-marketing-through-data-analytics-967bf2c027c8?source=tag_archive---------49--------------------- https://medium.com/@ankit.jain_86719 Canopus Infosystems Jan 3, 2018 2 min read
3 NaN NaN Data Analytics Growth in Indian Business Industries Big Data,Data Analysis,Data Science https://medium.com/@ankit.jain_86719/data-analytics-growth-in-indian-business-industries-4dc14371f9d1?source=tag_archive---------43--------------------- https://medium.com/@ankit.jain_86719 Canopus Infosystems Jan 4, 2018 2 min read
4 NaN NaN SO/IEC 27040:2015 — OVERVIEW AND SANITATION STANDARDS Security,Privacy,Cybersecurity,Compliance,Data Science https://medium.com/@cory_24274/so-iec-27040-2015-overview-and-sanitation-standards-964ad71b49c?source=tag_archive---------34--------------------- https://medium.com/@cory_24274 Clarabyte Jan 5, 2018 5 min read
## Preview data
titles.head()
article articleURL
0 AI and Machine Learning in Cyber Security https://towardsdatascience.com/ai-and-machine-learning-in-cyber-security-d6fbee480af0?source=tag_archive---------0---------------------
1 Redefining statistical significance: the statistical arguments https://medium.com/@richarddmorey/redefining-statistical-significance-the-statistical-arguments-ae9007bc1f91?source=tag_archive---------1---------------------
2 I do not understand t-SNE — Part 1 https://medium.com/@layog/i-dont-understand-t-sne-part-1-50f507acd4f9?source=tag_archive---------2---------------------
3 Statistical Analysis with Python: Pokémon https://medium.com/dataregressed/statistical-analysis-with-python-pok%C3%A9mon-1a72dd0451e1?source=tag_archive---------3---------------------
4 สอนให้เครื่องจักรเข้าใจภาษามนุษย์ภายใน code 3 บรรทัด (Python — Novice Level) https://medium.com/@dumpdatasci.th/%E0%B8%AA%E0%B8%AD%E0%B8%99%E0%B9%83%E0%B8%AB%E0%B9%89-%E0%B9%80%E0%B8%84%E0%B8%A3%E0%B8%B7%E0%B9%88%E0%B8%AD%E0%B8%87%E0%B8%88%E0%B8%B1%E0%B8%81%E0%B8%A3%E0%B9%80%E0%B8%82%E0%B9%89%E0%B8%B2%E0%B9%83%E0%B8%88%E0%B8%A0%E0%B8%B2%E0%B8%A9%E0%B8%B2%E0%B8%A1%E0%B8%99%E0%B8%B8%E0%B8%A9%E0%B8%A2%E0%B9%8C-code-python-3-%E0%B8%9A%E0%B8%A3%E0%B8%A3%E0%B8%97%E0%B8%B1%E0%B8%94-novice-level-12214ce838e4?source=tag_archive---------4---------------------

Theoretically, if my scraping is perfect in both cases, I would get the same number of articles/links from the two crawlers. However, a quick comparison of the two data sets, namely the "full-featured" article data that I got using the crawler in Part 1 and the more simplified "title-only" data gotten using the crawler above, showed that there are differences:

## Check dimensions of the two datasets
print("Full:", full.shape)
print("Titles:", titles.shape)
Full: (19698, 9)
Titles: (20127, 2)

Looking over the data sets, we see that there are duplicate links that only differ in the last part of the URLs (namely, ?source=tag_archive---------0---------------------). As the links are still valid after removing everything after ?, I did just that and deduplicated rows with idential article links.

## Clean full info dataset
full['articleLink'] = full['articleURL'].str.split('?').str[0]

full.drop('articleURL', axis=1, inplace=True)

full.drop_duplicates(subset=['articleLink'], keep='first', inplace=True)

full['articleLink'].dropna(inplace=True)

full.shape
(19653, 9)
## Clean title info dataset
titles['articleLink'] = titles['articleURL'].str.split('?').str[0]

titles.drop('articleURL', axis=1, inplace=True)

titles.drop_duplicates(subset=['articleLink'], keep='first', inplace=True)

titles['articleLink'].dropna(inplace=True)

titles.shape
(19678, 2)

Now we see that the number of articles/links in the two datasets are much closer.

At this point, I want to get data on articles whose links are present in the "title-only" dataset, just to see what information I am missing. To get links present in the titles but not in the full dataframe, I will use set operations:

set(titles['articleLink']) - set(full['articleLink'])
{'https://42hire.com/how-to-interview-a-data-scientist-and-know-if-theyre-good-18bd02ca43c3',
 'https://blog.color.com/leap-machine-learning-from-evidence-to-assess-pathogenicity-8bec2e0caa93',
 'https://blog.daylightdata.com/daylight-data-develops-launch-ready-analytics-for-neuroplus-d044e59f9ecd',
 'https://blog.daylightdata.com/duke-vs-unc-rivalry-visualized-5b3aac767bdc',
 'https://blog.daylightdata.com/hubdaylight-opens-providing-office-space-and-private-accelerator-14003fddffbf',
 'https://blog.daylightdata.com/the-gender-wage-gap-fact-vs-myth-5256f388118d',
 'https://blog.derniercri.io/julia-le-langage-qui-les-r%C3%A9unifiera-tous-3a274cb8794f',
 'https://blog.getnotion.com/build-interpret-and-protect-3-skill-sets-required-within-iot-68c1c154947b',
 'https://blog.getnotion.com/spotlight-series-data-science-69c8f35d8c8',
 'https://blog.impress.ai/ai-and-the-power-of-change-co-hosted-by-impress-ai-accenture-212a03e4191c',
 'https://blog.keen.io/order-and-limit-results-of-grouped-queries-hooray-83e768a97411',
 'https://blog.manifold.co/modeling-system-resource-usage-for-predictive-scheduling-738ca174cfe7',
 'https://blog.manifold.co/using-redis-streams-for-time-series-25de5b12bb46',
 'https://blog.monteirolima.adv.br/eu-tenho-o-direito-ao-esquecimento-e360d9fdf3a4',
 'https://blog.monteirolima.adv.br/h%C3%A1-um-direito-a-ser-deixado-em-paz-como-apagar-ou-desindexar-um-conte%C3%BAdo-na-internet-f5da7da1282b',
 'https://blog.monteirolima.adv.br/sou-candidato-como-me-prevenir-de-fakenews-nessas-elei%C3%A7%C3%B5es-6dcb0fe46b21',
 'https://blog.outsellinc.com/what-the-hell-is-blockchain-and-how-is-it-enabling-innovation-52e05b75888',
 'https://blog.rotageek.com/retail-2017-in-review-d9a277f3c360',
 'https://devup.co/flight-data-analysis-with-spark-ml-and-minio-fffe5808a36e',
 'https://engineering.olist.com/como-estruturar-uma-equipe-de-an%C3%A1lise-de-dados-em-uma-startup-parte-1-509a9c65cfe2',
 'https://insights.upfront.com/giving-time-back-to-our-doctors-and-nurses-to-deliver-patient-care-c24b43bfa0a2',
 'https://medium.com/@jiaheng_wei/data-scraping-and-data-cleaning-a4ca1aacbf9c',
 'https://medium.com/@kgstaub/unlock-the-true-value-of-your-datasets-4267d76741e6',
 'https://medium.com/@lucasoliveira_56505/estatistica-data-science-1d939b633d46',
 'https://medium.com/@will.mccaughey/meettheteam-giorgio-galvan-data-analyst-13afd7ee7d0b',
 'https://medium.reinvent.net/preparing-for-a-career-that-doesnt-exist-yet-ce4006bc5',
 'https://shortlythereafter.co.uk/accountants-vs-zombies-c90040f2193c',
 'https://tech.cars.com/shifting-into-data-science-2977c1fbf6c2',
 'https://tech.cars.com/whats-interesting-about-hotcars-b7eda5d14176',
 'https://theascent.pub/how-much-coffee-is-too-much-90fd61bd3325',
 'https://theascent.pub/what-i-learned-from-from-rick-rashid-founder-of-microsoft-research-20-965b3f465dcc',
 'https://thedatainart.com/comedy-central-presents-complete-episode-list-with-imdb-ratings-and-links-to-view-episode-1aa30481ff79',
 'https://towardsdatascience.com/fear-factor-guns-vs-terrorism-e6b92ebb576d'}

A cursory look through these 33 links show that some lead to 404 errors, and for others the Scrapy log shows redirects that may have interferred with the scraping process. Again, as this is a quick proof-of-concept and hobbey project, I'm not as concerned with getting all the data that I can possibly can. Though I will try my hardest. :)

So, I fed the links into the crawler shown in Part 1 and seemed to be successful in getting article data for 18 articles:

additional = pd.read_csv('https://github.com/nd823/scrapy/raw/master/data/raw/medium_missing_2018.csv')
print(additional.shape)
additional.head()
(18, 9)
NumOfClaps NumOfComments article articleTags articleURL linkOfAuthorProfile nameOfAuthor postingTime readingTime
0 1 clap NaN Unlock the True Value of your Datasets Fintech,Big Data,Big Data Analytics,Data Science,Software Engineering https://medium.com/avionix-llc/unlock-the-true-value-of-your-datasets-4267d76741e6 https://medium.com/@kgstaub Kenneth Staub 25-Jul-18 3 min read
1 168 clap 1.0 NaN NaN https://upfront.com/thoughts/giving-time-back-to-our-doctors-and-nurses-to-deliver-patient-care-c24b43bfa0a2 Kevin Zhang 27-Mar-18 5 min read
2 50 claps NaN Data Scraping and Data Cleaning Data Science https://medium.com/unreal-madrid/data-scraping-and-data-cleaning-a4ca1aacbf9c https://medium.com/@jiaheng_wei Derrick Wei 23-Oct-18 4 min read
3 661 claps NaN NaN NaN https://www.manifold.co/blog/modeling-system-resource-usage-for-predictive-scheduling-738ca174cfe7 Jessie-Raye Bauer 14-Aug-18 10 min read
4 26 claps NaN How to interview a Data Scientist (and know if they’re good) Data Science,Human Resources,Recruiting,Interview,Hiring https://42hire.com/how-to-interview-a-data-scientist-and-know-if-theyre-good-18bd02ca43c3?gi=8c5a91ec22a0 https://42hire.com/@MurtazaBambot Murtaza Bambot 20-Aug-18 5 min read

Finally, I want to process the links in this dataset the same way as above and merge it into the full dataframe:

## Clean up the dataframe
additional['articleLink'] = additional['articleURL'].str.split('?').str[0]
additional.drop('articleURL', axis=1, inplace=True)
additional.drop('article', axis=1, inplace=True)

## Merge the two dataframes
final_df = pd.concat([full, additional], sort=True)

Good to check that there are no duplicate entries in the dataframe:

## Check for duplicate entries
final_df[final_df.duplicated()]
NumOfClaps NumOfComments article articleLink articleTags linkOfAuthorProfile nameOfAuthor postingTime readingTime
## Check the dataframe size
final_df.shape
(19671, 9)

Truthfully, scraping the story cards the second time didn't add much data to the final set. However, it was a good exercise in data manipulation and cleaning. Taking this dataset as a good-enough starting point, let's do some analysis!

Stayed tuned, part 3 is coming right up.