Introduction to Webcrawling (with Javascript and Node.js)

https://unsplash.com/photos/lOsbnJKTaI8

From the point I started to learn about web development I was enthusiastic about web crawling or web scraping. Most of the time it is called "web crawling", "web scraping" or "web spider". Going through the web and using it's content for your ideas seems like an awesome idea to me. That's why I gathered some information and examples to provide an introduction to the topic.

📄 Table of contents


“If you decide that you’re going to do only the things you know are going to work, you’re going to leave a lot of opportunity on the table.” - Jeff Bezos


1. Frameworks and libraries

In the tutorial "Scraping the web with Node.js" by Scotch.io following frameworks are used to simply traverse a film review website:

  • NodeJS
  • ExpressJS: minimal and flexible Node.js web application framework with features for web and mobile applications
  • Request: Helps making HTTP calls
  • Cheerio: Implementation of core jQuery specifically for the server (helps to traverse the DOM and extract data)

That's a very good example of how easy it actually can get. I list these, because they are actually the most used ones in most of the tutorials available.

2. Examples

  • Francis Kim explained in his article how he scraped an certification website and automatically renders out an up-to-date list of developers. He uses promises and mongodb. It's amazing how he turns a callback based MongoDB native Node.js driver into a promise based one. Definitely check it out.
  • Andrew Forth shows an alternative approach in his article. He combines Node.js with Phantom.js and Horseman. Node is able to use the headless WebKit PhantomJS with the Horseman API. He created a CLI micro-framework that crawls your github repositories as an example.
  • Stephen from netinstructions.com reveals the sheer simplicity of web scraping in his article. He crawls Reddit, Hackernews and Buzzfeed. His strategy is to identify the structure of the site he wants to crawl with the chrome devtools, grabs elements with cheerio and then put the the scraped elements in a .txt file together. In another post he also explains how to setup crawlers in Node.js.
  • The article "My open source Instagram bot got me 2,500 real followers" by TimG serves a great example for using web crawling in Python with the Selenium framework for real life purposes. Social media steadily gains importance in the marketing of businesses and using bots can be a valuable variable in executive decisions. His approach shows the effectiveness of simple programming. Definitely worth checking out!

These are just some examples! Check out this comprehensive collection from potentpages.com.

3. Scraping JavaScript rendered sites?

The discussion about crawlabilty of JavaScript rendered websites reaches back many years and mostly discussed in terms of search engine optimization (SEO). An easy answer for writing your own solution are HTML-rendering-engines, that allow you to act the same way as a normal browser. Whereas there are many tools that allow you to meme such behavior, a practical example would be a webdriver used by Selenium.

Web scraping is an amazing way to gather much data with comparably low effort. Using and analyzing the collected data may provide advantages on a competition aspect und gives great insights on how a platform behaves.

Terms of use

First thing to look for are terms of use. Some Site explicitly address the possibility of using their website with scraping APIs. Always be sure to take a look at these before.

Law in a wider sense

Copyright, privacy, competitive and civil law aspects may be violated depending on each case. It's important to see the difficulties between court rulings in different countries (especially America and Europe) and simply missing legislation caused by the fast progression of "internet cases".

It's safe to say that, if you have the feeling that some web scraping actions are not legal, they probably aren't. Websites and, or databases often protected by simple intellectual property law. Which means that others are not allowed to use the data that is presented on the website. This makes perfect sense because people put effort and knowledge into their online presentation and created data.

This extends to social media platforms in particular. Using their data and creating automated bots violate their fundamental principle of human interaction. It's therefore safe to assume that any kind of bots violate some applicable law.

This article from 2013 shows the legal complexity in more detail.

They conclude:

Ultimately, while the claims and theories that may be advanced in connection with the use of web crawling and scraping tools for analytics purposes have yet to be deeply explored by courts, this is likely a temporary state of affairs. Rather, given the increasing number and availability of tools for aggregation and analysis of content in the Big Data era, courts will ultimately be required to address these complicated issues.

Having that said, be prepared to face the consequences when site operators ban or sue you for infringing their principles.

5. Additional - a list of established Node.js crawlers on Github

apps https://unsplash.com/photos/ywJPwawYR08

Conclusion

As this article showed, it is actually really easy to build a webcrawler/web scraper with JavaScript. It is one of the examples of programming, that show how to get to the same result pursuing different ways. I recommend checking out the additional links, since some of them provide great inspiration!

With this introduction I really just scratched the surface. There is much more to discover! The reason this article somehow ended up too short, is because I got lost in programming my own webcrawler. Playing around with Promises derailed me from my main goal. So another take-away for me as a programmer was: "Don't get distracted by other things! - Stay focused on your objective and get it!"

If you gained something from this article let me know with a comment or heart. Make sure to follow for more :)

results matching ""

    No results matching ""