Web Scraping 1100 Blog Posts from a Website

When you don’t have access to the database files for a WordPress website (or any website), use web scraping. Web scraping is grabbing data from a website.

For this particular web scraping project, I had to scrape 1100 blog posts from a website. I had access to the WordPress website, but the exporter failed.

I found a JavaScript web scraper called X-ray, and so began my ten hour journey. I started up my local NodeJS server and web scraped my first web page. This was a very momentous moment.

I came up with my battle plan.

  1. Get the data from the first blog post.
  2. Navigate to the second page.
  3. Get all the data form the second page.
  4. Continue in an endless loop until it reached the last blog post.
  5. Dump all the data into a file.

When I started this web scraping endeavor, it was really late. I looked all over the place to see if X-ray could paginate, but I couldn’t find the answer. Here is the code at the top of the Github page.

var Xray = require('x-ray');
var x = Xray();

x('https://dribbble.com', 'li.group', [{
  title: '.dribbble-img strong',
  image: '.dribbble-img [data-src]@data-src',
}])
  .paginate('.next_page@href')
  .limit(3)
  .write('results.json')

How I missed the .paginate function, I’ll never know.

Since I had the XML export file from WordPress, I used that to get all the links for each blog post. I had to use a PHP function called simplexml_load_file to do that because the file was too big for any online tools. That’s what’s great about being a web developer. When everything else fails, use the most base, simplest functions.

Now, with all the links, I was able to pass a new URL to X-ray on each loop. X-ray worked great at web scraping, but it only grabbed text. It couldn’t grab the HTML, and I needed the HTML. The HTML was very important to me. I decided to use another JavaScript tool called Cheerio. I love lower level libraries that actually let you have some control.

With Cheerio, I was able to grab the HTML of each post, and that’s exactly what I wanted. Everything was ready. I had my loops. I had all my JavaScript variables, array, and objects set up. Everything was just perfect. But when I ran the program, I started getting “hang up” and “ERRCON” errors.

The errors happened because the server sensed I was being too grabby. It wouldn’t let me grab all 1100 blog posts. There went the for loop. I contemplated making a button I could click 1100 times that would grab a post each time I clicked it. I also contemplated JavaScript’s setTimeout function, but that function doesn’t work in a for loop.

I decided to sleep on it, but then it hit me. I could use a recursive function!!! And, that’s exactly what I used. I set the setTimeout to one second, and I didn’t get any more errors.

The next morning, I discovered that more than half of blog posts didn’t have pictures. Whoops. That’s when I looked at X-ray a little more closely. I discovered it did exactly what I needed it to do! It could paginate, limit, write, and delay!

…Unfortunately, no matter how hard I tried, I couldn’t get it to write. Again, I love lower level programs that let you have more control. I used yesterday’s solution, again, and grabbed the images for the remaining posts.

Summary of Web Scraping Website Project

I scraped 1100 blog posts from a website. I used two JavaScript tools called Cheerio and Request on a NodeJS server to do it.

  • I discovered a JavaScript tool called Cheerio which is great for web scraping.
  • I created a JavaScript function on a NodeJS server with Cheerio.
  • I scraped 1100 blog posts from a website.