Blog of Raivo Laanemets

Stories about web development, consulting and personal computers.

Announcement: DOM-EEE

On 2016-12-17

DOM-EEE is a library to extract structured JSON data from DOM trees. The EEE part in the name means Extraction Expression Evaluator. The library takes a specification in the form of a JSON document containing CSS selectors and extracts data from the page DOM tree. The output is also a JSON document.

I started developing the library while dealing with many web scraping projects. There have been huge differences in navigation logics, page fetch strategies, automatic proxying, and runtimes (Node.js, PhantomJS, browser userscripts) but the data extraction code has been similar. I tried to cover these similarities in this library while making it working in the following environments:

  • Browsers (including userscripts)
  • PhantomJS
  • Cheerio (Node.js)
  • jsdom (Node.js)
  • ES5 and ES6 runtimes

The library is a single file that is easy to inject into any of these environments. As the extraction expressions are kept in the JSON format, and the output is a JSON document, any programming platform supporting JSON and HTTP can be coupled to PhantomJS, an headless web browser with a built-in server to drive the scraping process.

Example usage

This example uses cheerio, a jQuery implementation for Node.js:

var cheerio = require("cheerio");
var eee = require("eee");
var html = "<ul><li>item1</li><li>item2 <span>with span</span></li></ul>";
var $ = cheerio.load(html);
var result = eee(
  $.root(),
  {
    items: {
      selector: "li",
      type: "collection",
      extract: { text: { selector: ":self" } },
      filter: { exists: "span" },
    },
  },
  { env: "cheerio", cheerio: $ }
);
console.log(result);

This code will print:

{ items: [ { text: 'item2 with span' } ] }

Alternatives

There is a number of similar projects. Most of them assume a specific runtime environment or try to do too much to be portable. Some examples:

  • artoo.js (client side).
  • noodle (Node.js, not portable enough).
  • x-ray (not portable, coupled with HTTP and pagination and 100 other things).

Documentation

Full documentation of the JSON-based expression language and further examples can be found in the project's code repository.