Query

This article specifically details usage of query which are the heart of WP Web Scraper. For parsing html, the plugin three types of queries- CSS Selectors; XPath and Regular Expression. Selectors are not only used by WP Web Scraper to query data from source URL, but also to remove or replace stuff

For all scraping that deals with DOM documents (XML, HTML etc) CSS Selectors and XPaths can support all possible use cases. Regular Expression is provided as a query option for extreme edge cases or non-DOM content.

CSS Selectors

CSS selectors are patterns used to select the element(s) you want to style. CSS selectors are less powerful than XPath, but far easier to write, read and understand.

Many developers — particularly web developers — are more comfortable using CSS selectors to find elements. As well as working in stylesheets, CSS selectors are used in JavaScript with thequerySelectorAll function and in popular JavaScript libraries such as jQuery, Prototype and MooTools.

The CSS Selector Reference on w3schools is recommended to get you started. You may also want to try the CSS Selector Tester from w3schools.

Internally, WP Web Scraper converts the CSS Selector into an XPath expressions using Symfony’s CssSelector Component.

XPath

The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria. In popular use (though not in the official specification), an XPath expression is often referred to simply as “an XPath”.

When you’re parsing an HTML or an XML document, by far the most powerful method is XPath. XPath expressions are incredibly flexible, so there is almost always an XPath expression that will find the element you need.

XPath Syntax and XPath Examples on w3schools is a good starting point.

Internally, WP Web Scraper relies on PHP DOM and uses DOMXPath::query to evaluate XPath expressions.

Regular Expression

A regular expression (abbreviated regex or regexp) and sometimes called a rational expression is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.

This Introduction to Regex is a great place to start with Regex. You can also use tryphpregex.com to check your Regular Expressions.

Internally, WP Web Scraper relies on Regular Expressions (Perl-Compatible) and uses preg_match_all to perform a global regular expression match.

Need Help with Queries?

Crafting the right and optimized query can be a bit tricky at times. Try the paid support for crafting a perfectly optimized web scrape

2 thoughts on “Query

Leave a Reply