Arguments API

Unless mentioned, all these arguments are available in the template tag as well as shortcode. Here’s how these arguments will be used.

In shortcode:

[wpws url="https://www.yahoo.com/" query="ol.trendingnow_trend-list" output="text"]

In template tag:

<?php echo wpws_get_content("https://www.yahoo.com/", "ol.trendingnow_trend-list", array('output' => 'text')); ?>

For representational purposes, these arguments are categorized below as Request, Response or Parsing ones.

Request Arguments

These set of arguments deal with the way requests are made to the source URL to fetch content

url

Required. The complete URL which needs to be scraped. Dynamic URLs are also supported.

cache

Timeout interval of the cached data in minutes. This is dependent on how frequently your source URL is expected to change content. If the content is not changed often, it is recommended to keep this as higher as possible to save external requests. If ignored, the default value specified in WP Web Scraper Settings will be used. It is strongly recommended to use a Persistent Cache Plugin for better caching performance.

Default: 120

useragent

The USERAGENT header for making request. This string acts as your footprint while scraping data. If ignored, the default value specified in WP Web Scraper Settings will be used.

Default: WPWS bot (Site URL)

timeout

Request timeout in seconds. Higher the interval, better for scraping from slow server source URLs. But this will also increase your page load time. Ideally should not exceed 2. If ignored, the default value specified in WP Web Scraper Settings will be used.

Default: 2

headers

A string in query string format (like id=197&cat=5) of the post arguments that you may want to pass on to the source URL. Note that get arguments should be a part of URL itself and this argument should only be used for post arguments.

urldecode

Only available in shortcode. Handy for URLs with characters (like [ or ]) that may interfere with shortcode. Gives you an opportunity to enter a urlencoded string as URL. Values can be 1 or 0. Set to 1 to use urldecode for URLs with special characters. Set to 0 to use URL without modification.

Default: 1

querydecode

Only available in shortcode. Handy for query with characters (like [ or ]) that may interfere with shortcode. Gives you an opportunity to enter a urlencoded string as query. Strongly recommended if you are using xpath as query_type. Values can be 1 or 0. Set to 1 to use querydecode for URLs with special characters. Set to 0 to use URL without modification.

Default: 0

Response Arguments

These set of arguments deal with the way the parsed response from source URL is displayed

output

Format of output rendered by the selector. Values can be text or html. Text format strips all html tags and returns only text content. Html format retains the the html tags in output. Here’s an example.

Default: html

on_error

Error handling options for response. Values can be error_show or error_hide or any other string. error_show displays the error; error_hide fails silently without any error display while any other string will print the string itself. For instance on_error=”screwed!” will output ‘screwed!’ if something goes wrong. If ignored, the default value specified in WP Web Scraper Settings will be used. Here’s an example.

Default: error_show

debug

Display of debug information. Values can be 1 or 0. Set to 1 to turn on debug information in form of an html comment before scrap output or set to 0 to turn it off.

Default: 1

Parsing Arguments

These set of arguments provide options for parsing the content received from the source URL

query

Query string to select the content to be scraped. The query can be of type cssselector or xpath or regex. If query is empty, complete response will be returned without any querying. Read more about this in the Query documentation.

Default: (empty string)

query_type

Type of query. Values can be cssselector or xpath or regex. If query is blank, complete response will be returned without any querying irrespective of query_type. Read more about these query types in the Query documentation.

Default: cssselector

glue

String to be used to concatenate multiple results of query. For example if your (cssselector or xpath or regex) query returns 5 <p> elements, then this sting will be used to join all these 5. Here’s an example.

Default: PHP_EOL

eq

Filter argument to reduce the set of matched elements to the one at the specified index. Values can be first or last or an integer to represent a 0 based index (similar to eq implementation of jQuery API).

If ignored: All elements are returned.

gt

Filter argument to select all elements at an index greater than index within the matched set. Value can be an integer to represent a 0 based index. All elements with indexes greater then this value are returned (similar to eq implementation of jQuery API).

If ignored: All elements are returned.

lt

Filter argument to select all elements at an index lesser than index within the matched set. Value can be an integer to represent a 0 based index. All elements with indexes lesser then this value are returned (similar to eq implementation of jQuery API). Here’s an example.

If ignored: All elements are returned.

remove_query

Similar to query, however this query is used to remove matched content from the output. Read more about this in the Query documentationHere’s an example.

If ignored: No content is removed.

remove_query_type

Type of query. Values can be cssselector or xpath or regex. If remove_query is blank, complete response will be returned without removing anything. Read more about this in the Query documentation.

Default: cssselector

replace_query

Similar to query, however this query is used to replace matched content with string specified in argument replace_with. Read more about this in the Query documentation.

If replace_query_type is regex, this parameter can also be a serialized urlencoded array created like this:

urlencode(serialize($array))

That way, you can pass an array argument to the underlying preg_replace function.

If ignored: No content is replaced.

replace_query_type

Type of query. Values can be cssselector or xpath or regex. If replace_query is blank, complete response will be returned without replacing anything. Read more about this in the Query documentation.

Default: cssselector

replace_with

String to replace content matched by replace_query.

If replace_query_type is regex, this parameter can also be a serialized urlencoded array created like this:

urlencode(serialize($array))

That way, you can pass an array argument to the underlying preg_replace function.

If ignored: Content matched by replace_query will be replaced by empty string (will be removed)

basehref

Converts relative links from the scrap to absolute links. This can be handy to keep relative links functional. Values can be 1 or 0 or a specific URL which would be used to convert relative links to absolute links. Setting basehref to 1 will use the source URL itself intuitively to convert relative URL to absolute; setting basehref to 0 will not do any conversion; while setting basehref to a specific URL will use that URL as the base for conversion. Note that basehref needs to be a complete URL (with http, hostname, path etc). Here’s an example.

If ignored: basehref conversion will be skipped

a_target

Sets a specified target attribute for all links (a href). This can be handy to make sure external links open in a separate window. Values can be _blank or _self or _parent or _top or your custom framename. However note that there’s no validation and the argument value provided by you is used as is. Here’s an example.

If ignored: a target modification will be skipped

callback_raw

Callback function which will parse the scraped content in its most raw form. This callback function (if specified) is called before any of the above parsing arguments are applied. Handy to do some advanced parsing. Read more about this in the Callback documentation.

callback

Callback function which will parse the scraped content in its prosessed form. This callback function (if specified) is called after all of the above parsing arguments are applied. Handy to do some advanced parsing. Read more about this in the Callback documentation.

9 thoughts on “Arguments API

Leave a Reply