How to use
CIS | Open Scraper v.1.3


OK, you need some structured data but you are just discovering the marvelous world of scraping.
So let's see - step by step - how we do it with Open Scraper...



summary



I/ Get prepared to scrap


If you are really using a scraper for the very first time we invite you to check our presentation : "An open source tool to scrap'em all. An introduction to scraping and to Open Scraper" .

This presentation aims to summarize the main purposes of scraping,

  • What is a data and a datamodel?
  • Why scraping data from internet could be useful and for who?
  • What are the main problems when it comes to scrap several websites?
  • How Open Scraper tries to offer a solution ...

Another notions to get used to are the ones related to the use of XPATH , which is the main language used by Open Scraper to select items while scraping.

To learn more about XPATH check this section.



II/ Define a datamodel


The first thing to do before doing any scraping is to define the structure of the dataset you will work on.

By default everybody can see the structure / datamodel but not edit it by clicking on "List of the fields you want to extract" leading you to the "data structure overview".

If you are an admin or member of the staff you will be able to access "Edit your data model" .


If you are an admin or member of the staff you can then click on "edit data model" from the fields list to make changes to your datamodel.


You can click on "add a new field" . Every field is composed of :

  • the name of the field ;
  • the type of the field : url, email, tag, text, price, date, ... ;
  • an option to see or delete the field ;
  • the degree of openess of the field : private, collective, commons, open data


By clicking on "Edit your data model" you can also edit any of the existing fields.


At the end of this process you have to save your fields.




III/ Configure a scraper


Once you have defined your datamodel you can configure your scrapers.

You can either access the list of the existing scrapers by clicking on "List of all contributors" or directly add a new scraper by clicking on "Add a new contributor"


On the "List of all contributors" view you will have a table listing all spiders already configured in Open Scraper, but you will be able to modify only those you'd created yourself (except if you are an admin, then you'll be able to modify every spider)


There are two main sections to fill up in order to configure a new scraper

the "global fields"

Define here what is the website you want to scrap : the name of the website, its root_url, if you need to go from a list to a detailed page, if the website is reactive or an API, ...

the "custom fields"

Define where you will find the value corresponding to the field of your datamodel, by adding the XPATH (see more about the xpath here) of the targeted values in the form



Open Scraper uses the XPATH language to select blocks and values in a HTML page.



The "Global fields" describe the website you want to scrap from :

  • name : the name you want to give to your spider
  • licence : the licence the content of the website you will scrap is under
  • page_url : the shortest url route of the website (the domain name)
  • logo_url : an url ending by .jpg, .png, .svg to the logo of the website if any (use "copy the image link" on your browser and paste)
  • start_urls : the page(s) containing the list you want to scrap
  • item_xpath : the xpath selecting each item's block in the list (must give in a list of results)
  • next_page : the xpath selecting the button leading to the next page of results
  • parse_follow : describe here if all the data you need to scrap is present in the list or if you need to go to more detailed page
  • follow_xpath : the xpath selecting the link leading to the detailed page (must end by an @href)
  • parse_reactive : describe here if the url route is changing when you go to another page (not reactive) or if the url stays the same (reactive)
  • parse_api : mention here if the url you are scraping returns a normal page (no API) or a JSON (API Rest)
  • api_pagination_root : the url route used as a base for going to the next page, if the website has an API or/and if different than the page_url
  • follow_api : the url route used as a base for going to the detailed page based on an item_id, if the website has an API or/and if different than the page_url



The "Custom fields" describe where you will find the values for each field of your data model and store :

  • < the name of your custom field > : the xpath selecting the value corresponding to the field. must end by either : text() if the value you need is some simple text, @href if the value you need is a link hidden in an "a" tag, @src or similar if the value you need is an image,




IV/ Request your data


You can preview the results either by going to "Overview of the complete dataset" or by clicking on the "view XX item" for each spider in the "List of all contributors / websites" .


You have two main options when it comes to get back the data scraped on open Scraper

Get the data as a .CSV file
(from the "List of all contributors" view )

Use the API
(see also the documentation )