For one of my web based projects I needed to acquire some data. Data in the modern day and age is often the most valuable asset of a company... it got me thinking.

The Royal Mail here in the UK offer a Postcode Address File (PAF) to any entity who wishes to have it. The cost of this PAF file is many many thousands of pounds depending on exactly what you want and how you intend to use it.

Recently I was building a location based product. Using the PAF database was in no way financially viable. Instead I did a lot of research into alternative location groupings for the United Kingdom and gave consideration as to how I would acquire the data I needed to work with said groupings. I opted to utilize city, town, and village names grouped by their local government area, and I managed to accquire this data for a relatively insignificant sum.

My next requirement was to acquire data on the establishments that my website lists. My first approach was simply to manually input data - this gave me my first appreciation for the value of data. Were I to have continued to manually input this data my cost would have amounted to the value of my time. Utilizing the British minimum wage and extemely conservative time estimates a complete dataset would have cost me five figures.

Being an engineer I quickly begun investigating more intelligent ways of acquiring the required data. I ended up utilizing Facebooks API, and data scraping to begin the process of acquiring the data that I would need.

Facebook

Facebook seem to be pretty open with their data, and their API. Whilst to some extent you can put a cost on data acquisition, Facebook are in a position whereby sharing data is not going to have too much of a negative effect on them. It is actually more likely that Facebook will gain from my usage of a tiny subset of their data given the number of Facebook integrations across the product.

I pulled data from Facebook, but in the interest of producing a quality product I built functionality to allow me to manually process this data. Facebook allowed me to spend less time collecting data and thus reduced my cost.

Scraping

Having over a long period gradually integrated the data I had obtained from Facebook, I had the shocking realisation that the data was suprisingly incomplete. In this particular niche, many companies run a large number of establishments. As such I decided to scrape further data from the websites of these companies to try and fill the blanks.

What I noticed from this process is that in this particular niche, companies do not really value their data.. and that is somewhat understandable. I want their data so that I can essentially market them. On consideration I am suprised they did not make their data more easily accessible.

I found:

  • a few companies whose data was consistently formatted and easy to extract

  • a few companies who had placed the required data across multiple pages. Annoying but manageable.

  • A few companies using clean JSON backends

  • One company that really needs to hire a new web developer

My approach

Given that this was a one time thing, I was not interested in writing perfect, clean, well tested coded. I built a basic scraper using PHPs CURL functions, opened up the various pages and pulled out the data I needed from the source code.

Extracting the data essentially amounted to:

  • analysing the source code and working with PHPs DOMDocument

  • calling json_decode.

Now what

At this present point in time I have collected a sufficient amount of data to make my product viable. It has enough data to make it useful, and the public seem to agree. My hope is that users of the product will now contribute to the continued growth and general improvement of the dataset.

Given that I have spent a large amount of time an effort collecting, and moderating the dataset I have to some extent come full circle. This dataset now adds significant value to my product. I have done a lot of the heavy lifting and so now surely I want to stop others from scraping my data? Nope..

You cannot stop scraping

The long and the short of it is that if you want to provide your data to a good person (a site visitor or a search engine), then you have to provide it to the bad people too. The web is inherently open and free - you can either share your data or take your website down.

I was able to scrape all the data that I wanted using CURL. It was somewhat tedious, but it was not hard. This is probably the simplest tool in a scrapers toolkit.

You can try and obfuscate your source code, hide you data behind complex authorization procedures etc but this will only hurt you.

Google is a scraper.. as is Bing. If you want your website to be search engine friendly then it is also going to be bad person friendly.

Things likes capchas and required logins ruin the user experience.

Headers can be faked, IPs bought and so on.

Even considering the above, there are headless browsers like PhantomJS which are browsers. Phantom Mochachino, a tool which I built for testing demonstrates how powerful headless browsers are #shamelessplug. A scraper can use a headless browser in exactly the same manner that a normal user uses their browser.

You can make things harder, but anyone who is committed enough can, and will get your data.

It is for that reason I opted not to attempt to prevent scraping. Rather I made my API openly accessible to all.

Think about it

Wikipedia is a really large database, yet you don't see many Wikipedia clones ranking highly in Google. Even if they did you would more than likely click on the Wikipedia link as opposed to the http://rubbish-wikipedia-clone.com link.

Likewise with Stack Overflow - in this case there are loads of clones, but again I don't think I have ever used one and I highly doubt that they have any effect of Stack Overflow's visitor numbers.

Packaging

All things considered whilst data is extremely valuable the packaging of it is in many respects more important.

In my case my product is premised around helping companies within the niche improve visitor numbers whilst providing the general public with a useful and informative resource. I want to incentivize mutual cooperation and incetivize users to contribute to the site. I believe that more people will contribute if they know what we are doing with their data - we are making it freely accessible to anyone who wants to use it and help the industry grow.

So to answer the question.. for the particular dataset that I am working with, both have value. At least I hope they do.