Last weekend I attended the Open Data day event in London to have find out more about how open data has progressed in the last couple of years. It was also to get a feel of what these sort of hack day events were like.
It was a really useful event to hear about the various issues regarding open data, lack of data and discontinuation by govt bodies. James (@ ) gave us a talk about Open data and talked about how departments opened up their data and the use of Open Data licenses also covering how poor examples of open data and the unhelpful data released in the form of pdfs. An a really important way to get better access to open data is by making Freedom of Information requests and to feedback on existing data being used to relay the purpose and what improvements could be made.
James also went through a short exercise in what was bad open data giving us an example of a data table with various errors to identify:
After that we did data expeditions into looking at various problems and to try to come up with a process to figuring out a process to analyse and possibly resolve the problem.
For the problem I was looking at, it was the familiar problem of delays in public transport over the weekend. This lead to investigating the various travel data sources and I started off looking with London’s transport network Transport for London. TfL has had opened up access to various data feeds for a range of various modes of transport. The Underground and extended train network, London Roads, Bicycle schemes and as well as parking spaces. The last time I had a look at this API was for my group project for my MRes course 2 years ago, where we were looking at the bus time table API to feed into our augmented reality project UCLive.
Since then, the TfL APIs have significantly improved with the documentation side and making API requests is much more simplified without the need for authorization either. The site now has an option to “Generate URL” links based on filter criteria and then you parse as xml/json data.
To parse the data I used Omniscope to load up the data as a xml/json text file. Then by configuring various options, it can be used to map out the structure of the data by assigning the various fields within it. The process is a little abstract and it’s done mostly via trial and error to see what works.
In addition I looked for other public APIs – the UK data.gov.uk is also providing APIs for travel and health, The health API was interesting though it is currently only in Beta but is a live feed of locations for hospital, GPs, pharmacies and other social care facilities.
There is also a dedicated API server for travel:
A great data set from Higher Education Statistics Agency on available courses in the UK.
The data gives a fairly comprehensive list of all the available courses from Higher Education institutions like universities and also various attaching data sets which link to the individual courses. Some examples of related data include the National Student Survey given results for the degree classification obtained for the courses, salary achieved by graduates.
I crunched the data and used a number of the available resources to produce this Omniscope visualisation – http://my.visokio.com/uploads/University/
I’ve included a handful of screenshots below:
Interesting bits of information such as the Royal Academy of Music produced many first class degrees, perhaps a sign of excellence?
The average composition of degree awards and list of individual institution achievements.
Some survey results of median salaries by the course graduates, where over 50% of Furness College’s student made £32k+ salaries!
There is a lot more possible insights to be made from the data sets including the entrants’ qualifications (A levels, etc), the number of student continuing per year, the individual course tuition costs, see possible variables here:
An interesting related data that would be nice to add to would be where students come from to go to university in the UK. It would be interesting to look at where international students come from and likewise recent interesting news in regards to tuition fees is that more UK students go abroad to study with countries like the Netherlands offering cheaper degrees than the UK.
Omniscope was particularly useful to merge and analyse the various individual data sets in the unistats depository. Image of the data manager in the background to join the data up nicely (basically using Omniscope instead of loading up the data into a relationship database!):
I used 9 of the available tables for the current data visualisation but there is over 10 left!
Partly a demo example of how to Omniscope can be used to look at twitter data, also is of interest for how we develop and plan our cities. I just saw some interesting tweets under #futurecities and was an event with speakers on the subject of making use of data and technology to better understand how cities work (Event Info). Unfortunately just following tweets of an event is no replacement for actually attending so you miss out on the real content. Still reviewing other’s tweets still brings some little nuggets of knowledge.
Using the current twitter API, it is difficult to mine tweets in bulk from it due to the volume of tweets across the world. You need some serious hardware and automatically running algorithms to extract the data at set intervals a day across multiple access accounts (though this is against terms and conditions!). Alternatively you can use a licensed data supplier to do it instead.
The restrictions on the API creates 15 min windows in which you can only extract a certain amount of data. It is fine to follow a particular trending topic of by filter for specific types of tweets i.e. location enabled tweets. Using Omniscope I used the data API connector made calls to extract data every 5-10mins for the #futurecities tag.
I made a fairly basic Omniscope visualisation of the tweets and some screenshots below:
Extracting 179 tweets over the period 10am – 5pm covering the day event. Below is a graph showing the tweets over time split by original tweets by attendees and retweets by others.
Taking this data the most interesting aspect is to really look at how tweets are passed around and I used the tweet main text to extract out @user_names and taking the original user to map out in a network view diagram the following interactions. Below is a visualisation of the two types of original tweets in blue and red showing retweets of those original ones. The direction of the arrows indicate from the original poster and directed towards users mentioned in the tweet (most cases the recipient who may be interested in the tweet).
It’s always interesting to see how topics are passed around we often see networks and clustering in tweets where the activity will revolve around a dominate/popular user and see the interactions spread out from there. There also looks to be a self enforcing network where individuals will repeat interactions with each other whether directly or just indirectly via retweets.
I took on this exercise to find some interesting data to work with and found the NHS prescriptions data remembering a number of my course mates from UCL looked at this for their dissertation and coursework. Initially I thought I would deal with as much as I could but found 10mill+ rows of data is pretty hefty to crunch with only 8gb RAM so I scaled it down a little by picking a list of ten common drugs to start off with. The original data set here on the Health and Social Care Information Centre site provides data for ALL NHS prescriptions. There data here goes back to 2010 so quite a lot to go back and even do time series analysis on this, for now I’ll just look at October 2015 for the most recent data.
Here is the first draft visualised in Omniscope for data from October 2015 (screenshots below).
So what can we tell from the data?
Firstly I created a couple of cost fields, a cost per quantity to get a rate for each individual practice. I then standardised that rate for all types of drugs where the CPQ is divided by the average cost for each individual bnf name of the drugs. Generating a cost index where below and equal to 1 mean the practice paid the average price paid or less and greater than 1 would be where a practice spent more than the average paid. From there we can now list a ranking of practice which have relatively high costs compared to the national averages.
As well as a listing out all the individual practices and ranking their spend for each drug type against the national average. below is a graph of costs and quantity and those in red shows where the price ratio would be higher than average and those green are below. The idea here is drill down further to individual medicines for each practice to help see where the spend on the practice level can be optimised.
The whole purpose is to identify which trust is making the best value when purchasing their stocks and which are overspending in comparison to the average. This could help to answer questions like why certain trusts have higher costs. Above Ealing Park Health Centre appears to be paying nearly 7.5 times more for their prescriptions (October 2015)!
Data errors found so far!
A couple of the practices have incorrect SHA codes. I will look into this and fix their address file. Basically I will need to load it up and perhaps download the OS Codepoint Open data set to rematch the postcodes (but that’s overkill).
The Nic field of data does look a bit dubious to me, there were instances where the actual cost was lower than the production cost for the drug. I decided against trying measure cost effectiveness using Nic as there is an inherent issue where you can buy anything for below cost and there was a substantial amount of cases of this.
This will be a data set I want to look into again in the next week and I’ll look to splice the data by the various primary trusts, SHAs and the new CCGs areas. This is a spatial analysis blog so I’ll add some nice thematic maps once I get to grips with the data some more and figure out how to properly segment the data in mySQL.
If I can find a comprehensive list of drug use by bnf code/name or some way to match up the use of each drug quickly I can segment the data to create a drug catalogue to group up all medicines i.e. pain killers, cancer therapy drugs, etc together for a more targeted look into groups of drugs to facilitate analysis similar to the statins news when it was found that doctors were still prescribing the more expensive non-generic types (however since more has been reported on the dangers of statins!).
Another aspect to look into is to find something to house the data in, at 10million rows for a single month, this data set will need something stronger to query from. I processed the month with 8GB but it took hours. Ways around this would be to use things like a Hadoop cluster of PCs to share out the workload or upload the data to a cloud database like Google Big Query to do all the heavy lifting. Neither of which I have the access to so might have to bare the slow grind and
An update on Omniscope, my last review was a couple years back for 2.8, and since then a number of changs have been made –http://www.visokio.com/download.
A recap though first:
Omniscope is an ETL, data visualisation and publishing tool all in one. Other products require an additional products i.e with Tableau, it requires another software like Alteryx or some programming and scripting using python/R to manipulate and transform the data to cover the ETL aspect. It is fairly simple to pick up and it has a intuitive GUI design for those not so keen on programming and scripting to create data flow processes (much like ArcGIS’s model builder). You publish the final reports in their .iok formats in which anyone with an installed free edition of Omniscope can view with.
Omniscope requires a yearly subscription cost which varies for the number of keys bought for an organisation. There is also a server version which includes a scheduler tool which allows users to automate reports created in Omniscope well as data scrapping & processing tasks i.e. downloading data from a website/database and automating data outputs into csv/spreadsheets and distributed via emails or uploaded to FTP dumps. The latter option is above £10k+ dependent on how many server keys you want though.
Bundle packages of regular desktop keys start from £2k ea and drops down per key the more you buy as a bundle. The pricing is more expensive per key than tableau but it is also offering ETL capability which if you require it will result you needing to buy in another service such as Alteryx with a starting price of $3,995 for 3 yrs or £1.7k/year for their base version and more bolted on if you require further features/automation. It is difficult to directly compare pricing though as Alteryx has other features you may like such as a cloud based system, TomTom data, etc though Omniscope equally has a number of unique features.
Pros (Lots of Features):
- Flexible data feeds – has many data formats enabled so can read all common data and text formats.
- Connectors – can join up to a wide range of databases, google docs, twitter, facebook, ArcGIS, iGeolise to name a few.
- Connectors for adservers – used by a number of well known digital media agencies to access DoubleClick, Flashtalking, Sizmek, Atlas, Critereo & Celtra.
- Use it to run R scripts for those more complicated processes
- Interactive reports – delivers a report that the user can filter and examine the data visually.
- Wide variety of operations available for ETL, data filtering and cleaning.
- Inbuilt online mapping system and spatial analysis features for overlaying data on maps (Bing and Open Street Map). Can load ArcGIS online resources for boundaries/shapefiles/kml data.
- Huge number of functions available (geocoding included).
- Variety of views.
- Friendly to non-programmers with GUI approach in Data Manager.
- A data publishing tool.
- Requires all data to be in a single matrix per report so it is not optimal when dealing with multiple unrelated data sets.
- Preset graph/chart views are limited in appearance options. Not freely customisable as tableau and a smaller range. There is some scope for creating custom views via html though requires experience in web design to use.
- Limited customisation options in appearance of widgets and filters.
- Mobile reporting appearance is not identical to desktop versions of the report, also requires an always on machine or a server setup to host the iok files.
- Java webstarts no longer work as browser standards are changed to take out Java from all popular browsers.
Omniscope will likely undergo a lot of changes as it adapts in the near future with 3.0 in the works already.
Also see my page here for some demo examples I have previously compiled:
Just for practice, found a simple report by the Office of Rail and Road which gives estimated figures by station and the percentage of travelers on full, reduced or season rates (data here). Firstly a couple of gripes with the data to be outlined first! At least one incorrect easting/northing for a Southend station. Secondly the data is modeled on the principle that entries = exits at a station. It is practical for simple reporting to make this assumption but in reality this will not be true as the exits may vary due to one way journeys. This is difficult to model unless you have some samples of data that measure origins and destinations of each individual journeys. This is often difficult as open ended tickets by zones predominantly outside of cities will let travelers exit without a predetermined destination much like bus tickets. The ORR does do an origin destination matrix study which may yield more interesting findings.
I’ve visualised the report in Omniscope but will include screenshots of various little bits of insights gleaned from this data set, download the report here.
More station more travel?
Having the most number of stations does not mean having the most activity, Network Rail operates 20 stations but accounts for the lion’s share of journeys with estimates over 350 mill journeys for 2014/15. The reasoning for this is likely due to the major hub stations are operated under Network Rail such as London stations like Waterloo, Liverpool street,etc. These would be the focal exchange stations for most journeys predominantly to and from London.
Regionally speaking, Londoners are the least likely to pay full rates for fares. A couple of reasons for this, better transport links and more travel activity may mean more competitive pricing options are available. Secondly and more importantly, commuters to central London will be far more likely to buy season tickets.
Do train companies compete?
When we look at it by the train companies, it is Southern Rails with the least full price payers and Govia with the most season tickets but Virgin Trains East Coast with the most reduced price tickets. The Virgin East train service markets heavily on discounted journeys only recently taking up the contract at the end of 2014.
Destinations matters with pricing!
The stations in red are those with the least full price payers whereas the green are those that travelers are most likely to pay the full price. Gatwick and Stansted is a perfect example where you travel at a time determined by circumstances and will likely mean you may not get a reduced price for that.
After working a couple more years with data visualisation and analysis, now I’m looking for something new to get into. I thought I should abstract my thoughts here on this neglected blog for anyone that may be interested. Perhaps this might be useful for those thinking of a data visualisation job.
What do you do with data visualisation & how do you go about making a career from data visualisation?
Data visualization is the presentation of data in a pictorial or graphical format. For centuries, people have depended on visual representations such as charts and maps to understand information more easily and quickly.
Data visualisation isn’t so much a career path but a skill set comprising of analysis and presentation of results to bring some form of insight or use, to a bunch of numbers, letters or bytes of information. We use it everyday and it permeates through every level of our lives showing up inconspicuously and influencing our every action. It just kind of pops up here and there; as the wall map at the train station, a table forecasting the weather temperatures in your newspaper article you read in the morning or on the news channel when a reporter shows a bar chart of the number of crimes in your area.
The history of data visualisation could be said to have started from the early days of cavemen making etchings of their exploits on their dank cavern walls (Brief History Here Source: Jon Hazell @ Dundas.com).
The process has evolved to what we recognise as graphs, charts, diagrams, tables & maps which we use to present often complex information into a format which allows us to understand the data in a short amount of time.
To actually have a use for data visualisation you actually need a question or problem to solve first! Some may be sceptical at the idea that a table or graphs could solve anything and some even believe it to be lies and propaganda! This is untrue and they do help us to make decisions on a daily basis and affects our behaviour. It is a problem solving tool when correctly implemented and the insight can be valuable.
So those daily data visualisations popping up, we use to decide how to get to where you want to go from the map, checking the weather forecast making sure you have enough layers of clothes on to bring an umbrella and how aware of criminals you should be (the last is an exaggeration).
In terms of actually making a career out of data visualisation it is rather open ended. The majority of roles will be analysts, data scientists, business information, researchers, etc. Any role which includes the word data and here I’ve outlined some general areas/responsibilities where it is used:
- Reporting & Optimisation – Business activity needs to be recorded whether its bills and receipts to data feeds of digital activity. Supermarkets like Sainsbury’s Nectar scheme track all their sales and have a massive departments to aggregate their sales and produce analysis that helps to improve more sales and identify best selling products. Google tracks all their web users activities and their DoubleClick platform which I have used over the last couple of years records advertising activity for clients to create KPIs to measure their performance and cost effectiveness.
- Planning & Forecasting – A wide range of fields make use of plans and maps to decide how to put up a building such as an architect’s plan. We have forecast reports for the stock markets such as Standard & Poor’s credit ratings. The are many businesses that have some form of report with budget forecasts the next period.
- Quality Assurance – Data visualisations are used flag up issues or errors within a business and organisation which need to be attended to. Dashboards of hospital activities like here (Source: NHS Derbyshire) are used to monitor and performance. TfL runs real time updates on their site for their services (Source Transport for London).
- Insight & Analysis – Management and consultants are constantly looking at ways to interpret data and develop policies to help improve spending and productivity. Gathering up data from receipts, clocked in working hours or consumer research all needs to be digested and thousands of companies have departments focused on analysing various parts of their businesses.
- Data Journalism – Weather forecasting with their weather maps and weekly forecast tables are mainstays of any newspaper or television reports. Reporters incorporate data visualisations to augment their articles with facts and figures and helps get their message across to the readers. They’re a great space saver to condense a lot of information into a well made visualisation and prime example of The Guardian even has a dedicated page for it!
- Mapping and Cartography – Where would we be without maps? Though a very specialised and often technically demanding area, maps are data visualisations which solve all sorts of location, logistics and routing problems.
- Academic & Scientific Research – Whenever a researcher runs an experiment, they record the results and collate the findings. Data visualisation is often the most efficient way present the results and with more advanced graphs and diagram they can be used to extrapolate relationships to verify relationships in the data variables. Just look at all these graphs! It would be near impossible to represent millions to billions of data records from the Hadron Collider at a rate of 25GB/s that’s the equivalent of 5 Blu-Ray movies shown per second.
- Education & Training – This area is closely related to academic research and is really using the best data visualisations which are easy to understand to convey meaning and insight to the students. You could even teach the data visualisation methods themselves and apply them to unique niche problems see all these udemy courses!
This is by no means an exhaustive list but just some general areas which might be of interest for those looking at going into data visualisation. HOWEVER! My opinion is that data visualisation is not a core skill which you can just pick up by itself and expect to be able to get a job with straight away. It is a complementary skill and works best when you have knowledge and experience within a particular field which helps you apply the skills to better inform those problems and questions you are posed. Without some form of prior knowledge of the sector you wish to work in, you may not be able to translate their data or their problems to try to begin to solve them.