Thursday, 27 February 2014

Web Science Institute Research Week (Part I)

Historic Analysis of Government Websites

This week (24th - 28th February), the Web Science Institute at the University of Southampton has been hosting a research week for Web Science students and academics, industry contacts, and other interested academics in the University. We've split in to groups around a list of topics and have been working in teams to develop research projects around them. This post is about the Historical Analysis of Government Websites group with the National Archives and is part 1 of a series detailing our motivations, developments and outputs.  UPDATE: Part 2 is now available here

Step 1: Project Formation

After a series of introductory talks on Monday morning, we moved into our groups to begin tackling a series of research challenges that had been devised through collaboration between the Web Science CDT and members of our industry partners. Our group would be working with the National Archives to examine their collection of UK government website dating back to 1996 which has the potential to be a gold-mine of data to reflect use of the Web and how it has changed over time by the government, while also showing how societal matters and current events have been portrayed. The team consisted of 5 members:

Simon Demissie - The National Archives
Justin Murphy - University of Southampton (Politics) - Data retrieval and statistical analysis
Phil Waddell - University of Southampton (Web Science) - Research method and theoretical background
Chris Phethean - University of Southampton (Web Science) - Web based presentation/UX
Ian Brown - University of Southampton (Web Science) - Project manager and presenter

After discussing the potential options for this project, it became apparent that there was a huge variety of different approaches we could take. The data accessible from the National Archives was available through several different methods, and it was quickly decided that the most favourable would be a full-text search based API that would allow us to query the entire archive based on keywords and access every page that has contained those terms.  

Our motivation for this came from a common interest in our team of being able to examine some sort of narrative on the government webpages over a period of time. We were particularly interested in taking a specific theme - which we decided to be the financial crisis - and looking at how the frequency of key terms relating to this - such as unemployment, inflation, wage etc. - changed over time in correlation with what was being reported in the media. Most importantly however, we were interested in producing a reusable and generic system for analysing this data where regardless of topic or keyword, a similar search and analysis could be carried out. This would open up the data and the tools we needed to develop for the project to anyone, and could be hosted on the Web Observatory. 

We therefore set about looking at ways in which we could access the required data, process it and then visualise the results. The developments in this will be covered in the next post. 

No comments: