OER Visualisation Project: How is OER being shared – Quick look at Jorum and #ukoer Twitter archive [day 24]
This might be slightly off-topic for the OER Visualisation project, but I followed an idea, did – what I think are – some interesting things with an archive of tweets and thought I would share. This line of thought was triggered by a tweet from Terry McAndrew in which he asked:
@mhawksey Have you a visualisation planned of JORUM ‘customers’ of OER (and the rest of it for that matter).
Tracking who views or uses OER material can be tricky but not impossible the main issue comes when someone else like me comes along and wants to see who else has viewed, used, remixed the resource. For example, with the Jorum ukoer resources the only usage data I could get was each resource page view and even getting this required scraping over 8,000 pages. This is mentioned in day 18, but I realise I haven’t gone into any detail about how Google Refine was used to get this data – if someone wants me to I will reveal all.
A recent development in this area is the US Learning Registry project which is looking to maximise the use of activity data to support resource discovery and reuse. On Lorna Campbell’s CETIS blog there is a very useful introduction to the Learning Registry announcing JISCs involvement in a UK node. The post includes the following use case which helps illustrate what the project is about:
“Let’s assume you found several animations on orbital mechanics. Can you tell which of these are right for your students (without having to preview each)? Is there any information about who else has used them and how effective they were? How can you provide your feedback about the resources you used, both to other teachers and to the organizations that published or curated them? Is there any way to aggregate this feedback to improve discoverability?
The Learning Registry is defining and building an infrastructure to help answer these questions. It provides a means for anyone to ‘publish’ information about learning resources. Beyond metadata and descriptions, this information includes usage data, feedback, rankings, likes, etc.; we call this ‘paradata’”
Plan A – Inside-out: Repository resource sharing data
All of this is a bit late for me but I thought I’d see what ‘paradata’ I could build around ukoer and see if there were any interesting stories to be told as part of the visualisation project. This is an area I’ve visited before with a series of ‘And the most socially engaging is …’ posts which started with And the most engaging JISC Project is… For this I used Google Spreadsheet to get social share counts (Facebook, Twitter and more) for a list of urls. One of the issues with this technique is Google Spreadsheets timeout after 5 minutes so there is a limit to the number of links you can get through -this is however not a problem for Google Refine.
Taking 380 of the most viewed Jorum ukoer tagged resources (approximately 5% of the data) I used Google Refine to
- ‘Add column by fetching URL’ passing the resource link url into Yahel Carmon’s Shared Count API - using the expression
- Add column based on the results column parsing each of the counts – e.g. using expressions similar to
At this point I stopped parsing columns because it was clear that there was very little if any social sharing of Jorum ukoer resources (here is a spreadsheet of the data collected). In fact the most tweeted resource which Twitter records as having 7 tweets gets most of these following my tweet as it being the most viewed resource.
So what might be going on here. Is just a issue for national repositories? Are people consuming ukoer resources from other repositories? Are Jorum resources not ranking well in search engines? Are resources not being marketed enough?
I don’t have answers to any of those questions, and maybe this is just an isolated case, but the ‘marketing’ aspect interests me. When I publish a blog post I’ll push it into a number of communication streams including RSS, a couple of tweets, the occasional Google+. For posts I think are really worthwhile I’ll setup a twitter search column on Tweetdeck with related keywords and proactively push posts to people I think might be interested (I picked up this idea from Tony Hirst’s
Invisible Frictionless Tech Support. Are we doing something similar with our repositories?
Plan B – Outside-in: Community resource sharing data
There’s probably a lot more to be said about repositories or agencies promoting resources. To try and find some more answers, instead of looking from the repository perspective of what is being shared, I thought it’d be useful to look at what people are sharing. There are a number of ways you could do this like selecting and monitoring a group of staff. I don’t have time for that, so decided to use data from an existing community who are likely to be using or sharing resources aka the #ukoer Twitter hashtag community. [There are obvious issues with this approach, but I think it’s a useful starter for ten]
Having grabbed a copy of the #ukoer Twapper Keeper archive using Google Refine before it disappeared I’ve got over 8,000 tweets from 8th March 2010 to 3rd January 2012. My plan was to extract all the links mentioned in these tweets, identify any patterns or particularly popular tweets.
Extracting links and expand shortened urls
As most tweets now get links replaced with t.co shortened urls and the general use of url shortening the first step was to extract and expand all the links mentioned in tweets. Link expansion was achieved using the longurl.org API, which has the added bonus of returning meta description and keywords for target pages. Here’s a summary of the Google Refine actions I did (taken from the undo/redo history):
- Create new column links based on column item – title by filling 8197 rows with grel:filter(split(value, " "),v,startsWith(v,"http")).join("||")
- Split multi-valued cells in column links
- Create column long_url at index 4 by fetching URLs based on column links using expression grel:"http://ap
- Create new column url based on column long_url by filling 5463 rows with grel:parseJson(value)[‘long-url’]
With long urls extracted the data was exported and uploaded to Google Spreadsheet so that a list of unique urls and their frequencies could be calculated. Here is the Spreadsheet of data. From the refined data there are 2,482 different links which appear 6,220 times in the #ukoer archive. Here is the searchable table of extracted links with frequencies (sorry for the small font – can’t seem to find a way to control column width using Google Visualisation API … anyone?).
Not surprisingly a number of domain root urls appear at the top of the list. More work needs to be done to match resource sharing to different OER sources, but you can start doing simple filtering to find out what’s there. Something for the initial analysis I find interesting is that the top common link in the archive is to Oxfords Open Advent Calendar, which was a daily posting highlighting some of their OER resources. This could be interpreted as underlying the need for OER resources to be more effectively marketed. I’ll let you decide.
PS out of interest I put the list of 2,500 odd links back into Google Refine and extracted social share counts. I haven’t had a chance to look at the data closely but if you want to play a copy of it and a meta-data enhanced version of the #ukoer archive is here. Please share any findings ;)
In October last year the Sport Learning Technologists at Loughborough College successfully won funding from the LSIS Leadership in Technology (LIT) grant scheme for the Fast-tracking feedback using Google Scripts project. Here’s an extract from the project summary:
This project will effectively combine Google Apps for Education and Google Apps Script in order to create a tool which allows tutors to enter grades and feedback in a single spreadsheet which then automatically populates individual feedback proforma, simultaneously sharing these results with students, progress tutors, and administrators as appropriate.
The benefit will be an increase in the efficiency with which assessment feedback can be shared, improving the speed and quality of paper-less student feedback. A successful conclusion to this project will be demonstrated by reduced submission turnaround times and a reduction in the errors brought about by inconsistencies in data entry.
Project funding is not just for deploying technology but also increases the capacity within the organisation at the operational level. With this in mind I have been working with Loughborough, helping them in the technical aspects of developing the Fast-Tracking Feedback System and also learn about Google Apps Script via a series of workshops. Friday was the first of these and I thought I’d share the story so far.
The Loughborough group had already got of to a flying start successfully modifying My #eas11 world premier: Creating personalised Google Documents from Form submissions. 5 months is a long time in Google Apps Script and since then not only is there some new functionality in Apps Script, but I’ve also picked up some new tips. My own understanding has come on along way thanks to receiving a preview copy of Google Script: Enterprise Application Essentials by James Ferreira [due out 27th January]. I’ve been a regular visitor to James simpleappssolutions.com site and tutorials so wasn’t sure if his book would teach me much more, but how I was wrong. Part of the reason I believe for this is the book is geared towards ‘enterprise applications’ so concentrates on documents and workflows, just as assessment in education (for better or worse) is concentrated on documents and workflows.
So below are two links of the current version of the Google Apps Script Spreadsheet and example Document template followed by a quick video to show how it is used. Obviously these are still work in progress as there is still 6 months to run on the project but there’s already enough there for others to benefit from and perhaps feedback on design.
- Google Spreadsheet Template (File > Make a copy to reuse) [Source code here]
- Google Document Example Feedback Template (File > Make a copy to reuse)
Stay tuned for more developments
To support my research in Twitter community visualisation I’ve updated my Twitter Archiving Google Spreadsheet (TAGS) [formerly called twitteralytics – I pre-emptively changed the name to play nice with Twitter ToS].
This new version has some coding improvements and new features including a dashboard summary and advanced tools for getting user profile information and friend/follower relationships for social network analysis.
You can get a copy by selecting one of the links below [make sure you are looked in to your Google account first]:
*** Twitter Archive Google Spreadsheet – TAGS v3.0 ***
[Please rate it in the Template Gallery]
[If the first link doesn’t work try making a copy from the Template Gallery
or Opening this Spreadsheet and File > Make a copy]
- Open the TAGS Google Spreadsheet and copy
- On the Readme/Settings sheetenter the following settings (starting in cell B9):
- Who are you = any web address that identifies you or your event
- Search term = what you are looking for eg #cetis12
- Period = default
- No. results = 1500 (this is the maximum twitter allows but without authenticated access you might get less. See the Advanced setup for info on configuration)
- Continuous/paged = continuous
- To configure the spreadsheet to automatically update select Tools > Script Editor … and then in the Script Editor window select Triggers > Current script’s triggers… and Add a new trigger. Select to run ‘collectTweets’ as a ‘Time-driven’ choosing a time period that suits your search (For unauthenticated access I collect 1500 tweets every hour). Click ‘Save’
- The collection can manually be trigger by TAGS > Run Now! (Results appear on the ‘Archive’ Sheet).
- Open the TAGS Google Spreadsheet and make a copy
- Register for an API key with Twitter at http://dev.twitter.com/apps/new. In the form these are the important bits:
- Application Website = anything you like
- Application Type = Browser
- Callback URL = https://spreadsheets.google.com/macros
- Default Access type = Read-only
- Once finished filling in the form and accepting Twitter’s terms and conditions you’ll see a summary page which includes a Consumer Key and Consumer Secret
- Back in the Google Spreadsheet select Twitter > API Authentication (you’ll need to select this option twice, the first time to authorise read/write access to the spreadsheet). Paste in your Consumer Key and Secret from the previous step and click ‘Save’ (if the Twitter menu is not visible click on the blue button to show it)
- From the spreadsheet select Tools > Script Editor … and then Run > authenticate and Authorize the script with Twitter using your Twitter account
- While still in the Script Editor window select Triggers > Current script’s triggers… and Add a new trigger. Select to run ‘collectTweets’ as a ‘Time-driven’ choosing a time period that suits your search (I usually collect 1500 tweets once a day, but increase to hourly during busy periods eg during a conference). Click ‘Save’
- Now close the Script Editor window. Back in the main spreadsheet on the Readme/Settings sheetenter the following settings (starting in cell B9):
- Who are you = any web address that identifies you or your event
- Search term = what you are looking for eg #cetis12
- Period = default
- No. results = 1500 (this is the maximum twitter allows)
- Continuous/paged = continuous
- Click TAGS > Run Now! to check you are collecting results into a ‘Archive’ sheet
- To allow the results to be visualised from the spreadsheet select File > Publish to the web…You can choose to Publish All sheets or just the Archive sheet. Make sure Automatically republish when changes are made is ticked and click Start publishing
Creating a public interactive visualisation of the archived conversation
- Copy the url of the published spreadsheet
- Visit http://hawksey.info/tagsexplorer and paste your spreadsheet url in the box, then click‘get sheet names’
- When it loads the sheet names leave it on the default ‘Archive’ and click ‘go’
- You now have a visualisation of your spreadsheet archive (click on nodes to delve deeper)
- To share the visualisation at the top right-click ‘link for this’ which is a permanent link (as your archive grows and the spreadsheet is republished this visualisation will automatically grow)
Quick way to display archive community (Links) data
- Run TAGS Advanced menu options (1-3)
- Copy the url of the published spreadsheet
- Visit http://hawksey.info/edgesexplorer and paste your spreadsheet url in the box, then click‘get sheet names’
- When it loads the sheet names leave it on the default ‘Links’ and click ‘go’
Dear Diary, it is now day 20 of the OER Visualisation Project … One of the suggested outputs of this project was “collections mapped by geographical location of the host institution” and over the last couple of days I’ve experimented with different map outputs using different techniques. Its not the first time I’ve looked at maps and as early as day 2 used SPARQL to generate a project locations map. At the time I got some feedback questioning the usefulness of this type of data, but was still interested in pursuing the idea as a way to provide an interface for users to navigate some of the OER Phase 1 & 2 information. This obsession shaped the way I approached refining the data, trying to present project/institution/location relationships, which in retrospect was a red herring. Fortunately the refined data I produced has helped generate a map which might be interesting (thought there would be more from London), but I thought it would also be useful to document some of what I’m sure will end up on the cutting room floor.
Filling in the holes
One of the things the day 2 experiment showed was it was difficult to use existing data sources (PROD and location data from the JISC Monitoring Unit) to resolve all host institution names. The main issue was HEA Subject Centres and partnered professional organisations. I’m sure there are other linked data sources I could have tapped into (maybe inst. > postcode > geo), but opted for the
quick and dirty route by:
- Creating a sheet in the PROD Linked Spreadsheet of all projects and partners currently filtered for Phase 1 and 2 projects. I did try to also pull location data using this query but it was missing data so instead created a separate location lookup sheet using the queries here. As this produced 130 institutions without geo-data (Column M) I cheated and created a list of unmated OER institutions (Column Q) [File > Make a copy of the spreadsheet to see the formula used which includes SQL type QUERY].
- Resolving geo data for the 57 unresolved Phase 1 & 2 projects was a 3 stage process:
- Use the Google Maps hack recently rediscovered by Tony Hirst to get co-ordinates from a search. You can see the remnants of this here in cell U56 (Google Spreadsheets only allow 50 importDatas per spreadsheet so it is necessary to Copy > Paste Special > As values only).
- For unmatched locations ‘Google’ Subject Centres to find their host institution and insert the name in the appropriate row in Column W – existing project locations are then used to get coordinates.
- For other institutions ‘google’ them in Google Maps (if that didn’t return anything conclusive then a web search for a postcode was used). To get the co-ordinate pasted in Column W I centred their location on Google Maps then used the modified bookmarklet
- The co-ordinates in Column S and T are generated using a conditional lookup of existing project leads (Column B) OR IF NOT partners (Column F) OR IF NOT entered/searched co-ordinates.
Satisfied that I had enough lookup data I created a sheet of OER Phase 1 & 2 project leads/partners (filtering Project_and_Partners and pasting the values in a new sheet). Locations are then resolved by looking up data from the InstLocationLookup sheet.
Map 1 – Using NodeXL to plot projects and partners
Exporting the OER Edge List as a csv allows it to be imported to the Social Network Analysis add-on for Excel (NodeXL). Using the geo-coordinates as layout coordinates gives:
The international partners mess with the scale. Here’s the data displayed in my online NodeXL viewer. I’m not sure much can be taken from this.
Map 2 – Generating a KML file using a Google Spreadsheet template for Google Maps
KML is an XML based format for geodata originally designed for Google Earth, but now used in Google Maps and other tools. Without a templating tool like Yahoo Pipes which was used in day 2, generating KML can be very laborious. Fortunately the clever folks at Google have come up with a Google Spreadsheet template – Spreadsheet Mapper 2.0. The great thing about this template is you can download the generate KML file or host it in the cloud as part of the Google Spreadsheet.
The instructions for using the spreadsheet are very clear so I won’t go into details, you might however want to make a copy of the KML Spreadsheet for OER Phase 1 & 2 to see how data is being pulled from the PROD Spreadsheet. The results can be viewed in Google Maps (shown below), or viewed in Google Earth.
Map 3 – Customising the rendering KML data using Google Maps API
Whilst digging around the Google Maps API for inspiration I came across this KML with features example (in the KML and GeoRSS Layers Section. Out of interest I thought I’d use the KML link from Map 2 as the source which gives this OER Phase 1 & 2 map. [If you are wondering about the map styling I recently came across the Google Maps API Styled Map Wizard which lets you customise the appearance of Google Maps, creating a snippet of code you can use in Google Maps API Styling.
Map 4 – Rendering co-ordinate data from a Google Spreadsheet
I haven’t even begun on Google Map Gadgets, so it looks like there are 101 ways to display geo data from a Google Spreadsheet. Although all of this data bashing was rewarding I didn’t feel I was getting any closer to something useful. At this point in a moment of clarity I realised I was chasing the wrong idea, that I’d made that schoolboy error of not reading the question properly.
Map 5 – Jorum UKOER records rendered as a heatmap in Google Fusion Tables
Having already extracted ‘ukoer’ records from Jorum and reconciling them against institution names in day 11 it didn’t take much to geo-encode the 9,000 records to resolve them to an institutional location (I basically imported a location lookup from the PROD Spreadsheet, did a VLOOKUP, then copy/pasted the values. The result is in this sheet)
For a quick plot of the data I thought I’d upload to Google Fusion Tables and render as a heatmap but all I got was a tiny green dot over Stoke-on-Trent for the 4000 records from Staffordshire University. Far from satisfying.
Map 6 – Jorum UKOER records rendered in Google Maps API with Marker Clusters
The final roll of the dice … for now anyway. MarkerClusterer is an open source library for Google Maps API which groups large numbers of closely located markers for speed and usability gains. I’d never used this library before but the Speed Test Example looked easy to modify. This has resulted in the example linked at the very beginning of this post mapping Jorum ukoer records.
This is still a prototype version and lots of tweaking/optimisation required and the data file, which is a csv to json dump has a lot of extra information that’s not required (hence the slow load speed), but is probably the beginnings of the best solution for visualising this aspect of the OER programme.
So there you go. Two sets of data, 6 ways to turn it into a map and hopefully some hopefully useful methods for mashing data in between.
I don’t have a ending to this post, so this is it.
Last month I posted Free the tweets! Export TwapperKeeper archives using Google Spreadsheet, which was a response to the announcement that TwapperKeeper would be removing public access to archives on the 6th January. This solution was limited to archives smaller than 15,000 tweets (although minor tweaking could probably get more). Since then Tony Hirst has come up with a couple of other solutions:
- Rescuing Twapperkeeper Archives Before They Vanish « OUseful.Info
- Rescuing Twapperkeeper Archives Before They Vanish, Redux « OUseful.Info
- Python Script for Exporting (Large) Twapperkeeper Archives By User « OUseful.Info
One of the limits these solutions have is they only collect the data stored on TwapperKeeper missing lots of other juicy data like in_reply_to, location, retweet_count (here’s what a tweet used to look like, now there is more data). Whilst this data is probably of little interest to most people for people like me it opens the opportunity to do other interesting stuff. So here’s a way you can make a copy of a Twapper Keeper archive and rebuild the data using Google Refine.
- You’re going to need a copy of Google Refine and install/run it
- Visit the Twapper Keeper page of the archive you want. On the page copy the RSS Permalink into the URL box in Google Refine adding &l=50000 to the end e.g. the ukoer archive is http://twapperkeeper.com/rss.php?type=hashtag&name=ukoer&l=50000 and click Next.
- In the preview window that appears switch ‘Parse data as’ to XML files. Scroll the top pane down to hover over the ‘item’ tag and click
- You should now have a preview with the data split in columns. Enter a Project name and click ‘Create Project’
- From the ‘item – title’ column dropdown select Edit column > Add column based on this column…
- In the dialog that opens enter a New column name ‘id_str’ and the Expression
smartSplit(value," ")[-1](this splits the cell and returns the last group of text)
- From the new id_str column dropdown select Edit column > Add column by fetching URLs… and enter a name ‘meta’, change throttle delay to 500 and enter the expression
"https://api.twitter.com/1/statuses/show.json?id="+value+"&include_entities=true"(that’s with quotes), then click OK. [What we’ve done is extract a tweet id and then asked refine to fetch details about it from the Twitter API]
- Wait a bit (it took about 4 hours to get data from 8,000 tweets)
Once you have the raw data you can use Refine to make new columns using the expression parseJson(value) then navigate the object namespace. You might find it useful to paste a cell value into http://jsonviewer.stack.hu/ to see the structure. So to make a column which extracts the tweets full name you’d use parseJson(value).user.name
So don’t delay ‘free the tweets’
- 51 Phase 1&2 projects were lead by 41 different institutions with 122 partners
- Projects were recorded as using 112 different technologies and 104 standards
- Jorum contains 9,502 unique records containing ‘ukoer’ of which 8,859 can be reconciled against institution names (including duplicates for records containing more than one subject category there are 9069)
- ukoer records are deposited in 29 subject categories by 77 different institutions
- ukoer resources on Jorum have been viewed over 5 million times and the average resource is viewed 570 times (the top resource is An Introduction to IPR for Educators submitted by Caroline Cash at Falmouth University College)
Most of the numbers above come from two spreadsheets: CETIS PROD Spreadsheet; and jorumUKOERReconciled – Issue 2. I’ve mentioned both of these spreasheets before (day 8 | day 16), but you might like to File > Make a copy of these to play with the data yourself and see some of the formulas used. An additional note on the resource view counts. These were collected by fetching the each resource page on Jorum using Google Refine and scraping the data (more details on using Refine to extract data from day 11.
[The additional processes were to extract a Jorum url by using the expression
filter(value.split("|"),v,contains(v,"jorum.ac.uk"))) on the identifier column, fetching a new column based on the new url and then extracting a count using
So I now have a decent amount of data (some of which might be dropped), next to communicate …
By my calculation it’s day 16 of the OER Visualisation Project. Since day 11 and the Jorum UKOER ‘snowflake’ visualisation I’ve been going back to the refined data and trying to validate the set and understand the information it contains better.
One of the things I did was upload the data from Google Refine to Google Spreadsheet (exported Refine as .xls and uploaded it to Docs). Here is a copy of the spreadsheet. Using the UNIQUE and COUNTIF formula it’s very easy to built a summary of the top Jorum UKOER contributors and subject categorisation.
In the original OER funding call paragraph 19 states: “depositing resources funded through this call into JorumOpen will be mandatory” so in theory all 51 Phase 1 and 2 OER projects should in theory have records in Jorum. We can use this assumption to validate the refined dataset.
Using day 8’s CETIS PROD to Google Spreadsheet its easy for me to create a list of Phase 1 and 2 lead institutions (41 in total as some institutions from phase one were refunded). Using this list was able to query the spreadsheet data and produce the following table embedded below which counts Jorum ‘ukoer’ records for each of the institutions:
You can see a number of institutions have zero record counts. These are mainly for the HEA Subject Centre projects which were not detected using the original extraction and reconciliation method, but as also noted, a number of these records are reconciled against other university names. Using this data the original extracted dataset was further refined and an additional 705 ukoer records were reconciled against institution names. A revised issue and summary of ukoer records is available here.
Data Driven Journalism
Most people are probably unfamiliar with the term ‘data driven journalism’ but would have seen some of the products of the process like The Guardian’s Interactive guide to every IED attack (Wikipedia has a useful definition and overview of data-driven journalism).
It’s been useful for me to consider the OER visualisation project like a data journalistic assignment, using Paul Bradshaws The inverted pyramid of data journalism as a basic processes to approach the Jorum data. For example, remembering the ‘context’ in which the Jorum data was collected (mandatory task, which in cases wasn’t always full automated) is a reminder that even after multiple refinements of the data it’s still not 100% complete and in parts may be unrepresentative.
Looking at a table of top Jorum UKOER contributors, for example, Staffordshire University accounts for almost 50% of the deposits almost all going in the HE – Creative Arts and Design subject area, while University College Falmouth have one Jorum entry for their entire ukoer work.
|Top UKOER Depositors||Records|
|University of Cambridge||855|
|Subject Centre for Information and Computer Sciences||669|
|Leeds Metropolitan University||383|
|HE – Creative Arts and Design||4068|
|HE – Engineering||1227|
|HE – Veterinary Sciences, Agriculture and related subjects||874|
|HE – Mathematical and Computer Sciences||767|
|HE – Physical Sciences||454|
Using Data Journalism processes should also be helpful when considering how the data is communicated using techniques like interactive slideshows and providing brief narration to the data.
With this in mind it was useful to revisit Geoff McGhee’s Journalism in the Age of Data (embedded below)
A lot more to do in 2012