Ali's Stuff: data

Showing posts with label data. Show all posts

Thursday, 7 August 2008

LHC goes live

Computing has a front page story on the LHC going live tomorrow:
http://www.computing.co.uk/computing/news/2223424/grid-awaits-secrets-universe-4158895 and the implications for data management

Tuesday, 22 July 2008

More bits and pieces of news and stuff

Government web sites - Government unsure how many, how expensive and who using http://www.theregister.co.uk/2008/04/29/government_websites_uncontrolled/
Wired story on data deluge and impact on science http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
some interesting thoughts on community building from Stan Garfield http://www.communities.hp.com/online/blogs/garfield/archive/2008/06/18/building-people-to-people-networks.aspx
from Computing 10 July, Gartner predict a move from 1% to 20% of corporate mailboxes using a "cloud-computing provisioned model" for email by 2012
MoSCoW tool for requirements gathering - http://en.wikipedia.org/wiki/MoSCoW_Method
Information Commissioner's Office calls for review of 10-year old Data Protection Act
Do you speak Geek? from BCS :-) http://www.bcs.org/server.php?show=ConWebDoc.20051
interesting post on one of the BCS blogs about use of the internet (notably web2.0) in healthcare and the issue of quality/integrity of information posted and found both by patients and professionals - http://www.bcs.org/server.php?show=ConBlogEntry.480
from JISC IPR newsletter, The European Commission has adopted a Recommendation on the management of intellectual property in knowledge transfer activities by universities and other public research organisations - http://www.jisc.ac.uk/whatwedo/projects/ipr/iprconsultancy/newsletter30.aspx
New report from RIN on information handling by researchers. From the press release, "Although developing the personal, professional and career management skills of researchers is currently high on the agenda in the UK’s higher education sector, training on information seeking and information management is uncoordinated and generally not based on any systematic assessment of needs, according to a new report from the Research Information Network (RIN). A greater effort is required to ensure that training provision is more effectively coordinated and managed by agents with an interest in this agenda: libraries and other information training providers, institutional and faculty research committees within universities, central training units and research funders." Mind the Skills Gap: information-handling training for researchers (www.rin.ac.uk/training-researchinfo).
Article by David Lewis on what libraries should be doing in the current climate to curate content - http://www.ala.org/ala/acrl/acrlpubs/crlnews/backissues2008/may08/librarybudgetsscholcomm.cfm

Friday, 18 July 2008

Various news

I'm starting to catch up with reading - here's some of the news to hit recently (ish!):

Microsoft buys up Powerset, in its attempt to take on Google
HEFCE announces 22 pilot institutions to test the new REF (http://www.timeshighereducation.co.uk/story.asp?sectioncode=26&storycode=402609)
NHS Choices selects Capita as preferred bidder
Google is experimenting with a Digg-like interface
Amazon S3 experienced service outage on 20 July - one of the risks of relying on the cloud, I guess
Encyclopaedia Britannica goes wiki
Proquest to acquire Dialog business from Thomson Reuters

Some interesting articles came my way too...

Information : lifeblood or pollution? has some interesting thoughts about when information has value and when there is so much information it loses its value. Jakob Nielsen is quoted: 'Information pollution is information overload taken to the extreme. It is where it stops being a burden and becomes an impediment to your ability to get your work done.' Possible solutions are rating the integrity of information and clearer provenance.
International initiative licenses resources across 4 European countries about a deal negotiated via the Knowledge Exchange with Multi-Science, ALPSP, BioOne, ScientificWorldJournal, and Wiley-Blackwell.
A fun way of describing the amount of data Google handles

Thursday, 17 July 2008

Earlier this week, this JISC Innovation Forum took place, with the aim of getting together projects and programmes to discuss cross-cutting themes and share experiences. I attended the theme on research data - 3 sessions in all each focusing on a different aspect:

Session 1 - Legal and policy issues
This session followed the format of a debate, with Prof Charles Oppenheim arguing for the motion that institutions retain IPR and Mags McGinley arguing that IPR should be waived (with the disclaimer that both presenters were not necessarily representing their personal or institution's views).

Charles argued that institutional ownership encourages data sharing. Curation should be done by those with the necessary skills - curation involves copying and can only be done effectively where the curator knows they are not infringing copyright therefore the IPR needs to be owned "nearby". He also explained how publishers are developing an interest in raw data repositories and wish to own the IPR on raw as well as published data. There is a real need to encourage authors from blindly handing over the IPR on raw data. He suggested a model where the author is licensed to use and manipulate data (e.g. deposit in repository) and the right to intervene should they feel their reputation is under threat. The main argument focused on preventing unthinking assignment of rights to commercial publishers.

Mags suggested that curation is best done when no-one asserts IPR. There may in fact be no IPR to assert and she explained that there is often over-assertion of rights. There is in general a lot of confusion and uncertainty around IPR which leads to poor curation - Mags suggested the only way to prevent this confusion is to waive IPR altogether. Data is more than ever now the result of collaboration relying on multiple (and often international) sources of data so unravelling the rights can be very difficult - there could be many, even 100s of owners across many jurisdictions. Mags concluded with the argument that it is easier to share data which is unencumbered by IPR issues and quoted the examples of Science Commons and CC0.

A vote at this point resulted in : 5 for the motion supporting institutional ownership; 10 against; 7 abstaining.

A lively discussion followed - here are the highlights:

it's important to resolve IPR issues early
NERC model - researchers own IPR and NERC licenses it (grant T&Cs)
in order to waive your right, you have to assert it first
curation is more than just preservation - the whole point is reuse
funders have a greater interest in reuse than individual researchers - also have the resources to develop skills and negotiate T&Cs/contracts
not just a question of rights but responsibilities too
issues of long-term sustainability e.g. AHDS closure
incentives to curate - is attribution enough?
what is data? covered range of data including primary data collected by researcher, derived data, published results
are disciplines too different?
duty to place publicly funded research in the public domain? use of embargoes?
can we rely on researchers and institutions to curate?
"value" of data?
curation doesn't necessarily follow ownership - may outsource
proposal to change EU law on reuse of publicly funded research - HE now exempt - focuses on ability to commercially exploit - HEIs may have to hand over research data??

And finally, we voted again : this time, 6 for the motion; 14 against; 3 abstaining.

Session 2 - Capacity and skills issues
This session looked at 4 questions:

What are the current data management skills deficits and capacity building possibilities?
What are the longer term requirements and implications for the research community?
What is the value of and possibilities for accrediting data management training programmes?
How might formal education for data management be progressed?

Highlights of discussion:

who are we trying to train? How do we reach them? The need for training has to appear on their "radar" - best way to reach researchers is via lab, Vice-Chancellor, Head of School of funding source.
training should be badged e.g. "NERC data management training"
"JISC" and "DCC" less meaningful to researchers
a need to raise awareness of the problem first
domain specific vs generic training
need to target postgrads and even undergrads to embed good practice early on
need to cover entire research lifecycle in training materials
how is info literacy delivered in institutions now? can we use this as a vehicle for raising awareness or making early steps?
School of Chemistry in Southampton has accredited courses which postgrads must complete - these include an element of data management
lack of a career path for "data scientists" is a problem
employers increasingly looking for Masters graduates as perceived to be better at info handling
new generation of students - have a sharing ethic (web2.0) but not necessarily a sense of structured data management
small JISC-funded study to start soon on benefits of data management/sharing
can we tap into records management training? a role here for InfoNet?
can we learn from museums sector? libraries sector?
Centre for eResearch at Kings are developing "Digital Asset Management" course, to run Autumn 09
UK Council of Research Repositories has a resource of job descriptions
role of data curators in knowledge transfer - amassing an evidence base for commercial exploitation
also a need for marketing data resources

Session 3 - Technical and infrastructure issues

This session explored the following questions:

what are the main infrastructure challenges in your area?
who is addressing them?
why are these bodies involved? might others do better?
what should be prioritised over the next 5 years?

One of the drivers for addressing technical and infrastructure issues is around the sheer volume of data – instruments are generating more and more data – and the volume is growing exponentially. It must be remembered that this isn't just a problem for all big science – small datasets need to be managed too although the problem here is more to do with variety of data (heterogenous) than volume. It was argued that big science has always had the problem of too much data and have to plan experiments to deal with this e.g. LHC in CERN disposes of a large percentage of data collected during experiments. In some areas, e.g. geospatial, data standards have emerged but it may be a while before other areas develop their own or until existing standards become de facto standards.

Other areas touched on included:

the role of the academic and research library
roles and responsibilities for data curation
how can we anticipate which data will be useful in the future?
What is ‘just the right amount of effort’?
What are the selection criteria – what value this data might have in the future (who owns it, who’s going to pay for it), how much effort and money would you have to regenerate this data (eg do you have the equipment and skills to replicate it?)
not all disciplines are the same therefore one size doesn't fit all
what should be kept? data, methodology, workflow, protocol, background info on researcher? How much context is needed?
how much of this context metadata can be sourced directly e.g. from proposal?
issues of ownership determine what is stored and how
what is the purpose of retaining data - reuse or long-term storage? Should a nearline/offline storage model be used? Infrastrucutre for reuse may be different from that for long-term storage?
Should we be supporting publication of open notebook science? (and publishing of failed experiments). What about reuse/sharing if there’s commercial gains?

The summing up at the end concluded 4 main priority areas for JISC:

within a research environment – can we facilitiate the data curation using the carrot of sharing systems? (IT systems in the lab)
additional context beyond the metadata
how do we help institutions understand their infrastructural needs
what has to happen with the various dataset systems (fedora etc) to help them link with the library and institutional systems

Tuesday, 10 June 2008

Data librarians

Interesting article in CILIP Update:
http://www.cilip.org.uk/publications/updatemagazine/archive/archive2008/june/Interview+with+Macdonald+and+Martinez-Uribe.htm
which quotes:
"‘Recent research carried out by the Australian Department of Education, Science and Training3 has indicated that the amount of data generated in the next five years will surpass the volume of data ever created, and in a recent IDC White Paper4 it was reported that, between 2006 and 2010, the information added annually to the digital universe will increase more than six fold from 161 exabytes to 988 exabytes.’ "

Wednesday, 14 May 2008

Provenance theme at NeSC

A nice intro article to the new theme on provenance in the latest NeSC newsletter...
http://www.nesc.ac.uk/news/newsletter/May08.pdf
Also a helpful report from the "Marriage of Mercury and Philology" event including a summary of the CLELIA project, which is looking at how to mark up and structure manuscripts to include all components of the text.

Wednesday, 16 April 2008

JISC conference

Yesterday, the annual JISC conference took place in Birmingham - as usual, a very busy day and although I caught up with lots of people, I still managed to miss some of the people I was hoping to catch up with.

3 of my projects gave demos - 3DVisA, NaCTeM and ASSERT - and it was great to see the interest in the people attending. I went along to two parallel sessions: one on the Strategic eContent Alliance and one on rapid community building. Here are my notes from both...

The Strategic eContent Alliance aims to build a common information environment, a UK Content Framework and to gather case studies and exemplars. The UK Content Framework will be launched in March 2009 and will incorporate:

standards and good practice
advice, support, embedding
policy, procedures
service convergence modeling
audit and register
audience analysis and modeling
exchange (interoperability) model development
business models and sustainability strategies

There are a number of change agents to achieve the vision of the SCA...

common licensing platforms
common middleware
digital repositories
digitisation
devolved administrations
service convergence
uk government policy review
funding

Globally, there are other incentives e.g.

service oriented architecture
EU initiatives
Google and Microsoft initiatives
Open Content Alliance etc

The SCA has also engaged an IPR consultancy and Naomi Korn gave a brief overview of the issues of working in such a content-rich world. Naomi pointed out that it has never been easier to access content and referred to a number of key developments and standards to be aware of:

Science Commons
Digital Libraries i2010
PLUS
ACAP
SPECTRUM (collections management)
JISC registry of electronic licences
Open Access Licensing initiatives

Simon Delafond from the BBC talked about the Memoryshare project which enables user-generated content to be recorded against a timeframe to create a national living archive. They plan to build on this project with the SCA to create Centuryshare to aggregate content and augment with user generated content - this will be a proof of concept project due to deliver in March 2009.

Meredith Quinn talked about the recent Ithaka report on sustainability. The paper tackles some of the cultural issues to be resolved to create the right environment for sustainability. Meredith outlined the 4 key lessons from this work:

rapid cycles of innovation are needed - i.e. don't be afraid to try new ideas and to drop ideas which aren't working
seek economies of scale - e.g. Time Inc required all their magazines to use the same platform - not such an easy task to achieve in the distributed nature of HE but maybe this is where shared services come in
understand your unique value to your user
implement layered revenue streams

The rapid community building workshop focused on the Users and Innovations programme and the Emerge community which has been set up to support the programme. Given the nature of the Web2.0 and next generation technologies this programme is dealing with, it was decided early on to adopt an agile and community-led approach. It was important to avoid imposing an understanding on the community and instead build a shared understanding across the community. So 80 institutions were brought together (some 200 individuals) face to face to start to build a community of practice - from there, the community developed further in an online environment, set up using Elgg.

The programme shared the success factors for community building:

bounded openness
heterogenous homophily
mutable stability
sustainable development
adaptable model
structured freedom
multimodal identity
shared personal repertoires
serious fun

some of which are oxymorons! This is explained a little more at https://e-framework.usq.edu.au/users/wiki/UserCentredDevelopment. The approach is based on "appreciative enquiry" coined by Cooperrider and Srivastra in 1987.

It was interesting to hear their thoughts on benefits realisation which focuses on 3 strands:

synthesis (of learning etc)
capacity building
increased uptake

The programme is also planning to create an Emerge Bazaar where projects can "share their wares" and offer services. This will also promote a kind of IdeasForge to encourage new activities which might lead to new funded projects. The Emerge Online conference is next week from 23 to 25 April.

As for the keynote sessions, key points from Lord Puttnam's speech were that we shouldn't try to solve problems with the same kind of thinking that caused them and that we are only scratching the surface of what we can achieve with technologies therefore should be more ambitious and keep innovation high on the agenda.

It was good to hear Ron Cooke highlight the data problem: "...my nightmare is the “challenge of super-abundant data” - not just its life cycle, but its superfluity with the new, unprecedented increases of data through Web 2.0 and user-generated content, including academic publishing in real time, blogging without control, and the quality and reliability of data. I am also concerned about the demands of skills it places on us - critical assessment is needed to deal with this data."

I missed Angela Beesley from Wikia but am pleased to see someone has summarised the talk http://librariesofthefuture.jiscinvolve.org/2008/04/15/jisc-conference-closing-keynote-speech-angela-beesley/ :-)

The SCA team have blogged the conference (far better than i have!) which you can read at http://sca.jiscinvolve.org/2008/04/15/.

The conference also saw the launch of the Libraries of the Future campaign (http://www.jisc.ac.uk/whatwedo/campaigns/librariesofthefuture.aspx).

Friday, 28 March 2008

Projects addressing issues around research data

Yesterday, we had a meeting here at JISC to bring together current projects working in the field of research data. There's a lot happening and it's going to be really interesting to see what comes out of these studies:

UK Research Data Service feasibility study which is funded by HEFCE to explore options for creating a shared service for managing research data

Data Audit Framework project to enable institutions to find out what data they hold, where it's located, who's responsible for it and how to share with other institutions

A number of case studies to test the Data Audit Framework

A study exploring the skills, role and career development of "data scientists" and curators

A study exploring the costs to institutions of preserving research data, to report within next few weeks

A summer school to be organised by the DCC on curation skills

The DISC-UK DataShare project building on a support network to promote sharing

The forthcoming study funded by RIN/JISC/NERC on publication

DCC's work, notably the Research Data Management Forum; the Lifecycle Model; SCARP; Store

I already mentioned (http://ali-stuff.blogspot.com/2008/02/jisc-and-research-data.html) an article earlier this year in Inform. Of course, much of the work stems from Liz Lyon's report from last year Dealing with Data (see earlier post at http://ali-stuff.blogspot.com/2007/11/data-sharing.html)

Tuesday, 18 March 2008

DCC Curation Lifecycle Model

This recently went to consultation - not sure when the results of the consultation come out and how much the model will change as a result. But in meantime, want to keep track of the links:

model : http://www.dcc.ac.uk/events/dcc-2007/posters/DCC_Curation_Lifecycle_Model.pdf
background info : http://www.ijdc.net/ijdc/article/view/45/52

Thursday, 13 March 2008

UKDA-Store

"The UK Data Archive (UKDA) is launching UKDA-store, a new research output management tool, later this year. Used to submit data deposits into the UK Data Archive, UKDA-store is to be initially released to the social science research community with the intention of extending the system to other researchers. UKDA-store will enable researchers to submit a range of digital outputs to the self-archiving repository with the right to set permissions for individual and group access, so that data can remain private (on embargo) although metadata continues to be searchable. Furthermore, data that is judged to meet the UKDA’s acquisition criteria can be formally lodged for long-term central system preservation within the UK Data Archive. [...]
UKDA-store will be formally launched at the National Centre for Research Methods Festival on 30 June 2008 in Oxford."
http://www.jisc.ac.uk/news/stories/2008/02/ukdastore.aspx

Wednesday, 20 February 2008

JISC and research data

Research data gets a mention in latest JISC Inform:
http://www.jisc.ac.uk/publications/publications/inform20.aspx#Turningthetide

A brief summary of recently commissioned work - hopefully to be followed up in more detail when the various studies start reporting...

Tuesday, 19 February 2008

News from RIN

News from RIN this month includes:

"RIN responds to data sharing review" : http://www.rin.ac.uk/whats-new-data-sharing
"RIN issues guidance on acknowledgement of research funding" http://www.rin.ac.uk/whats-new-funding-acknowledge

Monday, 18 February 2008

NeSC news Jan/Feb 08

Some interesting stories in the latest NeSC News (http://www.nesc.ac.uk/news/newsletter/January08.pdf):

an overview of trust and security (to introduce the NeSC theme) which gives a good plain english account of the key issues
an article on SciSpace (http://scispace.niees.group.cam.ac.uk/) a social networking application developed in Cambridge
a news item on the Protocol for Implementing Open Access Data

Good to see the viznet Showcase and JISC Conference getting a mention too ;-)

Open chemistry

I picked this up a few weeks ago but what with all that's been going on so far this year, it's taken me till now to take a look:
http://www.rsc.org/chemistryworld/News/2008/January/29010803.asp
"Most chemical information on the web is published in closed journals and databases which guarantee high quality but also require a subscription to view. Pre-print servers, collaborative documents, open databases, video sites, online lab notebooks and blogs provide other ways of communicating research. Combining the lot offers the enticing prospect of a vast, free-to-access repository. This could transform the sharing of scientific research if the disparate data sources were machine-readable, so that a search engine could automatically gather data about a particular molecule from a crystal structure, a movie, an online lab book, and an archived article, for example. "

The project will be using standards developed by the Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE) - model protocols are expected to launch next month.

Tuesday, 29 January 2008

RIN issues stewardship principles

RIN has issued the Principles on Research Data Stewardship following the consultation which took place last year
http://www.rin.ac.uk/new-data-stewardship.

The principles address 5 areas:

roles and responsibilities
standards and quality assurance
access, usage and credit
benefits and cost-effectiveness
preservation and sustainability

Monday, 21 January 2008

Draft DCC Curation Lifecycle Model - request to comment

http://www.dcc.ac.uk/events/dcc-2007/posters/DCC_Curation_Lifecycle_Model.pdf
http://www.ijdc.net/ijdc/article/view/45/52

Sunday, 20 January 2008

Google to offer data storage

"Google to Host Terabytes of Open-Source Science Data"
http://blog.wired.com/wiredscience/2008/01/google-to-provi.html

Monday, 14 January 2008

Cloud and grid

Bill St Arnaud pointed to this useful blog by Ian Foster:
http://ianfoster.typepad.com/blog/2008/01/theres-grid-in.html

Gives a good overview of some of the issues and adds to the debate of how new cloud computing actually is.

Open access: various news

European Research Council Guidelines for Open Access, Dec 07:
http://erc.europa.eu/pdf/ScC_Guidelines_Open_Access_revised_Dec07_FINAL.pdf

Science Commons announces the Protocol for Implementing Open Access Data
http://creativecommons.org/weblog/entry/7917

Information World Review, Dec 07: "Bush bombs open access plans in war on spending" covers Bush vetoing the bill to require research funded by NIH to be available on open access sites within 12 months of the first publication. Work is now underway to override the veto and if that doesn't work, it's thought some of the bill will still go through.

Information World Review, Jan 08: "BioMed adds DSpace internet distribution to Open Repository" covers the new open source features added to Open Repository to "make it easier for customers to browse and submit material to the hosted repository solution online".

Monday, 7 January 2008

Open science: implications for librarians

A recent (Nov 07) talk by Liz Lyon to the Assoc of Research Libraries gives a great overview of open science and how communication and collaborative tools are changing ways of working. The talk is directed at librarians, and encourages them to ask some difficult questions about how research data is being managed in their own institutions.

I like the take home messages:

Open science is driving transformational change in research practice: now
Curating open data requires strong Faculty links and multi-disciplinary teams: Library + IT + Faculty
Recognise and respect disciplinary differences: get to know the data centre people, new partnerships
Libraries have a lot to offer: build on your repository experience
Data underpins intellectual ideas: we must curate for the future