Let’s build a “Debian for Development Data”

I just returned from an intense week in the UK: an IKM Emergent workshop in Oxford, and the  Open Government Data Camp in London had me almost drowning in “open data” examples and conversations, with a particular angle on aid data and the perspectives of international development.

As the result of that, I think we’re ready for a “Debian for Development Data”: a collection of data sets, applications and documentation to service community development, curated by a network of people and organisations who share crucial values on democratisation of information and empowerment of people.

“Open data” is mainstream newspaper content now

Mid 2009, after the 1%EVENT, a couple of innovative Dutch platforms came together to explore the opportunities of opening up our platforms: wouldn’t it be great if someone in an underdeveloped community had access to our combined set of services and information?

We had a hard time escaping new jargon (federated social networks, data portability, privacy commons, linked open data, the semantic web) and sketching what it would look like in five years. But then again, suppose it was five years earlier: in mid 2004, no-one could predict what Youtube, Facebook and Twitter look like today, even though many of us already felt the ground shaking.

  • The technical web was embracing the social web, of human connections.

  • The social web pushed “literacy”: people wanted to participate and they learned how to do that.

A year and a half later, “open data” is catching up with us, and going through a similar evolution. Governments and institutions have started to release data sets (the Dutch government will too, the UK released data on all spending over £25,000 on Friday). So when will the social dimension be embraced in open data?

A week of open data for development

At an IKM Emergent workshop in Oxford, on Monday and Tuesday, around 25 people came together to talk about the impact of open data on international development cooperation. We discussed when we would consider “linked open data” a success for development. One key aspect was: getting more stakeholders involved.

Then at Open Government Data Camp (#OGDCamp) in London, on Thursday and Friday, around 250 people worked in sessions on all kinds of aspects of open data. Several speakers called for a stronger social component: both in the community of open data evangelists and in reaching out to those for whom we think open data will provide new opportunities for development.

At IKM, Pete Cranston described how his perception of access to information changed when a person approached him in a telecentre to ask how the price of silk changed on the international market: he was a union representative, negotiating with a company who wanted to cut worker salaries because of a decline in the market price. Without access to internet or the skills to use it, you don’t have the same confidence we have that such a question can be answered at all.

Then at OGDCamp, David Eaves reminded us that libraries were (partly) built before the majority of the population knew how to read, as an essential part of the infrastructure to promote literacy and culture 1.

Telecenters fulfil a role in underdeveloped communities as modern-day libraries, providing both access as well as the skills to access information and communication tools via the internet.

But we don’t have “open data libraries” or an infrastructure to promote “open data literacy” yet.

How open source software did it

It shouldn’t be necessary for people to become data managers just to benefit from open data sets. Intermediaries can develop applications and services to answer the needs of specific target groups based on linked open data, much as librarians help make information findable and accessible.

There are also parallels with open source software. Not every user needs to become a developer in order to use it. Although it is still to think otherwise sometimes, the open source movement has managed to provide easier interfaces to work with the collective work of developers.

The open data movement can identify a few next steps by looking at how the open source movement evolved.

Open Source

Open Data

Software packages (operating systems, word processors, graphics editors, and so on) are developed independently. Each software package can choose the programming language, development tools, the standards and best practices they use.

Data sets (budget overviews, maps, incident reports) are produced independently as well. The data formats and delivery methods can be chosen freely, and there are various emerging standards and best practices.

Communities around software packages usually set up mailing lists, chat channels and bug trackers for developers and users to inform each other about new releases, problems, and the roadmap for new versions. The mantra is “many eyes make all bugs shallow”: let more people study the behaviour or the code of software, and errors and mistakes will be found and repaired more easily.

Data sets mainly are published. As Tim Davies noted in one of the conversations, there don’t seem to be mailing lists or release notes around data sets yet. To deliver the promise of a “wisdom of the crowds”, users of data sets should have more and better ways to provide feedback and report errors.

Open source software is mostly used via distributions like Debian, Redhat, Ubuntu, separating producers and integrators. A distribution is a set of software packages, compiled and integrated in a way that makes them work well together, thereby lowering the barrier of entry to use the software. Distributions each have a different focus (free software, enterprise support, user-friendliness) and thus make different choices on quality, completeness, and interfaces.

Perhaps the current data sets released by governments could be considered “distributions”, although the producer (a department) and the integrator (the portal manager) usually work for the same institution. CKAN.net could be considered a distribtion as well, although it does not (yet?) make clear choices on the type and the quality of data sets it accepts.

Software distributions make it possible to pool resources to make software interoperable, set up large-scale infrastructure, and streamline collaboration between “upstream” and “downstream”. The open character stimulates an ecosystem where volunteers and businesses can work together, essential to create new business models.

Towards a “Debian for Development Data”

To sum up several concerns around open data for development:

  • Open data is currently mainly advocated for by developers and policy makers, without a strong involvement of other stakeholders (most noteworthy: those we like to benefit in underdeveloped communities). It tends to be driven mostly by web technology and is mostly focused on transparency of spending. It does not take into account (political) choices on why activities were chosen, and also lacks a lot in recording the results.

  • Data sets and ontologies are hard to find, not very well linked, with few generic applications working across data sets, and examples of good use of multiple data sets. Once you want to make data sets available, it is hard to promote the use of your data, provide feedback loops for improvements, administer dependencies, and keep track of what was changed along the way and why.

  • There are hardly any structural social components around current open data sets, repositories and registries.

So why don’t we start a “Debian for Development Data”?

  • A Social Contract and Open Data Guidelines like those for Debian can capture essential norms and values shared by community members, and inform decisions to be made. The contract can for instance value “actionable opportunties” over financial accountability. The Agile Manifesto is another example to draw from.

  • The community should set up basic communication facilities such as a mailing list, website, and issue tracker, to ease participation. Decision-making is essentially based on meritocracy: active participants choose who has the final say or how to reach consensus.

  • The data sets should be accompanied by software and documentation, to take away the problem of integration for most end users. Each data set and tool should have at least one “maintainer”, who keeps an eye on updates and quality, and is the liaison for “upstream” data set publishers, offering a feedback loop from end-users to producers.

  • The CKAN software (powering the CKAN.net website mentioned before) draws on the lessons from distributions like Debian for its mechanisms to keep track of dependencies between data sets, and has version control, providing some support to track changes.

  • Ubuntu divides packages in categories like “core”, “non-free” and “ restrcited” to deal with license issues, and to express commitment of the community towards maintaining quality.

We stimulate the social component by providing more stakeholders a point of entry to get involved through socio-technical systems. We stimulate literacy by offering the stakeholders ways to get open data, publish their own, experiment with applications, and learn from each other. And we circumvent the tendency towards over-standardisation by doing this in parallel with other initiatives with sometimes overlapping goals and often different agendas.

1A quick check on Wikipedia indicates this seems to have mainly been the case in North-America, though.


Posted in Aid Transparency, Conferences, Ideas, International development, Open data and tagged , , , , , , .


Leave a Reply