Archimedes' Lever: social database

This started out as a conversation on Twitter with @cdegger, @ehrenfoss, @p2173 and other folks following the #opendata, #socent or #10swf hash tags. Twitter is (in)famous for being limited to 140 characters per “tweet”; with the extra hash tags and all, that's reduced to 96. I wrote a reply and used a text editor to break it into "tweets"; by the time I got to “(part 8/nn),” I knew it was crazy to try and tweet an intelligible response.

So, folks, here's what I think; I hope it's more intelligible this way. Comments appreciated, here or on Twitter.

What I see #opendata doing for #socent is to allow individuals or groups to build on and share information on opportunities, needs, donors, etc.. This collaboration would use open data formats and tools that iteratively improve philanthropy effectiveness.

Think of how a wiki enables text collaboration, shared learning and discovery, or how instant messaging allows both realtime and time-shifted conversation. Now expand that idea to a sort of "social database" that can be run like a less elitist Wikipedia mated with an RSS feed. Anybody can browse or search the information in the database (needs and offers). They can also get their own local copies of some/all data and have it be updated from the "upstream" source automatically. A smaller group of vetted people can update the "master" data which then gets pushed out to all viewers or subscribers.

This curation is needed to maintain data integrity and to (hopefully) eliminate attacks on or disruptions to the "social database" itself. The sad reality is that any public information on the Internet must have some sort of protection, or it will be damaged or destroyed. I see this as being especially true of a pioneering social-entrepreneurial system like this; there are enough people out there who have a vested interest in this sort of thing not working that security (authentication and validation) must be built in from the start. Otherwise, we will wind up with a situation akin to "spam" and "phishing" with email. Email standards were set in the early days, when the Internet was a primarily academic/scientific resource where all users could legitimately trust each other by default; the current state of the Net is far different. Any open data standards and protocols developed for the "social database" must take this into account.

These open data and protocol standards should be designed with the understanding that they are likely to change over time as the needs of users become better defined and as new opportunities to support those needs present themselves. The first version of a new system (like this) is almost never the simplest, nor will it be the most effective for its purpose. Lessons will be learned that should be folded back into revisions of the standards, in much the same way that later versions of standards like HTML built upon experience gained with earlier versions.

When evolving these data formats and protocols, it is vital that the process be fully transparent, with a balance between building progress and listening to the needs and concerns of the widest possible audience. It is entirely possible that no one standard in a given area will suit all stakeholders. In those instances, some sort of federation built on interchange of some common subset or intermediate format may be helpful. This should be seen as undesirable, however, as it limits the ability of casual or new users to make effective use of the entire distributed system.

The development, maintenance and ownership of standards developed for this project (including necessary legal protection such as copyrights) must be under the auspices of an organization with the visibility and stature to maintain control of the standards, lest they devolve into a balkanized mess that would be as unhelpful to the intended mission as not having any such standards at all. I would expect this organization to be a non-profit organization. Not only will this remove the fiduciary responsibility for monetizing investments made in the technology from the officers of the organization, but other non-profits/NGOs can be expected to cooperate more fully with the parent organization in developing, deploying and maintaining the standards – allowing them to remain open and unencumbered.

Finally, I think it's important to state that I don't see any one type of format as necessarily superior for developing this. I'm aware that there has been a lot of work done with various XML-based systems as part of the #socent build-out to date. After working with XML for approximately ten years now, I have seen some magnificent things done with it, and some absolutely misbegotten things done with it. Particularly with regards to the authentication and validation issues I mentioned earlier, and also with the sheer bulk and relative inefficiency of a large-scale XML data store, there are several issues I can think of. They're solvable, and they're by no means unique to XML, but they are issues that need to be thought about.

EDIT Sunday 18 April: I feel really strongly that one of the things our (distributed) information architecture is going to have to nail from the very beginning is the idea of authentication/verification; does a particular bit of information (what I'd been calling "opportunities, needs, [and] donors" earlier), otherwise we're just laying ourselves open to various black-hat cracking attacks as well as scams, for instance of the "Nigerian 419" variety. I think it's pretty obvious we're going to need some sort of vetting for posters and participants this in turn implies some (loose or otherwise) organization with a necessary minimum amount of administrative/curative overhead to maintain public credibility and apparent control over our own resources. Anybody would be allowed to read basic information, but I think we can all agree on the need for some sort of access control and/or obfuscation of data like individual contact information or some types of legal/financial information that gets tacked into the "social database." This could be pretty straightforward. One hypothetical approach might be as basic as having those who wish to publish information to go through a simple registration process that issues them some piece of private data (possibly even the open standard OAuth authentication that Twitter and others use for applications hooking into their infrastructure). Either alternatively or in conjunction, a public-key cryptography system such as GNU Privacy Guard could be used to prove data came from who it claimed to. For instance, the data to be published could be enclosed in a ZIP file or other archive, along with a "signature" and the identification of a registered publisher. (There's no way that I'm aware of to actually embed the 'signature' itself into the data file: the signature depends on the exact content of the data, and by adding the signature, the content is changed.)

To the non-technical user, the effects of such a system should be:

The 'Foundation' (for want of a better term) can use already-proven, open standards to enhance member and public confidence in the accuracy and transparency of the content of the "social database". Attempting to "reinvent the wheel" in this area is an oft-proven Bad Idea™;
The Foundation will be able to develop, deploy, manage and maintain a (potentially) widely-distributed philanthropic/social database architecture that can support a wide variety of organizational/use models;
Having this sort of authentication and validation will aid in the evolution of the technical and "business" architectures of the system; new services can be layered on top of existing ones by different users as needed.

For instance, if a particular member has announced that they will publish information in a specific version of the schema for the "social database" (say, during registration), any later information purportedly from that member published in an older format should raise warning flags, as it may be a sign of actual or attempted compromise in security and data integrity. A benign incident, such as the member inadvertently using a software tool that submits data in an inappropriate format, can be quickly identified, communicated and rectified.

This will be vital if we are to create a data system which publicly distributes data pertinent to specific members outside those members' control that could include information that must not be altered by outside influences (such as, say, budget information) or information that, for general Internet-security reasons, should not be directly visible to all and sundry (for instance, contact information might be accessible to members but not the casual browsing public).

Archimedes' Lever

Friday, 16 April 2010

A Slight Detour: Musing on Open Data Standards as applied to Social Entrepreneurship and Philanthropy

About Me

Labels

Blog Archive