Friday 16 April 2010

A Slight Detour: Musing on Open Data Standards as applied to Social Entrepreneurship and Philanthropy

This started out as a conversation on Twitter with @cdegger, @ehrenfoss, @p2173 and other folks following the #opendata, #socent or #10swf hash tags. Twitter is (in)famous for being limited to 140 characters per “tweet”; with the extra hash tags and all, that's reduced to 96. I wrote a reply and used a text editor to break it into "tweets"; by the time I got to “(part 8/nn),” I knew it was crazy to try and tweet an intelligible response.

So, folks, here's what I think; I hope it's more intelligible this way. Comments appreciated, here or on Twitter.


What I see #opendata doing for #socent is to allow individuals or groups to build on and share information on opportunities, needs, donors, etc.. This collaboration would use open data formats and tools that iteratively improve philanthropy effectiveness.

Think of how a wiki enables text collaboration, shared learning and discovery, or how instant messaging allows both realtime and time-shifted conversation. Now expand that idea to a sort of "social database" that can be run like a less elitist Wikipedia mated with an RSS feed. Anybody can browse or search the information in the database (needs and offers). They can also get their own local copies of some/all data and have it be updated from the "upstream" source automatically. A smaller group of vetted people can update the "master" data which then gets pushed out to all viewers or subscribers.

This curation is needed to maintain data integrity and to (hopefully) eliminate attacks on or disruptions to the "social database" itself. The sad reality is that any public information on the Internet must have some sort of protection, or it will be damaged or destroyed. I see this as being especially true of a pioneering social-entrepreneurial system like this; there are enough people out there who have a vested interest in this sort of thing not working that security (authentication and validation) must be built in from the start. Otherwise, we will wind up with a situation akin to "spam" and "phishing" with email. Email standards were set in the early days, when the Internet was a primarily academic/scientific resource where all users could legitimately trust each other by default; the current state of the Net is far different. Any open data standards and protocols developed for the "social database" must take this into account.

These open data and protocol standards should be designed with the understanding that they are likely to change over time as the needs of users become better defined and as new opportunities to support those needs present themselves. The first version of a new system (like this) is almost never the simplest, nor will it be the most effective for its purpose. Lessons will be learned that should be folded back into revisions of the standards, in much the same way that later versions of standards like HTML built upon experience gained with earlier versions.

When evolving these data formats and protocols, it is vital that the process be fully transparent, with a balance between building progress and listening to the needs and concerns of the widest possible audience. It is entirely possible that no one standard in a given area will suit all stakeholders. In those instances, some sort of federation built on interchange of some common subset or intermediate format may be helpful. This should be seen as undesirable, however, as it limits the ability of casual or new users to make effective use of the entire distributed system.

The development, maintenance and ownership of standards developed for this project (including necessary legal protection such as copyrights) must be under the auspices of an organization with the visibility and stature to maintain control of the standards, lest they devolve into a balkanized mess that would be as unhelpful to the intended mission as not having any such standards at all. I would expect this organization to be a non-profit organization. Not only will this remove the fiduciary responsibility for monetizing investments made in the technology from the officers of the organization, but other non-profits/NGOs can be expected to cooperate more fully with the parent organization in developing, deploying and maintaining the standards – allowing them to remain open and unencumbered.

Finally, I think it's important to state that I don't see any one type of format as necessarily superior for developing this. I'm aware that there has been a lot of work done with various XML-based systems as part of the #socent build-out to date. After working with XML for approximately ten years now, I have seen some magnificent things done with it, and some absolutely misbegotten things done with it. Particularly with regards to the authentication and validation issues I mentioned earlier, and also with the sheer bulk and relative inefficiency of a large-scale XML data store, there are several issues I can think of. They're solvable, and they're by no means unique to XML, but they are issues that need to be thought about.

EDIT Sunday 18 April: I feel really strongly that one of the things our (distributed) information architecture is going to have to nail from the very beginning is the idea of authentication/verification; does a particular bit of information (what I'd been calling "opportunities, needs, [and] donors" earlier), otherwise we're just laying ourselves open to various black-hat cracking attacks as well as scams, for instance of the "Nigerian 419" variety. I think it's pretty obvious we're going to need some sort of vetting for posters and participants this in turn implies some (loose or otherwise) organization with a necessary minimum amount of administrative/curative overhead to maintain public credibility and apparent control over our own resources. Anybody would be allowed to read basic information, but I think we can all agree on the need for some sort of access control and/or obfuscation of data like individual contact information or some types of legal/financial information that gets tacked into the "social database." This could be pretty straightforward. One hypothetical approach might be as basic as having those who wish to publish information to go through a simple registration process that issues them some piece of private data (possibly even the open standard OAuth authentication that Twitter and others use for applications hooking into their infrastructure). Either alternatively or in conjunction, a public-key cryptography system such as GNU Privacy Guard could be used to prove data came from who it claimed to. For instance, the data to be published could be enclosed in a ZIP file or other archive, along with a "signature" and the identification of a registered publisher. (There's no way that I'm aware of to actually embed the 'signature' itself into the data file: the signature depends on the exact content of the data, and by adding the signature, the content is changed.)

To the non-technical user, the effects of such a system should be:

  • The 'Foundation' (for want of a better term) can use already-proven, open standards to enhance member and public confidence in the accuracy and transparency of the content of the "social database". Attempting to "reinvent the wheel" in this area is an oft-proven Bad Idea™;

  • The Foundation will be able to develop, deploy, manage and maintain a (potentially) widely-distributed philanthropic/social database architecture that can support a wide variety of organizational/use models;

  • Having this sort of authentication and validation will aid in the evolution of the technical and "business" architectures of the system; new services can be layered on top of existing ones by different users as needed.

For instance, if a particular member has announced that they will publish information in a specific version of the schema for the "social database" (say, during registration), any later information purportedly from that member published in an older format should raise warning flags, as it may be a sign of actual or attempted compromise in security and data integrity. A benign incident, such as the member inadvertently using a software tool that submits data in an inappropriate format, can be quickly identified, communicated and rectified.

This will be vital if we are to create a data system which publicly distributes data pertinent to specific members outside those members' control that could include information that must not be altered by outside influences (such as, say, budget information) or information that, for general Internet-security reasons, should not be directly visible to all and sundry (for instance, contact information might be accessible to members but not the casual browsing public).

7 comments:

Unknown said...

Jeff,

Glad to see you are keeping the #opendata #10swf conversation going.

I found your post from a link provided by Christine Egger, of Social Actions, summarizing an emerging conversation prompted by Peter Deitz’s SWF-10 post http://bit.ly/cvW6rj

I thought I would share some brief thoughts that might be of interest.

The schema of LinkedData RDF, underpinnings of the emerging Semantic Web, provide a sound basis of the sorts of format & protocol standards you are calling for.

The vision of Semantic Web is occasionally dismissed as long on dream and short on practice. However, early implementations are gaining significant traction and offering substantive value to early adopters. Thompson Reuters, no slouch in the business of information, was quite visionary in acquiring the now @OpenCalais service. The mostly free service is currently identifying semantic entities in over 5 million documents submitted to it per day. @zemanta is also finding a significant user base among bloggers for its related services.

I appreciated your noting “curation is needed to maintain data integrity.” A robust, trustworthy and cross-platform OpenID will contribute significantly to machine curration of user contributions to discussions: intelligent systems will track user identities and derive elaborating multi-spectral reputation ranking.

I think there are amazing opportunities in Twitter's new annotation feature.

I am awestruck by the potential of developers applying Linked Data / Semantic Web schema and using RDF in Twitter annotations. It would massively scale search effectiveness and distribution opportunities while allowing sophisticated on-the-fly analysis of Twitter's firehose & other feeds.

The Twitter ecosystem is particularly suited to applying principles of the Semantic Web in that machine interpolated meaning could be continually refined the more humans tweet & retweet about the same and related topics; coupled with author, location, hashes, platform, and temporal information.

Twitter annotations may well be a turning point in the early practical application of Tim Berners-Lee’s 1999 “dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers.”

It is all about the metadata!

Roger
@r_macdonald

Unknown said...

Roger,

Count me underwhelmed rather than amazed by Twitter annotations. For a great explanation of why, read Eric James Woodward's post.

If Twitter had defined some basic structure and conventions for the metadata, it might have gone somewhere. As it is, I expect that particular venture will take a year or two to get its head out of its semantic posterior and develop (likely several mutually incompatible sets of) semi-formal conventions for what could have been evolved right from the get-go. In the meantime, a lot of developers (like EJW and myself) will drift off and do other things done by and with people more clueful.

I have learned, painfully, over the course of over three decades in this Craft, that (with very few exceptions) the amount of marketing hype associated with a product or service is inversely exponentially proportional to its actual technical value. I'm much more "awestruck" by people who get stuff done, preferably with reasonable community input, and then start tooting their horn. Doing the reverse is a historical near-guarantee that little or nothing of value will actually be built.

I've been involved for several years now in the building and use of various microformats, coupled with semantic, accessible, valid HTML markup, to bring enhanced usability to the Web. As you're aware, usability and usefulness of information on the Web delivers far more and far more lasting value to the overall Web-using community than any amount of marketese or hype. I see absolutely nothing, now or on the foreseeable horizon, to disabuse me of that notion.

One of the things that excites me about Junto, besides the people behind it, is that it's not a typical marketing campaign wound around a few PowerPoint decks or a proof-of-concept wireframe that's then huckstered around the 'community' to solicit funding (which invariably goes into more marketing). Rather, the existing alpha is actually being used for its stated purpose, while simultaneously undergoing furious brainstorming and development. That's closer to the ideals and history of the Internet that I first worked with back in the Olden Days when I actually had hair.

I'm glad to make your acquaintance — as well as anybody else plugged into the Social Actions/Junto ideaspace. Christine and Venessa are Good People™ whose expertise and perspectives I trust implicitly. I regret having had so little time to actually participate in the development up to now, and hope that that will change come May.

Too much of the history of the Net has been tied up in large companies/political orgs trying to impose a top-down Vision of the Meaning of Things on what, by design and nature, is an inherently bottom-up, self-organizing social conversation. To get back to the original point, I see that happening with Twitter now; Chirp was a desperate push by the top-down "suits" to gain control over what has become an open, self-organizing means of expression. The fact that it was a "developer conference" dominated by individuals who were not themselves primarily developers tells you all you need to know.

We deserve better. Junto is, I think and hope, part of how we'll get that 'better.'

Unknown said...

Jeff,

I appreciate your thoughtful skepticism on the likelihood of Twitter being open to incorporating principles of Linked Data into their new annotations feature.

I have read Eric James Woodward's post and understand his concerns that the metadata could be arbitrary and thus be next to useless without structure & standardization. He also expressed concerns that Twitter would just steal the fruit of innovative developers and make it their standard.

Some of Wodward’s concerns might be ameliorated if Twiter, its client developers and others were to consider the value of common metadata standards that closely graphed to those emerging from the elaboration of Linked Data ontologies.

Twitter is by no means sold on value of Linked Data in their annotations and you may not be either. I suspect, for the next couple weeks, Twitter will be genuinely open to input on what the annotation namespaces might contain, the payload length as well as considerations of the potential utility of incorporating RDF/OWL-type ontologies.

I take to heart your doubt is derived from decades of frustration at over-hyped marketing about products that bear paltry fruit. Though “awestruck” with the possibilities of early semi-implementations of the Semantic Web through Twitter annotations, I take your counsel to heart in tempering my heated enthusiasm with the cold understanding of the prevalence of corporate greed and marketing departments excesses.

I liked your additional comments about the need for authentication/verification. Author reliability and trustworthiness will be key to the value of potential Link Data implementations within Twitter annotations. Semantic entity & URI spamming, and worse, can be mitigated by the development of more robust OpenID schema and applying rigorous “Web Reputation” system algorithms. For more on OpenID see the work of Chris Messina http://factoryjoe.com/ and colleagues. Randall Farmer http://buildingreputation.com/ is an expert in Web Reputation systems, O’Reilly just published his book on same last month. http://oreilly.com/catalog/9780596159801


PS You might be interested in taking a look at a new map of a Twitter status object from @raffi - tech lead of @twitterapi http://post.ly/bQYz

PPS Like you, I respect what the newly alpha-launched Junto has achieved and their ambitions of empowering collaboration coupled with evolving means of facilitating activation.

Ehren said...

I'm glad I checked back on this discussion and found both of your thoughtful comments.

I've been doing web and software stuff for a less than a third of 30 years, and I think my main point of wisdom is that the innovations here will be social, sociological, political, and practical rather than technological. I think the 'social database' described in the main post already exists, fractured and siloed, with uneven methods for curation. Like you, I'd like to see more collaboration between publishers and consumers of the data, such that the line between publisher and consumer gets a little more blurred.

Also, I clearly really need to go check out @OpenCalais and @zemanta - have heard good things about them too many places by now.

Unknown said...

Ehren,

I agree. The main technological impact will be in support of the social, sociological and practical changes that the initial application of technology enables. Open standards are so much a part of our lifestyle now that we take them as much for granted as clean water and air — even though billions of people don't have enough of either of those.

What I worry about, technologically, is that the steps toward an open, semantic, meaningful Web infrastructure which are part and parcel of that larger change can be so readily pre-empted by those who have a stake in the status quo. Look for example at the war being waged to turn the Web into just another top-down broadcast medium in the US and UK with legal blunt instruments like the DMCA and the Digital Economy Act. Tens or hundreds of millions of people access the Internet in just this way — passively viewing "content" on the Web and regularly being told not to stray to "dodgy" sites for fear of identity theft, malware and on and on and on. Those few of us who have made the Internet so much more than that, by building on top of the Web and by building communities around conversations and transactions, are a tiny fraction of the population, often painted as "nerdy" "outsiders" fighting a rearguard action against the forces of Order... whether that Order is benevolent or otherwise.

The more technically-minded among us tend to be self-isolating, content in tiny walled gardens filled with like-minded individuals. That online isolation reinforces, and is reinforced by, the offline isolation many of these individuals experience as a "normal" part of their lives. One of the things that has to happen if our "revolution" is to have any real chance of sustained success is for influencers in the technical and non-technical communities to become better at communicating with each other. This is the hard part... people tend to naturally seek out those who are most like themselves. But unless both communities start working together more effectively, taking action that reinforces rather than detracts from attainment of the goal, effectively countering the isolation and irrelevance that powerful outside interests would impose on us all... we will fail. Most likely, we will create something "magnificent" and "extraordinary" that nobody outside our own small circle will use. In doing so, we will be doing Big Content's work for them, more effectively than they ever could themselves.

That is what I fear... that our idealism and effort become completely marginalized and contained, and we see that as "victory."

Unknown said...

Jeff,

I thought you would be interested in some speculation today that supports your and Eric James Woodward's speculations on Twitter’s openness to Linked Data for their annotations and the communities they seek to serve.
From EJW: “Annotations and innovations based on them are actually only going to further reinforce the existing top 10 Twitter client so-called "ecosystem"

ReadWriteWebs's Co-Editor, Marshall Kirkpatrick, suggests today that Twitter intends to leave the annotation classification system to be determined by the market. http://bit.ly/csK8Od

Although I appreciate that Twitter values keeping the annotation ecosystem open for innovation and adaptation, I hope the conversation on Linked Data metadata standards within Twitter annotations is just beginning.

It could be an historic lost opportunity if the hard driving Twitter team doesn’t step back and consider soliciting the counsel of the W3C, Sir TB-L, Nigel Shadbolt and other thoughtful public interest-oriented folks in the Linked Data community. After all, Metaweb's Freebase team is just 3 blocks away. From Twitter’s San Francisco headquarters.

I suspect the conversation on Linked Data metadata standards within Twitter annotations is just beginning.

Term Papers said...
This comment has been removed by a blog administrator.