Archimedes' Lever: open data

Showing posts with label open data. Show all posts

Saturday, 8 May 2010

Making URL Shorteners Less "Evil"

The following is the text of a comment I attempted to post to an excellent post on visitmix.com discussing The Evils of URL Shorteners. I think Hans had some great points, and the comments afterward seem generally thoughtful.

This is a topic which I happen to think is extremely important, for both historical and Internet-governance reasons, and hope to see a real discussion and resolution committed to by the community. Thanks for reading.

I agree with the problem completely, if not with the solution. I was a long-time user and enthusiastic supporter of tr.im back in the day (up to what, a couple of months ago?) It was obvious they were doing it more or less as a public service, not as a revenue-generating ad platform; they were apparently independent of Twitter, Facebook and the other "social media" services (which is important; see below) and several other reasons. Unfortunately, since the First Law of the InterWebs seems to be that "no good deed goes unpunished," they got completely hammered beyond any previously credible expectation, and, after trying unsuccessfully to sell the service off, are in the process of pulling the plug.

I think it's absolutely essential that any link-shortening service be completely independent of the large social-media sites like Facebook and Twitter, specifically because of the kind of trust/benevolence issues raised in the earlier comments. We as users on both ends of the link-shortening equation might trust, say, Facebook because their policies at the time led us to believe that nothing dodgy would be done in the process. I think the events of the past few weeks, however, have conclusively proven how illusory and ill-advised that belief can be. Certainly, such a service would give its owner a wealth of valuable marketing data (starting with "here's how many unique visitors clicked through links to this URL, posted by this user"). They could even rather easily implement an obfuscation system, whereby clicking through, say, a face.bk URL would never show the unaltered page, but dynamically rewrite URLs from the target site so that the shortener-operator could have even MORE data to market ("x% of the users who clicked through the shortened URL to get to this site then clicked on this other link," for example). For a simple, benign demonstration of this, view any foreign-language page using Google Translate. (I'm not accusing Google of doing anything underhanded here; they're just the most common example in my usage of dynamic URL rewriting.)

Another security catastrophe that URL shorteners make trivially easy is the man-in-the-middle exploit, either directly or by malware injected into the user's browser by the URL-shortener service. The source of such an attack can be camouflaged rather effectively by a number of means. (To those who would say "no company would knowingly distribute malware", I would remind you of the Sony rootkit debacle.)

So yeah, I resent the fact that I essentially must use a URL-shortener (now j.mp/bit.ly) whenever I send a URL via Twitter. I also really hate the way too many tweets now use Facebook as an intermediary; whenever I see a news item from a known news site or service that includes a Facebook link, I manually open the target site and search for the story there. That is an impediment to the normal usage flow, reducing the value of the original link.

Any URL-shortening service should be transparent and consistent with respect to its policies. I wouldn't even mind seeing some non-Flash ads on an intermediate page. ("In 3 seconds, you will be redirected to www.example.com/somepage, which you requested by clicking on w.eb/2f7gx; click this button or press the Escape key on your keyboard to stop the timer. If you click on the ad on this page, it will open in a new window or tab in your browser.")

Such a service would have to be independent of the Big Names to be trustworthy. It's not for nothing that "that zucks" is becoming a well-known phrase; the service must not offer even the potential for induced shadiness of behaviour.

I'd like to see some sort of non-profit federation or trade association built around the service; the idea being that 1) some minimal standards of behaviour and function could be self-enforced, and especially 2) that member services that fold would have some ability/obligation to have their shortened link targets preserved. This way, there would still be some way of continuing to use links generated from the now-defunct service.

Since the announcement that the Library of Congress will be archiving ALL tweets as an historical- and cultural-research resource, and contemplating a future in which it is expected that URL-shortening services will continue to fold or consolidate, the necessity and urgency of this discussion as an Internet-governance issue should have become clear to everyone. I hope that we can agree on and implement effective solutions before the situation degrades any further.

Friday, 16 April 2010

A Slight Detour: Musing on Open Data Standards as applied to Social Entrepreneurship and Philanthropy

This started out as a conversation on Twitter with @cdegger, @ehrenfoss, @p2173 and other folks following the #opendata, #socent or #10swf hash tags. Twitter is (in)famous for being limited to 140 characters per “tweet”; with the extra hash tags and all, that's reduced to 96. I wrote a reply and used a text editor to break it into "tweets"; by the time I got to “(part 8/nn),” I knew it was crazy to try and tweet an intelligible response.

So, folks, here's what I think; I hope it's more intelligible this way. Comments appreciated, here or on Twitter.

What I see #opendata doing for #socent is to allow individuals or groups to build on and share information on opportunities, needs, donors, etc.. This collaboration would use open data formats and tools that iteratively improve philanthropy effectiveness.

Think of how a wiki enables text collaboration, shared learning and discovery, or how instant messaging allows both realtime and time-shifted conversation. Now expand that idea to a sort of "social database" that can be run like a less elitist Wikipedia mated with an RSS feed. Anybody can browse or search the information in the database (needs and offers). They can also get their own local copies of some/all data and have it be updated from the "upstream" source automatically. A smaller group of vetted people can update the "master" data which then gets pushed out to all viewers or subscribers.

This curation is needed to maintain data integrity and to (hopefully) eliminate attacks on or disruptions to the "social database" itself. The sad reality is that any public information on the Internet must have some sort of protection, or it will be damaged or destroyed. I see this as being especially true of a pioneering social-entrepreneurial system like this; there are enough people out there who have a vested interest in this sort of thing not working that security (authentication and validation) must be built in from the start. Otherwise, we will wind up with a situation akin to "spam" and "phishing" with email. Email standards were set in the early days, when the Internet was a primarily academic/scientific resource where all users could legitimately trust each other by default; the current state of the Net is far different. Any open data standards and protocols developed for the "social database" must take this into account.

These open data and protocol standards should be designed with the understanding that they are likely to change over time as the needs of users become better defined and as new opportunities to support those needs present themselves. The first version of a new system (like this) is almost never the simplest, nor will it be the most effective for its purpose. Lessons will be learned that should be folded back into revisions of the standards, in much the same way that later versions of standards like HTML built upon experience gained with earlier versions.

When evolving these data formats and protocols, it is vital that the process be fully transparent, with a balance between building progress and listening to the needs and concerns of the widest possible audience. It is entirely possible that no one standard in a given area will suit all stakeholders. In those instances, some sort of federation built on interchange of some common subset or intermediate format may be helpful. This should be seen as undesirable, however, as it limits the ability of casual or new users to make effective use of the entire distributed system.

The development, maintenance and ownership of standards developed for this project (including necessary legal protection such as copyrights) must be under the auspices of an organization with the visibility and stature to maintain control of the standards, lest they devolve into a balkanized mess that would be as unhelpful to the intended mission as not having any such standards at all. I would expect this organization to be a non-profit organization. Not only will this remove the fiduciary responsibility for monetizing investments made in the technology from the officers of the organization, but other non-profits/NGOs can be expected to cooperate more fully with the parent organization in developing, deploying and maintaining the standards – allowing them to remain open and unencumbered.

Finally, I think it's important to state that I don't see any one type of format as necessarily superior for developing this. I'm aware that there has been a lot of work done with various XML-based systems as part of the #socent build-out to date. After working with XML for approximately ten years now, I have seen some magnificent things done with it, and some absolutely misbegotten things done with it. Particularly with regards to the authentication and validation issues I mentioned earlier, and also with the sheer bulk and relative inefficiency of a large-scale XML data store, there are several issues I can think of. They're solvable, and they're by no means unique to XML, but they are issues that need to be thought about.

EDIT Sunday 18 April: I feel really strongly that one of the things our (distributed) information architecture is going to have to nail from the very beginning is the idea of authentication/verification; does a particular bit of information (what I'd been calling "opportunities, needs, [and] donors" earlier), otherwise we're just laying ourselves open to various black-hat cracking attacks as well as scams, for instance of the "Nigerian 419" variety. I think it's pretty obvious we're going to need some sort of vetting for posters and participants this in turn implies some (loose or otherwise) organization with a necessary minimum amount of administrative/curative overhead to maintain public credibility and apparent control over our own resources. Anybody would be allowed to read basic information, but I think we can all agree on the need for some sort of access control and/or obfuscation of data like individual contact information or some types of legal/financial information that gets tacked into the "social database." This could be pretty straightforward. One hypothetical approach might be as basic as having those who wish to publish information to go through a simple registration process that issues them some piece of private data (possibly even the open standard OAuth authentication that Twitter and others use for applications hooking into their infrastructure). Either alternatively or in conjunction, a public-key cryptography system such as GNU Privacy Guard could be used to prove data came from who it claimed to. For instance, the data to be published could be enclosed in a ZIP file or other archive, along with a "signature" and the identification of a registered publisher. (There's no way that I'm aware of to actually embed the 'signature' itself into the data file: the signature depends on the exact content of the data, and by adding the signature, the content is changed.)

To the non-technical user, the effects of such a system should be:

The 'Foundation' (for want of a better term) can use already-proven, open standards to enhance member and public confidence in the accuracy and transparency of the content of the "social database". Attempting to "reinvent the wheel" in this area is an oft-proven Bad Idea™;
The Foundation will be able to develop, deploy, manage and maintain a (potentially) widely-distributed philanthropic/social database architecture that can support a wide variety of organizational/use models;
Having this sort of authentication and validation will aid in the evolution of the technical and "business" architectures of the system; new services can be layered on top of existing ones by different users as needed.

For instance, if a particular member has announced that they will publish information in a specific version of the schema for the "social database" (say, during registration), any later information purportedly from that member published in an older format should raise warning flags, as it may be a sign of actual or attempted compromise in security and data integrity. A benign incident, such as the member inadvertently using a software tool that submits data in an inappropriate format, can be quickly identified, communicated and rectified.

This will be vital if we are to create a data system which publicly distributes data pertinent to specific members outside those members' control that could include information that must not be altered by outside influences (such as, say, budget information) or information that, for general Internet-security reasons, should not be directly visible to all and sundry (for instance, contact information might be accessible to members but not the casual browsing public).

Monday, 21 August 2006

Projects and Data Formats, or Scratching a Standard Itch

Fair warning: This post was written in bits and pieces over a week that I spent mostly on my back in bed; it hits two or three hot-button issues that I've been running up against. In the fullness of time, I may come back and break it up, or write follow-on entries pontificating on one point or another, but for the nonce, your patience — and comments! are appreciated.

Real standards happen in one of two ways. One way involves an organisation like the World Wide Web Consortium (or W3C as it is commonly known) puts together different committees and working groups, and over the course of various meetings, seminars, forums, and other corporate expense-account sinkholes, massive sets of documents are ratified; if we're lucky, somewhere within that will be nuggets of information and wisdom around which useful things can be accomplished. Successful xamples of this include standards such as HTML, XHTML and CSS 2. Less sucessful examples include efforts such as WCAG 2. While it may safely be assumed that nothing in the new standard will disrupt the existing order of the Internet, the flip side of this is that there may be no actual working implementation of the new standard (to prove that such is practical), and it may well be that the new standard is not the most efficient or elegant solution to the problem. This may be described as the "top-down" approach.

The other way that standards happen in the real world is for a developer, or typically a small group of developers, to come up with something that works for them, open up community/public comment and collaboration, and eventually submit the standard definition (whihc by then has several working implementations) to standards bodies like the W3C or the Internet Engineering Task Force. This may be seen as the "bottom-up" approach. Its success is largely tied to how effectively it solves what it sets out to, and equally critically, whether it does so in a manner that doesn't convey inherent advantage to a subset of its audience (such as the company employing the creators of the standard). Successful examples of this include vCard and its successor hCard.

Stumbling across the description of hCard (from An Angry Fix by Jeffrey Zeldman, a well-known figure in the online Web-design industry and community) after I had been giving some thought to a problem I had been having with contact information in various formats. Namely, that the information was in various formats, for my (ancient) PalmPilot, each of two different Nokia phones, my email package (Mozilla Thunderbird), and so on, and so on.... Keeping everything synchronised — the mundane necessity of ensuring that any given contact was in each of the needed places with the most recently updated information — is a burden sufficient to preclude any further effort, such as actually communicating anything useful or interesting to those contacts. (Maybe they read this blog...)

What's needed is a free, open source bit of software to take these various directories in varyingly historical formats, apply updates and changes to a single, current-technology directory around something like vCard (or, better, hCard), and then to spit out various dumps of this data to suit the different devices and their differing format requirements. If you think about this for a while, you can think of all sorts of ways that synchronisation could be a real pain...which update gets applied if you enter the same information two different ways on two devices? Suppose that I get energetic and add data to the "Custom Fields" in my Palm to represent data that has specific fields for the phone or Thunderbird — but since I add different data at different times, it's not always consistent? And on and on...

I'm going to keep one eye open over the next few weeks or months for something that does this relatively painlessly (and, of course, if anybody knows of any, please let me know). Otherwise, it's likely to become Item 374 in my medium-priority queue for Tools I Intend To Write (Someday).

Implicit in the first paragraph, and alluded to more directly in the third (see Zeldman et al) is the fact that the W3C has spent the last 2-3 years making abundantly clear who its customers and stakeholders are, and telling those of us who are professionally tied to standard technologies but who are not ourselves multinational corporations flush with cash for endless junkets (and patent payoffs), to take a long walk off the shortest pier available. While this may be seen by some as an efficient use of resources, addressing the corporate sponsors who are the titans of the marketplace anyway, somebody made a good point along the way: the Microsofts and H-Ps and IBMs and so on of the world started out as small shops that nobody had ever heard of. Had the standards of the day been defined less for what made sense from an engineering perspective than a lock-out-the-small-guys marketing directive, the world would be a very different — and likely less advanced — place today. What goes around, comes around — and the W3C in particular is building up a lot of bad blood with the vitally "interested parties" who don't happen to (presently) be among the 200 or so largest corporations on the planet.

What will happen? On the one hand, we'll likely wind up with lots of easily available but proprietary "standards" like Adobe PDF; the word processors I've used for the last four years have supported publishing in PDF without Adobe asking for a dime. On the other hand, we'll have highly marketed, widely Diggable, proprietary-means-you-only-get-it-from-us packages. These may have lively add-on Astroturfed communities, but they won't deliver the business benefits of truly open software; you can't fork the product, you can't completely support yourself, every use you make of the product or technology, in perpetuity, will be subject to the dictates and whims of the company that owns the product. Well and good, you say; they do, in fact, own their product, and have a right to do whatever they like with it. True, but where does that leave customers who incorporate that product into critical business processes? A year or so ago, an American friend of mine told me of one of his clients, who had a hard drive in their accounting server fail. They swapped out the drive, restored from backups, and found they needed to reinstall the order-management package they used — to generate and track every single order from the day the company was founded right up to the guy who just got off the phone — needed a license key. Fine; they call the vendor's toll-free phone number, expecting to be back in business (literally) in a few minutes. Oops. The vendor was bought out by a much larger firm; their version is now three versions out of date and the (new) vendor requires htem to buy an upgrade — at retail — to access the data they've just restored from tape.

When people ask me what the business benefits of open systems are, they don't want to hear a Stallmanesque sermon on the virtues of individual liberties, real though they may be, or the geek chic of cool code, or the cheapskate appeal to "it doesn't cost a thing". It does — in time and effort to convert and adopt within the enterprise. But what you get from it at the end of the day is control over your own business processes; you can keep running a ten-year-old word processor if you choose to, or have your accounting package customised just so, or whatever you can create a business justification for — and it's going to be much easier to cost-justify relatively audacious projects because there are no hidden surprises. Transparency, auditability, control, economy -- those may not be terribly high on the Digg word list; they may not have dozens of ...For Dummies-style books in your local chain bookshop, but people who make their living, and their employees' living, by making the numbers come out right every quarter should understand what I'm talking about. It's about time.

A side note: Anyone who is considering setting up business in Malaysia rather than other nearby countries (Thailand, Vietnam) may well want to consider the level of technical efficiency, customer support, and attitude towards serice of the local telecom quasi-monopoly. For most people and businesses, Telekom Malaysia is the only game in town. As one of the subscribers/victims of their Streamyx ADSL "service" for the last three years, I have watched connection speed and reliabillity plummet as thousands of new subscribers are pushed onto steadily lower and lower tiers of service. I believe, for example, that they now offer a "broadband" connection at 128 Kbps; twice as fast as a standard dialup modem. I am paying for a 2 Mbps — 2,000 Kbps — connection; in the last two months, I have never witnessed transfer rates higher than 400 Kbps, and for the last week never higher than 80 Kbps. If I were living in a capitalist system with competitive markets, I would have choices. In a functionally Stalinist economy where competition against government-linked is tightly controlled, I have no usable choices. It has taken me well over two weeks of trying to post this blog entry. Selemat datang ke Malaysia! (Welcome to Malaysia!)

Wednesday, 28 June 2006

Promises Kept, Credibility Gaps, and Microsoft: Are we Customers or Consumers?

As reported on Slashdot, quoting Quentin Clark's WinFS team blog (which spun the item mercilessly), and commented on widely, particularly by rjdohnert and Kamal:

WinFS is dead. What has been understood for a decade or so to refer to a "Windows File System", recently rechristened in Microsoftspeak as "Windows Future Storage" (to imply a lack of commitment to a product or in fact anything specific at all); in any form recognisable as the product/technology that has been hyped unrelentingly by Microsoft when they needed something to keep users (and developers) committed to the Next Windows Version, the plug has been pulled for what promises to be the very last time. This could be viewed in a number of ways; the least uncharitable explanation that concievably touches upon our shared reality is the subject of the remainder of this item.

Yet another case of Microsoft overpromising and underdelivering? Since they really don't care about providing great software to consumers — either end users or developers, there is no real penalty for failing to keep promises (though they do, in true Rove/O'Reilly fashion, try to spin the sucker positive as hard as they can, just to keep the yokels giving the slack-jawed "wow....they say it's cool" and, as Michalski originally wrote, crapping cash).

There is absolutely no reason to keep waiting for a relational file store in Windows or any product except SQL Server (and possibly some future version of OFfice that requires SQL Server). There is no reason whatever to believe Microsoft will keep ANY promise made to developers or end users, nor or in future. There is absolutely no reason to believe that any gee-whiz "technology preview" given by Microsoft will ever turn into a real, stable, usable product unless that product is announced (with a ship date) at the show or conference where the demo is made. Stability and usability of said product will, as with all previous Microsoft releases, have to wait for the second service pack.

What this boils down to, in other words, is a matter of trust, and commitment, and honesty, and all the values that a company which values its customers (and workers) is expected to incorporate into its ethos. That Microsoft deliberately chooses not to do this, as it has proven on numerous occasions, shows its complete and consistent contempt for those poor schmucks it sees as consumers, not customers.

We, as developers and users, have two choices. We can either continue to prove Microsoft right, gulping whatever product they deign to deliver, crapping out whatever cash they choose to take, abjectly powerless to exert any change over their behaviour. Or, we can refuse to play their game any more. There are other tools to develop products for Windows. Most of these have the additional benefit of being cross-platform.

"Cross-platform". There's a quaintly radical word in these times. The idea that people could use a variety of systems, tools, applications, to get their work done. Companies don't have to pay US$600 to buy an office "suite" with a heavy-duty word processor, spreadsheet, and yadda yadda for a manager whose work is primarily limited to short memos? Revolutionary. Selecting tools based on the needs of the user rather than the "default" "choice" for the entire organisation? If one choice of office layout doesn't fit everybody from the managing director to the secretarial pool, then by what logic should they use the same software tools to do their work? How ma many users of, say, Microsoft Word use more than a tiny percentage (say, 5%) of the "features" in the product? (According to surveys dating back to 2000, roughly 5%). By looking at the situation as a need to give each user tools appropriate for the task at hand, rather than imposing a uniform "solution" and adapting the task to the "solution"?

This whole WinFS affair is yet another bit of weight pushing the Good Ship Microsoft towards (or past, in some opinions) the tipping point. Those already on board might do well to examine their options; those considering extending their 'booking' may wish to reconsider. The main forces arguing that no 'realistic' options exist have been marketing-driven, rather than technically- or business-driven. Consumers blindly take whatever they're given; customers demand products that meet their needs. It is high time that those who purchase and use business computer software systems, and the tools to work with them, availa themselves of their options.

Tuesday, 18 October 2005

About me and my work at Cilix

I'm working on a lot of things for my work at Cilix, an engineering firm in Kuala Lumpur, Malaysia. First off, let me be clear on one thing: this blog is not officially sanctioned in any way by Cilix; this is 'just me'.

We call ourselves "A Knowledge Company". What that means, at least in my understanding, is that we apply professional knowledge and experience, augmented heavily by technology, to solve customers' knowledge-management and IT challenges. As such, we do a lot of writing — documents, Web pages, software, ad (nearly) infinitum.

We're a small shop as these things go, and our competition comes from much larger organisations with instant multinational name-brand recognition. Like any small firm, we have to win our first projects with a given client by promising — and delivering — a better value proposition than our competition. Where we get repeat business — again, like any similar firm — is by being agile, efficient, and above all, competent to the point of being unquestionably the least risky vendor for a particular solution.

Those attributes, in turn, lead us to consider issues like process, quality, and superlative knowledge of everything we are about. These issues, and how we as an organisation work through them, were what originally attracted me to the Company when I was approached and offered a position here. These issues are also the foci of what I expect to accomplish with this blog and the related collaboration tools (such as the Wiki).

I am also trying to evangelise and lead the implementation of open documentation and data-format standards at Cilix. This involves, among other things, migrating away from proprietary, binary formats like Microsoft Office documents to open, preferably text-based formats. As it happens, many of these open, text-based formats are based on XML vocabularies, such as Docbook and SVG.

Wby are text-based formats preferable? Lots of reasons:

They are usually much more compact (and compressible) than comparable binary formats. Converting mostly-text Microsoft Word documents to Docbook equivalents often yields size reductions of 80% or more (think how much more convenient email attachments would be);
They are usable with a wider variety of tools. I can throw a text file on my PalmPilot and fiddle with it far easier than a Microsoft Word document, for instance;
The are more amenable to most version-control systems, particularly cvs and subversion. Instead of making copies of each version of a binary file, all that is required is to take the difference between two different versions of a text file — a much easier and more reliable operation. I have seen version control systems of all flavours — SourceSafe, cvs, Atria/Rational ClearCase — irretrievably corrupt binary files when insufficient care was taken by the configuration manager in dealing with binary files;
They are more amenable to being stored in databases. Many databases (such as MySQL can return result sets packaged as XML fragments; this, combined with an XSLT parser and stylesheet, opens the door to some truly compelling presentation capabilities.

By taking advantage of these capabilities, we should be able to create better products with more predictable (and shorter) schedules without either greatly expanding the development team or pushing the present staff to the point of burnout. There is a saying in Silicon Valley in California, only partly tongue-in-cheek:

It isn't a startup until somebody dies

Here's hoping that's one "tradition" that's not exported anywhere outside the Valley.

Archimedes' Lever