Tuesday, 28 October 2008

RDFa - so, WTF?!

As I write, I'm in my hotel room at the International Semantic Web Conference 2008 in Karlruhe. We're only half way through, but its already been a thought provoking and eye opening two days. One of the big topics I've been bottoming out is what exactly RDFa is, what it can do for us/you, and what are its problems. The thoughts in my head were all started with an excellent tutorial on Sunday by Michael Hausenblas. The answers to all three aren't easy. So I thought I'd try and make sense of it - for my benefit and yours - in this blog post.

Adding Semantic Meaning to Pages

We all want to get more from our web pages. As documents, they're pretty good, but we want to extract more meaning from them. This would allow loads more interactivity and interlinking than is currently possible. For example, one-click adding of info to address books and calendars, showing more in-depth information alongside articles, and automatic related linking. And these are the really simple basic ambitions.

Good web developers should already be using semantic HTML - where as far as possible the use of tags matches the real structure of the data you're representing. But this isn't enough. We can see its a list, but a list of what?!

So, there are two core technologies which are both attempting to solve this problem. The first is Microformats, the second is RDFa.

Microformats make use of existing HTML tags and attributes to assign extra meaning to HTML documents. This generally means adding special class names to class attributes to allow a Microformat parser to understand the content of the tags. There are several common Microformats for marking up data such as calendar events, contact information, geographical information, social relationships, copyright information, reviews etc.

Yahoo (via Peter Mika of Yahoo SearchMonkey) says that around 2% of pages on the web contain Microformats. That's pretty good for an emerging technology. They even have a nice logo - one of which adorns the lid of my laptop. But Microformats have their issues.

Firstly, theres a fairly limited (although useful) set of things which you can markup with Microformats. If you want to markup something new, you have to suggest a new proposal, and have it approved and ratified by the Microformats community, before having the parsers implement it - often slighly differently.

Secondly, Microformats have some minor accessibiliy issues. They make use of something they call the abbr design pattern, where the title attribute of an abbr tag is often used to convey non-human readable date time information. This can be read out by some screenreaders, appear in tool tips, and generally misuses the abbr tag. Not ideal.

Another Solution?

So how to get get round these issues? Another solution is RDFa. RDFa is "RDF in Attributes". As the name suggests RDFa adds more attributes to XHTML, and these attributes are designed perfectly to hold real, proper RDF data. Parsers then have a really simple job to distil pure RDF straight out of your otherwise beautiful XHTML document.

As RDFa is built on the foundations of RDF, you can use any RDF ontology in your document, and if theres not one which suits, you can make your own. Its naturally extensible without having to ask the permission of a central community, and hope parser makers follow your reccomendation.

Hang on Hang on! RDF..what now?

To understand RDFa, you need to know what RDF is. I'll over-simplify deliberately here, but RDF is most commonly seen as an XML format which is used to describe relationships between things, or add properties to things. RDF documents most often look like XML, and contain what are called Triples. A triple is a set of three statements that tell a little story.

  • Dave likes Cats,
  • Simon's nickname is "Si",
  • Chris Martin is married to Gwyneth Paltrow,
  • U2 released the album Achtung Baby
Sounds pretty simple eh? Well it kinda is. These three statements in a Triple are called the Subject (what's it about), Predicate (the kind of data we're adding to the Subject) and the Object (the actual data). Ideally Subjects will always be URIs, preferably URLs which means they're addressable on the web. Predicates should also be URIs. This is really important. Predicates are actually defined in Ontologies. This means for every RDF Triple, we have a deep, unambiguous meaning for the relationship described in it. Finally, the Object is the last part. It can be a string literal - i.e. "Si" in the Triple "Simon's nickname is Si" - or it can also be a URI - so in the Triple "Chris Martin is married to Gwyneth Paltrow", both Chris Martin and Gwyneth Paltrow should be represented as URIs - perhaps their Wikipedia entries.

Right, that's enough RDF theory. If you're interested it it, go read more on the web.

Back to RDFa

To recap, RDFa lets you embed real RDF directly inside XHTML documents in new and extended attributes. But, wait a sec - you can't just make up new attributes, or move attributes onto tags which can't support them. Correct. That's why RDFa can currently only be used in XHTML 1.1 documents. For a page which contains RDFa to validate it has to use the XHTML 1.1 doctype. This new doctype...

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

is the clever bit that means we can use the 'resource', 'property' and 'about' attributes on more tags than was possible before. The second part is that because we're in XHTML, we can add XHTML namespaces to our document. Just like in pure RDF (i.e. RDF/XML), this is where we import the Ontologies we want to use to describe the things in our document.



I don't want to go into too much detail of RDFa syntax here - there are plenty of examples on the web - I'll just point to the best ones. You can learn RDFa for yourself.
Good. Lots of bedtime reading for you to do! Seriously the best way to get to grips with RDFa is to have a go at marking up a document yourself. Its not that hard once you get used to the syntax, and learn the common RDF predicates.

Instead of teaching the world RDFa, I want to concentrate now on the practicalities of RDFa. testing, validation and data extraction.

Serving and validating RDFa

Let's deal with the utopian ideals first. If you're going to put RDFa into your document, it should be served by your server with the mime type "application/xhtml+xml". Also, you should use the XML+RDFa doctype mentioned earlier. You should also ensure you put a version attribute of "XHTML+RDFa 1.0" on your root html node.

Ok ok ok. If you do all those things, your page with RDFa in it will validate. But we all know that serving pages with 'application/xhtml+xml' causes issues with older browsers. Is this an issue? Well, according to Michael Hausenblas, RDFa interpreters are beginning to understand this, and should now attempt to parse documents containing RDFa even if the mime-type smells a bit wrong - 'text/html' for example. This should mean that you can just start embedding RDFa into your page, serving it under an old mime-type to keep IE6 and below happy, while still allowing people to distill juicy semantic goodness from your documents. However, if you're a bit cleverer, you could use content negotiation to serve your page to older browsers using the 'text/html' mime-type, and 'application/xhtml+xml' for user agents like shiny new browsers which can support it'

The mime-type is one thing, but what about that crazy ass new doctype! Well, firstly you'll notice it's XHTML 1.1. Since we're at v1.1, theres no such thing as strict, transitional and frameset. Its all strict baby. This means if you want to do RDFa properly, all your documents need to be properly XHTML 1.1 complient. Thats not such a big ask if you've been doing your job properly for the last two years or so.

Well, once again, all is not lost. It depends on wether or not you care about validation. The only way to validate a page with RDFa in it, is to use the correct, new doctype. But standards complient browsers should just ignore any attributes they find which they don't understand, and just get on with things. So, the theory is, that you can just go ahead and stick RDFa into your pages, and nothing will break - visually at least. The only thing that will stop working is validation.

Certainly I've tried RDFa in a HTML4.01 Strict document, and while it no longer validated (" there is no attribute 'ABOUT'" etc), it still displayed perfectly in Firefox 3 (as you'd hope) and the Operator plugin manages to get all that juicy data without a hitch.

So again, the theory is, just use RDFa loosely, and if you don't care about it causing errors in your documents, you'll be fine. Michael Hausenblas again says he speak to many organisations who's pages have 200 errors, so another 30 don't really matter.

I disagree with this. I think validation is important. I also think that if you get used to having errors due to RDFa then you'll be blind to the errors you really ought to fix. However, with bleeding-edge technologies like this, you have to be pragmatic. In this case I think it's probably such a benefit to have RDFa, that validation errors should be stomached for now. HOWEVER, if possible you should provide a programmatic way to remove all the RDFa statements such that the page will otherwise validate.

I'm going to try and do this in my apps using helpers in my MVC Views which will output nothing if a no-rdfa flag is set, perhaps in the query string, but otherwise will put RDFa into the page. That way, you can have a core page which validates against your older but widely supported doctype, and you only break it with the RDFa bits. Just make sure all your links to validators have the ?rdfa=false flag set! This is untries and untested in practice, but seems like an acceptable solution to allow us to begin using RDFa in older documents now.

Developing and Testing RDFa

Its one thing to talk about it, and do some demos, but what about developing in the real world. Well, its kinda easy with RDFa, just like it is with RDF!

First, get yourself Firefox and the Operator plugin. This baby will light up when it finds microformats or RDFa in the page, and allow you to inspect it. Think of it as Firebug for Semantic Data.

Then, start using the W3C's Validator. If you're using RDFa in an old doctype, you can force it to use the new XHTML + RDFa doctype, which at least will show up any errors in your RDF syntax, if not the extracted meaning.

Finally, you'll be wanting the W3C's RDFs distiller. This will parse your document, and extract pure RDF/XML from it. Its brilliant, and will show you the power of the monster you've created. Forget Microformats and GRDDL and creating XSL for every Microformat you use, as we're building on the extensability and structure of RDF, the parser has all the info it needs to make full sense of your data - all on its own!

Reading and using RDFa

Now you've put all this in your document, you want to ensure you get it out. There are a number of parsers for most popular languages - but the most interesting for me is Javascript.

The key benefit of in-page semantic markup is that, yes, machines can read it - but that machines can read it for the direct benefit of the user. What I mean is that suddenly the page should come alive with calls-to-action, extra data and interactivity which just wasn't there before. Now only 20% of the world use Firefox - and very very few of them have Operator installed. MS IE8 will have Accelerators which use a modified set of Microformats to provide follow on actions for the user. This is a clear user benefit, and one we need to use RDFa to achive if its to have any real success.

My proposed solution is a javascript-based toolset for exposing and using semantically embedded data on a page. Basically the same as Operator, but in JS, cross-browser and with incredible user interface elements. Site owners could then include this in their page. When a page loads it would quietly interrogate the dom for Microformats and RDFa. When it found some, a button somewhere in the header might begin to glow. When clicked, it might either open a panel and show you all the data in one place, with relevent onward actions - OR it might make the semantically marked up parts of the page glow, and offer on-hover contextual user actions.

If we do this - suddenly theres a clear user benefit - and our bosses will suddenly take more interest in this Semantic data thing. As a side effect, machines across the world suddenly get access to all this juice data which has been hidden from them for so long.

What goes in RDFa and what doesn't?

In theory, RDFa allows you to take any RDF triple and embed it contextually in a page. But do we really want to do this?
  • It makes the page bigger - bad for mobile devices which are unlikely to want it
  • The more RDFa you use, the more likely your HTML structure might be compromised by bad markup
  • And isn't that what we have full-fat RDF for?!
On the latter point, YES! we still have full RDF. Also, unless your insane, you'll subscribe to the software principle of Don't Repeat Yourself (DRY). So there's a tension here. we don't want hardcore RDF in our html pages, and we don't want to repeat everything. What to do?

After talking a lot about this at ISWC2008, we think there are some guidelines to follow.
  1. You should use RDFa for simple things that a machine and a user might want to know
  2. You don't put complex RDF in RDFa - you still use RDF/XML for that.
  3. You SHOULD repeat yourself for the simple data you put in RDFa - DO replicate this in RDF/XML. We don't want machines to have to look in two places - but this is the only time you violate DRY
  4. Acceptable RDFa things are from ontologies like Dublin Core, FOAF, SIOC, and Geo - plus the simple bits from more complex Ontologies like Music, Programmes etc.
  5. Focus on things like document metadata (creator, relationships, document meaning etc) not on hardcore URI to URI mappings. Do that in pure RDF/XML.
Phew. Its too early for that to be a tested definitive list - but it feels right. RDFa is more useful when you think of it like a highly structured Microformat rathern than actual RDF.

Conclusions

RDFa is now an official W3C recommendation. This means its time to play. The biggest problem is its never been tried and tested. Its been used on the common browsers, but we're messing with HTML here, and how older devices and browsers might handle it is anyones guess. The biggest one being mobile phones and other portable devices.

Only one way to find out though - GET PLAYING! Start using really simple Dublin Core metadata in your page and blog to markup titles, descriptions, people, meanings tags etc - and see what happens - see if anyone complains, and try and fix the problems as they arise.

This is cool and new - but with RDFa and Microformats taking off (we hope) the practical web is going to be a much better place!

Sunday, 26 October 2008

Excellent preso on Webapp Security by Simon Willison

Web Security Horror Stories
View SlideShare presentation or Upload your own. (tags: security csrf)