Tuesday 28 October 2008

RDFa - so, WTF?!

As I write, I'm in my hotel room at the International Semantic Web Conference 2008 in Karlruhe. We're only half way through, but its already been a thought provoking and eye opening two days. One of the big topics I've been bottoming out is what exactly RDFa is, what it can do for us/you, and what are its problems. The thoughts in my head were all started with an excellent tutorial on Sunday by Michael Hausenblas. The answers to all three aren't easy. So I thought I'd try and make sense of it - for my benefit and yours - in this blog post.

Adding Semantic Meaning to Pages

We all want to get more from our web pages. As documents, they're pretty good, but we want to extract more meaning from them. This would allow loads more interactivity and interlinking than is currently possible. For example, one-click adding of info to address books and calendars, showing more in-depth information alongside articles, and automatic related linking. And these are the really simple basic ambitions.

Good web developers should already be using semantic HTML - where as far as possible the use of tags matches the real structure of the data you're representing. But this isn't enough. We can see its a list, but a list of what?!

So, there are two core technologies which are both attempting to solve this problem. The first is Microformats, the second is RDFa.

Microformats make use of existing HTML tags and attributes to assign extra meaning to HTML documents. This generally means adding special class names to class attributes to allow a Microformat parser to understand the content of the tags. There are several common Microformats for marking up data such as calendar events, contact information, geographical information, social relationships, copyright information, reviews etc.

Yahoo (via Peter Mika of Yahoo SearchMonkey) says that around 2% of pages on the web contain Microformats. That's pretty good for an emerging technology. They even have a nice logo - one of which adorns the lid of my laptop. But Microformats have their issues.

Firstly, theres a fairly limited (although useful) set of things which you can markup with Microformats. If you want to markup something new, you have to suggest a new proposal, and have it approved and ratified by the Microformats community, before having the parsers implement it - often slighly differently.

Secondly, Microformats have some minor accessibiliy issues. They make use of something they call the abbr design pattern, where the title attribute of an abbr tag is often used to convey non-human readable date time information. This can be read out by some screenreaders, appear in tool tips, and generally misuses the abbr tag. Not ideal.

Another Solution?

So how to get get round these issues? Another solution is RDFa. RDFa is "RDF in Attributes". As the name suggests RDFa adds more attributes to XHTML, and these attributes are designed perfectly to hold real, proper RDF data. Parsers then have a really simple job to distil pure RDF straight out of your otherwise beautiful XHTML document.

As RDFa is built on the foundations of RDF, you can use any RDF ontology in your document, and if theres not one which suits, you can make your own. Its naturally extensible without having to ask the permission of a central community, and hope parser makers follow your reccomendation.

Hang on Hang on! RDF..what now?

To understand RDFa, you need to know what RDF is. I'll over-simplify deliberately here, but RDF is most commonly seen as an XML format which is used to describe relationships between things, or add properties to things. RDF documents most often look like XML, and contain what are called Triples. A triple is a set of three statements that tell a little story.

  • Dave likes Cats,
  • Simon's nickname is "Si",
  • Chris Martin is married to Gwyneth Paltrow,
  • U2 released the album Achtung Baby
Sounds pretty simple eh? Well it kinda is. These three statements in a Triple are called the Subject (what's it about), Predicate (the kind of data we're adding to the Subject) and the Object (the actual data). Ideally Subjects will always be URIs, preferably URLs which means they're addressable on the web. Predicates should also be URIs. This is really important. Predicates are actually defined in Ontologies. This means for every RDF Triple, we have a deep, unambiguous meaning for the relationship described in it. Finally, the Object is the last part. It can be a string literal - i.e. "Si" in the Triple "Simon's nickname is Si" - or it can also be a URI - so in the Triple "Chris Martin is married to Gwyneth Paltrow", both Chris Martin and Gwyneth Paltrow should be represented as URIs - perhaps their Wikipedia entries.

Right, that's enough RDF theory. If you're interested it it, go read more on the web.

Back to RDFa

To recap, RDFa lets you embed real RDF directly inside XHTML documents in new and extended attributes. But, wait a sec - you can't just make up new attributes, or move attributes onto tags which can't support them. Correct. That's why RDFa can currently only be used in XHTML 1.1 documents. For a page which contains RDFa to validate it has to use the XHTML 1.1 doctype. This new doctype...

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

is the clever bit that means we can use the 'resource', 'property' and 'about' attributes on more tags than was possible before. The second part is that because we're in XHTML, we can add XHTML namespaces to our document. Just like in pure RDF (i.e. RDF/XML), this is where we import the Ontologies we want to use to describe the things in our document.



I don't want to go into too much detail of RDFa syntax here - there are plenty of examples on the web - I'll just point to the best ones. You can learn RDFa for yourself.
Good. Lots of bedtime reading for you to do! Seriously the best way to get to grips with RDFa is to have a go at marking up a document yourself. Its not that hard once you get used to the syntax, and learn the common RDF predicates.

Instead of teaching the world RDFa, I want to concentrate now on the practicalities of RDFa. testing, validation and data extraction.

Serving and validating RDFa

Let's deal with the utopian ideals first. If you're going to put RDFa into your document, it should be served by your server with the mime type "application/xhtml+xml". Also, you should use the XML+RDFa doctype mentioned earlier. You should also ensure you put a version attribute of "XHTML+RDFa 1.0" on your root html node.

Ok ok ok. If you do all those things, your page with RDFa in it will validate. But we all know that serving pages with 'application/xhtml+xml' causes issues with older browsers. Is this an issue? Well, according to Michael Hausenblas, RDFa interpreters are beginning to understand this, and should now attempt to parse documents containing RDFa even if the mime-type smells a bit wrong - 'text/html' for example. This should mean that you can just start embedding RDFa into your page, serving it under an old mime-type to keep IE6 and below happy, while still allowing people to distill juicy semantic goodness from your documents. However, if you're a bit cleverer, you could use content negotiation to serve your page to older browsers using the 'text/html' mime-type, and 'application/xhtml+xml' for user agents like shiny new browsers which can support it'

The mime-type is one thing, but what about that crazy ass new doctype! Well, firstly you'll notice it's XHTML 1.1. Since we're at v1.1, theres no such thing as strict, transitional and frameset. Its all strict baby. This means if you want to do RDFa properly, all your documents need to be properly XHTML 1.1 complient. Thats not such a big ask if you've been doing your job properly for the last two years or so.

Well, once again, all is not lost. It depends on wether or not you care about validation. The only way to validate a page with RDFa in it, is to use the correct, new doctype. But standards complient browsers should just ignore any attributes they find which they don't understand, and just get on with things. So, the theory is, that you can just go ahead and stick RDFa into your pages, and nothing will break - visually at least. The only thing that will stop working is validation.

Certainly I've tried RDFa in a HTML4.01 Strict document, and while it no longer validated (" there is no attribute 'ABOUT'" etc), it still displayed perfectly in Firefox 3 (as you'd hope) and the Operator plugin manages to get all that juicy data without a hitch.

So again, the theory is, just use RDFa loosely, and if you don't care about it causing errors in your documents, you'll be fine. Michael Hausenblas again says he speak to many organisations who's pages have 200 errors, so another 30 don't really matter.

I disagree with this. I think validation is important. I also think that if you get used to having errors due to RDFa then you'll be blind to the errors you really ought to fix. However, with bleeding-edge technologies like this, you have to be pragmatic. In this case I think it's probably such a benefit to have RDFa, that validation errors should be stomached for now. HOWEVER, if possible you should provide a programmatic way to remove all the RDFa statements such that the page will otherwise validate.

I'm going to try and do this in my apps using helpers in my MVC Views which will output nothing if a no-rdfa flag is set, perhaps in the query string, but otherwise will put RDFa into the page. That way, you can have a core page which validates against your older but widely supported doctype, and you only break it with the RDFa bits. Just make sure all your links to validators have the ?rdfa=false flag set! This is untries and untested in practice, but seems like an acceptable solution to allow us to begin using RDFa in older documents now.

Developing and Testing RDFa

Its one thing to talk about it, and do some demos, but what about developing in the real world. Well, its kinda easy with RDFa, just like it is with RDF!

First, get yourself Firefox and the Operator plugin. This baby will light up when it finds microformats or RDFa in the page, and allow you to inspect it. Think of it as Firebug for Semantic Data.

Then, start using the W3C's Validator. If you're using RDFa in an old doctype, you can force it to use the new XHTML + RDFa doctype, which at least will show up any errors in your RDF syntax, if not the extracted meaning.

Finally, you'll be wanting the W3C's RDFs distiller. This will parse your document, and extract pure RDF/XML from it. Its brilliant, and will show you the power of the monster you've created. Forget Microformats and GRDDL and creating XSL for every Microformat you use, as we're building on the extensability and structure of RDF, the parser has all the info it needs to make full sense of your data - all on its own!

Reading and using RDFa

Now you've put all this in your document, you want to ensure you get it out. There are a number of parsers for most popular languages - but the most interesting for me is Javascript.

The key benefit of in-page semantic markup is that, yes, machines can read it - but that machines can read it for the direct benefit of the user. What I mean is that suddenly the page should come alive with calls-to-action, extra data and interactivity which just wasn't there before. Now only 20% of the world use Firefox - and very very few of them have Operator installed. MS IE8 will have Accelerators which use a modified set of Microformats to provide follow on actions for the user. This is a clear user benefit, and one we need to use RDFa to achive if its to have any real success.

My proposed solution is a javascript-based toolset for exposing and using semantically embedded data on a page. Basically the same as Operator, but in JS, cross-browser and with incredible user interface elements. Site owners could then include this in their page. When a page loads it would quietly interrogate the dom for Microformats and RDFa. When it found some, a button somewhere in the header might begin to glow. When clicked, it might either open a panel and show you all the data in one place, with relevent onward actions - OR it might make the semantically marked up parts of the page glow, and offer on-hover contextual user actions.

If we do this - suddenly theres a clear user benefit - and our bosses will suddenly take more interest in this Semantic data thing. As a side effect, machines across the world suddenly get access to all this juice data which has been hidden from them for so long.

What goes in RDFa and what doesn't?

In theory, RDFa allows you to take any RDF triple and embed it contextually in a page. But do we really want to do this?
  • It makes the page bigger - bad for mobile devices which are unlikely to want it
  • The more RDFa you use, the more likely your HTML structure might be compromised by bad markup
  • And isn't that what we have full-fat RDF for?!
On the latter point, YES! we still have full RDF. Also, unless your insane, you'll subscribe to the software principle of Don't Repeat Yourself (DRY). So there's a tension here. we don't want hardcore RDF in our html pages, and we don't want to repeat everything. What to do?

After talking a lot about this at ISWC2008, we think there are some guidelines to follow.
  1. You should use RDFa for simple things that a machine and a user might want to know
  2. You don't put complex RDF in RDFa - you still use RDF/XML for that.
  3. You SHOULD repeat yourself for the simple data you put in RDFa - DO replicate this in RDF/XML. We don't want machines to have to look in two places - but this is the only time you violate DRY
  4. Acceptable RDFa things are from ontologies like Dublin Core, FOAF, SIOC, and Geo - plus the simple bits from more complex Ontologies like Music, Programmes etc.
  5. Focus on things like document metadata (creator, relationships, document meaning etc) not on hardcore URI to URI mappings. Do that in pure RDF/XML.
Phew. Its too early for that to be a tested definitive list - but it feels right. RDFa is more useful when you think of it like a highly structured Microformat rathern than actual RDF.

Conclusions

RDFa is now an official W3C recommendation. This means its time to play. The biggest problem is its never been tried and tested. Its been used on the common browsers, but we're messing with HTML here, and how older devices and browsers might handle it is anyones guess. The biggest one being mobile phones and other portable devices.

Only one way to find out though - GET PLAYING! Start using really simple Dublin Core metadata in your page and blog to markup titles, descriptions, people, meanings tags etc - and see what happens - see if anyone complains, and try and fix the problems as they arise.

This is cool and new - but with RDFa and Microformats taking off (we hope) the practical web is going to be a much better place!

Sunday 26 October 2008

Excellent preso on Webapp Security by Simon Willison

Web Security Horror Stories
View SlideShare presentation or Upload your own. (tags: security csrf)

Monday 29 September 2008

CSS Systems

Here's an excellent presentation from Natalie Downe which she gave at BarCampLondon08 which I sadly had to miss.

I love the way she's thinking about really how to structure her CSS - its something some of us do, but not all of us, and not enough.


CSS Systems
View SlideShare presentation or Upload your own. (tags: barcamplondon5 html)

I'm in a Band!

Check us out! Then come see our gigs!

Wednesday 3 September 2008

is exploring all the cool places he's going to visit in NZ on Google Earth. Its like being here. But more rubbish.

Tuesday 5 August 2008

My preso to the BBC's Semantic Web interest group


Honeypot to Semantic Web interest group at the BBC

From: sicross, 1 day ago








SlideShare Link

Installing the Facebook Open Platform

Facebook is cool. It's a little past its best now (it may be seen as 2007's YoYo) but the technology underneath it is proven, scalable and solves many of the problems any site wanting to introduce some 'magic social dust' contains.

Now, Facebook have released their Facebook Open Platform. This looks interesting. Not sure what I can do with it yet, but its a nice thing to start messing around with - particularly the FBML and FBJS implementations - FBML especially as it's a lovely thing which we can use for all kinds of simple site layout things.

It wasn't totally easy getting it up and running, but here's my method which you're welcome to follow if you'd like.

Background

I'm running Apache on CentOS with PHP 5.1.6 already installed and working with Apache. In this tutorial, I'll assume you have:
  • Apache 2+
  • PHP 5+ working with Apache
  • CentOS
  • FTP/sFTP access to your server
  • command line access with sudo permissions where needed (for restarting Apache and editing httpd.conf for example)
  • MySQL 5+ working with PHP

Getting Started

  1. Grab the code from Facebook: http://developer.facebook.com/fbopen/
  2. Unzip
  3. FTP to your server and create a folder in your webroot like 'fbop'. This will be where your instace of the platform lives i.e http://www.example.com/fbop/
  4. Upload the contents of the 'html' folder in the unzipped fb-open-platform folder into your new 'fbop' folder on your web server. If you now go to http://www.example.com/fbop/fbopentest/fbml.php you'll see a ton of errors. That's okay, we'll fix them in a bit.
  5. Now, choose where you want your facebook open platform libraries to live. Sensibly, this will be away from your webroot, but with all your other php libraries. I've chosen /var/www/fblib/lib. Go ahead and create this folder and upload the contents 'lib' folder from your unzipped fb-open-platform folder.
You should now have the following files uploaded to your webserver in roughly the following ways.

  • /var/www/fblib/lib/
    • api/
    • common.php
    • core/
    • display/
    • .....
  • /var/www/html/fbop/
    • api/
    • canvas.php
    • common/
    • fbml/
    • fbopentest/
    • js/
So far so good. But we still have all those errors at http://www.example.com/fbop/fbopentest/fbml.php

Configuring Apache

Facebook have been a bit wierd and rather than telling PHP where to look for includes like most apps, they've set a custom Apache variable which PHP then uses to find some files. This isn't easy info to find out by Googling, but its very simple to implement.
  1. Open your httpd.conf file.
  2. At the bottom, or near your other non-standard config changes, add the line...
    SetEnv PHP_ROOT "/var/www/fblib"
  3. This sets a new variable constant in Apache called 'PHP_ROOT'. From within PHP this is now accessible using the $_SERVER["PHP_ROOT"] variable. The lack of this variable being present was that cause of all those errors we've been seeing.
  4. Restart Apache. For me on CentOS, this means running $sudo /sbin/service httpd restart
  5. You should now be able to test if this has worked by creating a new PHP file, and adding (inside you php braces, this line... print_r($_SERVER); . Opening that in a browser will list all the server variables available to PHP. Magically, one of them should be PHP_ROOT with the value you specified earlier. This is good.
Now, going to http://www.example.com/fbop/fbopentest/fbml.php will give us no errors! Sadly it won't do much else.

Configuring the database

Just like Facebook itself, running apps on the Platform requires the Platform to keep a list of all the applications running on it. In the Open Platform, this is held in a MySQL database. Helpfully, they provide you with enough data to get you started, you just have to load it into MySQL.
  1. Connect to MySQL using your favourite client - I'll use phpMyAdmin
  2. Create a user which which we'll give to the Platform so it can access the database
  3. In the SQL Tab, copy and paste the contents of the file /fbopen_data_dump where fb-open-platform is the archive you downloaded and unzipped from Facebook
  4. Run the Query
You should see a database called 'fbopentest' has been created with 12 tables. If you can't see this, the query may have errored. Have a look at your debug and see why. In my case, phpMyAdmin threw an error at the comments which had three dashes preceeding them '---' instead of the usual two '--'. Odd, but thats textual SQL for you. It never works first time. Changing those in the three lines in which it appeared got me up and running.

Now we have to configure the Platform to talk to the database.
  • On your web server, edit a file in the fb open platform lib directory - in our example, I mean this file: /var/www/fblib/lib/core/init.php
  • Change $DB_USERNAME to match the user you created. In my case it was 'fbopenplatform'
  • Change $DB_IP to match IP address of your database server. In most cases, this will just be 'localhost'
  • Change $DB_PASSWORD to match the user's password you created earlier. I'm not going to tell you what mine was ;-)
  • Save the file.
To test if this has been successful,

Hold up. You with me?

Before we go any further, lets test that the basics of the facebook open platform are up and running. In a browser, go to your http://www.example.com/fbop/fbopentest/ folder. You should see a nice list of test file we can use. At present, the most interesting one is http://www.example.com/fbop/fbopentest/fbml.php. You'll see that it loads, and you can see all the code if you view source, but its not parsing it into HTML from FBML. That's because we haven't installed the FBML parsing libraries yet!

This is a little more tricky for the novice as we're not messing about with PHP anymore - but proper unix and C++ stuff, but lets have a go. I would however, ensure you have a backup of your server or virtual machine. If this screws up, it has the capacity to ruin your install of a number of other things as Sean B discovered.

Installing the FBML Parser


Back soon to finish this off.