Tuesday, June 20, 2006, 10:26 PM
- PhD
This is the first real result from Smeagol, where it actually makes a plan to learn and does succeed! 
(click for readable version)
I realise that this diagram is probably completely incomprehensible to anyone but me, so here is a quick explanation:
1. The "top-level goal" given in this case is to answer the query:
[] ql:select ( ?s ) ;
ql:where {
?s a bibtex:InProceedings ;
?s foaf:maker ?a ;
?a pers:expertIn wiki:Semantic_Web
} ;
ql:results ?r .
i.e. give all papers written by Semantic Web Experts.
2. Smeagol first makes the trivial plan to read some data (there are many issues about this, I skip them all), and perform the query.
3. The trivial plan fails (otherwise this would be boring), because there are no people who are Semantic Web experts who have written any papers (in my extremely artificial hand-made dataset). Papers written by other people DO exist...
4. Several plans are made that may introduce more triples matching any of the patterns evaluated in the query before it failed. In this particular case my heuristic reordered the query to be [(?s a Inproceedings),(?a expertIn SemanticWeb), (?s maker ?a)], i.e. find all the people and papers first and do the join later, and it was the very last pattern that failed to match, therefore further results for any of these patterns could be useful.
5. The easiest (the actions have weights) plan is chosen first, this is read the ontology for the bibtex classes, and attempt to find more "things" that are "InProceedings" based on RDFS inference, maybe they are a only explicitly declared to be a subclass of InProceedings for example. Unfortunately, this is not the case, and the this plan fails.
6. Smeagol returns to the second easiest plan, this plan involves actual learning: Find the set of things that are already of type InProceedings => attempt to use ILP to learn a description of this set => use this description to classify further instances as InProceedings. Alas, this also fails, in this case it failed because it failed to find any good negative examples, but I wont go into that here.
7. Returning to the third easiest plan, this is using the same pattern as the previous plan, but finds a set of people who are Semantic Web experts instead. In this case the description learning DOES work and we learn the rule:
{ ?A <http://purl.org/swrc#conferenceChair> <http://iswc2004.org/> }
log:implies
{ ?A rabbi:learnedCategory "blah" }.
8. This rule is then used to classify more instances as Semantic Web experts, and since I very carefully constructed this examples it finds one! :)
9. After the sub-plan has succeeded, Smeagol returns to the previous failed plan and re-tries the failed action, and now the query works. Hurrah!
All in all this took a ridiculous number of lines of code, many hours of debugging, and the result still isn't very impressive, but hopefully I have ironed out most planning bugs now and I can get to work on creating the final set of examples that will finish off this PhD! :)
[ add comment ] | permalink |




( 3 / 1673 )Sunday, June 18, 2006, 08:28 PM
- Semantic Web
ESWC is now over and it was great fun. I might write something later about interesting talks I saw and people I met, but here is something about a quick hack I did during some less interesting talk :), this was also partly written while in Montenegro: ----
Since Tom Heath kindly made the effort to eat some dog food and produced lots of RDF data about ESWC, I thought I should be a good semantic webber and try to do something with this data. (I wont try to extend the dog-food analogy... )
To get all the data I started with this url and crawled seeAlso links from there, this made sure I would also get some FOAF data for individual users where their profile was linked. When I got tired of watching it crawl I had crawled a few hundred RDF files and some 10,000 triples.
So since I always do the same thing I applied my trusty all clustering and then ILP to learn descriptions of the clusters, at first I tried clustering the people, but got nothing interesting, I could recreate the organising committee and have all the other delegates cluster, or I could do just the delegates, and have 1 big cluster with most of them and a few people in clusters by themselves. Quickly giving this up I tried to cluster
the papers instead - this wasn't wildly successful either, but at least I recreated demo session, poster session and main conference papers groups as clusters, excluding a few papers, which were clustered on their own:
- WikiFactory: a web ontology-based application for creating domain-oriented wikis
- Using Semantics to Enhance the Blogging Experience
- Towards A Complete OWL Ontology Benchmark
If anyone knows why these are 'magic' please let me know :)
If you really care you can also look at the full results here
After looking at the properties of the papers in Tom's data, it was obvious that it would be hard to get anything more sensible out of this, so I gave this up as well and went back to paying attention to the talks.
The only good bits to come out of this was that I streamlined getting 'nice' html output from commandline, so I can easily do it later, and the lack of "topic" meta-data made me think of the application that in the end won me the iPod Nano in the ESWC Design Challenge! Yay!
[ add comment ] | permalink |




( 3 / 3663 )Saturday, May 20, 2006, 03:15 PM
- Python
Today's pointless hack was brought on by the fact that happened to be in possession of the complete dilbert comics up until 2005. Often I remember some particular strip, but have no way to sensible search thousands of GIF files. (I *could* join comics.com, but that would be less fun, and I already have nearly all the dilbert books, so this info is sort of 'mine' already, ahem). Also inspired by my recent re-reading of Hofstadter's Letter Spirit, I set to work. Note that there are good commercial solutions for this, and probably lots of well known algorithms etc., but I wanted to discover this myself. I did a quick google for tutorials on OCR, but nothing sensible came up.)So, looking at some strips, the dilbert text seems to have many nice features for automatic extraction:
- The font is always the same size/type
- the letters are all capitals
- the lines of text are always straights
- there is always white-space behind the text
- There is very little punctuation or digits.
- with some exceptions to all of these, but these bits i can live without:

So with the gimp, python, ImageMagick, Numeric and PIL I set to work. First I cut out one of each letter, and auto-cropped in the gimp:

The first plan was to scan every pixel of the big image for a match with every letter, a match being defined as the sum of the errors for the overlapping pixels being below a certain threshold. Since most dilberts (apart from sunday strips) are black and white anyway, I did everything in gray-scale mode and the pixel values are simply byte values. The threshold for a "match" was set at 40 after some trial and error.
This process was slow as cancer, even when I changed from python lists to Numeric arrays. (later it turned, using a flat list and computing the offset as y*w+x is actually quicker than Numeric arrays, even without psyco, odd...) The first attempt, for a single letter only, looked like this:

Tweaking the numbers for the "match" threshold and removing testing for overlapping matches etc. speeded it up a bit, and I soon discovered another problem. Here illustrated by looking for matches for i's in the first frame:

The problem was that comparing the tightly cropped I image was matching lots of sub-sections of other letters. Since the thing was still horribly slow I decided to try a slightly different approach. I would detect the base-lines of each line of the text in picture, again, this should be pretty easy apart from a few comics where there is content next to the text, but I would worry about that later. First attempt at finding lines that were largely empty:

then group together blocks of lines and remove the ones to close to the top to fit a whole letter:

Now I retried the above matching of letters, trying each letter on every possible X position along each line, then order them by X value.
This produces the first real result, i.e. it spat out some text:
vititingacuitmer
visiting a customer
odurodftfitoas]
our office was
iodesitgneodwithltfhe
designed with the
citenceocgepfecghuitl
science of feng shui
Correct answers in italics. Interestingly this produces more letters than in the original :) So I try grouping the letters that are duplicates, i.e. both were detected in the same place:
v<it><it>ing a cu<i t>mer
<od>ur <od>f<tf><it>o as
<iod>es<it>gne<od> w<it>h lt<fh>e
c<it>ence <ocg><ep> fe<cg> hu<itl>
It's not quite useful yet... even if I generated all possible words for each ambiguous character I wouldn't get anything sensible.
Also, there is clearly a big problem in recognising S's and in telling Is and Ts apart. Maybe I shouldn't disallow overlapping boxes... this made it slightly slower again, but didn't improve results.
So after two evenings of hacking I now had some random text that was completely useless. My instinct told me I had probably taken the simple sum-of-error approach as far as it would go, and now - there is a fork in the path:
1. Keep considering the letters independently, but let the program learn, i.e. sit through a few sessions of : "i reckon this is an 'S', no that's a 'Z', try harder.... "
2. Make the matching pattern of each letter more aware of the special features of each letter, i.e. ignore the bits the I and T have in common, focus on the difference. Not sure how to do this when 3 (or more) letters match the same thing... I would probably just try a dodgy hack and see where I get to.
3. Neural networks does 2 much better than I can ever hope to.
I've done some more work on this now, and I will shortly follow it up with a chapter II! :)
I was a bit late in typing up this part I -- and I needed a Saturday to find the time (as well as the need to prepare 150 slides on Semantic Web Services for Monday made it easy to want to do other things...)
PS: Leo was convinced I was wasting my time (BUT IT'S THE JOURNEY NOT THE DESTINATION!), and tried OmniPage Pro on some dilbert cartoons, and it sucked! Ha! This isn't completely pointless after all!
[ 2 comments ] ( 86 views ) | permalink |




( 3 / 4077 )Wednesday, May 17, 2006, 07:32 PM
- Gnowsis, Python, Javascript
This is kinda lame since every sensible web-application has this feature, but I just finished coding javascript autocompletion for my rewrite of the gnowsis GUI as a glorious python web.py HTML hack (which I will releasse and blog shortly, when it's slightly nicer). Lovely screenshot here: 
Most of this was stolen from gadgetopia, but I adapted it to keep track or URIs as well as labels.
[ 1 comment ] ( 25 views ) | permalink |




( 3 / 4755 )Tuesday, May 9, 2006, 12:36 PM
- Everything Else
So my scheduling life is nearly complete, I have all my calendars available everywhere and I have the quick add extension for adding events at anytime. Now for the problems:
- Quick Add ONLY adds to my default calendar, I want to be able to enter: "DFKI: Gnowsis Meeting 15pm" and have it added to my DFKI calendar.
- The "German Holiday" from google sucks - it doesn't have any of the days I want (AND the description says 2005). Also, I cannot get google to read this
- Adding the calendars from google-calendar to ical works fine, BUT they are read only - why can't I use webdav and edit them with my google username/password?
- Now, most importantly: Quick add is great, and close to perfection, but i've not booked a ryanair flight, and I get this in an email:
From Hahn Frankfurt(HHN) to Torp Oslo(TRF)
Sat, 03Jun06 Flight FR9822 Depart HHN at 07:05 and arrive TRF at 09:00
putting this into quick-add as is gives me an event TODAY. Not good. Insert some spaces and make it "03 Jun 06" gets the date right, but ignores the time - and the year, leaving these in the description and creates an all day event.
Having tried 10s of different formats I finally get the right thing to happen by changing "Depart HHN at 07:05 and arrive TRF at 09:00" to 7:05-9:00 and moving it in front of the date. That is, unfortunately, more hassle than entering the event by hand.
(Doing all this I noticed another feature that would have been nice: if I could click the "your event has been created" message and go straight to that event)
[ 1 comment ] ( 36 views ) | permalink |




( 3 / 3879 )Back Next



