Saturday, May 20, 2006, 03:15 PM
- Python
Today's pointless hack was brought on by the fact that happened to be in possession of the complete dilbert comics up until 2005. Often I remember some particular strip, but have no way to sensible search thousands of GIF files. (I *could* join comics.com, but that would be less fun, and I already have nearly all the dilbert books, so this info is sort of 'mine' already, ahem). Also inspired by my recent re-reading of Hofstadter's Letter Spirit, I set to work. Note that there are good commercial solutions for this, and probably lots of well known algorithms etc., but I wanted to discover this myself. I did a quick google for tutorials on OCR, but nothing sensible came up.)So, looking at some strips, the dilbert text seems to have many nice features for automatic extraction:
- The font is always the same size/type
- the letters are all capitals
- the lines of text are always straights
- there is always white-space behind the text
- There is very little punctuation or digits.
- with some exceptions to all of these, but these bits i can live without:

So with the gimp, python, ImageMagick, Numeric and PIL I set to work. First I cut out one of each letter, and auto-cropped in the gimp:

The first plan was to scan every pixel of the big image for a match with every letter, a match being defined as the sum of the errors for the overlapping pixels being below a certain threshold. Since most dilberts (apart from sunday strips) are black and white anyway, I did everything in gray-scale mode and the pixel values are simply byte values. The threshold for a "match" was set at 40 after some trial and error.
This process was slow as cancer, even when I changed from python lists to Numeric arrays. (later it turned, using a flat list and computing the offset as y*w+x is actually quicker than Numeric arrays, even without psyco, odd...) The first attempt, for a single letter only, looked like this:

Tweaking the numbers for the "match" threshold and removing testing for overlapping matches etc. speeded it up a bit, and I soon discovered another problem. Here illustrated by looking for matches for i's in the first frame:

The problem was that comparing the tightly cropped I image was matching lots of sub-sections of other letters. Since the thing was still horribly slow I decided to try a slightly different approach. I would detect the base-lines of each line of the text in picture, again, this should be pretty easy apart from a few comics where there is content next to the text, but I would worry about that later. First attempt at finding lines that were largely empty:

then group together blocks of lines and remove the ones to close to the top to fit a whole letter:

Now I retried the above matching of letters, trying each letter on every possible X position along each line, then order them by X value.
This produces the first real result, i.e. it spat out some text:
vititingacuitmer
visiting a customer
odurodftfitoas]
our office was
iodesitgneodwithltfhe
designed with the
citenceocgepfecghuitl
science of feng shui
Correct answers in italics. Interestingly this produces more letters than in the original :) So I try grouping the letters that are duplicates, i.e. both were detected in the same place:
v<it><it>ing a cu<i t>mer
<od>ur <od>f<tf><it>o as
<iod>es<it>gne<od> w<it>h lt<fh>e
c<it>ence <ocg><ep> fe<cg> hu<itl>
It's not quite useful yet... even if I generated all possible words for each ambiguous character I wouldn't get anything sensible.
Also, there is clearly a big problem in recognising S's and in telling Is and Ts apart. Maybe I shouldn't disallow overlapping boxes... this made it slightly slower again, but didn't improve results.
So after two evenings of hacking I now had some random text that was completely useless. My instinct told me I had probably taken the simple sum-of-error approach as far as it would go, and now - there is a fork in the path:
1. Keep considering the letters independently, but let the program learn, i.e. sit through a few sessions of : "i reckon this is an 'S', no that's a 'Z', try harder.... "
2. Make the matching pattern of each letter more aware of the special features of each letter, i.e. ignore the bits the I and T have in common, focus on the difference. Not sure how to do this when 3 (or more) letters match the same thing... I would probably just try a dodgy hack and see where I get to.
3. Neural networks does 2 much better than I can ever hope to.
I've done some more work on this now, and I will shortly follow it up with a chapter II! :)
I was a bit late in typing up this part I -- and I needed a Saturday to find the time (as well as the need to prepare 150 slides on Semantic Web Services for Monday made it easy to want to do other things...)
PS: Leo was convinced I was wasting my time (BUT IT'S THE JOURNEY NOT THE DESTINATION!), and tried OmniPage Pro on some dilbert cartoons, and it sucked! Ha! This isn't completely pointless after all!
[ 2 comments ] ( 86 views ) | permalink |




( 3 / 4077 )Wednesday, May 17, 2006, 07:32 PM
- Gnowsis, Python, Javascript
This is kinda lame since every sensible web-application has this feature, but I just finished coding javascript autocompletion for my rewrite of the gnowsis GUI as a glorious python web.py HTML hack (which I will releasse and blog shortly, when it's slightly nicer). Lovely screenshot here: 
Most of this was stolen from gadgetopia, but I adapted it to keep track or URIs as well as labels.
[ 1 comment ] ( 25 views ) | permalink |




( 3 / 4754 )Tuesday, May 9, 2006, 12:36 PM
- Everything Else
So my scheduling life is nearly complete, I have all my calendars available everywhere and I have the quick add extension for adding events at anytime. Now for the problems:
- Quick Add ONLY adds to my default calendar, I want to be able to enter: "DFKI: Gnowsis Meeting 15pm" and have it added to my DFKI calendar.
- The "German Holiday" from google sucks - it doesn't have any of the days I want (AND the description says 2005). Also, I cannot get google to read this
- Adding the calendars from google-calendar to ical works fine, BUT they are read only - why can't I use webdav and edit them with my google username/password?
- Now, most importantly: Quick add is great, and close to perfection, but i've not booked a ryanair flight, and I get this in an email:
From Hahn Frankfurt(HHN) to Torp Oslo(TRF)
Sat, 03Jun06 Flight FR9822 Depart HHN at 07:05 and arrive TRF at 09:00
putting this into quick-add as is gives me an event TODAY. Not good. Insert some spaces and make it "03 Jun 06" gets the date right, but ignores the time - and the year, leaving these in the description and creates an all day event.
Having tried 10s of different formats I finally get the right thing to happen by changing "Depart HHN at 07:05 and arrive TRF at 09:00" to 7:05-9:00 and moving it in front of the date. That is, unfortunately, more hassle than entering the event by hand.
(Doing all this I noticed another feature that would have been nice: if I could click the "your event has been created" message and go straight to that event)
[ 1 comment ] ( 36 views ) | permalink |




( 3 / 3879 )Saturday, May 6, 2006, 06:10 PM
- CaseMod

The start of this project was finding this old "portable" potentiometer which the chemistry department in Aberdeen threw out (at least I hope they meant to - otherwise I stole it). The portable'ness of the thing meant you could lift it, and it had room for 4 single A batteries, however it weighed a ton. Among the other interesting features this box was "open-source", i.e. the wiring diagram and printing instructions was printed on the inside of the cover.

With a mix of brute-force, power-tools, manual sawing and a lot of un-screwing I got most of the original components out, with only a few months waiting while I ordered imperial alan-keys on ebay and had them shipped to a friend in the uk who then forgot to send them to me for a long time. Getting all the old junk out made room inside for my VIA EPIA ML6000 fanless Mini-itx motherboard, and the tiny AC-DC converter.
The goal of this whole operation was to create a silent PC that could always be on, and could keep all my mp3s and movies, it was therefore necessary to put two 250gb harddisks in as well. Luckily the harddisks just about fit in, although there isn't much room for the IDE cables, I think I shall try to cut off the second connector on an ide cable now to see if it still works. I also hope the harddisks wont get too hot, I was really hoping to keep the whole setup fanless, if it fails I can always have a half-speed fan at the back though, the old battery compartment leaves a hole for ventilation.

The other problems was that the power-supply had only 1 harddisk/cdrom power connector (i.e. the 12v 4-pin things), which I found odd since it said that the 80W ac-dc adaptor said it could power a HD and a full-size CDROM. Since I wanted this to work today there was no time to order a splitter, so I cut off a bit of an old normal size power-supply, and compensated for the lack of a female plug (well, the plastic bit is female, the pins male) with creative soldering and four nails. Finishing it it looked slightly like a torture instruments... alas 12V is gonna make anyone scream though.

Finally, I installed debian, i thought i may have some problems with the strange hardware, but everything worked straight out of the box and the whole process took less than 30 minutes. Then I apt-got samba and was listening to my mp3s from my laptop in less than 1 hour! (if it wasn't for MacOSX than http://www.macworld.com/forums/ubbthrea ... )

Now all I need is a sensible backup solution. Burning DVDs is really not an option. Anyone?
[ 3 comments ] ( 41 views ) | permalink |




( 3 / 3876 )Tuesday, April 4, 2006, 11:32 PM
- PhD, Machine Learning
Since 3 is a magic number I'd really like to have 3 different learning algorithms used Smeagol. Currently I have ILP and HAC-clustering, both applied in several different ways. Sequence/Basket analysis seems like a good candidate for third algorithm, since it's the only area of ML not covered yet. (ILP covering classification and more...) Sequence analysis would of course require a time dimension the the data, which i'd really rather not get into, AND it was probably covered pretty well by Heather Maclaren.
Basket analysis is left, and my first attempt was quickly hacked up using Orange. The things in my baskets are predicate-value pairs, and each person becomes a basket on their own. I tried this on several data-sets i had lying around, here are some quick and dirty results:
A small subset of my IMDB Data (3534 triples) gave me:
rdf#type IMDB#Movie -> IMDB#languages English
and
rdf#type IMDB#Movie -> IMDB#country_USA
My email from the last 5 years as crawler by aperture (127615 triples) gave me the fascinating rule:
aperture:mimeType message/rfc822 -> rdf#type imap/Message
A subset of some old FOAF crawl stolen from JibberJim years ago gave me:
jim#isKnownBy norman.walsh#norman-walsh -> rdf#type foaf/Person
yes - fascinating indeed. I also found the Norman Walsh rule using ILP years ago, at least running this one was pretty fast.
I'm not sure what to conclude from this - none of the rules are groundbreaking OR that interesting. Maybe I can tweak the way items are represented, using just values or just predicates for example. I'll see tomorrow.
I also had a brain-storming session with myself and some gin'n'tonic today, and if I don't finish this PhD it's because the table wasn't big enough:
[ 1 comment ] ( 37 views ) | permalink |




( 3 / 2165 )Back Next



