UIMA introduction and reactions

Right now I’m working on learning more about UIMA. What is UIMA and why would you want to learn it? Well, if you work in the Computational Linguistics field would you be surprised to know that UIMA is what powered Watson on Jeopardy? The head guy over at IBM in charge of Watson development is also the head guy of UIMA development, shocker that they ended up using it. UIMA is basically a “standard” way to create pipelines for data to travel through. UIMA stands for “Unstructured Information Management Architecture”.

For example, lets say you want to do proper name classification but your final algorithm relies on POS tags and Syntax trees. How do you go from raw data into data that is usable for your classifier to train on and classify? When I was working on my Undergrad I took a senior CS course that used an XML driven framework that I think came from Berkeley (I think). The framework allowed us to define in XML what classes to instantiate to handle the data in whatever form it was. For example, there would be a “reader” that would read in a corpus, create the syntax tree objects and then pass them off to the classifier or “trainer” or what have you. The framework forced us to code to a standard so all our models and classifiers were interchangeable in each others XML files. It also allowed us to use data that was either structured or unstructured. UIMA appears to be something like this.

I’m still learning so I may have some concepts wrong but I hope to walk through learning UIMA with you. I started with reading the first page at http://uima.apache.org/

The frameworks support configuring and running pipelines of Annotator components. These components do the actual work of analyzing the unstructured information. Users can write their own annotators, or configure and use pre-existing annotators. Some annotators are available as part of this project; others are contained in various repositories on the Internet.

UIMA looks like it has expanded on the idea of having an XML framework to include lots of other options and also a specific Java and C++ frameworks for writing your code. Something that has me really excited is how UIMA also has a server component that allows REST based annotation. Oh the possibilities! To get started with UIMA I started going through their documentation.

The two tutorials / documentation pages that I found most useful were this (overview and setup) and that (create your own annotator). It took me a few hours to go through these but I felt they got me started really quick with both understanding the background / theory of UIMA and how to write a simple annotator. KEY NOTE: the actual UIMA update site is http://www.apache.org/dist/uima/eclipse-update-site/ as of this writing, not what is in their tutorials.

UIMA is the real deal, a way to create a pipeline of annotation and analysis from start to finish. While I have personal concerns about the whole “Watson” incident (I don’t think it is fair to compare humans vs computers on speed, I found the whole thing a little silly) I have a growing respect for the team that developed UIMA in conjunction with Watson. There are even tools to package your analysis engines and distribute them so that they just plug into another persons UIMA pipeline. I have to admit I’m sort of geeking out about this. For a special research project during my Undergrad I wrote a sentence analysis tool that combined part of speech tagging, syntactic parsing, semantic parsing, morphological parsing and a feature analysis tool that used all the previous things to discover grammatical features about sentences. I had to tie together multiple different engines and create readers and basically do everything that UIMA would have helped me do in a more efficient and portable way. Granted, when I did this project IBM had just barely started working on UIMA but still. The idea that it exists now fills my mind with possibilities.

How would you use UIMA? Do you use it? If so how?

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *