UIMA stands for Unstructured Information Management Architecture. It’s an Apache technology that provides a framework and standard for building text analytics applications. I’ve mentioned it before.
In this post, I want to talk about an area of UIMA which isn’t covered well in the documentation.
I couldn’t find practical getting-started instructions for running UIMA-AS annotators in parallel. In this post I want to discuss why you might want to do it, and share some simple sample code to show how.
Background – the UIMA pipeline
UIMA provides a framework for managing a text analytics application. You break up the analytics functionality into discrete pieces called annotators. UIMA takes care of moving a text document through an analytics engine: a pipeline containing a series of annotators.
A document goes in one end of the pipeline, passes through a number of annotators, each of which adds some metadata to the document. What comes out the other side of the pipeline is an annotated copy of the document.
By default, you get UIMA to run these annotators one at a time – one after another.
Background – annotators in parallel
What if your annotators are quite slow – perhaps they take several seconds to run?
If there is no dependency between any or all of your annotators, then maybe running them one at a time isn’t the most efficient approach.
You can run all of them at the same time, in parallel. UIMA will merge the output from all of the annotators into a single annotated document.
My sample code
I’ve written two sample UIMA apps. Each demonstrates one of these approaches, to compare and contrast.
They are divided into three eclipse projects. You can import them into an eclipse IDE.
The UIMA eclipse plugins are very helpful if you want to make changes to the XML configuration files, but they’re not essential. If you want them, there are instructions on how to install them at uima.apache.org.
I’ve added comments to the sample code to explain how the apps work, but I’ll give an overview here.