{"id":3073,"date":"2014-04-13T21:41:14","date_gmt":"2014-04-13T21:41:14","guid":{"rendered":"http:\/\/dalelane.co.uk\/blog\/?p=3073"},"modified":"2014-06-25T22:19:04","modified_gmt":"2014-06-25T22:19:04","slug":"text-analytics-in-bluemix-using-uima","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=3073","title":{"rendered":"Text analytics in BlueMix using UIMA"},"content":{"rendered":"<p>In this post, I want to explain <strong>how to create a text analytics application in BlueMix using UIMA<\/strong>, and share <a href=\"http:\/\/uimahelloworld.mybluemix.net\/\">sample code to show how to get started<\/a>.  <\/p>\n<p>First, some background if you&#8217;re unfamiliar with the jargon. <\/p>\n<p><strong>What is UIMA?<\/strong><\/p>\n<p><a href=\"http:\/\/uima.apache.org\/\">UIMA<\/a> (Unstructured Information Management Architecture) is an Apache framework for building analytics applications for unstructured information and the OASIS standard for content analytics. <\/p>\n<p>I&#8217;ve <a href=\"http:\/\/dalelane.co.uk\/blog\/?tag=uima\">written about it before<\/a>, having used it on <a href=\"http:\/\/cv.dalelane.co.uk\/index.php\/employment\/ets\/projects\/\">a few projects when I was in ETS<\/a>, and on other side projects since such as building <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=2277\">a conversational interface to web pages<\/a>. <\/p>\n<p>It&#8217;s perhaps better known for providing the architecture for the question answering system <a href=\"http:\/\/ibmwatson.com\/\">IBM Watson<\/a>. <\/p>\n<p><strong>What is BlueMix?<\/strong><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/140413-bluemixlogo.jpg\" hspace=10 vspace=10 border=0 align=\"left\"\/><a href=\"http:\/\/bluemix.net\/\">BlueMix<\/a> is IBM&#8217;s new Platform-as-a-Service (PaaS) offering, built on top of <a href=\"http:\/\/cloudfoundry.org\/\">Cloud Foundry<\/a> to provide a cloud development platform. <\/p>\n<p>It&#8217;s in open beta at the moment, so you can sign up and have a play. <\/p>\n<p>I&#8217;ve never used BlueMix before, or Cloud Foundry for that matter, so this was a chance for me to write my first app for it. <\/p>\n<p><strong>A UIMA &#8220;Hello World&#8221; for BlueMix<\/strong><\/p>\n<p>I&#8217;ve written a small sample to show how UIMA and BlueMix can work together. It provides a REST API that you can submit text to, and get back a JSON response with some attributes found in the text (long words, capitalised words, and strings that look like email addresses). <\/p>\n<p>The &#8220;analytics&#8221; that the app is doing is trivial at best, but this is just a Hello World. For now my aim isn&#8217;t to produce a useful analytics solution, but to walk through the configuration needed to define a UIMA analytics pipeline, wrap it in a REST API using <a href=\"http:\/\/wink.apache.org\/\">Wink<\/a>, and deploy it as a BlueMix application. <\/p>\n<p>When I get a chance, I&#8217;ll write a follow-up post on making something more useful. <\/p>\n<p>You can try out the sample on BlueMix as it&#8217;s deployed to <a href=\"http:\/\/uimahelloworld.mybluemix.net\/\">bluemix.net<\/a> <\/p>\n<p>The source is on GitHub at <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\">github.com\/dalelane\/bluemixuima<\/a>. <\/p>\n<p>In the rest of this post, I&#8217;ll walk through some of the implementation details. <\/p>\n<p><!--more--><strong>Runtimes and services<\/strong><\/p>\n<p>Creating an application in BlueMix is already <a href=\"http:\/\/bluemix.net\/docs\">well documented<\/a> so I won&#8217;t reiterate those steps, other than to say that as Apache UIMA is a Java SDK and framework, I use the <strong>Liberty for Java<\/strong> runtime. <\/p>\n<p><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/140413-bluemixruntime.png\" border=0 vspace=5\/><\/p>\n<p>I&#8217;m not using any of the services in this simple sample. <\/p>\n<p><strong>Manifest<\/strong><\/p>\n<p>The app is bundled up in a war file, which is what we deploy. This is specified in <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/manifest.yml\">manifest.yml<\/a>.<\/p>\n<p><strong>Building<\/strong><\/p>\n<p>The war file is built by an <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/build.xml\">ant task<\/a> which has to <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/build.xml#L16\">include the UIMA jar in the classpath<\/a>, and <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/build.xml#L49\">copy my UIMA descriptor XML files into the war<\/a>. <\/p>\n<p>I&#8217;m developing in eclipse, so I set up an <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/.externalToolBuilders\/BlueMix%20builder.launch\">ant builder to run the build<\/a>, and <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/.project\">configured the project to do it automatically<\/a>. <\/p>\n<p>I&#8217;m deploying from eclipse, too, using the <a href=\"http:\/\/dist.springsource.com\/release\/TOOLS\/cloudfoundry\">Cloud Foundry plugins for eclipse<\/a>. <\/p>\n<p><strong>XML descriptors<\/strong><\/p>\n<p>The type system is defined in <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/typeSystemDescriptor.xml\">an XML descriptor file<\/a> and specifies the different annotations that can be created by this pipeline, and the attributes that they have. <\/p>\n<p>Running JCasGen in eclipse on that descriptor generates <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/tree\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/annotations\">Java classes representing those types<\/a>. <\/p>\n<p>The pipeline is also defined in XML descriptors: one <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/aggregateAnalysisEngineDescriptor.xml\">overall aggregate descriptor<\/a> which imports three primitive descriptors for each of the three annotators in my sample pipeline : <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/primitiveAeDescriptorEmailAddresses.xml\">one to find email addresses<\/a>, <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/primitiveAeDescriptorWordCase.xml\">one to find capitalised words<\/a> and <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/primitiveAeDescriptorWordLengths.xml\">one to find long words<\/a>. <\/p>\n<p>Note that <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/aggregateAnalysisEngineDescriptor.xml#L7\">the imports in the aggregate descriptor<\/a> need to be relative so that they keep working once you deploy to BlueMix. <\/p>\n<p>These XML descriptor files are all added to the war file by being included in the <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/build.xml#L49\">build.xml with a fileset include<\/a>. <\/p>\n<p><strong>Annotators<\/strong><\/p>\n<p>Each of the primitive descriptor files <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/primitiveAeDescriptorEmailAddresses.xml#L5\">specifies the fully qualified class name<\/a> for the Java implementation of the annotator. <\/p>\n<p>There are <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/tree\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/annotators\">three annotators in this sample<\/a>. (XML files with names starting &#8220;primitiveAeDescriptor&#8221;).<\/p>\n<p>Each one is <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/tree\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/annotators\">implemented by a Java class<\/a> that extends <a href=\"http:\/\/uima.apache.org\/d\/uimaj-2.5.0\/apidocs\/org\/apache\/uima\/analysis_component\/JCasAnnotator_ImplBase.html\">JCasAnnotator_ImplBase<\/a>.<\/p>\n<p>Each uses a regular expression to find things to annotate in the text. This isn&#8217;t intended to be an indication that this is how things should be done, just that it makes for a simple and stateless demonstration without any additional dependencies. <\/p>\n<p>The simplest is the regex used to find capitalised words in <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/annotators\/WordCaseAnnotator.java#L31\">WordCaseAnnotator<\/a> and the most complex is the ridiculously painful one used to find email addresses in <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/annotators\/EmailAnnotator.java#L33\">EmailAnnotator<\/a>. <\/p>\n<p>Note that the regexes are prepared in the <a href=\"http:\/\/uima.apache.org\/d\/uimaj-2.5.0\/apidocs\/org\/apache\/uima\/analysis_component\/AnalysisComponent_ImplBase.html#initialize(org.apache.uima.UimaContext)\">annotator initializer<\/a>, and reused for each <a href=\"http:\/\/uima.apache.org\/d\/uimaj-2.5.0\/apidocs\/org\/apache\/uima\/analysis_component\/JCasAnnotator_ImplBase.html#process(org.apache.uima.jcas.JCas)\">new CAS to process<\/a>, to improve performance. <\/p>\n<p><strong>UIMA pipeline<\/strong><\/p>\n<p>The UIMA pipeline is defined in <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java\">a single Java class<\/a>. <\/p>\n<p>It finds <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/com.dalelane.uima.bluemixhelloworld\/aggregateAnalysisEngineDescriptor.xml\">the XML descriptor for the pipeline<\/a> by <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java#L67\">looking in the location where BlueMix will unpack the war<\/a>. <\/p>\n<p>It <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java#L80\">creates a CAS pool<\/a> to make it easier to handle multiple concurrent requests, and avoid the overhead of creating a CAS for every request.  <\/p>\n<p>Once the pipeline is <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java#L83\">initialised<\/a>, it is ready to <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java#L108\">handle incoming analysis requests<\/a>. <\/p>\n<p>Once the CAS has passed through the pipeline, <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java#L150\">the annotations are immediately copied out of the CAS into a POJO<\/a>, so that the CAS can be returned to the pool. <\/p>\n<p><strong>REST API<\/strong><\/p>\n<p>The war file deployed to BlueMix contains a <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/WebContent\/WEB-INF\/web.xml#L7\">web.xml which specifies the servlet<\/a> that implements the REST API.<\/p>\n<p>I&#8217;m using <a href=\"http:\/\/wink.apache.org\/\">Wink<\/a> to implement the API. The servlet definition in the web.xml specifies <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/WebContent\/WEB-INF\/web.xml#L12\">where to find the list of API endpoints<\/a> and the <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/WebContent\/WEB-INF\/web.xml#L18\">URL where the API should be<\/a>.<\/p>\n<p>The <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/WebContent\/WEB-INF\/resources\">list of API endpoints<\/a> is a list of classes that Wink uses. There is only one API endpoint, so only one class listed. <\/p>\n<p>The <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/api\/UimaResource.java\">API implementation<\/a> is a very thin wrapper around the <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/Pipeline.java\">Pipeline class<\/a>. <\/p>\n<p>Everything is defined using <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/src\/com\/dalelane\/uima\/bluemixhelloworld\/api\/UimaResource.java#L28\">annotations<\/a>, and Wink handles turning the response into a JSON payload. <\/p>\n<p><strong>That&#8217;s it<\/strong><\/p>\n<p>I think that&#8217;s pretty much it. <\/p>\n<p><a href=\"http:\/\/uimahelloworld.mybluemix.net\/\"><img decoding=\"async\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/140413-samplescreenshot.png\" border=0 vspace=5\/><\/a><\/p>\n<p>I&#8217;ve added a <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/WebContent\/index.html\">simple front-end webpage<\/a>, with <a href=\"https:\/\/github.com\/dalelane\/bluemixuima\/blob\/master\/WebContent\/apitester.js\">a script to submit API requests<\/a> for people who don&#8217;t want to do it with something like curl. <\/p>\n<p>It&#8217;s live at <a href=\"http:\/\/uimahelloworld.mybluemix.net\/\">uimahelloworld.mybluemix.net<\/a>.<\/p>\n<p>Like I said, it&#8217;s very simple. The Java itself isn&#8217;t particularly complex. My reason for sharing it was to provide a boilerplate config for defining a UIMA analytics pipeline, wrapping it in a REST API, and deploying it to BlueMix. <\/p>\n<p>Once you&#8217;ve got that working, you can do text analytics in BlueMix as complex as whatever you can dream up for your annotators. <\/p>\n<p>When I get time, I&#8217;ll write a follow-up post sharing what that could look like. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, I want to explain how to create a text analytics application in BlueMix using UIMA, and share sample code to show how to get started. First, some background if you&#8217;re unfamiliar with the jargon. What is UIMA? UIMA (Unstructured Information Management Architecture) is an Apache framework for building analytics applications for unstructured [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[560,561,512,533],"class_list":["post-3073","post","type-post","status-publish","format-standard","hentry","category-code","tag-bluemix","tag-cloudfoundry","tag-eightbar","tag-uima"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3073","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3073"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3073\/revisions"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3073"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3073"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3073"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}