Azure Search – Indexing documents using Tika

Azure-Search-Tika_2

This post will offer some insights on how to index popular document formats like Word, PDF and JPG’s when using the Azure Search offering. Azure Search is currently in preview and therefore evolving rapidly. However, as for now, indexing documents isn’t a supported. Hence my search for alternative ways of extracting document data and metadata.

Something about Azure Search….

As you might know, I’m currently exploring the world of Azure search and have to say that it’s an interesting journey so far. (If you haven’t caught up on Azure Search, I’ve collected some helpful resources that will get you up to speed, which can be found here.)

The search offering isn’t anything close to SQL servers Full-Text search solution because it’s build on a different product known as Elasticsearch. Just to keep things simple, Elasticsearch (http://www.elasticsearch.org/) is more or less responsible for the server side plumbing (scaling, multi-tenancy, insights, exposing the data etc.) and internally uses Lucene (http://lucene.apache.org) responsible for full text indexing. It’s important to know this because this may provide some details on how the product may evolve.

Actually, I’m quite positive about this change however, I’ve worked with Lucene in depth, so I’m biased. The current preview offering is somewhat limited and we just have to wait and see how much value the Azure team can add. Nevertheless I’m eager to give it a go.

Interesting read

Elasticsearch – the definitive guide: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/index.html

Indexing documents

By default Elasticsearch supports indexing documents of various formats by utilizing the Apache Tika toolkit (http://tika.apache.org/). This toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

NOTE: As you probably noticed by now, we are digging into this open source realm again, which makes some of the .NET dev/Windows IT guys uncomfortable. Just don’t because this is Microsoft’s new direction, and secondly you will be missing out on some great solutions.

Unfortunately, Azure Search doesn’t support indexing documents at the moment and therefore required to look for alternative ways of extracting the data so it can be fed into the index. The good news is that there are plenty of solutions available both commercial and opensource including native .NET in case this is important for you. I prefer sticking to the Tika toolkit for several good reasons;

  • If Azure search will include a way of indexing documents, this will most likely based on Tika.
  • Tika is used often and proven to be a solid solution.
  • Prior positive experiences working with the Tika community. (Had a critical bug fixed within 3 hours)
  • Many of the native .Net implementations aren’t up to par and sometimes even depend on abandoned libraries.

The only challenge is; how to get Tika running quickly within Azure? At this point there are a couple of available options;

1 – Apache Tika Server

Hosting Apache Tika Server (http://wiki.apache.org/tika/TikaJAXRS) within an Azure VM. This isn’t as hard as it sounds thanks to Docker Hub. However you will still need to read up on the REST API and manage the Linux based VM’s.

Docker Hub: https://registry.hub.docker.com/search?q=Tika&searchfield=

2 – Azure SDK for Java

Another option is to include the Tika library within a Worker Role Application. The benefit is that you can access Azure storage easily within the same process. In addition scaling and managing an application is a more straightforward compared with a VM.

3 – Using IKVM.NET / TikaOnDotNet

IKVM.NET (http://weblog.ikvm.net/) IKVM.NET is a Java Virtual Machine (JVM) for the .NET and Mono runtimes. And basically allows Java code to be used by .NET applications. The drawback is that you will need to generate .NET libraries for all Java packages, which will then need to be reference within your Visual Studio project. It’s not that complicated however Downloading and building the Tika source requires some time and knowledge of Maven 2.

Luckily someone already took the time to wrap all of this within a nuget package. But if you’re interested in the details, make sure to visit the following page http://kevm.github.io/tikaondotnet/ Details about the package can be found here https://www.nuget.org/packages/TikaOnDotNet/ or just install the package using the flowing command in the Package Manager Console

PM> Install-Package TikaOnDotNet

The API is simple but if you’re in need for samples, just open the Test project on GitHub.

const string url = "http://download.microsoft.com/download/E/7/B/E7B25440-1569-40B5-989E-3951FC178214/Microsoft_Press_eBook_Introducing_HDInsight_PDF.pdf";
var textExtractionResult = new TextExtractor().Extract(new Uri(url));

Azure-Search-Visual_Studio
textExtractionResult contains the extracted Metadata and Text

Conclusion

We have covered three different ways of extracting Metadata and text using Tika. Now selection a solution and how should glue things together depends on your requirements. In this post I just wanted to show that there are ways of getting the job done.

Post Navigation