This post will offer some insights on how to index popular document formats like Word, PDF, and JPG’s when using the Azure Search offering. Azure Search is currently in preview and therefore evolving rapidly. However, as for now, indexing documents isn’t supported. Hence my search for alternative ways of extracting document data and metadata.
Something about Azure Search.
As you might know, I’m currently exploring the world of Azure search and have to say that it’s an interesting journey so far. (If you haven’t caught up on Azure Search, I’ve collected some helpful resources that will get you up to speed, which can be found here.)
The search offering isn’t anything close to SQL servers Full-Text search solution because it’s built on a different product known as Elasticsearch. To keep things simple, Elasticsearch (http://www.elasticsearch.org/) is more or less responsible for the server-side plumbing (scaling, multi-tenancy, insights, exposing the data, etc.) and internally uses Lucene (http://lucene.apache.org) responsible for full-text indexing. It’s important to know this because this may provide some details on how the product may evolve.
I’m quite positive about this change; however, I’ve worked with Lucene in-depth, so I’m biased. The current preview offering is somewhat limited, and we have to wait and see how much value the Azure team can add. Nevertheless, I’m eager to give it a go.
Elasticsearch – the definitive guide: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/index.html
By default, Elasticsearch supports indexing documents of various formats by utilizing the Apache Tika toolkit (http://tika.apache.org/). This toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
NOTE: As you probably noticed by now, we are digging into this open source realm again, which makes some of the .NET dev/Windows IT guys uncomfortable. Just don’t because this is Microsoft’s new direction, and secondly you will be missing out on some great solutions.
Unfortunately, Azure Search doesn’t support indexing documents at the moment and therefore required to look for alternative ways of extracting the data so it can be fed into the index. The good news is that there are plenty of solutions available, both commercial and opensource, including native .NET in case this is important for you. I prefer sticking to the Tika toolkit for several good reasons;
- If Azure search includes a way of indexing documents, this will most likely be based on Tika.
- Tika is often used and proven to be a reliable solution.
- Prior positive experiences are working with the Tika community. (Had a critical bug fixed within 3 hours)
- Many of the native .Net implementations aren’t up to par and sometimes even depend on abandoned libraries.
The only challenge is; how to get Tika running quickly within Azure? At this point there are a couple of available options;
1 – Apache Tika Server
Hosting Apache Tika Server (http://wiki.apache.org/tika/TikaJAXRS) within an Azure VM. This isn’t as hard as it sounds, thanks to Docker Hub. However, you will still need to read up on the REST API and manage the Linux based VM’s.
2 – Azure SDK for Java
Another option is to include the Tika library within a Worker Role Application. The benefit is that you can access Azure storage easily within the same process. In addition, scaling and managing an application is a more straightforward compared with a VM.
3 – Using IKVM.NET / TikaOnDotNet
IKVM.NET (http://weblog.ikvm.net/) IKVM.NET is a Java Virtual Machine (JVM) for the .NET and Mono runtimes. And basically allows Java code to be used by .NET applications. The drawback is that you will need to generate .NET libraries for all Java packages, which will then need to be reference within your Visual Studio project. It’s not that complicated; however, Downloading and building the Tika source requires some time and knowledge of Maven 2.
Luckily someone already took the time to wrap all of this within a NuGet package. But if you’re interested in the details, make sure to visit the following page http://kevm.github.io/tikaondotnet/ Details about the package can be found here https://www.nuget.org/packages/TikaOnDotNet/ or install the package using the flowing command in the Package Manager Console.
PM> Install-Package TikaOnDotNet
The API is simple, but if you need samples, open the Test project on GitHub.
const string url = "http://download.microsoft.com/download/E/7/B/E7B25440-1569-40B5-989E-3951FC178214/Microsoft_Press_eBook_Introducing_HDInsight_PDF.pdf";
var textExtractionResult = new TextExtractor().Extract(new Uri(url));
textExtractionResult contains the extracted Metadata and Text
We have covered three different ways of extracting Metadata and text using Tika. Now selection a solution and how should glue things together depends on your requirements. In this post, I just wanted to show that there are ways of getting the job done.