This blogpost was inspired by Dot Net Rocks episode 1292. The episode was as informative as all others however, the first part in which Carl and Richard covered scanr.xyz (a OCR as a service offering) really peaked my interest.
OCR as a Service
When I think of OCR, I think of the struggles I had with my first flatbed scanner back in 1999, shipped with horrible scanning software and poor quality OCR. Due to this, I’ve never tried to use any Optical character recognition software again. So after I heard about scanr.xyz, providing a OCR as a service offering, I was interested to give it a try.
Fortunately, scanr.xyz has a very friendly subscription model allowing you to get your first 100 documents converted for free! In addition to this, they are also supporting a large amount of file types including PDFs (covered in detail later). They are currently lacking a simple web interface for some simple tests. but It actually makes sense in some way, given that they are focusing on providing a developer friendly experience
The REST interface isn’t very elegant… but very simple to consume. Currently, there are two ways of providing your documents to scanr.xyz.; Uploading the content within your request or providing the URL of a location publicly available on the web.
If case you are using NODE.JS or RUBY, you can simply install scanr and it have all up and running in no time. Given that I wanted to integrate scanr within a Logic Apps workflow, I decided to build a small API wrapper using C#. If you are only interested in testing a service like this, Postman.
Just another .NET API Wrapper
If you are interested in using my API wrapper, just run the following command in the Package Manager Console;
Note: This project is available on GitHub at: https://github.com/Kevin-Bronsdijk/scanr-net
The calls are very simple and self-explanatory. Given that the underlying API is somewhat inconsistent, I’ve decided to have a separate call for PDF based documents. Extracting the text of a local image would can be done as shown below (or use a byte array). Just make sure to provide your personal token as found within the management portal.
Extract text from PDF’s
There are some things I would like to cover when it comes to converting PDF documents. First of all, based on the error messages, I’m sure that PDF documents will be converted to images first. Now this might be okay when your pdf’s are based on images. In other cases, I would not make much sense to extract the text unless in some way restricted from copying the content.
I’ve never performed any research in PDF text extraction tools but know that they are available on the market.
I’m still encountering many issues when submitting PDF’s. Sometimes complaining about complexity, size (not over a Megabyte) and some other issues.
Behind the covers
After running some tests I got interested in knowing on which OCR extraction engine this was build. Ad therefore decided to do some research and comparison work. Not long after, I run into Tesseract an Open Source OCR Engine https://github.com/tesseract-ocr and A .Net wrapper for tesseract-ocr located here https://github.com/charlesw/tesseract.
NuGet: Install-Package Tesseract
After running some comparison tests, I can conclude that scanr is based on the same engine and not providing any additional values apart from hosting and making tesseract-ocr available via APIs. There are even some interesting features missing like the Content iterator, which might be helpful if you’re looking for simple ways to navigate through the extracted text.
I still like the service because of its pricing model. But if you if your documents are exceeding the 5-megabyte limit or working with confidential material/ company policies. You could create something on your own.
The number plate test
If you want to do more than just scanning dull documents like number plates, then there are some options available. First of all, the source needs to be of very high quality. This might be a problem given that scanr has a 5-megabyte limitation. In addition to this it’s required to do some pre text extracting image adjustments like converting the image to black and while and luminosity and contrast normalization.
I was able to get some decent results and sure it’s possible to get even better results when spending more time finding a way to automate the adjustment of the images before extracting the text.
Shortly after I’ve published this article I ran into Emgu CV, a cross platform .Net wrapper to the OpenCV image processing library. They have some License Plate Recognition samples as well and it works in conjunction with Tesseract. Additional details can be found right here: http://www.emgu.com/wiki/index.php/License_Plate_Recognition_in_CSharp
Scanr’s OCR as a service is definitely fun to play with and was able to convert the text of high quality scanned documents with great results. The API is simple and the service isn’t expensive. There are also other player on the market and most of them are exposing richer API’s, supporting bigger files, etc. But you need to be willing to play a bit more as well.
If you need more control, then there are options available just by hosting the OCR yourself.