How we went from humans to machine learning to make the impossible into reality in RealtimeCRM

A few years ago, when RealtimeCRM was getting off its feet we had a lot of big ideas and we experimented a lot.

Today, we’re going to give you a behind the scenes look at one of those ideas and why it didn’t work until it did because the world changed and the near impossible became possible.

I mean look at this joke from a few years ago from xkcd, it doesn’t work anymore:

Our business card reader (Pre machine learning)

So it’s still a few years ago and in the mobile area, apps that can read photos start appearing. It peaks the interest of some of our users and its peaks our interest too.

We’re obsessed with data entry - making it as easy as possible for our users. So it was a no brainer to us to integrate something like this into RealtimeCRM, a business card reader. It reads business cards and puts the content into RealtimeCRM to create a new contact record, it has a core utility that our users would want and we didn’t see it anywhere else.

The problem that we faced though was that we didn’t have the optical character recognition (OCR) technology in house and we were not about to develop it. We didn’t want to rely on desktop plugins (we’re a web product at heart), so we set about looking for APIs that could do the job for us.

We found some computer vision based APIs but struggled to find something affordable with good accuracy. We did however find some wrappers around Amazon’s Mechanical Turk service.

What is Amazon Mechanical Turk (MTurk)?

MTurk as we will refer to it from here on out allows you to insert human intelligence into your applications. Basically you create a job and you pay x amount for each job completed by a real person on the other side. It’s basically a marketplace to access human intelligence to complete tasks computers would struggle with like transcribing the details from a business card for example.

The name mechanical turk comes from an elaborate 18th century hoax known as the mechanical turk. It was a fake chess playing machine and it was a marvel of automation. It beat Napoleon Bonaparte and Benjamin Franklin at chess.

This is pretty impressive for today let alone the 18th century - too impressive.

Well it turned out it was a mechanical illusion that enabled a human chess master to hide inside the machine and operate it. So the “machine” was using human intelligence to beat other human intelligence’s like the guy who conquered Europe but was defeated by Britain and some guy who was not quite defeated by Britain in some backwater somewhere.

Getting back to our first version of the business card reader, we ended up using MTurk, we sent an image to a person who would transcribe the business card and we’d pay them.

The prototype worked fine except for a few critical problems, the latency was bad, really bad. It could take anywhere from 20 seconds to several minutes to get the job done. In fact in writing this article we found a funny email from our Managing Director Philip from a few years ago, he said “I love the business card reader but why does it take 15 minutes to work?”.

Then there was the cost, this was a really expensive way to solve the problem of getting the information we needed out of a business card and into RealtimeCRM, we’re talking around $0.10 per call. It was just not viable long term so we were forced to abandon the whole thing.

Our business card reader (After machine learning)

So fast forward a couple of years and machine learning APIs have a come a long way. We can now check whether a photo is of a bird or not.

So we decided to try to build our business card reader again using the vision APIs from Google.

A brief description of the solution (Don’t worry it’s really not complicated)

The Google Cloud AI platform offers a range of tools to inject machine learning into your products to improve them. For us, there were two in particular that were of interest in solving the problem:

The Google Cloud Vision API deals with the “seeing” problem as it would enable RealtimeCRM to understand the content of an image including objects such as logos, faces and most pertinently to us Optical Character Recognition (OCR), basically the ability to extract text from an image.

So we’ve got the seeing part solved. We need to be able to “provide context” for the information we’ve pulled out. That’s where the Google Cloud Natural Language API comes into play. It can be used to recognise entities such as a person, location or organisation as well as the overall sentiment of a block of text.

As you can see it recognised that “Google” is an organisation and that “Mountain View” is a location.

So now we’re able to provide context to the text we’ve extracted using the Vision API which is the missing piece in the puzzle.

What we now need to do is combine the Vision API and the Natural Language API so that we can do something useful i.e. create new contact records in RealtimeCRM from business cards as the above flow chart demonstrates.

So we know that Joe Example is a text string and that it is a name therefore it should go into the name field in the contact record inside RealtimeCRM, and the other information on the business card naturally follows the same flow.

It took us an hour to build our first prototype business card reader using the Google Cloud APIs, we didn’t have the latency issue and the cost was orders of magnitude lower than our previous attempt using MTurk, we use two API calls (one vision and one natural language) per lookup, which is about $0.0025 per scan, and we get around 1000 free per month.

Our business card reader is now no longer a prototype, it’s an integral part of RealtimeCRM and our users take advantage of it every day.

The underlying point is that advances in machine learning algorithms have enabled us to do things that only a few years ago were impossible, and more importantly to make completing certain tasks for our users easier and thereby providing value to them.

We’re now seeing the rise of voice assistants, it’s all part of the same trend towards simplicity for the end user and dealing with the human to machine latency problem, the business card reader can create contact records faster than a human can and with voice assistants you can speak faster than you can type. That’s why over 100 million Alexa devices have been sold as of writing this  - this is the bleeding edge where the value is today in the same way a decade ago it was in Smartphones.