Spell-A-Bee: Interactive Visual Learning

Spell-A-Bee is a hackathon project by our Team "Detectors" during our Work together Week at WeWorks, Pune. The hackathon consisted of 7 teams competing over a span of 4 days. The aim of our project was to explore the usage and implementation of object detection and OCR (Optical Character Recognition) in React Native.

Link to the project repository - https://github.com/TrueSparrowSystems/spell-a-bee/

Project Roadmap

We entered the hackathon with an initial idea for a project which was to create a multiplayer experience wherein users would join a private room and would be given tasks like “Grab a book” or “Get a glass of water”, the user would have to do these tasks in real life and then point the application’s camera at the object. The app would recognize the object and award a point to the user that completes the task first.

Day 1

We were able to create a working prototype with basic OCR and object detection and we were able to read printed as well as handwritten text. We were also able to detect the objects in an image. Additionally, we explored how private multiplayer sessions can be created in React Native with Google Firebase Realtime Database. By the end of the day, we realized that the results provided by Google Cloud’s Vision API were neither fast nor accurate enough to be used in a gaming application.

Day 2

We spent half the day exploring various other libraries for object detection. A summary of our findings is included later in this article. By lunchtime, we decided to scrap the gaming idea as none of the libraries were suitable for our desired implementation. We then decided to keep OCR on hold and focus on object detection first. We came up with an app design involving a simple camera UI that would mark the detected objects on the captured image and clicking on the marker would open a dialog displaying the object’s name.

Day 3

We started the day with a brainstorming session and discussed various product ideas that could be implemented on top of the object detection POC. We eventually settled on an education-focused product for children involving translations and text-to-speech. Two team members worked on the design of the application and the rest of the members began exploring libraries for implementing translation and text-to-speech in React Native. By the end of the day, the design mocks for the application were ready along with a POC with all the features implemented.

Day 4

We primarily focused on implementing the UI and polishing the experience of the application. One of the biggest issues since day 1 was the response time taken by the Google Vision API for every request sent. On this day, we observed that the images being sent were very heavy, contributing majorly to the high load times. The solution was to compress the images after capturing them. This brought the loading time from 10-15 seconds down to 2-5 seconds. Moreover, to improve the app experience and to make it more kid-friendly, we added a few onboarding screens to explain the working of the application on the first start. This marked the completion of the product which finally adopted the name “Spell-A-Bee”.

App Design

From the design perspective, it was a perfect example of technology-driven design. Being asked to turn the efforts of the development team into viable proof of concept was something we hadn’t done before. We wanted to explore the diverse uses of OCR technology and come up with ideas that are more than a PDF scanner. To narrow our research, we chose kids as a focus group because a lot of people are getting comfortable with operating their parents’ phones these days. Our goal was to enhance the child’s audio and visual learning capabilities, and encourage them to learn a new language by themselves (but also could be used by adults). After gaining some insights and limitations of the technology, we came up with the final solution of “Spell-A-Bee” and created a few versions low the fidelity wireframe. Finally, we created the visual designs and polished the prototype on Figma.

Spell-A-Bee

"Spell-A-Bee" is a cross-platform application built to help your child (and you) learn different languages easily on the go. Just capture an image of nearby objects and learn what they are called in different languages along with their audio pronunciations.

Features

Detects various objects present in the captured image.
Translates the object's name to various languages.
Also provides pronunciation of the object's name in English and other selected languages.

How does it work?

The user is first prompted to click a picture of nearby objects. This image is converted to the base64 format and then provided to the Google Cloud Vision API. The API response provides the following information

name of the object detected
confidence percentage, and
the coordinates of the bounding box.

Using this information, we plot the markers for the detected objects. The object names are passed to the translation API to get the translated text based on the target language selected by the user. This response is then used by the device's Text-to-Speech Service for the pronunciation of the translated text in different languages.

Screenshots

Splash Screen	Language Selection	Language Selection
Loading Screen	Result Screen	Result Screen

Demo

Technologies

This project is developed using React Native Framework.

React Native - Hybrid mobile app development framework.
Google Cloud's Vison API - Derive insights from your images in the cloud or at the edge with AutoML
Deep Translate API - Deep Translate provides a simple API for translating plain text and HTML documents between any of 100+ supported languages.
react-native-tts - Library which provides API for using devices Text-to-Speech Service.

Object Detection in React Native

When it comes to object detection, there are several directions a developer can take in order to achieve this in React Native. One would typically need to use an image recognition service that would detect objects, scenes, faces, text, etc. in images. We will be discussing a few of such services below.

Google TensorFlow

TensorFlow is an end-to-end open-source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

tfjs-react-native is a library that provides a TensorFlow.js platform adapter for react native, which is a powerful library for performing ML operations using Google’s pre-trained models. This is the most commonly used library for implementing object detection in react native applications.

In our project, we made a demo application using this library and the response received for every image request was very fast but pretty inaccurate. Our use case required specific descriptions of the objects in focus (keys, table, book, etc.) but this library provided results that were not favorable in our application (keys were being recognized as ‘security’.) Another drawback of this library was that it requires a training model to be installed on the device for quick results. This causes a spike in the app size and the startup time when the model has to be loaded for use. These shortcomings were a deal-breaker for our project and we were forced to move on to a better suitable option.

Amazon Rekognition

This is a very powerful library that offers pre-trained and customizable computer vision (CV) capabilities to extract information and insights from images and videos. In a React Native application, the task of integrating this service requires the integration of the AWS Amplify Framework as well. This might make the application unnecessarily heavy if the application’s primary focus is not object detection. Once the Amplify framework is integrated, the Rekognition API can be used where the image would be sent with a POST request.

In our project, the integration of this service was becoming a time-consuming task and since there was little documentation and support available for React Native, we considered it an unsuitable option for this hackathon.

Google Firebase ML Kit

This library is provided by Google and it provides a solution very similar to that of Tensorflow. Hence, the shortcomings of Tensorflow are followed in this library as well, the biggest being the requirement of having an ML training model be stored and trained on the device. In addition, a firebase application needs to be created and configured in the application in order to use this library which makes the application heavy.

Clarifai

This solution is provided by clarifai, a library that uses pre-trained models to provide results of detected objects in response to a POST request. It is very easy to use this library in react native as it is provided as a npm package.

In our project, this library was a great contender as a possible implementation and a prototype was created using this library. But, we observed that the results were very generalized for the images being sent, for example, a table was recognized as a 'table top'. We tried sending high-quality images of larger sizes hoping for better results but failed, which forced us to move on.

Google Cloud's Vision API

Google Cloud's Vision API allows developers to easily integrate vision detection features including image labeling, face, and landmark detection, optical character recognition (OCR), and tagging of explicit content, within applications. The library features the detection and classification of multiple objects including the location of each object within the image.

The advantage of this library is that it is extremely easy to integrate and use. It requires an initial project setup on the Google Cloud console along with billing information. After that, a POST request can be sent from the application along with a Base64 encoded image. The API response is quick and it includes comprehensive information about all the objects that are recognized along with their coordinates in the image, probability of the prediction, etc. This information is helpful while plotting the objects when the response is received. Moreover, this implementation kept the app light and fast.

Citing the various differences and the inferred advantages and disadvantages depending on our use case, we decided to use Google Cloud's Vision API specializing in object localization.

Translation API

To make our content and app multilingual with fast, dynamic machine translation, we have used a translation API that uses neural machine translation technology to translate texts into more than a hundred languages. It is highly responsive, so applications can integrate this translation API for fast, dynamic translation of text from a source language to a target language.

We had initially started with Google Translate by Google Cloud, but it had a few issues while integrating with React Native. Additionally, the service seemed costly. So, we landed on the Deep Translate API, which has similar translation quality as Google Translate, but with easy integration with React Native apps while being significantly affordable.

To get API keys and their subscriptions, we had taken the help of Rapid API which makes it easier to find, connect to, and manage APIs across multiple cloud environments. We followed these steps to get an API key.

To translate the text, we need to provide a string to the API along with the source language and target language. The API then responds with the translated text.

Code for this API call:

const options = {
method: 'POST',
headers: {
'content-type': 'application/json',
'X-RapidAPI-Host': 'deep-translate1.p.rapidapi.com',
'X-RapidAPI-Key': TRANSLATION_API_KEY,
},
body:

`{"q":"Hello World!","source":"en","target":"hi"}`,
};

fetch('https://deep-translate1.p.rapidapi.com/language/translate/v2', options)
.then(response = >response.json())
.then(jsonResponse => {return jsonResponse?.data?.translation?.translatedText;})
.catch(err => return err;);

Benefits:

Easy to configure and use.
Fast and dynamic translations.
Support of 100+ languages.
Cost-Efficient

Limitations:

Due to lack of context, the translation result might be inconsistent/incorrect.

Text to speech

Text to speech is a service that powers applications to read/speak out the text on screen with multi-language support.

There are many solutions like Google Text-to-Speech, Amazon AWS Polly, and Microsoft Azure Text-to-Speech. All of them have official iOS and Android SDKs, but it is difficult to find an SDK for React Native. In this project, we chose the react-native-tts library. It is very easy to set up this library and we can easily configure items like voice, language, and speech rate, which were required for our project.

Example

Language:

Tts.setDefaultLanguage('en-IE');

Voice:

Tts.setDefaultVoice('com.apple.ttsbundle.Moira-compact');

Speech Rate:

Tts.setDefaultRate(0.6);

Advantages

Easy set up of the library,
Native text-to-speech engines on both platforms allow you to easily use this feature in your app
Some of the languages can be available without downloading.

Disadvantages

Lack of consistency.
Android and iOS have different engines so the selection of voice is different.
A list of downloaded languages is not available on some devices

Conclusion

The application is able to detect various objects in an image and translate their names into 4 languages, namely, Hindi, French, Spanish, and German. The user can listen to the translated text using their device's text-to-speech services.

Since this competition was a first-time experience competing in a hackathon for most of the team members, it helped us learn several new technologies, planning, and collaborative skills.

Future Scope

Improving ML model accuracy
Increasing language support
Improving user experience
Implementing OCR in the application such that the text read from an image can also be translated and read.
Detecting objects or providing translations of the text in real-time without the need of clicking a picture.