Custom People Detection on an Android Device

As data scientists, we develop software that leverages machine learning.  The following is a use case that uses machine learning to identify whether a shot has correctly hit its target in a war-game scenario. In the first phase, we focused on building a ML model that finds a bounded-box of a human in a picture. Below are the challenges, solution, process, and results.  

Challenges

  1. OpenCV and pre-trained Tensorflow Object Detection Models were too general to deliver high accuracy.  
  2. There were also people partially hidden behind bushes.
  3. There is a set of infrared images that the pre-trained model does not pick up well on at all.  
  4. The model also needed to be fast and lightweight to be able to work on a mobile device such as a phone or tablet.  
Machine Learning challenges
There is a person hidden behind the bush, which would not be picked up by pre-trained models.
Machine Learning challenges
A person under Night Vision is not always picked up with generalized models released by Google on model zoo.

Solution

Using the Tensorflow Object Detection API, we were able to identify people in camouflage and night vision in two separate models.  We also used customized visualization to view the results of the model and developed a “confusion matrix” to quantify the results.

Technology Stack

Coding Language:

  • Python3 (Tensorflow)
  • Java (Android Studio)

Python Packages:  

  • OpenCV
  • Tensorflow
  • PIL

Google Cloud Platform Tools:

Process

Since there were roughly 20,000 images, the data was too large to send over a zip file over the internet, the process was to have:

  1. The data uploaded to the cloud from the client side and extract all of their data with the gsutil command line tool.   
  2. Afterwards, we were able to split the data into a 80/20 train test split.  
  3. From there, we converted the dataset into TFrecords.  
  4. With the amount of training needed and the size of the dataset, it was best to use the compute engine within Google Cloud Platform.  With a P100 Tesla GPU, we were able to train relatively quickly as opposed to weeks on a local machine.
  5. We then benchmarked it and were able to apply the model into a mobile device to demo for the client.    

Evaluation

To see how the model is doing, we overlapped the image with the detected results with the annotated ground truth.  Initially, the model results looked something like this:

Machine Learning evaluations
Where the red is ground truth and the green is detected by the trained model.

This means that the model needed more training to draw better bounding boxes around the region of interest (ROI).

Over many iterations, the results look more like this:

Machine Learning evaluations
Where the images are sitting on top of the ground truth images.

Models like these are typically evaluated by mAP, but for easier understanding, we created a “confusion matrix” internally to fill that void.  Here is an explanation of it in one of our previous posts on Detecting Retail Objects.

Model Benchmarks

Faster RCNN Inception on Visual

Model Iterations: 38515
mAP:0.80
tp = 1455
fp = 6
fn = 7
total = 1493

Faster RCNN Inception on Infrared

Model Iterations: 77241
mAP:0.71
tp = 1026
fp = 38
fn = 38
total = 1128

The infrared dataset was more challenging due to the nature of the images and the ability to distinguish between a person and other gray parts of the image.  Training on Faster RCNN demonstrated the capabilities of object detection to the client and we were able to learn from the dataset to clean it up for mobile detection.  

SSDLite MobileNet v2 Benchmarks on Visual

Model Iterations: 183371
True Positive = 1378
False Positive = 0
False Negative = 41
Total = 1492
mAP = 0.78
Accuracy = tp / (total+fp+fn) =1378 / (1492+0+41)=0.898

SSDLite MobileNet v2 Benchmarks on Infrared

Model Iterations: 121159
True Positive = 837
False Positive = 6
False Negative = 157
Total = 1053
mAP = 0.63
Accuracy = tp / (total+fp+fn) = 837 / (1053+6+157) = 0.688

As you can see, high accuracy scores in the confusion matrix is correlated with high mAP.  But, due to the nature of the SSD model, things were missed to sacrifice speed. The infrared especially did not do so well with the SSD model.  

TensorFlow to Mobile

We then pivoted to SSDLite Mobilenet V2 for speed and its compatibility with mobile devices, without sacrificing too much accuracy.  To try to get even close to what the of FRCNN has, we referred to this blog post about Speed-accuracy-trade-offs of today’s modern object detection structures.  Basically, to get to close to its accuracy, 2 or 3 times more training would be needed.  At first, we were stuck on the high 70’s and low 80’s range, and there were many missed detections.  We believe this is because the SSD models have a difficult time in adjusting to objects of different sizes.  What seemed to make a difference was when we tried image augmentation methods to randomly resize the image found here and other methods to adjust for lighting, hue, contrast,saturation, etc – to make the model more robust.  After that discovery, we were able to increase the accuracy just over 90%.

We also learned a great deal from this blog post on Pikachu Detection on how to upload a model to a mobile device.

Mobile BenchMarks

Faster RCNN Inception

Speed: ~ 3000 ms

Pros:

  • High Accuracy
  • It was much more adaptable when it came to pulling different images from the internet of similar appearances

Cons:

  • Inference is a lot slower than the 150 ms goal of the client

SSDLite MobileNet V2

Speed: ~ 200 ms

Pros:

  • Fast and mobile-friendly.  

Cons:

  • Not as adaptable to different images outside of the dataset

What We Learned

1) For RCNN models, the deeper layers seem to figure itself out even with mislabeled images, but for fast shallow mobile models such as SSD MobileNet V2, models were not adverse to mislabels as the deeper models are.

For example, the person behind the bush is annotated differently.

Which leads to situations like this:

What We Learned
What We Learned

Which leads to situations like this:

What We Learned
What We Learned
Where the model is confused on how to detect the person behind the bush.

FRCNN was also able to overcome low-resolution images and small objects, while SSD did not do so well that regards.

2) There are better options than what was given in the default parameters in the config files.  For example, using an Adam Optimizer shown here, the model loss rate improved, achieving higher accuracy.  We also lowered the batch size for more iterations.

3) With a powerful GPU in the cloud, which GCP offers, it was easy to overtrain the model.  And after a certain point, more training time led to decreased performance. This could be prevented with some early stopping measures in place.

Results

In the end, we developed a fast lightweight model that is applicable to an Android device.  After training a custom model, we were able to identify people under bushes and partially shown people.

Machine Learning results
Machine Learning results
Machine Learning results

With a separate custom model, we were also able to detect people in infrared:

Machine Learning results
Machine Learning results
Machine Learning results

Using Android Studio, we were able to insert a trained model into an Android device as shown above.  Here are some examples of it from the Tensorflow repo.  Even though that is the official repo, we found most useful was from this Object-Detection-Android-Example, which solely focuses on the object detection app, while the standard code from Tensorflow focuses on several apps in their build package.  The code base is Java, but we were able to understand the code even though we are predominantly Python developers in the data science team.

Here is an example of object detection through my Android phone:

detection from the emulator
detection from the emulator
detection from the emulator

Proposed Future Improvements

  • More data would be added and retrained.
  • Different models would be created for different conditions for war-gaming.
  • The mobile app would be customized for the customer’s needs.
  • Different techniques such as quantization and hardware acceleration could be tried to speed up the detection on a mobile device

Conclusion

This was a very interesting project where it challenged our skills as data scientists and even got us to step outside of our comfort zone to try different technologies and stacks.  In the end, we were able create customized object detection models using GCP to train it, and apply to a Android device. Along the way, we touched on a variety of different tools to complete the project within the constraints of the project budget.