Core ML is an interesting means to add a pre trained model to your app. But one thing that nagged me after trying my hands on Core ML was that how can I possibly train my own model and integrate in my apps using Core ML. Well, after doing some homework, a lot of light was dawned on me about the brilliant possibilities of achieving this. To be honest majorly all the ways require you to understand and know your math really well! While I was on this roller coaster ride, I came across Custom Vision. What a brilliant relief for developers who are looking at starting right away with training models and manifesting their machine learning ideas into the reality of mobile apps without diving too deep into the waters of machine learning.
Microsoft’s Custom vision allows you to upload images with tags, train your model for classification of images in these tags and helps you export the trained model in formats preferred by you (we will be primarily focusing on Core ML format in this blog). Along with this, Custom Vision gives you a dashboard of the performance of your trained model gauging it on prediction and recall percentage. You can even test your trained model using their interface.
The free trial lets you create two projects and use them to train your models. We need to buy their services for anything beyond this. It’s a great start to try your hands at self training a machine learning model.
I will walk you through a basic hand sign detector, which recognises ROCK, PAPER, SCISSORS! By that I mean, it recognises a closed fist, an open palm and a victory sign. This can even be taken forward to build a sign language interpreter. So here goes! Bon Voyage!
Create a new Custom Vision Project
- Login to Custom Vision or sign up if you don’t have a microsoft account already.
- Once you sign in, add a new project and add a name and description to your project.
- The next thing you need to select is the project type. Our project type will be classification as we are building our own model. The other option is for a prebuilt object detection custom vision model.
- Choosing the classification type is use case dependant. It depends on the number of predictions that will be derived from one input image. For instance, whether you want to input a picture of a person and predict the gender, emotion and age of the person or just the gender of the person. If you wish your model to predict just one thing from one input image then choose – Multiclass (Single tag per image), else choose – Multilabel (Multiple tags per image). We will choose the former for this example.
- Finally the domain to be chosen is General (compact), which will gives a compact model suitable for mobile and gives an option to directly export the model as a Core ML model.
Figure 1: Create Project Dialogue Box on Custom Vision
Train your image classifier model
Once the project has been created, you would need a lot of pictures! And by lot, I mean lot! Here’s the deal, for each type of prediction you want your model to make, you need to train it with a bunch of images telling what is called what. It is like training a child to associate a word with anything that the child perceives, but just faster! 😛 For my model I trained it with around 60-70 images of – fist, open palm, victory sign and no hand. Below is list of things I did to train my model:
- Collect images! Now, diversity is the key in this activity. I collected images of all three signs in various lightings, various positions on the phone screen with different backgrounds and a lot of different hands. Thanks to all my awesome hand modelling volunteers (a.k.a family and friends)! You could do the same. This is the most fun part of the voyage.
Here are the snapshots of my memories from the voyage! More about them in the following points.
- Each set of images needs to be tagged with a label that will be the output of your model after prediction. So we need to add these tags first. Add your tags with the ‘+’ button near ‘Tags’ label. I added four such tags – FistHand, FiveHand, NoHand, VictoryHand.
- Once you are done adding your tags, it is time to upload your images and tag them up! You need to click on Add Images option on the top of the screen. Upload all images belonging to one group together along with the tag for that group.
Quick Tip: Resize all your images to a smaller dimension so that the model size isn’t too big. Preview app will easily help you do it for all your images at one go.
- All image sets have been added and tagged. Now all you need to do, to train your model is press a button! Click the ‘Train’ button on top. The wonderful machine learning engine of Custom Vision by Microsoft, trains a model for you using the images you feed it.
- After the engine finishes training your model, it opens the performance tab showing the performance of your model. Here you get to see the overall precision and recall percentage of your model along with precision and recall percentage for each tag. Based on how satisfied you are with your model’s performance, you can further improve it or use the same model. To re-train your model, add a few more variants for each tag to improve your model and hit the ‘Train’ button again.
Figure 2: Performance Tab on Custom Vision dashboard for trained model
- You can test your model by clicking on the Quick Test button next to Train button on the top. Here you can upload new pictures and test your model for classification.
Using your trained model in your iOS app
You have two options at this point when you have a trained model you are satisfied with.
- You can either use an endpoint provided by Microsoft to hit it each time with an image and it will send back the prediction, both over the network. To view the endpoint details, go to the ‘Prediction’ tab and hit the ‘View Endpoint’ button. Here you will get all the details of the API endpoint. Works with both image url or actual image file.
Figure 3: Prediction API for trained model
- The other, faster and more secure path is the Core ML way. You can export your trained model as a Core ML model. This option is on the Performance Tab. Hit the Export button and then select the export type to be iOS – Core ML. Ta-Da! You have your .mlmodel file ready to be integrated in your iOS project.You will have something that does the following:
Figure 4: Input and ideal output of the HandSigns.mlmodel
Setup a live feed capture from phone camera
In this example, I have a live feed capture being sent to the Core ML model for giving out its prediction. So now we need to have a setup in place which will start the camera and start live capture and feed the buffer to our prediction model. This is a pretty straight forward bit of code that will surely look simple only if you understand what is happening.
- For this example, we take a Single Page Application project. It can be any other type depending on your requirements.
- Once in the project, let us build a method to configure our camera. This will be called in the viewDidLoad() method. We will do this with the help of AVCaptureSession and thus you will have to import AVKit.
- As we are setting the video output buffer delegate to our viewcontroller, it must extend AVCaptureVideoDataOutputSampleBufferDelegate, in order to implement the didOutput sampleBuffer method to catch the sample buffers thrown out from the AVCaptureConnection.
- The last thing remaining for this setup is the permission for using camera to be listed in info.plist. Add Privacy Camera Usage Description and add a string value to this key. Something like – App needs camera for detection. Your setup is in place now! Go ahead and try to run it on a device and the app should open up the camera as it launches.
Figure 5: Info.list with Camera Usage Permission
Integrate your Core ML model in your iOS project
The real fun for which you have been taking all these efforts, begins now. As you must know, including coreml model in iOS project is as simple as dragging and dropping it in your project structure in XCode. Once you add your downloaded/exported coreml model, you can very well analyse it by clicking on it and checking the generated swift file for your model. Mine looks like this:
Figure 6: HandSign.mlmodel overview
The steps below will guide you through this:
- First things first, import the Vision library and import CoreML for this image classifier example. We will need it while initialising the model and using the core ML functionalities in our app.
- Now we will make an enum for our prediction labels/tags.
- When our model outputs a result, we reduce it to a string type. You will need some UI component to display it. For that we will add a UILabel in our ViewController through the storyboard file and add the necessary constraints such that it is set at the bottom of the screen.
Figure 7: UILabel for displaying prediction
- Draw an outlet of the UILabel in your viewController. I have named it predicitonLabel.
- Once we have all of that in place, we can begin with the initialisation of our core ML model – HandSignsModel.mlmodel and extract the sample buffer input from AVCaptureConnection and feed it as an input to our hand sign detector model. We will then utilise the output of its prediction. To do so we will implement the didOutput sampleBuffer method of AVCaptureVideoDataOutputSampleBufferDelegate. Detailed step wise explanation of everything happening in this method is put up in the code comments inline. Seemed liked the best way to put it up.
Congratulations, you have built your very own machine learning model and integrated the same in an iOS app. You can find the entire project with the model and implementation here.
Here is a working demo of the app we have referred to through this tutorial: