Machine learning – how to
Machine learning is widely and successfully used nowadays. We think that computers know, but actually, they do not know – they compute. All they do is just operating on numbers. We teach them how to know by telling them what to compute.
Still thinking what is that machine learning thing? It’s a system which contains an algorithm (function) to transform (classify) given input to corresponding label/class (output).
How does classification internally works? When ML system gets an input sample, it retrieves feature vector (a vector of numbers) from it and checks which of defined classes (labels) this vector belongs to.
I would like to give you an overview on how the process of developing machine learning (ML) system looks like or may look like, especially I would like to focus on supervised learning. Just to give you ideas which would enable you to play around with ML.
At the beginning I would like to discuss basic definitions.
Learning (training) set – a set of artefacts (images, sound samples, statistics – depends on what you would like to run ML for), that contain true positives and true negatives examples. This set will be used to “train” system by providing feature vector (will be described) of each artefact and expected output. So that system may learn which values of features correspond to which output.
Validation set – a set of artefacts that contain true positives and true negatives examples. This set will be used to tune hyperparameters of trained model. It’s not needed for some methods.
Testing set – a set of artefacts that contain true positives and true negatives examples. This set will be used to verify how good system has been trained by asking it to classify each image and checking if its output is correct or not.
Feature vector – a vector of artefacts features – numbers or values that describes each artefact.
Feature descriptor (or feature extractor) – method which will extract features from artefacts. That means a method, which will describe an artefact as a feature vector – translate artefact to vector.
Classification algorithm – an algorithm which will determine which feature vector corresponds to which label (output class).
At the very beginning, you need to gather appropriate samples for learning and test sets. It’s important that the samples are as close as possible to the artefacts that will be used in production system. If you would like to “teach” your system to recognize speech coming from telephone call, record samples from telephone calls. If you would like to create a system to detect some behaviours in groups of people, record desired behaviors within group of people (instead of a one man show). Of course you need samples that contain that thing, which you would like to detect and samples that will not contain them.
After gathering samples, you can start processing them. This is the “fun” part. What that “fun” is about? It’s about testing, discovering, tying different combinations of descriptors and algorithms that would give you wanted results. You need to discover a way, how to get to wanted results. Of course, you may find some papers that will describe different approaches and solutions, but you will need to adjust them to your specific case.
First processing step – find, count, gather different features. There is a huge variety of possible features that may be used in ML, a very simple ones (eg. for images):
- number of black and white pixels
- black to white pixels ratio
- number of corners
- edges length
- maximum/minimum/average width/height
There are more complex feature descriptors such as HOG (histogram of oriented gradients), LBP (local binary patterns), GLCM (gray level co-occureance matrix) and many, many more. It just depends on how complex your problem is. Of course, you may combine a couple of features into one vector. It’s just a matter of results and computation time.
Before extracting features from samples, you may have a preprocessing step such as noise reduction (sound) or conversion to grayscale (images). It’s always up to you, if you would like to prepare each sample somehow or not.
In combination with descriptors, classification algorithm is needed. Some of then are simple (such as K Nearest Neighbours), the others very complex (like Artificial Neural Network or Deep Convolutional Neural Networks). Check, try and see which one fits your case best. Here are some examples you can start with:
- Support Vector Machine (read more)
- K Nearest Neighbours (read more)
- Artificial Neural Network (read more)
- Independent Component Analysis (read more),
- Ensemble Methods (read more)
It’s also worth to be mentioned that properly developed model should not have 100% of correct detections on test data. There should always be some incorrect detections. If it’s 100% correct, I would investigate if the approach taken for development was a proper one, beacuse it’s a sign of overrfitting.
Term overfitting describes situation when model perfectly fits to the dataset that was used for training it. Unfortunatelly, the same model will be much worse for the data that it has never been classifing. Learn more
Ready, steady, go!
It’s not that hard to start. There are several frameworks that support developing ML system, that you can download and work with.
TensorFlow (https://www.tensorflow.org/) – one of the best framework for machine learning. It’s open source and supports C++, Python and R. It comes with TensorBoard, which is a visualization tool for ML development.
Caffe (http://caffe.berkeleyvision.org/) – popular framework for vision systems that uses Convolutional Neural networks. This framework contains interfaces for C, C++, Python and Matlab. It’s huge advantage is that user may access pre-trained networks (zoo models), which makes development easier.
Deeplearning4j (https://deeplearning4j.org/) – JVM based ML platform. This framework supports multiple classification algorithms based on Neural Networks. Similarly to Caffe, Deeplearning4j provides zoo models, that gives you easier start.
Machine Learning is a powerful toolset, which can help to make decisions or automate many processes. It’s important to remember that quality of ML performance depends on quality of training data and proper algorithms.
Drożdż M., Kryjak T., FPGA Implementation of Multi-scale Face Detection Using HOG Features and SVM Classifier (more)
Grabska-chrząstowska J., Kwiecień J., Drożdż M., Bubliński Z., Tadeusiewicz R., Szczepaniak J., Walczyk J., Tylek P., Comparison of Selected Classification Methods in Automated Oak Seed Sorting (read)