Machine Learning Without a PhD?

Machine Learning Without a PhD?


I am fascinated by Machine Learning (ML). Technology and Computer Science has evolved to where ML models can be used to improve human life. Cancer cells can be identified by ML models more accurately than by human scientists. Health care companies can use ML to predict disease in patients based on the history of similar patients in order to act early. 

My day job is to understand advances in technology and to help my customers apply the correct technologies to support digital business models. I find that I am most helpful when I can get first-hand, hands-on knowledge of these advances in technology. I am a "keyboard guy". What I am not, is a trained data scientist. This post is written from a non-data scientist perspective.


My path toward understanding Machine Learning led me to three hands-on methods:
  1. Traditional: Python Programming + Understanding of Neural Networks
  2. Uber Ludwig: No Programming + Understanding of Neural Networks + Config Files
  3. Google AutoML: No Programming + No Understanding of Neural Networks

I used the same use case in each instance - build a model that could recognize photos of chickens and then see how confident the model was as recognizing a photo of me with one of our six chickens, Zaja.


Traditional (Python + Understanding of Neural Networks)


One of my recent blog posts covered this method in detail. Basically, I used someone else's machine learning Python code with my photos of chickens. 

Steps in the Traditional method:
  • Arrange training images in folders where the folder name is the category name
  • Get the retrain.py script
    • !wget https://raw.githubusercontent.com/tensorflow/tensorflow/c565660e008cf666c582668cb0d0937ca86e71fb/tensorflow/examples/image_retraining/retrain.py
  • Execute retrain.py which downloads the pre-built inception image recognition model and then extends the model using the images and categories provided.
  • Execute the label_image.py script to test the new model against an image not in the training set.
    • !python /content/label_image.py /content/gdrive/My\ Drive/Personal/TensorFlow/d_chicken_2.jpg

Summary:

  • Was it easy? 
    • Following the blog instructions was easy. Getting Python running was not easy. This was my first time installing Python. Installing Python is not without issues or errors. This xkcd strip illustrates that. I finally found Google Colab which is a pre-built, working Python notebook environment including multiple runtimes. 
  • Did I understand what I was doing?
    • Not really. I Googled each step to understand what was happening. At the end of the day, the code downloads a pre-existing image recognition model and adds new images and classifications to the model. This saves time, but you are not really building a new model based on just your images.
  • Did I need to have a deep understanding of Data Science?
    • To build this Python code from scratch, yes. If following the instructions in the Medium post did not work, I would not know how to fix the errors.
  • Did the model work?
    • Yes, the photo of Zaja and me was recognized as a chicken with 92% confidence.

Uber Ludwig (No Programming + Understanding of Neural Networks + Config Files)


Uber has recently open-sourced their internal "no programming" machine learning tool named Ludwig. Uber describes Ludwig as "a toolbox that allows [one] to train and test deep learning models without the need to write code." (Sounds good to me) All you need to do is provide a CSV of training data and a YAML text file describing the data and the experiment. Ludwig will train a model, predict an image using that model, and visualize the results of training and prediction. You can use this page to get started with Ludwig.

Steps in the Ludwig method:
  • Install the latest Ludwig code
    • !pip install https://github.com/uber/ludwig/archive/master.zip 
  • Create a CSV file that includes the path of your training images and categories for each image. Example "image_class.csv":
    • image_path,class
    • chickens/10538543_brown_hen_isolated_on_white_studio_shot_.jpg,chicken
    • chickens/1200px_Mother_hen_with_chicks02.jpg,chicken
    • chickens/126523696_56a885a65f9b58b7d0f30ca9.jpg,chicken
    • chickens/1304236.jpg,chicken
    • chickens/137318_004_A879596D.jpg,chicken
    • chickens/2017_11_largeimg14_Tuesday_2017_164436973.jpg,chicken
    • chickens/2Hen12_430.jpg,chicken
    • chickens/3163854817.jpg,chicken
    • chickens/378px_Hen_with_chicks__Raisen_district__MP__India.jpg,chicken
    • chickens/3a7e0342a3d2915f239900fba3321841.jpg,chicken
  • Create a YAML file that describes your training data and, optionally, describes how to build the model. Example "image_class.yaml":
    • input_features:
    •     -
    •         name: image_path
    •         type: image

    • output_features:
    •     -
    •         name: class
    •         type: category
  • Create the model
    • !ludwig experiment --data_csv image_class.csv --model_definition_file image_class.yaml
  • Create a CSV file of prediction images. Example "pred_class.csv":
    • image_path,class
    • Test_Chicken_640,
  • Run the prediction to see if your model predicts the new image accurately
    • !ludwig predict --data_csv pred_class.csv --model_path results/experiment_run_0/model
  • Display the results of the prediction (85% confident this is a chicken)
    • !cat results_0/*.csv
    • chicken
    • 0.85

Summary:

  • Was it easy? 
    • Yes and no. Yes, I quickly understood how to prepare the CSV files and the YAML file. No, using a very simple YAML file always led to inaccurate models for me. The developers supporting Ludwig on the GitHub page are very helpful and pointed out that, to get accurate models, one needs to customize one's YAML to the experiment being run. In other words, one really needs a background in data science and machine learning. The extensive list of YAML options can be found on the Ludwig User's Guide here. As an example, here is the list of default YAML parameters that were displayed when running the "ludwig experiment" command above. Ideally, one would need to understand when to modify the default for each of these options.

  • Did I understand what I was doing?
    • Yes to the mechanics, no to the data science. There are still very few public examples of how to use Ludwig and why and when to override defaults in the YAML file. More tutorials and Google Colab notebooks would be helpful. 
  • Did I need to have a deep understanding of Data Science?
    • As it turns out, yes. For instance, Data Scientists have the experience to understand which encoder to use (embed, parallel_cnn, stacked_cnn, stacked_parallel_cnn, rnn and cnnrnn) to improve accuracy or tricks to normalize images for greater speed. 
  • Did the model work?
    • No. Even though the nice Ludwig support folks reproduced my experiment with 85% confidence, I was never able to duplicate their results.

Google AutoML (No Programming + No Understanding of Neural Networks)



Google announced AutoML at their I/O 2018 Conference. Google's description of AutoML is the ability to "train high-quality custom machine learning models with minimum effort and machine learning expertise." The "no machine learning" in building neural networks is achieved by Google through having neural networks build other custom neural networks by using Google's massive computational capacity. Google claims hold true as I was able to build a chicken recognition model without programming and without turning any machine learning knobs that I did not understand. The only limitation to AutoML is that, currently, only image recognition, natural language (message context), and language translation machine learning models are supported. 

Steps in this method:

  • From the GCP Console, Create a Project
  • Give your project a name, a billing account and an organization and then click Create
  • Visit the Google Cloud Vision Page
  • Click Get started with AutoML. Make sure you are using the correct project in the upper right and then complete the Billing and API steps.
  • When complete, you will be directed to the Dataset page. Everything from this point on is as easy as working with albums in photos.google.com.  

  • Click New Dataset
  • Give your dataset a name, click Select Files to upload a batch of files, then click Create Dataset
  • Your images have been uploaded, but need to be labeled.
  • Click Add Label, type a label and hit Enter
  • Click Select all images and label them with the correct label
  • Once you have uploaded and labeled at least two different sets of labeled images, you are ready to train your model. Click the Train tab
  • Google will display how your images will be split into training and test sets. Click Start Training and wait. You first node-hour of training time is free. You will receive an email when training is complete.

    • Click the Evaluate tab to see how confident Google is with the new model (pretty confident)
    • Click the Predict tab to test your model against an image it has never seen before. (This page also displays the REST API or Python code you can use with your new model.)
    •  Click Upload Images and upload some new images to test the model against. My model is confident that this photo of me with Zaja is a photo of a chicken.

    Summary:
    • Was is easy? 
      • Yes. Upload images, label the images, click train.
    • Did I understand what I was doing?
      • Yes. There is a nice instructional video that covers the process as well.
    • Did I need to have a deep understanding of Data Science?
      • Not at all.
    • Did the model work?
      • Yes, the photo of Zaja and me was recognized as a chicken with 84% confidence.

    Conclusion

    I hope you have found this post helpful. For the non data scientist, Google AutoML is definitely the easiest path to follow. If you are a data scientist, you may find Uber Ludwig quicker and more flexible. If you are a data scientist who is also a Python programmer, Python plus TensorFlow may be the quickest path for you. 

    As always, I welcome your feedback and suggestions.

    Comments

    Dan Sheldon said…
    Excellent write-up Dennis! I may have to dig into Google AutoML for a project with my beehives!
    Dennis Faucher said…
    Great idea. I want to see that.
    Nice article. Clearly, PhD is somewhat over-rated and you proved it. I know plenty of persons who outperform PhDs, not holding PhD titles. The title, my finding, and after having thought about it for years from various angles, including my own path, is just a key to academia levels.