Machine Learning Benchmarks and AI Self-Driving Cars


By Lance Eliot, the AI Trends Insider

My computer is faster than your computer.

How do I know this? Presumably, we would want to compare the specifications of your computer and the specifications of my computer, and by doing so we could try to ascertain which one is faster. This is not as easy as it seems and the result might not still convince you that my computer is faster than your computer.

Time for a bake-off!

We could have your computer and my computer try to go head-to-head on some task and see which one finishes the task sooner. The question then arises as to what task we would use for this purpose. Maybe I secretly know that my computer is very fast at making in-depth calculations, so I propose that we have the computers compete to see which one can produce the most number of digits of pi and set a time limit of say five minutes. Is this fair? You might know that your computer is optimized for doing text-based searches, and so you propose that instead of calculating pi, the two computers compete by doing text search in an encyclopedia to see which of the computers is fastest at finding a set of words that we come up with jointly.

If we cannot agree on what the task to use, we’ll be hopelessly deadlocked as to how to determine which computer is the fastest. You are likely to keep pushing for tasks that fit to what your computer is fastest at, and I’ll keep pushing for tasks that fit to what my computer is fastest at. We could somehow try to find a middle ground, but this might be hard to do and either one of us might think the other is somehow “cheating” with the task choice.

It would be handy if there was some kind of standard benchmark that we could use. Something that was chosen without specific regard for a particular computer per se, and instead something that we could use to gauge how fast our respective computers are. Plus, if it was a published standard, we could compare our results to the results of others that had also performed the task with their computers. There might already be performance scores about how fast various computers are related to the task, and so after we run the task on our respective computers, we could compare our run times with each other and with what the leaderboard says too.

For those of you that have been in the computer field for any length of time, you likely already know that this interest in computer benchmarks has been around for ages. Groups such as the Business Applications Performance Corporation (BAPCo), Embedded Microprocessor Benchmark Consortium (EEMBC), Standard Performance Evaluation Corporation (SPEC), and the Transaction Processing Performance Council (TPC) have promulgated all kinds of benchmarks. I remember fondly the Dhrystone, a benchmark developed for integer programming and which was named somewhat jokingly after another benchmark called the Whetstone.

Besides allowing comparison across different computers, these benchmarks can also be handy when dealing with the sometimes wild claims that are made by this vendor or that vendor. You’ve probably seen an announcement from time-to-time by some vendor that says their new processor can go faster than anyone else’s processor. It’s a claim that is easy to make. The proof would be had by having them run some standardized benchmark and then tell us how their processor performed. This can help others, whether academics pushing the boundaries of processors, or companies wanting to get the fastest new processors, since rather than simply believing the vendor’s bold claims there would in addition be benchmark results to support (or refute) the claims.

What can also happen without having solid benchmarks is that it creates confusion and at times stymies progress in the computer field. If someone is thinking of buying a new computer and they hear that a new processor is coming out next month, they might be tempted to delay making their purchase. Suppose that the new processor is 10x the performance, it probably makes sense to wait. Suppose though that the new processor is about the same in performance, perhaps it makes sense to go ahead with the existing purchase.  By having benchmarks, you could find out how the new processor fared on the benchmark, and have a clearer indication of whether to proceed or not.

Similarly, if a company or a researcher is developing a new processor, they would want to know what the marketplace already has. When working on pushing the boundaries of the hardware, maybe their research is going to just skyrocket performance and so they know that they’ll have a home run hit on their hands. Or, maybe they realize by benchmarking that their new processor is just barely better than before, and so it might mean it’s time to get back to the drawing board to rethink what they are working on.

Machine Learning Gets a Benchmark

My artificial neural network is faster than your artificial neural network.

How do I know this? Right now, it would be hard to prove it.  We each might be making dramatically different assumptions about our neural networks. I could be running my neural network on a specialized hardware system that is optimized for neural networks and has 100 processors, meanwhile you might be running your neural network on your PC that has maybe four processors.

Besides running the model, there’s also the data being used to train the model and also then exercise the neural network to see it perform. Suppose that I have used a dataset with 10 million training examples, and you’ve used a dataset that has only 10,000 training examples. Without some kind of agreement about the models we’re both using, and some kind of agreement about the data that we’re using, it’s going to be dicey to make any kind of sensible comparison.

In the AI field of Machine Learning (ML), to-date there hasn’t been an accepted standard of how to benchmark an ML system. This lack of a commonly accepted benchmark has had all the same downfalls as what I’ve earlier mentioned about the downfalls if we didn’t have a benchmark for PC performance. Researchers can’t readily compare their neural networks. Advances in neural networks cannot be readily compared. Companies that want to adopt neural networks are bombarded with claims from this vendor or that vendor about how fast their neural network is. And so on.

I am pleased to say that there is now a stake-in-the-ground for an ML benchmark.  Called MLPerf, it is touted as “a common benchmark suite for training and inference for systems from workstations through large-scale servers. In addition to ML metrics like quality and accuracy, MLPerf evaluate metrics such as execution time, power, and cost to run the suite.” (see the Fig. 1.)


Finally! We needed this. Everyone needs this. If you are an AI developer, you should be thankful that this now exists. If you are someone that relies upon an AI system using Machine Learning, this will be of benefit to you too. With a behind-the-scenes effort by high-tech ML-pursuing firms and universities, including Google, Intel, AMD, Baidu, Stanford, Harvard, and other entities, the MLPerf was officially released in May 2018. This is very exciting because from now on, whenever some new hardware is announced for Machine Learning, you can ask them how their fancy machine scored on the MLPerf. We seem to daily see announcements from Nvidia and Intel about their upcoming ML hardware, which henceforth we can ask for and expect to see scores related MLPerf.

For those of you that want to dig into the ML performance benchmark, it’s all there in GitHub. Easy to find, easy to apply. No one can particularly complain that they couldn’t find the benchmark, or that they couldn’t get it to work. There have been occasions with other benchmarks where people tried to voice such complaints, and avoided using the benchmark because of it, but in this case it is hard to imagine anyone seriously being able to try that line.

When you consider the range of different ML kinds of needs and capabilities, formulating a benchmark is somewhat complicated since you want to cover as much ground as possible. So far, the MLPerf consists of 7 Machine Learning models and 6 ML-related datasets. They include ML for image classification, and ML for object detection, and ML for speech recognition, and so on. The MLPerf is certainly going to expand over time. Indeed, you are encouraged to consider providing additional benchmarks (more on this in a moment).

Here’s the ML models and datasets, and they are categorized by Area and by Problem Type, there is also a Quality Target that anyone using the benchmark should be seeking:

Area: Vision

Problem Type: Image Classification

ML Model: Resnet-50

Dataset: ImageNet

Quality Target: 74.90% classification

Area: Vision

Problem Type: Object Detection

ML Model: Mask R-CNN

Dataset: COCO

Quality Target: 0.377 Box min AP and 0.339 Mask min AP

Area: Language

Problem Type: Translation

ML Model: Transformer

Dataset: WMT English-German

Quality Target: 25.00 BLEU

Area: Language

Problem Type: Speech Recognition

ML Model: DeepSpeech2

Dataset: Librispeech

Quality Target: 23.00 WER

Area: Commerce

Problem Type: Recommendation

ML Model: Neural Collaborative Filtering

Dataset: MovieLens-20M

Quality Target: 0.9652 HR@10

Area: Commerce

Problem Type: Sentiment Analysis

ML Model: Seq-CNN

Dataset: IMDB

Quality Target: 90.60% accuracy

Area: General

Problem Type: Reinforcement Learning

ML Model: MiniGo

Dataset: n/a

Quality Target: 40.00% pro move prediction

For those of you that are relatively new to Machine Learning, here’s a handy tip – you ought to know each of the above ML models, and you ought to know the datasets that are being used with those models. When I interview to hire a ML specialist, these are the kinds of ML models and datasets that I expect them to walk in the door knowing (plus, other ML models and datasets as befits their particular specialty or expertise).

For those of you not familiar with ML, I’d like to point out that those datasets don’t have to be paired to those ML models, it’s just that those datasets are often used for those ML models. I say this because I am urging you to learn about each of the models, and also, separately, learn about each of the datasets. Keep in mind that they are distinct of each other.

If I was teaching a class on ML this coming year, I’d have the students become familiar with each of these, and I’d want them to be able to say what each does, its limitations, etc.:

  •         ML Model: Resnet-50
  •         ML Model: Mask R-CNN
  •         ML Model: Transformer
  •         ML Model: DeepSpeech2
  •         ML Model: Neural Collaborative Filtering
  •         ML Model: Seq-CNN
  •         ML Model: MiniGo

I’d also want them to be familiar with these commonly used datasets, in terms of what the data contains, how it is structured, how it is used for ML models, and so on:

  •         Dataset: ImageNet
  •         Dataset: COCO
  •         Dataset: WMT English-German
  •         Dataset: Librispeech
  •         Dataset: MovieLens-20M
  •         Dataset: IMDB

GitHub contains the code for each of the MLPerf chosen ML models, along with scripts to download the datasets, info about the quality metrics, and other characteristics to make things as transparent as possible. The ML models and datasets have so far been tested on a configuration consisting of 16 CPUs, one Nvidia P100, Ubuntu 16.04, 600GB of disk, and other related elements.

I recommend that you first read the MLPerf User Guide and then jump into the rest of the artifacts.

In the MLPerf User Guide, they point out that “a benchmark result is the median of five run results normalized to the reference result for that benchmark. Normalization is of the form (reference result / benchmark result) such that better benchmark result produces a higher number.” And that “a reference result is the median of five run results provided by the MLPerf organization for each reference implementation.”

This is important in that it is saying you need to do at least five runs for being able to indicate what your benchmark result is. Why do multiple runs? It could be that with one run you got lucky or there was a fluke, and so it is helpful to have multiple runs. When you have multiple runs, the question arises as to whether you keep all the runs and average them, or maybe toss out the high and the low, or take some other approach to trying to summarize the multiple runs. In this case, the MLPerf standard indicates to use the median.

Another important part of neural networks and ML models is the use of things like random numbers, which more formally refers to non-determinism. My run of a neural network and your neural network could each separately be materially impacted by initial seeded values or by other aspects relying upon random numbers. The MLPerf says that “the only forms of acceptable non-determinism are: Floating point operation order, Random initialization of the weights and/or biases, Random traversal of the inputs, and Reinforcement learning exploration decisions.” They also indicate that “all random numbers must be drawn from the framework’s stock random number generator.”  This attempts to level the playing field.

For training purposes, they do allow for the “hyperparameters (e.g., batch size, learning rate) may be selected to best utilize the framework and system being tested.”  This allows some reasonable flexibility in what the benchmark runners are doing, and could upset the playing field, but hopefully not (as I urge next).

I mention the above salient aspects because I want to bring up something now that we hopefully will all curtail, namely, there might be some that try to “trick the benchmark.”

If you are determined to get a really high score on your new hardware or software for Machine Learning, you might be tempted to try and find loopholes in the MLPerf. You might figure that perhaps there’s a means to get a higher score, even if your system is not really as good as the normal approach would depict. Therefore, you go through the MLPerf standard with a fine tooth comb, scrutinizing every word, every rule, every nuance, and try to find some miniscule rule or exception that you can exploit to your advantage.

That’s not the spirit of things.

The MLPerf tries to prevent this kind of trickery from happening by saying this: “Benchmarking should be conducted to measure the framework and system performance as fairly as possible. Ethics and reputation matter.”

It’s difficult to try and write a standard that will cover all avenues of sneakiness. I am sure there are inadvertent loopholes in the MLPerf. The guide would have to be likely hundreds of pages long to try and layout every possible conceivable rule. If you find loopholes, it is hoped you’ll let others know and that in the end we can plug them up or at least know they are there.

Machine Learning for AI Self-Driving Cars

What does all of this have to do with AI self-driving cars?

At the Cybernetic Self-Driving Car Institute, we are developing AI systems for self-driving cars that use Machine Learning and also aid firms in the assessing their ML systems and improving their ML systems. Having a Machine Learning benchmark is going to be handy.

The existing MLPerf does not directly address AI self-driving cars per se, though it does touch upon it. Yes, there is image classification, which is an important part of the sensor data analysis on an AI self-driving car. Object detection is another important aspect of an AI self-driving car such as finding a pedestrian in an image of a street scene. And so on.

What we need to do is have all of us in the AI self-driving car field provide added ML Models or added datasets into the MLPerf standard that are directly related to AI self-driving cars. The MLPerf welcomes submissions for expansion. They recognize that this initial release is nascent. The more, the merrier, assuming that whatever is added has value and provides a bona fide contribution towards the goal of having a solid base of ML models and datasets for performance benchmarking.

In the framework, it will be helpful to have proposed submissions into MLPerf for these major aspects of AI self-driving cars:

  •         Sensor Data Analysis for Self-Driving Cars
  •         Sensor Fusion Analysis for Self-Driving Cars
  •         Virtual World Model Analysis for Self-Driving Cars
  •         AI Action Planning for Self-Driving Cars
  •         Controls Activation Commands for Self-Driving Cars
  •         Strategic AI for Self-Driving Cars
  •         Tactical AI for Self-Driving Cars
  •         Self-Aware AI for Self-Driving Cars

See my article on our framework for further info.

In a movie starring Steve Martin, there’s a famous set of lines spoken by him as the character Navin and another character named Harry, in which Navin exclaims: “The new phone book is here! The new phone book is here!” And Harry says, “Well I wish I could get so excited about nothing.” Navin then replies: “Nothing? Are you kidding?!”

I mention this because the release of the MLPerf is something to get quite excited about. We can rejoice that an ML benchmark of sufficient quality and attention has been brought forth to the world. You can argue that maybe it’s not complete, or maybe it needs tuning, or has other rough edges. Sure, that’s all assumed. Let’s at least move forward with it. Furthermore, I call upon my fellow AI self-driving car industry colleagues to find a means to add to the MLPerf with Machine Learning models and datasets that are specific to self-driving cars. I am betting it would be a welcomed contribution.

Copyright 2018 Dr. Lance Eliot

This content is originally posted on AI Trends.

Dr. Lance Eliot can be reached directly on self-driving car topics at