Reinforcement Learning – a Moonshot or Today’s Most Underhyped Technology?


Reinforcement learning is gaining attention as the “next step in AI”, but there are very few business use cases of this technology. So is reinforcement learning a moonshot or an underhyped game-changer?

Artificial intelligence is one of the most dynamic fields of research and development. It allows companies to solve new classes of problems and effectively tackle challenges where creativity and flexibility are required.

Unlike “machine learning” and “deep learning”, “reinforcement learning” is not a commonly used buzzword and requires some explanation.

What is reinforcement learning?

Data is fundamental to all forms of machine learning. Whatever model is being used, it needs to get and process data to be able to perform a task. In traditional machine learning and deep learning, the data usually come in the form of a closed dataset consisting of more or less structured and homogenous information. They may be a set of images, a sorted sheet or a customer database.

In traditional machine learning, the machines are given the ability to progressively improve their performance on a given task. This can be done with or without supervision:

  • Supervised machine learning – when the ground truth labels for training input are at least partially available to the system. Using these data, the machine is able to learn the desired mapping and make predictions about events still to come. A good example is predicting seismic events in coal mines.
  • Unsupervised machine learning – if the learning algorithm is provided only data without any labels, and tasked with finding the hidden structure and relationships within that data. This can be used to better understand and visualize data, or detect anomalies.

A great example of supervised machine learning (and deep learning) is the whale recognition model developed for the US National Oceanic and Atmospheric Administration, an American scientific agency protecting endangered ocean species. The challenge was to help track the shrinking population of North Atlantic right whales by recognizing and distinguishing particular whales from aerial photographs. The model used the same mechanism as facial recognition, analyzing the photograph to look for specific patterns which it learned from the data provided.

The reward hunter

In reinforcement learning, data are generated when an agent explores the environment. But the agent’s behavior is shaped by a system of rewards and punishments. Building the artificial intelligence that controls an autonomous car is a great example – the agent is going to control the machine in a rapidly changing, unpredictable environment.

The system of rewards is a key to success – the agent gets points for safe driving and sticking to the rules. Crashing, running over pedestrians and speeding all result in penalties. These rewards serve as feedback the algorithm uses to get better at performing the task. They are somehow analogous to ground truth labels in traditional supervised learning, yet this time may be served in a much less friendly way e.g., delayed, sparse, or short-term.  Reinforcement learning agents usually are trained in a simulated environment. After all, crashing a few dozen cars to teach an agent how to break would be prohibitively expensive.

So when is reinforcement learning actually useful and more effective than other AI techniques?

When to use reinforcement learning

Reinforcement learning solves a particular class of problems, especially ones that explore the environment or possible actions to perform. It deals with problems that traditional machine learning cannot handle:

  • Open – challenges that allow multiple solutions. While traditional machine learning is effective in solving closed problems like image, speech or text recognition, reinforcement learning solves problems in the open world, where there are no correct answers.
    • Example – there is no “proper route” a car should take from point A to point B. The algorithm may look for the fastest, the safest or the most visually pleasing way to go, but this is not a 0-1 problem.
  • Possible to simulate properly – reinforcement learning explores an environment, and builds the agent’s skills by doing so. This means that using a proper simulator is crucial to eventual success.
    • Example – In the “Learning to run” project, muscles and bones were simulated thanks to the OpenSim simulation environment used in medicine. So crucial was that environment that the project would have made no sense without it.

  • Highly unpredictable – when it comes to image recognition or text processing, the level of unpredictability is limited. The neural network is trained to recognize a particular pattern and it classifies the input into one of the categories. On the other hand, there are countless situations on the road or in other open environments that are impossible to predict, so hard-coding the conditions is virtually impossible. Worse, each action of the algorithm may inevitably change the environment, making any anticipation way harder.
    • Example – There is no way to predict what may happen on a road. Thus, the algorithm needs to be flexible and adaptable to react properly to unexpected conditions.
  • Overwhelmingly complex – building software to control robotic arms or autonomous cars, while possible, is a complex undertaking that will be fraught with problems. By forcing the machine to develop desired patterns of behavior by itself, the developers save time and effort.
    • Example – A robot’s behavior can be hard coded, but the process is complicated and full of potential mistakes. Forcing the machine to learn by itself makes the process less troublesome, as the machine is given a goal to achieve rather than instructions for how to do it.
  • Creative – Traditionally, machines have approached challenges with neither creativity nor flexibility. That’s why projects involving playing chess or go are so “hot” now – the machine needs to adapt and react to its opponent’s strategy. Solving problems by developing a strategy is the highest level of creativity machines are able to achieve.
    • Example – flexibility is sometimes about breaking the rules. Reshaping the system of rewards may be used to build the machine that is able to respond appropriately to even the most unexpected situations. For example, in emergency situations speeding may be acceptable, but running over a pedestrian will never be.
  • Sequential – many problems can be solved by a series of actions that need to be performed one by one in a particular order. Reinforcement learning handles such tasks with ease, while automating the sequences still requires a lot of work from programmers.
    • Example – In one particular case, a robotic arm learned to take a can of coke from a refrigerator. It needed to open the door, find the can on a shelf, grab it and close the door. In this scenario, there was a sequence of actions to be performed.

Building a flexible and creative machine is an achievement itself and reinforcement learning is undoubtedly perceived as an important next step in the development of machine learning. Of course, being “the next step” doesn’t imply that other techniques are “of a lower generation”. They just work for different classes of problems.

The agents has trained to walk like a human or play Space Invaders go a long way towards illustrating the terms above. There are no particular ways to solve these problems, the world involved is open and the challenge is too complex to hard-code a solution.

The reinforcement learning superpowers

By combining interactions with the environment and machine learning, reinforcement learning provides interesting ways of development:

  • By combining reinforcement learning and traditional supervised machine learning, it is possible to augment machine learning with expert knowledge. The neural network is fed with expert input, and reinforcement learning provides the flexibility to use the knowledge in the new context. What’s more, the expert needn’t be a neural network – it may be a human or other, non-AI-powered machine. Teaching an agent to play Montezuma’s Revenge was one of the first examples of these two types of learning working together.
  • It is possible to use reinforcement learning to build a neural network that mimics the surrounding world. It may be used to provide a cheaper training environment for the agent – instead of running the full simulation environment, the data science team needs to feed one neural network with the outputs of a second one. The details of the experiment and how it can be used may be found on’s blog.
  • Reinforcement learning agents are able to outperform humans in many games, from chess to Atari classics. Thus, reinforcement learning is suitable for every situation where the goal is to win a game and the ability to react quickly to rapid changes is needed.

Initial performance      After 15 mins training                            After 30 mins training



The downsides of reinforcement learning

There are classes of problems where the reinforcement learning agent will not be as effective as traditional machine learning techniques. Reinforcement learning is also expensive and challenging to use, so applying it to problems that can be solved effectively with other means makes little sense. What’s more, a static environment cannot leverage the adaptability reinforcement learning agents offer. Further, training the reinforcement learning agent requires gobs of computational power for both the agent and the environment to train it. Lastly, finding specialists who know how to train reinforcement learning agents is no easy task as the discipline is only now gaining steam.


Reinforcement learning is currently the most promising means to training machines to react to a changing environment and take innovative actions based on following particular rules or policy. This makes it ideal for building the AI for autonomous machines. According to a PwC study, up to 40% of the mileage driven could be done in autonomous vehicles in 2030. Thus, the era of reinforcement learning has already begun in earnest, even if the work itself remains somewhat shrouded in mystery.

This is the second of a four-part series on machine learning and deep learning written for AI Trends by Robert Bogucki,‘s CTO. Join him during the Seminar: Enterprise Machine Learning & Deep Learning on day 1: