Reinforcement Learning an intuitive overview

5 min readDec 23, 2020

Reinforcement learning in formal terms is a method of machine learning wherein the software agent learns to perform certain actions in an environment which lead it to maximum reward. It does so by exploration and exploitation of knowledge it learns by repeated trials of maximizing the reward.

How does it compare with other ML techniques?

Reinforcement Learning(RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.

Though both supervised and reinforcement learning use mapping between input and output, unlike supervised learning where the feedback provided to the agent is correct set of actions for performing a task, reinforcement learning uses rewards and punishments as signals for positive and negative behavior.

As compared to unsupervised learning, reinforcement learning is different in terms of goals. While the goal in unsupervised learning is to find similarities and differences between data points, in the case of reinforcement learning the goal is to find a suitable action model that would maximize the total cumulative reward of the agent. The figure below illustrates the action-reward feedback loop of a generic RL model.

Agent — the learner and the decision maker.
Environment — where the agent learns and decides what actions to perform.
Action — a set of actions which the agent can perform.
State — the state of the agent in the environment.
Reward — for each action selected by the agent the environment provides a reward. Usually a scalar value.
Policy — the decision-making function (control strategy) of the agent, which represents a mapping from situations to actions.
Value function — mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy.

AlphaGo-Perfect example for Reinforcement Learning

Go has been around for more than 2 500 years with humans leading the pack in terms of skill. This was all until AlphaGo came around and laid down a new battlefield for humans vs computers.

October 2015, the first time an AI beat a Go professional. The three-time European world champion, Mr Fan Hui, was defeated 5–0 by Alpha Go version 1. This version was named Alpha Go Fan. March 2016, the legendary Lee Sedol, 18-time world title winner and 9th dan Go player was defeated by Alpha Go 4–1. They called it Alpha Go Lee, as it had defeated Lee Sedol, creative right 😉. January 2017, Alpha Go released a version of Alpha Go called master. The online player won 60 straight games against the top international players. Now, it’s clear to see that Google’s AI company, Deepmind, and creator of the many versions of Alpha Go loves the game of Go. However, the workers at Google couldn’t stop there. They had to make a newer and better version called AlphaGo Zero. This new version of Alpha Go beat AlphaGo Lee 100–0 and it’s without debate, the best Go player in the entire world 😲.

What are some of the most used Reinforcement Learning algorithms?

Q-learning and SARSA (State-Action-Reward-State-Action) are two commonly used model-free RL algorithms. They differ in terms of their exploration strategies while their exploitation strategies are similar. While Q-learning is an off-policy method in which the agent learns the value based on action a* derived from the another policy, SARSA is an on-policy method where it learns the value based on its current action a derived from its current policy. These two methods are simple to implement but lack generality as they do not have the ability to estimates values for unseen states.

This can be overcome by more advanced algorithms such as Deep Q-Networks(DQNs) which use Neural Networks to estimate Q-values. But DQNs can only handle discrete, low-dimensional action spaces.

Deep Deterministic Policy Gradient(DDPG) is a model-free, off-policy, actor-critic algorithm that tackles this problem by learning policies in high dimensional, continuous action spaces. The figure below is a representation of actor-critic architecture.

Conclusion

Reinforcement learning is becoming more popular today due to its broad applicability to solving problems relating to real-world scenarios. It has found significant applications in the fields such as

Game Theory and Multi-Agent Interaction — reinforcement learning has been used extensively to enable game playing by software. A recent example would be Google’s DeepMind which was able to defeat the world’s highest ranked Go player and later, the highest rated Chess program Komodo.
Robotics — robots have often relied upon reinforcement learning to perform better in the environment they are presented with. Reinforcement learning comes with the benefit of being a play and forget solution for robots which may have to face unknown or continually changing environments. One well-known example is the Learning Robots by Google X project.
Industrial Logistics — industry tasks are often automated with the help of reinforcement learning. The software agent facilitating it gets better at its task as time passes. BonsAI is a startup working to bring such AI to the industries.

References

https://www.datacamp.com/community/tutorials/introduction-reinforcement-learning

Reinforcement Learning 101

Learn the essentials of Reinforcement Learning!

towardsdatascience.com

https://towardsdatascience.com/alphago-research-review-6e8632221a22