Modern Reinforcement Learning approach and its applications

Nowadays, Artificial Intelligence is more present in our lives than ever. Everything thanks to more computational power that resulted in the popularity and affordability of Deep Learning.

DL allowed us to leverage a lot of fields of study like Computer Vision, Natural Language Processing, Optimal Control, Timeseries Forecasting or Generating Synthetic Data. There are various methods that can help solve these tasks but among them is one extraordinary algorithms family that can be applied to pretty much every problem – (Deep) Reinforcement Learning.

What is Reinforcement Learning?

Reinforcement Learning is the field of study in Artificial Intelligence that is trying to find optimal action/sequence of actions in any environment based on trial and error. An example of such would be a robot learning to play with the cart pole.

A short history of Reinforcement Learning

Origins of Reinforcement Learning come from the 1950s when researchers tried to solve the problem of finding optimal actions that will maximize certain objectives (like trying to win a game of chess). Three main threads of interest were:

Learning through trial and error e.g., animal learning (involves learning) – exploring the environment and trying to find out what is beneficial.
Finding optimal control in a known environment e.g., dynamic programming, Model Predictive Control (didn’t involve learning) – based on pieces of information about environment’s dynamics/changes finding best actions.
Temporal-difference learning (involves learning) – in every learning step changing strategy slightly (accordingly to present results) in order to converge to true optimality.

The first thing that we can notice here is that in the first case it’s more about psychology while the second and third case applies to calculations/programming. At that time there was not a lot of connection between the three of them but with the rise of Q-Learning (1989) it changes.

Q-Learning made it through thanks to some clever conceptions:

Agent assessed the current situation, and possible actions and then chose the action with the best-predicted outcome.
After trying action in a certain situation, assessment could be reinforced or weakened based on the outcome (sometimes the outcome could come even after several other actions).

These two simple rules could bring together the theory behind these three early Reinforcement Learning threads. Elements of these are listed below:

Learning through trial and error – changing preferences of actions in certain situations based on the outcome. To find good actions, the agent must explore the environment by executing sequences of actions using certain strategies.
Finding optimal control – thanks to updates of preteritions agent learns to maximize certain objectives.
Temporal-difference learning – agent is changing preferences only a bit because the outcome of action might come from a bad later decision.

An extension of this was Deep Q-Learning created by DeepMind (2013) which thanks to Deep Learning could improve its efficiency by a lot. After that year more Reinforcement Learning methods based on neural networks emerged.

The theory behind Reinforcement Learning

There are three most important conceptions in Reinforcement Learning:

Markov Decision Process
Q-value
Bellman equation
Markov Decision Process is a graph where each node is a state of the system, and each action can move an agent between nodes. A good example is imagining a simplified graph of the day at work:

We have three states, and, in each state, we can choose one of three actions. Every action in a certain state is associated with some reward. Our goal is to get as much reward as possible. It might be visible that working for the entire day might not be the optimal solution, sometimes we also must take a break and even talk with people!

Q-value is the value associated with pair state-action that connects immediate rewards and future ones. In other words, the same actions in different states can have different values based on the present and future profits that executing them gives. When we would like to introduce Q-values in the above graph we should add one more variable that using Bellman equation would converge to some true values.
Bellman equation is something that can help us find an optimal working day schedule. We can present it as:

Q(s, a) = reward + γ*Q(s’, a’)

Where:
S – current state
S’ - next state
A - Action
A’ - next action
Q(s, a) – q-value, it is used to estimate profitability of certain action in certain situations.
Reward – a reward for taking certain action in a certain state.
γ – discount factor (future rewards are uncertain things so their impact is lowered).

The idea standing behind this equation is quite easy. The value of actions depends also on future possibilities that executing them might bring to us. We simply connect all actions all together to make our values represent the present and future. The discount factor is introduced to make the future’ less important. Of course, depending on the situation γ can vary.

In simple cases, we can just use dynamic programming to find true q-values by creating a 2D array of states’ transitions when each entry will be a q-value. Then we can use some strategy or randomly move across the next states and apply Bellman equation to update entries.

Harder tasks like when we have thousands of states and thousands of actions require the use of Neural Networks.

Methods and application in real life

Probably the most well-known chess algorithm besides Stockfish is AlphaZero which is a Deep Reinforcement Learning algorithm. By playing with itself it could achieve super-human effectiveness that beat the best chess engines of that time but what is actually the secret of this agent?

In AlphaZero there are two agents: one is choosing actions using a certain strategy and the second one is assessing their’ Q(s, a). The agent is simulating a few paths (possible game continuations) using a certain strategy and then gives them a score using Q(s, a). Updating evaluations depended on whether the algorithm could win or not. Thanks to this combination it could achieve superior results in only 9 hours of learning!

Fig. 2 The exploration on the game graph

DeepMind in their paper also provided results of comparing with different state-of-the-art algorithms in the game of chess, Shogi and Go:

More information you can find in Original AlphaZero blog post. Another interesting example is shown by researchers in the paper is that they managed to learn real life robots to walk on 4 legs.

The algorithm they used is called Soft-Actor-Critic and besides trying to find true Q(s,a) function it adds bonus for entropy during decision that helps with exploration. Higher entropy means higher indecision of actions.

Paper claims that agent trained in that way could deal with unknown situation like having obstacles in their way or slippery path.

The results are shown as photos:

Fig. 4 The result of teaching robots to walk on 4 legs

Besides these two examples RL there are various others like NLP (Natural Language Processing) where it can be used in text summarization, machine translation, or chat bots.

Next usage can be found in trading and finance for predicting time series.

Also, for recommendation algorithms we can use RL instead of simple clustering. In that case rewards are reviews and likes over some content so this task fits Reinforcement Learning framework perfectly.

A very recent study also showed that we can apply these algorithms for video compression where we can beat human-made algorithms by 4%! For interested there is great explanation on YouTube:

For people that want to know more, there are great blog posts on Neptune and Oreilly.

Summary

Slowly we can see more and more Reinforcement Learning emerging in our daily life. Every aspect of technology can be improved by using Reinforcement Learning. Trial and error framework is an enormously powerful tool that can help in every task where we can assess the final results.

***

If you are interested in Artificial Intelligence, read other articles from our experts: Specialized Semantic Search Engine, RPA and intelligent document processing, and AI-powered Search Engine.

5/5

About the author

Karol Szymczyk

Inżynier ds. oprogramowania z 2-letnim doświadczeniem w projektach komercyjnych związanych z systemami wbudowanymi i sztuczną inteligencją. Matematyk-hobbysta poszukujący ciekawych tematów na temat użycia matematyki w informatyce. Prywatnie, jazzowy amator grający na akordeonie, fortepianie i perkusji.

All articles written by the author

Offices in Poland

Sii Sweden

Sii Ukraine

Sii India

Automating financial analysis with AI

Berlingske Media enters the digital-first era – platform modernization with Sii Poland

Quality Control Center for ABB – gaining control over IT system quality

E-commerce modernization to support growth and smooth shopping experience

Adobe Summit 2026: from Generative AI to AI Agents that get work done

Sii Poland and Microsoft showcase practical AI applications for Polish universities

Medical device design: Challenges faced by engineering teams

Sii Poland named the Best Employer among large companies – 1st place in Great Place to Work once again

The AI-Powered Ecosystem: Scaling Marketing Analytics

Mobile WMS for Process Manufacturing. Powered by Business Central.

From demo to systems you can trust: how real AI works in business

Multi‑Agent AI for Embedded Software Development

Why SAP S/4HANA releases still fail – even when everything was "tested"

Building an RTK Rover: Web-controlled precision navigation

HashiCorp Vault as a central hub for key and certificate rotation

Blazing fast websites and how Adobe Experience Manager can help achieve that

Modern Reinforcement Learning approach and its applications

What is Reinforcement Learning?

A short history of Reinforcement Learning

The theory behind Reinforcement Learning

Methods and application in real life

Summary

About the author

Leave a comment

Cancel reply

Join our team

Embedded Linux Expert (f/m/x)

Embedded Firmware Developer (f/m/x)

You might also like

Building an RTK Rover: Web-controlled precision navigation

HashiCorp Vault as a central hub for key and certificate rotation

Blazing fast websites and how Adobe Experience Manager can help achieve that

What does the code from Sii Testing Lab look like? Repositories deep-dive

How to set up branch policies in Azure DevOps: Complete guide for PR validation

From automation to autonomy: The evolution towards Agentic Testing, illustrated by mabl

Practical application of AI models in programming. A comparison of GPT-5.4 and Claude Opus 4.7

AI at work for a Test Developer – what Codex, Claude Design, Claude Opus 4.7, Codex Security, and Claude Mythos really change

AEM sites configuration with Sling Context-Aware Configuration

Want to track how fast and reliable your internet is? SpeedTest Tracker on Raspberry Pi

Spring x MonoDB – how to secure your transactions

Hierarchy view in Canvas apps

SUBSCRIBE AND DON'T FALL BEHIND

Join our team

Embedded Linux Expert (f/m/x)

Embedded Firmware Developer (f/m/x)

Processing...

What we do angle-down

Industries angle-down

Who we are angle-down

Career angle-down

Trainings angle-down

News angle-down

Contact angle-down

Ta treść jest dostępna tylko w jednej wersji językowej. Nastąpi przekierowanie do strony głównej.

What we do

Industries

Who we are

Career

Trainings

News

Contact

Ta treść jest dostępna tylko w jednej wersji językowej.
Nastąpi przekierowanie do strony głównej.