Sii Testing Lab Study: the "AI Boom" in Test Automation

AI has firmly become part of everyday work in IT teams – including software testing. Not as a one-time breakthrough, but as a shift that has quickly become the standard.

Over the past few years, tools based on large language models have started supporting an increasing number of work areas. Today, they are widely used by developers, testers, analysts, and product managers, and are increasingly treated as a natural part of the work environment rather than an experiment.

In software testing, the impact of AI is particularly visible. Support in creating test cases, structuring requirements, analyzing documentation, summarizing tickets, or suggesting manual test scenarios is no longer surprising. We know that AI genuinely helps and accelerates work in these areas – this is not a controversial claim, but an established practice already adopted by many teams.

That is why this study does not attempt to measure the overall impact of AI on testing. Instead, we narrow our focus to one specific area: automated test creation. We are not interested in whether AI helps with documentation or manual test preparation – benefits in those areas are already intuitive and widely observed. What we want to examine is the scale of improvement where a tester must sit down and deliver working code.

This area is particularly interesting because test automation is not just about “writing something that works.” The quality of the solution also matters: readability, maintainability, stability, and adherence to best practices. Even if AI enables faster development, an important question remains – how much faster, and whether that gain comes at the expense of quality.

The challenge is that despite the massive AI boom, we still have surprisingly little comparative data that shows the real scale of this change. Many opinions are based on subjective impressions:

“Work is faster,” “It’s easier to get started,” and “AI helps me get unstuck on difficult tasks.”

While these statements sound convincing, they do not answer how AI performs under controlled conditions on identical tasks. That is precisely what we aim to verify. An additional challenge is the “moving target” effect – we are evaluating models that evolve rapidly, which in extreme cases may make findings outdated with each new model version.

Objective

Our starting point is simple: we assume that AI accelerates the work of automation testers. We see it daily, teams see it, and the market confirms it.

The real question is therefore not “Does AI help?”, but rather:

How significant is the actual productivity gain – and what happens to the quality of the delivered solution?

This study aims to answer that question by measuring the impact of LLMs on automation testing in two dimensions:

Quantitative efficiency – how many scenarios were completed, the average score, and the distribution of results
Qualitative efficiency – whether the delivered solutions follow best practices, are maintainable, and whether increased speed affects quality

We are not trying to prove that AI helps. We aim to quantify the scale of its advantage under identical conditions and tasks.

Methodology

The study was designed as a controlled comparison of two approaches to building test automation solutions. Ten two-person teams participated, selected to match the competency level of two mid-level automation testers. The teams were divided into two groups:

AI group – teams using LLM-based tools
Oldschool group – teams working without AI

Each team received its own identical environment, built using reproducible cloud-based solutions. No framework was pre-installed – the participants worked in a greenfield setup, building solutions from scratch without any existing code, architecture, or legacy constraints.

Participants had no prior knowledge of the system – they were only informed that they would be working with an e-commerce platform. Documentation (in the form of test cases), evaluation criteria, and full access to the application were provided only at the start of the experiment, limiting the possibility of prior preparation.

The study took the form of a classic hackathon – an eight-hour working session. Teams were placed in separate rooms to minimize information exchange and reduce interference.

A task was considered complete when the test:

executed real user steps
included assertions verifying the outcome
could be executed from the command line

This setup allowed us to compare actual outcomes – not declarations: how many tests teams delivered and at what quality.

Experiment

General observations

Already during the experiment, it became clear that AI not only accelerates work but also changes its nature. Oldschool teams spent most of their time on manual coding, debugging, and step-by-step problem solving. When more complex issues arose – such as XML API integration or deeper technical challenges – progress slowed significantly. Some teams became stuck on a single problem, directly limiting the number of tasks they could complete.

In the AI group, the workflow looked different. Technical challenges still occurred, but teams moved from blockers to new solution attempts much faster. AI shortened the time needed to identify the right direction, suggested implementation options, and helped teams quickly get back on track. In practice, AI teams got stuck less often and were able to complete more tasks.

Experience and AI usage patterns

A clear difference emerged in how teams used AI. The best results were achieved by teams working iteratively: prompt, validation, correction, next iteration. In this model, AI acted as an accelerator of engineering work. Teams attempting to generate large chunks of code in a single step more often lost time correcting overly broad or mismatched outputs. Access to AI alone did not guarantee success – the key was the ability to work with the model effectively and guide its outputs.

The highest-performing AI teams included experienced engineers and architects. They were able to quickly assess generated solutions, reject incorrect directions, and steer the model toward better outcomes. AI did not eliminate the importance of expertise – it amplified it.

Importantly, variation between teams was much greater in the AI group than in the Oldschool group. AI does not reduce skill gaps – quite the opposite, it makes the gap more visible between teams that use it consciously and those that treat it solely as a code generator.

Model limitations

An interesting observation emerged around specific technical challenges. AI teams struggled with selectors for dynamic buttons. The model attempted to solve the issue but sometimes followed incorrect paths and reinforced flawed patterns. This reflects the nature of LLMs, which rely entirely on the provided context.

A human would likely switch approaches more quickly – and this is exactly what more experienced teams did. This highlights both the limitations of the model and its susceptibility to “context poisoning.”

Interestingly, Oldschool teams did not encounter this issue in the same way – they selected locators manually and more directly. Conversely, XML API challenges that slowed Oldschool teams had virtually no impact on AI teams.

Work organization and fatigue

Differences were also visible in team energy levels and work organization. By the end of the day, Oldschool teams were noticeably more fatigued, as most effort was spent on manual coding and debugging.

AI teams worked more conceptually and iteratively. Less energy was spent on repetitive tasks, and more on decision-making, evaluation, and guiding the process. This suggests that AI accelerates work and shifts the engineer’s role – from executor to process operator and solution reviewer.

Results

Quantitative efficiency

The most tangible outcome is the number of delivered tests:

ID	Group	Number of tests
A1	AI	196
A2	AI	37
A3	AI	60
A4	AI	9
A5	AI	5
O1	Oldschool	8
O2	Oldschool	5
O3	Oldschool	5
O4	Oldschool	5
O5	Oldschool	5

Teams using AI delivered between 5 and nearly 200 tests, while teams working without AI delivered between 5 and 8. The scale of this difference shows that using AI does not provide merely a marginal productivity improvement but can lead to a multiple-fold increase in the number of completed scenarios within the same time frame.

In the Oldschool group, the results were almost identical, which suggests that without AI support, the pace of work was limited by similar technical barriers and the cost of manual implementation. The only noticeable deviation was a team that included an architect – this confirms that technical experience still has a real impact on outcomes, even without AI support.

In the AI group, the variation in results was significantly greater. On the one hand, this highlights the enormous potential of LLM-based tools; on the other, it clearly shows that access to AI alone is not sufficient. Teams that were able to work with the model iteratively and consciously achieved very high results; the remaining teams also gained an advantage, but a much smaller one. Quantitative efficiency in an AI-driven environment therefore depends not only on traditional technical competencies but also on the ability to work effectively with the model.

From a numerical perspective, AI not only increased the speed of test delivery but also significantly reduced the time lost on local blockers. Teams without AI more often “burned” a large portion of the day on single technical problems, while AI teams were able to resolve issues faster and move on. This mechanism appears to be one of the main reasons behind the substantial difference in the number of completed scenarios.

Qualitative efficiency

The evaluation of solutions was not based solely on the number of delivered tests. Equally important was how they were designed and implemented. In other words, not only “how many,” but above all “how well.”

Therefore, we defined eight quality criteria that allow us to assess solutions from the perspective of real engineering work:

K1 – Alignment with business goals and requirement coverage. Does the test verify the essence of the scenario, rather than just a sequence of actions? Are success and failure criteria clearly defined?
K2 – Test data and state preparation. Independence of the test from manual environment setup, approach to data creation, their uniqueness, determinism, and cleanup after execution.
K3 – Solution stability. Test resilience to slower environments, animations, delays, and DOM state; use of appropriate waiting mechanisms instead of arbitrary delays.
K4 – Quality of selectors and locators. Stability, consistency within the project, and avoidance of fragile locators overly dependent on layout structure.
K5 – Test architecture and patterns. Appropriate use of Page Object Model, UI components, helpers, fixtures, and setup layers.
K6 – Assertion quality. Whether assertions appear at key points of the scenario, verify real business outcomes, and remain resistant to irrelevant UI changes.
K7 – Diagnostics and error handling. Usefulness of logs, screenshots, and artifacts supporting failure analysis.
K8 – Engineering quality. Code readability, stylistic consistency, naming, project structure, avoidance of duplication, and overall solution maturity.

Each repository was evaluated on a five-point scale for each criterion, and then an average quality score was calculated. Due to the limited time available to the teams, the assessment focused primarily on the correctness and maturity of completed elements. It was assumed that separating environment addresses, API URLs, or admin panel endpoints into a dedicated configuration file is a correct and desirable practice, while the presence of secrets, tokens, and user data in the repository was evaluated negatively.

ID	Group	K1	K2	K3	K4	K5	K6	K7	K8	Avg.
A1	AI	3	4	4	3	4	3	5	3	3,6
A2	AI	4	3	4	3	5	4	4	5	4,0
A3	AI	3	5	4	4	5	3	4	3	3,9
A4	AI	2	3	3	3	4	2	4	2	2,9
A5	AI	3	4	3	3	4	2	3	3	3,1
O1	Oldschool	4	5	4	4	5	3	4	5	4,3
O2	Oldschool	3	3	3	3	4	2	2	3	2,9
O3	Oldschool	2	3	2	2	3	1	1	2	2,0
O4	Oldschool	3	3	4	3	3	2	4	2	3,0
O5	Oldschool	3	2	4	3	4	2	2	3	2,9

Analysis of qualitative results

The average quality score for the AI group was 3.53/5, while for the Oldschool group it was 3.02/5. The difference is not as spectacular as in the number of tests, but it remains clear and consistent.

The advantage of the AI teams stemmed primarily from better preparation of test data and setup, stronger framework architecture, more frequent use of structured design patterns, and better failure diagnostics. AI repositories more often included separate fixtures, helpers, page object layers, unique test data, and artifacts supporting error analysis: screenshots, traces, and extended logging.

In other words: AI not only accelerated test writing, but often led to more structured and more “engineering-grade” solutions.

This does not mean there were no problems. In some repositories, uneven assertion quality, the presence of secrets in configuration, and signs of fast, not fully completed implementation were visible. Speed still came at a cost – although not as high as one might expect.

In the Oldschool group, the picture was more varied. The best single attempt – the O1 repository – achieved a score of 4.3/5 and was among the strongest technical solutions in the entire study. O1 stood out with very good layer separation, sensible API-based data setup, cleanup, and a high engineering culture. The remaining Oldschool repositories more often revealed weaknesses: shallow assertions, poor diagnostics, the presence of secrets, or lower maturity of the test infrastructure.

Ultimately, the advantage of AI was therefore not limited solely to the number of delivered tests – it also included the quality of work organization and the construction of the framework itself.

Conclusions

AI significantly and measurably increases productivity. The difference between 5–8 tests and 5–200 tests does not allow us to speak of a marginal improvement – it represents a shift in the level of work efficiency. Importantly, this higher productivity did not come at the expense of quality. The AI group demonstrated greater consistency in terms of data preparation, architecture, and code maintainability – this was particularly visible in the data preparation strategy, approach to selectors, and consistent handling of assertions.
Effective use of AI is a competency in itself. The best-performing teams worked with the model iteratively, in small steps, with continuous validation and refinement. The model does not replace the engineering process – it enhances it, but only when the team is able to guide it consciously.
Experience plays a key role. The best results were achieved by teams that included architects, confirming that AI does not eliminate the importance of expert knowledge. On the contrary – it increases the return on it.
AI significantly reduces the time needed to overcome technical blockers. Where Oldschool teams would get stuck and spend time manually working toward a solution, AI teams were able to regain momentum more quickly. The ability to resolve blockers rapidly appears to be one of the key sources of advantage.
Models have their limitations. The issue with selectors for dynamic buttons showed that AI can enter an incorrect path and reinforce it for some time instead of exiting it. Even with a significant increase in productivity, human technical judgment remains essential. AI accelerates work, but does not replace thinking.
AI also changes the nature of engineering work. In the Oldschool model, manual execution dominated: writing, fixing, debugging. In the AI model, greater emphasis was placed on decision-making, iteration, quality assessment, and selecting next steps. Thanks to easy access to large amounts of structured information, models supported more thoughtful and consistent decisions. In practice, experienced automation engineers will less often focus solely on writing code and more often take on the role of architects of interaction with models – defining problems, validating outputs, and deciding what enters the system. This shift can be compared to paradigm changes where similar differences were observed between engineers relying on book-based knowledge and those who began searching for answers on platforms such as Stack Overflow. The participants themselves noted that during the study they developed a better understanding of how to communicate effectively with language models.
The qualitative advantage reinforces the quantitative picture. The average score for the AI group was 3.53/5 compared to 3.02/5 for the Oldschool group. AI repositories more often featured better data setup, more structured architecture, clearer separation of layers, and richer diagnostics. However, this advantage is not absolute – the best single result in the Oldschool group reached 4.3/5, which shows that solution quality depends more strongly on the maturity and experience of a given team. AI, however, more often helped achieve a higher level of solution maturity within the same limited time.

What’s next? Continuation of the experiment

The first edition made one thing very clear: AI can provide a significant advantage, but the outcome depends on how it is used. The differences between teams were large enough to show that it was not the technology itself, but rather the approach to working with it, that became the key takeaway from the study.

That is why the next edition of Testing Lab will focus on identifying the most effective ways of using AI – we want to verify which methods, tools, and approaches truly accelerate work, make it possible to maintain high code quality, and at the same time reduce the costs resulting from the use of models. If the first edition answered the question “Does AI work?”, the next one will answer a much more important one: how to use AI to maximize the outcome without compromising quality and cost.

Meet the Testing Lab – AI Edition Jury

Krzysztof Bednarski – Solution Architect at Sii Polska. In the study, he focused on assessing the engineering maturity of solutions, with particular emphasis on the quality of framework architecture, their scalability, and the impact of LLM-based tools on the daily work of experienced test engineers.
Tomasz Kuran – Solution Architect at Sii Polska. In his analysis, he concentrated on identifying risks associated with the use of AI and evaluating the long-term maintainability of solutions. He paid special attention to the evolution of architectural standards in the context of working with language models.
Remigiusz Bednarczyk – Senior Test & Analysis Engineer at Sii Polska. He analyzed situations in which the use of AI provides teams with a real business advantage, while also identifying the limits of efficiency and areas where excessive reliance on LLMs may reduce quality or control over the solution.

Have questions? Get in touch with us 😊

5/5

Offices in Poland

Sii Sweden

Sii Ukraine

Sii India

Automating financial analysis with AI

Berlingske Media enters the digital-first era – platform modernization with Sii Poland

Quality Control Center for ABB – gaining control over IT system quality

E-commerce modernization to support growth and smooth shopping experience

Sii Poland named the Best Employer among large companies – 1st place in Great Place to Work once again

Atlassian Team ’26 takeaways: from tools to AI-human orchestration

New episodes of the Sii Talks podcast: from AI strategy to real implementation challenges in organizations

OVHcloud and Sii Poland to deliver secure and sovereign Cloud Solutions for enterprises

The AI-Powered Ecosystem: Scaling Marketing Analytics

Mobile WMS for Process Manufacturing. Powered by Business Central.

From demo to systems you can trust: how real AI works in business

Multi‑Agent AI for Embedded Software Development

Why SAP S/4HANA releases still fail – even when everything was "tested"

Building an RTK Rover: Web-controlled precision navigation

HashiCorp Vault as a central hub for key and certificate rotation

Blazing fast websites and how Adobe Experience Manager can help achieve that

Sii Testing Lab Study: Exploring the “AI Boom” in Test Automation

Objective

Methodology

Experiment

General observations

Experience and AI usage patterns

Model limitations

Work organization and fatigue

Testing & QA

Results

Quantitative efficiency

Qualitative efficiency

Analysis of qualitative results

Conclusions

What’s next? Continuation of the experiment

Meet the Testing Lab – AI Edition Jury

About the author

About the author

About the author

Leave a comment

Cancel reply

Join our team

QA Test Automation Engineer – industrial automation (f/m/x)

Java Test Automation Engineer (f/m/x)

You might also like

Why SAP S/4HANA releases still fail – even when everything was “tested”

Beginner’s guide on how to start testing a website’s accessibility

What if we didn’t test at all? What a project without a testing process would look like

How AI transformed testing in 2025: Insights from Mabl Experience 25 Conference

How can Tricentis Tosca support you in accessibility testing?

Playwright in practice: integration API with UI architecture of testing framework. Part II

Playwright in practice: 5 steps to an effective UI & Web test automation framework. Part I

Cypress – component testing

When the going gets tough, the tough get going with Vision AI

Manual Tester – how to start?

The way to (self-)development in test automation

Performance under control with k6 – creating extensions for k6

SUBSCRIBE AND DON'T FALL BEHIND

Join our team

QA Test Automation Engineer – industrial automation (f/m/x)

Java Test Automation Engineer (f/m/x)

Processing...

What we do angle-down

Industries angle-down

Who we are angle-down

Career angle-down

Trainings angle-down

News angle-down

Contact angle-down

Ta treść jest dostępna tylko w jednej wersji językowej. Nastąpi przekierowanie do strony głównej.

What we do

Industries

Who we are

Career

Trainings

News

Contact

Ta treść jest dostępna tylko w jednej wersji językowej.
Nastąpi przekierowanie do strony głównej.