AI in digital accessibility: what we worked on in the report for the Ministry of Digital Affairs

Artificial intelligence is now present in almost every conversation about technology. A great deal of attention is paid to generating new text, images, or video. Much less is said about improving the quality of existing content — and, in particular, adapting it to the needs of people who use assistive technologies.

This was precisely the focus of the report prepared jointly by Sages and the Institute of Computer Science of the Polish Academy of Sciences for the Ministry of Digital Affairs.

From our perspective, this topic is particularly interesting because it brings together large language models, NLP, and chatbots with the responsible design of digital services for use, among others, in education, culture, and public administration — areas to which we at Sages have devoted a great deal of attention in recent years.

Digital accessibility is not about simply “ticking off” a few technical requirements. Solutions that work on paper do not always deliver the intended results in real-life conditions. The real goal is to ensure that no message is lost to a person using assistive technologies, or to someone with different needs or limitations.

Already in the executive summary of the report, we pointed out that traditional automatic methods for checking and improving accessibility have significant limitations, while at the same time there is room for further automation with the help of artificial intelligence.

We did not start from the assumption that AI would solve everything

One of the things that mattered to us from the beginning was avoiding naïve enthusiasm. It would have been very easy to write that AI can simply “receive a file, generate an audit, and suggest fixes”. But that would not have been true. Checking a document for accessibility requires multi-layered and multimodal analysis. It also involves different components depending on whether we are talking about websites, texts, images, tables, audio, multimedia, or other types of content.

That is why the report analyses the potential use of AI separately across different areas of broadly understood digital accessibility: HTML code analysis, language correction, translation, summarisation, OCR, generation of alternative image descriptions, contrast checking, document layout analysis, speech synthesis, audio and video transcription, audio description and audio text, as well as the ability to explain legal regulations and technical aspects of accessibility based on source documents.

This variety of tasks was crucial for us. The same model may produce promising results in one use case, while performing very poorly in another, or even proving technically unsuitable. If AI is to be used responsibly, it needs to be evaluated task by task, not at a general level.

The report also addressed very practical issues: costs and licences, which ultimately determine whether a given solution is suitable for real-world use. In line with the preferences of the Ministry of Digital Affairs, we therefore focused on open-source models, although in justified cases we also included commercial solutions.

Where does AI already show real value today?

The main part of the report contains numerous tables comparing individual models according to the metrics they achieve in a variety of tests and benchmarks — both those published in other sources and those carried out by us specifically for the purposes of this analysis.

Given the enormous number of models and possible ways of using them, it is impossible to bring them all down to a single common denominator and summarise them comprehensively — especially as new models, new versions, and new test sets appear from month to month. Analysing specific parts of this complex reality makes it possible to identify general trends and patterns. We present some of our observations below.

Improving HTML code

One of the strongest results in the entire study concerned the automatic repair of accessibility errors in HTML code. In tests based on examples from the AccessGuru dataset, the highest effectiveness was achieved by the Qwen2.5:32b model, which reduced violations across the entire dataset by 80% and removed all errors present in a given example in 62% of test cases. It was followed closely by the Polish model Bielik-11B-v2.6-Instruct, with a 76% overall reduction and complete repair in 59% of test cases.

This complex task can be broken down into smaller component problems. Results by error type reveal both weak points — such as colour contrast or the use of ARIA attributes — and areas where effectiveness is highest, such as completing important metadata defining the document title and language. Unlike traditional validators, which only check the completeness of metadata, AI can not only identify the problem but also successfully fix it.

ARIA attributes, on the other hand, are safest to verify and improve through manual expert work. Any AI-based analysis in this area should be treated only as an optional indication.

However, this is only half of the picture. We also checked whether the models introduced new errors in the process. This is where a very important practical lesson emerges: automatically removing errors does not always mean returning fully correct code, because unnecessary or even harmful modifications may also be introduced in other parts of the document.

In this respect, Bielik-11B-v2.6-Instruct performed best: only 4.6% of the errors present in the documents it processed had not appeared in the original test case. However, the comparison did not include modifications that did not introduce accessibility errors but could lead to unintended changes in the appearance or content of the page — which is also potentially undesirable.

In the target solution, it will be necessary to identify additional methods for limiting such unintended modifications. When implementing AI — especially for complex tasks such as code modification — what matters is not only whether the instruction is carried out successfully, but also the stability of the process and the predictability of the overall result.

Contrast: not every task should be entrusted to LLMs

As for the contrast issue mentioned above, we analysed it further in a separate subsection. These were probably the most disappointing results of all those covered in the report. Importantly, reliable tools already exist for analysing contrast arithmetically, without using AI, and these tools are best suited for accelerating and objectivising assessment, or even automating corrections.

The conclusion is one of the recurring themes of the entire report: AI can provide very meaningful support in improving accessibility in selected areas, but it can rarely operate without additional validation — and it certainly cannot fully replace human expertise and intuition.

In practice, the best systems supporting digital accessibility will not be based on a single approach. Where rules and calculations work best, we should use rules and calculations. Where semantic, linguistic, or multimodal analysis is needed, AI is worth adding.

OCR: reading text from photos and scans

One area that has long made successful use of artificial intelligence methods is OCR, or the recognition of written characters in image files. This is a key aspect of accessibility, because if a document exists only as a scan or a photo of printed pages, it cannot be read by a speech synthesiser or displayed on a Braille monitor.

When it comes to the usefulness of the latest AI techniques, the range of results is very wide, depending on the model, the type of files in the test set, and the resolution of page images. Among open models, Qwen2.5-VL-72B-Instruct performs best in the reVISION benchmark, with accuracy above 60%, while in the olmOCR-Bench benchmark the strongest results are achieved by olmOCR 2 with 82% accuracy, PaddleOCR-VL with 80%, and dots.OCR with 79%.

Because the name of the second benchmark and the name of its leading model are not coincidentally similar, the 82% result should probably be approached with a certain degree of caution. As always, we emphasise that the best evaluations are those performed on a representative set of documents for a specific task, although this is not always possible.

New models can also be used for post-correction of OCR results — an area that until now has been dominated by manual work. In this task, Bielik-11B-v2.6-Instruct performs best, achieving a WER of just 0.26%. This means that after correction, for every 1,000 words, on average 26 differed from the source document. This is a very strong result, especially when we remember that people also make mistakes, particularly in tasks that are tedious and repetitive.

Translation as an element of information accessibility

Translation is rarely mentioned as part of digital accessibility, but it is an important element of information and communication accessibility, also online. A large share of valuable content is still published in English, which in itself can be a real access barrier.

Machine translation is also a way to make better use of models whose quality varies significantly depending on the language. In some tasks, such as generating alternative descriptions, it may sometimes be better to translate the output returned by a strong English-language model than to settle for a weaker result from a multilingual model.

For this reason, we independently analysed the effectiveness of open models in translation from English into Polish. Tests on the WikiMatrix dataset showed that Bielik-11B-v2.6-Instruct performed best among the open-source models tested. Interestingly, Bielik’s result was significantly higher than the scores achieved in a similar study carried out in the same year for foreign commercial models.

As we can see, price and quality do not always increase proportionally, although the difference in results may also be partly related to a different selection of test cases.

Alternative descriptions: an area where both potential and complexity are clearly visible

One of the most obvious associations with AI and accessibility today is alternative descriptions. This is understandable: if a multimodal model generates text based on an image, it should be able to describe that image to a blind person without difficulty. In practice, however, this area turned out to be somewhat more complex.

During the tests, we took into account the fact that the same image may require a different description in a news article than in a situation where it serves an identification function. In the analysis, we discussed, for example, the coats of arms of Polish voivodeships, whose descriptions should differ between Wikipedia and a regional assembly website because the image performs a different communication function in each context. This is very important, because it shows that simply “recognising objects in a photo” does not always solve the problem of image accessibility.

There is also the broader issue of adapting the language and content to the image’s function. In large, non-specialist datasets, the boundary between captions and true text alternatives becomes blurred. As a result, models may add unnecessary encyclopaedic facts or opinions, for example. Another separate problem is keeping descriptions to a sensible length — some models, such as those from the Llama family, tend to be overly verbose regardless of the specified character limit.

This is where the difference between just any image-to-text model and AI that genuinely supports accessibility becomes especially clear. In the latter case, what matters is not only linguistic correctness, but also function, context, and usefulness for the recipient. At Sages, we are currently conducting more detailed research into the use of LLMs in this area.

Language matters more than it may seem

One of the themes running through the entire report is the perspective of the Polish language. Many international benchmarks and studies focus on English or Chinese. However, we cannot assume that a “multilingual” model with strong results in one language will work equally well for Polish. To examine and improve the usefulness of models in local contexts, we also need our own test cases, training datasets, and fine-tuning for specialist tasks.

Poland already has reasons to be proud in the field of natural language processing, as shown, for example, by the strong results achieved by Bielik in selected tasks and by the existence of language-specific benchmarks such as BIGOS or the Polish benchmark of linguistic and cultural competence. These are not merely technological curiosities. Further development of local resources may have a tangible impact on digital accessibility in Poland.

We are keeping our fingers crossed for the continued development of local language technologies, because the need for them is clear to us.

The key strategic conclusion: an ecosystem, not a single model

At the technological level, the most important conclusion of the entire report concerns the architecture of future solutions. In the section devoted to further model development, we compared an approach based on a single multi-task model with one based on a combination of multiple specialised models. The conclusion was clear: the second approach is better suited to the pace of change in the AI market and to the need for flexible updates.

This means that a well-designed tool supporting digital accessibility should consist of several cooperating components. In the recommended “maxi” variant, the report points to, among others, the bge-multilingual-gemma2 model for source retrieval in a RAG architecture, Qwen2.5:32b for HTML correction and text analysis, Bielik for language correction and translation, Qwen2.5-VL-72B-Instruct for OCR, and mistral-small3.2 for generating alternative descriptions.

Why do we propose so many models? As the old saying goes, if something is for everything, it is for nothing. For less computationally intensive or narrowly specialised tasks, it is worth involving smaller dedicated models, while the largest solutions — costly to train and use — should be applied only where they genuinely improve results.

In addition, incorporating modularity from the very beginning of system design makes it possible to “swap” models easily whenever there is a reason to do so.

Open technology and responsible implementation

The report also ends with a strong strategic emphasis. In our recommendations, we indicated that from the public-sector perspective, it is worth investing in dedicated tools based on open technologies. This direction helps maintain independence from external providers, control costs, and take environmental and social aspects into account.

At the same time, we emphasise that a system supporting accessibility should be designed around real user needs and preceded by documentation of violations that are actually being reported.

This is important to us also for practical reasons. In the field of digital accessibility, the point is not superficial innovation. What is needed is trust — and trust can only be built on predictability and usefulness.

What we took away from this project

Because we spent six months examining topic after topic in detail, we were able to organise the subject without falling into either excessive enthusiasm or scepticism. Although we have long worked with both accessibility and AI, we are leaving this project with much broader knowledge — especially in areas that had previously remained on the margins of our interests, such as multimedia accessibility.

We also confirmed certain intuitive observations that had already emerged in our day-to-day work and that were reflected in methodical analyses of the literature, leaderboards, and the results of our own tests.

We can now see clearly that AI can already provide real support for digital accessibility today, but also that it does not replace expert knowledge and should not be treated as an automatic answer to every problem.

And this is what seems most valuable to us. The goal is not to automate everything while boasting about using the largest models and the most expensive hardware. Instead, we want to use technology wisely: combining the latest solutions with older, proven machine methods, without forgetting common sense, and keeping people at the centre — together with their knowledge, experience, skills, imagination, and needs.

The full report, “Analysis of the possibilities of using artificial intelligence in the area of digital accessibility assessment”, is available on the gov.pl website: https://www.gov.pl/web/dostepnosc-cyfrowa/raport—analiza-mozliwosci-wykorzystania-sztucznej-inteligencji-w-obszarze-badania-dostepnosci-cyfrowej