Evaluating the Quality of AI-Generated Alternative Descriptions and Accessible Diagram Representations

Educational and science communication materials often rely on visual content such as illustrations, photographs, diagrams, formulas, and charts. They help explain complex ideas and organize knowledge, but they are not equally accessible to everyone.

If visual materials do not include a properly prepared accessible alternative, for example a description, people with visual impairments are left without access to valuable information and context. Creating such alternatives takes a great deal of time, especially in collections containing large amounts of specialist material, such as university course content or museum holdings. That is why Sages is carrying out an R&D project aimed at automating the creation of alternative descriptions for specialist graphics.

The project spans many months, with different tasks being carried out in parallel. Recently, two topics have been at the center of our work: evaluating the quality of descriptions generated by models for artworks and museum objects, and preparing the groundwork for work on diagrams.

Evaluation of AI-generated descriptions: general and detailed perspective, and the pursuit of automation

1. General comparison of generated and reference descriptions

This part of the work aimed to answer the following question: which models and approaches can generate descriptions of equal or higher quality than those written by humans, and which perform worse?

Both people and AI models make, and always will make, different kinds of mistakes when creating alternative descriptions. Since ideal solutions do not exist, a realistic benchmark for automation is achieving results that are no worse than manual work.

How did the comparison work?

For a random set of museum objects and artworks, we collected alternative descriptions generated by individual AI models as well as descriptions written by humans.
We paired the descriptions of each object in the test set in an all-against-all system.
Annotators experienced in describing similar objects selected the winner in each pair.

The assessment took into account two criteria separately: content and form. According to the first criterion, the winning description is the one that conveys as much information as possible without including distortions. According to the second, the winning description is the one that is linguistically correct, concise, stylistically neutral, and easy to read. The same description may win in one category but lose in the other.

Annotators never assessed objects whose descriptions they had previously worked on. They also did not know which descriptions were generated by a model and which were written by another team member.

For now, human-written texts still rank highly, but in terms of both style and factual correctness we can already identify models that match human performance in at least one of these dimensions.

2. Qualitative assessment

In addition to the better-similar-worse scale, generated descriptions were subjected to a detailed qualitative assessment that pointed out fragments:

containing false information,
with content that was difficult to understand or potentially misleading,
repeating information already included elsewhere in the description,
unnecessarily introducing external knowledge.

This makes it much clearer which problems occur most often in descriptions generated using a given approach. It also allows us to monitor in detail how the choice of model and instruction prompt affects the real results.

This matters because our goal is not to find one model for everything, but to build a solution that ultimately delivers predictable quality.

For fully automatic use cases, a better choice may be a model that tends to repeat information but rarely distorts reality. In turn, for descriptions that are meant to be verified and corrected by humans, the most attractive option may be the absence of non-obvious interpretation errors, because obvious mistakes are much easier to notice.

By dividing the dataset into different categories corresponding to object types, we can also indicate which kinds of collections are most difficult for individual models.

3. Can evaluation be automated? The LLM-as-a-judge approach

Human evaluation is the best point of reference, but in the long run it is difficult to scale. That is why we are testing the LLM-as-a-judge approach, using a large language model to decide which description a human would choose as better.

How do we choose our arbitrator?

Several different models assess descriptions in terms of content and style.
The models evaluate the same pairs for which we had previously collected assessments from trained annotators.
We use statistical methods to determine which model produces assessments that are closest to human judgment.

If we find a model with high reliability and strong agreement with expert evaluation, we will be able to:

use it as a tool for automating evaluation in the next project tasks, which will speed up further work iterations,
ultimately apply this mechanism in automatic accessibility validation tools, where qualitative evaluation of alternative descriptions is currently missing.

Work on diagrams: mapping the field, alternative forms, consistency of models

Diagrams are a very heterogeneous group of content. This diversity concerns not only subject matter but also different varieties and levels of complexity. We started by organizing the basics: which diagrams are most common and how they can be meaningfully translated into an accessible form.

1. Which types of diagrams are most commonly used?

At the beginning of this work, we examined which diagrams are most likely to appear in real-world use cases. The analysis was based primarily on practical sources such as reviews of popular diagramming tools and guides to data visualization, as well as theoretical studies. Among many specific categories, we identified nine broader types that share similar functional elements.

This allows us to better plan the next steps, apply different approaches to different types of content, and monitor results more effectively.

2. Which alternative forms are we considering?

For many diagrams, the key challenge is finding a form of description that is not merely a short image caption, but one that captures the information structure in a way that makes the best possible use of screen reader capabilities.

That is why we analyze possible forms of alternative presentation:

lists, which are a good reflection of diagrams showing hierarchies or set elements,
tables, which clearly group data connected by analogy,
long descriptions, when the data cannot be presented in a more structured form,
mixed forms that combine these options, for example a list together with a short summary of other important information resulting from the diagram.

3. Consistency test: organizational charts in many visual variants

In the next part of the work, we prepared test sets using organizational charts, each containing a dozen or so visual variants of the same content. They differ in orientation, font size, the style of lines and block elements, and the presence of decorative elements.

What does this help us achieve?

It removes the random factor introduced by differences in how diagrams look in real examples, making the evaluation more reliable.
It allows us to identify not only the best approach, but also the most consistent one, meaning an approach whose results are less dependent on the appearance of the diagram.
It helps us identify whether there are factors that make diagram processing difficult regardless of the approach used.

This approach is especially important if we are thinking about a tool that is meant to work in the real world, where the same data can be presented in many different ways.

Next steps in the project

In the next stage, we plan to:

deepen our tests of description quality assessment and identify the method that best reflects human judgment,
test different methods for generating alternative forms for individual classes of diagrams,
begin work on methods for generating alternative forms for charts.

Ultimately, in line with the project assumptions, the results are to be captured in a technical solution, an API, and implemented in a tool supporting WCAG compliance, together with tests and iterative improvements.

If you would like a summary of the earlier stages of the project, we invite you to read the previous post, “AI for Accessible Alternative Descriptions of Artworks and Specialist Graphics,” where we describe the project assumptions and the results of the first phase.

The project entitled “Development of specialized algorithms that automatically create accessible alternative versions of specialist graphics for people with visual impairments in order to automatically ensure the accessibility of digital educational materials” is co-financed by the European Regional Development Fund under Priority I “European Funds for a more competitive and smarter Mazovia”, Measure 1.1 “Research, development and innovation of enterprises”, Project type “Research and development infrastructure of enterprises” of the European Funds for Mazovia 2021-2027 program.

Project value: PLN 4,887,450.00. European Funds contribution: PLN 2,432,225.00.