Judge LLM

We collaborated with IBM to study the domain of LLM evals and discovered that professionals evaluating LLMs need a more effective way to assess extensive sets of outputs across multiple criteria and dimensions. Judge LLM aims to provide a flexible, scalable and easily comprehensible interface for enabling this process.

Role: Design & Research

Team: 2

Themes: Industry Project, Large Language Models, Machine Learning, Comparative Analysis, Contextual Inquiry, Co-Design, Functional Prototyping

Timespan: 8 months

The Premise

As part of my Keystone Master's Project, I was part of a team of two. Our product was inspired by an IBM product. We did the entire project under the guidance of 2 UX Researchers and a UX Designer from IBM as industry partners.

Quick Links

Research

Design

Prototype

Research

Problem Space

For making it easy to keep track of the project scope, here are some definitions of terms that will be used repeatedly through out this case study.

LLM : Large Language Model, an AI model that specializes in natural language.

Dataset : A tabulated collection of LLM generated text responses.

Criteria : Qualitative criteria that can be used to judge any piece of text, like clarity.

Evaluation : The process of judging each response in the dataset based on qualitative criteria.

Problem Statement

Professionals* evaluating LLMs need a more effective way to assess extensive sets of outputs across multiple criteria and dimensions. Current evaluation tools struggle to efficiently handle large-scale assessments and lack the capability to meaningfully aggregate results from diverse criteria. The ability to gain comprehensive insights and make informed decisions about model performance is hindered by these limitations. An improved approach is needed to address both the scale of evaluation and the complexity of synthesizing multi-dimensional results.

*Professionals = AI Developers, Data Scientists, Consultants

ChatGPT Image Jun 17, 2025, 08_13_04 PM.png

After careful probing into the problem space with the problem statement in mind, we arrived at a 'How Might We?' statement that accurately confined our problem statement into a task oriented question from which our research questions could be derived.

How Might We?

How might we help professionals* efficiently evaluate large sets of language model outputs across multiple criteria and dimensions by providing a system that both handles large-scale assessments and effectively aggregates complex, multi-faceted results?

*Professionals = AI developers, Data scientists, Consultants...

Example of a Use Case

You are an AI developer at an airline company preparing to release a new customer service chatbot powered by an LLM. Before letting customers use it, you need to evaluate a dataset of responses generated by your LLM chatbot during testing.

This evaluation helps you identify patterns of issues and improve the chatbot before real customers interact with the system. So you decide to evaluate the outputs based on the following qualitative criteria.

Empathy
Finding connection with customer's particular situation.

Politeness
Respectful language conveying courtesy throughout interaction.

Clarity

Information presented in easily understandable manner.

What are your options? How will you go about this evaluation? You can choose from these:

Go through all responses manually or hire people to do the same.

This way of evaluation is very resource and time intensive.

Develop programmed frameworks to evaluate responses.

This method requires technical skills and expertise.

Use a tool that allows you to define the criteria and upload the dataset to be evaluated by another LLM. This method will be

Less Time Intensive
Less Resource Intensive

As organizations depend increasingly on LLMs, tools that enable users to define the criteria and upload the dataset to be evaluated by another LLM are becoming popular and useful. We then began to formulate research questions that will uncover more insights about current tools and their shortcomings.

Research Questions

1

What are the challenges when evaluating data from large datasets?

2

What are the challenges while trying to get LLM output evaluations based on diverse criteria provided?

3

How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?

4

What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?

To find answers to these research questions we planned to do a thorough literature review, to establish a solid understanding of the problem space and existing tools, followed by a contextual inquiry done using a tool that stood out to us. We made a research plan for a span of 4 months which looked like this:

Literature Review

Contextual Inquiry

Aug 2024

Sep 2024

Oct 2024

Nov 2024

Crucial for getting acquainted with the problem space.
Understanding the popular tools used for Large Language Models for evaluations.
This was followed by a comparative analysis of most popular tools.

Contextual Inquiry offered rich, qualitative insights into real-world evaluation environments.
Observations revealed subtle pain points and unmet needs when conducting Large Language Model evaluations.
These observations were followed by a discussion.

The research was structured to result into findings that correspond to our research questions, the broader 2 research activities, namely contextual inquiry and literature review had segments that led us to our research goals.

Literature Review Insights

We studied papers written by creators of evaluation tools globally and did a task analysis for different tools. Learning about popular tools and their limitations and affordances helped us create a comparative matrix outlining the features and their usefulness in different tools. We also created a feature matrix that compared the tools in terms of how many have functions relevant to our research questions.

Comparative Matrix

We built the comparative matrix around the following criteria for comparison:

Scalability

Ability to spot trends through large outputs efficiently.

Usability

User friendliness and visual quality of the tool's interface.

Multi Criteria Support

Ability to spot trends through large outputs efficiently.

Customization

Ability to adapt to different evaluation needs.

Aggregate Mechanisms

Ability to spot trends through large outputs efficiently.

Integration

Compatibility with existing workflows..

Task Analysis

We plotted task analysis diagrams for some tools that we studied. This helped us greatly in understanding the data pipeline of IBM's tool and inspired the hierarchical task analysis of Judge LLM.

Feature Matrix

The feature matrix compared several tools based on provision of specific features.

Legend

Feature is supported

Feature is partially supported/ requires manual effort

Feature is not supported (Blank)

Features

User Interface

Criteria Customization

AI Generated Criteria

Compatible with large datasets

Interactive Criteria Refinement

Filtering & Exploration

Quantitative & Qualitative analysis

Performance Tracking

Reliability Analysis

Tools

Eval LM

Consitution Maker

LLM Comparator

EvalGen

Deepcheck

This comparative analysis allowed us to identify an accessible tool with features that are in line with our research questions, to use for contextual inquiry. Eval LM, an open source LLM evaluation tool developed by researchers at KAIST was selected for the contextual inquiry. The tool offers flexible criteria management, and a combination of qualitative and quantitative data as evaluation results, though it does not handle large datasets, using this tool allowed us to get many insights.

Contextual Inquiry

We recruited professionals working with LLMs and Evals as well as PhD scholars researching the field for our conducting contextual inquiry. The process involved introducing the participant to our problem space and topic area, after which they were given a scenario which required them to carry out an evaluation using Eval LM.

The participant was asked to think aloud as they carried out the task and were assisted with the tool when absolutely required. This was followed by a discussion about the tool, and ideas for how it could have been better.

We conducted 6 contextual inquiries. After which in an interpretation session, the notes and quotes from the 6 sessions were then put into a spreadsheet for refinement, and further insight generation.

The refined quotes, notes and insights were then arranged on an affinity diagram, this allowed us to organize the CI output into categories and allowed us to uncover insights relevant to our research questions.

Findings

We then organized the resulting findings from both the literature review and the contextual inquiry, in relation to the four research questions.

RQ 1 - What are the challenges when evaluating data from large datasets?

Scalability & Navigation Issues: Some tools offer strong filtering and navigation for large datasets; others are limited.

Reliability at Scale: Few tools emphasize reliability analysis features to assess consistency across large volumes of data.

Aggregation Techniques: Some tools handle complex, multi-dimensional data well; others don’t present results effectively.

RQ 1 - What are the challenges when evaluating data from large datasets?

Efficient Handling at Scale: Efficient filtering and navigation tools to quickly sift through large volumes of results is necessary.

Aggregated Overviews: Users want summary metrics to understand data without reviewing every detail.

Validation Across Large Samples: Users struggle to synthesize large data and want aggregated scores for easier evaluation and a high-level overview.

RQ 2 - What are the challenges during LLM output evaluations based on diverse criteria provided?

Customizable Criteria: Some tools allow defining custom evaluation dimensions while others rely on fixed criteria.

Criteria Selection: Some tools allow selection from a pre-defined library, but it’s unclear how well these suggestions align with business goals.

Iterative Refinement: Some tools support for refining criteria based on feedback for more relevant and context-specific evaluation frameworks.

RQ 2 - What are the challenges during LLM output evaluations based on diverse criteria provided?

Criteria Intuitiveness & Flexibility: Users find it intuitive to manage multiple criteria and value the ability to define them.

Predefined vs. Custom Criteria: Many appreciate predefined criteria libraries but also want flexibility to adapt them to their unique contexts.

Alignment with Business & Legal Standards: Aligning criteria with business, brand, and compliance standards adds complexity but can reassure external stakeholders.

RQ 3 - How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?

Clearer Metrics: Users find percentage scores vague and want more interpretable measures like confidence intervals and clearer labels.

High Level Overviews: Users like visual summaries that highlight key insights, outliers, and similar results.

Diverse Visual Formats: Users want various visualization types—like side-by-side comparisons and trend tracking—to grasp complex data better.

RQ 3 - How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?

Visualization Approaches: Some tools offer rich visuals like interactive charts and color-coding, while others provide minimal support.

Comparative Visualization: Tools vary in comparing models or prompts—some enable detailed visual comparisons, others rely on text.

Quantitative & Qualitative Integration: Some tools combine metrics with qualitative insights; others focus on just one type of data.

RQ 4 - What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?

Reliability & Consistency Checks: Some tools support reliability analysis, but users need clearer ways to confirm results aren’t random.

Historical Performance Tracking: Monitoring trends over time helps users understand model changes and trustworthiness.

Automated Insight Generation: Some tools offer automated summaries, though their effect on trust and clarity remains uncertain.

RQ 4 - What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?

Validation & External Verification: Users want more rigorous validation, larger samples, and stakeholder input to build trust.

Historical Tracking & Reusability: Tracking past outputs and spotting long-term patterns boosts trust and reliability.

Transparency in Evaluation: Clear explanations of scoring methods and criteria are key to building confidence.

These insights were then condensed into 4 user needs, which formed the baseline of our approach to designing Judge LLM.

Design

The design activities following research involved:

Condensing research insights into design implications.
Developing initial wireframes for conducting co-design activity.
Conducting co-design activity and iterating on the wireframes.
Developing mock-ups using the IBM Carbon Design System.
Developing the prototype.

These were planned over a span of roughly four months as follows:

Design Implications

Wire-framing

Co-Design Sessions & Iterations

Mockups & Prototype Development

Jan 2025

Feb 2025

Mar 2025

Apr 2025

User Needs & Design Implications

4 user needs were highlighted through our research, each of these branched out into two design implications. This resulted in 8 design implications.

1. Users want to fluidly explore massive datasets with advanced filtering and drill-downs

User Needs

Design Implications

2. Side-by-side or layered hierarchical views so users can jump from a bird’s-eye overview into detailed per-item inspection.

1. Cluster and filter results by shared characteristics, then let users focus on meaningful subsets

2. Users need multi-level aggregated insights to quickly spot patterns, outliers, or trends.

4. Charts and tables that let users easily toggle between high-level statistics and more detailed breakdowns.

3. Let users view summarized results with statistical datapoints like mean, median, standard deviation, etc.

3. The solution should support flexible, customizable evaluation criteria that align with evaluation goals and context

6. Ensure criteria align with external policies through prompted reviews and stakeholder cross-checks.

5. Define, prioritize, and refine criteria with workspace tools and optional LLM-based suggestions

4. Users like clear, interactive visuals and transparent scoring processes to foster user trust and confidence.

8. Reveal score calculations, data sources, and limitations to build trust and understanding.

7. Blend visualizations with quantitative and qualitative feedback to offer meaningful insights.

With the time constraints at hand and the insights and feedback we got from the co-design activity, we addressed 7 of 8 design implications, the exception being design implication 8.

User Flow

Our user flow was inspired by a data pipeline similar to the tool mentioned in this paper. We plotted the user flow and it looked like this

Part 1: Setting Up Evaluation

User uploads the dataset to be evaluated. Enters the LLM API key to use for evaluation.

Hierarchichal Task Analysis

A more detailed series of steps can be understood through the following hierarchical task analysis:

Start Evaluation

Enter Evaluation Title

Enter Evaluation Goal

Select LLM Model

Enter API Key

Start Criteria Definition

Enter Criteria Manually

Choose AI suggested Criteria

Use AI Criteria Library

Refine Criteria using AI or Edit Manually

Run Evaluation

View Results

Co-Design Activity

Co-Design sessions were carried out with 3 engineers and 1 PhD scholar in the field of ML.

We explained the context to the participants and gave them a wireframe walkthrough.

Which was followed by a discussion on what features and visualizations would be helpful for evaluation.

The wireframe design was thus quite fluid and kept changing after each session, for the last two sessions we incorporated the IBM Carbon Design System, after noting a requirement for color schemes by the participants.

This way our wireframes evolved into hi-fidelity mockups.

IBM Carbon Design System

We adopted the IBM Carbon Design System, an open source design system by IBM.

File Options like saving, re-naming, etc

Define evaluation goal, this gives the LLM model context for suggesting criteria

Upload Dataset to evaluate.

Select the name of the evaluating LLM model and enter the API key to use it.

Users can enter criteria title and describe it.

These can be filled in manually or using LLM.

LLM can refine definitions if the user wishes to.

Users can pick from LLM suggested criteria.

And/or choose LLM generated criteria from an extensive criteria library.

The LLM will generate a rubric based on the definitions provided/ generated.

This rubric explains a 5-point scale on which the LLM gives results.

Prototype

Prototype
Design & Testing

After further contemplation and re-design of the results dashboard in Figma, we began creating and testing a functional prototype.

The prototype was created using React and was tested with 3 UX Designers and 1 ML Professional, based on standard heuristics.

Following is a video of the prototype demo from the presentation that we gave for our keystone project.

Reflection

This project opened my eyes to the world of LLM evaluation and what it takes to build robust and reliable language models. I am a language enthusiast, always learning new languages and interacting with people. My curiosity knew no bounds when LLMs became a thing, and through this project, I have learned a lot about how we can make computers better at conversation.

Collaborating with IBM was an amazing experience, I gained many insights on how to approach research, how to choose methods, how research insights translate to design initiatives and how to make the most out of user tests.

I am especially proud of this project because our research was selected for the Human-centered Evaluation and Auditing of Language Models workshop at ACM CHI 2025 in Japan. Working in a team of two was an interesting experience where I learned a lot about sharing responsibility and leveraging AI for efficiency. You can read the paper we submitted using the link below.

HEAL CHI '25 Paper - Designing Scalable and Transparent Interfaces for Multi-Criteria Evaluation of LLM Outputs

UX Lakshya

Judge LLM

The Premise

Research

Problem Space

Problem Statement

How Might We?

Example of a Use Case

Research Questions

1

2

3

4

Literature Review Insights

Comparative Matrix

Task Analysis

Feature Matrix

Contextual Inquiry

Findings

RQ 1 - What are the challenges when evaluating data from large datasets?

RQ 1 - What are the challenges when evaluating data from large datasets?

RQ 2 - What are the challenges during LLM output evaluations based on diverse criteria provided?

RQ 2 - What are the challenges during LLM output evaluations based on diverse criteria provided?

RQ 3 - How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?

RQ 3 - How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?

RQ 4 - What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?

RQ 4 - What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?

Design

User Needs & Design Implications

User Flow

Part 1: Setting Up Evaluation

Part 2: Defining Criteria

Part 3: Viewing Results

Hierarchichal Task Analysis

Co-Design Activity

IBM Carbon Design System

Prototype

Prototype
Design & Testing

Reflection

UX Lakshya

Judge LLM

The Premise

Research

Problem Space

Problem Statement

How Might We?

Example of a Use Case

Research Questions

1

2

3

4

Literature Review Insights

Comparative Matrix

Task Analysis

Feature Matrix

Contextual Inquiry

Findings

RQ 1 - What are the challenges when evaluating data from large datasets?

RQ 2 - What are the challenges during LLM output evaluations based on diverse criteria provided?

RQ 3 - How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?

RQ 4 - What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?

Design

User Needs & Design Implications

User Flow

Part 1: Setting Up Evaluation

Hierarchichal Task Analysis

Co-Design Activity

IBM Carbon Design System

Prototype

Prototype Design & Testing

Reflection

Prototype
Design & Testing