
Judge LLM
We collaborated with IBM to study the domain of LLM evals and discovered that professionals evaluating LLMs need a more effective way to assess extensive sets of outputs across multiple criteria and dimensions. Judge LLM aims to provide a flexible, scalable and easily comprehensible interface for enabling this process.
Role: Design & Research
Team: 2
Themes: Industry Project, Large Language Models, Machine Learning, Comparative Analysis, Contextual Inquiry, Co-Design, Functional Prototyping
Timespan: 8 months
Research
Problem Space
For making it easy to keep track of the project scope, here are some definitions of terms that will be used repeatedly through out this case study.
LLM : Large Language Model, an AI model that specializes in natural language.
Dataset : A tabulated collection of LLM generated text responses.
Criteria : Qualitative criteria that can be used to judge any piece of text, like clarity.
Evaluation : The process of judging each response in the dataset based on qualitative criteria.
Problem Statement
Professionals* evaluating LLMs need a more effective way to assess extensive sets of outputs across multiple criteria and dimensions. Current evaluation tools struggle to efficiently handle large-scale assessments and lack the capability to meaningfully aggregate results from diverse criteria. The ability to gain comprehensive insights and make informed decisions about model performance is hindered by these limitations. An improved approach is needed to address both the scale of evaluation and the complexity of synthesizing multi-dimensional results.
Professionals* evaluating LLMs need a more effective way to assess extensive sets of outputs across multiple criteria and dimensions. Current evaluation tools struggle to efficiently handle large-scale assessments and lack the capability to meaningfully aggregate results from diverse criteria. The ability to gain comprehensive insights and make informed decisions about model performance is hindered by these limitations. An improved approach is needed to address both the scale of evaluation and the complexity of synthesizing multi-dimensional results.
*Professionals = AI Developers, Data Scientists, Consultants

After careful probing into the problem space with the problem statement in mind, we arrived at a 'How Might We?' statement that accurately confined our problem statement into a task oriented question from which our research questions could be derived.
How Might We?
How might we help professionals* efficiently evaluate large sets of language model outputs across multiple criteria and dimensions by providing a system that both handles large-scale assessments and effectively aggregates complex, multi-faceted results?
*Professionals = AI developers, Data scientists, Consultants...
Example of a Use Case
You are an AI developer at an airline company preparing to release a new customer service chatbot powered by an LLM. Before letting customers use it, you need to evaluate a dataset of responses generated by your LLM chatbot during testing.
This evaluation helps you identify patterns of issues and improve the chatbot before real customers interact with the system. So you decide to evaluate the outputs based on the following qualitative criteria.
Empathy
Finding connection with customer's particular situation.
Politeness
Respectful language conveying courtesy throughout interaction.
Clarity
Information presented in easily understandable manner.
What are your options? How will you go about this evaluation? You can choose from these:



Go through all responses manually or hire people to do the same.
This way of evaluation is very resource and time intensive.
Develop programmed frameworks to evaluate responses.
This method requires technical skills and expertise.
Use a tool that allows you to define the criteria and upload the dataset to be evaluated by another LLM. This method will be
-
Less Time Intensive
-
Less Resource Intensive
As organizations depend increasingly on LLMs, tools that enable users to define the criteria and upload the dataset to be evaluated by another LLM are becoming popular and useful. We then began to formulate research questions that will uncover more insights about current tools and their shortcomings.
Research Questions
1
What are the challenges when evaluating data from large datasets?
2
What are the challenges while trying to get LLM output evaluations based on diverse criteria provided?
3
How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?
4
What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?
To find answers to these research questions we planned to do a thorough literature review, to establish a solid understanding of the problem space and existing tools, followed by a contextual inquiry done using a tool that stood out to us. We made a research plan for a span of 4 months which looked like this:
Literature Review
Contextual Inquiry
Aug 2024
Sep 2024
Oct 2024
Nov 2024
-
Crucial for getting acquainted with the problem space.
-
Understanding the popular tools used for Large Language Models for evaluations.
-
This was followed by a comparative analysis of most popular tools.
-
Contextual Inquiry offered rich, qualitative insights into real-world evaluation environments.
-
Observations revealed subtle pain points and unmet needs when conducting Large Language Model evaluations.
-
These observations were followed by a discussion.
The research was structured to result into findings that correspond to our research questions, the broader 2 research activities, namely contextual inquiry and literature review had segments that led us to our research goals.

Literature Review Insights
We studied papers written by creators of evaluation tools globally and did a task analysis for different tools. Learning about popular tools and their limitations and affordances helped us create a comparative matrix outlining the features and their usefulness in different tools. We also created a feature matrix that compared the tools in terms of how many have functions relevant to our research questions.
Comparative Matrix
We built the comparative matrix around the following criteria for comparison:
Scalability
​
Ability to spot trends through large outputs efficiently.
Usability
​
User friendliness and visual quality of the tool's interface.
Multi Criteria Support
​
Ability to spot trends through large outputs efficiently.
Customization
​
Ability to adapt to different evaluation needs.
Aggregate Mechanisms​
​
Ability to spot trends through large outputs efficiently.
Integration
​
Compatibility with existing workflows..
Task Analysis
We plotted task analysis diagrams for some tools that we studied. This helped us greatly in understanding the data pipeline of IBM's tool and inspired the hierarchical task analysis of Judge LLM.
Feature Matrix
The feature matrix compared several tools based on provision of specific features.
Legend
Feature is supported
Feature is partially supported/ requires manual effort
Feature is not supported (Blank)
Features
User Interface
Criteria Customization
AI Generated Criteria
Compatible with large datasets
Interactive Criteria Refinement
Filtering & Exploration
Quantitative & Qualitative analysis
Performance Tracking
Reliability Analysis
Tools
Eval LM
Consitution Maker
LLM Comparator
EvalGen
Deepcheck
This comparative analysis allowed us to identify an accessible tool with features that are in line with our research questions, to use for contextual inquiry. Eval LM, an open source LLM evaluation tool developed by researchers at KAIST was selected for the contextual inquiry. The tool offers flexible criteria management, and a combination of qualitative and quantitative data as evaluation results, though it does not handle large datasets, using this tool allowed us to get many insights.
Contextual Inquiry
We recruited professionals working with LLMs and Evals as well as PhD scholars researching the field for our conducting contextual inquiry. The process involved introducing the participant to our problem space and topic area, after which they were given a scenario which required them to carry out an evaluation using Eval LM.
The participant was asked to think aloud as they carried out the task and were assisted with the tool when absolutely required. This was followed by a discussion about the tool, and ideas for how it could have been better.
​
We conducted 6 contextual inquiries. After which in an interpretation session, the notes and quotes from the 6 sessions were then put into a spreadsheet for refinement, and further insight generation.
​
The refined quotes, notes and insights were then arranged on an affinity diagram, this allowed us to organize the CI output into categories and allowed us to uncover insights relevant to our research questions.
Findings
We then organized the resulting findings from both the literature review and the contextual inquiry, in relation to the four research questions.
RQ 1 - What are the challenges when evaluating data from large datasets?
Scalability & Navigation Issues: Some tools offer strong filtering and navigation for large datasets; others are limited.
Reliability at Scale: Few tools emphasize reliability analysis features to assess consistency across large volumes of data.
Aggregation Techniques: Some tools handle complex, multi-dimensional data well; others don’t present results effectively.
RQ 2 - What are the challenges during LLM output evaluations based on diverse criteria provided?
Customizable Criteria: Some tools allow defining custom evaluation dimensions while others rely on fixed criteria.
Criteria Selection: Some tools allow selection from a pre-defined library, but it’s unclear how well these suggestions align with business goals.
Iterative Refinement: Some tools support for refining criteria based on feedback for more relevant and context-specific evaluation frameworks.
RQ 3 - How can aggregated results from multi-dimensional LLM output evaluations be effectively visualized and interpreted?
Clearer Metrics: Users find percentage scores vague and want more interpretable measures like confidence intervals and clearer labels.
High Level Overviews: Users like visual summaries that highlight key insights, outliers, and similar results.
Diverse Visual Formats: Users want various visualization types—like side-by-side comparisons and trend tracking—to grasp complex data better.
RQ 4 - What key features in large-scale LLM output evaluation tools would enhance result reliability and transparency?
Reliability & Consistency Checks: Some tools support reliability analysis, but users need clearer ways to confirm results aren’t random.
Historical Performance Tracking: Monitoring trends over time helps users understand model changes and trustworthiness.
Automated Insight Generation: Some tools offer automated summaries, though their effect on trust and clarity remains uncertain.
These insights were then condensed into 4 user needs, which formed the baseline of our approach to designing Judge LLM.
Design
The design activities following research involved:
​​
-
Condensing research insights into design implications.
-
Developing initial wireframes for conducting co-design activity.
-
Conducting co-design activity and iterating on the wireframes.
-
Developing mock-ups using the IBM Carbon Design System.
-
Developing the prototype.
​
These were planned over a span of roughly four months as follows:
Design Implications
Wire-framing
Co-Design Sessions & Iterations
Mockups & Prototype Development
Jan 2025
Feb 2025
Mar 2025
Apr 2025
User Needs & Design Implications
4 user needs were highlighted through our research, each of these branched out into two design implications. This resulted in 8 design implications.
1. Users want to fluidly explore massive datasets with advanced filtering and drill-downs
User Needs
Design Implications
2.
Side-by-side or layered hierarchical views so users can jump from a bird’s-eye overview into detailed per-item inspection.
1.
Cluster and filter results by shared characteristics, then let users focus on meaningful subsets
2. Users need multi-level aggregated insights to quickly spot patterns, outliers, or trends.
4.
Charts and tables that let users easily toggle between high-level statistics and more detailed breakdowns.
3.
Let users view summarized results with statistical datapoints like mean, median, standard deviation, etc.
3. The solution should support flexible, customizable evaluation criteria that align with evaluation goals and context
6.
Ensure criteria align with external policies through prompted reviews and stakeholder cross-checks.
5.
Define, prioritize, and refine criteria with workspace tools and optional LLM-based suggestions
4. Users like clear, interactive visuals and transparent scoring processes to foster user trust and confidence.
8.
Reveal score calculations, data sources, and limitations to build trust and understanding.
7.
Blend visualizations with quantitative and qualitative feedback to offer meaningful insights.
With the time constraints at hand and the insights and feedback we got from the co-design activity, we addressed 7 of 8 design implications, the exception being design implication 8.
User Flow
Our user flow was inspired by a data pipeline similar to the tool mentioned in this paper. We plotted the user flow and it looked like this
Part 1: Setting Up Evaluation
User uploads the dataset to be evaluated. Enters the LLM API key to use for evaluation.
​

Hierarchichal Task Analysis
A more detailed series of steps can be understood through the following hierarchical task analysis:
Start Evaluation
Enter Evaluation Title
Enter Evaluation Goal
Select LLM Model
Enter API Key
Start Criteria Definition
Enter Criteria Manually
Choose AI suggested Criteria
Use AI Criteria Library

Refine Criteria using AI or Edit Manually
Run Evaluation
View Results
Co-Design Activity
Co-Design sessions were carried out with 3 engineers and 1 PhD scholar in the field of ML.
We explained the context to the participants and gave them a wireframe walkthrough.
Which was followed by a discussion on what features and visualizations would be helpful for evaluation.
The wireframe design was thus quite fluid and kept changing after each session, for the last two sessions we incorporated the IBM Carbon Design System, after noting a requirement for color schemes by the participants.
This way our wireframes evolved into hi-fidelity mockups.
IBM Carbon Design System
We adopted the IBM Carbon Design System, an open source design system by IBM.
File Options like saving, re-naming, etc
Define evaluation goal, this gives the LLM model context for suggesting criteria

Upload Dataset to evaluate.
Select the name of the evaluating LLM model and enter the API key to use it.
Prototype
Prototype
Design & Testing
After further contemplation and re-design of the results dashboard in Figma, we began creating and testing a functional prototype.
​
The prototype was created using React and was tested with 3 UX Designers and 1 ML Professional, based on standard heuristics.
Following is a video of the prototype demo from the presentation that we gave for our keystone project.
Reflection
This project opened my eyes to the world of LLM evaluation and what it takes to build robust and reliable language models. I am a language enthusiast, always learning new languages and interacting with people. My curiosity knew no bounds when LLMs became a thing, and through this project, I have learned a lot about how we can make computers better at conversation.
​
Collaborating with IBM was an amazing experience, I gained many insights on how to approach research, how to choose methods, how research insights translate to design initiatives and how to make the most out of user tests.
​
I am especially proud of this project because our research was selected for the Human-centered Evaluation and Auditing of Language Models workshop at ACM CHI 2025 in Japan. Working in a team of two was an interesting experience where I learned a lot about sharing responsibility and leveraging AI for efficiency. You can read the paper we submitted using the link below.