Create a tunable scoring system that can combine natural language and code-based criteria for your AI application
Trusted By
Upload in progress...
Your video file is being uploaded. The currently loaded video is the source file.
1
Work with Pi's copilot to build your scoring system
2
Use Pi's scoring system to evaluate anything across your stack
Why score with Pi?
We turn your evaluations into precise, user-calibrated, and cost-effective signals for use anywhere in your stack.
Transform data into metrics
Not sure what to measure? Pi figures it out for you. Feed it any or all of your prompts, your PRDs, your user feedback, or just sit down and chat with it and it will help you figure out the best calibrated metrics for your application.
Quick, deterministic scores
Tap to view
Our foundation model, Pi Scorer, scores more accurately than Deepseek and GPT 4.1, but runs at the size and speed of GPT Mini and Gemini Flash. You can score 20+ custom dimensions in less than 100msec; it’s that fast.
Framework agnostic
Tap to view
A single Pi Scorer can be used in every part of your AI stack and existing tools: offline evals, online observability, training data quality, model optimization, agent control flows and more. Easily plug Pi into Google Spreadsheets, Promptfoo, CrewAI, or any other tool you might be using.
A foundation model designed for scoring
We train our models to understand principles, not mimic content. We continuously monitor performance to improve quality with each release.
Aligned with your users & experts.
The best metrics align with human judgment. You can continuously improve your Pi
scoring system by calibrating it on your own labels, preferences, and user data,
adjusting to match your team's expertise and actual user behavior in a virtuous
feedback loop.
Fully captures correctness and taste.
Pi’s scoring system combines soft measures like natural language quality, hard
measures like code correctness, and trained measures like thumbs-up prediction. This
comprehensiveness gives you the highest quality evals, reward models, ranking
functions, and agent decision nodes.
5x cheaper than LLM judges.
Maintaining the performance of a large model on a smaller size means you can afford to
measure all that matters to you without running a massive bill. You can reinvest your
savings to measure even more dimensions, more frequently, across your workflows.
Built by veterans of Google Search
For decades, we've transformed breakthrough research into industry-leading AI and search engines. Today, we're making that expertise accessible to you.
David Karam
Founder & CEO
Previously, as Director of Product at Google, David led a product management team working alongside a 200+ engineer organization to develop AI, LLM, and search platforms, collaborating across teams in Search, Shopping, and Geo to drive innovation in search products.
Achint Srivastava
Founder & CTO
Prior to Pi Labs, Achint was a Principal Software Engineer at Google, leading the technical vision for a 250+ person team. Achint conceptualized and built AI and Search platforms, including the GenAI which power features like Search Generative Experience and Google Cloud Search.
from withpi import PiClient
pi = PiClient()scores = pi.scoring_system.score( llm_input="Pi Labs", llm_output="Score anything with Pi Labs today!", scoring_spec=[{"question":"Is there a strong call to action?"}])print(scores.total_score)