Large Language Models excel at intuition—the ability to understand and generate content without explicit reasoning. They're masters at "filling in the gaps," predicting the next token based on context with remarkable accuracy. This intuitive capability allows them to tackle complex problems and even ace challenging exams.
However, intuition alone has its limits, especially when dealing with projects that have multiple unique or conflicting constraints—scenarios that are "out of distribution" from typical training data.
Consider this example where we want to generate travel itineraries with specific requirements:
prompt = """
Generate interesting trip plans that are short, but also descriptive.
"""
The LLM might produce responses like this:
Portland Weekend Plan: "Explore indie bookstores with coffee in hand, then chase food carts and hidden speakeasies through the drizzle. End your night with vinyl playing in a candlelit bar as the city glows softly through misted windows."
While atmospheric, this example lacks the actionable details we need. It's vague about specific locations, timings, or concrete activities—failing to meet our "descriptive" constraint despite being appropriately "short."
To achieve consistent, high-quality outputs that adhere to all constraints, we need deterministic evaluation. This is where principled scoring comes in.
Using Pi's scoring system, we can create precise evaluation criteria:
from withpi import PiClient
# Initialize Pi client
pi = PiClient()
def score_trip_plan(llm_input, llm_output) -> float:
scoring_spec = [
{
"label": "Engaging Content",
"question": "Does the trip plan include engaging descriptions of destinations or activities?",
"weight": 0.5
},
{
"label": "Destination Highlights",
"question": "Does the trip plan highlight unique or interesting aspects of the destinations?",
"weight": 0.5
},
{
"label": "Actionable Details",
"question": "Does the trip plan provide actionable details such as timings, locations, or costs?",
"weight": 0.5
},
{
"label": "Activity Variety",
"question": "Does the trip plan include a variety of activities to cater to different interests?",
"weight": 0.3
},
{
"label": "Cultural Insights",
"question": "Does the trip plan provide cultural insights or tips for the destinations?",
"weight": 0.3
},
{
"label": "Concise Length",
"question": "Is the trip plan concise and free from unnecessary details?",
"weight": 0.1
}
]
return pi.scoring_system.score(
llm_input=llm_input,
llm_output=llm_output,
scoring_spec=scoring_spec
).total_score
Using Pi, we can generate a scorer for our use case. These questions will be evaluated quickly and accurately, using Pi's foundation model scorer.
Now we can combine intuitive generation with principled evaluation:
def get_high_quality_trip_plan(prompt: str, quality_threshold: float = 0.7) -> str:
"""Generate trip plans until one meets our quality standards."""
attempts = 0
max_attempts = 10 # Prevent infinite loops
while attempts < max_attempts:
plan = generate_trip_plan(prompt)
score = score_trip_plan(prompt, plan)
if score >= quality_threshold:
return plan
attempts += 1
# Fallback: return the best attempt if threshold isn't met
return plan
This approach transforms unreliable "vibing" into systematic quality assurance. Instead of hoping the LLM gets it right, we:
This pattern applies to any LLM application with specific quality requirements:
The future of LLM applications isn't just about better prompts—it's about combining the intuitive power of large models with principled evaluation systems. By building applications that generate iteratively and score deterministically, we can achieve the consistency and reliability that production systems demand.
When intuition meets principles, that's where the real magic happens.