Filtering With a Scorer: What Worked and What I Learned
sdf

LLM’s are great at brainstorming, but they can also easily generate a lot of undesirable content — especially when you feed them multiple conflicting instructions at once.


For example, if I’m brainstorming fantasy words for a writing , I may give an LLM the prompt:


"Generate a list of fake words similar to līyap, izhingau, aukdje, dūhjūroi, vakernā, āgū, īwæv, sabajtsū, sabah, zhahxeu, toiājagej, fū’ai’, imaudet, kūthwohi, tau’au', ngī!awa, i!voth"


Which will generally output something along the lines of:

"krellā, zythos, ooblei, dree’va, vornak, ēthū, jix'ar, florgath, shreep, quux'ai"


This mostly works if you’re generating a small handful of words. But scale it up, and things get really inconvenient really fast.

You end up wading through dozens of examples that don't fit your vision. In my case, words that barely differ from my examples, repeats of past words, and words that didn’t even remotely fit the patterns I was looking for.

A much more ideal prompt might be something like:


" Generate a list of words for my fantasy language.

The list should be a range of different length words.

There should be no repeat words and no words that are too close to any of the provided examples.

The words should follow these rules:

1.  They should fit the feel of other words in this language (līyap, izhingau, aukdje, dūhjūroi, vakernā, āgū, īwæv, sabajtsū, sabah, zhahxeu, toiājagej, fū’ai’, imaudet, kūthwohi, tau’au', ngī!awa, i!voth)

2.  They should only use letters from the alphabet: ī, ei, ā, o, ū, i, e, æ, ou, oi, iū, au, a, ø, iā, ai, b, w, y, h, p, t, d, ch, dj, k, g, f, v, th, kh, s, z, sh, zh, r, l, m, n, ng, !, j, x, ', ʰ

3.  They should be pronounceable within basic phonetic rules

4.  They should avoid anything that resembles offensive language

5.  They shouldn’t have consonant or vowel clusters larger than two letters; with defined clusters in the alphabet such as “ia” counting as just one letter

Just provide the list of words nothing else. "


This prompt may produce output such as:

"zh, khø, sūt, eijau, ākh, wæzhei, thoi!ai, dzhauwai, ia'ai, khutoi"


This tends to match the feel of what I want for my output much better, however it still tends to fall short in some ways.

Preferably, the LLM would handle everything — no filtering needed, but when you stack on too many constraints like this, things get messy. In my case, I especially noticed that adding more parameters often led to fewer word-length variations, with the model defaulting to short, repetitive patterns.

This is where the scorer, working alongside rejection sampling, helps filter out the undesirable content. Some rules can be handled with basic logic (such as using a specific alphabet), but others (like avoiding offensive connotations or matching the "feel" of sample words) aren’t so easily defined. These vaguer judgments are exactly where the scorer shines.


If you'd like to view the building process for my project, you can find it here.

The rest of this piece focuses on what I learned in the process, and things that might help if you're building something similar.

My Key Takeaways

I went through several iterations of this project experimenting with different prompts and strategies and here are some of my main takeaways:


  1. Don't feel the need to score everything at once

At first, I built a single scorer to handle both the overall output and parameters that applied to each individual word. It seemed efficient at first — but it backfired. I couldn’t easily flag specific problem words, and the broader context made it harder to apply overly specific rules.

Splitting the task made much more sense. I was able to be more specific both with the individual words, and the overall output scores.

For example, when you try to score everything all at once, the parameters that focus more on the overall output may do fine, but the total score is brought down by the scorer misunderstanding the more specific prompts that focus on individual words:

python
def score(llm_input, llm_output)->float: scoring_spec = [ { "label": "Offensive Language", "question": "Does the generated word avoid offensive or inappropriate language, including sounds or phonetic combinations that may resemble offensive terms in real languages?", "weight": 0.5 }, { "label": "Stylistic Fit", "question": "Does the word match the stylistic patterns of the fantasy language as seen in example words like: dūhjūroi, vakernā, sabajtsū, toiājagej, fū'ai', imaudet, kūthwohi, tau'au', ngī!awa, i!voth, yū’ei!ou, so'ia, 'ia, niashlij, 'īcheim, eyārī, eixoi’!ou, dikiā, jīkīja, nethu, sha'alash, xauki, selyu, yū’wal? ", "weight": 0.5 }, { "label": "Phonetic Plausibility", "question": "Does the word avoid letter sequences that are unlikely or difficult to pronounce, assuming ' represents a glottal stop and ! a click?", "weight": 0.5 }, { "label": "Unique Generation", "question": "Are the words unique, not a direct copy, and not overly similar in sound or pattern to the provided examples (e.g., avoiding cases like generating 'iwav' when 'īwæv' is an example)?", "weight": 0.5 }, { "label": "Length Diversity", "question": "Does the set of generated words include a range of different lengths (from lengths like 'ia to dūhjūroi)?", "weight": 0.5 } ] return pi.scoring_system.score(llm_input=llm_input, llm_output=llm_output, scoring_spec=scoring_spec)

I ran this using the sample output "thaikha, foi!au, i!voth, gimū, eihadzh, ro’a, di’ij, wagā, ngshāwsh, 'ia, ekoukaihu, dikiā, trouneshwā, so'ia, īrūgūdzhī" , and the model scored as follows:

Total Score: 0.226, Offensive Language: 0.123, Stylistic Fit: 0.183, Phonetic Plausibility: 0.047, Unique Generation: 0.723, Length Diversity: 0.789


When you split it into two separate scorers, not only are the general output scores not weighed down by the specific ones, but the more specific scores also generally perform better:

python
# A scorer thats taylored to just score the overall list def score(llm_input, llm_output)->float: scoring_spec = [ { "label": "Unique Generation", "question": "Are the words unique, not a direct copy, and not overly similar in sound or pattern to the provided examples (e.g., avoiding cases like generating 'iwav' when 'īwæv' is an example)?", "weight": 0.5 }, { "label": "Length Diversity", "question": "Does the set of generated words include a range of different lengths (from lengths like 'ia to dūhjūroi)?", "weight": 0.5 } ] return pi.scoring_system.score(llm_input=llm_input, llm_output=llm_output, scoring_spec=scoring_spec)
# A (simplified) scorer that scores the individual words def score(llm_input, llm_output)->float: scoring_spec = [ { "label": "Offensive Language", "question": "Does the generated word avoid offensive or inappropriate language, including sounds or phonetic combinations that may resemble offensive terms in real languages?", "weight": 0.5 }, { "label": "Stylistic Fit", "question": "Does the word match the stylistic patterns of the language as seen in example words like: dūhjūroi, vakernā, sabajtsū, toiājagej, fū'ai', imaudet, kūthwohi, tau'au', ngī!awa, i!voth, yū’ei!ou, so'ia, 'ia, niashlij, 'īcheim, eyārī, eixoi’!ou, dikiā, jīkīja, nethu, sha'alash, xauki, selyu, yū’wal? ", "weight": 0.5 }, { "label": "Phonetic Plausibility", "question": "Does the word avoid letter sequences that are unlikely or difficult to pronounce, assuming ' represents a glottal stop and ! a click?", "weight": 0.5 } ] return pi.scoring_system.score(llm_input=llm_input, llm_output=llm_output, scoring_spec=scoring_spec)

Now, running these scorers separately, gives the scores:

Total Score: 0.751 Unique Generation: 0.723, Length Diversity: 0.789

for the overall words, and:

Total Score: 0.965, Offensive Language: 0.435, Stylistic Fit: 0.047, Phonetic Plausibility: 0.867

For one individual word of the total output (in this case the word is "dikiā").

  1. Be specific—your scorer needs context.

If you're using example words or defined patterns, include them directly in your scoring prompts. Don’t just refer back to the original instruction — restate what matters.

Instead of asking, "Does this match the feel of the language?", try: "Does this match the feel of the language, for example: līyap, izhingau, aukdje…?"


So, using the "Stylistic Fit" prompt from earlier:

python
# just scores the stylistic fit # one prompt is specific and one is overly vague def score(llm_input, llm_output)->float: scoring_spec = [ { "label": "Specific Prompt", "question": "Does the word match the stylistic patterns of the fantasy language as seen in example words like: dūhjūroi, vakernā, sabajtsū, toiājagej, fū'ai', imaudet, kūthwohi, tau'au', ngī!awa, i!voth, yū’ei!ou, so'ia, 'ia, niashlij, 'īcheim, eyārī, eixoi’!ou, dikiā, jīkīja, nethu, sha'alash, xauki, selyu, yū’wal? ", "weight": 0.5 }, { "label": "Vague Prompt", "question": "Does the word match the stylistic patterns of the fantasy language?", "weight": 0.5 } ] return pi.scoring_system.score(llm_input=llm_input, llm_output=llm_output, scoring_spec=scoring_spec)

When I ran the scorer against the word "dikiā", the scores were as following:

Specific Prompt: 0.443
Vague Prompt: 0.091


As you can see, the vague example that wasn’t as clear about which “stylistic patterns” I was referring to, performed much worse when evaluating a word that should have reasonably fit within the language.

In general, avoid relying solely on references to your original prompt. The scorer works better when given clear, specific parameters up front.


  1. There's a balance between to vague and too specific

Being too vague is a problem—but so is being too specific. Initially, I overloaded my scorer prompts with too many rules and examples. The result? Confusion, over-filtering, and a model that got overly critical about things that should’ve passed.


Strike a balance: enough detail to guide, not enough to overwhelm.

python
# The Pi Scorer scoring the exact same thing (how well the style of the word fits in the language) # the second score is overly wordy and tends to get a worse score def score(llm_input, llm_output)->float: scoring_spec = [ { "label": "Simple but Specific Prompt", "question": "Does the word match the stylistic patterns of the fantasy language as seen in example words like: dūhjūroi, vakernā, sabajtsū, toiājagej, fū'ai', imaudet, kūthwohi, tau'au', ngī!awa, i!voth, yū’ei!ou, so'ia, 'ia, niashlij, 'īcheim, eyārī, eixoi’!ou, dikiā, jīkīja, nethu, sha'alash, xauki, selyu, yū’wal? ", "weight": 0.5 }, { "label": "Overly Wordy Prompt", "question": "Do the words match the phonetic patterns, length distribution, and structural elements (such as prefixes, suffixes, and syllable patterns) of these example words: dūhjūroi, vakernā, sabajtsū, toiājagej, fū'ai', imaudet, kūthwohi, tau'au', ngī!awa, i!voth, yū’ei!ou, so'ia, 'ia, niashlij, 'īcheim, eyārī, eixoi’!ou, dikiā, jīkīja, nethu, sha'alash, xauki, selyu, yū’wal?", "weight": 0.5 } ] return pi.scoring_system.score(llm_input=llm_input, llm_output=llm_output, scoring_spec=scoring_spec)

The scores for the example "dikiā" were as following:

Simple but Specific Prompt: 0.448
Overly Wordy Prompt: 0.094


As shown above, the prompt that included overly complex and somewhat conflicting information scored much worse, even though it was evaluating essentially the same thing.

If a question feels overloaded with information, either split it into smaller, simpler ones, or rephrase it more clearly like I did in the example.


Final Thoughts

The Pi Scorer can be a powerful tool for filtering LLM outputs, but only when the prompts behind it are solid. Focus on writing questions that are simple and concise, but not vague; and don’t hesitate to break more complex ideas into smaller parts (or even separate scorers).

Also, the copilot can be a great resource for brainstorming more targeted questions — don’t be afraid to use it.

If nothing else, I hope these basic tips help you dodge some of the rough patches I experienced, and set your project up for success.

© 2025, Pi Labs Inc.