Why Your Prompts Are Failing in Kaggle Competitions

I remember staring at the leaderboard of the “LLM Science Exam” competition a few years ago, completely baffled. My model had a decent architecture, my RAG pipeline was retrieving relevant context, but my score was stuck. I was losing to people who were simply asking the model better questions. Fast forward to December 2025, and the game has changed entirely. If you are still treating prompt engineering as an art form based on vibes and “magic words,” you are going to lose. It’s an engineering discipline now.

Earlier this year, we saw a massive shift in how major players like Google DeepMind and OpenAI document their prompting strategies. The documentation released back in early 2025—which felt like a masterclass in itself—validated what many top Kagglers had suspected: structure beats creativity. I’ve spent the last six months refactoring my entire approach to NLP competitions based on these formalized engineering principles. I’m not just guessing anymore; I’m compiling prompts.

In this article, I want to walk you through the specific prompting workflows I use right now. I’ll show you how I structure inputs for stability, how I evaluate prompt performance (because if you aren’t measuring it, it doesn’t exist), and how this applies to the current landscape of Kaggle News and competitions.

The Death of “Act as a…”

Let’s get one thing straight: starting your prompt with “Act as a helpful assistant” is wasted tokens. In the high-stakes environment of a Kaggle kernel, where we are often constrained by GPU hours and inference latency, every token needs to fight for its existence. The “persona” pattern was useful in 2023, but modern models from Anthropic News updates or the latest Gemini iterations don’t need as much role-playing to get the format right. They need constraints.

When I look at the discussions in Kaggle News feeds regarding recent competitions, the winners aren’t sharing clever phrases. They are sharing XML tags and structural delimiters. The big realization from the technical papers released this year is that LLMs are incredibly sensitive to data separation. If you mix your instructions with your data, the model gets confused.

Here is how I structure a standard classification prompt now. I use XML-style tagging because models trained on code (which is most of them) understand opening and closing tags implicitly. It separates the “system” logic from the “user” noise.

def construct_robust_prompt(instruction, context, query):
    """
    Constructs a prompt using strict delimiters to prevent instruction drift.
    """
    prompt = f"""
    <system_instruction>
    {instruction}
    Please output your reasoning in <thought> tags, followed by the final answer in <answer> tags.
    </system_instruction>

    <context_data>
    {context}
    </context_data>

    <user_query>
    {query}
    </user_query>
    """
    return prompt.strip()

# Example Usage
instruction_text = "Classify the sentiment of the query based on the context. Options: [POSITIVE, NEGATIVE, NEUTRAL]."
context_text = "The user has experienced repeated login failures due to server timeout."
query_text = "I am extremely frustrated that I cannot access my account right now."

print(construct_robust_prompt(instruction_text, context_text, query_text))

This looks simple, but it solves the biggest issue I face in Automl News discussions: hallucination due to context bleeding. By wrapping the context, I tell the model exactly where the “truth” lives. If you check Google DeepMind News, you’ll see they emphasize this “containerization” of information heavily.

Few-Shotting is an Algorithm, Not a Suggestion

I used to hand-pick examples for my few-shot prompts. I’d think, “This one looks like a good example.” That was a mistake. In 2025, few-shot selection is a retrieval task. If I have a dataset of 10,000 training examples, I am not going to hardcode three of them into my prompt.

I use a dynamic selector. I embed my training set using a lightweight model (often referencing Sentence Transformers News for the latest efficient embeddings), store them in a vector store, and then for every single inference call, I pull the top-3 most semantically similar examples to include in the prompt.

Programmer typing prompt into AI chat interface - Page 3 | Mobile programming Photos - Download Free High-Quality ... — Programmer typing prompt into AI chat interface – Page 3 | Mobile programming Photos – Download Free High-Quality …

This dynamic few-shotting is what separates a Bronze medal from a Gold. It adapts the “teaching” material to the specific test question. Here is a simplified version of my workflow using a local FAISS index, which is standard practice if you follow FAISS News or Milvus News.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class DynamicFewShot:
    def __init__(self, examples, embeddings):
        """
        examples: list of dicts {'input': str, 'output': str}
        embeddings: numpy array of shape (N, D)
        """
        self.examples = examples
        self.embeddings = embeddings
    
    def get_prompt_examples(self, query_embedding, k=3):
        # Calculate similarity
        sims = cosine_similarity(query_embedding.reshape(1, -1), self.embeddings)
        
        # Get top k indices
        top_k_indices = np.argsort(sims[0])[-k:][::-1]
        
        selected_examples = ""
        for idx in top_k_indices:
            ex = self.examples[idx]
            selected_examples += f"<example>\nInput: {ex['input']}\nOutput: {ex['output']}\n</example>\n"
            
        return selected_examples

# Mock data workflow
# In a real Kaggle scenario, I'd use Hugging Face datasets here
my_examples = [
    {"input": "Login failed", "output": "Technical Issue"},
    {"input": "Love this product", "output": "Positive Feedback"}
]
# Assume we have pre-computed embeddings
mock_embeddings = np.random.rand(2, 384) 

# Initialize
selector = DynamicFewShot(my_examples, mock_embeddings)

I’ve found that using libraries discussed in LangChain News or LlamaIndex News is great for production, but in a Kaggle kernel, I prefer writing lightweight classes like this. It reduces dependency overhead and gives me total control over the memory, which is critical when you are pushing the limits of Google Colab News free tiers or Kaggle’s P100s.

The Rise of “Compiled” Prompts

One of the most interesting trends I’ve tracked via Hugging Face News this year is the concept of prompt optimization as a training step. We aren’t writing prompts anymore; we are training them. Tools like DSPy (which gathered massive momentum in late 2024 and throughout 2025) allow us to treat prompts as weights.

I recently worked on a project involving Mistral AI News models where manual prompting was failing. The model was too small (7B quantized) to understand nuanced instructions. I switched to an optimizer approach. I defined a metric (exact match accuracy on a validation set) and let a script iterate through variations of the prompt instructions until it found the wording that maximized the score.

It turns out the “best” prompt for a specific model often looks weird to humans. It might repeat certain words or use strange punctuation. But who cares? If it boosts my leaderboard score by 0.02, I’m using it. This aligns with what I read in PyTorch News regarding the integration of LLM logic directly into optimization loops.

Evaluating the Un-evaluable

You cannot improve what you cannot measure. This is the cardinal rule of data science, yet so many people ignore it when it comes to GenAI. They change a prompt, look at one output, say “that looks better,” and commit the code. That is not science; that is gambling.

My workflow involves a rigorous evaluation pipeline. Whenever I change a prompt, I run it against a “Golden Set” of 50 difficult examples. I don’t just check for exact matches. I use a secondary, stronger model (often calling an API like those discussed in OpenAI News or Cohere News) to grade the response of my smaller, competition-submission model.

Here is a snippet of how I use a “Judge” model to evaluate my experiments. This is crucial when tuning for Kaggle News shared tasks.

def evaluate_response(question, ground_truth, model_output):
    """
    Uses a larger model to grade the output 1-5.
    """
    grading_prompt = f"""
    You are an impartial judge. 
    Question: {question}
    Ground Truth: {ground_truth}
    Student Answer: {model_output}
    
    Rate the Student Answer from 1 to 5 based on accuracy relative to the Ground Truth. 
    Return ONLY the number.
    """
    
    # In practice, this calls an API or a high-quality local model
    # score = call_llm_api(grading_prompt)
    
    # Placeholder for the article
    score = 4 
    return score

# My experimentation loop
results = []
for test_case in golden_set:
    output = my_competition_model.generate(test_case['question'])
    score = evaluate_response(test_case['question'], test_case['answer'], output)
    results.append(score)

print(f"Average Experiment Score: {sum(results)/len(results)}")

I log these experiments using tools like Weights & Biases News or MLflow News. Seeing a graph of your “Prompt Performance” over time is incredibly satisfying and prevents regression.

Handling the Context Window

Programmer typing prompt into AI chat interface - Page 19 | Coding window Photos - Download Free High-Quality ... — Programmer typing prompt into AI chat interface – Page 19 | Coding window Photos – Download Free High-Quality …

Despite Gemini and Claude pushing context windows into the millions, local compute on Kaggle is still limited. We usually have to fit everything into 8k or 32k context windows if we are running locally with vLLM News or Ollama News integrations. This means we can’t just dump documents.

I’ve started using a technique called “Chain of Density” for summarization before RAG. Instead of retrieving chunks and feeding them raw, I compress them. This ensures the prompt contains high-density information. It’s a strategy I picked up from Salesforce Research papers referenced in ArXiv updates.

Furthermore, managing the “kv-cache” (Key-Value cache) memory usage is vital. If you aren’t paying attention to NVIDIA AI News regarding PagedAttention and memory optimization, you are likely running out of CUDA memory halfway through your inference loop. I always set strict limits on my sequence lengths to ensure I don’t hit OOM errors 9 hours into a notebook run.

The Tooling Landscape in Late 2025

The ecosystem has matured so much. Back in the day, we hacked together scripts. Now, looking at LangSmith News or Chainlit News, the observability is incredible. For Kaggle specifically, however, I tend to stick to lighter frameworks.

I am currently heavily invested in LlamaFactory News for fine-tuning. It abstracts away so much of the pain of QLoRA. Often, the best “prompt engineering” is actually just lightweight fine-tuning. If I can fine-tune an adapter for 2 hours on a T4 GPU to learn the output format, I don’t need to waste 500 tokens in my prompt explaining the format. I just say “Output in JSON” and the adapter knows exactly what that means for this specific domain.

Man frustrated looking at data dashboard on screen - Smarter Power BI Dashboard Design Starts with UX — Man frustrated looking at data dashboard on screen – Smarter Power BI Dashboard Design Starts with UX

I also keep a close eye on RunPod News and Modal News. While we can’t use external compute for the final submission in code competitions, I use them extensively for the “compile” phase of my prompts and for generating synthetic data to train my local models.

Why This Matters Now

The barrier to entry for Kaggle competitions involving LLMs has raised significantly. You can’t just fork a public notebook, change the seed, and hope for silver. The winners are those who treat prompts as code modules—versioned, tested, and optimized.

If you are looking for resources, don’t just look for “prompting guides.” Look for engineering documentation. Read the technical reports from Google DeepMind News or the system cards from Meta AI News. That is where the real alpha is. They tell you exactly how they broke down the tasks to get their SOTA benchmarks. Copy their architecture, not just their words.

I expect that by mid-2026, we won’t even be writing prompts manually anymore. We will likely be defining objective functions and letting agents write the prompts for us. But until then, mastering the structure, the XML delimiters, and the dynamic few-shot retrieval is your best bet for climbing that leaderboard.

So, stop asking the model to be nice. Start architecting your inputs. The results on the leaderboard will speak for themselves.

AI Dev News | Building with Artificial Intelligence

The Death of “Act as a…”

Few-Shotting is an Algorithm, Not a Suggestion

The Rise of “Compiled” Prompts

Evaluating the Un-evaluable

Handling the Context Window

The Tooling Landscape in Late 2025

Why This Matters Now

Kwesi Mensah

The Death of “Act as a…”

Few-Shotting is an Algorithm, Not a Suggestion

The Rise of “Compiled” Prompts

Evaluating the Un-evaluable

Handling the Context Window

The Tooling Landscape in Late 2025

Why This Matters Now

Kwesi Mensah

Related Posts