Stop scraping your Notion manually: A real sync pipeline to Pinecone
Actually, I should clarify – I spent last Tuesday night staring at a 429 Too Many Requests error from the Notion API, wondering why I ever decided to build my own “second brain” AI. The promise is always the same: take your messy Notion workspace, dump it into a vector database like Pinecone, and suddenly you have a magical chatbot that knows everything you’ve ever written.
If only it were that simple.
But the reality of making Notion data “AI Ready” is a lot messier than the tutorials suggest. I’ve been working on a TypeScript pipeline to automate this, and let me tell you, the Notion API is not your friend here. It’s a block-based nightmare that treats a simple paragraph like a distinct database entity. Though, after a few pots of coffee and some aggressive refactoring, I finally got a reliable sync working.
The “Block” Problem
When I first tried to ingest my engineering wiki, I just grabbed the page content. The result? Garbage. I missed all the nested toggles, the child databases, and the context inside columns. If you just grab the plain_text property, you lose the semantic structure that makes the data useful for an LLM.
I had to write a recursive function to walk the block tree. It’s slow, it’s painful, but it’s necessary.
import { Client } from "@notionhq/client";
// Running this with @notionhq/client v2.2.15 on Node 23.4.0
const notion = new Client({ auth: process.env.NOTION_KEY });
async function getBlockChildren(blockId: string): Promise<string> {
let content = "";
let hasMore = true;
let nextCursor = undefined;
while (hasMore) {
const response = await notion.blocks.children.list({
block_id: blockId,
start_cursor: nextCursor,
});
for (const block of response.results) {
// This is where the magic (and pain) happens
if ("type" in block) {
const text = extractTextFromBlock(block); // Helper to parse rich text
content += text + "\n";
// Recursion for nested blocks
if (block.has_children) {
content += await getBlockChildren(block.id);
}
}
}
hasMore = response.has_more;
nextCursor = response.next_cursor;
}
return content;
}
See that if (block.has_children) check? That little line is responsible for doubling my sync time. But without it, you lose half your data.
Chunking Strategy: Where I Screwed Up
My first instinct was to embed every Notion block as a separate vector. “Granularity is good,” I told myself. But when I ran queries against Pinecone, I was getting back individual bullet points without any context. A query for “deployment process” would return a vector saying “Run the script,” but wouldn’t tell me which script or which project, because that info was in the parent header block three levels up.
The Fix: I switched to a markdown-based approach. I convert the entire Notion page (and its children) into a single Markdown string first, then I use a sliding window chunker. The difference was night and day – 85% retrieval accuracy compared to just 12% with the block-based approach.
The Sync Engine (Don’t use Cron)
You can’t just run this script on your laptop whenever you remember. Data goes stale instantly. I ended up moving this logic into a background job framework (something like Trigger.dev or Inngest works well here). The key is to decouple the fetching from the embedding.
Pinecone Upserting: Metadata is King
When you finally push to Pinecone, don’t be stingy with metadata. I dump everything I can grab into the metadata object. Why the author ID? Because sometimes I want to ask my AI, “What did Dave write about the database migration?” If you don’t index that metadata now, you can’t filter by it later.
A Warning on Costs
One thing nobody mentions is the embedding cost loop. If you aren’t careful with your last_edited_time checks, you’ll end up re-embedding the same pages over and over. I accidentally left my poller running without a proper “since” timestamp filter, and I burned through about $40 of OpenAI credits in a weekend just re-processing static pages.
Getting Notion data into Pinecone isn’t just about API calls; it’s about reconstructing the logic of your documents so a machine can understand them. It’s messy work, but once it’s running smoothly, having an AI that actually knows what’s in your docs is pretty sweet.
