Don's Tools · Developers · RAG Text Chunker

RAG text chunker

Split a long document into overlapping chunks for retrieval and embeddings. Choose tokens, characters, words or sentences, set the size and overlap, and copy the chunks, all in your browser.

RAG Text Chunker splits a long document into smaller overlapping pieces for retrieval augmented generation and embeddings. Choose to split by tokens, characters, words or sentences, set the chunk size and how much each chunk overlaps the previous one, and the chunks are produced live with their sizes. Token splitting uses the GPT-4o tokenizer as a close approximation, and you can copy every chunk as a JSON array or as separated text. Everything runs in your browser with nothing uploaded.

Frequently asked questions

Is my text uploaded anywhere?

No. The text is split into chunks entirely in your browser. Nothing you paste is sent anywhere or stored.

What does chunking do and why overlap?

It breaks a long document into smaller pieces for retrieval augmented generation, so each piece can be embedded and searched. Overlap repeats a little of the previous chunk at the start of the next, so a sentence split across the boundary is not lost.

What can I split by?

Tokens, characters, words or sentences. Tokens are usually best for fitting an embedding model's limit, while sentences keep chunks readable and avoid cutting mid-sentence.

How accurate is the token splitting?

It uses the GPT-4o tokenizer, which is a close approximation for most models since there is no public tokenizer for Claude or Gemini. It is well within range for planning chunk sizes.

How do I get the chunks out?

Copy all as JSON gives an array of strings ready to drop into code, and copy all as text gives the chunks separated by a divider. Both include every chunk, not just the ones shown.