N-gram Viewer & Phrase Frames

Free online n-gram viewer and phrase frame analyzer. Paste any text to count word sequences (bigrams, trigrams, 4-grams, 5-grams), find collocations, and extract phrase frames — n-grams with one variable slot that group templatic patterns like (intrat|impus) pe piața din românia into a single entry. Runs entirely in your browser.

Input text

0 tokens · 0 unique · 0 segments

N (core size)

Min count

Max rows shown

Case sensitive

Min variants:

A phrase frame is a core n-gram with the words that appear on its left and right sides across the text. Min variants = the maximum of left/right distinct fillers must be ≥ this value.

#	Phrase frame	Left	Right	Count
No phrase frames found. Try a smaller N, lower min count, or lower min variants.

What this tool does

This is a free online n-gram viewer and phrase frame analyzer. Paste any text — an article, a book chapter, a corpus of tweets, a transcript, a scraped dataset — and it does two things. First, it counts every n-gram (every sequence of N consecutive words) in the text and shows you the most frequent ones. Second, it extracts phrase frames: n-grams with one variable slot, so you can see templatic patterns that plain frequency counts miss. Everything runs locally in your browser. Nothing is uploaded, logged, or sent to any server.

Unlike the Google Books Ngram Viewer, which charts word frequencies over time across a fixed corpus, this tool analyzes whatever text you paste. That makes it useful for your own writing, your own scraped data, your own research corpus, or any other body of text that isn't in Google's index.

What is an n-gram?

An n-gram is just a sequence of N consecutive words from a text. A 1-gram (unigram) is a single word. A 2-gram (bigram) is a pair of adjacent words. A 3-gram (trigram) is three in a row. And so on. Take the sentence "the quick brown fox jumps". Its bigrams are the quick, quick brown, brown fox, fox jumps. Its trigrams are the quick brown, quick brown fox, brown fox jumps.

Counting n-grams across a large chunk of text surfaces repeated phrases. It's the foundation of a lot of corpus linguistics work, stylometric analysis, phrase mining, keyword extraction, and cliché-hunting.

Phrase frames — a core with its left and right context

Plain n-gram counts have a weakness. Consider these two Romanian phrases: "a intrat pe piața din România" and "s-a impus pe piața din România". They're clearly the same underlying pattern ("entered / established itself on the Romanian market") but each surface form may only appear once or twice, so neither rises in a plain frequency list. The shared core pe piața din românia shows up but the verb that sits in front of it is lost.

A phrase frame fixes this. In this tool, a phrase frame is a core n-gram shown together with the distribution of words that appear immediately to its left and immediately to its right across the entire text. So pe piața din românia becomes (intrat|impus|pătruns|extins) pe piața din românia (datorită|prin|la), collapsing every leading verb and every trailing connector into a single row, with counts for each filler.

This is how you find templatic expressions, collocations, construction patterns, and formulaic writing that plain frequency counts can't see because each surface variant is rare on its own but the underlying template is common.

Controls

N — the size of the core, from 1 (single words) up to 7. In the n-grams view this is just the n-gram length; in the frames view it's the length of the fixed core, with left and right context shown around it.
Min count — hide n-grams or cores that occur fewer than this many times.
Min variants (frames only) — only show frames where at least one side (left or right) has this many distinct filler words. Higher values surface more productive templates and filter out cores where the surrounding words never repeat.
Filter — text search across the visible n-grams or frames. Matches against fillers too on the frame view.
Case sensitive — by default tokens are lowercased so The and the merge. Flip this off for stylometric work where casing matters.
Export CSV — download whatever is currently shown for further analysis in a spreadsheet.

Who uses n-gram analysis?

Writers and editors hunting their own crutch phrases, clichés, and repetitive sentence openers. Paste a draft and the 3-grams you use too often will float to the top.
SEO and content teams mining competitor content for repeated keyword phrases, title patterns, and formulaic copy that signals what a niche rewards.
Corpus linguists and researchers studying collocations, lexical bundles, and phraseology. Phrase frames are the standard way to find productive templates in a corpus.
Language learners spotting set expressions and idiomatic templates by pasting native-speaker text.
NLP and ML engineers doing quick exploratory analysis on a dataset before committing to a preprocessing pipeline.
Prompt engineers auditing LLM output for repetitive phrasing, templated transitions, or the kind of stock language that makes AI-generated text recognizable.
Journalists and forensic linguists comparing writing styles, detecting plagiarism, or attributing authorship.

How it works

The input text is first split into segments at hard boundaries: newlines, markdown markers (*, _), sentence-ending punctuation (. ! ? ; :), brackets, backticks, pipes, slashes, and quote marks. No n-gram can span across one of these boundaries, so phrases from two unrelated sentences — or two unrelated URL path segments — never get glued together into a ghost n-gram. Commas and internal hyphens or apostrophes stay inside a segment because real phrases legitimately cross those.

Each segment is then tokenized with a Unicode-aware regex that keeps letters, digits, and internal hyphens or apostrophes. Tokens are lowercased by default. The tool slides a window of size N across each segment's token stream and counts every distinct n-gram it sees.

For phrase frames, every occurrence of a core n-gram records the token that appeared immediately before it and the token that appeared immediately after it, staying inside segment boundaries. After scanning the whole text each core has a left-context distribution and a right-context distribution. Cores where neither side has any repeats are discarded — they're just regular n-grams. Cores with productive slots on one or both sides are what the frame view shows.

FAQ

Is my text sent to a server?

No. All processing happens in your browser with JavaScript. There's no backend. You can paste confidential drafts, unpublished manuscripts, or client data and nothing leaves the page. Disconnect from the internet after loading the page and it still works.

What's the maximum text size?

There's no hard limit, but the n-gram and frame computation runs synchronously, so extremely large inputs (hundreds of thousands of words) may freeze the tab briefly. For typical use — articles, essays, chapters, scraped pages, subtitle files — it's instant.

Does it handle non-English text?

Yes. The tokenizer uses Unicode letter classes, so Romanian, French, German, Spanish, Russian, Greek, Arabic, Chinese, Japanese, Korean and other scripts all work. Diacritics are preserved. Any language that separates words with whitespace or punctuation is supported.

What's the difference between an n-gram and a phrase frame?

An n-gram is a fixed sequence of N words. A phrase frame in this tool is a core n-gram together with the distribution of words that appear just before and just after it everywhere it shows up in the text. Plain n-grams miss templatic patterns where each variant is rare; phrase frames catch them by collapsing every lead-in and every follow-on into one row.

Why am I seeing phrases that aren't really in my text?

You shouldn't — the tokenizer treats sentence-ending punctuation, markdown markers, brackets, slashes, and newlines as hard boundaries, so n-grams can't span them. If you paste something like site.com/us/en/about, each of site, com, us, en, about lives in its own segment and no multi-word n-gram will be built across them. If you see a phantom phrase anyway, it means two tokens really are adjacent in the source with only a space or comma between them.

Can I exclude punctuation, numbers, or stopwords?

Punctuation is already excluded by the tokenizer. Numbers are kept as tokens. There's no stopword filter — stopword lists are always language-specific, often wrong for your domain, and tend to hide the exact function-word patterns that phrase frames are good at surfacing. If you really need to remove common words, strip them from your text before pasting.

Tips

Use larger N (4–6) with phrase frames to find long templatic expressions. Small N tends to just surface common bigrams.
Click all fillers on any frame row to see every word that appeared on the left or right side of the core and how often.
Tokenization is Unicode-aware — Romanian diacritics, Cyrillic, Greek, CJK and other scripts all work.