All posts
Feature12 min read

Ask Your CSV Data Questions — AI Analysis That Runs in Your Browser

The Ask your data panel lets you query any CSV in plain English — filters, aggregates, group-bys, derived columns, and string transforms — without sending a single row to the LLM. Examples for analysts, finance, HR, ops, data entry, and more.

The viewer has a chat panel. Press Ask in the toolbar, type a question in plain English, and get a filtered table, a bar chart, or a scalar stat back — all computed locally in your browser. The LLM authors a query plan; a deterministic JavaScript executor runs it against your data. No SQL. No Python. No upload. This post walks through the full capability set with real examples for analysts, finance teams, HR, operations, data entry staff, and developers.

Your row data never leaves your browser. The model sees only column names and a small sample of values — enough to understand the schema, not enough to read your data. Every calculation in this post runs locally. Nothing is uploaded.

How it works under the hood

When you type a question, the panel sends the LLM a schema digest — column names, their inferred type (number, text, date, boolean), a null-fraction estimate, and up to five sample values per column. For numeric columns it also includes min, max, and mean. That is the entire payload. A 500,000-row file produces exactly the same schema digest as a 50-row file. Your data stays where it is.

The LLM returns a JSON plan — a small pipeline of typed steps. A plan for "which region has the most orders" might look like: filter where order_date is after 2024-01-01, then group by Region and count rows, then sort descending. The local executor runs those steps against your data using the same deterministic JavaScript that powers the rest of the viewer. If the plan references a column that does not exist, the validator catches it before any execution and returns a readable error instead of silently wrong numbers.

The plan is always visible. Click "Show plan" beneath any result and you see the exact steps the model generated. This matters for regulated environments — you are not trusting a black box, you are trusting a transparent, inspectable pipeline. You can also re-ask with corrections ("use Revenue instead of Sales") and watch the plan change.

Ask your data — how the plan is built
Sent to LLMSchema digest only
"Which category has the highest average margin?"
Producttexte.g. Green Protect Insect Spray 500ml
Categorytexte.g. Pest Control, Garden, Household
Stocknumber0 – 847
Sale pricenumber€4.25 – €129.50
Margin %number51.35 – 52.17
✕ Not included— your actual row data
LLM returnsJSON plan
GROUP BY
Group by Category
agg: avg_margin = avg(Margin %)
SORT
Sort avg_margin descending
Executed locally — 1,065 rows in your browser
▬ Bar chart
The full pipeline for "which category has the highest average margin?". Left side shows exactly what is sent to the LLM — column names, types, and sample values, never your actual rows. Right side shows the plan the LLM returns and the local execution step.

Pinpoint the exact column with @ tagging

When a file has multiple columns with similar names — "Sale price", "List price", "Cost price" — a plain question like "what is the average price?" leaves the model guessing. The @ column picker removes that ambiguity. Type @ in the chat input and a fuzzy-search dropdown appears showing every column in your file. Arrow-key through the matches, press Enter or Tab, and the column name is inserted as a highlighted chip. Multi-word names are automatically quoted.

Ask your data — column picker
1 — Type @ to pick a column
What is the average @Sale
@Sale price↵ select
@Sale date
@Salesperson
2 — Column resolved as chip in your question
What is the average @Sale price by region?
Typing @Sale surfaces matching columns. Selecting "Sale price" inserts it as a chip — the question is now unambiguous before it ever reaches the model.

The chip also appears in your conversation history so you can see at a glance which column each question targeted. When you re-ask or follow up, you can @ a different column without retyping the whole question. This makes exploratory analysis faster — iterate through "Sale price" vs "List price" vs "Cost price" in three keystrokes each.

For analysts — aggregates, group-bys, and ranked bar charts

Group-by questions are the bread and butter for anyone doing data analysis. "Which category has the highest average margin?" generates a group_by step with an avg aggregate, followed by a sort. The result renders as a ranked bar chart with the group labels on the left and the computed value on the right. Clicking any bar drills down into the source rows that make it up, applying a live filter to the main table so you can keep investigating from there.

Ask your dataGemini
Which category has the highest average margin?
Show plan
group_by by=["Category"] agg={avg_margin: avg(Margin %)}
→ sort by=[avg_margin desc]
Garden has the highest average margin at 52.3%, followed by Pest Control at 51.9%.
4 Categorys by avg_margin
Garden
52.3
Pest Control
51.9
Household
51.4
Outdoor
50.8
Follow up
Sort by avg_margin ascending insteadShow only the top 3 categoriesBreak that down by supplier
A group-by question: the LLM emits a plan with a group_by step and an avg aggregate, then sorts descending. The executor renders a ranked bar chart. Clicking any bar filters the main table to that category's rows.

Five aggregate functions are supported: count, sum, avg, min, and max. The model picks the right one from context — "how many" produces count, "total" produces sum, "average" produces avg. You can stack multiple aggregates in a single question: "what is the min, max, and average price per category" generates a group_by with three computed columns in one plan. The result renders as a table with each aggregate in its own column, exportable as TSV or CSV.

After each answer the panel surfaces up to three follow-up suggestions computed locally — no extra LLM call. A group-by result suggests sorting the other direction, showing only the top five, or breaking down by a second categorical column. A filter result suggests aggregating the filtered rows or grouping them. Clicking any suggestion fires it as the next question, so a line of investigation feels more like a conversation than a series of disconnected queries.

For finance and accounting — totals, cost analysis, and period filters

Finance teams spend a lot of time answering the same question in different slices: total cost this month, total cost by supplier, cost variance versus last quarter. These map directly onto the aggregate and group-by operations. Scalar questions — "what is the total purchase cost?" — render as a stat card. The card picks up the currency symbol from the source column and formats the result accordingly, so a column full of "€6.26" values produces a Euro-formatted total, not a raw number.

Ask your dataGemini
What is the total purchase cost across all products?
Show plan
aggregate total_cost=sum(Purchase Price)
1 result rowTSV · CSV · ↓CSV
total_cost
#€62,847.30
Follow up
What is the total purchase cost by supplier?What is the average purchase price per category?
A scalar aggregate over a currency column. The executor detects the € symbol from the sample values and formats the result. No manual formatting needed.

Period filters pair with aggregates naturally. "What is the total revenue for orders placed between January and March?" generates a filter on the date column followed by an aggregate. The date comparison uses locale-aware parsing — ISO dates, European dd/mm/yyyy formats, and month-name formats all work. You do not need to know what format your date column uses; the executor figures it out.

For cost variance work, the derive step covers the gap. "Add a column showing the difference between sale price and purchase price" generates a subtract expression. "Add a column showing VAT amount at 21%" generates a multiply. Once added to the table, the new column can be summed, averaged, filtered, and exported exactly like any source column. It is the spreadsheet formula workflow, minus the spreadsheet.

For HR and people analytics — headcount, salary bands, and attrition

HR CSV exports from payroll systems, ATS tools, or HRIS platforms are often the messiest data an analyst encounters — inconsistent department names, salary stored as text with currency symbols, hire dates in four different formats. The Ask panel handles all of these gracefully because the schema digest captures samples, not assumptions. If the model sees "€45,000" as a sample value for the Salary column, it knows the column needs currency stripping before arithmetic.

Ask your dataGemini
How many employees are in each department and what is their average salary?
Show plan
group_by by=["Department"] agg={headcount: count(*), avg_salary: avg(Salary)}
→ sort by=[headcount desc]
Engineering is the largest department with 47 employees and an average salary of €68,400.
5 Departments by headcount
Engineering
47
Sales
31
Operations
28
Marketing
19
Finance
14
Follow up
Which department has the highest average salary?How many employees joined in the last 12 months?Break engineering down by role
A headcount and average salary breakdown by department. Group-by with two aggregates in one question — headcount (count) and average salary (avg) — rendered as a bar chart ranked by headcount.

Attrition analysis is a filter-then-count workflow: "how many employees left in 2024?" filters on the termination date column (isNotEmpty plus dateAfter), then counts. "What is the attrition rate by department?" adds a group_by on top of that filter. The plan is transparent — you can verify what "left" means before trusting the number.

Salary banding is a derive use case. "Add a column showing whether each employee is above or below the department average salary" requires a two-step question — first ask for the average per department, note the values, then ask for the derive column. The current derive step supports arithmetic expressions, so "is salary above 1.2 times the company average" becomes a single divide and compare. The result can be added to the table and used as a filter for subsequent questions.

For operations and logistics — inventory alerts and stock analysis

Operations teams are usually looking for exceptions — products below reorder level, shipments delayed past a threshold, suppliers missing delivery windows. These are filter questions with numeric or date comparisons. "Find products where stock is below 10" generates a single filter step. "Find products where stock is below 10 and lead time is more than 14 days" generates a two-condition AND filter. Results land in a mini table in the chat panel; you can click the drill-down arrow on any row to jump to that row in the main table.

Ask your dataGemini
Find products where stock is below 10 and lead time is more than 14 days
Show plan
filter where Stock < 10 AND Lead_Time_Days > 14
4 rows
TSV · CSV
ProductCategoryStockLead_Time_Days
COMPO BIO Granuplant 10LGarden921
Weitech Pest Repeller UltraPest Control318
Catchmaster Pantry Pest GlueHousehold716
Green Protect Trap RefillPest Control228
Follow up
What is the total stock value for these products?Which supplier covers most of these products?
A two-condition filter for low-stock, long-lead-time products. The result is a mini table with one row per matching product. Clicking the arrow on any row jumps to it in the main viewer.

Inventory valuation is a derive question. "Add a column showing stock value as stock multiplied by purchase price" generates a multiply expression and runs it over every row. Once added to the table you can immediately ask "what is the total stock value by category?" and get a breakdown across the newly added column. Questions chain naturally — the schema digest rebuilds whenever you add a column, so derived columns are immediately available for the next question.

For data entry and QA — finding gaps, fixing formats, and cleaning text

Data entry work is mostly about finding what is wrong before it causes a problem downstream. Missing values, inconsistent capitalisation, barcodes in the wrong format, duplicate rows. The Ask panel handles all of these without requiring anyone to know SQL or write a script. The isEmpty and isNotEmpty operators find blanks; contains and regex_replace handle format normalisation; the derive step with string functions covers cleanup transforms.

Ask your dataGemini
Show me all products that are missing a barcode
Show plan
filter where Barcode isEmpty
→ select cols=["Product", "Category", "SKU", "Barcode"]
3 rows
TSV · CSV
ProductCategorySKUBarcode
COMPO SANA Cactus Soil 5LGardenCS-5L
Luxan ECO Ant Powder 100gPest ControlLA-100
Pest Stop Rodent Bait StationPest ControlPS-RBS
Follow up
How many products are missing barcodes per category?Show products missing both barcode and SKU
A QA question targeting missing barcodes. The isEmpty operator filters to rows where the Barcode column has no value — zero knowledge of SQL required.

Text cleanup is where the string functions shine for data entry teams. "Add a column with the product name in title case and trimmed of extra spaces" generates a derive step with trim and a case conversion in sequence. "Extract the size from the product name — the part in parentheses" becomes a regex_replace that captures the parenthetical segment. "Combine SKU and barcode into a single identifier separated by a dash" becomes a concat. Each cleaned column can be added to the table and exported, replacing the original messy column in your downstream system.

Ask your dataGemini
Add a column that combines SKU and Barcode into a single product code separated by a dash
Show plan
derive Product_Code = concat(SKU, "-", Barcode)
3 rows
Add col
TSV · CSV
ProductSKUBarcodeProduct_CodeNEW
Green Protect Insect Spray 500mlGP-5005060525035562GP-500-5060525035562
Pest Stop Quick-Snap Rat TrapPS-QS5014055040222PS-QS-5014055040222
COMPO BIO Granuplant 10LCB-10L4003949000247CB-10L-4003949000247
Follow up
Trim whitespace from the SKU columnConvert all product names to uppercase
A string derive question that combines two columns. The concat function joins SKU and Barcode with a dash separator. Add col writes the result into the table as a new column ready for export.

For sales and marketing — rankings, top-N, and conversion analysis

Sales data questions tend to fall into two shapes: who is performing best (top-N ranking) and where are deals coming from (group-by breakdown). Top-N is a group-by followed by sort and limit. "Show me the top 10 products by total sales" generates a group_by with sum, a sort descending, and a limit 10. The result renders as a bar chart so the ranking is immediately visual. Hovering a bar shows the exact value and its share of the total.

Ask your dataGemini
Show the top 5 products by total sale value
Show plan
group_by by=["Product"] agg={total_sales: sum(Sale price)}
→ sort by=[total_sales desc]
→ limit n=5
5 Products by total_sales
COMPO BIO Granuplant 10L
14,820
Green Protect Insect Spray 500ml
11,204
Weitech Telescopic Fly Swatter
9,870
Pest Stop Electronic Fly Killer
8,340
Catchmaster Catch Zone Roll
7,125
Follow up
Show the bottom 5 products by total sale valueWhat is the average sale value per category?Which product has the highest margin among these?
A top-5 ranking query: group_by + sum + sort desc + limit 5. The bar chart makes the ranking immediately visual. Each bar is clickable to drill into that product's underlying orders.

Marketing questions often involve date filtering combined with group-by: "how many leads came in per channel last month?" filters on a date range, then groups by the channel column, then counts. "What is the conversion rate per campaign?" requires a derive step — divide converted by total — followed by a group-by if the campaign column is present. The key insight is that these questions are just combinations of the same small set of operations; the model handles the composition automatically.

For data scientists — distributions, distinct values, and custom derivations

Data scientists tend to use the Ask panel for quick exploration before writing proper analysis code. "What are the distinct values in the Status column?" runs a distinct step and returns a deduplicated list — useful for spotting unexpected categories before you build a pivot table. "How many rows per status?" chains distinct into a group-by count. "What percentage of rows have a null email?" uses an aggregate with an isEmpty filter to get the count, then divides.

Derived columns replace the usual "add a quick formula column in Excel, check the distribution, delete it" loop. "Add a column that flags rows where margin is more than two standard deviations from the mean" is a more complex derive — you would ask for the mean and standard deviation first (two separate aggregate questions), note the values, then ask for the flag column using arithmetic on those thresholds. Each step is transparent and the plan shows the exact expression.

Ask your dataGemini
Add a column that flags products where purchase price is more than 1.5 times the average purchase price (average is €18.40)
Show plan
derive Price_Flag = "Purchase Price" / 18.40
4 rows
Add col
TSV · CSV
ProductPurchase PricePrice_FlagNEW
COMPO BIO Granuplant 10L€62.563.40
Weitech Pest Repeller Ultra€38.202.08
Catchmaster Flying Insect Trap€5.370.29
Green Protect Insect Spray 500ml€6.260.34
Follow up
How many products have a price flag above 1.5?What is the average margin for high-price products?
An outlier-detection derive step: rows where a numeric value exceeds 1.5× the column mean are flagged. The new Flag column is added to the table and can immediately be used as a filter.

The column stats bar integrates automatically with the chat panel. When a question involves an aggregate over a column — "average margin", "total stock" — the stats bar jumps to that column and shows count, distinct, sum, avg, min, max, and median without a separate click. When you click a drill-down arrow on a bar chart row, the stats bar follows to the most relevant column. You can override it manually from the dropdown at any point.

Viewer · footer
Σ
count1,065distinct48sum55,049avg51.69min51.35max52.17median51.7
After "which category has the highest average margin?" the stats bar automatically jumps to the Margin % column, surfacing the distribution at a glance.

Filters and comparisons — the full operator set

The executor supports fourteen comparison operators across text, numeric, and date columns. Multiple conditions combine with AND by default; ask for OR explicitly. Here is the full list of what you can express in a plain English question:

  • eq / neq — exact match or exclusion ("where status is active", "where region is not EMEA")
  • contains / startsWith / endsWith — substring matching ("where email contains @gmail")
  • in — membership check ("where country is in UK, France, Germany")
  • gt / gte / lt / lte — numeric comparisons ("where price is greater than 100")
  • between — range check on numbers or dates ("where order date is between Jan 1 and Mar 31")
  • isEmpty / isNotEmpty — null and blank detection ("where barcode is missing")
  • dateBefore / dateAfter / dateBetween — date-aware comparisons with locale-tolerant parsing

String functions — the complete list

String functions are used in derive steps to transform text columns. Describe the transformation; the model picks the right function and arguments. Every derived column can be added to the table with Add col and then filtered, sorted, and exported.

  • replace(col, old, new) — exact string substitution ("replace all occurrences of GmbH with Ltd")
  • regex_replace(col, pattern, replacement) — regex-based substitution ("extract the numeric part of the SKU")
  • trim(col) — strip leading and trailing whitespace ("clean up the product name column")
  • upper(col) / lower(col) — case conversion ("convert country codes to uppercase")
  • concat(col1, separator, col2) — join two columns ("combine first name and last name with a space")
  • split(col, delimiter, index) — extract a segment ("take the first part of the email address before the @")

Bring your own API key

The feature requires a Google Gemini key or an OpenRouter key. Paste it into the AI settings panel (the gear icon in the chat header); it is stored only in your browser's local storage and never sent to csvdiff.app. Requests go directly from your browser to Google or OpenRouter. We never see your key, your questions, or your data. This matters for regulated industries — finance, healthcare, legal — where even metadata about what you are querying needs to stay inside your perimeter.

Gemini Flash is the default on the Gemini path — fast and cheap for this workload. On OpenRouter you can pick any model that supports JSON-mode output. Most questions consume fewer than 1,000 input tokens because the schema digest is compact. A session of twenty questions on a fifty-column file typically costs a few cents at current model pricing. The session history is preserved in browser storage so you can close the panel, do other work, and continue the conversation when you come back.

Honest limitations

A few things to set expectations correctly. The feature does not write back to your file — derive results have to be explicitly added with Add col, and filter results do not modify the underlying data. It does not join across multiple files in a single question (open two tabs and use the diff tool for cross-file comparisons). Very ambiguous column names can trip the model; the UI surfaces a "Used X — switch to Y?" chip when it detects an auto-pick among multiple plausible columns, so you can correct a wrong guess with one click. And it will not hallucinate columns it cannot see — if you reference a column that does not exist, the validator rejects the plan before any execution and returns a specific error message.

Complex statistical operations — standard deviation, percentile ranking, rolling averages, cohort analysis — are not supported yet. The operation set is intentionally small: filter, aggregate, group_by, sort, limit, distinct, select, and derive. Everything the executor does is deterministic and inspectable, which is a deliberate trade-off against supporting an unbounded set of operations that would be harder to validate and explain.

To try it: open any CSV in the viewer at csvdiff.app/view, press Ask in the toolbar, add your Gemini or OpenRouter key in the settings panel, and type any question from this post. The schema digest is built from your file locally — nothing is uploaded before you even ask.

Try it yourself

Ready to diff your files?

Upload two CSV files and see the differences in seconds. 100% client-side — your data never leaves the browser.

Start comparing →

Ready to diff?

Drop your files, see the deltas, export the merge. Takes 30 seconds.