Best AI Prompts for Data Science (2025)

Production-quality prompts for data scientists and analysts. Each prompt includes your schema, constraints, and expected output format — the difference between code that runs and code that needs three rounds of fixes.

Works with Claude, ChatGPT, Gemini, or any LLM. Copy, fill in the brackets, paste.

Paste any of these prompts into the improver →

Customize for your specific dataset, libraries, or workflow.

Try Prompt Improver Free →

🐍 Python & pandas Code Generation

Schema-first code request

Gives Claude your exact schema so it writes correct pandas on the first try.

I have a pandas DataFrame with the following schema:

[Paste the output of df.dtypes and df.head(3) here]

Task: [describe exactly what you want to calculate or transform]

Requirements:
- Use only pandas and numpy (no additional libraries)
- Return a function called `transform(df)` that takes the DataFrame and returns [describe expected output]
- Add a docstring with parameter types and return type
- Handle NaN values explicitly

Do not include example usage — just the function.

Debugging data pipelines

Pastes the error + context so Claude can pinpoint the exact fix.

My pandas pipeline is throwing this error:

```
[paste the full traceback here]
```

Pipeline code:
```python
[paste your code here]
```

DataFrame info at the point of failure:
[paste df.info() and df.head(3) output]

Identify the root cause and provide the corrected code. Explain why the error occurs in one sentence.

Performance optimization

Rewrites slow loops and apply() calls into vectorized pandas.

Optimize this pandas code for performance. It currently processes [N] rows in [X] seconds.

```python
[paste your slow code here]
```

Schema: [describe relevant columns and types]

Rewrite using vectorized operations. If the operation cannot be vectorized, explain why and suggest the next best approach (e.g., Cython, Dask, chunked processing). Show the before/after with estimated speedup.

🔍 Exploratory Data Analysis

Data quality audit

Generates a full quality-check script from your schema alone.

Generate a data quality audit script for a pandas DataFrame with this schema:

```
[paste df.dtypes output]
```

Business context: [e.g., "customer transaction data, one row per purchase"]

The script should check for and report:
1. Missing values (count + %) per column
2. Duplicate rows (exact and near-duplicate by key columns)
3. Outliers in numeric columns (IQR method)
4. Date range validity for timestamp columns
5. Cardinality issues (columns that should be categorical but have high cardinality)
6. Foreign key integrity for ID columns [list expected key relationships if any]

Output a single Python function `audit(df)` that returns a dict of findings.

Targeted EDA visualization

Gets the 3 most informative plots for your specific column types.

I am doing EDA on a dataset with these columns:
- Target variable: [name, type, e.g. "churn: boolean"]
- Key features: [list 3-5 columns with types]
- Row count: [N rows]
- Domain: [e.g. "e-commerce transactions"]

For each of the following pairs, write seaborn/matplotlib code for the single most informative plot:
1. [feature_1] vs [target]
2. [feature_2] vs [feature_3]
3. [feature_4] distribution overall

Each code block should be standalone and include `plt.tight_layout()` and `plt.title()`. No prose — just code blocks.

Hypothesis generation from stats

Turns your describe() output into ranked hypotheses.

Here are summary statistics for my dataset:

```
[paste df.describe() and df.value_counts() output for key columns]
```

Business context: [1-2 sentences on what this data is and the main question you are trying to answer]

Generate 5 data-driven hypotheses worth testing, ranked by likely business impact. For each hypothesis:
- State it as a falsifiable claim
- Identify the columns and statistical test to use
- Estimate the sample size needed for 80% power at α=0.05
- Flag any data quality issues that could invalidate the test

🤖 Machine Learning

Model result interpretation

Pastes raw metrics and gets actionable improvement suggestions.

I trained a [model type, e.g. "XGBoost classifier"] for [task, e.g. "predicting 30-day churn"].

Results on the held-out test set:
```
[paste classification_report(), confusion_matrix(), and AUC score]
```

Dataset characteristics:
- Train size: [N], Test size: [M]
- Class distribution: [e.g. "89% negative, 11% positive"]
- Feature count: [N features]

Answer:
1. Is this performance good for this type of problem? Give a benchmark.
2. What is the model's most likely failure mode given these metrics?
3. List the top 3 concrete improvements to try, ordered by expected impact.
4. Is there evidence of data leakage? What would I check?

Feature engineering brainstorm

Generates domain-specific feature ideas from your column list.

I am building a model to predict [target variable] using data from [domain].

Raw features available:
[list your current columns with brief descriptions]

Generate 10 engineered features that could improve model performance. For each:
- Feature name (snake_case)
- Computation (pandas code or formula)
- Hypothesis for why it would be predictive
- Whether it requires target encoding, one-hot encoding, or can be used as-is

Prioritize features that capture [recency/frequency/interaction/seasonality — pick what's relevant] effects.

Sklearn pipeline builder

Generates a production-ready sklearn Pipeline from your spec.

Build a scikit-learn Pipeline for this task:

Goal: [e.g. "binary classification on tabular data"]

Features:
- Numeric columns: [list them] — strategy: [e.g. "StandardScaler + median imputation"]
- Categorical columns (low cardinality): [list them] — strategy: OneHotEncoder
- Categorical columns (high cardinality): [list them] — strategy: [e.g. "OrdinalEncoder or target encoding"]
- Date columns: [list them] — extract: [e.g. "day_of_week, month, days_since_event"]

Model: [e.g. "start with LogisticRegression as baseline, then try LightGBM"]

Include:
- ColumnTransformer for preprocessing
- Cross-validation with StratifiedKFold(5)
- GridSearchCV or RandomizedSearchCV for the top 3 hyperparameters
- Joblib serialization at the end

Return the full pipeline code as a single Python file.

🗄️ SQL & Data Querying

Query optimization request

Gets specific rewrites for slow queries with your actual schema.

Optimize this SQL query. It currently takes [X seconds] on a table with [N rows].

```sql
[paste your query]
```

Table schemas:
```sql
[paste CREATE TABLE statements or describe column names + types + approximate cardinality]
```

Existing indexes: [list them or say "none"]

EXPLAIN output (if available):
```
[paste EXPLAIN or EXPLAIN ANALYZE output]
```

Return: (1) The optimized query, (2) What changed and why, (3) What index(es) to add if any.

Cohort analysis query

Generates week-over-week or day-over-day retention SQL.

Write a SQL query for cohort retention analysis.

Table: `[table name]`
Columns:
- User ID: `[column name]` (type: [int/varchar])
- Event timestamp: `[column name]` (type: [timestamp/date])
- Event type: `[column name]` — filter to: [e.g. "'purchase'"]

Database dialect: [PostgreSQL / BigQuery / MySQL / Snowflake]

Cohort definition: users grouped by their [week/month] of first [event type]
Retention definition: a user is "retained" in week N if they performed the event at least once in that week

Output: a cohort × week matrix with absolute user counts and retention percentage.
Include comments explaining each CTE.

Frequently Asked Questions

What are the best AI prompts for data science?

The best data science prompts share four traits: (1) They specify your data schema upfront — column names, types, and sample rows so the AI writes correct code on the first pass. (2) They ask for explanations alongside code so you understand and can maintain what gets generated. (3) They constrain the output format — "return only a Python function, no prose" avoids lengthy explanations you have to strip. (4) They reference real libraries (pandas, scikit-learn, seaborn) rather than abstract concepts, keeping suggestions grounded in runnable code. Vague prompts like "analyze my data" produce generic answers; schema-specific prompts produce copy-paste-ready code.

How do I use AI to help write Python code for data analysis?

For Python data analysis, always include: (1) Your DataFrame schema — paste df.dtypes and df.head(3) output, or describe columns explicitly. (2) Your exact goal — "calculate 30-day rolling average of column X grouped by column Y" is far more useful than "analyze the trend". (3) Constraints — "use only pandas and numpy, no additional libraries" prevents dependency creep. (4) Expected output format — "return a DataFrame with columns [a, b, c]" vs "print to console". With this context, Claude can write production-quality pandas/numpy/sklearn code instead of pseudocode.

Can AI help me interpret data science results?

Yes, AI is excellent at interpreting statistical output and model results when you provide the raw numbers. Paste your model metrics, confusion matrix, or statistical test output and ask specific questions: "This logistic regression has AUC=0.72 and precision=0.61 on a 90/10 class split — is this good? What are the top 3 ways to improve it?" or "My residual plot shows a fan shape — what does this indicate and how do I fix it?" The key is giving the AI concrete output to reason about rather than asking abstract questions about techniques.

What prompts work best for exploratory data analysis (EDA)?

The most effective EDA prompts: (1) Schema-first audit — "Here is my DataFrame schema: [paste dtypes]. What data quality issues should I check for before analysis, and give me the pandas code to check each one." (2) Targeted visualization — "I have columns X (continuous) and Y (categorical) with Z rows. Write seaborn code for the 3 most informative plots for understanding their relationship." (3) Hypothesis generation — "Here are summary statistics for my dataset: [paste]. List 5 hypotheses worth testing, ordered by likely business impact." EDA prompts that include actual data stats consistently outperform vague requests.

How do I use AI to improve my SQL queries?

For SQL improvement, always paste: (1) Your current query, (2) The table schemas (CREATE TABLE statements or column names + types), (3) What the query is supposed to return, and (4) What is wrong or slow about it. Then ask specifically: "Rewrite this query to avoid the correlated subquery in the WHERE clause" or "This query scans 50M rows — what indexes or query rewrites would reduce that?" Generic "make my SQL better" prompts miss the context needed for meaningful improvements. Include EXPLAIN output if you have it.

More Prompt Libraries by Role

Coding & DevelopmentWriting & ContentMarketingStudentsResearchersFull Prompt Library
🔥 Tonight: Claude Code Power Prompts · £5 £3 first 10Get PDF →