ai-docs/docs/ai/usage-tokens.md
2026-02-10 14:24:09 -06:00

142 lines
5.5 KiB
Markdown

# Usage and Token Budgeting
## How Tokens Are Spent
Tokens are consumed based on input length, output length, and tool usage. Long prompts and repeated context increase usage quickly.
### Simple Example
If you paste a large file and ask for a rewrite, you pay for:
1. The pasted file (input tokens)
2. The model reasoning about it
3. The full rewritten output (output tokens)
If you do that several times in a row, you can burn a large portion of your daily or monthly allowance quickly.
## Chat Context (Why New Chats Matter)
Each chat keeps a running memory of what you said. That memory grows over time and gets sent back to the model, which costs more tokens and can slow responses.
### Best Practice
- Start a new chat for each topic or task.
- Do not keep one chat open for weeks and switch subjects.
### Why This Matters
- Bigger context = more tokens used per message.
- Larger contexts can slow response time.
- Old context can confuse the model and reduce answer quality.
### Use Summaries To Reset Context
When a chat gets long, ask for a short summary and start a new chat using that summary. This keeps context small and saves tokens.
#### Example: Resetting A Long Chat
1. In the long chat, ask: "Summarize the current state and decisions in 8 bullets."
2. Copy the summary into a new chat.
3. Continue from there with a smaller context.
## Model Choice Matters (Plain Language)
Think of each model as a different "speed and cost" setting. Some models are cheap and fast. Some are smarter but cost more for the same question. If you pick a higher-cost model, you can burn through your daily or monthly allowance much faster.
### Models Available (From The Copilot Picker)
- Auto (10% discount)
- GPT-4.1 (0x)
- GPT-4o (0x)
- GPT-5 mini (0x)
- Grok Code Fast 1
- Claude Haiku 4.5 (0.33x)
- Claude Opus 4.5 (3x)
- Claude Opus 4.6 (3x)
- Claude Sonnet 4 (1x)
- Claude Sonnet 4.5 (1x)
- Gemini 2.5 Pro (1x)
- Gemini 3 Flash (Preview) (0.33x)
- Gemini 3 Pro (Preview) (1x)
- GPT-5 (1x)
- GPT-5-Codex (Preview) (1x)
- GPT-5.1 (1x)
- GPT-5.1-Codex (1x)
- GPT-5.1-Codex-Max (1x)
- GPT-5.1-Codex-Mini (Preview) (0.33x)
- GPT-5.2 (1x)
- GPT-5.2-Codex (1x)
### Practical Guidance (Plain Language)
- Use cheaper models for summaries, quick questions, and small edits.
- Use expensive models only when the task is truly complex or high-stakes.
- If you are unsure, start with Auto or a 0.33x or 1x option, then move up only if needed.
#### Example: Choosing A Model
- Task: "Summarize this file in 5 bullets." Use a 0.33x or 1x model.
- Task: "Refactor three files and update tests." Start with a 1x model. Move to 3x only if the 1x model fails.
- Task: "Explain a confusing production issue with lots of context." Start with 1x, and only move up if needed.
### Quick Glossary
- Model: The "brain" Copilot uses to answer your question.
- Multiplier: A cost factor. Higher number = faster token usage.
- Tokens: The units that count your AI usage (roughly input + output size).
## Best Practices to Reduce Usage
- Use clear, bounded requests with specific goals.
- Prefer targeted edits over full rewrites.
- Reuse context by referencing earlier outputs instead of re-pasting.
- Ask for summaries before requesting changes.
### Before And After Example
Bad: "Rewrite this entire module and update all tests."
Better: "Only refactor the validation functions in this module. Keep existing behavior. List tests to update."
## Examples of Efficient Prompts
- "Summarize this file in 5 bullets. Then propose a refactor plan."
- "Update only the functions in this file that handle validation."
- "List risks in this change and suggest tests to add."
## Daily and Monthly Budgeting Tips
- Batch related questions in a single prompt.
- Timebox explorations and stop when enough info is gathered.
- Avoid repeated retries without changing the prompt.
### Example: Timeboxed Session
1. Ask for a 5-step plan.
2. Approve or adjust.
3. Ask for just step 1 or 2.
4. Stop and summarize before moving on.
## Budgeting Routine
- Start with a plan-first request for large tasks.
- Limit each request to one output type.
- End sessions with a short summary for easy follow-up.
### Example: One Output Type
Instead of: "Refactor the file, explain it, and add tests."
Use: "Refactor the file only. Do not explain or add tests."
Then follow up with a separate request if needed.
## Red Flags That Burn Tokens Quickly
- Large file pastes with no clear ask.
- Multiple full rewrites in one session.
- Repeated "start over" requests.
## How You Can Burn A Full Day Fast (Example Scenarios)
- You paste multiple large files and ask a 3x model to rewrite everything plus tests.
- You keep asking a high-cost model to "start over" with a new approach.
- You do a long debugging session on a big codebase using a 3x model for every step.
- You ask for full architecture diagrams and long explanations from a high-cost model in one session.
### Realistic "New User" Scenario
You open a single chat and do this all day:
1. Paste a large file and ask for a full rewrite.
2. Ask for a different rewrite using another approach.
3. Ask for full tests.
4. Ask for a full explanation of the changes.
5. Repeat with another file.
If each step is done with a 3x model and a growing chat context, your token use can spike quickly and slow down responses.
## Team Habits That Help
- Capture reusable prompts in a shared doc.
- Standardize request templates.
- Agree on when to use agents vs chat.
## Quick Checklist
- Is the request specific and scoped?
- Do I need the whole file or just a section?
- Can I ask for a plan first?