The README That Saved My Job
A year ago, I got called at 2 AM. Production was down. The engineer who built the deployment system had left the company six months earlier. No documentation. No comments. Just 800 lines of Bash scripts that nobody understood.
I spent 4 hours that night guessing, googling, and praying. I got the system back up. Then I spent the next two days writing documentation so this would never happen again.
Two months later, we got the same issue. The new devOps asked "How does your infrastructure work?" I sent them the docs.
That documentation literally made the onboarding smoother.
Good documentation isn't just nice to have. It's a time saver.
Why Most Documentation Fails
Let me guess what your documentation looks like:
- A README that says "Install dependencies and run the app"
- A Wiki with 47 pages, 40 of which are outdated
- Comments that say things like `// Fix bug` (what bug?!)
- Architecture diagrams drawn 3 years ago that no longer match reality
I know because I've written all of these. Here's why they fail:
- They're written once and never updated (documentation rots faster than food)
- They explain "what" but not "why" (I can read the code to see what it does)
- They're in the wrong place (if I have to search for 10 minutes, I won't read it)
- They're written for the author (not for the confused junior at 11 PM)
The Documentation Hierarchy
After years of trial and error, here's what actually works. I call it the "Four Levels of Documentation."
Level 1: Code Comments (The Why, Not the What)
Bad comment:
// Loop through users
for (let user of users) {
// Delete the user
await deleteUser(user.id);
}
Thanks, I can read JavaScript.
Good comment:
// We delete users one at a time (not in bulk) because our audit system
// needs to track each deletion separately for compliance.
// Batch deletes failed audit in 2023. See TICKET-1234.
for (let user of users) {
await deleteUser(user.id);
}
Now I know WHY it's slow, and I know not to "optimize" it without checking compliance requirements.
What to comment:
- Weird code that looks wrong but isn't ("This looks like a race condition, but it's actually fine because X")
- Performance decisions ("We use a Set instead of Array because with 10K items, lookup is 100x faster")
- Business rules ("Refunds are only allowed within 30 days per CEO decision, Jan 2024")
- Bugs you're working around ("This API returns null instead of 404. Reported to vendor, waiting on fix")
What NOT to comment:
- Things the code already says (`i++; // Increment i`)
- Function names that are self-explanatory (`getUserById()` doesn't need a comment)
Level 2: README Files (Getting Started)
Your README should answer these questions in order:
- What is this? (One sentence. Pretend I'm a new hire on day 1.)
- How do I run it locally? (Step by step. Assume I've never seen this before.)
- How do I run tests?
- How do I deploy?
- Where do I get help? (Slack channel, wiki link, person to ask)
Real example from our repo:
# Payment Service
Handles all payment processing, refunds, and subscription billing.
## Quick Start
1. Install dependencies: `npm install`
2. Copy `.env.example` to `.env` and add your Stripe test key
3. Run migrations: `npm run migrate`
4. Start the server: `npm start`
5. Visit http://localhost:3000
## Running Tests
`npm test` — runs all tests
`npm run test:watch` — runs tests in watch mode
## Deployment
Push to `main` → auto-deploys to staging
Tag a release → auto-deploys to production
See [deployment guide](./docs/deployment.md) for details
## Common Issues
**"Cannot connect to database"** → Make sure Docker is running: `docker-compose up`
**"Stripe API key invalid"** → Check your `.env` file has `STRIPE_TEST_KEY=sk_test_...`
## Need Help?
#payments Slack channel or message @alice (tech lead)
This takes 10 minutes to write. It saves 2 hours for every new person who touches this code.
Level 3: Architecture Docs (The Big Picture)
This is for understanding how the whole system fits together. I used to draw fancy diagrams in Lucidchart. They were outdated in a week.
What works better: simple text files in the repo.
I create a `docs/architecture.md` file with:
# System Architecture
## How a payment flows through our system
1. User clicks "Buy Now" on frontend
2. Frontend calls `POST /api/payments` (our API gateway)
3. API gateway validates request, calls Payment Service
4. Payment Service calls Stripe API
5. Stripe charges card, returns success/failure
6. Payment Service records result in database
7. Payment Service publishes event to message queue
8. Email Service picks up event, sends receipt email
9. Analytics Service picks up event, updates revenue dashboard
## Key Services
- **Payment Service** (Node.js) — processes payments via Stripe
- **Email Service** (Python) — sends transactional emails
- **Analytics Service** (Go) — aggregates data for dashboards
- **API Gateway** (Node.js) — authentication, rate limiting, routing
## Databases
- **PostgreSQL** (main DB) — users, payments, subscriptions
- **Redis** (cache) — session data, rate limiting
- **MongoDB** (logs) — application logs, audit trails
## Message Queue
We use RabbitMQ. Events are published to these topics:
- `payment.success` — payment succeeded
- `payment.failed` — payment failed
- `subscription.created` — new subscription
- `subscription.cancelled` — subscription cancelled
## Where to look when things break
- Payments failing? Check Payment Service logs + Stripe dashboard
- Emails not sending? Check Email Service logs + RabbitMQ queue depth
- Slow API responses? Check Redis hit rate + PostgreSQL slow query log
This is text, so it's easy to update. It's in the repo, so it's versioned with the code. It answers the question "how does this all work?" without making people read 10,000 lines of code.
Level 4: Decision Records (Why We Made This Choice)
This is my secret weapon. Every significant decision gets documented in `docs/decisions/`.
Example: `docs/decisions/2024-01-15-why-we-use-rabbitmq.md`
# Why We Use RabbitMQ for Message Queue
**Date:** 2024-01-15
**Status:** Accepted
**Deciders:** Alice (tech lead), Bob (senior eng), Carol (architect)
## Context
We need a message queue for async processing (emails, analytics, etc.). We're currently using a homegrown Redis-based queue, but it's unreliable (loses messages when Redis restarts).
## Options Considered
1. **Keep Redis queue, add persistence**
2. **Use RabbitMQ**
3. **Use AWS SQS**
4. **Use Kafka**
## Decision
We chose RabbitMQ.
## Reasons
- **Reliability:** Built-in message persistence and acknowledgment
- **Familiar:** Team has RabbitMQ experience from previous jobs
- **Self-hosted:** We already run our own infrastructure, SQS would add AWS dependency
- **Right size:** Kafka is overkill for our volume (1000 messages/day)
## Consequences
- **Positive:** No more lost messages, better monitoring, team knows how to operate it
- **Negative:** One more service to maintain, slightly more complex than Redis
- **Neutral:** We'll need to migrate existing Redis queue consumers (estimated 2 weeks)
## References
- [RabbitMQ vs SQS comparison](internal-link)
- [Spike: RabbitMQ proof of concept](ticket-link)
Why this matters: Six months later, someone asks "Why are we using RabbitMQ instead of SQS?" I send them this doc. Discussion over. No need to re-litigate decisions.
Also, when we DO want to switch technologies, we have a record of why we chose what we chose. Context is preserved.
Documentation That Stays Fresh
The hardest part of documentation isn't writing it. It's keeping it updated.
Strategy 1: Documentation in the Repo
If docs are in Confluence/Notion, they get outdated immediately. If docs are in the same repo as the code, they get updated with the code.
We have a rule: any PR that changes behavior must update the relevant README. Reviewers check for this.
Strategy 2: The "Obvious Location" Principle
Documentation should be where people expect it.
- How to run the app? → README.md in the root
- How a service works? → README.md in that service's folder
- How to deploy? → docs/deployment.md
- Why we made a decision? → docs/decisions/
If I have to search for docs, I won't find them.
Strategy 3: Delete Outdated Docs
Last year, we deleted 60% of our wiki. It was all outdated. Nobody maintained it. It caused more confusion than help.
Rule: If you can't update it, delete it.
Better to have no docs than wrong docs.
Strategy 4: Doc Debt in the Backlog
We track "doc debt" just like tech debt. When we ship a feature without updating docs, we create a ticket: "Update docs for feature X."
Once per month, we spend a half-day on doc debt. Not glamorous, but it keeps docs from rotting.
How to Document for Your Future Self
I think of documentation as a message to my future self (or to the poor person who inherits my code).
Questions I ask when writing docs:
- If I looked at this code in 6 months, what would I forget?
- What did I struggle to understand when I started?
- What would I need to know if I got paged at 2 AM?
- What will the next person ask me about this code?
Real example: I wrote a caching layer. I documented:
- Why we cache (API calls cost money, rate limits at 1000/hour)
- What we cache (user profiles, not payment data)
- How long cache lives (5 minutes)
- How to invalidate cache (call `clearUserCache(userId)`)
- What happens if Redis is down (we skip cache, hit API directly, log warning)
Few months later, someone asked "Can we cache this for 1 hour instead of 5 minutes?" I'd forgotten why it was 5 minutes. I checked the docs. They said: "5 minutes because user profile changes (name, avatar) should appear quickly. Tested with users—5 min was the max they'd tolerate." Mystery solved.
Documentation for Different Audiences
Different people need different docs:
- New hires: Need getting-started guides, architecture overview, "how we work" docs
- Junior engineers: Need "why" explanations, decision context, patterns we follow
- Senior engineers: Need system design, scaling considerations, known limitations
- On-call engineers: Need troubleshooting guides, runbooks, "what to do when X breaks"
- Product/Design: Need API docs, capability lists, technical constraints
I organize our docs by audience, not by topic.
docs/
onboarding/ ← For new hires
runbooks/ ← For on-call
architecture/ ← For senior engineers
decisions/ ← For everyone
api/ ← For product/design/frontend
The "Docs-First" Culture Shift
Here's what changed documentation from "nice to have" to "must have" on our team:
1. Lead by Example
I started documenting everything I touched. Not as a chore, but as part of the work. After a few months, others followed.
2. Praise Good Docs Publicly
When someone writes great docs, I call it out in team meetings. "Shoutout to Bob—his deployment runbook saved us 3 hours last night."
3. Make It Easy
We have templates:
- README template
- Decision record template
- Runbook template
Copy, fill in the blanks, done. Removes the "I don't know how to start" excuse.
4. Include Docs in Definition of Done
A PR isn't "done" until docs are updated. Reviewers check. If docs are missing, PR doesn't get approved.
Sounds strict, but it works. Docs get written.
What to Do This Week
- Pick your worst-documented system. Spend 30 minutes writing a basic README. Just answer: What is this? How do I run it? Who do I ask for help?
- Add one "why" comment to code that's confusing. Explain the business reason or the technical constraint.
- Document one recent decision. Use the template above. File it in `docs/decisions/`. Takes 15 minutes.
- Delete one outdated doc. Check your wiki. Find something that's wrong. Delete it. Feels good.
- Add "docs updated?" to your PR checklist. Make it a required checkbox. Watch docs improve.
The ROI of Documentation
Let me give you numbers from our team:
- Before good docs: New engineer productivity: 3 months to be effective
- After good docs: New engineer productivity: 3 weeks to be effective
Before: On-call incidents averaged 4 hours to resolve
After: On-call incidents averaged 45 minutes to resolve
Before: "How does X work?" questions in Slack: 20-30 per week
After: "How does X work?" questions in Slack: 3-5 per week (and we answer with a docs link)
Time spent writing docs: ~2 hours per week per engineer
Time saved by having docs: ~5 hours per week per engineer
Documentation isn't overhead. It's an investment that pays compound interest.
The Truth About Documentation
Here's what nobody tells you: you'll never have perfect documentation.
Some docs will get outdated. Some gaps will remain. Some things will be under-documented.
That's okay.
The goal isn't perfection. The goal is: can someone figure this out without asking me?
If the answer is "yes, with these docs," you've won.
Start small. Document one thing today. Tomorrow, document another. In six months, you'll have a well-documented codebase.
And when you get called at 2 AM, you'll thank yourself.