The Rewrite That Never Shipped
In 2021, I watched a team spend 14 months rewriting their monolith into microservices. They had all the buzzwords: event-driven architecture, service mesh, Kubernetes, the works.
The rewrite never shipped. The company ran out of runway and got acquired. The monolith is still running in production today.
Meanwhile, at my current company, we've been gradually modernizing our 8-year-old monolith for 3 years. We ship to production every day. We've extracted 12 services. Revenue is up 3x. The monolith is still the core, and that's fine.
Here's what I learned: you don't need to kill the monolith. You need to make it better.
Why Your Monolith Isn't Actually the Problem
Let me guess what you hate about your monolith:
- Takes 20 minutes to run tests
- Deploys are scary and take 2 hours
- One bug in the reporting module takes down checkout
- New engineers take 3 months to be productive
- You can't scale the parts that need scaling without scaling everything
Here's the thing: microservices won't fix any of those problems if you don't fix them first.
Slow tests? You'll have slow tests in 15 services instead of 1.
Scary deploys? Now you have 15 deployment pipelines to be scared of.
Tight coupling? That just becomes network calls that are even harder to debug.
I know because I've made these mistakes. We extracted a service before fixing the underlying problems. It was just a distributed monolith with extra latency.
The Strangler Fig Pattern (That Actually Works)
You've probably heard of the Strangler Fig pattern. The metaphor is cool: a fig tree grows around an old tree, eventually replacing it.
What the blog posts don't tell you: you might not want to strangle the whole tree.
Here's our approach over the past 3 years:
Year 1: Make the Monolith Modular
Before extracting anything, we organized the monolith into clear modules:
/src
/users (authentication, profiles, permissions)
/billing (payments, subscriptions, invoicing)
/products (catalog, inventory, pricing)
/analytics (reporting, dashboards)
/core (shared utilities, database)
Rules we enforced:
- Users module can only import from core, not from billing
- Billing can import from users (needs auth) but not from analytics
- Analytics can import from everything (read-only)
We added linting rules to enforce this. PRs that violated module boundaries got rejected automatically.
This took 6 months. Zero new features during this time. Just reorganization. It was painful, but it worked.
Year 2: Extract the Obvious Wins
Once we had clear modules, we looked for services that were:
- Independent: Doesn't need much from other modules
- Scalable: Needs different scaling characteristics than the main app
- Stable: Not changing every week
We extracted:
- Email service: Was sending 10M emails/day, monolith was sending 100K requests/day. Different scaling needs. Easy to extract because it was already just a queue consumer.
- Image processing: CPU-intensive, independent, clear API (upload image → get URL back).
- Analytics/Reporting: Read-only, could have its own database replica, didn't need real-time consistency.
We did NOT extract:
- User authentication: Too critical, touches everything, needs real-time consistency
- Billing: Too complex, constantly changing, too risky
- Core product logic: Too coupled to everything else
Year 3: Optimize the Monolith
Plot twist: after extracting a few services, we stopped extracting and focused on making the monolith better.
We:
- Upgraded Rails 5 → Rails 7 (40% faster)
- Optimized database queries (cut API response time in half)
- Implemented proper caching (Redis FTW)
- Split the database reads/writes to replicas
- Added comprehensive monitoring
Result: the monolith now handles 5x more traffic than when we started. We don't need to extract more services because the monolith is fast enough.
How to Decide What to Extract
Use this decision tree (I literally have this printed on my wall):
Extract If:
- Different scaling needs: Email service needs to send 10M/day, main app needs 100K req/day → extract
- Different technology requirements: Need Python for ML, main app is Ruby → extract
- Different team ownership: Separate team that needs to deploy independently → extract
- Clear, stable interface: "Upload file → get URL back" with no plans to change → safe to extract
- High cost of coupling: Mobile app can't release features because they need backend deploys first → extract the mobile API
Don't Extract If:
- It shares a database table with other code: You'll just create a distributed monolith
- The interface isn't clear: "It does... stuff with users?" → not ready
- It changes frequently: Deploying 2 services every time you change something is worse than deploying 1
- It's tightly coupled: If extracting it means 1000 network calls, don't do it yet
- You're doing it because "microservices are best practice": Please don't
The Extraction Process That Worked
When we extracted our email service, here's what we did:
Step 1: Create Internal API Boundaries (in the monolith)
First, we wrapped the email code in a clean interface inside the monolith:
class EmailService
def send_email(to:, subject:, body:, template:)
# All email logic here
end
end
# Everywhere in the codebase
EmailService.send_email(to: user.email, subject: "Welcome!", ...)
Took 2 weeks. Zero functionality changed. Just created a clear boundary.
Step 2: Add a Feature Flag
class EmailService
def send_email(...)
if FeatureFlag.enabled?(:external_email_service)
EmailAPI.send(...) # Call external service
else
# Old code in monolith
end
end
end
Now we could route some traffic to the new service, some to the old code, and switch back instantly if needed.
Step 3: Build the Service (While the Old Code Still Runs)
We built the new email service over 6 weeks. During this time, 100% of production traffic still used the monolith. No pressure.
Step 4: Gradual Rollout
- Week 1: 1% of traffic to new service
- Week 2: 10%
- Week 3: 50%
- Week 4: 100%
We found 3 bugs during this rollout. Because we could instantly switch back to the monolith, users barely noticed.
Step 5: Delete the Old Code
This is the step everyone forgets. We waited 2 months after hitting 100%, then deleted the email code from the monolith.
Total time: 4 months from "let's extract this" to "old code deleted."
Total bugs in production: 3 (minor, caught within hours)
Total downtime: 0 seconds
Mistakes I've Made (Learn From My Pain)
Mistake 1: Extracting Too Early
We extracted a "notifications service" when we had 3 types of notifications. Then we added 12 more types. We ended up deploying the notifications service every single day, which defeated the whole purpose.
Lesson: Wait until the interface is stable. If you're still figuring out what this code should do, keep it in the monolith.
Mistake 2: Shared Database
We extracted a "reports service" but it directly queried the main database. So now we couldn't change the database schema without coordinating between services.
Lesson: Services should own their data. Use APIs or events, not direct database access.
Mistake 3: Wrong Boundaries
We split "user management" from "authentication." Made sense on paper. In practice, every auth change required a user management change. We merged them back into one service after 6 months.
Lesson: Boundaries should follow business domains, not technical layers.
Mistake 4: Not Having Rollback Plans
We extracted a service and deleted the monolith code immediately. The service had a memory leak. It crashed after 6 hours. We had no fallback. That was a bad Saturday.
Lesson: Keep the old code for at least a month. Feature flags are your friend.
Living With the Monolith
Here's my controversial take: monoliths are fine, actually.
Shopify is largely a monolith. GitHub is largely a monolith. Basecamp is literally famous for being a monolith.
The key word is "modular monolith."
What makes a monolith good:
- Clear module boundaries enforced by tooling
- Fast tests (< 10 minutes for full suite)
- Fast deploys (< 10 minutes from commit to production)
- Good monitoring and observability
- Easy to run locally
If you have those things, you don't need microservices. If you don't have those things, microservices won't save you.
Your Action Plan This Week
- Map your monolith: Draw boxes around major functional areas. Where are the natural boundaries?
- Identify one module boundary violation: Where is code reaching across modules when it shouldn't? Fix that one case.
- Measure your deployment time: How long from merge to production? If > 1 hour, that's your first optimization target (not extraction).
- Run your tests: How long? If > 30 minutes, spend a week optimizing tests before you extract anything.
- Ask: what's the real problem? If the answer is "monolith is bad," dig deeper. The real problem is probably "deploys are scary" or "tests are slow." Fix those.
The Real Goal
The goal isn't microservices. The goal isn't even killing the monolith.
The goal is: can your team ship features quickly and confidently?
If you can ship to production 10 times a day with your monolith, you're doing better than teams with 50 microservices deploying once a week.
Our monolith is 8 years old. It's still growing. We've extracted the pieces that needed extracting. The rest? It's fast, well-tested, and it makes us money.
That's good enough.