Good Retry, Bad Retry: An Incident Story

guideamigo_com 5 hours ago

I never get this desire for micro services. You IDE can help if there are 500 functions, but nothing would help you if you have 500 micro services. Almost no one fully understands such a system. Is is hard to argue who parts of code are unused. And large scale refactoring is impossible.

The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.

morningsam 4 hours ago

>The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.
Unless I misunderstand something here, they say pretty early in the article that they didn't have autoscaling configured for the service in question and there is no indication they scaled up the number of replicas manually after the downtime to account for the accumulated backlog of requests. So, in my mind, of course there can be no infinite, or really any, scalability if the service isn't allowed to scale...
FooBarWidget 4 hours ago

The point of microservices is not technical, it's so that the deployment- and repository ownership structure matches your organization structure, and that clear lines are drawn between responsibilities.
- sim7c00 4 hours ago
  
  its also easier to find devs that have the skills to create and maintain thin services than a large complicated monolith, despite the difficulties found when having to debug a constellation of microservices during a crisis.
  - phil21 2 hours ago
    
    For the folks who downvoted this - why? I hire developers and this is the absolute truth of the matter.
    You can get away with hiring devs able to only debug their little micro empire so long as you can retain some super senior rockstar level folks able to see the big picture when it inevitably breaks down in production under load. These skills are becoming rarer by the day, when they used to be nearly table stakes for a “senior” dev.
    Microservices have their place, but many times you can see that it’s simply developers saying “not my problem” to the actual hard business case things.
    
    pards 2 hours ago
    
    > retain some super senior rockstar level folks able to see the big picture
    This is the critical piece that many organisations miss.
    Microservices are the bricks; but the customer needs those assembled into a house.
delusional 5 hours ago

I think the dream is that you can reason locally. I'm not convinced that it actually help any, but the dream is that having everything as services, complete with external boundaries and enforced constraints, you're able to more accurately reason about the orchestration of services. It's hard to reason about your order flow if half if it depends on some implicit procedure that's part of your shopping cart.
The business I'm part of isn't really after "scalable" technology, so that might color my opinion, but a lot of the arguments for microservices I hear from my colleagues are actually benefits of modular programs. Those two have just become synonyms in their minds.
- klabb3 3 hours ago
  
  > […] the dream is that having everything as services, […], you're able to more accurately reason about the orchestration of services.
  Well.. I mean that’s an entirely circular point. Maybe you mean something else? That you can individually deploy and roll back different functionality that belong to a team? There’s some appeal for operations yeah.
  > but a lot of the arguments for microservices I hear from my colleagues are actually benefits of modular programs
  Yes I mean from a development perspective a library call is far, far superior to an http call. It is much more performant and orders of magnitude easier to reason about since the caller and callee are running the same version of the code. That means that breaking changes is a refactor and single commit, whereas with a service boundary you need a whole migration.
  You can’t avoid services altogether, like say external services like a payment portal by a completely different company. But to deliberately create more of these expensive boundaries for no reason, within the same small org or team, is madness, imo.

davedx 5 hours ago

This is the kind of well written, in depth technical narrative I visit HN for. I definitely learned from it. Thanks for posting!

chipdart 5 hours ago

I agree. What a treat. One of the best submissions gracing HN in months.

Rygian 4 hours ago

Reading this excellent article put me in the mind of wondering if job interviews for developer positions include enough questions about queue management.

"Ben" developed retries without exponential back-off, and only learned about that concept in code review. Exponential back-off should be part of any basic developer curriculum (except if that curriculum does not mention networks of any sort at all).

sim7c00 4 hours ago

if you have too many deeper questions you rule out a lot of eager juniors who can learn and grow on the job. its a fine balance though, but looking at the article, ben's taking his lessons and growing. thats more important i think than having someone who's some guru from the get go. everyone has things they are better or worse at, and it's really a team effort to do everythinng right. presumably someone reviewed and accepted his code, that person also didnt catch it... there's no developer who knows everything and makes all perfect code and design. its a well balanced team that can help go in that direction
- Rygian 2 hours ago
  
  I wholeheartedly agree, and realize my comment was not really clear.
  Any training curriculum needs to include exponential back-off as a core concept of any system-to-system interaction.
  Ben was let out of school without proper training. Kudos on the employer for finishing up the training that was missed earlier on.

duffmancd 5 hours ago

I missed it on the first read-through but there is a link to the code used to run the simulations in the first appendix.

Homegrown python code (i.e. not a library), very nicely laid out. And would form a good basis for more experiments for anyone interested. I think I'll have a play around later and try and train my intuition.

azlev 5 hours ago

Good reading.

In my last job, the service mesh was responsible to do retries. It was a startup and the system was changing every day.

After a while, we suspect that some services were not reliable enough and retries were hiding this fact. Turning off retries exposed that in fact, quality went down.

In the end, we put retries in just some services.

I never tested neither retry budget nor deadline propagation. I will suggest this in the future.

k3vinw 3 hours ago

Great food for thought! I’m currently on an endeavor at work to stabilize some pre-existing rest service integration tests executed in parallel.

easylion 5 hours ago

Really good article about retries, its consequences and how load amplification works. Loved it

sim7c00 6 hours ago

ver nice read with lots of interesting points and examples / examination. very thorough imo. Im not a microservices guy but it gives a lot of general concepts also applicable outside of that domain. very good thanks!