Site icon patrick's blog

The misery of positive feedback loops

At work we perform jobs off a queue. Sometimes duplicates get enqueued, and we don’t want them to run. So we did this.

Sidekiq::Queue.new.select do |job|
  (job.args[0]["job_class"] == "OurSpecialClass") &&
  (job.args[0]["arguments"][0]["_aj_globalid"] == "gid://our_id") &&
  (job.args[0]["arguments"][1] == event_type)
end.any?

This was one of those issues that come in as “not working”, with a screenshot carefully cropped to omit any useful information. I saw in NewRelic that we had a bunch of llen and lrange operations, which was very strange to me as I didn’t think we used those.

I was lucky enough to discover MONITOR in Redis and saw that the parameters tended to be real close to each other, the second one higher. Clearly, we were doing something dumb.

1524865607.137354  "lrange" "sidekiq:queue:default" "2003" "2052"
1524865607.137651  "lrange" "sidekiq:queue:default" "3103" "3152"
1524865607.138266  "lrange" "sidekiq:queue:default" "10230" "10279"
1524865607.139222  "llen" "sidekiq:queue:default"
1524865607.139582  "lrange" "sidekiq:queue:default" "11833" "11882"
1524865607.139956  "llen" "sidekiq:queue:default"
1524865607.140247  "lrange" "sidekiq:queue:default" "10080" "10129"
1524865607.140613  "llen" "sidekiq:queue:default"
1524865607.140937  "llen" "sidekiq:queue:default"
1524865607.141833  "lrange" "sidekiq:queue:default" "10533" "10582"
1524865607.143325  "llen" "sidekiq:queue:default"
1524865607.143396  "llen" "sidekiq:queue:default"
1524865607.143667  "lrange" "sidekiq:queue:default" "9971" "10020"
1524865607.144039  "lrange" "sidekiq:queue:default" "850" "899"
1524865607.144669  "llen" "sidekiq:queue:default"
1524865607.145019  "lrange" "sidekiq:queue:default" "10132" "10181"
1524865607.145395  "lrange" "sidekiq:queue:default" "6571" "6620"
1524865607.146262  "llen" "sidekiq:queue:default"
1524865607.146604  "llen" "sidekiq:queue:default"
1524865607.146976  "llen" "sidekiq:queue:default"
1524865607.147054  "lrange" "sidekiq:queue:default" "11530" "11579"
1524865607.147622  "llen" "sidekiq:queue:default"
1524865607.149180  "llen" "sidekiq:queue:default"
1524865607.149649  "lrange" "sidekiq:queue:default" "4452" "4501"
1524865607.149827  "lrange" "sidekiq:queue:default" "3153" "3202"
1524865607.150086  "lrange" "sidekiq:queue:default" "11130" "11179"

The problem is that as the queue grows longer, this operation takes more and more time. In turn, this makes the queue get longer. The operations are fast on their own, but when it’s running on a queue of 200k, those times add up. The fix is to keep a set, add to it when enqueued, and remove from it when finished.

That was all well and good and on its way out the door. Unfortunately, the operation was a success, but the patient died. Redis evicted the queue before the solution could be deployed and all the jobs were lost.

 

Exit mobile version