Rails, callbacks, workers, and the race you never expected to lose

August 21, 2012 § 38 Comments

Consider a fairly normal pattern in a user model in Rails, sending a welcome email when a user signs up. If you’re interested in keeping your controller actions quick whilst you do this, you would probably queue the email to be sent later, like this:

class User < ActiveRecord::Base
  
  after_save :queue_welcome_email, :on => :create
  private
  def queue_welcome_email
    Resque.enqueue(WelcomeEmailJob, self.id)
  end
end

class WelcomeEmailJob
  @queue = :normal

  def self.perform(user_id)
    user = User.find(user_id)
    UserMailer.welcome_email(user).deliver
  end
end

Straightforward, right? You’d think so, but if you implemented this and ran it in a live system, you would find a small number of ActiveRecord::RecordNotFound errors cropping up in Resque.

What’s going on?

This has bitten me on more than one occasion, which is why I’m taking the time to write it down now. The problem is that your eager worker has snatched the job off the queue and tried to process it before your database transaction has completed. So when it calls to the database to find your new record, it won’t actually be there, even though you said “after save”!

So how do we sort this out? Well, since Rails 3 there has been one last callback beyond after_save: after_commit.

Problem solved, new problems introduced

So, all we need to do is replace any after_save that places jobs on a queue with after_commit? Sort of, this does actually raise some other problems for us. Firstly, if you are using transactional fixtures, your tests, like the one below, will now fail.

class UserTest < ActiveSupport::TestCase

  def test_should_queue_email_job
    flexmock(Resque).should_receive(:enqueue).once
    User.create(:email => '')
  end

end

Transactional fixtures stop us having to teardown and recreate our database every test by just running every test within a transaction and rolling back at the end. This presents a problem; if the after_commit callback only runs once the transaction is complete, then it will never run in our test.

To deal with this, you can either give up on transactional fixtures, which will likely make your tests slow down, or, my preference, create tests that avoid transactions purely for this use case.

class UserAfterCommitTest < ActiveSupport::TestCase

  self.use_transactional_fixtures = false

  def test_should_queue_email_job
    flexmock(Resque).should_receive(:enqueue).once
    User.create(:email => '')
  end

end

Dirty Attributes

Our initial example is now solved, we no longer have an issue with the database and the workers racing against each other to save or retrieve our record. There is one other scenario I want to mention here though. Imagine you are using a 3rd party email newsletter service and you need to keep your users’ emails up to date with the service. You might have something like this:

class User < ActiveRecord::Base

  after_save :update_external_email, :if => :email_changed?

  private

  def update_external_email
    Resque.enqueue(ExternalEmailJob, self.id)
  end
end

We are suffering with the problem we first discussed here. But, if you just change after_save to after_commit you will never update those emails externally. Why? Because by the time the after_commit callback is triggered, the record’s dirty state has been reset and email_changed? will return false.

Thankfully, the Rails team had thought of this too. Available on every object is a method called previous_changes which contains a hash of attributes that changed the last time the object was saved. Sadly, with this we don’t get all our convenience methods like #{attribute}_changed? (although we could write them ourselves) this still tells us what happened to the attributes, so we can use after_commit conditionally.

So, to update our example, we can now write:

class User < ActiveRecord::Base

  after_commit :update_external_email, :if => :email_previously_changed?

  private

  def update_external_email
    Resque.enqueue(ExternalEmailJob, self.id)
  end

  def email_previously_changed?
    previous_changes[:keys].include?('email')
  end
end

after_commit and previous_changes save the day

The race condition between database and workers can be hard to spot and it has caught me on more than one occasion. Thankfully, once you realise you are heading for this sort of issue, or once you catch it, these quick changes will sort you out.

Tagged: callbacks, rails, ruby, workers

§ 38 Responses to Rails, callbacks, workers, and the race you never expected to lose

kitopastorino says:

August 21, 2012 at 12:03 pm

You also got to be worry with after_commit in that example about sending the email on user update.
IIRC, after_commit triggers both in save and update, so that callback will run if a user updates his password or any other info.

Reply
- Phil Nash says:
  
  August 21, 2012 at 1:05 pm
  
  You can choose to trigger after_commit on save, create or update. By default it is on save (i.e. every time) but if you pass :on => :create or :on => :update as arguments to the callback, you can restrict that.
  
  Reply
anamika says:

August 21, 2012 at 12:03 pm

Too much work. Why don’t add it in controller or use a service.

Reply
- Phil Nash says:
  
  August 21, 2012 at 1:06 pm
  
  What do you mean by too much work? This was just a simple example to illustrate the issue, things may not be as simple as an update in one controller.
  
  Reply
Alex Heaton says:

August 21, 2012 at 12:46 pm

Seems like a good argument to avoid callbacks altogether.

I’ve been using a service object recently to handle the saving of the data object and the sending these emails instead.
Avoids this weird race condition and awkward tests.

Reply
- Phil Nash says:
  
  August 21, 2012 at 1:18 pm
  
  Perhaps this is the way forward, in this case. I picked welcome emails as a simple example of the potential issue. I’m sure there is a use for callbacks/observers out there.
  
  Reply
DHH says:

August 21, 2012 at 12:49 pm

Do you have a special situation that prevents you from making the call to generate the welcome email from the controller?

In my book, emails are like async views. The model should not be responsible for rendering views. That’s the responsibility of the controller. Which is why Action Mailer itself is modelled on Action Controller, uses Action View, and so forth.

Reply
- Phil Nash says:
  
  August 21, 2012 at 1:23 pm
  
  That’s a good point, and, whilst we do use callbacks for sending emails at times, I see your view that it is more like a controller action. In the case of this post, I used emails as an example as it was easy to understand and illustrate the solution.
  
  The actual use case that caused me to write up the issue was geocoding an address once it had been saved/updated. Same issues with the worker missing the address or getting an old one had it been updated. Perhaps I should have just stuck with that example, but it was more complicated in practice.
  
  Reply
webster132 says:

August 21, 2012 at 12:50 pm

Do you have a special situation that prevents you from making the call to generate the welcome email from the controller?

In my book, emails are like async views. The model should not be responsible for rendering views. That\’s the responsibility of the controller. Which is why Action Mailer itself is modelled on Action Controller, uses Action View, and so forth.

Reply
Mike says:

August 21, 2012 at 1:04 pm

Would using an observer sort this?

Reply
- Phil Nash says:
  
  August 21, 2012 at 1:25 pm
  
  Sadly observers fall victim to the same issues. If you were using an observer for this, you would still have to use the after_commit callback and the integration test between the model and observer would have to avoid transactional fixtures.
  
  Reply
Michael Sullivan says:

August 21, 2012 at 3:05 pm

Thanks for this! We ran into this problem a few weeks ago, and, while we solved it another way, it drove me a little bit crazy in the mean-time.

Reply
- Phil Nash says:
  
  August 21, 2012 at 3:19 pm
  
  How did you solve it? I once tried fixing this issue by making Resque keep trying again until it found the record, but that really wasn’t the right way!
  
  Reply
says:

August 21, 2012 at 4:27 pm

Re: above comments…

I don’t like this soft of action in a controller. It doesn’t fit there either. Remember fat model, skinny controller. But I understand where they are coming from.

I just think this action, sending emails on a condition of a model, is exactly what observers were created for. Let’s be honest, they are just extension of the model itself, a handy place to keep non-data modify actions.

I use DJ and it will retry a job if the ActiveRecord::RecordNotFound exception happens and simply retries the job in a few seconds/minutes, when it will complete successfully. I guess that is why I have never observed this happening, but secretly I new it could. All that being said, I really should implement my email conditions in an after_commit callback. But I probably won’t, until it becomes an _observed_ problem.

Great article. I’m sure a lot of programmers new to rails may not consider the race condition in callbacks.

Reply
- heeton says:
  
  August 21, 2012 at 5:33 pm
  
  Fat model has shown itself to be quite naive beyond anything simple. (Even in the original example it’s causing problems).
  
  Would also avoid observers and callbacks completely in favour of something like a Services object.
  
  Controller: handles incoming request, delegates to some service action, renders the result. Doesn’t care about what happens in the service (single responsibility)
  
  ActiveRecord Model: Persists data and relationships, no logic (skinny). Nothing else (again, single responsibility)
  
  Service object: manipulates various objects to achieve a task. doesn’t care about requests or database persistance.
  
  It might look something like this instead https://gist.github.com/3417559
  
  Reply
Jon says:

August 21, 2012 at 5:51 pm

Or you could just schedule your jobs after a delay.

Reply
says:

August 21, 2012 at 5:57 pm

Another possible issue in this realm is database replication lag. We write to a master db, and read from slaves. Occasionally our jobs run before new data has replicated to a slave, which is problematic in various ways.

Reply
Rodrigo Rosenfeld Rosas says:

August 22, 2012 at 2:43 pm

Once more I’m feeling so happy that I use Sequel instead of AR… :)

Reply
Roman says:

August 22, 2012 at 3:07 pm

Thank you for article!
I’d like to suggest one more way to solve problem. Sometimes it just enough to have prepared strings instead of AR model. Example:

““
class User :create
private
def queue_welcome_email
Resque.enqueue(WelcomeEmailJob, Marshal.
dump({ :email => self.email, :full_name => “#{self.first_name} #{self.last_name}” }))
end
end

class WelcomeEmailJob
@queue = :normal

def self.perform(user_data)
UserMailer.welcome_email(Marshal.load(user_data)).deliver
end
end
““
Accordingly welcome_email should be able to accept hash of strings instead of user. One more advantage of such example is that you have one less query to database.
It’s not suitable for all cases but _sometimes_ could be useful (for example when your task sends http request to third party service, so Hash easily transforms to request parameters).

Reply
- gromans says:
  
  August 22, 2012 at 3:17 pm
  
  Sorry for that mess. There is a link to gist: https://gist.github.com/3426669
  
  Reply
says:

October 6, 2012 at 4:20 pm

@Roman, I’ve always tried to pass the parameters to the bg job explicitly, not only it helps w. race bugs, but also respects the Smyth’s Law of Demeter (http://goo.gl/u62dk). What is your reason to use Marshal.dump though?

Reply
says:

December 11, 2012 at 7:34 pm

I just wanted to say thank for writing this article. I’ve also encountered this issue a few times and the after_commit callbacks work well enough.

Scheduling the asynchronous task to run at some arbitrary delayed interval (whether it be 10 seconds or 1 minute, etc), may be a quick pragmatic fix, but it is still prone to failure. What if your DB is pegged and is unable to commit your transaction in < 10 seconds? Using the 'after_commit' callback allows us listen for the correct event to be captured, thus taking out the guessing.

Reply
- says:
  
  December 11, 2012 at 8:37 pm
  
  One more thing:
  The gem ‘test_after_commit’ forces ActiveRecord to trigger ‘after_commit’ callbacks even when transactional_fixtures are set to true. I found it to be useful in testing (using RSpec)
  
  https://github.com/grosser/test_after_commit
  
  Reply
Al Davidson says:

January 25, 2013 at 11:26 am

Another way round this is to use a DB table (say, “transactional_jobs”) as an intermediate queue. Your app code (in whatever form) saves the email into this table, and then a separate resque job periodically pulls transactional_jobs out of this table, deletes the record, and finally sends the actual job to resque (probably in an after_commit)

That way, job is “scheduled” in the same transaction as the update that causes it, so it’s only ever visible after the transaction commits. If the transaction rolls back, so does the job scheduling.

Of course, it’s not so much solving the problem as shifting it around a bit, but it does de-couple the problem from your app code

Reply
Abdullah says:

April 28, 2013 at 12:04 am

Hey thank you for this example!

I had a quick question,

You are using previous_changes, which is defined as follows:
####File activemodel/lib/active_model/dirty.rb, line 125
def previous_changes
@previously_changed
end

as you can see, the changes are stored in memory.

So what happens if the worker somehow dies between getting the answer for this line:
after_commit :update_external_email, :if => :email_previously_changed?

So the process/worker restarts, since the previous_changes was held in memory, it will now reset to an empty hash {}, so now when the worker runs :if => :email_previously_changed? it will return false, so now the job will never be enqueued again..

Is my understanding of this correct?

Reply
- Phil Nash says:
  
  April 28, 2013 at 2:20 pm
  
  The after_commit hook is run within the original process, not the worker, so this is not an issue.
  
  Reply
  - abdullahali1990 says:
    
    April 28, 2013 at 2:35 pm
    
    Hello Phil,
    
    Thank you for the reply.
    
    Is it still not an issue when deploying to the server? ie: killing/restarting the application? When deploying, the original process will be killed and hence when the application is restarted the previous_changes method would return an empty hash, masking the if statement fail?
  - Phil Nash says:
    
    April 28, 2013 at 2:45 pm
    
    As far as I’m aware, if you kill a Rails process part way through, it’s not going to restart and try to finish what it was doing. So that would be a problem regardless.
  - abdullahali1990 says:
    
    April 28, 2013 at 3:04 pm
    
    Could the following approaches work in this scenario:
    
    1) when deploying, don’t kill the process directly, but tell it to stop accepting new requests (and load balancing the traffic to another sever), an only when there is nothing to process, then the application can be killed. This way nothing will be killed half way through? Would this solve this issue?
    
    2) if the previous changes are stored and updated in the DB (so it is more prsistent when the process restarts) everytime an enqueue action is triggered, when the rails process/workers are restarted, then it will look into the DB and figure out that changes did occur, and hence it can use this to enqueue the record to the new queue? Would this solve this issue?
  - abdullahali1990 says:
    
    April 28, 2013 at 9:11 pm
    
    Hello Phil,
    
    In your previous comment you said that the after_commit is run within the original process.
    
    But what about the previous_chnages method? Is it run within the main process? Or is it run within the instance of the application loaded by the resque worker?
  - Phil Nash says:
    
    April 29, 2013 at 5:28 pm
    
    All rails callbacks happen inline, in process so the after_commit hook is run after the data is saved to the database. When that hook runs, in this instance, it checks whether the email had changed, as you can see in the conditional gate to the hook “:if => :email_previously_change”.
    
    In the final example in the post, the only part of the code that is run by the resque worker would be ExternalEmailJob#perform. The ExternalEmailJob class is not actually shown in the post, but it would look very similar to the WelcomeEmailJob class.
    
    This post doesn’t cover making this process fault tolerant, you might want to investigate this yourself and release a post with your own findings. I’d be very interested to hear the best way to deal with it.
says:

December 12, 2013 at 1:22 pm

How do I disable transactional_fixtures for a single functional ActionController TestCase?
I need to test logic in after_commit callback in a functional test

Reply
- Phil Nash says:
  
  December 12, 2013 at 6:14 pm
  
  You can do so by calling
  
  self.use_transactional_fixtures = false
  
  in the test in which you need transactional fixtures disabled.
  
  Reply
  - says:
    
    December 13, 2013 at 5:39 am
    
    I tried it. But I need to disable it for a single test.
    For the case of unit tests, I scoped it inside a describe block
    e.g.
    describe “disabling transactional fixtures” do
    self.use_transactional_fixtures = false
    def test_unit_update
    …
    end
    end
    
    Is there any similar alternative for functional tests?
    there is no describe for ActionController::TestCase
  - Phil Nash says:
    
    December 13, 2013 at 10:31 am
    
    As I describe towards the end of the blog post, I actually create separate test files for testing anything that requires transactional fixtures to be disabled, that way you do not need to use a describe block at all if you don’t want to/can’t.
  - says:
    
    December 13, 2013 at 1:09 pm
    
    Thanks Phil for the quick response.
    I have only a few tests per controller which will get affected. I didn’t want to move them to separate files for the sake of code organization. But looks like I am left with no other option.
says:

June 13, 2014 at 11:01 am

I know this was written ages ago, but you just saved me an enormous headache trying to figure out what was going on with a worker throwing exceptions. Thanks!

Reply
- Phil Nash says:
  
  June 17, 2014 at 10:06 pm
  
  Aimee, thanks! I’m glad that I wrote this too, because it still crops up in various places in work we do too and it’s nice to always have an explanation on hand.
  
  Reply