Yesterday I was doing some testing with Bodhi's staging environment to try to reproduce a bug and test a proposed fix for it. The bug was in a library that Bodhi's repository composer uses to test whether the composed repository meets basic sanity checks. The composer runs this at the end of the job, so I needed to run a compose with the known bad version of the library.

I used bodhi-push to fire off a compose, and I saw Pungi running and everything seemed fine. I went off to so some other things for a while (composes take a long time), and when I came back I was surprised to find that Pungi had finished before the updateinfo generation step. This is highly unusual, as the updateinfo step usually takes a few minutes, while pungi takes usually an hour or three (depending on the size of the repo). Furthermore, htop and dstat both seemed to show the system as having no noticeable load or activity. The process was hung.

I started off by adding a few debug log statements to try to find out where in my code it was hanging. It turned out that it was always on the same line. Aha! The database must be getting stuck!

Oddly, running that particular query by hand seemed to work just fine. Even more oddly, it did seem to happen always on the same build (note that we are looping on builds here) at first, but I later noticed that it occasionally got hung on the build just prior.

I am not a Fedora administrator, so I asked for help in #fedora-apps, and nirik helped me check some things in the server logs. There were a couple error messages in there about text being too long for a column, but we are doing a read query here so that wasn't relevant. nirik then noted that when Bodhi was hung that it was actually not in the middle of a query at all. That was surprising, since we are on a line that calls into sqlalchemy to get the result of a query.

I decided to use pdb to get into the code and see if I could find where inside sqlalchemy the hang is happening. Of course, I expected it to be on some IO call that was not returning, but I ended up tracing it to here. How could the code that turns our query into text hang?

There's obviously something deeper going on, so with some assistance from jcline I was able to get gdb to show me the full Python stack trace:

$ sudo gdb -p 
(gdb) thread apply all py-bt
Thread 16 (Thread 0x7fcc3177a700 (LWP 1734)):
Traceback (most recent call first):
  File "/usr/lib64/python3.7/site-packages/zmq/sugar/context.py", line 46, in __del__
    self.term()
  Garbage-collecting
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/elements.py", line 4298, in _expand_cloned
    return itertools.chain(*[x._cloned_set for x in elements])
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/selectable.py", line 637, in corresponding_column
    expanded_proxy_set = set(_expand_cloned(c.proxy_set))
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/util.py", line 682, in _corresponding_column
    col, require_embedded=require_embedded
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/util.py", line 709, in replace
    return self._corresponding_column(col, True)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 214, in replace
    e = v.replace(elem)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 329, in clone
    newelem = replace(elem)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 340, in replacement_traverse
    obj = clone(obj, **opts)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 218, in traverse
    return replacement_traverse(obj, self.__traverse_options__, replace)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/relationships.py", line 3061, in join_targets
    primaryjoin = primary_aliasizer.traverse(primaryjoin)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/relationships.py", line 2081, in _create_joins
    source_selectable, dest_selectable, aliased, single_crit
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/util.py", line 975, in __init__
    of_type=right_info.mapper,
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/strategies.py", line 1776, in _create_eager_join
    _right_memo=self.mapper,
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 3599, in _compile_context
    strategy(*rec[1:])
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 3077, in __iter__
    context = self._compile_context()
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 2771, in __getitem__
    return list(res)
  File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 2979, in first
    ret = list(self[0:1])
  File "/usr/lib/python3.7/site-packages/bodhi/server/metadata.py", line 172, in _fetch_updates
    build_obj = self.db.query(Build).filter_by(nvr=nvr).first()
  File "/usr/lib/python3.7/site-packages/bodhi/server/metadata.py", line 136, in __init__
    self._fetch_updates()
  File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/masher.py", line 1020, in _generate_updateinfo
    self.db, self.mash_dir)
  File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/masher.py", line 951, in _compose_updates
    uinfo = self._generate_updateinfo()
  File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/masher.py", line 415, in work
    self._compose_updates()

Now we're getting somewhere, it's the garbage collector! It's calling del on an object in zmq and that call is not returning. It had nothing to do with our database query at all, even though it reliably happened in the same place every time I tried.

After some research, I discovered that it's not OK to destroy a zmq context while other threads have sockets open in this comment. fedmsg-hub uses threads to launch the various listeners and does seem to open sockets in multiple threads. Bodhi itself also launches a few more threads to run the composer, and those threads also open sockets to send messages.

One mystery still remained - why is this happening now? I've been working on Bodhi for almost three years and haven't seen this problem in staging or production prior to this incident. It was reliably reproducible on staging, but production worked fine. jcline suggested that perhaps something is blocking the socket that is getting garbage collected as part of a policy Fedora has to firewall staging from production (we generally don't want our staging systems to connect to production).

While starting fedmsg-hub-3 a few more times, I happened to notice this fly by in the logs:

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/moksha/hub/hub.py", line 455, in __init_producers
    producer_obj = producer_class(self)
  File "/usr/lib/python3.7/site-packages/moksha/hub/monitoring.py", line 47, in __init__
    self.socket.bind(endpoint)
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Permission denied

Excellent, now we see something that seems pretty relevant. So zmq is getting a permission denied while binding to a socket, and some time later the garbage collector comes along to clean up the context, and since other threads are using sockets when that happens, the whole process hangs. Also, this message happened a long time before the hang, so it wasn't near any of the log statements I had been focusing on, which is why I hadn't noticed it until now.

Unfortunately, neither fedmsg nor zmq told me which socket it was getting permission denied on. One lesson here is that good error messages will go a long way towards helping people debug code! Fortunately, another process was running that was emitting an error message that I had not been seeing, because until now I had been filtering the logs for only messages from fedmsg-hub-3 - since it wasn't telling me the information I wanted, I was hoping that perhaps it was an SELinux issue or something along those lines, and that perhaps looking at the general system logs would reveal something useful. And it did:

collectd[788]: exec plugin: Unable to parse command, ignoring line: "UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-hub.socket does not exist"

So some other process was expecting this socket to be there, and I'm seeing a permissions error on binding to a socket. Coincidence?

It turns out that rebooting bodhi-backend breaks fedmsg-hub-3. Since /var/run is a tmpfs, its permissions revert to the defaults upon boot. Unfortunately, the default permissions do not allow the fedmsg-hub-3 process to create that socket as the user it runs as. Also unfortunately, this problem has been worked around by writing a task in the Fedora Infra Ansible playbook to set the permissions on /var/run/fedmsg such that fedmsg-hub-3 can open its socket. This is fragile, as we've seen here, and unfortunately the problem is not obvious when the playbook is not run after a reboot.

After running the playbook to reset the permissions on that folder, the hang went away.

Lessons

There are a lot of lessons in this incident:

Make sure you handle error conditions correctly in your code. fedmsg is not using zmq correctly here, because it failed to catch and handle the exception raised from zmq when it was unable to open the socket. This led to the garbage collector eventually trying to clean up the context, which caused a hang due to other threads having open sockets, which led me down a long and dark rabbit hole.
If you are receiving a permission denied error, it is helpful to the user if you log what you were denied access to. Fortunately in this case, there was another process that logged that it was having trouble finding the path and it just so happened to be the one that was getting permission denied, but your users won't always be so lucky. I filed zmq#1287 to request this.
Read the documentation for the libraries you are using, and make sure you pay particular attention to thread safety when you are using threads. pyzmq does document that a context must not be destroyed if there are other threads with open sockets. If fedmsg had taken note of this, it could have avoided this problem. zmq recommends using a single context for the entire process, which fedmsg is not doing here.
Servers should be able to be rebooted and still work.
Sometimes the log message that is most helpful isn't at the timestamp you might be expecting, or from the process you might be expecting.
gdb is awesome.

Looking forward

Fortunately, Bodhi will not be using fedmsg anymore very soon, with its upcoming 4.0.0 release. We are migrating to fedora-messaging, which provides reliable delivery, schemas, a broker, and many other exciting benefits to us. fedmsg is deprecated and being retired (otherwise, I'd file tickets about these problems).

blog.electronsweatshop.com

The case of the fedmsg hang

Lessons

Looking forward

Lessons

Looking forward

links

social