Yesterday I was doing some testing with Bodhi's staging environment to try to reproduce a bug and test a proposed fix for it. The bug was in a library that Bodhi's repository composer uses to test whether the composed repository meets basic sanity checks. The composer runs this at the end of the job, so I needed to run a compose with the known bad version of the library.
I used bodhi-push
to fire off a compose, and I saw Pungi running and
everything seemed fine. I went off to so some other things for a while (composes
take a long time), and when I came back I was surprised to find that Pungi had
finished before the
updateinfo generation step.
This is highly unusual, as the updateinfo step usually takes a few minutes,
while pungi takes usually an hour or three (depending on the size of the repo).
Furthermore, htop
and dstat
both seemed to show the system as having
no noticeable load or activity. The process was hung.
I started off by adding a few debug log statements to try to find out where in my code it was hanging. It turned out that it was always on the same line. Aha! The database must be getting stuck!
Oddly, running that particular query by hand seemed to work just fine. Even more oddly, it did seem to happen always on the same build (note that we are looping on builds here) at first, but I later noticed that it occasionally got hung on the build just prior.
I am not a Fedora administrator, so I asked for help in #fedora-apps, and nirik helped me check some things in the server logs. There were a couple error messages in there about text being too long for a column, but we are doing a read query here so that wasn't relevant. nirik then noted that when Bodhi was hung that it was actually not in the middle of a query at all. That was surprising, since we are on a line that calls into sqlalchemy to get the result of a query.
I decided to use pdb
to get into the code and see if I could find where
inside sqlalchemy the hang is happening. Of course, I expected it to be on some
IO call that was not returning, but I ended up tracing it to
here.
How could the code that turns our query into text hang?
There's obviously something deeper going on, so with some assistance from jcline
I was able to get gdb
to show me the full Python stack trace:
$ sudo gdb -p(gdb) thread apply all py-bt Thread 16 (Thread 0x7fcc3177a700 (LWP 1734)): Traceback (most recent call first): File "/usr/lib64/python3.7/site-packages/zmq/sugar/context.py", line 46, in __del__ self.term() Garbage-collecting File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/elements.py", line 4298, in _expand_cloned return itertools.chain(*[x._cloned_set for x in elements]) File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/selectable.py", line 637, in corresponding_column expanded_proxy_set = set(_expand_cloned(c.proxy_set)) File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/util.py", line 682, in _corresponding_column col, require_embedded=require_embedded File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/util.py", line 709, in replace return self._corresponding_column(col, True) File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 214, in replace e = v.replace(elem) File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 329, in clone newelem = replace(elem) File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 340, in replacement_traverse obj = clone(obj, **opts) File "/usr/lib64/python3.7/site-packages/sqlalchemy/sql/visitors.py", line 218, in traverse return replacement_traverse(obj, self.__traverse_options__, replace) File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/relationships.py", line 3061, in join_targets primaryjoin = primary_aliasizer.traverse(primaryjoin) File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/relationships.py", line 2081, in _create_joins source_selectable, dest_selectable, aliased, single_crit File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/util.py", line 975, in __init__ of_type=right_info.mapper, File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/strategies.py", line 1776, in _create_eager_join _right_memo=self.mapper, File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 3599, in _compile_context strategy(*rec[1:]) File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 3077, in __iter__ context = self._compile_context() File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 2771, in __getitem__ return list(res) File "/usr/lib64/python3.7/site-packages/sqlalchemy/orm/query.py", line 2979, in first ret = list(self[0:1]) File "/usr/lib/python3.7/site-packages/bodhi/server/metadata.py", line 172, in _fetch_updates build_obj = self.db.query(Build).filter_by(nvr=nvr).first() File "/usr/lib/python3.7/site-packages/bodhi/server/metadata.py", line 136, in __init__ self._fetch_updates() File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/masher.py", line 1020, in _generate_updateinfo self.db, self.mash_dir) File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/masher.py", line 951, in _compose_updates uinfo = self._generate_updateinfo() File "/usr/lib/python3.7/site-packages/bodhi/server/consumers/masher.py", line 415, in work self._compose_updates()
Now we're getting somewhere, it's the garbage collector! It's calling del
on an object in zmq
and that call is not returning. It had nothing to do
with our database query at all, even though it reliably happened in the same
place every time I tried.
After some research, I discovered that it's not OK to destroy a zmq context
while other threads have sockets open in
this comment.
fedmsg-hub
uses threads to launch the various listeners and does seem to
open sockets in multiple threads. Bodhi itself also launches a few more threads
to run the composer, and those threads also open sockets to send messages.
One mystery still remained - why is this happening now? I've been working on Bodhi for almost three years and haven't seen this problem in staging or production prior to this incident. It was reliably reproducible on staging, but production worked fine. jcline suggested that perhaps something is blocking the socket that is getting garbage collected as part of a policy Fedora has to firewall staging from production (we generally don't want our staging systems to connect to production).
While starting fedmsg-hub-3
a few more times, I happened to notice this
fly by in the logs:
Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/moksha/hub/hub.py", line 455, in __init_producers producer_obj = producer_class(self) File "/usr/lib/python3.7/site-packages/moksha/hub/monitoring.py", line 47, in __init__ self.socket.bind(endpoint) File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc zmq.error.ZMQError: Permission denied
Excellent, now we see something that seems pretty relevant. So zmq is getting a permission denied while binding to a socket, and some time later the garbage collector comes along to clean up the context, and since other threads are using sockets when that happens, the whole process hangs. Also, this message happened a long time before the hang, so it wasn't near any of the log statements I had been focusing on, which is why I hadn't noticed it until now.
Unfortunately, neither fedmsg nor zmq told me which socket it was getting
permission denied on. One lesson here is that good error messages will go a long
way towards helping people debug code! Fortunately, another process was running
that was emitting an error message that I had not been seeing, because until now
I had been filtering the logs for only messages from fedmsg-hub-3
- since
it wasn't telling me the information I wanted, I was hoping that perhaps it
was an SELinux issue or something along those lines, and that perhaps looking at
the general system logs would reveal something useful. And it did:
collectd[788]: exec plugin: Unable to parse command, ignoring line: "UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-hub.socket does not exist"
So some other process was expecting this socket to be there, and I'm seeing a permissions error on binding to a socket. Coincidence?
It turns out that rebooting bodhi-backend breaks fedmsg-hub-3. Since
/var/run
is a tmpfs
, its permissions revert to the defaults upon
boot. Unfortunately, the default permissions do not allow the fedmsg-hub-3
process to create that socket as the user it runs as. Also unfortunately, this
problem has been worked around by writing a task in the Fedora Infra Ansible
playbook to set the permissions on /var/run/fedmsg
such that
fedmsg-hub-3
can open its socket. This is fragile, as we've seen here,
and unfortunately the problem is not obvious when the playbook is not
run after a reboot.
After running the playbook to reset the permissions on that folder, the hang went away.
Lessons
There are a lot of lessons in this incident:
- Make sure you handle error conditions correctly in your code.
fedmsg
is not usingzmq
correctly here, because it failed to catch and handle the exception raised fromzmq
when it was unable to open the socket. This led to the garbage collector eventually trying to clean up the context, which caused a hang due to other threads having open sockets, which led me down a long and dark rabbit hole. - If you are receiving a permission denied error, it is helpful to the user if you log what you were denied access to. Fortunately in this case, there was another process that logged that it was having trouble finding the path and it just so happened to be the one that was getting permission denied, but your users won't always be so lucky. I filed zmq#1287 to request this.
- Read the documentation for the libraries you are using, and make sure you
pay particular attention to thread safety when you are using threads. pyzmq
does
document
that a context must not be destroyed if there are other threads with open
sockets. If fedmsg had taken note of this, it could have avoided this
problem.
zmq
recommends using a single context for the entire process, which fedmsg is not doing here. - Servers should be able to be rebooted and still work.
- Sometimes the log message that is most helpful isn't at the timestamp you might be expecting, or from the process you might be expecting.
gdb
is awesome.
Looking forward
Fortunately, Bodhi will not be using fedmsg anymore very soon, with its upcoming
4.0.0 release. We are
migrating to
fedora-messaging, which
provides reliable delivery, schemas, a broker, and many other exciting benefits
to us. fedmsg
is deprecated and being retired (otherwise, I'd file tickets
about these problems).