On Wed, 2015-04-08 at 15:08 +0200, Johannes Berg wrote: > On Wed, 2015-04-08 at 13:03 +0100, David Woodhouse wrote: > > > I'm not sure if this is entirely fixed. In Fedora 22 (4.0.0-rc5-git4) > > I'm occasionally seeing glibc deadlock in __check_pf() on a netlink > > recvmsg(), here: > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/check_pf.c;h=162606d7;hb=glibc-2.21#l166 > > > > As I understand it, this shouldn't happen. Even if messages are > > dropped (which surely shouldn't happen as often as I'm seeing this), > > glibc should get ENOBUFS from the recvmsg() call. > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1209433 > > > > I haven't bisected and proved that it *was* this commit which > > introduced the problem, as it only happens after a day or two of > > running Evolution and I haven't managed to trigger it more reliably. > > I don't see the connection to this change. > > The issue with my patch was that some code for NLM_F_DUMP would have > this pattern: > > int fill_function(...) > { > ... > return nlmsg_end(...); > } > > loop (...) { > if (fill_function() <= 0) > break; /* continue in next dump */ > } > > and that all had to be converted to be just "< 0" now. > > Additionally, the failure mode of this was the process running out of > memory due to receiving the same results over and over again - does that > happen for you? It seems it was stuck in recvmsg(), but that may just be > a side effect of happening to interrupt at that point? > I don't think the problem was introduced by your change. At https://github.com/nahi/httpclient/issues/232 it seems to have been observed even in November of last year. I've added some debugging, and it seems that when it deadlocks, glibc doesn't get *any* response to its RTM_GETADDR request. I know we'd get ENOBUFS is a *response* was dropped... but what about when the request itself is dropped? Does userspace get any hint of that? Is this purely a glibc bug, for assuming its request got delivered and unconditionally waiting for a response? I don't know why it suddenly started happening to me in the 4.0 kernel when I'd never seen it before, but it's still happening. I've put a poll() in the glibc code (referenced above), and made it fail after a 5 -second timeout. That will at least prevent me from throwing my computer out the window for the time being... -- dwmw2