* strange crashes in tcp_poll() via epoll_wait
@ 2013-07-19 16:24 Eric Dumazet
2013-07-19 23:50 ` Eric Wong
0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2013-07-19 16:24 UTC (permalink / raw)
To: Al Viro; +Cc: netdev, linux-kernel
Hi Al
I tried to debug strange crashes in tcp_poll() called from
sys_epoll_wait() -> sock_poll()
The symptom is that sock->sk is NULL and we therefore dereference a NULL
pointer.
It's really rare crashes but still, it would be nice to understand where
is the bug. Presumably latest kernels would crash in sock_poll() because
of the sk_can_busy_loop(sock->sk) call.
We do test sock->sk being NULL in sock_fasync(), but epoll should be
safe because of existing synchronization (epmutex) ?
Any idea?
Thanks !
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: strange crashes in tcp_poll() via epoll_wait
2013-07-19 16:24 strange crashes in tcp_poll() via epoll_wait Eric Dumazet
@ 2013-07-19 23:50 ` Eric Wong
2013-07-20 0:10 ` Eric Dumazet
0 siblings, 1 reply; 4+ messages in thread
From: Eric Wong @ 2013-07-19 23:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Al Viro, netdev, linux-kernel
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Hi Al
>
> I tried to debug strange crashes in tcp_poll() called from
> sys_epoll_wait() -> sock_poll()
>
> The symptom is that sock->sk is NULL and we therefore dereference a NULL
> pointer.
>
> It's really rare crashes but still, it would be nice to understand where
> is the bug. Presumably latest kernels would crash in sock_poll() because
> of the sk_can_busy_loop(sock->sk) call.
>
> We do test sock->sk being NULL in sock_fasync(), but epoll should be
> safe because of existing synchronization (epmutex) ?
It should be safe because of ep->mtx, actually, as epmutex is not taken
in sys_epoll_wait.
I took a look at this but have not found anything. I've yet to see this
this on my machines.
When did you start noticing this?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: strange crashes in tcp_poll() via epoll_wait
2013-07-19 23:50 ` Eric Wong
@ 2013-07-20 0:10 ` Eric Dumazet
2013-07-20 2:03 ` Eric Wong
0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2013-07-20 0:10 UTC (permalink / raw)
To: Eric Wong; +Cc: Al Viro, netdev, linux-kernel
On Fri, 2013-07-19 at 23:50 +0000, Eric Wong wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Hi Al
> >
> > I tried to debug strange crashes in tcp_poll() called from
> > sys_epoll_wait() -> sock_poll()
> >
> > The symptom is that sock->sk is NULL and we therefore dereference a NULL
> > pointer.
> >
> > It's really rare crashes but still, it would be nice to understand where
> > is the bug. Presumably latest kernels would crash in sock_poll() because
> > of the sk_can_busy_loop(sock->sk) call.
> >
> > We do test sock->sk being NULL in sock_fasync(), but epoll should be
> > safe because of existing synchronization (epmutex) ?
>
> It should be safe because of ep->mtx, actually, as epmutex is not taken
> in sys_epoll_wait.
Hmm, it might be more complex than that for multi threaded programs :
eventpoll_release_file()
The problem might be because a thread closes a socket while an event
was queued for it.
>
> I took a look at this but have not found anything. I've yet to see this
> this on my machines.
>
> When did you start noticing this?
Hard to say, but we have these crashes on a 3.3+ based kernel.
Probability of said crashes is very very low.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: strange crashes in tcp_poll() via epoll_wait
2013-07-20 0:10 ` Eric Dumazet
@ 2013-07-20 2:03 ` Eric Wong
0 siblings, 0 replies; 4+ messages in thread
From: Eric Wong @ 2013-07-20 2:03 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Al Viro, netdev, linux-kernel
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-07-19 at 23:50 +0000, Eric Wong wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Hi Al
> > >
> > > I tried to debug strange crashes in tcp_poll() called from
> > > sys_epoll_wait() -> sock_poll()
> > >
> > > The symptom is that sock->sk is NULL and we therefore dereference a NULL
> > > pointer.
> > >
> > > It's really rare crashes but still, it would be nice to understand where
> > > is the bug. Presumably latest kernels would crash in sock_poll() because
> > > of the sk_can_busy_loop(sock->sk) call.
> > >
> > > We do test sock->sk being NULL in sock_fasync(), but epoll should be
> > > safe because of existing synchronization (epmutex) ?
> >
> > It should be safe because of ep->mtx, actually, as epmutex is not taken
> > in sys_epoll_wait.
>
> Hmm, it might be more complex than that for multi threaded programs :
>
> eventpoll_release_file()
>
> The problem might be because a thread closes a socket while an event
> was queued for it.
But ep->mtx is also held when traversing the ready list with
ep_send_events_proc.
Can sock->sk somehow be NULL before hitting eventpoll_release_file?
> > I took a look at this but have not found anything. I've yet to see this
> > this on my machines.
> >
> > When did you start noticing this?
>
> Hard to say, but we have these crashes on a 3.3+ based kernel.
So I don't think any of my epoll changes caused it. Phew!
> Probability of said crashes is very very low.
This still worries me since I rely heavily on multi-threaded epoll. I
don't have a lot of cores/CPUs, though, so maybe it's harder to trigger
any potential race as a result...
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-07-20 2:03 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-19 16:24 strange crashes in tcp_poll() via epoll_wait Eric Dumazet
2013-07-19 23:50 ` Eric Wong
2013-07-20 0:10 ` Eric Dumazet
2013-07-20 2:03 ` Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).