All of lore.kernel.org
 help / color / mirror / Atom feed
From: Soheil Hassas Yeganeh <soheil@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Eric Dumazet <edumazet@google.com>,
	Guantao Liu <guantaol@google.com>,
	Khazhismel Kumykov <khazhy@google.com>,
	Linux-MM <linux-mm@kvack.org>,
	mm-commits@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [patch 13/15] epoll: check ep_events_available() upon timeout
Date: Mon, 2 Nov 2020 14:54:37 -0500	[thread overview]
Message-ID: <CACSApvZiWLAzSpkkBSo9=U_Fpx85_xCryaDh+sFsr7Uw8-RjUg@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wh3FcWMRAt-WGYcKP-YxLDBkpbNtVzLrm+=t6xixV+A9w@mail.gmail.com>

On Mon, Nov 2, 2020 at 1:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 2, 2020 at 9:49 AM Soheil Hassas Yeganeh <soheil@google.com> wrote:
> >
> > Thank you for the suggestion. That was the first version I tried, and
> > I can confirm it fixes the race because we call ep_send_events() once
> > more before returning.  Though, I believe, due to time_out=1, we won't
> > goto fetch_events to call ep_events_available():
> >
> > if (!res && eavail &&
> >    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
> >  goto fetch_events;
>
> Right. We won't be repeating the loop, but we will do one final send_events.
>
> Which I think is really the point, no?

Yes, absolutely, that's the point.

> > You're spot on that the patch is more complicated than your
> > suggestion.  However, the downside I observed was a performance
> > regression for the non-racy case: Suppose there are a few threads with
> > a similar non-zero timeout and no ready event. They will all
> > experience a noticeable contention in ep_scan_ready_list, by
> > unconditionally calling ep_send_events(). The contention was large
> > because there will be 2 write locks on ep->lock and one mutex lock on
> > ep->mtx with a large critical section.
>
> Ugh. I really detest the eventpoll code. Afaik, it has no normal users
> anywhere, so it gets no real coverage except for the odd cases
> (presumably inside google?)
>
> A lot of the work over the last couple of years has been to try to
> simplify the code and streamline it, and fix bugs due to recursion.
>
> I really wish we would continue that pattern of trying to simplify
> this code rather than add more complexity on top of it, which is why I
> reacted to strongly to that patch.
>
> And that whole ep_poll() function is written in just about the most
> confusing way possible, with code that looks like loops but aren't
> ("while (0)") and goto's that _are_ loops ("goto fetch_events").

I wholeheartedly agree. Due to its cryptic implementation, It was
really difficult to think about the correctness of the fixes.

On Mon, Nov 2, 2020 at 2:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 2, 2020 at 10:51 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'll go stare at it some more.
>
> That code is fundamentally broken in so many ways.
>
> Look at how  ep_poll() does that ep_events_available() without
> actually holding the ep->lock (or the ep->mtx) in half the cases.
>
> End result: it works in 99.9% of all cases, but there's a race with
> somebody else doing
>
>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>         /*
>          * Quickly re-inject items left on "txlist".
>          */
>         list_splice(&txlist, &ep->rdllist);
>
> when that code can see an empty rdllist and an inactive ovflist and
> thus decide that there are no events available.
>
> I think the "Quickly re-inject" comment may be because some people
> knew of that race.
>
> The signal handling is also odd and looks broken.
>
> The "short circuit on fatal signals" means that ep_send_events() isn't
> actually done on a SIGKILL, but the code also used an exclusive wait,
> so nobody else will be woken up either.
>
> Admittedly you can steal wakeups other ways, by simply not caring
> about the end result, so maybe that's all just inherent in epoll
> anyway. But it looks strange, and it seems pointless: the right thing
> to do would seem to be simply to have a regular check for
> signal_pending(), and returning -EINTR if rather than looping.
>
> And that do { } while (0) is entirely pointless. It seems to exist in
> order to use "break" instead of the goto that everything else does,
> which I guess is nice, except the whole need for that comes from how
> oddly the code is written.
>
> Why doesn't this all do something like the attached instead?

This looks really great to me! Thank you!  I'll give it a try and get
back to you soon.

> NOTE! I did not bother to fix that ep_events_available() race.

Given that you're calling ep_events_available() under lock, I think
this should address the inefficiency for the non-racy timeout case,  I
mentioned above. The remaining races are preexisting and all result in
spurious events, which should be fine.

Thanks again!
Soheil

>                       Linus



> I'll go stare at it some more.
>
>                    Linus

  parent reply	other threads:[~2020-11-02 19:58 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-02  1:06 incoming Andrew Morton
2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
2020-11-02  1:07 ` [patch 02/15] hugetlb_cgroup: fix reservation accounting Andrew Morton
2020-11-02  1:07 ` [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg Andrew Morton
2020-11-02  1:07 ` [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false Andrew Morton
2020-11-02  1:07   ` [LTP] " Andrew Morton
2020-11-02  1:07 ` [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode Andrew Morton
2020-11-02  1:07 ` [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error Andrew Morton
2020-11-02  1:07 ` [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced Andrew Morton
2020-11-02  1:07 ` [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable Andrew Morton
2020-11-02  1:07 ` [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static Andrew Morton
2020-11-02  1:07 ` [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled Andrew Morton
2020-11-02  1:07 ` [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing Andrew Morton
2020-11-02  1:08 ` [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted() Andrew Morton
2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
2020-11-02 17:08   ` Linus Torvalds
2020-11-02 17:08     ` Linus Torvalds
2020-11-02 17:48     ` Soheil Hassas Yeganeh
2020-11-02 17:48       ` Soheil Hassas Yeganeh
2020-11-02 18:51       ` Linus Torvalds
2020-11-02 18:51         ` Linus Torvalds
2020-11-02 19:38         ` Linus Torvalds
2020-11-02 19:38           ` Linus Torvalds
2020-11-02 19:54         ` Soheil Hassas Yeganeh [this message]
2020-11-02 19:54           ` Soheil Hassas Yeganeh
2020-11-02 20:12           ` Linus Torvalds
2020-11-02 20:12             ` Linus Torvalds
2020-11-02  1:08 ` [patch 14/15] epoll: add a selftest for epoll timeout race Andrew Morton
2020-11-02  1:08 ` [patch 15/15] kernel/hung_task.c: make type annotations consistent Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACSApvZiWLAzSpkkBSo9=U_Fpx85_xCryaDh+sFsr7Uw8-RjUg@mail.gmail.com' \
    --to=soheil@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=edumazet@google.com \
    --cc=guantaol@google.com \
    --cc=khazhy@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.