mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Soheil Hassas Yeganeh <soheil@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Eric Dumazet <edumazet@google.com>,
	Guantao Liu <guantaol@google.com>,
	Khazhismel Kumykov <khazhy@google.com>,
	Linux-MM <linux-mm@kvack.org>,
	mm-commits@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [patch 13/15] epoll: check ep_events_available() upon timeout
Date: Mon, 2 Nov 2020 14:54:37 -0500	[thread overview]
Message-ID: <CACSApvZiWLAzSpkkBSo9=U_Fpx85_xCryaDh+sFsr7Uw8-RjUg@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wh3FcWMRAt-WGYcKP-YxLDBkpbNtVzLrm+=t6xixV+A9w@mail.gmail.com>

On Mon, Nov 2, 2020 at 1:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 2, 2020 at 9:49 AM Soheil Hassas Yeganeh <soheil@google.com> wrote:
> >
> > Thank you for the suggestion. That was the first version I tried, and
> > I can confirm it fixes the race because we call ep_send_events() once
> > more before returning.  Though, I believe, due to time_out=1, we won't
> > goto fetch_events to call ep_events_available():
> >
> > if (!res && eavail &&
> >    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
> >  goto fetch_events;
>
> Right. We won't be repeating the loop, but we will do one final send_events.
>
> Which I think is really the point, no?

Yes, absolutely, that's the point.

> > You're spot on that the patch is more complicated than your
> > suggestion.  However, the downside I observed was a performance
> > regression for the non-racy case: Suppose there are a few threads with
> > a similar non-zero timeout and no ready event. They will all
> > experience a noticeable contention in ep_scan_ready_list, by
> > unconditionally calling ep_send_events(). The contention was large
> > because there will be 2 write locks on ep->lock and one mutex lock on
> > ep->mtx with a large critical section.
>
> Ugh. I really detest the eventpoll code. Afaik, it has no normal users
> anywhere, so it gets no real coverage except for the odd cases
> (presumably inside google?)
>
> A lot of the work over the last couple of years has been to try to
> simplify the code and streamline it, and fix bugs due to recursion.
>
> I really wish we would continue that pattern of trying to simplify
> this code rather than add more complexity on top of it, which is why I
> reacted to strongly to that patch.
>
> And that whole ep_poll() function is written in just about the most
> confusing way possible, with code that looks like loops but aren't
> ("while (0)") and goto's that _are_ loops ("goto fetch_events").

I wholeheartedly agree. Due to its cryptic implementation, It was
really difficult to think about the correctness of the fixes.

On Mon, Nov 2, 2020 at 2:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 2, 2020 at 10:51 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'll go stare at it some more.
>
> That code is fundamentally broken in so many ways.
>
> Look at how  ep_poll() does that ep_events_available() without
> actually holding the ep->lock (or the ep->mtx) in half the cases.
>
> End result: it works in 99.9% of all cases, but there's a race with
> somebody else doing
>
>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>         /*
>          * Quickly re-inject items left on "txlist".
>          */
>         list_splice(&txlist, &ep->rdllist);
>
> when that code can see an empty rdllist and an inactive ovflist and
> thus decide that there are no events available.
>
> I think the "Quickly re-inject" comment may be because some people
> knew of that race.
>
> The signal handling is also odd and looks broken.
>
> The "short circuit on fatal signals" means that ep_send_events() isn't
> actually done on a SIGKILL, but the code also used an exclusive wait,
> so nobody else will be woken up either.
>
> Admittedly you can steal wakeups other ways, by simply not caring
> about the end result, so maybe that's all just inherent in epoll
> anyway. But it looks strange, and it seems pointless: the right thing
> to do would seem to be simply to have a regular check for
> signal_pending(), and returning -EINTR if rather than looping.
>
> And that do { } while (0) is entirely pointless. It seems to exist in
> order to use "break" instead of the goto that everything else does,
> which I guess is nice, except the whole need for that comes from how
> oddly the code is written.
>
> Why doesn't this all do something like the attached instead?

This looks really great to me! Thank you!  I'll give it a try and get
back to you soon.

> NOTE! I did not bother to fix that ep_events_available() race.

Given that you're calling ep_events_available() under lock, I think
this should address the inefficiency for the non-racy timeout case,  I
mentioned above. The remaining races are preexisting and all result in
spurious events, which should be fine.

Thanks again!
Soheil

>                       Linus



> I'll go stare at it some more.
>
>                    Linus

  parent reply	other threads:[~2020-11-02 19:58 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-02  1:06 incoming Andrew Morton
2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
2020-11-02  1:07 ` [patch 02/15] hugetlb_cgroup: fix reservation accounting Andrew Morton
2020-11-02  1:07 ` [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg Andrew Morton
2020-11-02  1:07 ` [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false Andrew Morton
2020-11-02  1:07 ` [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode Andrew Morton
2020-11-02  1:07 ` [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error Andrew Morton
2020-11-02  1:07 ` [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced Andrew Morton
2020-11-02  1:07 ` [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable Andrew Morton
2020-11-02  1:07 ` [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static Andrew Morton
2020-11-02  1:07 ` [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled Andrew Morton
2020-11-02  1:07 ` [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing Andrew Morton
2020-11-02  1:08 ` [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted() Andrew Morton
2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
2020-11-02 17:08   ` Linus Torvalds
2020-11-02 17:48     ` Soheil Hassas Yeganeh
2020-11-02 18:51       ` Linus Torvalds
2020-11-02 19:38         ` Linus Torvalds
2020-11-02 19:54         ` Soheil Hassas Yeganeh [this message]
2020-11-02 20:12           ` Linus Torvalds
2020-11-02  1:08 ` [patch 14/15] epoll: add a selftest for epoll timeout race Andrew Morton
2020-11-02  1:08 ` [patch 15/15] kernel/hung_task.c: make type annotations consistent Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACSApvZiWLAzSpkkBSo9=U_Fpx85_xCryaDh+sFsr7Uw8-RjUg@mail.gmail.com' \
    --to=soheil@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=edumazet@google.com \
    --cc=guantaol@google.com \
    --cc=khazhy@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).