linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	 Eric Dumazet <edumazet@google.com>,
	Guantao Liu <guantaol@google.com>,
	 Khazhismel Kumykov <khazhy@google.com>,
	Linux-MM <linux-mm@kvack.org>,
	mm-commits@vger.kernel.org,  Al Viro <viro@zeniv.linux.org.uk>,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [patch 13/15] epoll: check ep_events_available() upon timeout
Date: Mon, 2 Nov 2020 11:38:53 -0800	[thread overview]
Message-ID: <CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wh3FcWMRAt-WGYcKP-YxLDBkpbNtVzLrm+=t6xixV+A9w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1713 bytes --]

On Mon, Nov 2, 2020 at 10:51 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll go stare at it some more.

That code is fundamentally broken in so many ways.

Look at how  ep_poll() does that ep_events_available() without
actually holding the ep->lock (or the ep->mtx) in half the cases.

End result: it works in 99.9% of all cases, but there's a race with
somebody else doing

        WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
        /*
         * Quickly re-inject items left on "txlist".
         */
        list_splice(&txlist, &ep->rdllist);

when that code can see an empty rdllist and an inactive ovflist and
thus decide that there are no events available.

I think the "Quickly re-inject" comment may be because some people
knew of that race.

The signal handling is also odd and looks broken.

The "short circuit on fatal signals" means that ep_send_events() isn't
actually done on a SIGKILL, but the code also used an exclusive wait,
so nobody else will be woken up either.

Admittedly you can steal wakeups other ways, by simply not caring
about the end result, so maybe that's all just inherent in epoll
anyway. But it looks strange, and it seems pointless: the right thing
to do would seem to be simply to have a regular check for
signal_pending(), and returning -EINTR if rather than looping.

And that do { } while (0) is entirely pointless. It seems to exist in
order to use "break" instead of the goto that everything else does,
which I guess is nice, except the whole need for that comes from how
oddly the code is written.

Why doesn't this all do something like the attached instead?

NOTE! I did not bother to fix that ep_events_available() race.

                      Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 4551 bytes --]

 fs/eventpoll.c | 114 +++++++++++++++++++++++++++------------------------------
 1 file changed, 54 insertions(+), 60 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4df61129566d..d4732327f57a 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1864,85 +1864,79 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	 */
 	ep_reset_busy_poll_napi_id(ep);
 
-	do {
-		/*
-		 * Internally init_wait() uses autoremove_wake_function(),
-		 * thus wait entry is removed from the wait queue on each
-		 * wakeup. Why it is important? In case of several waiters
-		 * each new wakeup will hit the next waiter, giving it the
-		 * chance to harvest new event. Otherwise wakeup can be
-		 * lost. This is also good performance-wise, because on
-		 * normal wakeup path no need to call __remove_wait_queue()
-		 * explicitly, thus ep->lock is not taken, which halts the
-		 * event delivery.
-		 */
-		init_wait(&wait);
+	if (signal_pending(current))
+		return -EINTR;
 
-		write_lock_irq(&ep->lock);
-		/*
-		 * Barrierless variant, waitqueue_active() is called under
-		 * the same lock on wakeup ep_poll_callback() side, so it
-		 * is safe to avoid an explicit barrier.
-		 */
-		__set_current_state(TASK_INTERRUPTIBLE);
-
-		/*
-		 * Do the final check under the lock. ep_scan_ready_list()
-		 * plays with two lists (->rdllist and ->ovflist) and there
-		 * is always a race when both lists are empty for short
-		 * period of time although events are pending, so lock is
-		 * important.
-		 */
-		eavail = ep_events_available(ep);
-		if (!eavail) {
-			if (signal_pending(current))
-				res = -EINTR;
-			else
-				__add_wait_queue_exclusive(&ep->wq, &wait);
-		}
-		write_unlock_irq(&ep->lock);
-
-		if (eavail || res)
-			break;
+	/*
+	 * Internally init_wait() uses autoremove_wake_function(),
+	 * thus wait entry is removed from the wait queue on each
+	 * wakeup. Why it is important? In case of several waiters
+	 * each new wakeup will hit the next waiter, giving it the
+	 * chance to harvest new event. Otherwise wakeup can be
+	 * lost. This is also good performance-wise, because on
+	 * normal wakeup path no need to call __remove_wait_queue()
+	 * explicitly, thus ep->lock is not taken, which halts the
+	 * event delivery.
+	 */
+	init_wait(&wait);
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
-			timed_out = 1;
-			break;
-		}
+	write_lock_irq(&ep->lock);
+	/*
+	 * Barrierless variant, waitqueue_active() is called under
+	 * the same lock on wakeup ep_poll_callback() side, so it
+	 * is safe to avoid an explicit barrier.
+	 */
+	__set_current_state(TASK_INTERRUPTIBLE);
 
-		/* We were woken up, thus go and try to harvest some events */
-		eavail = 1;
+	/*
+	 * Do the final check under the lock. ep_scan_ready_list()
+	 * plays with two lists (->rdllist and ->ovflist) and there
+	 * is always a race when both lists are empty for short
+	 * period of time although events are pending, so lock is
+	 * important.
+	 */
+	eavail = ep_events_available(ep);
+	if (!eavail)
+		__add_wait_queue_exclusive(&ep->wq, &wait);
+	write_unlock_irq(&ep->lock);
 
-	} while (0);
+	if (!eavail)
+		timed_out = !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS);
 
 	__set_current_state(TASK_RUNNING);
 
+	/*
+	 * If we were woken up, assume there's something available, otherwise
+	 * remove ourselves from the wait queue and check carefully (since we
+	 * hold the lock anyway).
+	 */
+	eavail = 1;
 	if (!list_empty_careful(&wait.entry)) {
 		write_lock_irq(&ep->lock);
 		__remove_wait_queue(&ep->wq, &wait);
+		eavail = ep_events_available(ep);
 		write_unlock_irq(&ep->lock);
 	}
 
 send_events:
-	if (fatal_signal_pending(current)) {
-		/*
-		 * Always short-circuit for fatal signals to allow
-		 * threads to make a timely exit without the chance of
-		 * finding more events available and fetching
-		 * repeatedly.
-		 */
-		res = -EINTR;
+	if (res)
+		return res;
+	if (eavail) {
+		res = ep_send_events(ep, events, maxevents);
+		if (res)
+			return res;
 	}
+	if (signal_pending(current))
+		return -EINTR;
+
 	/*
-	 * Try to transfer events to user space. In case we get 0 events and
-	 * there's still timeout left over, we go trying again in search of
-	 * more luck.
+	 * In case we get 0 events and there's still timeout left over, we
+	 * go trying again in search of more luck.
 	 */
-	if (!res && eavail &&
-	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
+	if (!timed_out)
 		goto fetch_events;
 
-	return res;
+	return 0;
 }
 
 /**

  reply	other threads:[~2020-11-02 19:39 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-02  1:06 incoming Andrew Morton
2020-11-02  1:07 ` [patch 01/15] mm/mremap_pages: fix static key devmap_managed_key updates Andrew Morton
2020-11-02  1:07 ` [patch 02/15] hugetlb_cgroup: fix reservation accounting Andrew Morton
2020-11-02  1:07 ` [patch 03/15] mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg Andrew Morton
2020-11-02  1:07 ` [patch 04/15] mm: memcg: link page counters to root if use_hierarchy is false Andrew Morton
2020-11-02  1:07 ` [patch 05/15] kasan: adopt KUNIT tests to SW_TAGS mode Andrew Morton
2020-11-02  1:07 ` [patch 06/15] mm: mempolicy: fix potential pte_unmap_unlock pte error Andrew Morton
2020-11-02  1:07 ` [patch 07/15] ptrace: fix task_join_group_stop() for the case when current is traced Andrew Morton
2020-11-02  1:07 ` [patch 08/15] lib/crc32test: remove extra local_irq_disable/enable Andrew Morton
2020-11-02  1:07 ` [patch 09/15] mm/truncate.c: make __invalidate_mapping_pages() static Andrew Morton
2020-11-02  1:07 ` [patch 10/15] kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled Andrew Morton
2020-11-02  1:07 ` [patch 11/15] mm, oom: keep oom_adj under or at upper limit when printing Andrew Morton
2020-11-02  1:08 ` [patch 12/15] mm: always have io_remap_pfn_range() set pgprot_decrypted() Andrew Morton
2020-11-02  1:08 ` [patch 13/15] epoll: check ep_events_available() upon timeout Andrew Morton
2020-11-02 17:08   ` Linus Torvalds
2020-11-02 17:48     ` Soheil Hassas Yeganeh
2020-11-02 18:51       ` Linus Torvalds
2020-11-02 19:38         ` Linus Torvalds [this message]
2020-11-02 19:54         ` Soheil Hassas Yeganeh
2020-11-02 20:12           ` Linus Torvalds
2020-11-02  1:08 ` [patch 14/15] epoll: add a selftest for epoll timeout race Andrew Morton
2020-11-02  1:08 ` [patch 15/15] kernel/hung_task.c: make type annotations consistent Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=edumazet@google.com \
    --cc=guantaol@google.com \
    --cc=khazhy@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=soheil@google.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).