linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Roman Penyaev <rpenyaev@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jason Baron <jbaron@akamai.com>,
	Khazhismel Kumykov <khazhy@google.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>, Heiher <r@hev.cc>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 5.4 58/90] epoll: atomically remove wait entry on wake up
Date: Wed, 13 May 2020 11:44:54 +0200	[thread overview]
Message-ID: <20200513094415.888707435@linuxfoundation.org> (raw)
In-Reply-To: <20200513094408.810028856@linuxfoundation.org>

From: Roman Penyaev <rpenyaev@suse.de>

commit 412895f03cbf9633298111cb4dfde13b7720e2c5 upstream.

This patch does two things:

 - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll:
   remove unnecessary wakeups of nested epoll")

 - improves performance for events delivery.

The description of the problem is the following: if N (>1) threads are
waiting on ep->wq for new events and M (>1) events come, it is quite
likely that >1 wakeups hit the same wait queue entry, because there is
quite a big window between __add_wait_queue_exclusive() and the
following __remove_wait_queue() calls in ep_poll() function.

This can lead to lost wakeups, because thread, which was woken up, can
handle not all the events in ->rdllist.  (in better words the problem is
described here: https://lkml.org/lkml/2019/10/7/905)

The idea of the current patch is to use init_wait() instead of
init_waitqueue_entry().

Internally init_wait() sets autoremove_wake_function as a callback,
which removes the wait entry atomically (under the wq locks) from the
list, thus the next coming wakeup hits the next wait entry in the wait
queue, thus preventing lost wakeups.

Problem is very well reproduced by the epoll60 test case [1].

Wait entry removal on wakeup has also performance benefits, because
there is no need to take a ep->lock and remove wait entry from the queue
after the successful wakeup.  Here is the timing output of the epoll60
test case:

  With explicit wakeup from ep_scan_ready_list() (the state of the
  code prior 339ddb53d373):

    real    0m6.970s
    user    0m49.786s
    sys     0m0.113s

 After this patch:

   real    0m5.220s
   user    0m36.879s
   sys     0m0.019s

The other testcase is the stress-epoll [2], where one thread consumes
all the events and other threads produce many events:

  With explicit wakeup from ep_scan_ready_list() (the state of the
  code prior 339ddb53d373):

    threads  events/ms  run-time ms
          8       5427         1474
         16       6163         2596
         32       6824         4689
         64       7060         9064
        128       6991        18309

 After this patch:

    threads  events/ms  run-time ms
          8       5598         1429
         16       7073         2262
         32       7502         4265
         64       7640         8376
        128       7634        16767

 (number of "events/ms" represents event bandwidth, thus higher is
  better; number of "run-time ms" represents overall time spent
  doing the benchmark, thus lower is better)

[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
[2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Heiher <r@hev.cc>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/eventpoll.c |   43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1827,7 +1827,6 @@ static int ep_poll(struct eventpoll *ep,
 {
 	int res = 0, eavail, timed_out = 0;
 	u64 slack = 0;
-	bool waiter = false;
 	wait_queue_entry_t wait;
 	ktime_t expires, *to = NULL;
 
@@ -1872,21 +1871,23 @@ fetch_events:
 	 */
 	ep_reset_busy_poll_napi_id(ep);
 
-	/*
-	 * We don't have any available event to return to the caller.  We need
-	 * to sleep here, and we will be woken by ep_poll_callback() when events
-	 * become available.
-	 */
-	if (!waiter) {
-		waiter = true;
-		init_waitqueue_entry(&wait, current);
-
+	do {
+		/*
+		 * Internally init_wait() uses autoremove_wake_function(),
+		 * thus wait entry is removed from the wait queue on each
+		 * wakeup. Why it is important? In case of several waiters
+		 * each new wakeup will hit the next waiter, giving it the
+		 * chance to harvest new event. Otherwise wakeup can be
+		 * lost. This is also good performance-wise, because on
+		 * normal wakeup path no need to call __remove_wait_queue()
+		 * explicitly, thus ep->lock is not taken, which halts the
+		 * event delivery.
+		 */
+		init_wait(&wait);
 		write_lock_irq(&ep->lock);
 		__add_wait_queue_exclusive(&ep->wq, &wait);
 		write_unlock_irq(&ep->lock);
-	}
 
-	for (;;) {
 		/*
 		 * We don't want to sleep if the ep_poll_callback() sends us
 		 * a wakeup in between. That's why we set the task state
@@ -1916,10 +1917,20 @@ fetch_events:
 			timed_out = 1;
 			break;
 		}
-	}
+
+		/* We were woken up, thus go and try to harvest some events */
+		eavail = 1;
+
+	} while (0);
 
 	__set_current_state(TASK_RUNNING);
 
+	if (!list_empty_careful(&wait.entry)) {
+		write_lock_irq(&ep->lock);
+		__remove_wait_queue(&ep->wq, &wait);
+		write_unlock_irq(&ep->lock);
+	}
+
 send_events:
 	/*
 	 * Try to transfer events to user space. In case we get 0 events and
@@ -1930,12 +1941,6 @@ send_events:
 	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
 		goto fetch_events;
 
-	if (waiter) {
-		write_lock_irq(&ep->lock);
-		__remove_wait_queue(&ep->wq, &wait);
-		write_unlock_irq(&ep->lock);
-	}
-
 	return res;
 }
 



  parent reply	other threads:[~2020-05-13 10:02 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-13  9:43 [PATCH 5.4 00/90] 5.4.41-rc1 review Greg Kroah-Hartman
2020-05-13  9:43 ` [PATCH 5.4 01/90] USB: serial: qcserial: Add DW5816e support Greg Kroah-Hartman
2020-05-13  9:43 ` [PATCH 5.4 02/90] nvme: refactor nvme_identify_ns_descs error handling Greg Kroah-Hartman
2020-05-13  9:43 ` [PATCH 5.4 03/90] nvme: fix possible hang when ns scanning fails during error recovery Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 04/90] tracing/kprobes: Fix a double initialization typo Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 05/90] net: macb: Fix runtime PM refcounting Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 06/90] drm/amdgpu: move kfd suspend after ip_suspend_phase1 Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 07/90] drm/amdgpu: drop redundant cg/pg ungate on runpm enter Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 08/90] vt: fix unicode console freeing with a common interface Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 09/90] tty: xilinx_uartps: Fix missing id assignment to the console Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 10/90] devlink: fix return value after hitting end in region read Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 11/90] dp83640: reverse arguments to list_add_tail Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 12/90] fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 13/90] ipv6: Use global sernum for dst validation with nexthop objects Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 14/90] mlxsw: spectrum_acl_tcam: Position vchunk in a vregion list properly Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 15/90] neigh: send protocol value in neighbor create notification Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 16/90] net: dsa: Do not leave DSA master with NULL netdev_ops Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 17/90] net: macb: fix an issue about leak related system resources Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 18/90] net: macsec: preserve ingress frame ordering Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 19/90] net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 20/90] net_sched: sch_skbprio: add message validation to skbprio_change() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 21/90] net: stricter validation of untrusted gso packets Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 22/90] net: tc35815: Fix phydev supported/advertising mask Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 23/90] net/tls: Fix sk_psock refcnt leak in bpf_exec_tx_verdict() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 24/90] net/tls: Fix sk_psock refcnt leak when in tls_data_ready() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 25/90] net: usb: qmi_wwan: add support for DW5816e Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 26/90] nfp: abm: fix a memory leak bug Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 27/90] sch_choke: avoid potential panic in choke_reset() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 28/90] sch_sfq: validate silly quantum values Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 29/90] tipc: fix partial topology connection closure Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 30/90] tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040 Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 31/90] bnxt_en: Fix VF anti-spoof filter setup Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 32/90] bnxt_en: Reduce BNXT_MSIX_VEC_MAX value to supported CQs per PF Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 33/90] bnxt_en: Improve AER slot reset Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 34/90] bnxt_en: Return error when allocating zero size context memory Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 35/90] bnxt_en: Fix VLAN acceleration handling in bnxt_fix_features() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 36/90] net/mlx5: DR, On creation set CQs arm_db member to right value Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 37/90] net/mlx5: Fix forced completion access non initialized command entry Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 38/90] net/mlx5: Fix command entry leak in Internal Error State Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 39/90] net: mvpp2: prevent buffer overflow in mvpp22_rss_ctx() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 40/90] net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 41/90] HID: wacom: Read HID_DG_CONTACTMAX directly for non-generic devices Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 42/90] sctp: Fix bundling of SHUTDOWN with COOKIE-ACK Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 43/90] Revert "HID: wacom: generic: read the number of expected touches on a per collection basis" Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 44/90] HID: usbhid: Fix race between usbhid_close() and usbhid_stop() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 45/90] HID: wacom: Report 2nd-gen Intuos Pro S center button status over BT Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 46/90] USB: uas: add quirk for LaCie 2Big Quadra Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 47/90] usb: chipidea: msm: Ensure proper controller reset using role switch API Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 48/90] USB: serial: garmin_gps: add sanity checking for data length Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 49/90] tracing: Add a vmalloc_sync_mappings() for safe measure Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 50/90] crypto: arch/nhpoly1305 - process in explicit 4k chunks Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 51/90] KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 52/90] KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 53/90] KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVER Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 54/90] KVM: arm64: Fix 32bit PC wrap-around Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 55/90] arm64: hugetlb: avoid potential NULL dereference Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 56/90] drm: ingenic-drm: add MODULE_DEVICE_TABLE Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 57/90] ipc/mqueue.c: change __do_notify() to bypass check_kill_permission() Greg Kroah-Hartman
2020-05-13  9:44 ` Greg Kroah-Hartman [this message]
2020-05-13  9:44 ` [PATCH 5.4 59/90] eventpoll: fix missing wakeup for ovflist in ep_poll_callback Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 60/90] mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 61/90] mm: limit boost_watermark on small zones Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 62/90] ceph: fix endianness bug when handling MDS session feature bits Greg Kroah-Hartman
2020-05-13  9:44 ` [PATCH 5.4 63/90] ceph: demote quotarealm lookup warning to a debug message Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 64/90] staging: gasket: Check the return value of gasket_get_bar_index() Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 65/90] coredump: fix crash when umh is disabled Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 66/90] riscv: set max_pfn to the PFN of the last page Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 67/90] iocost: protect iocg->abs_vdebt with iocg->waitq.lock Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 68/90] batman-adv: fix batadv_nc_random_weight_tq Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 69/90] batman-adv: Fix refcnt leak in batadv_show_throughput_override Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 70/90] batman-adv: Fix refcnt leak in batadv_store_throughput_override Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 71/90] batman-adv: Fix refcnt leak in batadv_v_ogm_process Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 72/90] x86/entry/64: Fix unwind hints in register clearing code Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 73/90] x86/entry/64: Fix unwind hints in kernel exit path Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 74/90] x86/entry/64: Fix unwind hints in rewind_stack_do_exit() Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 75/90] x86/unwind/orc: Dont skip the first frame for inactive tasks Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 76/90] x86/unwind/orc: Prevent unwinding before ORC initialization Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 77/90] x86/unwind/orc: Fix error path for bad ORC entry type Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 78/90] x86/unwind/orc: Fix premature unwind stoppage due to IRET frames Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 79/90] KVM: x86: Fixes posted interrupt check for IRQs delivery modes Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 80/90] arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 81/90] netfilter: nat: never update the UDP checksum when its 0 Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 82/90] netfilter: nf_osf: avoid passing pointer to local var Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 83/90] objtool: Fix stack offset tracking for indirect CFAs Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 84/90] iommu/virtio: Reverse arguments to list_add Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 85/90] scripts/decodecode: fix trapping instruction formatting Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 86/90] mm, memcg: fix error return value of mem_cgroup_css_alloc() Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 87/90] bdi: move bdi_dev_name out of line Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 88/90] bdi: add a ->dev_name field to struct backing_dev_info Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 89/90] fsnotify: replace inode pointer with an object id Greg Kroah-Hartman
2020-05-13  9:45 ` [PATCH 5.4 90/90] fanotify: merge duplicate events on parent and child Greg Kroah-Hartman
2020-05-13 13:46 ` [PATCH 5.4 00/90] 5.4.41-rc1 review Jon Hunter
2020-05-13 17:03 ` Guenter Roeck
2020-05-13 17:50 ` Naresh Kamboju
2020-05-13 23:01 ` shuah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200513094415.888707435@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=jbaron@akamai.com \
    --cc=khazhy@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=r@hev.cc \
    --cc=rpenyaev@suse.de \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).