linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: He Zhe <zhe.he@windriver.com>
To: Jens Axboe <axboe@kernel.dk>,
	viro@zeniv.linux.org.uk, bcrl@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] eventfd: Make wake counter work for single fd instead of all
Date: Thu, 9 Apr 2020 18:37:14 +0800	[thread overview]
Message-ID: <c4984e43-480c-3f9a-2316-249c61507bf2@windriver.com> (raw)
In-Reply-To: <3f395813-a497-aa25-71cc-8aed345b9f75@kernel.dk>



On 4/8/20 4:06 AM, Jens Axboe wrote:
> On 4/7/20 3:59 AM, zhe.he@windriver.com wrote:
>> From: He Zhe <zhe.he@windriver.com>
>>
>> commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
>> introduces a percpu counter that tracks the percpu recursion depth and
>> warn if it greater than one, to avoid potential deadlock and stack
>> overflow.
>>
>> However sometimes different eventfds may be used in parallel.
>> Specifically, when high network load goes through kvm and vhost, working
>> as below, it would trigger the following call trace.
>>
>> -  100.00%
>>    - 66.51%
>>         ret_from_fork
>>         kthread
>>       - vhost_worker
>>          - 33.47% handle_tx_kick
>>               handle_tx
>>               handle_tx_copy
>>               vhost_tx_batch.isra.0
>>               vhost_add_used_and_signal_n
>>               eventfd_signal
>>          - 33.05% handle_rx_net
>>               handle_rx
>>               vhost_add_used_and_signal_n
>>               eventfd_signal
>>    - 33.49%
>>         ioctl
>>         entry_SYSCALL_64_after_hwframe
>>         do_syscall_64
>>         __x64_sys_ioctl
>>         ksys_ioctl
>>         do_vfs_ioctl
>>         kvm_vcpu_ioctl
>>         kvm_arch_vcpu_ioctl_run
>>         vmx_handle_exit
>>         handle_ept_misconfig
>>         kvm_io_bus_write
>>         __kvm_io_bus_write
>>         eventfd_signal
>>
>> 001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
>> ---- snip ----
>> 001: Call Trace:
>> 001:  vhost_signal+0x15e/0x1b0 [vhost]
>> 001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
>> 001:  handle_rx+0xb9/0x900 [vhost_net]
>> 001:  handle_rx_net+0x15/0x20 [vhost_net]
>> 001:  vhost_worker+0xbe/0x120 [vhost]
>> 001:  kthread+0x106/0x140
>> 001:  ? log_used.part.0+0x20/0x20 [vhost]
>> 001:  ? kthread_park+0x90/0x90
>> 001:  ret_from_fork+0x35/0x40
>> 001: ---[ end trace 0000000000000003 ]---
>>
>> This patch moves the percpu counter into eventfd control structure and
>> does the clean-ups, so that eventfd can still be protected from deadlock
>> while allowing different ones to work in parallel.
>>
>> As to potential stack overflow, we might want to figure out a better
>> solution in the future to warn when the stack is about to overflow so it
>> can be better utilized, rather than break the working flow when just the
>> second one comes.
> This doesn't work for the infinite recursion case, the state has to be
> global, or per thread.

Thanks, but I'm not very clear about why the counter has to be global or per
thread.

If the recursion happens on the same eventfd, the attempt to re-grab the same
ctx->wqh.lock would be blocked by the fd-specific counter in this patch.

If the recursion happens with a chain of different eventfds, that might lead to
a stack overflow issue. The issue should be handled but it seems unnecessary to
stop the just the second ring(when the counter is going to be 2) of the chain.

Specifically in the vhost case, it runs very likely with heavy network load
which generates loads of eventfd_signal. Delaying the eventfd_signal to worker
threads will still end up violating the global counter later and failing as
above.

So we might want to take care of the potential overflow later, hopefully with a
measurement that can tell us if it's about to overflow.

Zhe

>


  reply	other threads:[~2020-04-09 10:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-07 10:59 [PATCH 1/2] eventfd: Make wake counter work for single fd instead of all zhe.he
2020-04-07 10:59 ` [PATCH 2/2] aio: Update calling to eventfd_signal_count with righ parameter zhe.he
2020-04-07 20:06 ` [PATCH 1/2] eventfd: Make wake counter work for single fd instead of all Jens Axboe
2020-04-09 10:37   ` He Zhe [this message]
2020-04-09 15:44     ` Jens Axboe
2020-04-10 11:46       ` He Zhe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c4984e43-480c-3f9a-2316-249c61507bf2@windriver.com \
    --to=zhe.he@windriver.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).