From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1170B34CD4 for ; Tue, 16 May 2023 16:44:34 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 22229C433EF; Tue, 16 May 2023 16:44:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1684255474; bh=oI/w41//LmhBbTJUIIKUoIgJT4tMuwewNU0CYpQ25/4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=C5EARszE4TyE9KS7I51W38SR5n7Bc74hVJRONPVLwM0Go7PY72IqtJaYgX7vrZaNY rqj1vAIJxec9TNeLh/re6W8ROBz33fszELvWzvvMQ701PUkTa+PFdVwmChp9jinPlv zm89889dgT1+XV/vKtr53Y6O7EFxxEioSt4SjxWFv81aY+433OmOeNH+P6IglH5Evs ggzxpsYViBLsSNNTVbiiyMQvr70TcM0gBLCMRR7mDgYg2jv7tLlRTuyfNB99GIi8Dl RhtfiBAtQ5kG4DAhzQqyZ6flxWolTYzsZErQT2Ge/RaImD07AnYKDYHNV5YKOvuJV+ CowyLHoX7rcjQ== Date: Tue, 16 May 2023 18:44:27 +0200 From: Christian Brauner To: Mike Christie Cc: Linus Torvalds , Thorsten Leemhuis , nicolas.dichtel@6wind.com, Linux kernel regressions list , hch@infradead.org, stefanha@redhat.com, jasowang@redhat.com, mst@redhat.com, sgarzare@redhat.com, virtualization@lists.linux-foundation.org, ebiederm@xmission.com, konrad.wilk@oracle.com, linux-kernel@vger.kernel.org, Jens Axboe Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Message-ID: <20230516-summieren-bankintern-7def19d6ec45@brauner> References: <78c5e150-26cf-7724-74ee-4a0b16b944b1@oracle.com> <48842e92-835e-bc3f-7118-48b8f415f532@leemhuis.info> <20230515-vollrausch-liebgeworden-2765f3ca3540@brauner> <122b597e-a5fa-daf7-27bb-6f04fa98d496@oracle.com> <20230516-weltmeere-backofen-27f12ae2c9e0@brauner> <683d0c30-4aeb-3547-49a9-90ce6eb1a7d0@oracle.com> Precedence: bulk X-Mailing-List: regressions@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <683d0c30-4aeb-3547-49a9-90ce6eb1a7d0@oracle.com> On Tue, May 16, 2023 at 11:24:48AM -0500, Mike Christie wrote: > On 5/16/23 3:39 AM, Christian Brauner wrote: > > On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote: > >> On 5/15/23 10:44 AM, Linus Torvalds wrote: > >>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner wrote: > >>>> > >>>> So I think we will be able to address (1) and (2) by making vhost tasks > >>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP > >>>> and then having vhost handle get_signal() - as you mentioned - the same > >>>> way io uring already does. We should also remove the ingore_signals > >>>> thing completely imho. I don't think we ever want to do this with user > >>>> workers. > >>> > >>> Right. That's what IO_URING does: > >>> > >>> if (args->io_thread) { > >>> /* > >>> * Mark us an IO worker, and block any signal that isn't > >>> * fatal or STOP > >>> */ > >>> p->flags |= PF_IO_WORKER; > >>> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > >>> } > >>> > >>> and I really think that vhost should basically do exactly what io_uring does. > >>> > >>> Not because io_uring fundamentally got this right - but simply because > >>> io_uring had almost all the same bugs (and then some), and what the > >>> io_uring worker threads ended up doing was to basically zoom in on > >>> "this works". > >>> > >>> And it zoomed in on it largely by just going for "make it look as much > >>> as possible as a real user thread", because every time the kernel > >>> thread did something different, it just caused problems. > >>> > >>> So I think the patch should just look something like the attached. > >>> Mike, can you test this on whatever vhost test-suite? > >> > >> I tried that approach already and it doesn't work because io_uring and vhost > >> differ in that vhost drivers implement a device where each device has a vhost_task > >> and the drivers have a file_operations for the device. When the vhost_task's > >> parent gets signal like SIGKILL, then it will exit and call into the vhost > >> driver's file_operations->release function. At this time, we need to do cleanup > > > > But that's no reason why the vhost worker couldn't just be allowed to > > exit on SIGKILL cleanly similar to io_uring. That's just describing the > > current architecture which isn't a necessity afaict. And the helper > > thread could e.g., crash. > > > >> like flush the device which uses the vhost_task. There is also the case where if > >> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer. > > > > In a way I really don't like the patch below. Because this should be > > solvable by adapting vhost workers. Right now, vhost is coming from a > > kthread model and we ported it to a user worker model and the whole > > point of this excercise has been that the workers behave more like > > regular userspace processes. So my tendency is to not massage kernel > > signal handling to now also include a special case for user workers in > > addition to kthreads. That's just the wrong way around and then vhost > > could've just stuck with kthreads in the first place. > > I would have preferred that :) Maybe let's take a step back and revisit > that decision to make sure it was right. The vhost layer wants: > > 1. inherit cgroups. > 2. share mm. > 3. no signals > 4. to not show up was an extra process like in Nicolas's bug. > 5. have it's worker threads counted under its parent nproc limit. > > We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first > attempt which passed around the creds to use for kthreads or exported a > helper for the nproc accounting was not liked and we eventually ended up > here. > > Is this hybird user/kernel thread/task still the right way to go or is > better to use kthreads and add some way to handle #5? I think the io_uring model makes a lot more sense for vhost than the current approach. > > > > > > So I'm fine with skipping over the freezing case for now but SIGKILL > > should be handled imho. Only init and kthreads should get the luxury of > > ignoring SIGKILL. > > > > So, I'm afraid I'm asking some work here of you but how feasible would a > > model be where vhost_worker() similar to io_wq_worker() gracefully > > handles SIGKILL. Yes, I see there's > > > > net.c: .release = vhost_net_release > > scsi.c: .release = vhost_scsi_release > > test.c: .release = vhost_test_release > > vdpa.c: .release = vhost_vdpa_release > > vsock.c: .release = virtio_transport_release > > vsock.c: .release = vhost_vsock_dev_release > > > > but that means you have all the basic logic in place and all of those > > drivers also support the VHOST_RESET_OWNER ioctl which also stops the > > vhost worker. I'm confident that a lof this can be leveraged to just > > cleanup on SIGKILL. > > We can do this, but the issue I'm worried about is that right now if there > is queued/running IO and userspace escalates to SIGKILL, then the vhost layer > will still complete those IOs. If we now allow SIGKILL on the vhost thread, > then those IOs might fail. > > If we get a SIGKILL, I can modify vhost_worker() so that it temporarily > ignores the signal and allows IO/flushes/whatever-operations to complete > at that level. However, we could hit issues where when vhost_worker() It's really not that different from io_uring though which also flushes out remaining io, no? This seems to basically line up with what io_wq_worker() does. > calls into the drivers listed above, and those drivers call into whatever > kernel layer they use, that might do > > if (signal_pending(current)) > return failure; > > and we now fail. > > If we say that since we got a SIGKILL, then failing is acceptable behavior > now, I can code what you are requesting. I think this is fine but I don't maintain vhost and we'd need their opinion.