Re: [PATCH 1/2] bcache: ignore pending signals in bcache_device_init()

From: Michal Hocko <mhocko@kernel.org>
To: Coly Li <colyli@suse.de>
Cc: axboe@kernel.dk, linux-bcache@vger.kernel.org,
	linux-block@vger.kernel.org, hare@suse.de, mkoutny@suse.com,
	Oleg Nesterov <oleg@redhat.com>, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH 1/2] bcache: ignore pending signals in bcache_device_init()
Date: Mon, 2 Mar 2020 18:28:40 +0100	[thread overview]
Message-ID: <20200302172840.GQ4380@dhcp22.suse.cz> (raw)
In-Reply-To: <cc759569-e79e-a1a0-3019-0e07dd6957cb@suse.de>

On Tue 03-03-20 01:06:38, Coly Li wrote:
> On 2020/3/2 9:40 下午, Michal Hocko wrote:
[...]
> > I cannot really comment for the systemd part but it is quite unexpected
> > for it to have signals ignored completely.
> > 
> 
> I see. But so far I don't have better solution to fix this problem.
> Asking people to do extra configure to udev rules is very tricky, most
> of common users will be scared. I need to get it fixed by no-extra
> configuration.

Well, I believe that having an explicit documentation that large cache
requires more tim to initialize is quite reasonable way forward. We
already do have similar situations with memory hotplug on extra large
machines with a small block size.

> >>> Is there any problem to simply increase the timeout on the system which
> >>> uses a large bcache?
> >>>
> >>
> >> At this moment, this is a workaround. Christoph Hellwig also suggests to
> >> fix kthread_run()/kthread_create(). Now I am looking for method to
> >> distinct that the parent process is killed by OOM killer and not by
> >> other processes in kthread_run()/kthread_create(), but the solution is
> >> not clear to me yet.
> > 
> > It is really hard to comment on this because I do not have a sufficient
> > insight but in genereal. The oom victim context can be checked by
> > tsk_is_oom_victim but kernel threads are subject of the oom killer
> > because they do not own any address space. I also suspect that none of
> > the data you allocate for the cache is accounted per any specific
> > process.
> 
> You are right, the cached data is not bonded to process, it is bonded to
> specific backing block devices.
> 
> In my case, kthread_run()/kthread_create() is called in context of
> registration process (/lib/udev/bcache-register), so it is unnecessary
> to worry about kthread address space. So maybe I can check
> tsk_is_oom_victim to judge whether current process is killing by OOM
> killer other then simply calling pending_signal() ?

What exactly are going to do about this situation? If you want to bail
out then you should simply do that on any pending fatal signal. Why is
OOM special?

> >> When meta-data size is around 40GB, registering cache device will take
> >> around 55 minutes on my machine for current Linux kernel. I have patch
> >> to reduce the time to around 7~8 minutes but still too long. I may add a
> >> timeout in bcache udev rule for example 10 munites, but when the cache
> >> device get large and large, the timeout will be not enough eventually.
> >>
> >> As I mentioned, this is a workaround to fix the problem now. Fixing
> >> kthread_run()/kthread_create() may take longer time for me. If there is
> >> hint to make it, please offer me.
> > 
> > My main question is why there is any need to touch the kernel code. You
> > can still update the systemd/udev timeout AFAIK. This would be the
> > proper workaround from my (admittedly limited) POV.
> > 
> 
> I see your concern. But the udev timeout is global for all udev rules, I
> am not sure whether change it to a very long time is good ... (indeed I
> tried to add event_timeout=3600 but I can still see the signal received).

All processes which can finish in the default timeout are not going to
regress when the forcefull temination happens later. I cannot really
comment on the systemd part because I am not familiar with internals but
the mere existence of the timeout suggests that the workflow should be
prepared for longer timeouts because initialization taking a long time
is something we have to live with. Something just takes too long to
initialized.

> Ignore the pending signal in bcache registering code is the only method
> currently I know to avoid bcache auto-register failure in boot time. If
> there is other way I can achieve the same goal, I'd like to try.

The problem is that you are effectivelly violating signal delivery
semantic because you are ignoring user signal without userspace ever
noticing. That to me sounds like a free ticket to all sorts of
hard-to-debug problems. What if the userspace expects the signal to be
handled? Really, you want to increase the existing timeout to workaround
it. If you can make the initialization process more swift then even
better.

> BTW, by the mean time, I am still looking for the reason why
> event_timeout=3600 in /etc/udev/udev.conf does not take effect...

I have a vague recollection that there are mutlitple timeouts and
setting only some is not sufficient.
-- 
Michal Hocko
SUSE Labs