From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D43BAC04A68 for ; Thu, 28 Jul 2022 16:24:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229949AbiG1QY7 (ORCPT ); Thu, 28 Jul 2022 12:24:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38766 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229508AbiG1QY5 (ORCPT ); Thu, 28 Jul 2022 12:24:57 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4538270E51 for ; Thu, 28 Jul 2022 09:24:56 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id EBB7D33D1B; Thu, 28 Jul 2022 16:24:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1659025494; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Dnv2zg71NodAS5lxSR0DemRGnBQ7O2Bt4NSiOO+vGf4=; b=HhgZIY2HTIs9c/gLj3v5pEtH6H9jPqncs0G+ucjIpBLqTxwuko7R149utnUKpc1zW9nSo1 AhMuyOAb6EJRHNrXQ6SJ3mRhgyIzfOHRn9eo+zvsqGn6nWR301kvoJ/oN8AamZMEEd/vlU O5lSxdaQu5SQSk6wMw5XgWdKq0T+mbA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1659025494; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Dnv2zg71NodAS5lxSR0DemRGnBQ7O2Bt4NSiOO+vGf4=; b=Yz7KHMNYjn/dAtH6+rW3LSuorPQMvKdxxQAHie723jHoH+hIG215S8ngQAQb4HPGOD5E+q EqWq48AKvHxY5QBw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 6B88413427; Thu, 28 Jul 2022 16:24:52 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id WaMKBFS44mIIEwAAMHmgww (envelope-from ); Thu, 28 Jul 2022 16:24:52 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: Some persistently stuck cache devices after backing device failures From: Coly Li In-Reply-To: <55edf995-8ed3-1db8-af05-2f3a9e57c2d0@ehuk.net> Date: Fri, 29 Jul 2022 00:24:40 +0800 Cc: linux-bcache@vger.kernel.org Content-Transfer-Encoding: quoted-printable Message-Id: <7F912DDA-C949-480A-ACC9-E96DA34EFD8C@suse.de> References: <8820069C-5E97-402A-BFC1-05FF9A5608B0@suse.de> <55edf995-8ed3-1db8-af05-2f3a9e57c2d0@ehuk.net> To: eddie@ehuk.net X-Mailer: Apple Mail (2.3696.100.31) Precedence: bulk List-ID: X-Mailing-List: linux-bcache@vger.kernel.org > 2022=E5=B9=B47=E6=9C=8829=E6=97=A5 00:15=EF=BC=8CEddie Chapman = =E5=86=99=E9=81=93=EF=BC=9A >=20 > On 28/07/2022 16:40, Coly Li wrote: >>> 2022=E5=B9=B47=E6=9C=8828=E6=97=A5 21:14=EF=BC=8CEddie Chapman = =E5=86=99=E9=81=93=EF=BC=9A >>>=20 >>> On 28/07/2022 13:37, Coly Li wrote: >>>>> 2022=E5=B9=B47=E6=9C=8822=E6=97=A5 00:18=EF=BC=8CEddie Chapman = =E5=86=99=E9=81=93=EF=BC=9A >>>>>=20 >>>>> Hello, >>>>>=20 >>>>> I've been using bcache for several years and have learned how to = resolve various problems. However now I have a situation on a live = server where I've tried everything but looks like I might have to reboot = the kernel to resolve it. This will be a big pain for this server so = thought I'd check here if anyone knows something else I could try. >>>>>=20 >>>>> Let me try and describe the situation: >>>>>=20 >>>>> - system is running vanilla stable kernel 5.10.107 for 100+ days >>>>> - There's a single nvme device with 15 partitions >>>>> - Each nvme partition is a bcache cache device in writeback mode, = attached to a backing device of course (there are various types of = backing devices) >>>>> - Now a SATA controller has completely failed and taken out 3 = spinning SATA disks, leading to several backing devices but not all = failing. >>>>> - The nvme device is fine and still acting as cache for some = remaining, working backing devices which have nothing to do with the = failed disks. >>>>> - The bcache backing devices that failed and their corresponding = caches were in a mixed state immediately after the disks failed, some = gone, some not. >>>>> - I don't care about any of the failed data, I just want to clean = up the mess and re-use the nvme partitions that had backing device = failures with some other working spinning disks on the system. >>>>> - So, I've been able to unregister, free, wipe and re-use MOST of = the cache devices that had failed backing devices. >>>>> - Of the cache devices I unregistered I've been able to clean up = completely their failed backing devices (both the bcacheN device and = failed backing device is completely gone from /sys/block). >>>>> - However, there are 3 cache devices, with failed backing devices, = which completely refuse to either stop or unregister. I have written 1 = many times to the "stop" and "unregister" files but nothing changes and = nothing is logged in the kernel log. >>>>> - There has not been any crash or warning emitted by bcache in the = kernel log, just the normal messages you would expect to see when = backing devices fail. >>>>>=20 >>>>> For each of these 3 "stuck" cache devices: >>>>>=20 >>>>> - The cache directory still exists in /sys/fs/bcache/ >>>>> - Inside the cache directory there is no bdev0 symlink anymore. = There *is* the cache0 symlink pointing to the nvme partition. >>>>> - The corresponding /sys/block/bcacheN directory does still exist. >>>>> - There is nothing in the /sys/block/bcacheN/holders/ directory. = Above the bcache device was LVM devices but I was able to successfully = remove all of them with dmsetup remove. There is definitely nothing = above still holding the bcacheN device still open. >>>>> - The failed backing device, which is an md raid0, still exists on = the system, but it is not possible to stop it with mdadm, it is also = "stuck" (I've tried to normally "fail" and "remove" it as well as using = mdadm's special "failed" and "detached" keywords). It still shows up in = /proc/mdstat. The raid members are the SATA disks which have now = disappeared. mdadm -D says the array is in a state "broken, FAILED" and = it's disks are all "missing". >>>>> - Inside the /sys/block/mdN/ directory there is NO "bcache" = subdirectory >>>>> - Inside the /sys/block/mdN/holders/ directory there is still a = working "bcacheN" symlink to /sys/block/bcacheN >>>>> - Inside the /sys/block/bcacheN directory there is a broken = "bcache" symlink pointing to the now disappeared /sys/block/mdN/bcache >>>>>=20 >>>>>=20 >>>>> As I said, I don't care about any of the failed data, I only care = about freeing the nvme cache devices. But I have 2 problems: >>>>>=20 >>>>> 1. Something in the kernel is constantly reading hundreds of = Mbytes/s of data from the 3 "stuck" nvme partitions (but not writing = anything, at least not at any significant rate). I can see it using = atop. I'm wondering if bcache is stuck in a loop trying to read = something from them and that is why it will not let go of these = remaining 3 nvme partitions. I believe this is affecting performance of = the nvme device as a whole for the other working bcache devices using = it. load average is quite high continually as a result. >>>>>=20 >>>>> 2. I cannot re-use these 3 nvme partitions while they are "stuck", = though this is less of a problem that 1. >>>>>=20 >>>>> I'm hoping there might be a simple way to force bcache to "let go" = of these 3 cache devices without rebooting? I don't care if the leftover = stuff of the md backing devices doesn't go away, it is not doing any = harm that I can tell, just that I can successfully unregister the cache = devices. Or at least stop bcache from constantly reading the hundreds of = Mbyte/s, I could live with just that. >>>>>=20 >>>>> I see there is an "io_disable" file in the "internal" = subdirectory. What does that do? Maybe it is the solution but I dare not = try it as I don't know what it does and don't want to risk crashing the = kernel. >>>>>=20 >>>> The io_disable option might be helpful but I don=E2=80=99t = recommend. Setting it to 1 will directly reject all external and = internal I/Os inside bcache driver, most of time it just triggers errors = and forces the cache or backing device to stop, but sometimes if the I/O = error is unrecoverable in critical I/O path it may trigger kernel panic = in upper layer code. >>>> BTW, does it help a bit if you write 1 to = /sys/fs/bcache/pendings_cleanup? >>>> Coly Li >>>>=20 >>>=20 >>> Hi Coly, >>>=20 >>> Really appreciate you looking at this. The server is still running = in this state and I'm still hoping I can free these cache devices = somehow. >>>=20 >>> I should have mentioned in my original email that when I was = originally searching for solutions I cam across the original mailing = list posting for your patch that added /sys/fs/bcache/pendings_cleanup = and thought "yes, that sounds like it will do it!". But tried it and = unfortunately nothing happened. >> Copied. Then there is no luck for this... >>>=20 >>> Thanks for confirming that io_disable is to risky, I'm glad I = followed my instinct to not try that :-) >>>=20 >>> One interesting thing to add is that every 2 or 3 days bcache logs = about 50 or so lines of exactly the same: >>>=20 >>> bcache: bch_count_backing_io_errors() mdX: IO error on backing = device, unrecoverable >> This is exact the IO error returned from backing device. Since it is = a md raid0, one of the component disks might be failing soon. >=20 > This is why this situation is puzzling; the bcache device has = completely gone (the /sys/block/bcacheN gone). The backing device itself = still exists but is a "brain dead" non-functioning md device with no = members, it has no /sys/block/mdN/bcache directory. The only thing left = is the cache device which refuses to stop or unregister and is still = somehow "stuck" to the backing device even though bcache has "let go" of = the failed backing device. >=20 > There surely must be a bug here somewhere, because the cache device = should not be still holding onto a device which bcache has already = "ejected" from itself when it failed. I feel it is some refcount related issue to bcache or the code layer = below bcache. Some refcount is not 0 after the failure happens, so the = cache device cannot be retired. >>> With mdX being one of the failed md raid0 backing devices, it can be = any of the 3 devices but then all the lines logged are for just that one = device (i.e. not a mix of the 3 devices). >>>=20 >>> It's interesting that it happens sporadically every few days and = then bcache goes completely quiet again. Made me wonder if perhaps the = cache device is still stuck trying to write old data to the backing = devices over and over and every now and then some very long timeout = kicks in or something for just a subset of IOs and then it moves onto = another subset. >>>=20 >> It is possible, that bcache was trying to writeback while encounter = the error location on backing device and try it in next run. In bcache = code, if the I/O errors don=E2=80=99t happen frequently in a period, = then it may continue to work. If there are too many I/O errors from = backing device then the bcache device will stop (io_disable set within = the bcache code) and disappear. >=20 > Yes, this has already happened when the backing device completely = failed about 2 weeks ago. >=20 > Note, I don't care about any of the data here, neither on the cache = device or backing device. The problem is that bcache won't let go of the = nvme cache device when it should do, and worse it is constantly reading = many hundreds of Mbyte/s from the nvme device partition it is hanging = onto, affecting its performance. >=20 >>> So made me wonder maybe there is some configurable timeout setting = somewhere in the cache dev settings that I can reduce. Which would then = result in bcache giving up and letting go of these devices sooner. e.g = in some days instead of months or years? >> In the backing device sysfs directory, you may find a file named = io_errors, which counts the I/O errors of this backing device. And you = may find another file io_error_limit, this is the threshold which stops = the bcache device when I/O error reaches the threshold. The default = io_error_limit is 64 if I remember correctly, you can modify it to a = much smaller number to trigger backing device failure earlier. >=20 > The /sys/block/mdN/bcache directory is gone so those files are gone = unfortunately. The /sys/block/mdN directory is still there, the md = device still exists but in a "brain dead" state. >=20 >> For cache device, there are io_errors_halftime and io_error_limit = files as well. You may also modify io_error_limit to a smaller value to = trigger cache failure earlier. >=20 > I do have these files since the cache device still exists and I cannot = stop it. io_error_halflife has 0 and io_error_limit has 8 >=20 >> When io errors reaches io_error_limit, the io_disable tag will be set = on cache or backing device (depends on where the I/O errors are from), = and bcache will reject all external and internal I/O requests. If some = upper layer code encounters an I/O error that it cannot handle, a kernel = panic is possible. So setting io_disable file is same to wait for io = errors reaches io_error_limit. >=20 > I suspect this must have already happened as the /sys/block/bcacheN is = gone? Yes, you are right. >=20 >> This is as-designed behavior, because avoiding data corruption is the = highest priority, rejecting I/O to avoid data corruption is more = important than avoid a system down. >=20 > Right. Yes, this is very good and desirable, and it has happened here = already when the backing devices failed at the start. >>> If there is any further info I can pull from the system that might = be useful please let me know. Ultimately if there's nothing I can do = I'll just have to bite the bullet and arrange a reboot. >> I am not able to provide more useful suggestion for current = situation, maybe arrange a reboot might be an ideal plan. >=20 > Yes I guess I'll have to do this unfortunately. I had hoped there = might be a way to force bcache to release the nvme device like it should = do, like it usually does in a situation like this, as rebooting this = particular server is a big pain. But if not is there anything I can do = to try and get some data that might give an idea what might have gone = wrong here in bcache? So that it could possibly lead to a fix, because = it is a very bad bug for any who might get hit by it in the future. I = guess this could also be a bug in the md raid driver in which case I = could report it there if that turns out to be the case. It might be a reference counter related issue somewhere in the IO stack. = Myabe it is from bcache, or maybe lower layer below bcache. The device = failure testing is hard to be fully covered, the software emulated = sometimes is far from the real hardware failure. Is it possible that you = may trigger a crash dump file? Then I can look inside to find some clue = if I am lucky, but it cannot help your situation=E2=80=A6 Thanks. Coly Li