All of lore.kernel.org
 help / color / mirror / Atom feed
* Detecting new partitions fails after "btrfs device scan --forget"
@ 2020-09-13 18:47 Jonas Zeiger
  2020-09-14  0:36 ` Anand Jain
  0 siblings, 1 reply; 3+ messages in thread
From: Jonas Zeiger @ 2020-09-13 18:47 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

Our automated linux deployments on x86_64 stopped working when switching 
from 5.8.7 to 5.8.8 or 5.8.9 (both tested).

We perform aproximately the following routine on PXE booted servers with 
EMPTY disks (never done IO):

# [Step 1] Wipe disks
# $d == /dev/sda /dev/sdb and so on
btrfs device scan --forget
wipefs --all $d
dd if=/dev/zero of=$d bs=32M count=1
sync
sleep 1
hdparm -z $d

# [Step 2] Partitioning
parted -a optimal -s $d \\
	mklabel gpt \\
	mkpart primary 0% 4M \\
	mkpart primary 4M 10% \\
	name 2 "system" \\
	mkpart primary 10% 100% \\
	name 3 "ceph" \\
	quit

[Step 2] fails with kernel 5.8.8 and 5.8.9 with:
Error: Partition(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 
119, 120, 121, 122, 123, 124, 125, 126, 127, 128 on /dev/sda have been 
written, but we have been unable to inform the kernel of the change, 
probably because it/they are in use.  As a result, the old partition(s) 
will remain in use.  You should reboot now before making further 
changes.

The partitions do not become visible so the deployment can't continue.

I logged into the system at this point and checked that no filesystem is 
mounted and no relevant messages appear in the console.

The only weird thing I noticed is a lot of these:
[26523.729131] ata3.00: Enabling discard_zeroes_data
[26524.737705] ata3.00: Enabling discard_zeroes_data
[26524.783084] ata7.00: Enabling discard_zeroes_data
[26524.788256] ata7.00: Enabling discard_zeroes_data
[26524.877407] ata7.00: Enabling discard_zeroes_data
[26525.885710] ata7.00: Enabling discard_zeroes_data
[26525.931513] ata4.00: Enabling discard_zeroes_data
[26525.936719] ata4.00: Enabling discard_zeroes_data
[26526.026196] ata4.00: Enabling discard_zeroes_data
[26527.034256] ata4.00: Enabling discard_zeroes_data
[26527.079552] ata8.00: Enabling discard_zeroes_data
...

But that may not have anything to do with this.

Switching back to a 5.8.7 kernel makes the problem go away.

I file this under filesystems (BTRFS) because I noticed a few relevant 
commits in the changelog that mention the word "lock", but I may be 
completely wrong and another subsystem/commit is causing this.

I also created a bugzilla entry tracking this issue:

   https://bugzilla.kernel.org/show_bug.cgi?id=209221

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Detecting new partitions fails after "btrfs device scan --forget"
  2020-09-13 18:47 Detecting new partitions fails after "btrfs device scan --forget" Jonas Zeiger
@ 2020-09-14  0:36 ` Anand Jain
       [not found]   ` <e8bdb0c13f5f91b90e75b1a218ded2cb@talpidae.net>
  0 siblings, 1 reply; 3+ messages in thread
From: Anand Jain @ 2020-09-14  0:36 UTC (permalink / raw)
  To: Jonas Zeiger, linux-btrfs



> /dev/sda have been 
> written, but we have been unable to inform the kernel of the change, 
> probably because it/they are in use.  As a result, the old partition(s) 
> will remain in use.  You should reboot now before making further changes.
> 
> The partitions do not become visible so the deployment can't continue.

The forget subcommand does not touch the mounted device, and it frees 
only the unmounted or unopened btrfs devices from its kernel memory.

Now the error seems to be about the device being kept open. Can you find 
out who did not close it?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Detecting new partitions fails after "btrfs device scan --forget"
       [not found]   ` <e8bdb0c13f5f91b90e75b1a218ded2cb@talpidae.net>
@ 2020-09-18 11:05     ` Jonas Zeiger
  0 siblings, 0 replies; 3+ messages in thread
From: Jonas Zeiger @ 2020-09-18 11:05 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

Seems like the following commit in 5.8.10 fixes the issue:

commit b730cc810f71f7b2126d390b63b981e744777c35
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Sep 8 16:15:06 2020 +0200

     block: restore a specific error code in bdev_del_partition

     [ Upstream commit 88ce2a530cc9865a894454b2e40eba5957a60e1a ]

     mdadm relies on the fact that deleting an invalid partition returns
     -ENXIO or -ENOTTY to detect if a block device is a partition or a
     whole device.

     Fixes: 08fc1ab6d748 ("block: fix locking in bdev_del_partition")
     Reported-by: kernel test robot <rong.a.chen@intel.com>
     Signed-off-by: Christoph Hellwig <hch@lst.de>
     Signed-off-by: Jens Axboe <axboe@kernel.dk>
     Signed-off-by: Sasha Levin <sashal@kernel.org>

On 2020-09-14 13:10, Jonas Zeiger wrote:
> Hi Anand,
> 
> Thank you for your assistance.
> 
> I enabled lock-stats in the kernel and stopped directly after
> initramfs boot from PXE:
> 
>  1. "partprobe" already fails even directly after boot, so my initial
> hunch that "btrfs device scan" is causing it is likely wrong.
> 
>  2. "lsof /dev/sdd" doesn't return anything and there is no mention of
> the device in "lsof" output.
> 
>  3. Nothing from disks is mounted, running only from tmpfs (unpacked
> from initramfs), almost no daemons (even killed smartd).
> 
>  4. "strace partprobe /dev/sdd" STDERR output is attached
> (strace-partprobe-dev-sdd.txt).
> 
>  5. I could not tell from "/proc/lock_stat" which lock could be
> responsible, if any, so I attached the file (lock_stat.txt).
> 
>  6. Partprobe's "ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION..." returns
> ENOMEM. Maybe I should mention that my kernel doesn't have swap
> (CONFIG_SWAP=n), but I guess ENOMEM means something different in this
> context. Also, why would it reliably work on 5.8.7.
> 
> I have set one host aside to help debug this issue as it prevents us
> from updating the kernel.
> 
> How can I help to further analyze this regression?
> 
> 
> On 2020-09-14 02:36, Anand Jain wrote:
>>> /dev/sda have been written, but we have been unable to inform the 
>>> kernel of the change, probably because it/they are in use.  As a 
>>> result, the old partition(s) will remain in use.  You should reboot 
>>> now before making further changes.
>>> 
>>> The partitions do not become visible so the deployment can't 
>>> continue.
>> 
>> The forget subcommand does not touch the mounted device, and it frees
>> only the unmounted or unopened btrfs devices from its kernel memory.
>> 
>> Now the error seems to be about the device being kept open. Can you
>> find out who did not close it?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-18 11:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-13 18:47 Detecting new partitions fails after "btrfs device scan --forget" Jonas Zeiger
2020-09-14  0:36 ` Anand Jain
     [not found]   ` <e8bdb0c13f5f91b90e75b1a218ded2cb@talpidae.net>
2020-09-18 11:05     ` Jonas Zeiger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.