Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] btrfs: volumes: Allow missing devices to be writeable
@ 2019-08-29  7:17 Qu Wenruo
  2019-09-11 17:17 ` David Sterba
  2019-09-12 10:27 ` Anand Jain
  0 siblings, 2 replies; 5+ messages in thread
From: Qu Wenruo @ 2019-08-29  7:17 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There is a long existing bug that degraded mounted btrfs can allocate new
SINGLE/DUP chunks on a RAID1 fs:
  #!/bin/bash

  dev1=/dev/test/scratch1
  dev2=/dev/test/scratch2
  mnt=/mnt/btrfs

  umount $mnt &> /dev/null
  umount $dev1 &> /dev/null
  umount $dev2 &> /dev/null

  dmesg -C
  mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2

  wipefs -fa $dev2

  mount -o degraded $dev1 $mnt
  btrfs balance start --full $mnt
  umount $mnt
  echo "=== chunk after degraded mount ==="
  btrfs ins dump-tree -t chunk $dev1 | grep stripe_len.*type

The result fs will have chunks with SINGLE and DUP only:
  === chunk after degraded mount ===
                  length 33554432 owner 2 stripe_len 65536 type SYSTEM
                  length 1073741824 owner 2 stripe_len 65536 type DATA
                  length 1073741824 owner 2 stripe_len 65536 type DATA|DUP
                  length 219676672 owner 2 stripe_len 65536 type METADATA|DUP
                  length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP

This behavior greatly breaks the RAID1 tolerance.

Even with missing device replaced, if the device with DUP/SINGLE chunks
on them get missing, the whole fs can't be mounted RW any more.
And we already have reports that user even can't mount the fs as some
essential tree blocks got written to those DUP chunks.

[CAUSE]
The cause is pretty simple, we treat missing devices as non-writable.
Thus when we need to allocate chunks, we can only fall back to single
device profiles (SINGLE and DUP).

[FIX]
Just consider the missing devices as WRITABLE, so we allocate new chunks
on them to maintain old profiles.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 56f751192a6c..cc30b1fa9306 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7002,6 +7002,18 @@ static int read_one_dev(struct extent_buffer *leaf,
 
 	fill_device_from_item(leaf, dev_item, device);
 	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
+
+	/*
+	 * We treat missing devices as writable, so that we can maintain
+	 * the existing profiles without degrading to DUP/SINGLE.
+	 */
+	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
+		set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+		list_add(&device->dev_alloc_list,
+			 &fs_devices->alloc_list);
+		fs_devices->rw_devices++;
+	}
+
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	   !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
 		device->fs_devices->total_rw_bytes += device->total_bytes;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: volumes: Allow missing devices to be writeable
  2019-08-29  7:17 [PATCH] btrfs: volumes: Allow missing devices to be writeable Qu Wenruo
@ 2019-09-11 17:17 ` David Sterba
  2019-09-11 23:39   ` Qu Wenruo
  2019-09-12 10:27 ` Anand Jain
  1 sibling, 1 reply; 5+ messages in thread
From: David Sterba @ 2019-09-11 17:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Aug 29, 2019 at 03:17:31PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a long existing bug that degraded mounted btrfs can allocate new
> SINGLE/DUP chunks on a RAID1 fs:
>   #!/bin/bash
> 
>   dev1=/dev/test/scratch1
>   dev2=/dev/test/scratch2
>   mnt=/mnt/btrfs
> 
>   umount $mnt &> /dev/null
>   umount $dev1 &> /dev/null
>   umount $dev2 &> /dev/null
> 
>   dmesg -C
>   mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
> 
>   wipefs -fa $dev2
> 
>   mount -o degraded $dev1 $mnt
>   btrfs balance start --full $mnt
>   umount $mnt
>   echo "=== chunk after degraded mount ==="
>   btrfs ins dump-tree -t chunk $dev1 | grep stripe_len.*type
> 
> The result fs will have chunks with SINGLE and DUP only:
>   === chunk after degraded mount ===
>                   length 33554432 owner 2 stripe_len 65536 type SYSTEM
>                   length 1073741824 owner 2 stripe_len 65536 type DATA
>                   length 1073741824 owner 2 stripe_len 65536 type DATA|DUP
>                   length 219676672 owner 2 stripe_len 65536 type METADATA|DUP
>                   length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
> 
> This behavior greatly breaks the RAID1 tolerance.
> 
> Even with missing device replaced, if the device with DUP/SINGLE chunks
> on them get missing, the whole fs can't be mounted RW any more.
> And we already have reports that user even can't mount the fs as some
> essential tree blocks got written to those DUP chunks.
> 
> [CAUSE]
> The cause is pretty simple, we treat missing devices as non-writable.
> Thus when we need to allocate chunks, we can only fall back to single
> device profiles (SINGLE and DUP).
> 
> [FIX]
> Just consider the missing devices as WRITABLE, so we allocate new chunks
> on them to maintain old profiles.

I'm not sure this is the best way to fix it, it makes the meaning of
rw_devices ambiguous. A missing device is by definition not readable nor
writeable.

This should be tracked separatelly, ie. counting real devices that can
be written and devices that can be considered for allocation (with a
documented meaning that even missing devices are included).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: volumes: Allow missing devices to be writeable
  2019-09-11 17:17 ` David Sterba
@ 2019-09-11 23:39   ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2019-09-11 23:39 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 2368 bytes --]



On 2019/9/12 上午1:17, David Sterba wrote:
> On Thu, Aug 29, 2019 at 03:17:31PM +0800, Qu Wenruo wrote:
>> [BUG]
>> There is a long existing bug that degraded mounted btrfs can allocate new
>> SINGLE/DUP chunks on a RAID1 fs:
>>   #!/bin/bash
>>
>>   dev1=/dev/test/scratch1
>>   dev2=/dev/test/scratch2
>>   mnt=/mnt/btrfs
>>
>>   umount $mnt &> /dev/null
>>   umount $dev1 &> /dev/null
>>   umount $dev2 &> /dev/null
>>
>>   dmesg -C
>>   mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
>>
>>   wipefs -fa $dev2
>>
>>   mount -o degraded $dev1 $mnt
>>   btrfs balance start --full $mnt
>>   umount $mnt
>>   echo "=== chunk after degraded mount ==="
>>   btrfs ins dump-tree -t chunk $dev1 | grep stripe_len.*type
>>
>> The result fs will have chunks with SINGLE and DUP only:
>>   === chunk after degraded mount ===
>>                   length 33554432 owner 2 stripe_len 65536 type SYSTEM
>>                   length 1073741824 owner 2 stripe_len 65536 type DATA
>>                   length 1073741824 owner 2 stripe_len 65536 type DATA|DUP
>>                   length 219676672 owner 2 stripe_len 65536 type METADATA|DUP
>>                   length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>>
>> This behavior greatly breaks the RAID1 tolerance.
>>
>> Even with missing device replaced, if the device with DUP/SINGLE chunks
>> on them get missing, the whole fs can't be mounted RW any more.
>> And we already have reports that user even can't mount the fs as some
>> essential tree blocks got written to those DUP chunks.
>>
>> [CAUSE]
>> The cause is pretty simple, we treat missing devices as non-writable.
>> Thus when we need to allocate chunks, we can only fall back to single
>> device profiles (SINGLE and DUP).
>>
>> [FIX]
>> Just consider the missing devices as WRITABLE, so we allocate new chunks
>> on them to maintain old profiles.
> 
> I'm not sure this is the best way to fix it, it makes the meaning of
> rw_devices ambiguous. A missing device is by definition not readable nor
> writeable.
> 
> This should be tracked separatelly, ie. counting real devices that can
> be written and devices that can be considered for allocation (with a
> documented meaning that even missing devices are included).
> 
Indeed this sounds much better.

I'd go that direction.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: volumes: Allow missing devices to be writeable
  2019-08-29  7:17 [PATCH] btrfs: volumes: Allow missing devices to be writeable Qu Wenruo
  2019-09-11 17:17 ` David Sterba
@ 2019-09-12 10:27 ` Anand Jain
  2019-09-12 10:32   ` WenRuo Qu
  1 sibling, 1 reply; 5+ messages in thread
From: Anand Jain @ 2019-09-12 10:27 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



There is previous work [1]

[1]
https://lore.kernel.org/linux-btrfs/1461812780-538-1-git-send-email-anand.jain@oracle.com/


I guess it was on purpose that missing device is not part of
alloc chunk, so to have lesser impact due to writehole bug.
My target is to fix the writehole first, and then this and
other bugs.

> [FIX]
> Just consider the missing devices as WRITABLE, so we allocate new chunks
> on them to maintain old profiles.

  IMO. In a 3-disks raid1 when one of the disk fails, we still
  need the _new writes_ not to be degraded. Just use two available
  disks. This fix fails that idea which is being followed now.

Thanks, Anand

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: volumes: Allow missing devices to be writeable
  2019-09-12 10:27 ` Anand Jain
@ 2019-09-12 10:32   ` WenRuo Qu
  0 siblings, 0 replies; 5+ messages in thread
From: WenRuo Qu @ 2019-09-12 10:32 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



On 2019/9/12 下午6:27, Anand Jain wrote:
> 
> 
> There is previous work [1]
> 
> [1]
> https://lore.kernel.org/linux-btrfs/1461812780-538-1-git-send-email-anand.jain@oracle.com/
> 
> 
> 
> I guess it was on purpose that missing device is not part of
> alloc chunk, so to have lesser impact due to writehole bug.
> My target is to fix the writehole first, and then this and
> other bugs.
> 
>> [FIX]
>> Just consider the missing devices as WRITABLE, so we allocate new chunks
>> on them to maintain old profiles.
> 
>  IMO. In a 3-disks raid1 when one of the disk fails, we still
>  need the _new writes_ not to be degraded. Just use two available
>  disks. This fix fails that idea which is being followed now.

So priority comes, use existing, and only when it's impossible, consider
degraded/missing device?

> 
> Thanks, Anand
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-29  7:17 [PATCH] btrfs: volumes: Allow missing devices to be writeable Qu Wenruo
2019-09-11 17:17 ` David Sterba
2019-09-11 23:39   ` Qu Wenruo
2019-09-12 10:27 ` Anand Jain
2019-09-12 10:32   ` WenRuo Qu

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox