Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* "kernel BUG" and segmentation fault with "device delete"
@ 2019-07-05  4:39 Vladimir Panteleev
  2019-07-05  7:01 ` Vladimir Panteleev
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-05  4:39 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to 
RAID1 (2 disks). The array was less than half full, and I disconnected 
two parity drives, leaving two that contained one copy of all data.

After stubbing out btrfs_check_rw_degradable (because btrfs currently 
can't realize when it has all drives needed for RAID10), I've 
successfully mounted rw+degraded, balance-converted all RAID10 data to 
RAID1, and then btrfs-device-delete-d one of the missing drives. It 
fails at deleting the second.

The process reached a point where the last missing device shows as 
containing 20 GB of RAID1 metadata. At this point, attempting to delete 
the device causes the operation to shortly fail with "No space left", 
followed by a "kernel BUG at fs/btrfs/relocation.c:2499!", and the 
"btrfs device delete" command to crash with a segmentation fault.

Here is the information about the filesystem:

https://dump.thecybershadow.net/55d558b4d0a59643e24c6b4ee9019dca/04%3A28%3A23-upload.txt

And here is the dmesg output (with enospc_debug):

https://dump.thecybershadow.net/9d3811b85d078908141a30886df8894c/04%3A28%3A53-upload.txt

Attempting to unmount the filesystem causes another warning:

https://dump.thecybershadow.net/6d6f2353cd07cd8464ece7e4df90816e/04%3A30%3A30-upload.txt

The umount command then hangs indefinitely.

Linux 5.1.15-arch1-1-ARCH, btrfs-progs v5.1.1

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05  4:39 "kernel BUG" and segmentation fault with "device delete" Vladimir Panteleev
@ 2019-07-05  7:01 ` Vladimir Panteleev
  2019-07-05  9:42 ` Andrei Borzenkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-05  7:01 UTC (permalink / raw)
  To: linux-btrfs

On 05/07/2019 04.39, Vladimir Panteleev wrote:
> The process reached a point where the last missing device shows as 
> containing 20 GB of RAID1 metadata. At this point, attempting to delete 
> the device causes the operation to shortly fail with "No space left", 
> followed by a "kernel BUG at fs/btrfs/relocation.c:2499!", and the 
> "btrfs device delete" command to crash with a segmentation fault.

Same effect if I try to use btrfs-replace on the missing device (which 
works) and then try to delete it (which fails in the same way).

Also same effect if I try to balance the metadata (balance start -v -m 
/mnt/a).

At this point this doesn't look like it is at all related to RAID10 or 
btrfs_check_rw_degradable, just a bug somewhere with handling something 
weird in the filesystem.

I'm out of ideas, so suggestions welcome. :)

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05  4:39 "kernel BUG" and segmentation fault with "device delete" Vladimir Panteleev
  2019-07-05  7:01 ` Vladimir Panteleev
@ 2019-07-05  9:42 ` Andrei Borzenkov
  2019-07-05 10:20   ` Vladimir Panteleev
  2019-07-05 21:43 ` Chris Murphy
  2019-07-06  5:01 ` Qu Wenruo
  3 siblings, 1 reply; 17+ messages in thread
From: Andrei Borzenkov @ 2019-07-05  9:42 UTC (permalink / raw)
  To: Vladimir Panteleev; +Cc: Btrfs BTRFS

On Fri, Jul 5, 2019 at 7:45 AM Vladimir Panteleev
<thecybershadow@gmail.com> wrote:
>
> Hi,
>
> I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to
> RAID1 (2 disks). The array was less than half full, and I disconnected
> two parity drives,

btrfs does not have dedicated parity drives; it is quite possible that
some chunks had their mirror pieces on these two drives, meaning you
effectively induced data loss. You had to perform "btrfs device
delete" *first*, then disconnect unused drive after this process has
completed.

I suspect at this point the only possibility to salvage data is "btrfs restore".

> leaving two that contained one copy of all data.
>
> After stubbing out btrfs_check_rw_degradable (because btrfs currently
> can't realize when it has all drives needed for RAID10), I've
> successfully mounted rw+degraded, balance-converted all RAID10 data to
> RAID1, and then btrfs-device-delete-d one of the missing drives. It
> fails at deleting the second.
>
> The process reached a point where the last missing device shows as
> containing 20 GB of RAID1 metadata. At this point, attempting to delete
> the device causes the operation to shortly fail with "No space left",
> followed by a "kernel BUG at fs/btrfs/relocation.c:2499!", and the
> "btrfs device delete" command to crash with a segmentation fault.
>
> Here is the information about the filesystem:
>
> https://dump.thecybershadow.net/55d558b4d0a59643e24c6b4ee9019dca/04%3A28%3A23-upload.txt
>
> And here is the dmesg output (with enospc_debug):
>
> https://dump.thecybershadow.net/9d3811b85d078908141a30886df8894c/04%3A28%3A53-upload.txt
>
> Attempting to unmount the filesystem causes another warning:
>
> https://dump.thecybershadow.net/6d6f2353cd07cd8464ece7e4df90816e/04%3A30%3A30-upload.txt
>
> The umount command then hangs indefinitely.
>
> Linux 5.1.15-arch1-1-ARCH, btrfs-progs v5.1.1
>
> --
> Best regards,
>   Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05  9:42 ` Andrei Borzenkov
@ 2019-07-05 10:20   ` Vladimir Panteleev
  2019-07-05 21:48     ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-05 10:20 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Btrfs BTRFS

On 05/07/2019 09.42, Andrei Borzenkov wrote:
> On Fri, Jul 5, 2019 at 7:45 AM Vladimir Panteleev
> <thecybershadow@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to
>> RAID1 (2 disks). The array was less than half full, and I disconnected
>> two parity drives,
> 
> btrfs does not have dedicated parity drives; it is quite possible that
> some chunks had their mirror pieces on these two drives, meaning you
> effectively induced data loss. You had to perform "btrfs device
> delete" *first*, then disconnect unused drive after this process has
> completed.

Hi Andrei,

Thank you for replying. However, I'm pretty sure this is not the case as 
you describe it, and in fact, unrelated to the actual problem I'm having.

- I can access all the data on the volumes just fine.

- All the RAID10 block profiles had been successfully converted to 
RAID1. Currently, there are no RAID10 blocks left anywhere on the 
filesystem.

- Only the data was in the RAID10 profile. Metadata was and is in RAID1. 
It is also metadata which btrfs cannot move away from the missing device.

If you can propose a test to verify your hypothesis, I'd be happy to 
check. But, as far as my understanding of btrfs allows me to see, your 
conclusion rests on a bad assumption.

Also, IIRC, your suggestion is not applicable. btrfs refuses to remove a 
device from a 4-device filesystem with RAID10 blocks, as that would put 
it under the minimum number of devices for RAID10 blocks. I think the 
"correct" approach would be first to convert all RAID10 blocks to RAID1 
and only then remove the devices, however, this was not an option for me 
due to other constraints I was working under at the time.

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05  4:39 "kernel BUG" and segmentation fault with "device delete" Vladimir Panteleev
  2019-07-05  7:01 ` Vladimir Panteleev
  2019-07-05  9:42 ` Andrei Borzenkov
@ 2019-07-05 21:43 ` Chris Murphy
  2019-07-06  0:05   ` Vladimir Panteleev
  2019-07-06  5:01 ` Qu Wenruo
  3 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2019-07-05 21:43 UTC (permalink / raw)
  To: Vladimir Panteleev; +Cc: Btrfs BTRFS

On Thu, Jul 4, 2019 at 10:39 PM Vladimir Panteleev
<thecybershadow@gmail.com> wrote:
>
> Hi,
>
> I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to
> RAID1 (2 disks). The array was less than half full, and I disconnected
> two parity drives, leaving two that contained one copy of all data.

There's no parity on either raid10 or raid1. But I can't tell from the
above exactly when each drive was disconnected. In this scenario you
need to convert to raid1 first, wait for that to complete successfully
before you can do a device remove. That's clear.  Also clear is you
must use 'btrfs device remove' and it must complete before that device
is disconnected.

What I've never tried, but the man page implies, is you can specify
two devices at one time for 'btrfs device remove' if the profile and
the number of devices permits it. So exactly the order and commands
you've used is really important to understand the problem and solution
including whether there might be a bug.


>
> After stubbing out btrfs_check_rw_degradable (because btrfs currently
> can't realize when it has all drives needed for RAID10),

Uhh? This implies it was still raid10 when you disconnected two drives
of a four drive raid10. That's definitely data loss territory.
However, your 'btrfs fi us' command suggests only raid1 chunks. What
I'm suspicious of is this:

>>Data,RAID1: Size:2.66TiB, Used:2.66TiB
>>  /dev/sdd1   2.66TiB
>>  /dev/sdf1   2.66TiB

All data block groups are only on sdf1 and sdd1.

>>Metadata,RAID1: Size:57.00GiB, Used:52.58GiB
>>   /dev/sdd1  57.00GiB
>>  /dev/sdf1  37.00GiB
>>   missing  20.00GiB

There's metadata still on one of the missing devices. You need to
physically reconnect this device. The device removal did not complete
before this device was physically disconnected.

>> System,RAID1: Size:8.00MiB, Used:416.00KiB
>>   /dev/sdd1   8.00MiB
>>   missing   8.00MiB

This is actually worse, potentially because it means there's only one
copy of the system chunk on sdd1. It has not been replicated to sdf1,
but is on the missing device. So it definitely sounds like the missing
device was physicall removed before 'device remove' command finished.

Depending on degraded operation for this task is the wrong strategy.
You needed to 'btrfs device delete/remove' before physically
disconnecting these drives.


>I've
> successfully mounted rw+degraded, balance-converted all RAID10 data to
> RAID1, and then btrfs-device-delete-d one of the missing drives. It
> fails at deleting the second.

OK you definitely did this incorrectly if you're expecting to
disconnect two devices at the same time, and then "btrfs device delete
missing" instead of explicitly deleting drives by ID before you
physically disconnect them.

It sounds to me like you had a successful conversion from 4 disk
raid10 to a 4 disk raid1. But then you're assuming there are
sufficient copies of all data and metadata on each drive. That is not
the case with Btrfs. The drives are not mirrored. The block groups are
mirrored. Btrfs raid1 tolerates exactly 1 device loss. Not two.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05 10:20   ` Vladimir Panteleev
@ 2019-07-05 21:48     ` Chris Murphy
  2019-07-05 22:04       ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2019-07-05 21:48 UTC (permalink / raw)
  To: Vladimir Panteleev; +Cc: Andrei Borzenkov, Btrfs BTRFS

On Fri, Jul 5, 2019 at 4:20 AM Vladimir Panteleev
<thecybershadow@gmail.com> wrote:
>
> On 05/07/2019 09.42, Andrei Borzenkov wrote:
> > On Fri, Jul 5, 2019 at 7:45 AM Vladimir Panteleev
> > <thecybershadow@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to
> >> RAID1 (2 disks). The array was less than half full, and I disconnected
> >> two parity drives,
> >
> > btrfs does not have dedicated parity drives; it is quite possible that
> > some chunks had their mirror pieces on these two drives, meaning you
> > effectively induced data loss. You had to perform "btrfs device
> > delete" *first*, then disconnect unused drive after this process has
> > completed.
>
> Hi Andrei,
>
> Thank you for replying. However, I'm pretty sure this is not the case as
> you describe it, and in fact, unrelated to the actual problem I'm having.
>
> - I can access all the data on the volumes just fine.
>
> - All the RAID10 block profiles had been successfully converted to
> RAID1. Currently, there are no RAID10 blocks left anywhere on the
> filesystem.
>
> - Only the data was in the RAID10 profile. Metadata was and is in RAID1.
> It is also metadata which btrfs cannot move away from the missing device.
>
> If you can propose a test to verify your hypothesis, I'd be happy to
> check. But, as far as my understanding of btrfs allows me to see, your
> conclusion rests on a bad assumption.
>
> Also, IIRC, your suggestion is not applicable. btrfs refuses to remove a
> device from a 4-device filesystem with RAID10 blocks, as that would put
> it under the minimum number of devices for RAID10 blocks. I think the
> "correct" approach would be first to convert all RAID10 blocks to RAID1
> and only then remove the devices, however, this was not an option for me
> due to other constraints I was working under at the time.

We need to see a list of commands issued in order, along with the
physical connected state of each drive. I thought I understood what
you did from the previous email, but this paragraph contradicts my
understanding, especially when you say "correct approach would be
first to convert all RAID 10 to RAID1 and then remove devices but that
wasn't an option"

OK so what did you do, in order, each command, interleaving the
physical device removals.


Thanks,


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05 21:48     ` Chris Murphy
@ 2019-07-05 22:04       ` Chris Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2019-07-05 22:04 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Vladimir Panteleev, Andrei Borzenkov, Btrfs BTRFS

On Fri, Jul 5, 2019 at 3:48 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> We need to see a list of commands issued in order, along with the
> physical connected state of each drive. I thought I understood what
> you did from the previous email, but this paragraph contradicts my
> understanding, especially when you say "correct approach would be
> first to convert all RAID 10 to RAID1 and then remove devices but that
> wasn't an option"
>
> OK so what did you do, in order, each command, interleaving the
> physical device removals.


To put a really fine point on this, in my estimation the data on sdf
and sdd are hanging by a thread. It's likely you have partial data
loss because clearly some Btrfs metadata is missing and there are no
other copies of it. For sure you do not want to try to repair the file
system, or do anything that will cause any writes to happen. Any
changes to the file system now will almost certainly make problems
worse, and the chance of recovery will be reduced.

#1: Reconnect the last missing device. Obviously Btrfs wants it
because there's necessary metadata on that drive that doesn't exist on
sdf or sdd.
#2: Only mount the volume ro from here on out. Even a rw mount might
make things worse.
#3: Rsync copy everything you care about off this volume onto a new
volume. At some point this rsync will fail when Btrfs discovers the
missing metadata, which is why you really need to follow step #1. But
if that missing drive is already wiped and retasked for some other
purpose, you're looking at a data recovery operation. Not a file
system repair operation.

Your best chance to avoid total data loss is to get as much off the
volume as you can while you still can. And then after that see if you
can get more off of it - which I think is doubtful.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05 21:43 ` Chris Murphy
@ 2019-07-06  0:05   ` Vladimir Panteleev
  2019-07-06  2:38     ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-06  0:05 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Hi Chris,

First, thank you very much for taking the time to reply. I greatly 
appreciate it.

On 05/07/2019 21.43, Chris Murphy wrote:
> There's no parity on either raid10 or raid1.

Right, thank you for the correction. Of course, I meant the duplicate 
copies of the RAID1 data.

> But I can't tell from the
> above exactly when each drive was disconnected. In this scenario you
> need to convert to raid1 first, wait for that to complete successfully
> before you can do a device remove. That's clear.  Also clear is you
> must use 'btrfs device remove' and it must complete before that device
> is disconnected.

Unfortunately as mentioned before that wasn't an option. I was 
performing this operation on a DM snapshot target backed by a file that 
certainly could not fit the result of a RAID10-to-RAID1 rebalance.

> What I've never tried, but the man page implies, is you can specify
> two devices at one time for 'btrfs device remove' if the profile and
> the number of devices permits it.

What I found surprising, was that "btrfs device delete missing" deletes 
exactly one device, instead of all missing devices. But, that might be 
simply because a device with RAID10 blocks should not have been 
mountable rw with two missing drives in the first place.

>> After stubbing out btrfs_check_rw_degradable (because btrfs currently
>> can't realize when it has all drives needed for RAID10),
> 
> Uhh? This implies it was still raid10 when you disconnected two drives
> of a four drive raid10. That's definitely data loss territory.
> However, your 'btrfs fi us' command suggests only raid1 chunks. What
> I'm suspicious of is this:
> 
>>> Data,RAID1: Size:2.66TiB, Used:2.66TiB
>>>   /dev/sdd1   2.66TiB
>>>   /dev/sdf1   2.66TiB
> 
> All data block groups are only on sdf1 and sdd1.
> 
>>> Metadata,RAID1: Size:57.00GiB, Used:52.58GiB
>>>    /dev/sdd1  57.00GiB
>>>   /dev/sdf1  37.00GiB
>>>    missing  20.00GiB
> 
> There's metadata still on one of the missing devices. You need to
> physically reconnect this device. The device removal did not complete
> before this device was physically disconnected.
> 
>>> System,RAID1: Size:8.00MiB, Used:416.00KiB
>>>    /dev/sdd1   8.00MiB
>>>    missing   8.00MiB
> 
> This is actually worse, potentially because it means there's only one
> copy of the system chunk on sdd1. It has not been replicated to sdf1,
> but is on the missing device.

I'm sorry, but that's not right. As I mentioned in my second email, if I 
use btrfs device replace, then it successfully rebuilds all missing 
data. So, there is no lost data with no remaining copies; btrfs is 
simply having some trouble moving it off of that device.

Here is the filesystem info with a loop device replacing the missing drive:

https://dump.thecybershadow.net/9a0c88c3720c55bcf7fee98630c2a8e1/00%3A02%3A17-upload.txt

> Depending on degraded operation for this task is the wrong strategy.
> You needed to 'btrfs device delete/remove' before physically
> disconnecting these drives.
> 
> OK you definitely did this incorrectly if you're expecting to
> disconnect two devices at the same time, and then "btrfs device delete
> missing" instead of explicitly deleting drives by ID before you
> physically disconnect them.

I don't disagree in general, however, I did make sure that all data was 
accessible with two devices before proceeding with this endeavor.

> It sounds to me like you had a successful conversion from 4 disk
> raid10 to a 4 disk raid1. But then you're assuming there are
> sufficient copies of all data and metadata on each drive. That is not
> the case with Btrfs. The drives are not mirrored. The block groups are
> mirrored. Btrfs raid1 tolerates exactly 1 device loss. Not two.

Whether it was through dumb luck or just due to the series of steps I've 
happened to have taken, it doesn't look like I've lost any data so far. 
But thank you for the correction regarding how RAID1 works in btrfs, 
I've indeed been misunderstanding it.

>> We need to see a list of commands issued in order, along with the
>> physical connected state of each drive. I thought I understood what
>> you did from the previous email, but this paragraph contradicts my
>> understanding, especially when you say "correct approach would be
>> first to convert all RAID 10 to RAID1 and then remove devices but that
>> wasn't an option"
>>
>> OK so what did you do, in order, each command, interleaving the
>> physical device removals.

Well, at this point, I'm still quite confident that the BTRFS kernel bug 
is unrelated to this entire RAID10 thing, but I'll do so if you like. 
Unfortunately I do not have an exact record of this, but I can do my 
best to reconstruct it from memory.

The reason I'm doing this in the first place is that I'm trying to split 
a 4-drive RAID10 array that was getting full. The goal was to move some 
data off of it to a new array, then delete it from its original 
location. I couldn't use rsync because most of the data was in 
snapshots, and I couldn't use btrfs send/receive because it bugs out 
with the old "chown oXXX-XXXXXXX-0 failed: No such file or directory" 
bug. So, my idea was:

1. Use device mapper to create a COW copy of all four devices, and 
operate on those (make the SATA devices read-only to ensure they're not 
touched)
2. Use btrfs-tune to change the UUID of the new filesystem
3. Delete 75%-ish of data off of the COW copy
4. Somehow convert the 4-disk RAID10 to 2-disk RAID1 without incurring a 
ton of writes to the COW copies
5. dd the contents of the COW copies to two new real disks
6. After ensuring the remaining data is safe on the new disks, delete it 
from the original array.

For steps 2 and 3, I needed to specify the exact devices to work with. 
It's possible to specify the device list when mounting with -o device=, 
but for btrfstune, I had to bind-mount a fake partitions file over 
/proc/partitions. I can share the scripts I used for all this if you like.

Anyway, step 4 is the one I was having trouble with. After a few failed 
approaches, what I did was:
1. Find a pair of disks that had a copy of all data (there was no 
guarantee that this was possible, but it did happen to be for me)
2. dd them to real disks
3. mount them as rw+degraded with a kernel patch
4. btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/a
5. sudo btrfs device delete missing /mnt/a
6. sudo btrfs device delete missing /mnt/a

and that's when the kernel bug happens.

> To put a really fine point on this, in my estimation the data on sdf
> and sdd are hanging by a thread. It's likely you have partial data
> loss because clearly some Btrfs metadata is missing and there are no
> other copies of it. For sure you do not want to try to repair the file
> system, or do anything that will cause any writes to happen. Any
> changes to the file system now will almost certainly make problems
> worse, and the chance of recovery will be reduced.

Thank you for the concern. As mentioned above, I'm pretty sure all the 
data is safe and accessible. Also, I have a copy of all data of the 
filesystem still.

Have you had a chance to look at the kernel stack trace yet? It looks 
like it's running out of temporary space to perform a relocation. I 
think that is where we should be concentrating on.

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  0:05   ` Vladimir Panteleev
@ 2019-07-06  2:38     ` Chris Murphy
  2019-07-06  3:37       ` Vladimir Panteleev
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2019-07-06  2:38 UTC (permalink / raw)
  To: Vladimir Panteleev; +Cc: Btrfs BTRFS, Qu Wenruo

On Fri, Jul 5, 2019 at 6:05 PM Vladimir Panteleev
<thecybershadow@gmail.com> wrote:

> On 05/07/2019 21.43, Chris Murphy wrote:

> > But I can't tell from the
> > above exactly when each drive was disconnected. In this scenario you
> > need to convert to raid1 first, wait for that to complete successfully
> > before you can do a device remove. That's clear.  Also clear is you
> > must use 'btrfs device remove' and it must complete before that device
> > is disconnected.
>
> Unfortunately as mentioned before that wasn't an option. I was
> performing this operation on a DM snapshot target backed by a file that
> certainly could not fit the result of a RAID10-to-RAID1 rebalance.

Then the total operation isn't possible. Maybe you could have made the
volume a seed, and then create a single device sprout on a new single
target, and later convert that sprout to raid1. But I'm not sure of
the state of multiple device seeds.


>
> > What I've never tried, but the man page implies, is you can specify
> > two devices at one time for 'btrfs device remove' if the profile and
> > the number of devices permits it.
>
> What I found surprising, was that "btrfs device delete missing" deletes
> exactly one device, instead of all missing devices. But, that might be
> simply because a device with RAID10 blocks should not have been
> mountable rw with two missing drives in the first place.

It's a really good question for developers if there is a good reason
to permit rw mount of a volume that's missing two or more devices for
raid 1, 10, or 5; and missing three or more for raid6. I cannot think
of a good reason to allow degraded,rw mounts for a raid10 missing two
devices.


> > This is actually worse, potentially because it means there's only one
> > copy of the system chunk on sdd1. It has not been replicated to sdf1,
> > but is on the missing device.
>
> I'm sorry, but that's not right. As I mentioned in my second email, if I
> use btrfs device replace, then it successfully rebuilds all missing
> data. So, there is no lost data with no remaining copies; btrfs is
> simply having some trouble moving it off of that device.
>
> Here is the filesystem info with a loop device replacing the missing drive:
>
> https://dump.thecybershadow.net/9a0c88c3720c55bcf7fee98630c2a8e1/00%3A02%3A17-upload.txt

Wow that's really interesting. So you did 'btrfs replace start' for
one of the missing drive devid's, with a loop device as the
replacement, and that worked and finished?!

Does this three device volume mount rw and not degraded? I guess it
must have because 'btrfs fi us' worked on it.

        devid    1 size 7.28TiB used 2.71TiB path /dev/sdd1
        devid    2 size 7.28TiB used 22.01GiB path /dev/loop0
        devid    3 size 7.28TiB used 2.69TiB path /dev/sdf1

OK so what happens now if you try to 'btrfs device remove /dev/loop0' ?


>
> > Depending on degraded operation for this task is the wrong strategy.
> > You needed to 'btrfs device delete/remove' before physically
> > disconnecting these drives.
> >
> > OK you definitely did this incorrectly if you're expecting to
> > disconnect two devices at the same time, and then "btrfs device delete
> > missing" instead of explicitly deleting drives by ID before you
> > physically disconnect them.
>
> I don't disagree in general, however, I did make sure that all data was
> accessible with two devices before proceeding with this endeavor.

Well there's definitely something screwy if Btrfs needs something on a
missing drive, which is indicated by its refusal to remove it from the
volume, and yet at same time it's possible to e.g. rsync every file to
/dev/null without any errors. That's a bug somewhere.


> >> OK so what did you do, in order, each command, interleaving the
> >> physical device removals.
>
> Well, at this point, I'm still quite confident that the BTRFS kernel bug
> is unrelated to this entire RAID10 thing, but I'll do so if you like.
> Unfortunately I do not have an exact record of this, but I can do my
> best to reconstruct it from memory.

I'm not a developer but a dev very well might need to have a simple
reproducer for this in order to locate the problem. But the call trace
might tell them what they need to know. I'm not sure.


>
> The reason I'm doing this in the first place is that I'm trying to split
> a 4-drive RAID10 array that was getting full. The goal was to move some
> data off of it to a new array, then delete it from its original
> location. I couldn't use rsync because most of the data was in
> snapshots, and I couldn't use btrfs send/receive because it bugs out
> with the old "chown oXXX-XXXXXXX-0 failed: No such file or directory"
> bug. So, my idea was:

I'm not familiar with that bug. That sounds like a receive side bug
not a send side bug. I wonder if receive will continue if you use the
-E 0 option, and the result will just be wrong owner on a few files.


>
> 1. Use device mapper to create a COW copy of all four devices, and
> operate on those (make the SATA devices read-only to ensure they're not
> touched)
> 2. Use btrfs-tune to change the UUID of the new filesystem
> 3. Delete 75%-ish of data off of the COW copy
> 4. Somehow convert the 4-disk RAID10 to 2-disk RAID1 without incurring a
> ton of writes to the COW copies
> 5. dd the contents of the COW copies to two new real disks
> 6. After ensuring the remaining data is safe on the new disks, delete it
> from the original array.
>
> For steps 2 and 3, I needed to specify the exact devices to work with.
> It's possible to specify the device list when mounting with -o device=,
> but for btrfstune, I had to bind-mount a fake partitions file over
> /proc/partitions. I can share the scripts I used for all this if you like.

No, it's fine.

> Have you had a chance to look at the kernel stack trace yet? It looks
> like it's running out of temporary space to perform a relocation. I
> think that is where we should be concentrating on.

I've looked at it but I can't really follow it. The comments in the
code don't really tell me much either other than Btrfs is confused,
and so you're seeing the warning and then error -28. It may really be
running out of global reserve for this operation, I can't really tell.

Qu will understand this better.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  2:38     ` Chris Murphy
@ 2019-07-06  3:37       ` Vladimir Panteleev
  2019-07-06 17:36         ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-06  3:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Qu Wenruo

On 06/07/2019 02.38, Chris Murphy wrote:
> On Fri, Jul 5, 2019 at 6:05 PM Vladimir Panteleev
> <thecybershadow@gmail.com> wrote:
>> Unfortunately as mentioned before that wasn't an option. I was
>> performing this operation on a DM snapshot target backed by a file that
>> certainly could not fit the result of a RAID10-to-RAID1 rebalance.
> 
> Then the total operation isn't possible. Maybe you could have made the
> volume a seed, and then create a single device sprout on a new single
> target, and later convert that sprout to raid1. But I'm not sure of
> the state of multiple device seeds.

That's an interesting idea, thanks; I'll be sure to explore it if I run 
into this situation again.

>> What I found surprising, was that "btrfs device delete missing" deletes
>> exactly one device, instead of all missing devices. But, that might be
>> simply because a device with RAID10 blocks should not have been
>> mountable rw with two missing drives in the first place.
> 
> It's a really good question for developers if there is a good reason
> to permit rw mount of a volume that's missing two or more devices for
> raid 1, 10, or 5; and missing three or more for raid6. I cannot think
> of a good reason to allow degraded,rw mounts for a raid10 missing two
> devices.

Sorry, the code currently indeed does not permit mounting a RAID10 
filesystem with more than one missing device in rw. I needed to patch my 
kernel to force it to allow it, as I was working on the assumption that 
the two remaining drives contained a copy of all data (which turned out 
to be true).

> Wow that's really interesting. So you did 'btrfs replace start' for
> one of the missing drive devid's, with a loop device as the
> replacement, and that worked and finished?!

Yes, that's right.

> Does this three device volume mount rw and not degraded? I guess it
> must have because 'btrfs fi us' worked on it.
> 
>          devid    1 size 7.28TiB used 2.71TiB path /dev/sdd1
>          devid    2 size 7.28TiB used 22.01GiB path /dev/loop0
>          devid    3 size 7.28TiB used 2.69TiB path /dev/sdf1

Indeed - with the loop device attached, I can mount the filesystem rw 
just fine without any mount flags, with a stock kernel.

> OK so what happens now if you try to 'btrfs device remove /dev/loop0' ?

Unfortunately it fails in the same way (warning followed by "kernel 
BUG"). The same thing happens if I try to rebalance the metadata.

> Well there's definitely something screwy if Btrfs needs something on a
> missing drive, which is indicated by its refusal to remove it from the
> volume, and yet at same time it's possible to e.g. rsync every file to
> /dev/null without any errors. That's a bug somewhere.

As I understand, I don't think it actually "needs" any data from that 
device, it's just having trouble updating some metadata as it tries to 
move one redundant copy of the data from there to somewhere else. It's 
not refusing to remove the device either, rather it tries and fails at 
doing so.

> I'm not a developer but a dev very well might need to have a simple
> reproducer for this in order to locate the problem. But the call trace
> might tell them what they need to know. I'm not sure.

What I'm going to try to do next is to create another COW layer on top 
of the three devices I have, attach them to a virtual machine, and boot 
that (as it's not fun to reboot the physical machine each time the code 
crashes). Then I could maybe poke the related kernel code to try to 
understand the problem better.

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-05  4:39 "kernel BUG" and segmentation fault with "device delete" Vladimir Panteleev
                   ` (2 preceding siblings ...)
  2019-07-05 21:43 ` Chris Murphy
@ 2019-07-06  5:01 ` Qu Wenruo
  2019-07-06  5:13   ` Vladimir Panteleev
  3 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2019-07-06  5:01 UTC (permalink / raw)
  To: Vladimir Panteleev, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 2659 bytes --]



On 2019/7/5 下午12:39, Vladimir Panteleev wrote:
> Hi,
> 
> I'm trying to convert a data=RAID10,metadata=RAID1 (4 disks) array to
> RAID1 (2 disks). The array was less than half full, and I disconnected
> two parity drives, leaving two that contained one copy of all data.

Definitely not something I would even try.

I'd go convert, delete one, delete the other one, although it's slower,
but should work properly.

> 
> After stubbing out btrfs_check_rw_degradable (because btrfs currently
> can't realize when it has all drives needed for RAID10),

The point is, btrfs_check_rw_degradable() is already doing per-chunk
level rw degradable checking.

I would highly recommend not to comment out the function completely.
It has been a long (well, not that long) way from old fs level tolerance
to current per-chunk tolerance check.

I totally understand for RAID10 we can at most drop half of its stripes
as long as we have one device for each substripe.
If you really want that feature to allow RAID10 to tolerate more missing
devices, please do proper chunk stripe check.

> I've
> successfully mounted rw+degraded, balance-converted all RAID10 data to
> RAID1, and then btrfs-device-delete-d one of the missing drives. It
> fails at deleting the second.
> 
> The process reached a point where the last missing device shows as
> containing 20 GB of RAID1 metadata. At this point, attempting to delete
> the device causes the operation to shortly fail with "No space left",
> followed by a "kernel BUG at fs/btrfs/relocation.c:2499!", and the
> "btrfs device delete" command to crash with a segmentation fault.
> 
> Here is the information about the filesystem:
> 
> https://dump.thecybershadow.net/55d558b4d0a59643e24c6b4ee9019dca/04%3A28%3A23-upload.txt

The fs should have enough space to allocate new metadata chunk (it's
metadata chunk lacking space and caused ENOSPC).

I'm not sure if it's the degraded mount cause the problem, as the
enospc_debug output looks like reserved/pinned/over-reserved space has
taken up all space, while no new chunk get allocated.

Would you please try to balance metadata to see if the ENOSPC still happens?

Thanks,
Qu

> 
> 
> And here is the dmesg output (with enospc_debug):
> 
> https://dump.thecybershadow.net/9d3811b85d078908141a30886df8894c/04%3A28%3A53-upload.txt
> 
> 
> Attempting to unmount the filesystem causes another warning:
> 
> https://dump.thecybershadow.net/6d6f2353cd07cd8464ece7e4df90816e/04%3A30%3A30-upload.txt
> 
> 
> The umount command then hangs indefinitely.
> 
> Linux 5.1.15-arch1-1-ARCH, btrfs-progs v5.1.1
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  5:01 ` Qu Wenruo
@ 2019-07-06  5:13   ` Vladimir Panteleev
  2019-07-06  5:51     ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-06  5:13 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1593 bytes --]

On 06/07/2019 05.01, Qu Wenruo wrote:
>> After stubbing out btrfs_check_rw_degradable (because btrfs currently
>> can't realize when it has all drives needed for RAID10),
> 
> The point is, btrfs_check_rw_degradable() is already doing per-chunk
> level rw degradable checking.
> 
> I would highly recommend not to comment out the function completely.
> It has been a long (well, not that long) way from old fs level tolerance
> to current per-chunk tolerance check.

Very grateful for this :)

> I totally understand for RAID10 we can at most drop half of its stripes
> as long as we have one device for each substripe.
> If you really want that feature to allow RAID10 to tolerate more missing
> devices, please do proper chunk stripe check.

This was my understanding of the situation as well; in any case, it was 
a temporary patch just so I could rebalance the RAID10 blocks to RAID1.

> The fs should have enough space to allocate new metadata chunk (it's
> metadata chunk lacking space and caused ENOSPC).
> 
> I'm not sure if it's the degraded mount cause the problem, as the
> enospc_debug output looks like reserved/pinned/over-reserved space has
> taken up all space, while no new chunk get allocated.

The problem happens after replace-ing the missing device (which succeeds 
in full) and then attempting to remove it, i.e. without a degraded mount.

> Would you please try to balance metadata to see if the ENOSPC still happens?

The problem also manifests when attempting to rebalance the metadata.

Thanks!

-- 
Best regards,
  Vladimir


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  5:13   ` Vladimir Panteleev
@ 2019-07-06  5:51     ` Qu Wenruo
  2019-07-06 15:09       ` Vladimir Panteleev
  2019-07-20 10:59       ` Vladimir Panteleev
  0 siblings, 2 replies; 17+ messages in thread
From: Qu Wenruo @ 2019-07-06  5:51 UTC (permalink / raw)
  To: Vladimir Panteleev, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 786 bytes --]



On 2019/7/6 下午1:13, Vladimir Panteleev wrote:
[...]
>> I'm not sure if it's the degraded mount cause the problem, as the
>> enospc_debug output looks like reserved/pinned/over-reserved space has
>> taken up all space, while no new chunk get allocated.
> 
> The problem happens after replace-ing the missing device (which succeeds
> in full) and then attempting to remove it, i.e. without a degraded mount.
> 
>> Would you please try to balance metadata to see if the ENOSPC still
>> happens?
> 
> The problem also manifests when attempting to rebalance the metadata.

Have you tried to balance just one or two metadata block groups?
E.g using -mdevid or -mvrange?

And did the problem always happen at the same block group?

Thanks,
Qu
> 
> Thanks!
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  5:51     ` Qu Wenruo
@ 2019-07-06 15:09       ` Vladimir Panteleev
  2019-07-20 10:59       ` Vladimir Panteleev
  1 sibling, 0 replies; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-06 15:09 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1111 bytes --]

On 06/07/2019 05.51, Qu Wenruo wrote:
>> The problem also manifests when attempting to rebalance the metadata.
> 
> Have you tried to balance just one or two metadata block groups?
> E.g using -mdevid or -mvrange?

If I use -mdevid with the device ID of the device I'm trying to remove 
(2), I see the crash.

If I use -mvrange with a range covering one byte past a problematic 
virtual address, I see the crash.

Not sure if you had anything else / specific in mind. Here is a log of 
my experiments so far:

https://dump.thecybershadow.net/da241fb4b6e743b01a7e9f8734f70d6e/scratch.txt

> And did the problem always happen at the same block group?

Upon reviewing my logs, it looks like for "device remove" it always 
tries to move block group 1998263943168 first, upon which it crashes.

For "balance", it seems to vary - looks like there is at least one other 
problematic block group at 48009543942144.

Happy to do more experiments or test kernel patches. I have a VM set up 
with a COW view of the devices, so I can do destructive tests too.

-- 
Best regards,
  Vladimir


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  3:37       ` Vladimir Panteleev
@ 2019-07-06 17:36         ` Chris Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2019-07-06 17:36 UTC (permalink / raw)
  To: Vladimir Panteleev; +Cc: Btrfs BTRFS, Qu Wenruo

On Fri, Jul 5, 2019 at 9:38 PM Vladimir Panteleev
<thecybershadow@gmail.com> wrote:
>
> On 06/07/2019 02.38, Chris Murphy wrote:
> > It's a really good question for developers if there is a good reason
> > to permit rw mount of a volume that's missing two or more devices for
> > raid 1, 10, or 5; and missing three or more for raid6. I cannot think
> > of a good reason to allow degraded,rw mounts for a raid10 missing two
> > devices.
>
> Sorry, the code currently indeed does not permit mounting a RAID10
> filesystem with more than one missing device in rw. I needed to patch my
> kernel to force it to allow it, as I was working on the assumption that
> the two remaining drives contained a copy of all data (which turned out
> to be true).

Oh gotcha. I glossed over that. Ahh yeah, so we're kinda back to end
user sabotage in that case. :-)

The thing about Btrfs, it has very little pre-defined on disk layout.
The only things explicitly assigned locations are the superblocks. The
super points to the start of root tree and chunk tree, and those can
start literally anywhere. When block groups are mirrored, which device
they appear on, and the physical location on each device, is also not
consistent.

In other words, you could do this test a bunch of times, and then as
the file system ages it becomes even more non-deterministic, the
likelihood of  some data loss when losing two devices on a raid10 very
quickly approaches 100%.



>
> > Wow that's really interesting. So you did 'btrfs replace start' for
> > one of the missing drive devid's, with a loop device as the
> > replacement, and that worked and finished?!
>
> Yes, that's right.

I suspect it's lucky. There's every reason to believe in a repeat
scenario you can end up with raid1 block groups only on to two missing
devices.


>
> > Does this three device volume mount rw and not degraded? I guess it
> > must have because 'btrfs fi us' worked on it.
> >
> >          devid    1 size 7.28TiB used 2.71TiB path /dev/sdd1
> >          devid    2 size 7.28TiB used 22.01GiB path /dev/loop0
> >          devid    3 size 7.28TiB used 2.69TiB path /dev/sdf1
>
> Indeed - with the loop device attached, I can mount the filesystem rw
> just fine without any mount flags, with a stock kernel.
>
> > OK so what happens now if you try to 'btrfs device remove /dev/loop0' ?
>
> Unfortunately it fails in the same way (warning followed by "kernel
> BUG"). The same thing happens if I try to rebalance the metadata.

That seems like a legitimate bug even if the way you got to this point
is sorta screwy and definitely an edge case.


>
> > Well there's definitely something screwy if Btrfs needs something on a
> > missing drive, which is indicated by its refusal to remove it from the
> > volume, and yet at same time it's possible to e.g. rsync every file to
> > /dev/null without any errors. That's a bug somewhere.
>
> As I understand, I don't think it actually "needs" any data from that
> device, it's just having trouble updating some metadata as it tries to
> move one redundant copy of the data from there to somewhere else. It's
> not refusing to remove the device either, rather it tries and fails at
> doing so.

I think the developers would say anytime the user space tools permit
an action that results in a kernel warning, it's a bug. The priority
of fixing that bug will of course depend on the likelihood of users
running into it, and the scope of the fix, and the resources required.



>
> > I'm not a developer but a dev very well might need to have a simple
> > reproducer for this in order to locate the problem. But the call trace
> > might tell them what they need to know. I'm not sure.
>
> What I'm going to try to do next is to create another COW layer on top
> of the three devices I have, attach them to a virtual machine, and boot
> that (as it's not fun to reboot the physical machine each time the code
> crashes). Then I could maybe poke the related kernel code to try to
> understand the problem better.

I don't really understand the code, but then also I don't know what's
happening as it tries to remove the device and what logical problems
Btrfs is running into that eventually causes the warning. It might be
there's already confusion with on-disk metadata.

Btrfs debugging isn't enabled in default kernels, it's vaguely
possible that would reveal more information. And then the integrity
checker can be incredibly verbose, as in so verbose you definitely do
not want to be writing out a persistent kernel message log to the same
Btrfs file system you're checking. The integrity checker also isn't
enabled in distro kernels. It's both a compile time option as well as
a mount time option (separate for metadata only and with data
checking). But i can't give any advice on what mask options to use
that might help reveal what's going on and where Btrfs gets tripped
up. It does look like it's related to the global reserve, which is
something of a misnomer. It's not some separate thing, it's really
space within a metadata block group.

What still would be interesting is if there's a way to reproduce this
layout, where user space tools permit device removal but then the
kernel splats with this warning.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-06  5:51     ` Qu Wenruo
  2019-07-06 15:09       ` Vladimir Panteleev
@ 2019-07-20 10:59       ` Vladimir Panteleev
  2019-08-08 20:40         ` Vladimir Panteleev
  1 sibling, 1 reply; 17+ messages in thread
From: Vladimir Panteleev @ 2019-07-20 10:59 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi,

I've done a few experiments and here are my findings.

First I probably should describe the filesystem: it is a snapshot 
archive, containing a lot of snapshots for 4 subvolumes, totaling 2487 
subvolumes/snapshots. There are also a few files (inside the snapshots) 
that are probably very fragmented. This is probably what causes the bug.

Observations:

- If I delete all snapshots, the bug disappears (device delete succeeds).
- If I delete all but any single subvolume's snapshots, the bug disappears.
- If I delete one of two subvolumes' snapshots, the bug disappears, but 
stays if I delete one of the other two subvolumes' snapshots.

It looks like two subvolumes' snapshots' data participates in causing 
the bug.

In theory, I guess it would be possible to reduce the filesystem to the 
minimal one causing the bug by iteratively deleting snapshots / files 
and checking if the bug manifests, but it would be extremely 
time-consuming, probably requiring weeks.

Anything else I can do to help diagnose / fix it? Or should I just order 
more HDDs and clone the RAID10 the right way?

On 06/07/2019 05.51, Qu Wenruo wrote:
> 
> 
> On 2019/7/6 下午1:13, Vladimir Panteleev wrote:
> [...]
>>> I'm not sure if it's the degraded mount cause the problem, as the
>>> enospc_debug output looks like reserved/pinned/over-reserved space has
>>> taken up all space, while no new chunk get allocated.
>>
>> The problem happens after replace-ing the missing device (which succeeds
>> in full) and then attempting to remove it, i.e. without a degraded mount.
>>
>>> Would you please try to balance metadata to see if the ENOSPC still
>>> happens?
>>
>> The problem also manifests when attempting to rebalance the metadata.
> 
> Have you tried to balance just one or two metadata block groups?
> E.g using -mdevid or -mvrange?
> 
> And did the problem always happen at the same block group?
> 
> Thanks,
> Qu
>>
>> Thanks!
>>
> 

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "kernel BUG" and segmentation fault with "device delete"
  2019-07-20 10:59       ` Vladimir Panteleev
@ 2019-08-08 20:40         ` Vladimir Panteleev
  0 siblings, 0 replies; 17+ messages in thread
From: Vladimir Panteleev @ 2019-08-08 20:40 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Did more digging today. Here is where the -ENOSPC is coming from:

btrfs_run_delayed_refs ->          // WARN here
__btrfs_run_delayed_refs ->
btrfs_run_delayed_refs_for_head ->
run_one_delayed_ref ->
run_delayed_data_ref ->
__btrfs_inc_extent_ref ->
insert_extent_backref ->
insert_extent_data_ref ->
btrfs_insert_empty_item ->
btrfs_insert_empty_items ->
btrfs_search_slot ->
split_leaf ->
alloc_tree_block_no_bg_flush ->
btrfs_alloc_tree_block ->
use_block_rsv ->
block_rsv_use_bytes / reserve_metadata_bytes

In use_block_rsv, first block_rsv_use_bytes (with the 
BTRFS_BLOCK_RSV_DELREFS one) fails, then reserve_metadata_bytes fails, 
then block_rsv_use_bytes with global_rsv fails again.

My understanding of this in plain English is as follows: btrfs attempted 
to finalize a transaction and add the queued backreferences. When doing 
so, it ran out of space in a B-tree, and attempted to allocate a new 
tree block; however, in doing so, it hit the limit it reserved for 
itself for how much space it was going to use during that operation, so 
it gave up on the whole thing, which led everything to go downhill from 
there. Is this anywhere close to being accurate?

BTW, the DELREFS rsv is 0 / 7GB reserved/free. So, it looks like it 
didn't expect to allocate the new tree node at all? Perhaps it should be 
using some other rsv for those?

Am I on the right track, or should I be discussing this elsewhere / with 
someone else?

On 20/07/2019 10.59, Vladimir Panteleev wrote:
> Hi,
> 
> I've done a few experiments and here are my findings.
> 
> First I probably should describe the filesystem: it is a snapshot 
> archive, containing a lot of snapshots for 4 subvolumes, totaling 2487 
> subvolumes/snapshots. There are also a few files (inside the snapshots) 
> that are probably very fragmented. This is probably what causes the bug.
> 
> Observations:
> 
> - If I delete all snapshots, the bug disappears (device delete succeeds).
> - If I delete all but any single subvolume's snapshots, the bug disappears.
> - If I delete one of two subvolumes' snapshots, the bug disappears, but 
> stays if I delete one of the other two subvolumes' snapshots.
> 
> It looks like two subvolumes' snapshots' data participates in causing 
> the bug.
> 
> In theory, I guess it would be possible to reduce the filesystem to the 
> minimal one causing the bug by iteratively deleting snapshots / files 
> and checking if the bug manifests, but it would be extremely 
> time-consuming, probably requiring weeks.
> 
> Anything else I can do to help diagnose / fix it? Or should I just order 
> more HDDs and clone the RAID10 the right way?
> 
> On 06/07/2019 05.51, Qu Wenruo wrote:
>>
>>
>> On 2019/7/6 下午1:13, Vladimir Panteleev wrote:
>> [...]
>>>> I'm not sure if it's the degraded mount cause the problem, as the
>>>> enospc_debug output looks like reserved/pinned/over-reserved space has
>>>> taken up all space, while no new chunk get allocated.
>>>
>>> The problem happens after replace-ing the missing device (which succeeds
>>> in full) and then attempting to remove it, i.e. without a degraded 
>>> mount.
>>>
>>>> Would you please try to balance metadata to see if the ENOSPC still
>>>> happens?
>>>
>>> The problem also manifests when attempting to rebalance the metadata.
>>
>> Have you tried to balance just one or two metadata block groups?
>> E.g using -mdevid or -mvrange?
>>
>> And did the problem always happen at the same block group?
>>
>> Thanks,
>> Qu
>>>
>>> Thanks!
>>>
>>
> 

-- 
Best regards,
  Vladimir

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-05  4:39 "kernel BUG" and segmentation fault with "device delete" Vladimir Panteleev
2019-07-05  7:01 ` Vladimir Panteleev
2019-07-05  9:42 ` Andrei Borzenkov
2019-07-05 10:20   ` Vladimir Panteleev
2019-07-05 21:48     ` Chris Murphy
2019-07-05 22:04       ` Chris Murphy
2019-07-05 21:43 ` Chris Murphy
2019-07-06  0:05   ` Vladimir Panteleev
2019-07-06  2:38     ` Chris Murphy
2019-07-06  3:37       ` Vladimir Panteleev
2019-07-06 17:36         ` Chris Murphy
2019-07-06  5:01 ` Qu Wenruo
2019-07-06  5:13   ` Vladimir Panteleev
2019-07-06  5:51     ` Qu Wenruo
2019-07-06 15:09       ` Vladimir Panteleev
2019-07-20 10:59       ` Vladimir Panteleev
2019-08-08 20:40         ` Vladimir Panteleev

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox