All of lore.kernel.org
 help / color / mirror / Atom feed
* Removing a failed device - stuck in a loop or normal?
@ 2019-06-13 23:17 Steven Fosdick
  2019-06-13 23:41 ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Steven Fosdick @ 2019-06-13 23:17 UTC (permalink / raw)
  To: linux-btrfs

I have a BTRFS volume with four devices one of which has not failed
and is no longer present in the machine.  The volume is mounted in
degraded mode.  I am trying to remove the failed device with:

btrfs device remove missing /data

There should be enough space to consolidate the data onto the three
remaining disc before adding a fourth.  The first few attempts have
failed with errors of the form:

Jun 12 14:54:36 meije kernel: BTRFS info (device sda): relocating
block group 10436799889408 flags data|raid5
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519241728 csum 0x9cb8912f expected csum 0x73ba6e2a
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519245824 csum 0x98f94189 expected csum 0x4ab823e6
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519254016 csum 0xd3f53909 expected csum 0x94ab4db4
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519249920 csum 0xcb29eade expected csum 0x65d28b9e
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519258112 csum 0x714821f5 expected csum 0xeed771e2
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519262208 csum 0x574f1bdc expected csum 0x5a78e046
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519266304 csum 0x63ec8641 expected csum 0xcee67afe
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519270400 csum 0xb3d8a215 expected csum 0x39db0f0a
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519274496 csum 0x910dd641 expected csum 0x3599ad7d
mirror 2
Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 519278592 csum 0xe6ca8bc2 expected csum 0x413d5da7
mirror 2

Deleting the files concerned allows it to progress further and the
device remove has been logging messages of the form:

Jun 13 21:14:01 meije kernel: BTRFS info (device sda): relocating
block group 7956456275968 flags data|raid5
Jun 13 21:14:36 meije kernel: BTRFS info (device sda): found 785 extents
Jun 13 21:14:46 meije kernel: BTRFS info (device sda): found 785 extents

The numbers obviously vary but the pattern of those three lines which
the block group and two identical "found extents" lines has been
repeating for several hours and the amount of data reported by:

btrfs fi usage /data

as being on the missing disc has been gradually reducing and the
amount on the other three gradually increasing just as I would expect.
Now, however, there is a new pattern:

Jun 13 21:14:54 meije kernel: BTRFS info (device sda): relocating
block group 7955382534144 flags metadata|raid1
Jun 13 21:18:51 meije kernel: BTRFS info (device sda): found 51353 extents
Jun 13 21:19:18 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:23 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:27 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:32 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:36 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:40 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:44 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:48 meije kernel: BTRFS info (device sda): found 1 extents
Jun 13 21:19:52 meije kernel: BTRFS info (device sda): found 1 extents

With the last line repeating.  So far there have been 9,347 of the
"found 1 extents" messages with no other BTRFS messages in between.
The amount of data on the missing disc does not seem to be decreasing
now.

Does this seem like normal behaviour, or has it not got stuck in an
infinite loop, i.e. is it finding the same extent over and over again?
 What should I do?

Regards,
Steve.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Removing a failed device - stuck in a loop or normal?
  2019-06-13 23:17 Removing a failed device - stuck in a loop or normal? Steven Fosdick
@ 2019-06-13 23:41 ` Qu Wenruo
       [not found]   ` <CAG_8rEcn7JYTf6S24wBEQbFDs5yz_uUr9gUMSSmLmgs4j4gheQ@mail.gmail.com>
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2019-06-13 23:41 UTC (permalink / raw)
  To: Steven Fosdick, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4682 bytes --]



On 2019/6/14 上午7:17, Steven Fosdick wrote:
> I have a BTRFS volume with four devices one of which has not failed
> and is no longer present in the machine.  The volume is mounted in
> degraded mode.  I am trying to remove the failed device with:
> 
> btrfs device remove missing /data
> 
> There should be enough space to consolidate the data onto the three
> remaining disc before adding a fourth.  The first few attempts have
> failed with errors of the form:
> 
> Jun 12 14:54:36 meije kernel: BTRFS info (device sda): relocating
> block group 10436799889408 flags data|raid5
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519241728 csum 0x9cb8912f expected csum 0x73ba6e2a
> mirror 2

That's common if the device is really failing.
Raid5 should re-build the corrupted blocks.

> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519245824 csum 0x98f94189 expected csum 0x4ab823e6
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519254016 csum 0xd3f53909 expected csum 0x94ab4db4
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519249920 csum 0xcb29eade expected csum 0x65d28b9e
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519258112 csum 0x714821f5 expected csum 0xeed771e2
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519262208 csum 0x574f1bdc expected csum 0x5a78e046
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519266304 csum 0x63ec8641 expected csum 0xcee67afe
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519270400 csum 0xb3d8a215 expected csum 0x39db0f0a
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519274496 csum 0x910dd641 expected csum 0x3599ad7d
> mirror 2
> Jun 12 14:54:41 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 272 off 519278592 csum 0xe6ca8bc2 expected csum 0x413d5da7
> mirror 2
> 
> Deleting the files concerned allows it to progress further and the

The corrupted info is from data reloc tree, I'm not sure if deleting
files would really help.

> device remove has been logging messages of the form:
> 
> Jun 13 21:14:01 meije kernel: BTRFS info (device sda): relocating
> block group 7956456275968 flags data|raid5
> Jun 13 21:14:36 meije kernel: BTRFS info (device sda): found 785 extents
> Jun 13 21:14:46 meije kernel: BTRFS info (device sda): found 785 extents
> 
> The numbers obviously vary but the pattern of those three lines which
> the block group and two identical "found extents" lines has been
> repeating for several hours and the amount of data reported by:
> 
> btrfs fi usage /data
> 
> as being on the missing disc has been gradually reducing and the
> amount on the other three gradually increasing just as I would expect.
> Now, however, there is a new pattern:
> 
> Jun 13 21:14:54 meije kernel: BTRFS info (device sda): relocating
> block group 7955382534144 flags metadata|raid1
> Jun 13 21:18:51 meije kernel: BTRFS info (device sda): found 51353 extents
> Jun 13 21:19:18 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:23 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:27 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:32 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:36 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:40 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:44 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:48 meije kernel: BTRFS info (device sda): found 1 extents
> Jun 13 21:19:52 meije kernel: BTRFS info (device sda): found 1 extents
> 
> With the last line repeating.  So far there have been 9,347 of the
> "found 1 extents" messages with no other BTRFS messages in between.
> The amount of data on the missing disc does not seem to be decreasing
> now.
> 
> Does this seem like normal behaviour, or has it not got stuck in an
> infinite loop, i.e. is it finding the same extent over and over again?

Looks like a dead loop.

Would you please provide the kernel version please?

And have you tried cancel current balance and start a new one again?

Thanks,
Qu

>  What should I do?
> 
> Regards,
> Steve.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Fwd: Removing a failed device - stuck in a loop or normal?
       [not found]   ` <CAG_8rEcn7JYTf6S24wBEQbFDs5yz_uUr9gUMSSmLmgs4j4gheQ@mail.gmail.com>
@ 2019-06-14  0:18     ` Steven Fosdick
  0 siblings, 0 replies; 3+ messages in thread
From: Steven Fosdick @ 2019-06-14  0:18 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 14 Jun 2019 at 00:41, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> That's common if the device is really failing.
> Raid5 should re-build the corrupted blocks.

That's strange then, because there are a few files that are
unreadable, i.e. attempting to read them with normal programs like
'cp' gives I/O error.

> Looks like a dead loop.
>
> Would you please provide the kernel version please?

Linux meije 5.1.9-arch1-1-ARCH #1 SMP PREEMPT

> And have you tried cancel current balance and start a new one again?

How do I cancel it?  I started "btrfs device remove missing /data"
with nohup.  That ssh session is no longer active and issuing SIGTERM
to that process doesn't cause it to die, which is not surprising since
looking at the source it seems to execute the whole device remove from
within the kernel during a single ioctl call.

Thanks for such a quick response, too.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-06-14  0:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-13 23:17 Removing a failed device - stuck in a loop or normal? Steven Fosdick
2019-06-13 23:41 ` Qu Wenruo
     [not found]   ` <CAG_8rEcn7JYTf6S24wBEQbFDs5yz_uUr9gUMSSmLmgs4j4gheQ@mail.gmail.com>
2019-06-14  0:18     ` Fwd: " Steven Fosdick

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.