All of lore.kernel.org
 help / color / mirror / Atom feed
* Rebalancing RAID1
@ 2013-02-12 23:01 Fredrik Tolf
  2013-02-13  0:58 ` Chris Murphy
  2013-02-14 14:44 ` Martin Steigerwald
  0 siblings, 2 replies; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-12 23:01 UTC (permalink / raw)
  To: linux-btrfs

Dear list,

I'm sorry if this is a dumb n3wb question, but I couldn't find anything 
about it, so please bear with me.

I just decided to try BtrFS for the first time, to replace an old ReiserFS 
data partition currently on a mdadm mirror. To do so, I'm using two 3 TB 
disks that were initially detected as sdd and sde, on which I have a 
single large GPT partition, so the devices I'm using for btrfs are sdd1 
and sde1.

I created a filesystem on them using RAID1 from the start (mkfs.btrfs -d 
raid -m raid1 /dev/sd{d,e}1), and started copying the data from the old 
partition onto it during the night. As it happened, I immediately got 
reason to try out BtrFS recovery because sometime during the copying 
operation /dev/sdd had some kind of cable failure and was removed from the 
system. A while later, however, it was apparently auto-redetected, this 
time as /dev/sdi, and BtrFS seems to have inserted it back into the 
filesystem somehow.

The current situation looks like this:

> $ sudo ./btrfs fi show
> Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
>         Total devices 2 FS bytes used 1.64TB
>         devid    1 size 2.73TB used 1.64TB path /dev/sdi1
>         devid    2 size 2.73TB used 2.67TB path /dev/sde1
> 
> Btrfs v0.20-rc1-56-g6cd836d

As you can see, /dev/sdi1 has much less space used, which I can only 
assume is because extents weren't allocated on it while it was off-line. 
I'm now trying to remedy this, but I'm not sure if I'm doing it right.

What I'm doing is to run "btrfs fi bal start /mnt &", and it gives me a 
ton of kernel messages that look like this:

Feb 12 22:57:16 nerv kernel: [59596.948464] btrfs: relocating block group 2879804932096 flags 17
Feb 12 22:57:45 nerv kernel: [59626.618280] btrfs_end_buffer_write_sync: 8 callbacks suppressed
Feb 12 22:57:45 nerv kernel: [59626.621893] lost page write due to I/O error on /dev/sdd1
Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs_dev_stat_print_on_error: 8 callbacks suppressed
Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs: bdev /dev/sdd1 errs: wr 66339, rd 26, flush 1, corrupt 0, gen 0
Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on /dev/sdd1
[Lots of the above, and occasionally a couple of lines like these]
Feb 12 22:57:48 nerv kernel: [59629.569278] btrfs: found 46 extents
Feb 12 22:57:50 nerv kernel: [59631.685067] btrfs_dev_stat_print_on_error: 5 callbacks suppressed

This barrage of messages combined with the fact that the rebalance is 
going quite slowly (btrfs fi bal stat indicates about 1 extent per minute, 
where an extent seems to be about 1 GB; which is several factors slower 
than it took to copy the data onto the filesystem) leads me to think that 
something is wrong. Is it, or should I just wait 2 days for it to 
complete, ignoring the error?

Also, why does it say that the errors are occuring /dev/sdd1? Is it just 
remembering the whole filesystem by that name since that's how I mounted 
it, or is it still trying to access the old removed instance of that disk 
and is that, then, why it's giving all these errors?

Thanks for reading!

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-12 23:01 Rebalancing RAID1 Fredrik Tolf
@ 2013-02-13  0:58 ` Chris Murphy
  2013-02-13  6:18   ` Fredrik Tolf
  2013-02-14 14:44 ` Martin Steigerwald
  1 sibling, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2013-02-13  0:58 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: linux-btrfs


On Feb 12, 2013, at 4:01 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
> 
> mkfs.btrfs -d raid -m raid1 /dev/sd{d,e}1

Is that a typo? -d raid isn't valid.

What do you get for:
btrfs fi df /mnt

Please report the result for each drive:
smartctl -a /dev/sdX
smartctl -l scterc /dev/sdX

> 
> Also, why does it say that the errors are occuring /dev/sdd1? Is it just remembering the whole filesystem by that name since that's how I mounted it, or is it still trying to access the old removed instance of that disk and is that, then, why it's giving all these errors?

I suspect bad sectors at the moment. But it could be other things too. What kernel version?


Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-13  0:58 ` Chris Murphy
@ 2013-02-13  6:18   ` Fredrik Tolf
  2013-02-13  8:10     ` Chris Murphy
  0 siblings, 1 reply; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-13  6:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Tue, 12 Feb 2013, Chris Murphy wrote:
>
> On Feb 12, 2013, at 4:01 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
>>
>> mkfs.btrfs -d raid -m raid1 /dev/sd{d,e}1
>
> Is that a typo? -d raid isn't valid.

Ah yes, sorry. That was a typo.

> What do you get for:
> btrfs fi df /mnt

$ sudo ./btrfs fi df /mnt
Data, RAID1: total=2.66TB, used=2.66TB
Data: total=8.00MB, used=0.00
System, RAID1: total=8.00MB, used=388.00KB
System: total=4.00MB, used=0.00
Metadata, RAID1: total=4.00GB, used=3.66GB
Metadata: total=8.00MB, used=0.00

> Please report the result for each drive:
> smartctl -a /dev/sdX

As they're a bit long for mail, so see here:
<http://www.dolda2000.com/~fredrik/tmp/smart-hde>
<http://www.dolda2000.com/~fredrik/tmp/smart-hdi>

There's not a whole lot to see, though.

> smartctl -l scterc /dev/sdX

"Warning: device does not support SCT Error Recovery Control command"

>> Also, why does it say that the errors are occuring /dev/sdd1? Is it just remembering the whole filesystem by that name since that's how I mounted it, or is it still trying to access the old removed instance of that disk and is that, then, why it's giving all these errors?
>
> I suspect bad sectors at the moment.

Doesn't seem that way to me; partly because of the SMART data, and partly 
because of the errors that were logged as the drive failed:

Feb 12 16:36:49 nerv kernel: [36769.546522] ata6.00: Ata error. fis:0x21
Feb 12 16:36:49 nerv kernel: [36769.550454] ata6: SError: { Handshk }
Feb 12 16:36:51 nerv kernel: [36769.554129] ata6.00: failed command: WRITE FPDMA QUEUED
Feb 12 16:36:51 nerv kernel: [36769.559375] ata6.00: cmd 61/00:00:00:ec:2e/04:00:cd:00:00/40 tag 0 ncq 524288 out
Feb 12 16:36:51 nerv kernel: [36769.559375]          res 41/84:d0:00:98:2e/84:00:cd:00:00/40 Emask 0x10 (ATA bus error)
Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }

That's not typical for actual media problems, in my experience. :)

What kernel version?

Oh, sorry, it's 3.7.1. The system is otherwise a pretty much vanilla 
Debian Squeeze (curreny Stable) that I've just compiled a newer kernel 
(and btrfs-tools) for.

Thanks for replying!

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-13  6:18   ` Fredrik Tolf
@ 2013-02-13  8:10     ` Chris Murphy
  2013-02-14  6:42       ` Fredrik Tolf
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2013-02-13  8:10 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: linux-btrfs


On Feb 12, 2013, at 11:18 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
> 
> 
>> smartctl -l scterc /dev/sdX
> 
> "Warning: device does not support SCT Error Recovery Control command"
> 
> Doesn't seem that way to me; partly because of the SMART data, and partly because of the errors that were logged as the drive failed:
> 
> Feb 12 16:36:49 nerv kernel: [36769.546522] ata6.00: Ata error. fis:0x21
> Feb 12 16:36:49 nerv kernel: [36769.550454] ata6: SError: { Handshk }
> Feb 12 16:36:51 nerv kernel: [36769.554129] ata6.00: failed command: WRITE FPDMA QUEUED
> Feb 12 16:36:51 nerv kernel: [36769.559375] ata6.00: cmd 61/00:00:00:ec:2e/04:00:cd:00:00/40 tag 0 ncq 524288 out
> Feb 12 16:36:51 nerv kernel: [36769.559375]          res 41/84:d0:00:98:2e/84:00:cd:00:00/40 Emask 0x10 (ATA bus error)
> Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
> Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }
> 
> That's not typical for actual media problems, in my experience. :)

Quite typical, because these drives don't support SCTERC which almost certainly means their error timeouts are well above that of the linux SCSI layer which is 30 seconds. Their timeouts are likely around 2 minutes. So in fact they never report back a URE because the command timer times out and resets the drive.
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html

For your use case, I'd reject these drives and get WDC Red, or even reportedly the Hitachi Deskstars still have a settable SCTERC. And set it for something like 70 deciseconds. Then if if a drive ECC hasn't recovered in 7 seconds, it will give up, and report a read error with the problem LBA. Either btrfs (or md) can recover the data from the other drive, and cause the read error to be fixed on the other drive.

However, in your case, with both the kernel message ICRC ABRT, and the following SMART entry, this is your cable problem. The ICRC and UCMA_CRC errors are the same problem reported by the actors at each end of the cable.

/dev/hdi
Serial Number:    WD-WMC1T1679668
199 UDMA_CRC_Error_Count    0x0032   200   192   000    Old_age   Always       -       91


So the question is whether the cable problem has actually been fixed, and if you're still getting ICRC errors from the kernel. As this is hdi, I'm wondering how many drives are connected, and if this could be power induced rather than just cable induced. Once that's solved, you should do a scrub, rather than a rebalance.

Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-13  8:10     ` Chris Murphy
@ 2013-02-14  6:42       ` Fredrik Tolf
  2013-02-14  7:27         ` Chris Murphy
  2013-02-14  8:01         ` Chris Murphy
  0 siblings, 2 replies; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-14  6:42 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Wed, 13 Feb 2013, Chris Murphy wrote:
> On Feb 12, 2013, at 11:18 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
>> That's not typical for actual media problems, in my experience. :)
>
> Quite typical, because these drives don't support SCTERC which almost 
> certainly means their error timeouts are well above that of the linux 
> SCSI layer which is 30 seconds. Their timeouts are likely around 2 
> minutes. So in fact they never report back a URE because the command 
> timer times out and resets the drive.

That's interesting to read. I haven't ever actually experienced missing a 
bad sector reported by a hard drive, though; and not for a lack of 
experience with bad sectors.

Either way, though, with the assumption that it actually was a cable 
problem rather than bad medium...

> However, in your case, with both the kernel message ICRC ABRT, and the 
> following SMART entry, this is your cable problem.

... I'd still like to solve the problem as it is, so that I know what to 
do the next time I get some device error.

> So the question is whether the cable problem has actually been fixed, 
> and if you're still getting ICRC errors from the kernel.

I'm not getting any block-layer errors from the kernel. The errors I 
posted originally are the only ones I'm getting.

> As this is hdi, I'm wondering how many drives are connected, and if this 
> could be power induced rather than just cable induced.

With the general change, I actually decreased the number of drives in the 
system from 10 to 8, so unless the new drives are incredibly more 
power-hungry than the old ones, that shouldn't be a problem.

> Once that's solved, you should do a scrub, rather than a rebalance.

Oh, will scrubbing actually rebalance the array? I was under the 
impression that it only checked for bad checksums.

I'm still wondering what those errors actually mean, though. I'm still 
getting them occasionally, even when I'm not rebalancing (just not as 
often). I'm also very curious about what it means that it's still 
complaining about sdd rather than sdi.

It's worth noting that I still haven't un- and remounted the filesystem 
since the drive disconnected. I assumed that I shouldn't need to and that 
the multiple-device layer of btrfs should handle the situation correctly. 
Is that assumption correct?

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  6:42       ` Fredrik Tolf
@ 2013-02-14  7:27         ` Chris Murphy
  2013-02-14  7:58           ` Fredrik Tolf
  2013-02-14  8:01         ` Chris Murphy
  1 sibling, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2013-02-14  7:27 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: linux-btrfs


On Feb 13, 2013, at 11:42 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:

> 
> That's interesting to read. I haven't ever actually experienced missing a bad sector reported by a hard drive, though; and not for a lack of experience with bad sectors.

That experience is consistent with a consumer drive with an ECC timeout that's longer than linux. Well before the drive gives up, linux does (by default anyway.)

>> However, in your case, with both the kernel message ICRC ABRT, and the following SMART entry, this is your cable problem.
> 
> ... I'd still like to solve the problem as it is, so that I know what to do the next time I get some device error.

It depends on the error, but top on that list would be to stop writing to the disk. The last thing I'd do is a rebalance.

> 
>> So the question is whether the cable problem has actually been fixed, and if you're still getting ICRC errors from the kernel.
> 
> I'm not getting any block-layer errors from the kernel. The errors I posted originally are the only ones I'm getting.

Previously you reported:
Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }

These are not block errors. You should not proceed until you're certain this isn't still intermittently occurring.


> With the general change, I actually decreased the number of drives in the system from 10 to 8, so unless the new drives are incredibly more power-hungry than the old ones, that shouldn't be a problem.

I'd find out and be certain. That ICRC error translates into low power as one of the causes, not just a cable problem.


>> Once that's solved, you should do a scrub, rather than a rebalance.
> 
> Oh, will scrubbing actually rebalance the array? I was under the impression that it only checked for bad checksums.

Scrubbing does not balance the volume. Based on the information you supplied I don't really see the reason for a rebalance.

What you do next depends on what your goal is for this data, on these two disks,  using btrfs. If the idea is to trust the data on the volume; you still have the source data so I'd mkfs.btrfs on the disks and start over. If the idea is to experiment and learn, you might want to do a btrfsck, followed by a scrub.


> I'm still wondering what those errors actually mean, though. I'm still getting them occasionally, even when I'm not rebalancing (just not as often). I'm also very curious about what it means that it's still complaining about sdd rather than sdi.

I have no idea what errors you're still getting, or in what context. This:

Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on /dev/sdd1

This:
Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }

Are not btrfs errors. So if you're still getting them. You still have hardware problems to figure out.


> 
> It's worth noting that I still haven't un- and remounted the filesystem since the drive disconnected. I assumed that I shouldn't need to and that the multiple-device layer of btrfs should handle the situation correctly. Is that assumption correct?

Btrfs is stable on stable hardware. Your hardware most definitely was not stable during a series of writes. So I'd say all bets are off. That doesn't mean it can't be fixed, but the very fact you're still getting errors indicates something is still wrong.


Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  7:27         ` Chris Murphy
@ 2013-02-14  7:58           ` Fredrik Tolf
  2013-02-14  8:41             ` Chris Murphy
  0 siblings, 1 reply; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-14  7:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Thu, 14 Feb 2013, Chris Murphy wrote:
>>> So the question is whether the cable problem has actually been fixed, and if you're still getting ICRC errors from the kernel.
>>
>> I'm not getting any block-layer errors from the kernel. The errors I posted originally are the only ones I'm getting.
>
> Previously you reported:
> Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
> Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }
>
> These are not block errors. You should not proceed until you're certain this isn't still intermittently occurring.

Sorry for being unclear. By "block-layer errors" I intended to mean 
hardware/driver errors, as those are, as opposed to filesystem errors, but 
I guess that's not the vernacular use of the term.

To try to be clearer, then:

I am not getting ICRC errors anymore, or any driver-related errors 
whatsoever. I was only getting them when sdd was originally lost, and have 
not been getting any of them since.

The errors I am currently getting, and the ones I was getting during the 
rebalance, are those I reported in the original mail; that is:

Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error on /dev/sdd1
Feb 14 08:32:30 nerv kernel: [180511.764690] btrfs: bdev /dev/sdd1 errs: wr 288650, rd 26, flush 1, corrupt 0, gen 0

I am only getting those messages from the kernel, and nothing else. 
Currently, those two messages are the only ones I'm getting at all (except 
with slightly different numeric parameters, of course); while I was trying 
to rebalance, I also got messages looking like this:

Feb 12 22:57:16 nerv kernel: [59596.948464] btrfs: relocating block group 2879804932096 flags 17
Feb 12 22:57:45 nerv kernel: [59626.618280] btrfs_end_buffer_write_sync: 8 callbacks suppressed
Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs_dev_stat_print_on_error: 8 callbacks suppressed
Feb 12 22:57:48 nerv kernel: [59629.569278] btrfs: found 46 extents

I hope that clears it up.

>>> Once that's solved, you should do a scrub, rather than a rebalance.
>>
>> Oh, will scrubbing actually rebalance the array? I was under the impression that it only checked for bad checksums.
>
> Scrubbing does not balance the volume. Based on the information you 
> supplied I don't really see the reason for a rebalance.

Maybe my terminology is wrong again, then, because I do see a reason to 
get the data properly replicated across the drives, which it doesn't seem 
to be now. That's what I meant by "rebalancing".

> What you do next depends on what your goal is for this data, on these 
> two disks, using btrfs. If the idea is to trust the data on the volume; 
> you still have the source data so I'd mkfs.btrfs on the disks and start 
> over. If the idea is to experiment and learn, you might want to do a 
> btrfsck, followed by a scrub.

I'm still keeping the original data just in case, of course. However, my 
primary goal right now is to learn how to manage redundancy reliably with 
btrfs. I mean, with md, I can easily handle a device failure and fix it up 
without having to remount or reboot; and I've assumed that I should be 
able to do that with btrfs as well (please correct me if that assumption 
is invalid, though).

> Btrfs is stable on stable hardware. Your hardware most definitely was 
> not stable during a series of writes. So I'd say all bets are off. That 
> doesn't mean it can't be fixed, but the very fact you're still getting 
> errors indicates something is still wrong.

Isn't btrfs' RAID1 supposed to be stable as long as only one disk fails, 
though?

> This:
> Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on /dev/sdd1
> Are not btrfs errors.

I see. I thought that was a btrfs error, but I was wrong then. Since I'm 
not actually getting any driver errors, though, and it's referring to sdd, 
doesn't that just mean, as I suspect, that btrfs is still trying to use 
the old defunct sdd instead of sdi as the drive became named after it was 
redetected?

> This:
> Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
> Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }

Just to be overly redundant: I'm not getting those anymore, and I only 
ever got them before the drive was redetected as sdi.

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  6:42       ` Fredrik Tolf
  2013-02-14  7:27         ` Chris Murphy
@ 2013-02-14  8:01         ` Chris Murphy
  2013-02-15  4:06           ` Fredrik Tolf
  1 sibling, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2013-02-14  8:01 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: linux-btrfs


On Feb 13, 2013, at 11:42 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:

> It's worth noting that I still haven't un- and remounted the filesystem since the drive disconnected. 

I suggest capturing the current dmesg, reboot, and see if the btrfs volume will mount read-only without complaints in dmesg.

Also, is a virtual machine being used in any of this, either as host or guest?

Chris Murphy


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  7:58           ` Fredrik Tolf
@ 2013-02-14  8:41             ` Chris Murphy
  2013-02-14  8:59               ` Hugo Mills
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2013-02-14  8:41 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: linux-btrfs


On Feb 14, 2013, at 12:58 AM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
> 
> Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error on /dev/sdd1

Well, someone else might comment on what that is exactly, I'm not getting conclusive google hits on this. Sometimes it's fixed by going to a newer kernel. Sometimes it's bad hardware. But it's apparently not a btrfs error. But it's causing subsequent errors which are btrfs errors. So whatever it is, it seems like btrfs doesn't like it.


> Feb 14 08:32:30 nerv kernel: [180511.764690] btrfs: bdev /dev/sdd1 errs: wr 288650, rd 26, flush 1, corrupt 0, gen 0

So there continue to be write errors. Unsurprising as sdd1 seems to be dropping pages.

>> 
>> Scrubbing does not balance the volume. Based on the information you supplied I don't really see the reason for a rebalance.
> 
> Maybe my terminology is wrong again, then, because I do see a reason to get the data properly replicated across the drives, which it doesn't seem to be now. That's what I meant by "rebalancing".

How much data was copied to the drives? I'm continuously confused by how btrfs reports data usage. What I have is this from fi show and fi df:

Data, RAID1: total=2.66TB, used=2.66TB
Total devices 2 FS bytes used 1.64TB
devid    1 size 2.73TB used 1.64TB path /dev/sdi1
devid    2 size 2.73TB used 2.67TB path /dev/sde1

So I can't tell if it's ~1.64TB copied or 2.6TB.

At this point I wish these two could just report the GB/TB equivalent of non-free LBA's in use for the file system and on each device.

I can see why you want to rebalance but you have WRITE ERRORS still occurring. That needs to get figured out before you expect a rebalance to work.


> I mean, with md, I can easily handle a device failure and fix it up without having to remount or reboot;

That's speculative. You have continuing write failures for one drive for unknown reasons. md will kick devices out of an array in such a case; so it too gets out of sync and needs resyncing.

What's not clear right now is why you keep getting this kernel I/O error. It's not unreasonable to power off the computer, check all the cables, and that the controller is seated - after all you did get multiple indications that a UDMA CRC error occurred. And that's a hardware error. I don't know how well the kernel or the hardware recovers from this. That's a separate question from btrfs recovering from such a problem.

> 
>> Btrfs is stable on stable hardware. Your hardware most definitely was not stable during a series of writes. So I'd say all bets are off. That doesn't mean it can't be fixed, but the very fact you're still getting errors indicates something is still wrong.
> 
> Isn't btrfs' RAID1 supposed to be stable as long as only one disk fails, though?

OK but this isn't a drive failure. You have other problems occurring.


> 
>> This:
>> Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on /dev/sdd1
>> Are not btrfs errors.
> 
> I see. I thought that was a btrfs error, but I was wrong then. Since I'm not actually getting any driver errors, though, and it's referring to sdd, doesn't that just mean, as I suspect, that btrfs is still trying to use the old defunct sdd instead of sdi as the drive became named after it was redetected?

Btrfs uses UUIDs to identify drives, not /dev/sdX. So if a particular drive vanished for a bit, came back and was assigned a new /dev/sdX letter, but has the UUID btrfs is expecting, it very well may re-add it. It's an open question if it should do that in the face of hardware problems that it's not designed to manage.


> 
>> This:
>> Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
>> Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }
> 
> Just to be overly redundant: I'm not getting those anymore, and I only ever got them before the drive was redetected as sdi.

I'd poweroff, check things, power back up. Seems to me either the hardware is confused, the kernel is confused, or both. I'm not sure why.


Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  8:41             ` Chris Murphy
@ 2013-02-14  8:59               ` Hugo Mills
  2013-02-14 18:05                 ` Chris Murphy
  0 siblings, 1 reply; 26+ messages in thread
From: Hugo Mills @ 2013-02-14  8:59 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Fredrik Tolf, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2265 bytes --]

On Thu, Feb 14, 2013 at 01:41:04AM -0700, Chris Murphy wrote:
> 
> On Feb 14, 2013, at 12:58 AM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
> > 
> > Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error on /dev/sdd1
> 
> Well, someone else might comment on what that is exactly, I'm not getting conclusive google hits on this. Sometimes it's fixed by going to a newer kernel. Sometimes it's bad hardware. But it's apparently not a btrfs error. But it's causing subsequent errors which are btrfs errors. So whatever it is, it seems like btrfs doesn't like it.
> 
> 
> > Feb 14 08:32:30 nerv kernel: [180511.764690] btrfs: bdev /dev/sdd1 errs: wr 288650, rd 26, flush 1, corrupt 0, gen 0
> 
> So there continue to be write errors. Unsurprising as sdd1 seems to be dropping pages.
> 
> >> 
> >> Scrubbing does not balance the volume. Based on the information you supplied I don't really see the reason for a rebalance.
> > 
> > Maybe my terminology is wrong again, then, because I do see a reason to get the data properly replicated across the drives, which it doesn't seem to be now. That's what I meant by "rebalancing".
> 
> How much data was copied to the drives? I'm continuously confused by how btrfs reports data usage. What I have is this from fi show and fi df:
> 
> Data, RAID1: total=2.66TB, used=2.66TB

   This is the amount of actual useful data (i.e. what you see with du
or ls -l). Double this (because it's RAID-1) to get the number of
bytes or raw storage used.

> Total devices 2 FS bytes used 1.64TB
> devid    1 size 2.73TB used 1.64TB path /dev/sdi1
> devid    2 size 2.73TB used 2.67TB path /dev/sde1

   This is the amount of raw disk space allocated. The total of used
here should add up to twice the "total" values above (for
Data+Metadata+System).

> So I can't tell if it's ~1.64TB copied or 2.6TB.

   Looks like /dev/sdi1 isn't actually being written to -- it should
be the same allocation as /dev/sde1.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Alert status "mauve ocelot": Slight chance of brimstone. Be ---   
                   prepared to make a nice cup of tea.                   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-12 23:01 Rebalancing RAID1 Fredrik Tolf
  2013-02-13  0:58 ` Chris Murphy
@ 2013-02-14 14:44 ` Martin Steigerwald
  2013-02-14 18:45   ` Chris Murphy
  2013-02-15  3:44   ` Fredrik Tolf
  1 sibling, 2 replies; 26+ messages in thread
From: Martin Steigerwald @ 2013-02-14 14:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Fredrik Tolf

Am Mittwoch, 13. Februar 2013 schrieb Fredrik Tolf:
> Dear list,

Hi Fredrik,

> I'm sorry if this is a dumb n3wb question, but I couldn't find anything 
> about it, so please bear with me.
> 
> I just decided to try BtrFS for the first time, to replace an old ReiserFS 
> data partition currently on a mdadm mirror. To do so, I'm using two 3 TB 
> disks that were initially detected as sdd and sde, on which I have a 
> single large GPT partition, so the devices I'm using for btrfs are sdd1 
> and sde1.
> 
> I created a filesystem on them using RAID1 from the start (mkfs.btrfs -d 
> raid -m raid1 /dev/sd{d,e}1), and started copying the data from the old 
> partition onto it during the night. As it happened, I immediately got 
> reason to try out BtrFS recovery because sometime during the copying 
> operation /dev/sdd had some kind of cable failure and was removed from the 
> system. A while later, however, it was apparently auto-redetected, this 
> time as /dev/sdi, and BtrFS seems to have inserted it back into the 
> filesystem somehow.
> 
> The current situation looks like this:
> 
> > $ sudo ./btrfs fi show
> > Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
> >         Total devices 2 FS bytes used 1.64TB
> >         devid    1 size 2.73TB used 1.64TB path /dev/sdi1
> >         devid    2 size 2.73TB used 2.67TB path /dev/sde1
> > 
> > Btrfs v0.20-rc1-56-g6cd836d
> 
> As you can see, /dev/sdi1 has much less space used, which I can only 
> assume is because extents weren't allocated on it while it was off-line. 
> I'm now trying to remedy this, but I'm not sure if I'm doing it right.
> 
> What I'm doing is to run "btrfs fi bal start /mnt &", and it gives me a 
> ton of kernel messages that look like this:
> 
> Feb 12 22:57:16 nerv kernel: [59596.948464] btrfs: relocating block group 2879804932096 flags 17
> Feb 12 22:57:45 nerv kernel: [59626.618280] btrfs_end_buffer_write_sync: 8 callbacks suppressed
> Feb 12 22:57:45 nerv kernel: [59626.621893] lost page write due to I/O error on /dev/sdd1
> Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs_dev_stat_print_on_error: 8 callbacks suppressed
> Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs: bdev /dev/sdd1 errs: wr 66339, rd 26, flush 1, corrupt 0, gen 0
> Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on /dev/sdd1
> [Lots of the above, and occasionally a couple of lines like these]
> Feb 12 22:57:48 nerv kernel: [59629.569278] btrfs: found 46 extents
> Feb 12 22:57:50 nerv kernel: [59631.685067] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
[…]
> Also, why does it say that the errors are occuring /dev/sdd1? Is it just 
> remembering the whole filesystem by that name since that's how I mounted 
> it, or is it still trying to access the old removed instance of that disk 
> and is that, then, why it's giving all these errors?

You started the balance after above btrfs fi show command?

Then its obvious to me:

For some reason BTRFS is still trying to write to /dev/sdd, which isn´t
there anymore. That perfectly explains those lost page writes for me. If
that is the case, this seems to me like a serious bug in BTRFS.

Also Hugo´s obversation point in that direction. At first I would take those
log messages literally. 

There is a chance that BTRFS still displays /dev/sdd while actually writing
to /dev/sdi, but, I doubt it. I think its possible to find this out by
using iostat -x 1 or atop or something like that. And if it does write to
the correct device file, I think it makes sense to update and fix those
log messages.

I´d restart the machine, see that BTRFS is using both devices again and
then try the balance again.

I´d do this while still having a backup on the ReiserFS volume or another
backup drive. After this I´d do a btrfs scrub start to see whether BTRFS
is happy with all the data on the drives.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  8:59               ` Hugo Mills
@ 2013-02-14 18:05                 ` Chris Murphy
  2013-02-14 20:56                   ` Hugo Mills
  2013-02-15  3:50                   ` Fredrik Tolf
  0 siblings, 2 replies; 26+ messages in thread
From: Chris Murphy @ 2013-02-14 18:05 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Fredrik Tolf, linux-btrfs


On Feb 14, 2013, at 1:59 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> 
>> Data, RAID1: total=2.66TB, used=2.66TB
> 
>   This is the amount of actual useful data (i.e. what you see with du
> or ls -l). Double this (because it's RAID-1) to get the number of
> bytes or raw storage used.

Right, the decoder ring. Effectively no outsiders will understand this. It contradicts the behavior of conventional df with btrfs volumes. And it becomes untenable with per subvolume profiles.


>> Total devices 2 FS bytes used 1.64TB
>> devid    1 size 2.73TB used 1.64TB path /dev/sdi1
>> devid    2 size 2.73TB used 2.67TB path /dev/sde1
> 
>   This is the amount of raw disk space allocated. The total of used
> here should add up to twice the "total" values above (for
> Data+Metadata+System).

I'm mostly complaining about the first line. If 2.67TB of writes to sde1 are successful enough to be stated as "used" on that device, then FS bytes used should be at least 2.67TB.

> 
>> So I can't tell if it's ~1.64TB copied or 2.6TB.
> 
>   Looks like /dev/sdi1 isn't actually being written to -- it should
> be the same allocation as /dev/sde1.

Yeah he's getting a lot of these, and I don't know what it is:

> Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error on /dev/sdd1

It's not tied to btrfs or libata so I don't think it's the drive itself reporting the write error. I think maybe the kernel has become confused as a result of the original ICRC ABRT, and the subsequent change from sdd to sdi. 

Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14 14:44 ` Martin Steigerwald
@ 2013-02-14 18:45   ` Chris Murphy
  2013-02-15  3:44   ` Fredrik Tolf
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Murphy @ 2013-02-14 18:45 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: linux-btrfs@vger.kernel.org BTRFS, Fredrik Tolf, Hugo Mills


On Feb 14, 2013, at 7:44 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote:

> For some reason BTRFS is still trying to write to /dev/sdd, which isn´t
> there anymore. That perfectly explains those lost page writes for me. If
> that is the case, this seems to me like a serious bug in BTRFS.

Following the ICRC ABRT error, /dev/sdd becomes /dev/sdi. Btrfs-progs recognizes this by only listing /dev/sdi and /dev/sde as devices in the volume. But the btrfs kernel space code continues to try to write to /dev/sdd, while /dev/sdi isn't getting any writes (at least, it's not filling up with data).

Btrfs kernel space code is apparently unaware that /dev/sdd is gone. That seems to be the primary problem.

A question is, if the kernel space code was aware of a member device vanishing and then reappearing, whether as the same or different block device designation, should it automatically re-add the device to the volume? Upon being re-added, it would be out of sync, leading to a follow-up question about whether it should be auto-scrubbed to fix this? And yet another follow-up question which is if the file system metadata contains information that can be used similar to the function of the md write-intent bitmap, reducing the time to catch the drive up, avoiding a full scrub?


Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14 18:05                 ` Chris Murphy
@ 2013-02-14 20:56                   ` Hugo Mills
  2013-02-14 22:11                     ` Chris Murphy
  2013-02-15  3:50                   ` Fredrik Tolf
  1 sibling, 1 reply; 26+ messages in thread
From: Hugo Mills @ 2013-02-14 20:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Fredrik Tolf, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3511 bytes --]

On Thu, Feb 14, 2013 at 11:05:39AM -0700, Chris Murphy wrote:
> 
> On Feb 14, 2013, at 1:59 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> >> 
> >> Data, RAID1: total=2.66TB, used=2.66TB
> > 
> >   This is the amount of actual useful data (i.e. what you see with du
> > or ls -l). Double this (because it's RAID-1) to get the number of
> > bytes or raw storage used.
> 
> Right, the decoder ring. Effectively no outsiders will understand
> this. It contradicts the behavior of conventional df with btrfs
> volumes. And it becomes untenable with per subvolume profiles.

   Correct, but *all* other single-value (or small-number-of-values)
displays of space usage fail in similar ways. We've(*) had this
discussion out on this mailing list many times before. All "simple"
displays of disk usage will cause someone to misinterpret something at
some point, and get cross.

(*) For non-"you" values of "we".

   If you want a display of "raw bytes used/free", then someone will
complain that they had 20GB free, wrote a 10GB file, and it's all
gone. If you want a display of "usable data used/free", then we can't
predict the "free" part. There is no single set of values that will
make this simple.

> >> Total devices 2 FS bytes used 1.64TB
> >> devid    1 size 2.73TB used 1.64TB path /dev/sdi1
> >> devid    2 size 2.73TB used 2.67TB path /dev/sde1
> > 
> >   This is the amount of raw disk space allocated. The total of used
> > here should add up to twice the "total" values above (for
> > Data+Metadata+System).
> 
> I'm mostly complaining about the first line. If 2.67TB of writes to sde1 are successful enough to be stated as "used" on that device, then FS bytes used should be at least 2.67TB.

   The values shown above are for bytes *allocated* -- i.e. the
"total" values shown in btrfs fi df. You haven't added in the
metadata, which I'm willing to bet is another 100 GiB or so allocated
space, bringing you up to the 2.67 TiB.

   (There's another problem with this display, which is that it's
actually showing TiB, not TB. There have been patches for this, but I
don't know if any are current).

> > 
> >> So I can't tell if it's ~1.64TB copied or 2.6TB.

   2.66 TiB. The 1.64TiB is clearly wrong, given all the other values.
Hence my conclusion below.

> >   Looks like /dev/sdi1 isn't actually being written to -- it should
> > be the same allocation as /dev/sde1.
> 
> Yeah he's getting a lot of these, and I don't know what it is:
> 
> > Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error on /dev/sdd1
> 
> It's not tied to btrfs or libata so I don't think it's the drive itself reporting the write error. I think maybe the kernel has become confused as a result of the original ICRC ABRT, and the subsequent change from sdd to sdi. 

   That would be my conclusion, too. But with the newly-appeared
/dev/sdi1, btrfs fi show picks it up as belonging to the FS (because
it's got the same UUID), but it's not been picked up by the kernel, so
the kernel's not trying to write to it, and it's therefore massively
out of date.

   I think the solution, if it's certain that the drive is now
behaving sensibly again, is one of:

 * unmount, btrfs dev scan, remount, scrub
or
 * btrfs dev delete missing, add /dev/sdi1 to the FS, and balance

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
           --- I must be musical:  I've got *loads* of CDs ---           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14 20:56                   ` Hugo Mills
@ 2013-02-14 22:11                     ` Chris Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Murphy @ 2013-02-14 22:11 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Fredrik Tolf, linux-btrfs


On Feb 14, 2013, at 1:56 PM, Hugo Mills <hugo@carfax.org.uk> wrote:

>> 
> 
>   Correct, but *all* other single-value (or small-number-of-values)
> displays of space usage fail in similar ways. We've(*) had this
> discussion out on this mailing list many times before. All "simple"
> displays of disk usage will cause someone to misinterpret something at
> some point, and get cross.

The decoder ring method causes misinterpretation.

I refuse the premise that there isn't a way to at least be consistent; and use switches for alternate presentations.

>   If you want a display of "raw bytes used/free", then someone will
> complain that they had 20GB free, wrote a 10GB file, and it's all
> gone. If you want a display of "usable data used/free", then we can't
> predict the "free" part. There is no single set of values that will
> make this simple.

This is exactly how (conventional) df -h works now. And it causes exactly the problem you describe. The df -h size and available numbers are double that of btrfs fi df/show. Not ok. Not consistent. Either df needs to change (likely) or btrfs fi needs to change.

2x 80GB array, btrfs

/dev/sdb        160G  112K  158G   1% /mnt

2x 80GB array, md raid1 xfs

/dev/md0         80G   33M   80G   1% /mnt

And I think it's (regular) df that needs to change the most. btrfs fi df contains 50% superfluous information as far as I can tell:

[root@f18v ~]# btrfs fi df /mnt
Data, RAID1: total=1.00GB, used=0.00
*Data: total=8.00MB, used=0.00
*System, RAID1: total=8.00MB, used=8.00KB
*System: total=4.00MB, used=0.00
Metadata, RAID1: total=1.00GB, used=48.00KB
*Metadata: total=8.00MB, used=0.00

The lines marked * I see zero useful conveyed information. And fi show:

[root@f18v ~]# btrfs fi show
Label: 'hello'  uuid: d5517733-7c9f-458a-9e99-5b832b8776b2
	Total devices 2 FS bytes used 56.00KB
	devid    2 size 80.00GB used 2.01GB path /dev/sdc
	devid    1 size 80.00GB used 2.03GB path /dev/sdb

I don't know why I should care about allocated chunks but if that's what used means in this case, it should say that, rather than "used". I'm sortof annoyed that the same words, total and used, have different meaning depending on their position, without other qualifiers. It's like being in school and the teacher would get pissed when students wouldn't specify units or label axes, and now I'm one of those types. What do these numbers mean? If I have to infer this, then they're obscure, so why should I care about them?

And what I can get from btrfs fi df that it doesn't indicate at all, that could be more useful than regular df (simply because there's no room) is a:

Free Space Estimate: min - max


>   I think the solution, if it's certain that the drive is now
> behaving sensibly again, is one of:
> 
> * unmount, btrfs dev scan, remount, scrub
> or
> * btrfs dev delete missing, add /dev/sdi1 to the FS, and balance

The 2nd won't work because user space tools don't consider there to be a missing device.

So back to the question on how btrfs should behave in such a case. md would have tossed the drive and as far as I know doesn't automatically readd it if it reappears as either the same or a different block device. And when the user uses --re-add there's a resync.


Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14 14:44 ` Martin Steigerwald
  2013-02-14 18:45   ` Chris Murphy
@ 2013-02-15  3:44   ` Fredrik Tolf
  2013-02-15  5:49     ` Sander
  2013-02-15  9:05     ` Martin Steigerwald
  1 sibling, 2 replies; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-15  3:44 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2494 bytes --]

On Thu, 14 Feb 2013, Martin Steigerwald wrote:
> Am Mittwoch, 13. Februar 2013 schrieb Fredrik Tolf:
> You started the balance after above btrfs fi show command?

I did.

> Then its obvious to me:
>
> For some reason BTRFS is still trying to write to /dev/sdd, which isn´t
> there anymore. That perfectly explains those lost page writes for me. If
> that is the case, this seems to me like a serious bug in BTRFS.

Now I have disconnected the drive entirely, quite simply, so that I can 
try to do simply what I should do if the drive really had failed 
completely and I had gotten a replacement in its stead. Neither any sdd 
nor any sdi is seen by the system anymore. However, I'm still getting 
kernel messages about being unable to write to sdd:

Feb 15 04:37:41 nerv kernel: [252822.640560] lost page write due to I/O error on /dev/sdd1
Feb 15 04:37:41 nerv kernel: [252822.644531] btrfs: bdev /dev/sdd1 errs: wr 362195, rd 26, flush 1, corrupt 0, gen 0

I can't say I know what conclusions that lead to with regards to your 
observations.

> I´d restart the machine, see that BTRFS is using both devices again and
> then try the balance again.

I mentioned it in another mail, but I'd very much prefer not to do that. 
I'd like to try and solve this as I normally should when a drive fails.

When I'm running btrfs fi show, this is what I'm getting now:

> $ sudo ./btrfs fi show
> Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
>         Total devices 2 FS bytes used 2.66TB
>         devid    2 size 2.73TB used 2.67TB path /dev/sde1
>         *** Some devices missing

So that's what it should look like when a drive fails, right?

At this point, I'm trying to remove the missing device from the filesystem 
as the Wiki indicates that I should be able to, but alas:

> $ sudo ./btrfs device delete missing /mnt
> ERROR: error removing the device 'missing' - Invalid argument

The dmesg tells me this:

> Feb 15 04:42:22 nerv kernel: [253103.799201] btrfs: unable to go below two devices on raid1

How do I remove the conception of the missing device so that I can replace 
it? Should I simply add the replacement first, and only after that remove 
the missing device?

If the latter, how can I "scratch" the previous btrfs metadata from this 
"replacement" drive so that it doesn't try to autoreinsert it into the 
filesystem when it is detected? I assume it won't be enough be just 
zeroing the first few sectors of the drive, right?

Thanks for replying!

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14 18:05                 ` Chris Murphy
  2013-02-14 20:56                   ` Hugo Mills
@ 2013-02-15  3:50                   ` Fredrik Tolf
  2013-02-15  3:55                     ` Chris Murphy
  1 sibling, 1 reply; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-15  3:50 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hugo Mills, linux-btrfs

On Thu, 14 Feb 2013, Chris Murphy wrote:
> Yeah he's getting a lot of these, and I don't know what it is:
>
>> Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error on /dev/sdd1
>
> It's not tied to btrfs or libata so I don't think it's the drive itself 
> reporting the write error.

Actually, it appears it might just be from btrfs:

> $ grep -rlI 'lost page write' /usr/local/src/linux-3.7.1/fs
> /usr/local/src/linux-3.7.1/fs/btrfs/disk-io.c
> /usr/local/src/linux-3.7.1/fs/buffer.c

And at btrfs/disk-io.c:2711 in this 3.7.1 source:

> printk_ratelimited_in_rcu(KERN_WARNING "lost page write due to "
> 			  "I/O error on %s\n",
> 			  rcu_str_deref(device->name));

So it's either from btrfs, or from the buffer cache, and seeing as how it 
appears with other btrfs messages, I'd be willing to bet on the former. :)

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15  3:50                   ` Fredrik Tolf
@ 2013-02-15  3:55                     ` Chris Murphy
  2013-02-15  3:56                       ` Fredrik Tolf
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Murphy @ 2013-02-15  3:55 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: Hugo Mills, linux-btrfs


On Feb 14, 2013, at 8:50 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:

> 
> So it's either from btrfs, or from the buffer cache, and seeing as how it appears with other btrfs messages, I'd be willing to bet on the former. :)

Unclear. Google searches reveal this identical error comes up in non-btrfs contexts. I'd like to think that if it's kernel-btrfs related that it would get tagged as kerne: blah btrfs: lost page blah.

I could be wrong but a debug kernel might offer more information on this.


Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15  3:55                     ` Chris Murphy
@ 2013-02-15  3:56                       ` Fredrik Tolf
  2013-02-15  4:03                         ` Chris Murphy
  0 siblings, 1 reply; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-15  3:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hugo Mills, linux-btrfs

On Thu, 14 Feb 2013, Chris Murphy wrote:
> On Feb 14, 2013, at 8:50 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
>
>>
>> So it's either from btrfs, or from the buffer cache, and seeing as how it appears with other btrfs messages, I'd be willing to bet on the former. :)
>
> Unclear. Google searches reveal this identical error comes up in non-btrfs contexts. I'd like to think that if it's kernel-btrfs related that it would get tagged as kerne: blah btrfs: lost page blah.

It appears to me that the bulk of the containing function has been 
copy-pasted from buffer.c, so that's probably why the messages are 
identical.

As you could see from the source code I quoted, the message was, in fact, 
not prepended with "btrfs: ".

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15  3:56                       ` Fredrik Tolf
@ 2013-02-15  4:03                         ` Chris Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Murphy @ 2013-02-15  4:03 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: Hugo Mills, linux-btrfs


On Feb 14, 2013, at 8:56 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:

> 
> As you could see from the source code I quoted, the message was, in fact, not prepended with "btrfs: ".

Yep. That's why I'm not so far convinced it's btrfs induced. But clearly btrfs is adversely impacted, and tentatively I'm unsure if the resulting behavior is reasonable.

Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-14  8:01         ` Chris Murphy
@ 2013-02-15  4:06           ` Fredrik Tolf
  0 siblings, 0 replies; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-15  4:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On Thu, 14 Feb 2013, Chris Murphy wrote:
> Also, is a virtual machine being used in any of this, either as host or guest?

Nope.

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15  3:44   ` Fredrik Tolf
@ 2013-02-15  5:49     ` Sander
  2013-02-15  9:05     ` Martin Steigerwald
  1 sibling, 0 replies; 26+ messages in thread
From: Sander @ 2013-02-15  5:49 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: Martin Steigerwald, linux-btrfs

Fredrik Tolf wrote (ao):
> How do I remove the conception of the missing device so that I can
> replace it? Should I simply add the replacement first, and only
> after that remove the missing device?
> 
> If the latter, how can I "scratch" the previous btrfs metadata from
> this "replacement" drive so that it doesn't try to autoreinsert it
> into the filesystem when it is detected? I assume it won't be enough
> be just zeroing the first few sectors of the drive, right?

You could use wipefs.

	Sander

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15  3:44   ` Fredrik Tolf
  2013-02-15  5:49     ` Sander
@ 2013-02-15  9:05     ` Martin Steigerwald
  2013-02-15 21:56       ` Fredrik Tolf
  1 sibling, 1 reply; 26+ messages in thread
From: Martin Steigerwald @ 2013-02-15  9:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Fredrik Tolf

Am Freitag, 15. Februar 2013 schrieb Fredrik Tolf:
> On Thu, 14 Feb 2013, Martin Steigerwald wrote:
[…]
> > I´d restart the machine, see that BTRFS is using both devices again and
> > then try the balance again.
> 
> I mentioned it in another mail, but I'd very much prefer not to do that.
> I'd like to try and solve this as I normally should when a drive fails.

Well if Hugo´s solution with unmounting the FS, btrfs dev scan does not 
work, I see this my suggestion to reboot as making sense.

I do not have to add more to my analysis.

If any BTRFS developer or expert knows another solution in during runtime of 
the system, feel free :)

So or so I think a kernel bug is involved here. And I think I remember 
having seen something like this during a balance attempt myself already, but 
it was just a test BTRFS and I was not sure of it.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15  9:05     ` Martin Steigerwald
@ 2013-02-15 21:56       ` Fredrik Tolf
  2013-02-18 15:29         ` Stefan Behrens
  0 siblings, 1 reply; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-15 21:56 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On Fri, 15 Feb 2013, Martin Steigerwald wrote:
> So or so I think a kernel bug is involved here.

Well, *some* kernel bug is certainly involved. :)

I did wipe the filesystem off the device and reinserted it as a new device 
into the filesystem. After that, "btrfs fi show" gave me the following:

> $ sudo ./btrfs fi show
> Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
>         Total devices 3 FS bytes used 2.66TB
>         devid    3 size 2.73TB used 0.00 path /dev/sdi1
>         devid    2 size 2.73TB used 2.67TB path /dev/sde1
>         *** Some devices missing

I then proceeded to try to remove the missing devices with "btrfs dev del 
missing /mnt", but it made no difference whatever, with the kernel saying 
the following:

Feb 15 07:12:29 nerv kernel: [262110.799823] btrfs: no missing devices found to remove

This seems odd enough, seeing as how "btrfs fi show" says there are 
missing devices, and the kernel contradicting that.

Either way, I tried to start a scrub on the filesystem, too, seeing if 
that would make a difference, but that oopsed the kernel. :)

The oops cut can be found here: <http://www.dolda2000.com/~fredrik/tmp/btrfs-oops>

So with that, I'm certainly going to reboot the machine. :)

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-15 21:56       ` Fredrik Tolf
@ 2013-02-18 15:29         ` Stefan Behrens
  2013-02-23  0:36           ` Fredrik Tolf
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Behrens @ 2013-02-18 15:29 UTC (permalink / raw)
  To: Fredrik Tolf; +Cc: Martin Steigerwald, linux-btrfs

On Fri, 15 Feb 2013 22:56:19 +0100 (CET), Fredrik Tolf wrote:
> The oops cut can be found here:
> <http://www.dolda2000.com/~fredrik/tmp/btrfs-oops>

This scrub issue is fixed since Linux 3.8-rc1 with commit
4ded4f6 Btrfs: fix BUG() in scrub when first superblock reading gives EIO


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Rebalancing RAID1
  2013-02-18 15:29         ` Stefan Behrens
@ 2013-02-23  0:36           ` Fredrik Tolf
  0 siblings, 0 replies; 26+ messages in thread
From: Fredrik Tolf @ 2013-02-23  0:36 UTC (permalink / raw)
  To: Stefan Behrens; +Cc: Martin Steigerwald, linux-btrfs

On Mon, 18 Feb 2013, Stefan Behrens wrote:
> On Fri, 15 Feb 2013 22:56:19 +0100 (CET), Fredrik Tolf wrote:
>> The oops cut can be found here:
>> <http://www.dolda2000.com/~fredrik/tmp/btrfs-oops>
>
> This scrub issue is fixed since Linux 3.8-rc1 with commit
> 4ded4f6 Btrfs: fix BUG() in scrub when first superblock reading gives EIO

I see, thanks!

Rebooting the system did get me running again, allowing me to remove the 
missing device from filesystem. However, I encountered a couple of 
somewhat strange happenings as I did that. I don't know if they're 
considered bugs or not, but I thought I had best report them.

To begin with, the act of removing the missing device from the filesystem 
itself caused the resynchronization to the "new" device to happen in 
blocking mode, so the "btrfs device delete missing" operation took about a 
day to finish. My expectation would have been that the device removal 
would have been a fast operation and that I would have had to scrub the 
filesystem or something in order to resynchronize, but I can see how this 
would be intented behavior.

However, what's weirder is that while the resynchronization was underway, 
I couldn't mount subvolumes on other mountpoints. The mount commands 
blocked (disk-slept) until the entire synchronization was done, and I 
don't think this was intended behavior, because I had the kernel saying 
the following while it happened:

Feb 16 06:01:27 nerv kernel: [ 3482.512106] INFO: task mount:3525 blocked for more than 120 seconds.
Feb 16 06:01:28 nerv kernel: [ 3482.518484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 06:01:28 nerv kernel: [ 3482.526324] mount           D ffff88003e220e40     0  3525   3524 0x00000000
Feb 16 06:01:28 nerv kernel: [ 3482.533587]  ffff88003e220e40 0000000000000082 ffffffffa0067470 ffff88003e2300c0
Feb 16 06:01:28 nerv kernel: [ 3482.541088]  0000000000013b40 ffff88001126dfd8 0000000000013b40 ffff88001126dfd8
Feb 16 06:01:28 nerv kernel: [ 3482.548584]  0000000000013b40 ffff88003e220e40 0000000000013b40 ffff88001126c010
Feb 16 06:01:28 nerv kernel: [ 3482.556280] Call Trace:
Feb 16 06:01:28 nerv kernel: [ 3482.558776]  [<ffffffff81396132>] ? __mutex_lock_common+0x10d/0x175
Feb 16 06:01:28 nerv kernel: [ 3482.565078]  [<ffffffff81396260>] ? mutex_lock+0x1a/0x2c
Feb 16 06:01:28 nerv kernel: [ 3482.570661]  [<ffffffffa05a38c2>] ? btrfs_scan_one_device+0x40/0x133 [btrfs]
Feb 16 06:01:28 nerv kernel: [ 3482.577752]  [<ffffffffa0564e8b>] ? btrfs_mount+0x1c4/0x4d8 [btrfs]
Feb 16 06:01:28 nerv kernel: [ 3482.584080]  [<ffffffff810e56cb>] ? pcpu_next_pop+0x37/0x43
Feb 16 06:01:28 nerv kernel: [ 3482.589709]  [<ffffffff810e52c0>] ? cpumask_next+0x18/0x1a
Feb 16 06:01:28 nerv kernel: [ 3482.595226]  [<ffffffff811012aa>] ? alloc_pages_current+0xbb/0xd8
Feb 16 06:01:28 nerv kernel: [ 3482.601345]  [<ffffffff81113778>] ? mount_fs+0x6c/0x149
Feb 16 06:01:28 nerv kernel: [ 3482.606595]  [<ffffffff811291f7>] ? vfs_kern_mount+0x67/0xdd
Feb 16 06:01:28 nerv kernel: [ 3482.612292]  [<ffffffffa056516b>] ? btrfs_mount+0x4a4/0x4d8 [btrfs]
Feb 16 06:01:28 nerv kernel: [ 3482.618673]  [<ffffffff810e52c0>] ? cpumask_next+0x18/0x1a
Feb 16 06:01:28 nerv kernel: [ 3482.624178]  [<ffffffff811012aa>] ? alloc_pages_current+0xbb/0xd8
Feb 16 06:01:28 nerv kernel: [ 3482.630347]  [<ffffffff81113778>] ? mount_fs+0x6c/0x149
Feb 16 06:01:28 nerv kernel: [ 3482.635580]  [<ffffffff811291f7>] ? vfs_kern_mount+0x67/0xdd
Feb 16 06:01:28 nerv kernel: [ 3482.641258]  [<ffffffff811292e0>] ? do_kern_mount+0x49/0xd6
Feb 16 06:01:29 nerv kernel: [ 3482.646855]  [<ffffffff81129a98>] ? do_mount+0x72b/0x791
Feb 16 06:01:29 nerv kernel: [ 3482.652186]  [<ffffffff81129b86>] ? sys_mount+0x88/0xc3
Feb 16 06:01:29 nerv kernel: [ 3482.657464]  [<ffffffff8139d229>] ? system_call_fastpath+0x16/0x1b

Furthermore, it struck me that the consequences of having to mount a 
filesystem with missing deviced with -o degraded can be a bit strange. I 
realize what the intentions of the behavior is, of course, but I think it 
might cause quite some difficulties when trying to mount a degraded btrfs 
filesystem as root on a system that you don't have physical access to, 
like a hosted server, because it might be hard to manipulate the boot 
process so as to pass that mountflag to the initrd. Note that this is not 
a problem with md-raid; it will simply assemble its arrays in degraded 
mode automatically, without intervention. I'm not necessarily saying 
that's better, but I thought I should bring up the point.

--

Fredrik Tolf

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-02-23  0:36 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-12 23:01 Rebalancing RAID1 Fredrik Tolf
2013-02-13  0:58 ` Chris Murphy
2013-02-13  6:18   ` Fredrik Tolf
2013-02-13  8:10     ` Chris Murphy
2013-02-14  6:42       ` Fredrik Tolf
2013-02-14  7:27         ` Chris Murphy
2013-02-14  7:58           ` Fredrik Tolf
2013-02-14  8:41             ` Chris Murphy
2013-02-14  8:59               ` Hugo Mills
2013-02-14 18:05                 ` Chris Murphy
2013-02-14 20:56                   ` Hugo Mills
2013-02-14 22:11                     ` Chris Murphy
2013-02-15  3:50                   ` Fredrik Tolf
2013-02-15  3:55                     ` Chris Murphy
2013-02-15  3:56                       ` Fredrik Tolf
2013-02-15  4:03                         ` Chris Murphy
2013-02-14  8:01         ` Chris Murphy
2013-02-15  4:06           ` Fredrik Tolf
2013-02-14 14:44 ` Martin Steigerwald
2013-02-14 18:45   ` Chris Murphy
2013-02-15  3:44   ` Fredrik Tolf
2013-02-15  5:49     ` Sander
2013-02-15  9:05     ` Martin Steigerwald
2013-02-15 21:56       ` Fredrik Tolf
2013-02-18 15:29         ` Stefan Behrens
2013-02-23  0:36           ` Fredrik Tolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.