All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS RAID1 behavior after one drive temporal disconection
@ 2015-10-05 20:26 Pavel Pisa
  2015-10-08  8:28 ` Pavel Pisa
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Pisa @ 2015-10-05 20:26 UTC (permalink / raw)
  To: linux-btrfs

Hello everybody,

SATA connection/firmware of my drives (ST3000VN000-1H4167) failed.
Disk has not responded to hdparm, smartctl and no SW reset,
SATA controller rescan changed the situation.

I have been able to restore communication by brute force
power cable connectore removal and reconnection. I have been
able to rescan device and partitions then.

There is high probability of time coincidence of problem start
and next SMART report

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 09 a9 00 80 40  Device Fault; Error: ABRT at LBA = 0x008000a9 = 8388777

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 18 00 09 01 46 00   4d+15:27:59.335  WRITE FPDMA QUEUED
  61 00 80 80 08 01 46 00   4d+15:27:59.335  WRITE FPDMA QUEUED
  61 00 80 00 08 01 46 00   4d+15:27:59.335  WRITE FPDMA QUEUED
  61 00 80 80 07 01 46 00   4d+15:27:59.335  WRITE FPDMA QUEUED
  61 00 68 18 07 01 46 00   4d+15:27:59.335  WRITE FPDMA QUEUED

Disk seems to be undamaged. The smartctl -t long finished without
any error logged or reported. Some backup ext4 partition can be mounted
and is writable.

BTRFS has recognized appearance of its partition (even that hanged
from sdb5 to sde5 when disk "hotplugged" again).
But it seems that RAID1 components are not in sync and BTRFS
continues to report

BTRFS: lost page write due to I/O error on /dev/sde5
BTRFS: bdev /dev/sde5 errs: wr 11021805, rd 8526080, flush 29099, corrupt 0, gen 

I have tried to find the best way to resync RAID1 BTRFS partitions.
But problem is that filesystem is the root one of the system.
So reboot to some rescue media is required to run btrfsck --repair
which is intended for unmounted devices.

What is behavior of BTRFS in this situation?
Is BTRFS able to use data from not up to date partition in these
cases where data in respective files have not been modified?
The main reason for question is if such (stable) data can be backuped
by out of sync partition in the case of some random block is wear
out on another device. Or is this situation equivalent to running
with only one disk?

Are there some parameters/solution to run some command
(scrub balance) which makes devices to be in the sync again
without unmount or reboot?

I believe than attaching one more drive and running "btrfs replace"
would solve described situation. But is there some equivalent to
run operation "inplace".

Thanks for reply,

                Pavel Pisa
    e-mail:     pisa@cmp.felk.cvut.cz
    www:        http://cmp.felk.cvut.cz/~pisa
    university: http://dce.fel.cvut.cz/
    company:    http://www.pikron.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-05 20:26 BTRFS RAID1 behavior after one drive temporal disconection Pavel Pisa
@ 2015-10-08  8:28 ` Pavel Pisa
  2015-10-08 11:47   ` Austin S Hemmelgarn
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Pisa @ 2015-10-08  8:28 UTC (permalink / raw)
  To: linux-btrfs

Hello everybody,

On Monday 05 of October 2015 22:26:46 Pavel Pisa wrote:
> Hello everybody,
...
> BTRFS has recognized appearance of its partition (even that hanged
> from sdb5 to sde5 when disk "hotplugged" again).
> But it seems that RAID1 components are not in sync and BTRFS
> continues to report
>
> BTRFS: lost page write due to I/O error on /dev/sde5
> BTRFS: bdev /dev/sde5 errs: wr 11021805, rd 8526080, flush 29099, corrupt
> 0, gen
>
> I have tried to find the best way to resync RAID1 BTRFS partitions.
> But problem is that filesystem is the root one of the system.
> So reboot to some rescue media is required to run btrfsck --repair
> which is intended for unmounted devices.
>
> What is behavior of BTRFS in this situation?
> Is BTRFS able to use data from not up to date partition in these
> cases where data in respective files have not been modified?
> The main reason for question is if such (stable) data can be backuped
> by out of sync partition in the case of some random block is wear
> out on another device. Or is this situation equivalent to running
> with only one disk?
>
> Are there some parameters/solution to run some command
> (scrub balance) which makes devices to be in the sync again
> without unmount or reboot?
>
> I believe than attaching one more drive and running "btrfs replace"
> would solve described situation. But is there some equivalent to
> run operation "inplace".

It seems that SATA controller is not able to activate link which
has not been connected at BIOS POST time. This means that I cannot add new drive
without reboot.

Before reboot, the server bleeds with messages

BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt 0, gen 0
BTRFS: lost page write due to I/O error on /dev/sde5
BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt 0, gen 0
BTRFS: lost page write due to I/O error on /dev/sde5

that changed to next mesages after reboot

Btrfs loaded
BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3
BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5
BTRFS info (device sda3): disk space caching is enabled
BTRFS (device sda3): parent transid verify failed on 44623216640 wanted 263476 found 212766
BTRFS (device sda3): parent transid verify failed on 45201899520 wanted 282383 found 246891
BTRFS (device sda3): parent transid verify failed on 45202571264 wanted 282383 found 246890
BTRFS (device sda3): parent transid verify failed on 45201965056 wanted 282383 found 246889
BTRFS (device sda3): parent transid verify failed on 45202505728 wanted 282383 found 246890
BTRFS (device sda3): parent transid verify failed on 45202866176 wanted 282383 found 246890
BTRFS (device sda3): parent transid verify failed on 45207126016 wanted 282383 found 246894
BTRFS (device sda3): parent transid verify failed on 45202522112 wanted 282383 found 246890
BTRFS: bdev /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 11723314, rd 8526080, flush 2
BTRFS (device sda3): parent transid verify failed on 45206945792 wanted 282383 found 67960
BTRFS (device sda3): parent transid verify failed on 45204471808 wanted 282382 found 67960

which looks really frightening to me. Temporary disconnected drive has old transid
at start (OK). But what means the rest of the lines. If it means that files with
older transaction ID are used from temporary disconnected drive (now /dev/sdb5)
and newer versions from /dev/sda3 are ignored and reported as invalid then this means
severe data lost and may it be mitchmatch because all transactions after disk disconnect
are lost (i.e. FS root has been taken from misbehaving drive at old version).

BTRFS does not fall even to red-only/degraded mode after system restart.

On the other hand, from logs (all stored on the possibly damaged root FS) it seems
that there there are not missing messages from days when discs has been out of sync,
so it looks like all data are OK. So should I expect that BTRFS managed problems
well and all data are consistent?

I go to use "btrfs replace" because there has not been any reply to my inplace correction
question. But I expect that clarification if possible/how to resync RAID1 after one
drive temporal disappear is really important to many of BTRFS users.

I am now at place where all my connection to Internet goes through endangered
server/router/containers server so I hope to not lost connection.

Thanks for BTRFS work,

                        Pavel





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-08  8:28 ` Pavel Pisa
@ 2015-10-08 11:47   ` Austin S Hemmelgarn
  2015-10-08 16:40     ` Pavel Pisa
  2015-10-08 21:13     ` Hugo Mills
  0 siblings, 2 replies; 8+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-08 11:47 UTC (permalink / raw)
  To: Pavel Pisa, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6024 bytes --]

On 2015-10-08 04:28, Pavel Pisa wrote:
> Hello everybody,
>
> On Monday 05 of October 2015 22:26:46 Pavel Pisa wrote:
>> Hello everybody,
> ...
>> BTRFS has recognized appearance of its partition (even that hanged
>> from sdb5 to sde5 when disk "hotplugged" again).
>> But it seems that RAID1 components are not in sync and BTRFS
>> continues to report
>>
>> BTRFS: lost page write due to I/O error on /dev/sde5
>> BTRFS: bdev /dev/sde5 errs: wr 11021805, rd 8526080, flush 29099, corrupt
>> 0, gen
>>
>> I have tried to find the best way to resync RAID1 BTRFS partitions.
>> But problem is that filesystem is the root one of the system.
>> So reboot to some rescue media is required to run btrfsck --repair
>> which is intended for unmounted devices.
>>
>> What is behavior of BTRFS in this situation?
>> Is BTRFS able to use data from not up to date partition in these
>> cases where data in respective files have not been modified?
>> The main reason for question is if such (stable) data can be backuped
>> by out of sync partition in the case of some random block is wear
>> out on another device. Or is this situation equivalent to running
>> with only one disk?
>>
>> Are there some parameters/solution to run some command
>> (scrub balance) which makes devices to be in the sync again
>> without unmount or reboot?
>>
>> I believe than attaching one more drive and running "btrfs replace"
>> would solve described situation. But is there some equivalent to
>> run operation "inplace".
>
> It seems that SATA controller is not able to activate link which
> has not been connected at BIOS POST time. This means that I cannot add new drive
> without reboot.
Check your BIOS options, there should be some option to set SATA ports 
as either 'Hot-Plug' or 'External', which should allow you to hot-plug 
drives without needing a reboot (unless it's a Dell system, they have 
never properly implemented the SATA standard on their desktops).
>
> Before reboot, the server bleeds with messages
>
> BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt 0, gen 0
> BTRFS: lost page write due to I/O error on /dev/sde5
> BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt 0, gen 0
> BTRFS: lost page write due to I/O error on /dev/sde5
Even aside from the below mentioned issues, if your disk is showing that 
many errors, you should probably run a SMART self-test routine on it to 
determine whether this is just a transient issue or an indication of an 
impending disk failure.  The commands I'd suggest are:
smartctl -t short /dev/sde
That will tell you some time to wait for the test to complete, after 
waiting  that long, run:
smartctl -H /dev/sde
If that says the health check failed, replace the disk as soon as 
possible, and don't use it for storing any data you can't afford to lose.
>
> that changed to next mesages after reboot
>
> Btrfs loaded
> BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3
> BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5
> BTRFS info (device sda3): disk space caching is enabled
> BTRFS (device sda3): parent transid verify failed on 44623216640 wanted 263476 found 212766
> BTRFS (device sda3): parent transid verify failed on 45201899520 wanted 282383 found 246891
> BTRFS (device sda3): parent transid verify failed on 45202571264 wanted 282383 found 246890
> BTRFS (device sda3): parent transid verify failed on 45201965056 wanted 282383 found 246889
> BTRFS (device sda3): parent transid verify failed on 45202505728 wanted 282383 found 246890
> BTRFS (device sda3): parent transid verify failed on 45202866176 wanted 282383 found 246890
> BTRFS (device sda3): parent transid verify failed on 45207126016 wanted 282383 found 246894
> BTRFS (device sda3): parent transid verify failed on 45202522112 wanted 282383 found 246890
> BTRFS: bdev /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 11723314, rd 8526080, flush 2
> BTRFS (device sda3): parent transid verify failed on 45206945792 wanted 282383 found 67960
> BTRFS (device sda3): parent transid verify failed on 45204471808 wanted 282382 found 67960
>
> which looks really frightening to me. Temporary disconnected drive has old transid
> at start (OK). But what means the rest of the lines. If it means that files with
> older transaction ID are used from temporary disconnected drive (now /dev/sdb5)
> and newer versions from /dev/sda3 are ignored and reported as invalid then this means
> severe data lost and may it be mitchmatch because all transactions after disk disconnect
> are lost (i.e. FS root has been taken from misbehaving drive at old version).
>
> BTRFS does not fall even to red-only/degraded mode after system restart.
This actually surprises me.
>
> On the other hand, from logs (all stored on the possibly damaged root FS) it seems
> that there there are not missing messages from days when discs has been out of sync,
> so it looks like all data are OK. So should I expect that BTRFS managed problems
> well and all data are consistent?
I would be very careful in that situation, you may still have issues, at 
the very least, make a backup of the system as soon as possible.
>
> I go to use "btrfs replace" because there has not been any reply to my inplace correction
> question. But I expect that clarification if possible/how to resync RAID1 after one
> drive temporal disappear is really important to many of BTRFS users.
As of right now, there is no way that I know of to safely re-sync a 
drive that's been disconnected for a while.  The best bet is probably to 
use replace, but for that to work reliably, you would need to tell it to 
ignore the now stale drive when trying to read each chunk.

It is theoretically possible to wipe the FS signature on the out-of sync 
drive, run a device scan, then run 'replace missing' pointing at the now 
'blank' device, although going that route is really risky.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-08 11:47   ` Austin S Hemmelgarn
@ 2015-10-08 16:40     ` Pavel Pisa
  2015-10-08 21:13     ` Hugo Mills
  1 sibling, 0 replies; 8+ messages in thread
From: Pavel Pisa @ 2015-10-08 16:40 UTC (permalink / raw)
  To: Austin S Hemmelgarn, linux-btrfs

Hello Austin,

thanks for reply.

On Thursday 08 of October 2015 13:47:33 Austin S Hemmelgarn wrote:
> On 2015-10-08 04:28, Pavel Pisa wrote:
> > Hello everybody,
...
> > It seems that SATA controller is not able to activate link which
> > has not been connected at BIOS POST time. This means that I cannot add
> > new drive without reboot.
>
> Check your BIOS options, there should be some option to set SATA ports
> as either 'Hot-Plug' or 'External', which should allow you to hot-plug
> drives without needing a reboot (unless it's a Dell system, they have
> never properly implemented the SATA standard on their desktops).
>
> > Before reboot, the server bleeds with messages
> >
> > BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt
> > 0, gen 0 BTRFS: lost page write due to I/O error on /dev/sde5
> > BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt
> > 0, gen 0 BTRFS: lost page write due to I/O error on /dev/sde5
>
> Even aside from the below mentioned issues, if your disk is showing that
> many errors, you should probably run a SMART self-test routine on it to
> determine whether this is just a transient issue or an indication of an
> impending disk failure.  The commands I'd suggest are:
> smartctl -t short /dev/sde

Yes, I have run even long as reported in the first message.
No problem has been found. The cause has been sudden stop
of DISK SATA communication after more months of uninterrupted
communication/service. When connection has been restored
by HDD power cable disconnect/connect then disk has been
OK, no SMART problems, no problem to read/write to other
filesystems.

So it seems to be BTRFS internal prevention to write to that
portion  of FS (whole block device?) on temporary disconnected
drive where transid do not match. Situation changed after reboot
(only way for new mount) when BTRFS has restored operation somehow.

> That will tell you some time to wait for the test to complete, after
> waiting  that long, run:
> smartctl -H /dev/sde
> If that says the health check failed, replace the disk as soon as
> possible, and don't use it for storing any data you can't afford to lose.
>
> > that changed to next mesages after reboot
> >
> > Btrfs loaded
> > BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3
> > BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5
> > BTRFS info (device sda3): disk space caching is enabled
> > BTRFS (device sda3): parent transid verify failed on 44623216640 wanted
> > 263476 found 212766 BTRFS (device sda3): parent transid verify failed on
> > 45201899520 wanted 282383 found 246891 BTRFS (device sda3): parent
> > transid verify failed on 45202571264 wanted 282383 found 246890 BTRFS
> > (device sda3): parent transid verify failed on 45201965056 wanted 282383
> > found 246889 BTRFS (device sda3): parent transid verify failed on
> > 45202505728 wanted 282383 found 246890 BTRFS (device sda3): parent
> > transid verify failed on 45202866176 wanted 282383 found 246890 BTRFS
> > (device sda3): parent transid verify failed on 45207126016 wanted 282383
> > found 246894 BTRFS (device sda3): parent transid verify failed on
> > 45202522112 wanted 282383 found 246890 BTRFS: bdev
> > /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 11723314,
> > rd 8526080, flush 2 BTRFS (device sda3): parent transid verify failed on
> > 45206945792 wanted 282383 found 67960 BTRFS (device sda3): parent transid
> > verify failed on 45204471808 wanted 282382 found 67960
> >
> > which looks really frightening to me. Temporary disconnected drive has
> > old transid at start (OK). But what means the rest of the lines. If it
> > means that files with older transaction ID are used from temporary
> > disconnected drive (now /dev/sdb5) and newer versions from /dev/sda3 are
> > ignored and reported as invalid then this means severe data lost and may
> > it be mitchmatch because all transactions after disk disconnect are lost
> > (i.e. FS root has been taken from misbehaving drive at old version).
> >
> > BTRFS does not fall even to red-only/degraded mode after system restart.
>
> This actually surprises me.

Both drives has been present for all time / except that for about one week
on drive (in fact corresponding SATA controller) reported permanent error
for each access.

> > On the other hand, from logs (all stored on the possibly damaged root FS)
> > it seems that there there are not missing messages from days when discs
> > has been out of sync, so it looks like all data are OK. So should I
> > expect that BTRFS managed problems well and all data are consistent?
>
> I would be very careful in that situation, you may still have issues, at
> the very least, make a backup of the system as soon as possible.

I have done backup to external drive before attempts to reconnect
failed drive.

I have done btrfs replace of temporary failed HDD to new bought HDD.
I have planned to replace old drive (that one which did not experience
problems but reports some relocated sectores). So I have done
replace of sustained drive back to the drive with disconnected with
hope that it has been single event problem. I keep all data/system/meta
in RAID1 anyway so I hope to be able to keep data healthy. In the fact
the whole event at the end proves quality of BTRFS.

I have have untarred all backups and run complete recursive diff -r -u
of the root fs and containers fs (secondary mounts to eliminate /proc etc.)
to backups.

It seems that all is correct, only difference seen are in these files
and logs which should differ from start of problem investigation and
rescue. Same for missing files - only expected ones has been missing
(media files excluded from backup and new files and broken sysmlinks
- broken in both backup and live data.

It seems that all git repos changes and other stuff which happened
on system from disk disconnection to backup is there. Even git repos
changes after reconection before backups compare.

So it seems that even that some mesages looks really strange,
BTRFS selected right copy and generation during whole phases
of operation and maintenance.

> > I go to use "btrfs replace" because there has not been any reply to my
> > inplace correction question. But I expect that clarification if
> > possible/how to resync RAID1 after one drive temporal disappear is really
> > important to many of BTRFS users.
>
> As of right now, there is no way that I know of to safely re-sync a
> drive that's been disconnected for a while.  The best bet is probably to
> use replace, but for that to work reliably, you would need to tell it to
> ignore the now stale drive when trying to read each chunk.
>
> It is theoretically possible to wipe the FS signature on the out-of sync
> drive, run a device scan, then run 'replace missing' pointing at the now
> 'blank' device, although going that route is really risky.

Yes,  I have been considering this but in the case that other drive
has real data errors then it would lead to data lost (and in my case
drive with actual data is worse according to the SMART).

So it seems that best recommendation is to run replace of problematic
partition to to other drive or even other partition on the same drive
and the run replace to copy already synchronized data back to the original
place. There would be at least two physical drives with at least with one
data copy in play for whole operation so even consequent failure to read
some data has high chance to have one health copy.

Anyway confirming that correct behavior in my case has been result
of design and not only of my luck would be nice. Option to run
inplace synchronization instead of all data shuffling would be yet
another nice feature to have.

Thanks,

            Pavel




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-08 11:47   ` Austin S Hemmelgarn
  2015-10-08 16:40     ` Pavel Pisa
@ 2015-10-08 21:13     ` Hugo Mills
  2015-10-08 22:16       ` Pavel Pisa
  1 sibling, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2015-10-08 21:13 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Pavel Pisa, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1314 bytes --]

On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
> On 2015-10-08 04:28, Pavel Pisa wrote:
> >I go to use "btrfs replace" because there has not been any reply to my inplace correction
> >question. But I expect that clarification if possible/how to resync RAID1 after one
> >drive temporal disappear is really important to many of BTRFS users.
> As of right now, there is no way that I know of to safely re-sync a
> drive that's been disconnected for a while.  The best bet is
> probably to use replace, but for that to work reliably, you would
> need to tell it to ignore the now stale drive when trying to read
> each chunk.

   Scrub is officially what you need there. I can confirm that it
works correctly, having used it myself after accidentally unplugging
the wrong drive.

   Hugo.

(Sorry for the delay, I wrote this earlier, but had trouble sending it)

> It is theoretically possible to wipe the FS signature on the out-of
> sync drive, run a device scan, then run 'replace missing' pointing
> at the now 'blank' device, although going that route is really
> risky.
> 



-- 
Hugo Mills             | Gomez, darling, don't torture yourself.
hugo@... carfax.org.uk | That's my job.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                       Morticia Addams

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-08 21:13     ` Hugo Mills
@ 2015-10-08 22:16       ` Pavel Pisa
  2015-10-08 22:22         ` Hugo Mills
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Pisa @ 2015-10-08 22:16 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Austin S Hemmelgarn, linux-btrfs

Hello Hugo,

On Thursday 08 of October 2015 23:13:52 Hugo Mills wrote:
> On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
> > On 2015-10-08 04:28, Pavel Pisa wrote:
> > >I go to use "btrfs replace" because there has not been any reply to my
> > > inplace correction question. But I expect that clarification if
> > > possible/how to resync RAID1 after one drive temporal disappear is
> > > really important to many of BTRFS users.
> >
> > As of right now, there is no way that I know of to safely re-sync a
> > drive that's been disconnected for a while.  The best bet is
> > probably to use replace, but for that to work reliably, you would
> > need to tell it to ignore the now stale drive when trying to read
> > each chunk.
>
>    Scrub is officially what you need there. I can confirm that it
> works correctly, having used it myself after accidentally unplugging
> the wrong drive.
>

Thanks for the reply.

I have tried to run scrub after reconnect but it counted errors in
its console output and write errors has been logged by kernel as crazy.
I have to admit I have not wait to finish it because I have not good
feeling from it.
May it be it was result of not fully correct reconnect.
But other partition worked with ext4 has no problems to write.

But if mount/unmount (in my case requiring reboot) and then scrub
worked it would be much simpler than replaces series.

I hope I would not need that (at least soon/in years) but I
give try to scrub again.

May it be problem is my btrfs tools old version on the server --
Wheezy Btrfs v3.17 backport. Kernel is Linux 4.1.2 #1 SMP PREEMPT.

Thanks,

         Pavel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-08 22:16       ` Pavel Pisa
@ 2015-10-08 22:22         ` Hugo Mills
  2015-10-09 11:13           ` Austin S Hemmelgarn
  0 siblings, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2015-10-08 22:22 UTC (permalink / raw)
  To: Pavel Pisa; +Cc: Austin S Hemmelgarn, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2385 bytes --]

On Fri, Oct 09, 2015 at 12:16:43AM +0200, Pavel Pisa wrote:
> Hello Hugo,
> 
> On Thursday 08 of October 2015 23:13:52 Hugo Mills wrote:
> > On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
> > > On 2015-10-08 04:28, Pavel Pisa wrote:
> > > >I go to use "btrfs replace" because there has not been any reply to my
> > > > inplace correction question. But I expect that clarification if
> > > > possible/how to resync RAID1 after one drive temporal disappear is
> > > > really important to many of BTRFS users.
> > >
> > > As of right now, there is no way that I know of to safely re-sync a
> > > drive that's been disconnected for a while.  The best bet is
> > > probably to use replace, but for that to work reliably, you would
> > > need to tell it to ignore the now stale drive when trying to read
> > > each chunk.
> >
> >    Scrub is officially what you need there. I can confirm that it
> > works correctly, having used it myself after accidentally unplugging
> > the wrong drive.
> >
> 
> Thanks for the reply.
> 
> I have tried to run scrub after reconnect but it counted errors in
> its console output and write errors has been logged by kernel as crazy.
> I have to admit I have not wait to finish it because I have not good
> feeling from it.

   If the scrub works OK, you will still get lots of scary-looking
errors in the logs, but they'll usually say it's repaired the problem.

   Getting write errors at this point indicates that you have hardware
problems of some kind, and (usually) that device needs to be replaced.
(Or the controller, or the cabling).

> May it be it was result of not fully correct reconnect.
> But other partition worked with ext4 has no problems to write.
> 
> But if mount/unmount (in my case requiring reboot) and then scrub
> worked it would be much simpler than replaces series.
> 
> I hope I would not need that (at least soon/in years) but I
> give try to scrub again.
> 
> May it be problem is my btrfs tools old version on the server --
> Wheezy Btrfs v3.17 backport. Kernel is Linux 4.1.2 #1 SMP PREEMPT.

   No, the version of the tools has no effect on any of this. It
really sounds like you have hardware issues.

   Hugo.

-- 
Hugo Mills             | ©1973 Unclear Research Ltd
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS RAID1 behavior after one drive temporal disconection
  2015-10-08 22:22         ` Hugo Mills
@ 2015-10-09 11:13           ` Austin S Hemmelgarn
  0 siblings, 0 replies; 8+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-09 11:13 UTC (permalink / raw)
  To: Hugo Mills, Pavel Pisa, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2613 bytes --]

On 2015-10-08 18:22, Hugo Mills wrote:
> On Fri, Oct 09, 2015 at 12:16:43AM +0200, Pavel Pisa wrote:
>> Hello Hugo,
>>
>> On Thursday 08 of October 2015 23:13:52 Hugo Mills wrote:
>>> On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
>>>> On 2015-10-08 04:28, Pavel Pisa wrote:
>>>>> I go to use "btrfs replace" because there has not been any reply to my
>>>>> inplace correction question. But I expect that clarification if
>>>>> possible/how to resync RAID1 after one drive temporal disappear is
>>>>> really important to many of BTRFS users.
>>>>
>>>> As of right now, there is no way that I know of to safely re-sync a
>>>> drive that's been disconnected for a while.  The best bet is
>>>> probably to use replace, but for that to work reliably, you would
>>>> need to tell it to ignore the now stale drive when trying to read
>>>> each chunk.
>>>
>>>     Scrub is officially what you need there. I can confirm that it
>>> works correctly, having used it myself after accidentally unplugging
>>> the wrong drive.
>>>
>>
>> Thanks for the reply.
>>
>> I have tried to run scrub after reconnect but it counted errors in
>> its console output and write errors has been logged by kernel as crazy.
>> I have to admit I have not wait to finish it because I have not good
>> feeling from it.
>
>     If the scrub works OK, you will still get lots of scary-looking
> errors in the logs, but they'll usually say it's repaired the problem.
>
>     Getting write errors at this point indicates that you have hardware
> problems of some kind, and (usually) that device needs to be replaced.
> (Or the controller, or the cabling).
>
>> May it be it was result of not fully correct reconnect.
>> But other partition worked with ext4 has no problems to write.
>>
>> But if mount/unmount (in my case requiring reboot) and then scrub
>> worked it would be much simpler than replaces series.
>>
>> I hope I would not need that (at least soon/in years) but I
>> give try to scrub again.
>>
>> May it be problem is my btrfs tools old version on the server --
>> Wheezy Btrfs v3.17 backport. Kernel is Linux 4.1.2 #1 SMP PREEMPT.
>
>     No, the version of the tools has no effect on any of this. It
> really sounds like you have hardware issues.
I have to agree with Hugo here, it really sounds like you have hardware 
issues.  If the new HDD doesn't completely fix things, try replacing the 
cabling (because that's the easy to fix without spending a lot of money, 
and having spare cables is usually a good thing), then try the 
controller or RAM.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-10-09 11:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-05 20:26 BTRFS RAID1 behavior after one drive temporal disconection Pavel Pisa
2015-10-08  8:28 ` Pavel Pisa
2015-10-08 11:47   ` Austin S Hemmelgarn
2015-10-08 16:40     ` Pavel Pisa
2015-10-08 21:13     ` Hugo Mills
2015-10-08 22:16       ` Pavel Pisa
2015-10-08 22:22         ` Hugo Mills
2015-10-09 11:13           ` Austin S Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.