All of lore.kernel.org
 help / color / mirror / Atom feed
* Global hotspare functionality
@ 2016-03-18 19:39 Yauhen Kharuzhy
  2016-03-19  1:17 ` Yauhen Kharuzhy
  2016-03-29 14:41 ` Anand Jain
  0 siblings, 2 replies; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-18 19:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

Hi all,

I try to get Anand's patchset for global hotspare functionality working.

Now it's working for me but I have met number of issues while applying
and patches testing.

I took latest versions of patchset and its dependencies (latest at two
weeks ago):

1) Anand's hotspare patchset:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/49985
2) Device delete by id series:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/53208
3) Two Anand's patches about sysfs attributes (hotspare series seems to be
depended on it):
http://thread.gmane.org/gmane.comp.file-systems.btrfs/48943

My kernel is 4.4.5 stable version (I had tried integration-4.6 branch
of btrfs-next first and had same troubles as for 4.4.5).

So, good result: hotspare functionality works!
Bad result: it works for me after some patching only :)

General notice: we are definitely need FS-specific hotspares, because
common case is to have few RAID with different drives size (system root and
data RAIDs, for instance).

I have published my git tree with working set of patches here:
https://bitbucket.org/jekhor/linux-btrfs/branch/4.4.5%2Bhotspare-without_degradable_check
And corresponding btrfs-progs tree:
https://bitbucket.org/jekhor/btrfs-progs/commits/branch/devel-hotspare

This trees contain some RAID state monitoring related changes, just ignore
them (I am going to start another discussion about of RAID status monitoring
soon).


Issue 1.

First, kernel oopsed at FS mounting after unmounting. Unfortunately, I
don't have saved logs for this. I found that fsid_kobj was corrupted (has
NULL ktype field) before invocation of btrfs_sysfs_add_fsid(). I cannot
found the source of corruption – no 'kobject release' events before,
state_initialized field remains true, ktype just is cleaned
(btrfs_ktype.release() wasn't called before this too).

My printk-based trace looks like this but exactly place of value changing
was not permanent, so this is can be some kind of race condition:

Mar 11 01:07:31 grack12 kernel: [   33.694074] btrfs_commit_transaction:2133: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
Mar 11 01:07:31 grack12 kernel: [   33.697967] btrfs_commit_transaction:2142: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
Mar 11 01:07:31 grack12 kernel: [   33.697972] write_all_supers:3672: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
Mar 11 01:07:31 grack12 kernel: [   33.697973] write_all_supers:3677: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
Mar 11 01:07:31 grack12 kernel: [   33.697974] write_all_supers:3679: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
Mar 11 01:07:31 grack12 kernel: [   33.702881] write_all_supers:3690: fsid_kobj=ffff88001f020cd8, ktype=          (null)
Mar 11 01:07:31 grack12 kernel: [   33.702884] write_all_supers:3699: fsid_kobj=ffff88001f020cd8, ktype=          (null)
Mar 11 01:07:31 grack12 kernel: [   33.702885] write_all_supers:3701: fsid_kobj=ffff88001f020cd8, ktype=          (null)

Bisecting pointed me to simple commit 'b0f398c btrfs: optimize
btrfs_check_degradable() for calls outside of barrier' but I have no idea how
it may cause or trigger this issue...

So, after spending some time for debugging, I decided to remove second
patchset entirely except of 'btrfs: create a helper function to read the disk
super' commit and problem had gone out.


Issue 2.
At start of autoreplacig drive by hotspare, kernel craches in transaction
handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
(oops text is attached also). Bdev is closed after replacing by
btrfs_dev_replace_finishing(), so this is safe but doesn't seem
to be right way.

Issue 3.
btrfs_auto_replace_start() doesn't check and doesn't set the
fs_info->mutually_exclusive_operation_running flag as ioctl handler for
DEV_REPLACE_START does, this cause race conditions in some cases, see
https://bitbucket.org/jekhor/linux-btrfs/commits/834bebb96a2f6b5ef5856836839e5ce7830ec745?at=master

Issue 4.
Autoreplacement code doesn't start replacing at mounting in degraded mode,
even if hotspare exists. We need this feature, so I added check for missing
drives also, not only for failed, to checking if replacement needed.
See
https://bitbucket.org/jekhor/linux-btrfs/commits/4c9ddb58d979ae5a232aeaa1fbe3d26373210768?at=master
and
https://bitbucket.org/jekhor/linux-btrfs/commits/be5e2524c10f2b4047da80f9f85b54c6006d4273?at=master

-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-18 19:39 Global hotspare functionality Yauhen Kharuzhy
@ 2016-03-19  1:17 ` Yauhen Kharuzhy
  2016-03-29 14:43   ` Anand Jain
  2016-03-29 14:41 ` Anand Jain
  1 sibling, 1 reply; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-19  1:17 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

Issue 5:

Race between close_ctree() and casualty_kthread():

close_ctree():
        if (fs_info->casualty_kthread)
                kthread_stop(fs_info->casualty_kthread);

casualty_kthread():
out:
        fs_info->casualty_kthread = NULL;


At SMP system, kthread_stop() argument can be changed to NULL after
check in if (). Some kind of synchronization is needed.

Oops:

[48878.674314] BUG: unable to handle kernel paging request at 0000000000001347
[48878.686252] IP: [<ffffffff810a73a4>] kthread_stop+0x54/0x280
[48878.687530] PGD 5f2c6067 PUD 62c63067 PMD 0
[48878.688780] Oops: 0002 [#1] SMP
[48878.689701] Modules linked in: cpuid cpufreq_powersave
cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc
ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul
sha256_ssse3 sha256_generic hmac drbg ansi_cprng iTCO_wdt
iTCO_vendor_support snd_pcm snd_timer snd soundcore aesni_intel
aes_x86_64 lrw gf128mul glue_helper ablk_helper lpc_ich cryptd
mfd_core psmouse serio_raw 8250_fintek parport_pc evdev parport video
pcspkr ac battery acpi_cpufreq tpm_tis tpm processor i2c_piix4
rng_core button btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci
libahci libata pcnet32 crc32c_intel scsi_mod mii
[48879.282797] CPU: 1 PID: 24369 Comm: umount Not tainted 4.4.5-scst31x+ #16
[48879.283976] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[48879.285532] task: ffff880036438440 ti: ffff88006242c000 task.ti:
ffff88006242c000
[48879.288022] RIP: 0010:[<ffffffff810a73a4>]  [<ffffffff810a73a4>]
kthread_stop+0x54/0x280
[48879.289681] RSP: 0018:ffff88006242fd78  EFLAGS: 00010206
[48879.290678] RAX: 0000000000000001 RBX: 00000000000013af RCX: 0000000000000003
[48879.291889] RDX: 0000000000000000 RSI: 0000000000001347 RDI: 0000000000000246
[48879.293096] RBP: ffff88006242fd90 R08: ffff88006242fd38 R09: 0000000000000000
[48879.449616] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88002642e840
[48879.494970] R13: ffff880036438440 R14: 0000000000000000 R15: ffff880048b7f000
[48879.496219] FS:  00007f95236c27e0(0000) GS:ffff880066b00000(0000)
knlGS:0000000000000000
[48879.497767] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48879.498806] CR2: 0000000000001347 CR3: 000000005ed74000 CR4: 00000000000406e0
[48879.500027] Stack:
[48879.500603]  ffff880027f4a2d0 ffff880027f48000 ffff880036438440
ffff88006242fdf0
[48879.502516]  ffffffffa016713d ffff88006242fe00 ffffffff81239ff5
ffff880039cf2000
[48879.505658]  ffff880039cf2a40 ffff880062430000 ffff880039cf2000
ffffffffa01edae0
[48879.507584] Call Trace:
[48879.508239]  [<ffffffffa016713d>] close_ctree+0x13d/0x390 [btrfs]
[48879.509328]  [<ffffffff81239ff5>] ? evict_inodes+0x165/0x180
[48879.510360]  [<ffffffffa0135f99>] btrfs_put_super+0x19/0x20 [btrfs]
[48879.511467]  [<ffffffff8121d0ff>] generic_shutdown_super+0x6f/0xf0
[48879.512570]  [<ffffffff8121d3f2>] kill_anon_super+0x12/0x20
[48879.594205]  [<ffffffffa0136a88>] btrfs_kill_super+0x18/0x120 [btrfs]
[48879.714563]  [<ffffffff8121d66e>] deactivate_locked_super+0x3e/0x70
[48879.715691]  [<ffffffff8121dabc>] deactivate_super+0x5c/0x60
[48879.716720]  [<ffffffff8123e07f>] cleanup_mnt+0x3f/0x90
[48879.717695]  [<ffffffff8123e112>] __cleanup_mnt+0x12/0x20
[48879.718695]  [<ffffffff810a5423>] task_work_run+0x73/0xa0
[48879.719692]  [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
[48879.720773]  [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0
[48879.721880]  [<ffffffff81634022>] int_ret_from_sys_call+0x25/0x9f
[48879.722962] Code: 85 c0 0f 85 90 01 00 00 65 ff 0d 48 4b f6 7e f0
41 ff 44 24 10 49 8b 9c 24 38 04 00 00 48 85 db 74 26 48 89 de 48 83
ee 68 74 1d <f0> 80 4b 98 02 4c 89 e7 e8 ff fe ff ff 4c 89 e7 e8 a7 ca
00 00
[48879.891283] RIP  [<ffffffff810a73a4>] kthread_stop+0x54/0x280
[48880.019550]  RSP <ffff88006242fd78>
[48880.091281] CR2: 0000000000001347
[48880.133312] ---[ end trace 8f0228d92754702e ]---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-18 19:39 Global hotspare functionality Yauhen Kharuzhy
  2016-03-19  1:17 ` Yauhen Kharuzhy
@ 2016-03-29 14:41 ` Anand Jain
  2016-03-29 19:24   ` Yauhen Kharuzhy
                     ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Anand Jain @ 2016-03-29 14:41 UTC (permalink / raw)
  To: Yauhen Kharuzhy, linux-btrfs


Hi Yauhen,

  Thanks ! more below..

On 03/19/2016 03:39 AM, Yauhen Kharuzhy wrote:
> Hi all,
>
> I try to get Anand's patchset for global hotspare functionality working.

> Now it's working for me but I have met number of issues while applying
> and patches testing.
>
> I took latest versions of patchset and its dependencies (latest at two
> weeks ago):
>
> 1) Anand's hotspare patchset:
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/49985
> 2) Device delete by id series:
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/53208

  V2 is sent out based on cleanups-4.6 branch. Plus preparatory
  patch found in the ML.

fc72066b27f3 btrfs: refactor btrfs_dev_replace_start for reuse
0b2322126a95 btrfs: keep sysfs target add in the last
3ecbc05149e0 btrfs: use fs_info directly


> 3) Two Anand's patches about sysfs attributes (hotspare series seems to be
> depended on it):
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/48943

  The sysfs patches we need it only to see the device state or some
  enterprise scripts may need it. But auto replace hot spare as such
  don't depend on that.

> My kernel is 4.4.5 stable version (I had tried integration-4.6 branch
> of btrfs-next first and had same troubles as for 4.4.5).

  Wiki needs an update, pls don't use btrfs-next.

> So, good result: hotspare functionality works!

  Thanks for testing.

> Bad result: it works for me after some patching only :)

  Thanks for working on it. Let me review.

> General notice: we are definitely need FS-specific hotspares, because
> common case is to have few RAID with different drives size (system root and
> data RAIDs, for instance).

  Yep.

> I have published my git tree with working set of patches here:
> https://bitbucket.org/jekhor/linux-btrfs/branch/4.4.5%2Bhotspare-without_degradable_check
> And corresponding btrfs-progs tree:
> https://bitbucket.org/jekhor/btrfs-progs/commits/branch/devel-hotspare

  Hm, generally posting the independent patches to the ML will help.

> This trees contain some RAID state monitoring related changes, just ignore
> them (I am going to start another discussion about of RAID status monitoring
> soon).
>
>
> Issue 1.
>
> First, kernel oopsed at FS mounting after unmounting. Unfortunately, I
> don't have saved logs for this. I found that fsid_kobj was corrupted (has
> NULL ktype field) before invocation of btrfs_sysfs_add_fsid(). I cannot
> found the source of corruption – no 'kobject release' events before,
> state_initialized field remains true, ktype just is cleaned
> (btrfs_ktype.release() wasn't called before this too).
>
> My printk-based trace looks like this but exactly place of value changing
> was not permanent, so this is can be some kind of race condition:
>
> Mar 11 01:07:31 grack12 kernel: [   33.694074] btrfs_commit_transaction:2133: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
> Mar 11 01:07:31 grack12 kernel: [   33.697967] btrfs_commit_transaction:2142: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
> Mar 11 01:07:31 grack12 kernel: [   33.697972] write_all_supers:3672: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
> Mar 11 01:07:31 grack12 kernel: [   33.697973] write_all_supers:3677: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
> Mar 11 01:07:31 grack12 kernel: [   33.697974] write_all_supers:3679: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840
> Mar 11 01:07:31 grack12 kernel: [   33.702881] write_all_supers:3690: fsid_kobj=ffff88001f020cd8, ktype=          (null)
> Mar 11 01:07:31 grack12 kernel: [   33.702884] write_all_supers:3699: fsid_kobj=ffff88001f020cd8, ktype=          (null)
> Mar 11 01:07:31 grack12 kernel: [   33.702885] write_all_supers:3701: fsid_kobj=ffff88001f020cd8, ktype=          (null)
>
> Bisecting pointed me to simple commit 'b0f398c btrfs: optimize
> btrfs_check_degradable() for calls outside of barrier' but I have no idea how
> it may cause or trigger this issue...

  dev is stale here, my bad, that was a crap patch. Also we don't need
  this patch as part of hot spare / auto replace code. I have removed it.

> So, after spending some time for debugging, I decided to remove second
> patchset entirely except of 'btrfs: create a helper function to read the disk
> super' commit and problem had gone out.
>
>
> Issue 2.
> At start of autoreplacig drive by hotspare, kernel craches in transaction
> handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
> routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
> (oops text is attached also). Bdev is closed after replacing by
> btrfs_dev_replace_finishing(), so this is safe but doesn't seem
> to be right way.

  I have sent out V2. I don't see that issue with this,
  could you pls try ?

> Issue 3.
> btrfs_auto_replace_start() doesn't check and doesn't set the
> fs_info->mutually_exclusive_operation_running flag as ioctl handler for
> DEV_REPLACE_START does, this cause race conditions in some cases, see
> https://bitbucket.org/jekhor/linux-btrfs/commits/834bebb96a2f6b5ef5856836839e5ce7830ec745?at=master

  There were some fixes to the main btrfs_auto_replace_start() before,
  (not the v2). So to avoid such a disconnect, I have sent out a patch
  set which shall not v2 the function, instead it re-factors the original

     btrfs: refactor btrfs_dev_replace_start for reuse

  With this the hot spare V2 will apply nicely, and I have found it
  to be stable.

> Issue 4.
> Autoreplacement code doesn't start replacing at mounting in degraded mode,
> even if hotspare exists. We need this feature, so I added check for missing
> drives also, not only for failed, to checking if replacement needed.

  No. No. No please don't do that, it would lead to trouble in handing
  slow devices. I purposely didn't do it.

  Also kindly note that, in volume manage / storage context things
  should continue to work in degraded mode automatically, and it
  shouldn't wait for user's opinion. If it don't do that, then
  there is no point in having a volume manager. But as of now btrfs has
  already made degraded as non default choice. There is something else
  new which is needed and it can be a separate RFC, not part of this
  patch set.


  Please try. V2 sent out.

Thanks, Anand

> See
> https://bitbucket.org/jekhor/linux-btrfs/commits/4c9ddb58d979ae5a232aeaa1fbe3d26373210768?at=master
> and
> https://bitbucket.org/jekhor/linux-btrfs/commits/be5e2524c10f2b4047da80f9f85b54c6006d4273?at=master





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-19  1:17 ` Yauhen Kharuzhy
@ 2016-03-29 14:43   ` Anand Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Anand Jain @ 2016-03-29 14:43 UTC (permalink / raw)
  To: Yauhen Kharuzhy, linux-btrfs



On 03/19/2016 09:17 AM, Yauhen Kharuzhy wrote:
> Issue 5:
>
> Race between close_ctree() and casualty_kthread():

  This is fixed in V2, thanks for the report.

- Anand

> close_ctree():
>          if (fs_info->casualty_kthread)
>                  kthread_stop(fs_info->casualty_kthread);
>
> casualty_kthread():
> out:
>          fs_info->casualty_kthread = NULL;
>
>
> At SMP system, kthread_stop() argument can be changed to NULL after
> check in if (). Some kind of synchronization is needed.
>
> Oops:
>
> [48878.674314] BUG: unable to handle kernel paging request at 0000000000001347
> [48878.686252] IP: [<ffffffff810a73a4>] kthread_stop+0x54/0x280
> [48878.687530] PGD 5f2c6067 PUD 62c63067 PMD 0
> [48878.688780] Oops: 0002 [#1] SMP
> [48878.689701] Modules linked in: cpuid cpufreq_powersave
> cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd
> auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc
> ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul
> sha256_ssse3 sha256_generic hmac drbg ansi_cprng iTCO_wdt
> iTCO_vendor_support snd_pcm snd_timer snd soundcore aesni_intel
> aes_x86_64 lrw gf128mul glue_helper ablk_helper lpc_ich cryptd
> mfd_core psmouse serio_raw 8250_fintek parport_pc evdev parport video
> pcspkr ac battery acpi_cpufreq tpm_tis tpm processor i2c_piix4
> rng_core button btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci
> libahci libata pcnet32 crc32c_intel scsi_mod mii
> [48879.282797] CPU: 1 PID: 24369 Comm: umount Not tainted 4.4.5-scst31x+ #16
> [48879.283976] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> VirtualBox 12/01/2006
> [48879.285532] task: ffff880036438440 ti: ffff88006242c000 task.ti:
> ffff88006242c000
> [48879.288022] RIP: 0010:[<ffffffff810a73a4>]  [<ffffffff810a73a4>]
> kthread_stop+0x54/0x280
> [48879.289681] RSP: 0018:ffff88006242fd78  EFLAGS: 00010206
> [48879.290678] RAX: 0000000000000001 RBX: 00000000000013af RCX: 0000000000000003
> [48879.291889] RDX: 0000000000000000 RSI: 0000000000001347 RDI: 0000000000000246
> [48879.293096] RBP: ffff88006242fd90 R08: ffff88006242fd38 R09: 0000000000000000
> [48879.449616] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88002642e840
> [48879.494970] R13: ffff880036438440 R14: 0000000000000000 R15: ffff880048b7f000
> [48879.496219] FS:  00007f95236c27e0(0000) GS:ffff880066b00000(0000)
> knlGS:0000000000000000
> [48879.497767] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [48879.498806] CR2: 0000000000001347 CR3: 000000005ed74000 CR4: 00000000000406e0
> [48879.500027] Stack:
> [48879.500603]  ffff880027f4a2d0 ffff880027f48000 ffff880036438440
> ffff88006242fdf0
> [48879.502516]  ffffffffa016713d ffff88006242fe00 ffffffff81239ff5
> ffff880039cf2000
> [48879.505658]  ffff880039cf2a40 ffff880062430000 ffff880039cf2000
> ffffffffa01edae0
> [48879.507584] Call Trace:
> [48879.508239]  [<ffffffffa016713d>] close_ctree+0x13d/0x390 [btrfs]
> [48879.509328]  [<ffffffff81239ff5>] ? evict_inodes+0x165/0x180
> [48879.510360]  [<ffffffffa0135f99>] btrfs_put_super+0x19/0x20 [btrfs]
> [48879.511467]  [<ffffffff8121d0ff>] generic_shutdown_super+0x6f/0xf0
> [48879.512570]  [<ffffffff8121d3f2>] kill_anon_super+0x12/0x20
> [48879.594205]  [<ffffffffa0136a88>] btrfs_kill_super+0x18/0x120 [btrfs]
> [48879.714563]  [<ffffffff8121d66e>] deactivate_locked_super+0x3e/0x70
> [48879.715691]  [<ffffffff8121dabc>] deactivate_super+0x5c/0x60
> [48879.716720]  [<ffffffff8123e07f>] cleanup_mnt+0x3f/0x90
> [48879.717695]  [<ffffffff8123e112>] __cleanup_mnt+0x12/0x20
> [48879.718695]  [<ffffffff810a5423>] task_work_run+0x73/0xa0
> [48879.719692]  [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
> [48879.720773]  [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0
> [48879.721880]  [<ffffffff81634022>] int_ret_from_sys_call+0x25/0x9f
> [48879.722962] Code: 85 c0 0f 85 90 01 00 00 65 ff 0d 48 4b f6 7e f0
> 41 ff 44 24 10 49 8b 9c 24 38 04 00 00 48 85 db 74 26 48 89 de 48 83
> ee 68 74 1d <f0> 80 4b 98 02 4c 89 e7 e8 ff fe ff ff 4c 89 e7 e8 a7 ca
> 00 00
> [48879.891283] RIP  [<ffffffff810a73a4>] kthread_stop+0x54/0x280
> [48880.019550]  RSP <ffff88006242fd78>
> [48880.091281] CR2: 0000000000001347
> [48880.133312] ---[ end trace 8f0228d92754702e ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 14:41 ` Anand Jain
@ 2016-03-29 19:24   ` Yauhen Kharuzhy
  2016-03-29 19:59     ` Austin S. Hemmelgarn
  2016-03-29 19:40   ` Yauhen Kharuzhy
  2016-03-29 19:47   ` Yauhen Kharuzhy
  2 siblings, 1 reply; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-29 19:24 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
> 
> >Issue 2.
> >At start of autoreplacig drive by hotspare, kernel craches in transaction
> >handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
> >routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
> >https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
> >(oops text is attached also). Bdev is closed after replacing by
> >btrfs_dev_replace_finishing(), so this is safe but doesn't seem
> >to be right way.
> 
>  I have sent out V2. I don't see that issue with this,
>  could you pls try ?

Sure. In progresss now.

 
> >Issue 4.
> >Autoreplacement code doesn't start replacing at mounting in degraded mode,
> >even if hotspare exists. We need this feature, so I added check for missing
> >drives also, not only for failed, to checking if replacement needed.
> 
>  No. No. No please don't do that, it would lead to trouble in handing
>  slow devices. I purposely didn't do it.

Hmm. Can you explain please? Sometimes admins may want to have
autoreplacement working automatically if drive was failed and removed
before unmounting and remounting again. The simplest way to achieve this —
add spare and always mount FS with 'degraded' option (we need to use
this option in any case if we have root fs on RAID, for instance, to
avoiding non-bootable state). So, if the autoreplacement code will check for
missing drives also, this will working without user intervention. To
allow user to decide if he wants autoreplacement, we can add mount
option like '(no)hotspare' (I have done this already for our project and
will send patch after rebasing onto your new series). Yes, there are
side effects exists if you want to make some experiments with missing
drives in FS, but you can disable autoreplacement for such case.

If you know about any pitfalls in such scenarios, please point me to
them, I am newbie in FS-related kernel things.


>  Also kindly note that, in volume manage / storage context things
>  should continue to work in degraded mode automatically, and it
>  shouldn't wait for user's opinion. If it don't do that, then
>  there is no point in having a volume manager. But as of now btrfs has
>  already made degraded as non default choice. There is something else
>  new which is needed and it can be a separate RFC, not part of this
>  patch set.
> 
> 
>  Please try. V2 sent out.

OK, I am going to testing now. 

-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 14:41 ` Anand Jain
  2016-03-29 19:24   ` Yauhen Kharuzhy
@ 2016-03-29 19:40   ` Yauhen Kharuzhy
  2016-03-30 22:17     ` Yauhen Kharuzhy
  2016-03-29 19:47   ` Yauhen Kharuzhy
  2 siblings, 1 reply; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-29 19:40 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

Hi.

I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later)
now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed,
because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock()
block (GFP_NOFS can sleep).

[ 1463.470875] BTRFS critical (device sdc): Fatal error on device /dev/sdf
[ 1463.499091] BTRFS warning (device sdc): device /dev/sdf failed, chunks degraded
[ 1463.501394] BTRFS info (device sdc): num_devices 4 rw_devices 3 degraded-option: unset
[ 1463.506221] 
[ 1463.508572] ===============================
[ 1463.512989] [ INFO: suspicious RCU usage. ]
[ 1463.514194] 4.4.5-scst31x+ #20 Not tainted
[ 1463.515191] -------------------------------
[ 1463.516162] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section!
[ 1463.519943] 
[ 1463.519943] other info that might help us debug this:
[ 1463.519943] 
[ 1463.531515] 
[ 1463.531515] rcu_scheduler_active = 1, debug_locks = 0
[ 1463.541056] 3 locks held by btrfs-casualty/4702:
[ 1463.542151]  #0:  (&fs_info->casualty_mutex){+.+...}, at: [<ffffffffa018325a>] casualty_kthread+0x6a/0x340 [btrfs]
[ 1463.563978]  #1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa01bfc65>] btrfs_get_spare_device+0x25/0x1c0 [btrfs]
[ 1463.587380]  #2:  (rcu_read_lock){......}, at: [<ffffffffa01bfcac>] btrfs_get_spare_device+0x6c/0x1c0 [btrfs]
[ 1463.606124] 
[ 1463.606124] stack backtrace:
[ 1463.612596] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20
[ 1463.630796] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 1463.649423]  0000000000000000 ffff88005e65fc60 ffffffff813529e3 ffff88005e658580
[ 1463.685195]  0000000000000001 ffff88005e65fc90 ffffffff810d6407 0000000000000000
[ 1463.719570]  ffffffff81a270d7 0000000000000b2d 0000000002400040 ffff88005e65fcb8
[ 1463.735510] Call Trace:
[ 1463.741172]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 1463.751346]  [<ffffffff810d6407>] lockdep_rcu_suspicious+0xd7/0x110
[ 1463.776423]  [<ffffffff810add27>] ___might_sleep+0xa7/0x230
[ 1463.792409]  [<ffffffff810adef9>] __might_sleep+0x49/0x80
[ 1463.798818]  [<ffffffffa01bfd0a>] ? btrfs_get_spare_device+0xca/0x1c0 [btrfs]
[ 1463.814369]  [<ffffffff81200ec7>] __kmalloc_track_caller+0x227/0x5e0
[ 1463.828477]  [<ffffffff811c0f81>] kstrdup+0x31/0x60
[ 1463.836761]  [<ffffffffa01bfd0a>] btrfs_get_spare_device+0xca/0x1c0 [btrfs]
[ 1463.851841]  [<ffffffffa01bfcac>] ? btrfs_get_spare_device+0x6c/0x1c0 [btrfs]
[ 1463.867022]  [<ffffffffa02038a3>] btrfs_auto_replace_start+0x23/0xd0 [btrfs]
[ 1463.874201]  [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs]
[ 1463.891812]  [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs]
[ 1463.908948]  [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs]
[ 1463.920370]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 1463.936421]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1463.949868]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 1463.963096]  [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70
[ 1463.973053]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 1463.976256] BUG: sleeping function called from invalid context at mm/slab.c:2861
[ 1463.979936] in_atomic(): 1, irqs_disabled(): 0, pid: 4702, name: btrfs-casualty
[ 1463.981868] INFO: lockdep is turned off.
[ 1463.987513] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20
[ 1463.999678] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 1464.002555]  0000000000000000 ffff88005e65fc90 ffffffff813529e3 ffff88005e658580
[ 1464.022926]  ffffffff81a270d7 ffff88005e65fcb8 ffffffff810addf9 ffffffff81a270d7
[ 1464.055231]  0000000000000b2d 0000000000000000 ffff88005e65fce0 ffffffff810adef9
[ 1464.066771] Call Trace:
[ 1464.069301]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 1464.074030]  [<ffffffff810addf9>] ___might_sleep+0x179/0x230
[ 1464.076525]  [<ffffffff810adef9>] __might_sleep+0x49/0x80
[ 1464.078372]  [<ffffffffa01bfd0a>] ? btrfs_get_spare_device+0xca/0x1c0 [btrfs]
[ 1464.080315]  [<ffffffff81200ec7>] __kmalloc_track_caller+0x227/0x5e0
[ 1464.083712]  [<ffffffff811c0f81>] kstrdup+0x31/0x60
[ 1464.099224]  [<ffffffffa01bfd0a>] btrfs_get_spare_device+0xca/0x1c0 [btrfs]
[ 1464.102570]  [<ffffffffa01bfcac>] ? btrfs_get_spare_device+0x6c/0x1c0 [btrfs]
[ 1464.106557]  [<ffffffffa02038a3>] btrfs_auto_replace_start+0x23/0xd0 [btrfs]
[ 1464.113009]  [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs]
[ 1464.115574]  [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs]
[ 1464.131337]  [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs]
[ 1464.139141]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 1464.151499]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1464.175156]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 1464.194352]  [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70
[ 1464.207239]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 1464.232552] BTRFS info (device sdc): dev_replace from <missing disk> (devid 4) to /dev/sdg started


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 14:41 ` Anand Jain
  2016-03-29 19:24   ` Yauhen Kharuzhy
  2016-03-29 19:40   ` Yauhen Kharuzhy
@ 2016-03-29 19:47   ` Yauhen Kharuzhy
  2016-03-29 23:18     ` Yauhen Kharuzhy
  2016-04-02  1:15     ` Anand Jain
  2 siblings, 2 replies; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-29 19:47 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
> 
> Hi Yauhen,
> 

> >
> >Issue 2.
> >At start of autoreplacig drive by hotspare, kernel craches in transaction
> >handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
> >routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
> >https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
> >(oops text is attached also). Bdev is closed after replacing by
> >btrfs_dev_replace_finishing(), so this is safe but doesn't seem
> >to be right way.
> 
>  I have sent out V2. I don't see that issue with this,
>  could you pls try ?

Yes, it reproduced on v4.4.5 kernel. I will try with current
'for-linus-4.6' Chris' tree soon.

To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
can be freed by kernel after releasing of all references to it.

[ 1464.232552] BTRFS info (device sdc): dev_replace from <missing disk> (devid 4) to /dev/sdg started
[ 1464.255824] BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
[ 1464.291760] IP: [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910
[ 1464.309746] PGD 5c668067 PUD 5b841067 PMD 0 
[ 1464.326143] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC 
[ 1464.340474] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg iTCO_wdt ansi_cprng iTCO_vendor_support snd_pcm snd_timer aesni_intel snd soundcore psmouse aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev serio_raw pcspkr battery acpi_cpufreq 8250_fintek parport_pc video lpc_ich parport mfd_core tpm_tis tpm ac rng_core processor button i2c_piix4 btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci libahci libata pcnet32 crc32c_intel scsi_mod mii
[ 1464.483244] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20
[ 1464.511300] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 1464.518035] task: ffff88005e658580 ti: ffff88005e65c000 task.ti: ffff88005e65c000
[ 1464.543072] RIP: 0010:[<ffffffff8131d58d>]  [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910
[ 1464.579027] RSP: 0018:ffff88005e65f498  EFLAGS: 00010283
[ 1464.604774] RAX: 0000000000000000 RBX: ffff88005b919f28 RCX: 0000000000030b00
[ 1464.629544] RDX: 0000000000000080 RSI: 0000000000000781 RDI: ffff88004ecd5ac0
[ 1464.652763] RBP: ffff88005e65f500 R08: ffff88005b130ff0 R09: 0000000000010000
[ 1464.674939] R10: ffff88005e674f28 R11: 0000000000000000 R12: 0000000000000080
[ 1464.691478] R13: 0000000000000004 R14: ffff88004e48de00 R15: 0000000000000010
[ 1464.714115] FS:  0000000000000000(0000) GS:ffff880066600000(0000) knlGS:0000000000000000
[ 1464.737302] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1464.766380] CR2: 0000000000000548 CR3: 000000005723f000 CR4: 00000000000406f0
[ 1464.804808] Stack:
[ 1464.814950]  ffffffff813184ae 0000000000000246 0000000000000082 0000000000000000
[ 1464.847217]  0000000000000000 0000000000000092 0000000000000000 ffff88005e65f540
[ 1464.879147]  ffff88005b919f28 00000000ffffffff 0000000000000004 ffff88004e48de00
[ 1464.907440] Call Trace:
[ 1464.919293]  [<ffffffff813184ae>] ? bvec_alloc+0x5e/0x100
[ 1464.939019]  [<ffffffff813213a4>] generic_make_request+0x24/0x290
[ 1464.961775]  [<ffffffff81321677>] submit_bio+0x67/0x140
[ 1464.971842]  [<ffffffffa02051b9>] finish_rmw+0x409/0x570 [btrfs]
[ 1464.983700]  [<ffffffffa02053c5>] full_stripe_write+0xa5/0xb0 [btrfs]
[ 1464.996554]  [<ffffffffa0206d05>] raid56_parity_write+0xf5/0x180 [btrfs]
[ 1465.012560]  [<ffffffffa01bab95>] btrfs_map_bio+0x105/0x300 [btrfs]
[ 1465.046907]  [<ffffffffa018f8b3>] ? btrfs_get_extent+0x83/0xb20 [btrfs]
[ 1465.052462]  [<ffffffffa018d175>] btrfs_submit_bio_hook+0xe5/0x1b0 [btrfs]
[ 1465.069342]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1465.091031]  [<ffffffffa01a917d>] submit_one_bio+0x6d/0xa0 [btrfs]
[ 1465.111233]  [<ffffffffa01ae06e>] submit_extent_page+0xee/0x230 [btrfs]
[ 1465.126076]  [<ffffffffa01ae7f4>] __extent_writepage_io+0x444/0x490 [btrfs]
[ 1465.132550]  [<ffffffffa01ade10>] ? end_extent_writepage+0x80/0x80 [btrfs]
[ 1465.145490]  [<ffffffffa01aeaa5>] __extent_writepage+0x265/0x3e0 [btrfs]
[ 1465.168445]  [<ffffffffa01aef1b>] extent_write_cache_pages.isra.32.constprop.49+0x2fb/0x3d0 [btrfs]
[ 1465.204094]  [<ffffffffa01b009d>] extent_writepages+0x4d/0x70 [btrfs]
[ 1465.229627]  [<ffffffffa018f830>] ? btrfs_real_readdir+0x5c0/0x5c0 [btrfs]
[ 1465.250927]  [<ffffffffa018d5e8>] btrfs_writepages+0x28/0x30 [btrfs]
[ 1465.274099]  [<ffffffff811afcb1>] do_writepages+0x21/0x30
[ 1465.298275]  [<ffffffff811a10ea>] __filemap_fdatawrite_range+0xaa/0xf0
[ 1465.324278]  [<ffffffff811a1203>] filemap_fdatawrite_range+0x13/0x20
[ 1465.341055]  [<ffffffffa01a1490>] btrfs_fdatawrite_range+0x20/0x50 [btrfs]
[ 1465.378952]  [<ffffffffa01d377a>] __btrfs_write_out_cache.isra.27+0x3ea/0x430 [btrfs]
[ 1465.405760]  [<ffffffffa01d4a7f>] btrfs_write_out_cache+0x8f/0x110 [btrfs]
[ 1465.428091]  [<ffffffffa0176128>] btrfs_write_dirty_block_groups+0x228/0x290 [btrfs]
[ 1465.458865]  [<ffffffffa0208d6a>] commit_cowonly_roots+0x1f8/0x283 [btrfs]
[ 1465.480450]  [<ffffffffa018b0d7>] btrfs_commit_transaction+0x577/0xb60 [btrfs]
[ 1465.512410]  [<ffffffffa0202cf3>] btrfs_dev_replace_start+0x2e3/0x520 [btrfs]
[ 1465.535358]  [<ffffffffa0202b6e>] ? btrfs_dev_replace_start+0x15e/0x520 [btrfs]
[ 1465.548049]  [<ffffffffa02038d8>] btrfs_auto_replace_start+0x58/0xd0 [btrfs]
[ 1465.551787]  [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs]
[ 1465.561195]  [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs]
[ 1465.573308]  [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs]
[ 1465.610840]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 1465.629472]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 1465.648678]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 1465.660686]  [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70
[ 1465.667065]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 1465.676861] Code: 67 28 48 c7 c7 6b f8 a3 81 e8 40 09 d9 ff e8 3b 43 31 00 41 c1 ec 09 48 8b 7b 08 45 85 e4 0f 85 13 01 00 00 48 8b 87 f0 00 00 00 <4c> 8b b8 48 05 00 00 4d 85 ff 0f 84 d5 01 00 00 4c 8b af e0 00 
[ 1465.750135] RIP  [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910
[ 1465.776005]  RSP <ffff88005e65f498>
[ 1465.790848] CR2: 0000000000000548
[ 1465.797370] ---[ end trace 45545495cd54e799 ]---



-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 19:24   ` Yauhen Kharuzhy
@ 2016-03-29 19:59     ` Austin S. Hemmelgarn
  2016-03-29 20:26       ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-29 19:59 UTC (permalink / raw)
  To: Yauhen Kharuzhy, Anand Jain; +Cc: linux-btrfs

On 2016-03-29 15:24, Yauhen Kharuzhy wrote:
> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
>>
>>   No. No. No please don't do that, it would lead to trouble in handing
>>   slow devices. I purposely didn't do it.
>
> Hmm. Can you explain please? Sometimes admins may want to have
> autoreplacement working automatically if drive was failed and removed
> before unmounting and remounting again. The simplest way to achieve this —
> add spare and always mount FS with 'degraded' option (we need to use
> this option in any case if we have root fs on RAID, for instance, to
> avoiding non-bootable state). So, if the autoreplacement code will check for
> missing drives also, this will working without user intervention. To
> allow user to decide if he wants autoreplacement, we can add mount
> option like '(no)hotspare' (I have done this already for our project and
> will send patch after rebasing onto your new series). Yes, there are
> side effects exists if you want to make some experiments with missing
> drives in FS, but you can disable autoreplacement for such case.
>
> If you know about any pitfalls in such scenarios, please point me to
> them, I am newbie in FS-related kernel things.
If a disk is particularly slow to start up for some reason (maybe it's 
going bad, maybe it's just got a slow interconnect (think SD cards), 
maybe it's just really cold so the bearings seizing up), then this would 
potentially force it out of the array when it shouldn't be.

That said, having things set to always allow degraded mounts is 
_extremely dangerous_.  If the user does not know anything failed, they 
also can't know they need to get anything fixed.  While notification 
could be used, it also introduces a period of time where the user is at 
risk of data loss without them having explicitly agreed to this risk (by 
manually telling it to mount degraded).

I could possibly understand doing this for something that needs to be 
guaranteed to come on line when powered on,  but **only** if it notifies 
responsible parties that there was a problem **and** it is explicitly 
documented, and even then I'd be wary of doing this unless there was 
something in place to handle the possibility of false positives (yes, 
they do happen), and to make certain that the failed hardware got 
replaced as soon as possible.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 19:59     ` Austin S. Hemmelgarn
@ 2016-03-29 20:26       ` Chris Murphy
  2016-03-30 11:26         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2016-03-29 20:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Yauhen Kharuzhy, Anand Jain, Btrfs BTRFS

On Tue, Mar 29, 2016 at 1:59 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-03-29 15:24, Yauhen Kharuzhy wrote:
>>
>> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
>>>
>>>
>>>   No. No. No please don't do that, it would lead to trouble in handing
>>>   slow devices. I purposely didn't do it.
>>
>>
>> Hmm. Can you explain please? Sometimes admins may want to have
>> autoreplacement working automatically if drive was failed and removed
>> before unmounting and remounting again. The simplest way to achieve this —
>> add spare and always mount FS with 'degraded' option (we need to use
>> this option in any case if we have root fs on RAID, for instance, to
>> avoiding non-bootable state). So, if the autoreplacement code will check
>> for
>> missing drives also, this will working without user intervention. To
>> allow user to decide if he wants autoreplacement, we can add mount
>> option like '(no)hotspare' (I have done this already for our project and
>> will send patch after rebasing onto your new series). Yes, there are
>> side effects exists if you want to make some experiments with missing
>> drives in FS, but you can disable autoreplacement for such case.
>>
>> If you know about any pitfalls in such scenarios, please point me to
>> them, I am newbie in FS-related kernel things.
>
> If a disk is particularly slow to start up for some reason (maybe it's going
> bad, maybe it's just got a slow interconnect (think SD cards), maybe it's
> just really cold so the bearings seizing up), then this would potentially
> force it out of the array when it shouldn't be.
>
> That said, having things set to always allow degraded mounts is _extremely
> dangerous_.  If the user does not know anything failed, they also can't know
> they need to get anything fixed.  While notification could be used, it also
> introduces a period of time where the user is at risk of data loss without
> them having explicitly agreed to this risk (by manually telling it to mount
> degraded).

I agree, certainly replace should not be automatic by default. And I'm
unconvinced this belongs in kernel code anyway because it's a matter
of policy. Policy stuff goes in user space, where capability to
achieve the policy goes in the kernel.

A reasonable exception is bad device ejection (e.g. mdadm faulty).

Considering spinning devices take a long time to rebuild already and
this probably won't change, a policy I'd like to see upon a drive
going bad (totally vanishing, or producing many read or write errors):
1. Bad device is ejected, volume is degraded.
2. Consider chunks with one remaining stripe (one copy) as degraded.
3. Degraded chunks are read only, so COW changes to non-degraded chunks.
4. Degraded metadata chunks are replicated elsewhere, happens right away.
5. Implied by 4, degraded data chunks aren't immediately replicated
but any change are, via COW.
6. Option, by policy, to immediately start replicating degraded data
chunks - either with existing storage or hot spare, which is also a
policy choice.

In particular, I'd like to see the single stripe metadata chunks
replicated soon so in case there's another device failure the entire
volume doesn't implode. Yes there's some data loss, still better than
100% data loss.


> I could possibly understand doing this for something that needs to be
> guaranteed to come on line when powered on,  but **only** if it notifies
> responsible parties that there was a problem **and** it is explicitly
> documented, and even then I'd be wary of doing this unless there was
> something in place to handle the possibility of false positives (yes, they
> do happen), and to make certain that the failed hardware got replaced as
> soon as possible.

Exactly. And I think it's safer to be more aggressive with (fairly)
immediate metadata replication to remaining devices, than it is with
data.

I'm considering this behavior for both single volume setups, as well
as multiple bricks in a cluster. And admittedly it's probably
cheaper/easier to just get n-way copies of metadata than the above
scheme I've written.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 19:47   ` Yauhen Kharuzhy
@ 2016-03-29 23:18     ` Yauhen Kharuzhy
  2016-04-02  1:15     ` Anand Jain
  1 sibling, 0 replies; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-29 23:18 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

Reproduced with mason's for-linus-4.6 branch also.

2016-03-29 12:47 GMT-07:00 Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>:
> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
>>
>> Hi Yauhen,
>>
>
>> >
>> >Issue 2.
>> >At start of autoreplacig drive by hotspare, kernel craches in transaction
>> >handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
>> >routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
>> >https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
>> >(oops text is attached also). Bdev is closed after replacing by
>> >btrfs_dev_replace_finishing(), so this is safe but doesn't seem
>> >to be right way.
>>
>>  I have sent out V2. I don't see that issue with this,
>>  could you pls try ?
>
> Yes, it reproduced on v4.4.5 kernel. I will try with current
> 'for-linus-4.6' Chris' tree soon.
>
> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
> can be freed by kernel after releasing of all references to it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 20:26       ` Chris Murphy
@ 2016-03-30 11:26         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-30 11:26 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Yauhen Kharuzhy, Anand Jain, Btrfs BTRFS

On 2016-03-29 16:26, Chris Murphy wrote:
> On Tue, Mar 29, 2016 at 1:59 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-03-29 15:24, Yauhen Kharuzhy wrote:
>>>
>>> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
>>>>
>>>>
>>>>    No. No. No please don't do that, it would lead to trouble in handing
>>>>    slow devices. I purposely didn't do it.
>>>
>>>
>>> Hmm. Can you explain please? Sometimes admins may want to have
>>> autoreplacement working automatically if drive was failed and removed
>>> before unmounting and remounting again. The simplest way to achieve this —
>>> add spare and always mount FS with 'degraded' option (we need to use
>>> this option in any case if we have root fs on RAID, for instance, to
>>> avoiding non-bootable state). So, if the autoreplacement code will check
>>> for
>>> missing drives also, this will working without user intervention. To
>>> allow user to decide if he wants autoreplacement, we can add mount
>>> option like '(no)hotspare' (I have done this already for our project and
>>> will send patch after rebasing onto your new series). Yes, there are
>>> side effects exists if you want to make some experiments with missing
>>> drives in FS, but you can disable autoreplacement for such case.
>>>
>>> If you know about any pitfalls in such scenarios, please point me to
>>> them, I am newbie in FS-related kernel things.
>>
>> If a disk is particularly slow to start up for some reason (maybe it's going
>> bad, maybe it's just got a slow interconnect (think SD cards), maybe it's
>> just really cold so the bearings seizing up), then this would potentially
>> force it out of the array when it shouldn't be.
>>
>> That said, having things set to always allow degraded mounts is _extremely
>> dangerous_.  If the user does not know anything failed, they also can't know
>> they need to get anything fixed.  While notification could be used, it also
>> introduces a period of time where the user is at risk of data loss without
>> them having explicitly agreed to this risk (by manually telling it to mount
>> degraded).
>
> I agree, certainly replace should not be automatic by default. And I'm
> unconvinced this belongs in kernel code anyway because it's a matter
> of policy. Policy stuff goes in user space, where capability to
> achieve the policy goes in the kernel.
>
> A reasonable exception is bad device ejection (e.g. mdadm faulty).
>
> Considering spinning devices take a long time to rebuild already and
> this probably won't change, a policy I'd like to see upon a drive
> going bad (totally vanishing, or producing many read or write errors):
> 1. Bad device is ejected, volume is degraded.
> 2. Consider chunks with one remaining stripe (one copy) as degraded.
> 3. Degraded chunks are read only, so COW changes to non-degraded chunks.
> 4. Degraded metadata chunks are replicated elsewhere, happens right away.
> 5. Implied by 4, degraded data chunks aren't immediately replicated
> but any change are, via COW.
> 6. Option, by policy, to immediately start replicating degraded data
> chunks - either with existing storage or hot spare, which is also a
> policy choice.
>
> In particular, I'd like to see the single stripe metadata chunks
> replicated soon so in case there's another device failure the entire
> volume doesn't implode. Yes there's some data loss, still better than
> 100% data loss.
I've actually considered multiple times writing a daemon in Python to do 
this.  In general, I agree that it's mostly policy, and thus should be 
in userspace.  At the very least though, we really should have something 
in the kernel that we can watch from userspace (be it with select, 
epoll, inotify, fanotify, or something else) to tell us when a state 
change happens or the filesystem, as right now the only way I can see to 
do this is to poll the mount options.
>
>> I could possibly understand doing this for something that needs to be
>> guaranteed to come on line when powered on,  but **only** if it notifies
>> responsible parties that there was a problem **and** it is explicitly
>> documented, and even then I'd be wary of doing this unless there was
>> something in place to handle the possibility of false positives (yes, they
>> do happen), and to make certain that the failed hardware got replaced as
>> soon as possible.
>
> Exactly. And I think it's safer to be more aggressive with (fairly)
> immediate metadata replication to remaining devices, than it is with
> data.
>
> I'm considering this behavior for both single volume setups, as well
> as multiple bricks in a cluster. And admittedly it's probably
> cheaper/easier to just get n-way copies of metadata than the above
> scheme I've written.
>
And even then, you would still have people with big arrays who would 
want the metadata re-striped immediately on a device failure.  I will, 
however, be extremely happy when n-way replication hits, as I then will 
not need to stack BTRFS raid1 on top of LVM RAID1 to get higher order 
replication levels.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 19:40   ` Yauhen Kharuzhy
@ 2016-03-30 22:17     ` Yauhen Kharuzhy
  2016-04-02  1:17       ` Anand Jain
  0 siblings, 1 reply; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-03-30 22:17 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote:
> Hi.
> 
> I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later)
> now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed,
> because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock()
> block (GFP_NOFS can sleep).

Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without
rcu_read_lock():

int btrfs_auto_replace_start(struct btrfs_root *root,
                                struct btrfs_device *src_device)
{
        int ret;
        char *tgt_path;

        if (btrfs_get_spare_device(&tgt_path)) {
                btrfs_err(root->fs_info,
                        "No spare device found/configured in the kernel");
                return -EINVAL;
        }

        ret = btrfs_dev_replace_start(root, tgt_path,
                                        src_device->devid,
                                        rcu_str_deref(src_device->name),
                BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
        if (ret)
                btrfs_put_spare_device(tgt_path);

        kfree(tgt_path);

        return 0;
}

[  156.168133] ===============================
[  156.168963] [ INFO: suspicious RCU usage. ]
[  156.169822] 4.4.5-scst31x+ #20 Not tainted
[  156.170656] -------------------------------
[  156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() usage!
[  156.172920] 
[  156.172920] other info that might help us debug this:
[  156.172920] 
[  156.174825] 
[  156.174825] rcu_scheduler_active = 1, debug_locks = 0
[  156.176152] 1 lock held by btrfs-casualty/4807:
[  156.181917]  #0:  (&fs_info->casualty_mutex){+.+...}, at: [<ffffffffa0165364>] casualty_kthread+0x64/0x390 [btrfs]
[  156.193511] 
[  156.193511] stack backtrace:
[  156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20
[  156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  156.219100]  0000000000000000 ffff88005d79fda0 ffffffff813529e3 ffff88005e19c600
[  156.221216]  0000000000000001 ffff88005d79fdd0 ffffffff810d6407 0000000000000000
[  156.224287]  0000000000000000 ffff88005f4a0c00 ffff88005da36000 ffff88005d79fe08
[  156.226375] Call Trace:
[  156.227078]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[  156.228152]  [<ffffffff810d6407>] lockdep_rcu_suspicious+0xd7/0x110
[  156.229418]  [<ffffffffa01e6236>] btrfs_auto_replace_start+0xa6/0xd0 [btrfs]
[  156.230714]  [<ffffffffa01655c4>] casualty_kthread+0x2c4/0x390 [btrfs]
[  156.231915]  [<ffffffffa016549c>] ? casualty_kthread+0x19c/0x390 [btrfs]
[  156.233105]  [<ffffffffa0165300>] ? btrfs_check_devices+0x200/0x200 [btrfs]
[  156.234339]  [<ffffffff810a70df>] kthread+0xef/0x110
[  156.235309]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  156.236940]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[  156.239489]  [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70
[  156.240533]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-29 19:47   ` Yauhen Kharuzhy
  2016-03-29 23:18     ` Yauhen Kharuzhy
@ 2016-04-02  1:15     ` Anand Jain
  2016-04-02  1:33       ` Yauhen Kharuzhy
  2016-04-04 19:32       ` Yauhen Kharuzhy
  1 sibling, 2 replies; 18+ messages in thread
From: Anand Jain @ 2016-04-02  1:15 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote:
> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
>>
>> Hi Yauhen,
>>
>
>>>
>>> Issue 2.
>>> At start of autoreplacig drive by hotspare, kernel craches in transaction
>>> handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
>>> routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
>>> (oops text is attached also). Bdev is closed after replacing by
>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem
>>> to be right way.
>>
>>   I have sent out V2. I don't see that issue with this,
>>   could you pls try ?
>
> Yes, it reproduced on v4.4.5 kernel. I will try with current
> 'for-linus-4.6' Chris' tree soon.
>
> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
> can be freed by kernel after releasing of all references to it.

   So far the raid group profile would adapt to lower suitable
   group profile when device is missing/failed. This appears to
   be not happening with RAID56 OR there are stale IO which wasn't
   flushed out. Anyway to have this fixed I am moving the patch
    btrfs: introduce device dynamic state transition to offline or failed
   to the top in v3 for any potential changes.
   But firstly we need a reliable test case, or a very carefully
   crafted test case which can create this situation

   Here below is the dm-error that I am using for testing, which
   apparently doesn't report this issue. Could you please try on V3. ?
   (pls note the device names are hard coded in the test script
   sorry about that) This would eventually be fstests script.


----
# cat util
run()
{
	local ret

	echo -- ${*} --
	echo ${*} | bash
	ret=$?
	if [ $ret -ne 0 ]; then
		echo
		echo "###### FAILED: RET $ret #####"
		echo
		exit
	fi
	echo
	#echo "OK?"; read
}

runnt()
{
	local ret

	echo -- ${*} --
	echo ${*} | bash
	ret=$?
	echo
	#echo "OK?"; read
}

wipeall()
{
	runnt "wipefs -a /dev/sd[c-h] > /dev/null"
}

create_err_dev_raid1()
{
	dm_backing_dev="/dev/sdd"
	blk_dev_size=`blockdev --getsz $dm_backing_dev`
	dmerror_dev="/dev/mapper/dm-sdd"
	dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0"
	dmerror_table="0 $blk_dev_size error $dm_backing_dev 0"

	echo -e dm_backing_dev'\t'= $dm_backing_dev
	echo -e blk_dev_size'\t'= $blk_dev_size
	echo -e dmerror_dev'\t'= $dmerror_dev
	echo -e dmlinear_table'\t'= $dmlinear_table
	echo -e dmerror_table'\t'= $dmerror_table
	echo

	runnt "dmsetup remove dm-sdd > /dev/null 2>&1"
	run "dmsetup create dm-sdd --table '${dmlinear_table}'"

	run "mkfs.btrfs -f -draid1 -mraid1 /dev/sdc $dmerror_dev > /dev/null 2>&1"
	run mount /dev/sdc /btrfs
	run "fillfs /btrfs 1000 > /dev/null 2>&1"
	run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

	run btrfs fi show

#	run sleep 32

	run dmsetup suspend dm-sdd
	run "dmsetup load dm-sdd --table '$dmerror_table'"
	run dmsetup resume dm-sdd
	run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

	run btrfs fi show
}

create_err_dev_raid56()
{
	dm_backing_dev="/dev/sdd"
	blk_dev_size=`blockdev --getsz $dm_backing_dev`
	dmerror_dev="/dev/mapper/dm-sdd"
	dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0"
	dmerror_table="0 $blk_dev_size error $dm_backing_dev 0"

	echo -e dm_backing_dev'\t'= $dm_backing_dev
	echo -e blk_dev_size'\t'= $blk_dev_size
	echo -e dmerror_dev'\t'= $dmerror_dev
	echo -e dmlinear_table'\t'= $dmlinear_table
	echo -e dmerror_table'\t'= $dmerror_table
	echo

	runnt "dmsetup remove dm-sdd > /dev/null 2>&1"
	run "dmsetup create dm-sdd --table '${dmlinear_table}'"

	run "mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdf $dmerror_dev > 
/dev/null 2>&1"
	run mount /dev/sdc /btrfs
	run "fillfs /btrfs 1000 > /dev/null 2>&1"
	run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

	run btrfs fi show

#	run sleep 32

	run dmsetup suspend dm-sdd
	run "dmsetup load dm-sdd --table '$dmerror_table'"
	run dmsetup resume dm-sdd
	run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

	run btrfs fi show
}

# cat auto-replace-test56
source $(dirname $0)/util

wipeall

run btrfs spare add /dev/sde

#run cat /proc/fs/btrfs/devlist

create_err_dev_raid56
------


Thanks, Anand



> [ 1464.232552] BTRFS info (device sdc): dev_replace from <missing disk> (devid 4) to /dev/sdg started
> [ 1464.255824] BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
> [ 1464.291760] IP: [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910
> [ 1464.309746] PGD 5c668067 PUD 5b841067 PMD 0
> [ 1464.326143] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1464.340474] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg iTCO_wdt ansi_cprng iTCO_vendor_support snd_pcm snd_timer aesni_intel snd soundcore psmouse aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev serio_raw pcspkr battery acpi_cpufreq 8250_fintek parport_pc video lpc_ich parport mfd_core tpm_tis tpm ac rng_core processor button i2c_piix4 btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci libahci libata pcnet32 crc32c_intel scsi_mod mii
> [ 1464.483244] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20
> [ 1464.511300] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [ 1464.518035] task: ffff88005e658580 ti: ffff88005e65c000 task.ti: ffff88005e65c000
> [ 1464.543072] RIP: 0010:[<ffffffff8131d58d>]  [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910
> [ 1464.579027] RSP: 0018:ffff88005e65f498  EFLAGS: 00010283
> [ 1464.604774] RAX: 0000000000000000 RBX: ffff88005b919f28 RCX: 0000000000030b00
> [ 1464.629544] RDX: 0000000000000080 RSI: 0000000000000781 RDI: ffff88004ecd5ac0
> [ 1464.652763] RBP: ffff88005e65f500 R08: ffff88005b130ff0 R09: 0000000000010000
> [ 1464.674939] R10: ffff88005e674f28 R11: 0000000000000000 R12: 0000000000000080
> [ 1464.691478] R13: 0000000000000004 R14: ffff88004e48de00 R15: 0000000000000010
> [ 1464.714115] FS:  0000000000000000(0000) GS:ffff880066600000(0000) knlGS:0000000000000000
> [ 1464.737302] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1464.766380] CR2: 0000000000000548 CR3: 000000005723f000 CR4: 00000000000406f0
> [ 1464.804808] Stack:
> [ 1464.814950]  ffffffff813184ae 0000000000000246 0000000000000082 0000000000000000
> [ 1464.847217]  0000000000000000 0000000000000092 0000000000000000 ffff88005e65f540
> [ 1464.879147]  ffff88005b919f28 00000000ffffffff 0000000000000004 ffff88004e48de00
> [ 1464.907440] Call Trace:
> [ 1464.919293]  [<ffffffff813184ae>] ? bvec_alloc+0x5e/0x100
> [ 1464.939019]  [<ffffffff813213a4>] generic_make_request+0x24/0x290
> [ 1464.961775]  [<ffffffff81321677>] submit_bio+0x67/0x140
> [ 1464.971842]  [<ffffffffa02051b9>] finish_rmw+0x409/0x570 [btrfs]
> [ 1464.983700]  [<ffffffffa02053c5>] full_stripe_write+0xa5/0xb0 [btrfs]
> [ 1464.996554]  [<ffffffffa0206d05>] raid56_parity_write+0xf5/0x180 [btrfs]
> [ 1465.012560]  [<ffffffffa01bab95>] btrfs_map_bio+0x105/0x300 [btrfs]
> [ 1465.046907]  [<ffffffffa018f8b3>] ? btrfs_get_extent+0x83/0xb20 [btrfs]
> [ 1465.052462]  [<ffffffffa018d175>] btrfs_submit_bio_hook+0xe5/0x1b0 [btrfs]
> [ 1465.069342]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [ 1465.091031]  [<ffffffffa01a917d>] submit_one_bio+0x6d/0xa0 [btrfs]
> [ 1465.111233]  [<ffffffffa01ae06e>] submit_extent_page+0xee/0x230 [btrfs]
> [ 1465.126076]  [<ffffffffa01ae7f4>] __extent_writepage_io+0x444/0x490 [btrfs]
> [ 1465.132550]  [<ffffffffa01ade10>] ? end_extent_writepage+0x80/0x80 [btrfs]
> [ 1465.145490]  [<ffffffffa01aeaa5>] __extent_writepage+0x265/0x3e0 [btrfs]
> [ 1465.168445]  [<ffffffffa01aef1b>] extent_write_cache_pages.isra.32.constprop.49+0x2fb/0x3d0 [btrfs]
> [ 1465.204094]  [<ffffffffa01b009d>] extent_writepages+0x4d/0x70 [btrfs]
> [ 1465.229627]  [<ffffffffa018f830>] ? btrfs_real_readdir+0x5c0/0x5c0 [btrfs]
> [ 1465.250927]  [<ffffffffa018d5e8>] btrfs_writepages+0x28/0x30 [btrfs]
> [ 1465.274099]  [<ffffffff811afcb1>] do_writepages+0x21/0x30
> [ 1465.298275]  [<ffffffff811a10ea>] __filemap_fdatawrite_range+0xaa/0xf0
> [ 1465.324278]  [<ffffffff811a1203>] filemap_fdatawrite_range+0x13/0x20
> [ 1465.341055]  [<ffffffffa01a1490>] btrfs_fdatawrite_range+0x20/0x50 [btrfs]
> [ 1465.378952]  [<ffffffffa01d377a>] __btrfs_write_out_cache.isra.27+0x3ea/0x430 [btrfs]
> [ 1465.405760]  [<ffffffffa01d4a7f>] btrfs_write_out_cache+0x8f/0x110 [btrfs]
> [ 1465.428091]  [<ffffffffa0176128>] btrfs_write_dirty_block_groups+0x228/0x290 [btrfs]
> [ 1465.458865]  [<ffffffffa0208d6a>] commit_cowonly_roots+0x1f8/0x283 [btrfs]
> [ 1465.480450]  [<ffffffffa018b0d7>] btrfs_commit_transaction+0x577/0xb60 [btrfs]
> [ 1465.512410]  [<ffffffffa0202cf3>] btrfs_dev_replace_start+0x2e3/0x520 [btrfs]
> [ 1465.535358]  [<ffffffffa0202b6e>] ? btrfs_dev_replace_start+0x15e/0x520 [btrfs]
> [ 1465.548049]  [<ffffffffa02038d8>] btrfs_auto_replace_start+0x58/0xd0 [btrfs]
> [ 1465.551787]  [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs]
> [ 1465.561195]  [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs]
> [ 1465.573308]  [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs]
> [ 1465.610840]  [<ffffffff810a70df>] kthread+0xef/0x110
> [ 1465.629472]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [ 1465.648678]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [ 1465.660686]  [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70
> [ 1465.667065]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [ 1465.676861] Code: 67 28 48 c7 c7 6b f8 a3 81 e8 40 09 d9 ff e8 3b 43 31 00 41 c1 ec 09 48 8b 7b 08 45 85 e4 0f 85 13 01 00 00 48 8b 87 f0 00 00 00 <4c> 8b b8 48 05 00 00 4d 85 ff 0f 84 d5 01 00 00 4c 8b af e0 00
> [ 1465.750135] RIP  [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910
> [ 1465.776005]  RSP <ffff88005e65f498>
> [ 1465.790848] CR2: 0000000000000548
> [ 1465.797370] ---[ end trace 45545495cd54e799 ]---
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-03-30 22:17     ` Yauhen Kharuzhy
@ 2016-04-02  1:17       ` Anand Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Anand Jain @ 2016-04-02  1:17 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



On 03/31/2016 06:17 AM, Yauhen Kharuzhy wrote:
> On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote:
>> Hi.
>>
>> I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later)
>> now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed,
>> because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock()
>> block (GFP_NOFS can sleep).
>
> Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without
> rcu_read_lock():
>
> int btrfs_auto_replace_start(struct btrfs_root *root,
>                                  struct btrfs_device *src_device)
> {
>          int ret;
>          char *tgt_path;
>
>          if (btrfs_get_spare_device(&tgt_path)) {
>                  btrfs_err(root->fs_info,
>                          "No spare device found/configured in the kernel");
>                  return -EINVAL;
>          }
>
>          ret = btrfs_dev_replace_start(root, tgt_path,
>                                          src_device->devid,
>                                          rcu_str_deref(src_device->name),

This is fixed in V3.

Thanks, Anand


>                  BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
>          if (ret)
>                  btrfs_put_spare_device(tgt_path);
>
>          kfree(tgt_path);
>
>          return 0;
> }
>
> [  156.168133] ===============================
> [  156.168963] [ INFO: suspicious RCU usage. ]
> [  156.169822] 4.4.5-scst31x+ #20 Not tainted
> [  156.170656] -------------------------------
> [  156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() usage!
> [  156.172920]
> [  156.172920] other info that might help us debug this:
> [  156.172920]
> [  156.174825]
> [  156.174825] rcu_scheduler_active = 1, debug_locks = 0
> [  156.176152] 1 lock held by btrfs-casualty/4807:
> [  156.181917]  #0:  (&fs_info->casualty_mutex){+.+...}, at: [<ffffffffa0165364>] casualty_kthread+0x64/0x390 [btrfs]
> [  156.193511]
> [  156.193511] stack backtrace:
> [  156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20
> [  156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [  156.219100]  0000000000000000 ffff88005d79fda0 ffffffff813529e3 ffff88005e19c600
> [  156.221216]  0000000000000001 ffff88005d79fdd0 ffffffff810d6407 0000000000000000
> [  156.224287]  0000000000000000 ffff88005f4a0c00 ffff88005da36000 ffff88005d79fe08
> [  156.226375] Call Trace:
> [  156.227078]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
> [  156.228152]  [<ffffffff810d6407>] lockdep_rcu_suspicious+0xd7/0x110
> [  156.229418]  [<ffffffffa01e6236>] btrfs_auto_replace_start+0xa6/0xd0 [btrfs]
> [  156.230714]  [<ffffffffa01655c4>] casualty_kthread+0x2c4/0x390 [btrfs]
> [  156.231915]  [<ffffffffa016549c>] ? casualty_kthread+0x19c/0x390 [btrfs]
> [  156.233105]  [<ffffffffa0165300>] ? btrfs_check_devices+0x200/0x200 [btrfs]
> [  156.234339]  [<ffffffff810a70df>] kthread+0xef/0x110
> [  156.235309]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [  156.236940]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [  156.239489]  [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70
> [  156.240533]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-04-02  1:15     ` Anand Jain
@ 2016-04-02  1:33       ` Yauhen Kharuzhy
  2016-04-02  1:38         ` Anand Jain
  2016-04-04 19:32       ` Yauhen Kharuzhy
  1 sibling, 1 reply; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-02  1:33 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote:
> 
> 
> On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote:
> >On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
> >>
> >>Hi Yauhen,
> >>
> >
> >>>
> >>>Issue 2.
> >>>At start of autoreplacig drive by hotspare, kernel craches in transaction
> >>>handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
> >>>routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
> >>>https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
> >>>(oops text is attached also). Bdev is closed after replacing by
> >>>btrfs_dev_replace_finishing(), so this is safe but doesn't seem
> >>>to be right way.
> >>
> >>  I have sent out V2. I don't see that issue with this,
> >>  could you pls try ?
> >
> >Yes, it reproduced on v4.4.5 kernel. I will try with current
> >'for-linus-4.6' Chris' tree soon.
> >
> >To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
> >can be freed by kernel after releasing of all references to it.
> 
>   So far the raid group profile would adapt to lower suitable
>   group profile when device is missing/failed. This appears to
>   be not happening with RAID56 OR there are stale IO which wasn't
>   flushed out. Anyway to have this fixed I am moving the patch
>    btrfs: introduce device dynamic state transition to offline or failed
>   to the top in v3 for any potential changes.
>   But firstly we need a reliable test case, or a very carefully
>   crafted test case which can create this situation
> 
>   Here below is the dm-error that I am using for testing, which
>   apparently doesn't report this issue. Could you please try on V3. ?
>   (pls note the device names are hard coded in the test script
>   sorry about that) This would eventually be fstests script.

Sure. But I don't see any V3 patches in the list. Are you still
preparing to send them or I missed something?


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-04-02  1:33       ` Yauhen Kharuzhy
@ 2016-04-02  1:38         ` Anand Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Anand Jain @ 2016-04-02  1:38 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



On 04/02/2016 09:33 AM, Yauhen Kharuzhy wrote:
> On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote:
>>
>>
>> On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote:
>>> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
>>>>
>>>> Hi Yauhen,
>>>>
>>>
>>>>>
>>>>> Issue 2.
>>>>> At start of autoreplacig drive by hotspare, kernel craches in transaction
>>>>> handling code (inside of btrfs_commit_transaction() called by autoreplace initiating
>>>>> routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see
>>>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
>>>>> (oops text is attached also). Bdev is closed after replacing by
>>>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem
>>>>> to be right way.
>>>>
>>>>   I have sent out V2. I don't see that issue with this,
>>>>   could you pls try ?
>>>
>>> Yes, it reproduced on v4.4.5 kernel. I will try with current
>>> 'for-linus-4.6' Chris' tree soon.
>>>
>>> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
>>> can be freed by kernel after releasing of all references to it.
>>
>>    So far the raid group profile would adapt to lower suitable
>>    group profile when device is missing/failed. This appears to
>>    be not happening with RAID56 OR there are stale IO which wasn't
>>    flushed out. Anyway to have this fixed I am moving the patch
>>     btrfs: introduce device dynamic state transition to offline or failed
>>    to the top in v3 for any potential changes.
>>    But firstly we need a reliable test case, or a very carefully
>>    crafted test case which can create this situation
>>
>>    Here below is the dm-error that I am using for testing, which
>>    apparently doesn't report this issue. Could you please try on V3. ?
>>    (pls note the device names are hard coded in the test script
>>    sorry about that) This would eventually be fstests script.
>
> Sure. But I don't see any V3 patches in the list. Are you still
> preparing to send them or I missed something?

  Its out now. There was a little distraction when I was about to send it.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-04-02  1:15     ` Anand Jain
  2016-04-02  1:33       ` Yauhen Kharuzhy
@ 2016-04-04 19:32       ` Yauhen Kharuzhy
  2016-04-12 14:16         ` Anand Jain
  1 sibling, 1 reply; 18+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-04 19:32 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

2016-04-01 18:15 GMT-07:00 Anand Jain <anand.jain@oracle.com>:
>>>> Issue 2.
>>>> At start of autoreplacig drive by hotspare, kernel craches in
>>>> transaction
>>>> handling code (inside of btrfs_commit_transaction() called by
>>>> autoreplace initiating
>>>> routines). I 'fixed' this by removing of closing of bdev in
>>>> btrfs_close_one_device_dont_free(), see
>>>>
>>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
>>>> (oops text is attached also). Bdev is closed after replacing by
>>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem
>>>> to be right way.
>>>
>>>
>>>   I have sent out V2. I don't see that issue with this,
>>>   could you pls try ?
>>
>>
>> Yes, it reproduced on v4.4.5 kernel. I will try with current
>> 'for-linus-4.6' Chris' tree soon.
>>
>> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
>> can be freed by kernel after releasing of all references to it.
>
>
>   So far the raid group profile would adapt to lower suitable
>   group profile when device is missing/failed. This appears to
>   be not happening with RAID56 OR there are stale IO which wasn't
>   flushed out. Anyway to have this fixed I am moving the patch
>    btrfs: introduce device dynamic state transition to offline or failed
>   to the top in v3 for any potential changes.
>   But firstly we need a reliable test case, or a very carefully
>   crafted test case which can create this situation
>
>   Here below is the dm-error that I am using for testing, which
>   apparently doesn't report this issue. Could you please try on V3. ?
>   (pls note the device names are hard coded in the test script
>   sorry about that) This would eventually be fstests script.

Hi,

I have reproduced this oops with attached script. I don't use any dm
layer, but just detach drive at scsi layer as xfstests do (device
management functions were copy-pasted from it).

[-- Attachment #2: test-autoreplace2-mainline.sh --]
[-- Type: application/x-sh, Size: 2565 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Global hotspare functionality
  2016-04-04 19:32       ` Yauhen Kharuzhy
@ 2016-04-12 14:16         ` Anand Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Anand Jain @ 2016-04-12 14:16 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



On 04/05/2016 03:32 AM, Yauhen Kharuzhy wrote:
> 2016-04-01 18:15 GMT-07:00 Anand Jain <anand.jain@oracle.com>:
>>>>> Issue 2.
>>>>> At start of autoreplacig drive by hotspare, kernel craches in
>>>>> transaction
>>>>> handling code (inside of btrfs_commit_transaction() called by
>>>>> autoreplace initiating
>>>>> routines). I 'fixed' this by removing of closing of bdev in
>>>>> btrfs_close_one_device_dont_free(), see
>>>>>
>>>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
>>>>> (oops text is attached also). Bdev is closed after replacing by
>>>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem
>>>>> to be right way.
>>>>
>>>>
>>>>    I have sent out V2. I don't see that issue with this,
>>>>    could you pls try ?
>>>
>>>
>>> Yes, it reproduced on v4.4.5 kernel. I will try with current
>>> 'for-linus-4.6' Chris' tree soon.
>>>
>>> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
>>> can be freed by kernel after releasing of all references to it.
>>
>>
>>    So far the raid group profile would adapt to lower suitable
>>    group profile when device is missing/failed. This appears to
>>    be not happening with RAID56 OR there are stale IO which wasn't
>>    flushed out. Anyway to have this fixed I am moving the patch
>>     btrfs: introduce device dynamic state transition to offline or failed
>>    to the top in v3 for any potential changes.
>>    But firstly we need a reliable test case, or a very carefully
>>    crafted test case which can create this situation
>>
>>    Here below is the dm-error that I am using for testing, which
>>    apparently doesn't report this issue. Could you please try on V3. ?
>>    (pls note the device names are hard coded in the test script
>>    sorry about that) This would eventually be fstests script.
>
> Hi,
>
> I have reproduced this oops with attached script. I don't use any dm
> layer, but just detach drive at scsi layer as xfstests do (device
> management functions were copy-pasted from it).

  Nice. I was able reproduce this (also found lock dep issue when running
  this, since it was in the original code a separate patch was sent
  ou). The issue was due to that bdev wasn't null, to fix this the
  btrfs_device_enforce_state() is changed quite a bit. V4 is out.

Thanks, Anand


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-04-12 14:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-18 19:39 Global hotspare functionality Yauhen Kharuzhy
2016-03-19  1:17 ` Yauhen Kharuzhy
2016-03-29 14:43   ` Anand Jain
2016-03-29 14:41 ` Anand Jain
2016-03-29 19:24   ` Yauhen Kharuzhy
2016-03-29 19:59     ` Austin S. Hemmelgarn
2016-03-29 20:26       ` Chris Murphy
2016-03-30 11:26         ` Austin S. Hemmelgarn
2016-03-29 19:40   ` Yauhen Kharuzhy
2016-03-30 22:17     ` Yauhen Kharuzhy
2016-04-02  1:17       ` Anand Jain
2016-03-29 19:47   ` Yauhen Kharuzhy
2016-03-29 23:18     ` Yauhen Kharuzhy
2016-04-02  1:15     ` Anand Jain
2016-04-02  1:33       ` Yauhen Kharuzhy
2016-04-02  1:38         ` Anand Jain
2016-04-04 19:32       ` Yauhen Kharuzhy
2016-04-12 14:16         ` Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.