All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc MERLIN <marc@merlins.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: How to handle a RAID5 arrawy with a failing drive?
Date: Mon, 17 Mar 2014 09:13:07 -0700	[thread overview]
Message-ID: <20140317161307.GJ6143@merlins.org> (raw)
In-Reply-To: <B593C5F2-5CE0-428C-97FB-A75AB0B11254@colorremedies.com>

On Sun, Mar 16, 2014 at 11:12:43PM -0600, Chris Murphy wrote:
> 
> On Mar 16, 2014, at 9:44 PM, Marc MERLIN <marc@merlins.org> wrote:
> 
> > On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote:
> > 
> >>> If I add a device, isn't it going to grow my raid to make it bigger instead
> >>> of trying to replace the bad device?
> >> 
> >> Yes if it's successful. No if it fails which is the problem I'm having.
> > 
> > That's where I don't follow you.
> > You just agreed that it will grow my raid.
> > So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to
> > 5TB with 11 drives.
> > How does that help?
> 
> If you swap the faulty drive for a good drive, I'm thinking then you'll be able to device delete the bad device, which ought to be "missing" at that point; or if that fails you should be able to do a balance, and then be able to device delete the faulty drive.
> 
> The problem I'm having is that when I detach one device out of a 3 device raid5, btrfs fi show doesn't list it as missing. It's listed without the /dev/sdd designation it had when attached, but now it's just blank.

Ok, I tried unmounting and remounting degraded this morning:

polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy
Mar 17 08:57:35 polgara kernel: [123824.344085] BTRFS: device label backupcopy devid 9 transid 3837 /dev/mapper/crypt_sdk1
Mar 17 08:57:35 polgara kernel: [123824.454641] BTRFS info (device dm-9): allowing degraded mounts
Mar 17 08:57:35 polgara kernel: [123824.454978] BTRFS info (device dm-9): disk space caching is enabled
Mar 17 08:57:35 polgara kernel: [123824.497437] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3888, rd 321927975, flush 0, corrupt 0, gen
0
/dev/mapper/crypt_sdk1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded)

What's confusing is that mounting in degraded mode shows all devices:
polgara:~# btrfs fi show
Label: backupcopy  uuid: 7d8e1197-69e4-40d8-8d86-278d275af896
        Total devices 10 FS bytes used 376.27GiB
        devid    1 size 465.76GiB used 42.42GiB path /dev/dm-0
        devid    2 size 465.76GiB used 42.40GiB path /dev/dm-1
        devid    3 size 465.75GiB used 42.40GiB path /dev/mapper/crypt_sde1 << this is missing
        devid    4 size 465.76GiB used 42.40GiB path /dev/dm-3
        devid    5 size 465.76GiB used 42.40GiB path /dev/dm-4
        devid    6 size 465.76GiB used 42.40GiB path /dev/dm-5
        devid    7 size 465.76GiB used 42.40GiB path /dev/dm-6
        devid    8 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdj1
        devid    9 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 42.40GiB path /dev/dm-8

Ok, so mount in degraded mode works.

Adding a new device failed though:
polgara:~# btrfs device add /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy
BTRFS: bad tree block start 852309604880683448 156237824
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1963 at fs/btrfs/super.c:257 __btrfs_abort_transaction+0x50/0x100()
BTRFS: Transaction aborted (error -5)
Modules linked in: xts gf128mul ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_stats ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse dm_crypt dm_mod configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs btusb bluetooth 6lowpan_iphc rfkill usbkbd usbmouse joydev hid_generic usbhid hid iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm microcode snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec pcspkr snd_hwdep i2c_i801 snd_pcm_oss snd_mixer_oss lpc_ich snd_pcm snd_seq_midi snd_seq_midi_event sg sr_mod cdrom snd_rawmidi snd_seq snd_seq_device snd_timer atl1 mii mvsas snd nouveau libsas scsi_transport_
soundcore ttm ehci_pci asus_atk0110 floppy uhci_hcd ehci_hcd usbcore acpi_cpufreq usb_common processor evdev
CPU: 0 PID: 1963 Comm: btrfs Tainted: G        W    3.14.0-rc5-amd64-i915-preempt-20140216c #1
Hardware name: System manufacturer P5KC/P5KC, BIOS 0502    05/24/2007
 0000000000000000 ffff88004b5c9988 ffffffff816090b3 ffff88004b5c99d0
 ffff88004b5c99c0 ffffffff81050025 ffffffff8120913a 00000000fffffffb
 ffff8800144d5800 ffff88007bd3ba00 ffffffff81839280 ffff88004b5c9a20
Call Trace:
 [<ffffffff816090b3>] dump_stack+0x4e/0x7a
 [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
 [<ffffffff8120913a>] ? __btrfs_abort_transaction+0x50/0x100
 [<ffffffff8105008a>] warn_slowpath_fmt+0x4c/0x4e
 [<ffffffff8120913a>] __btrfs_abort_transaction+0x50/0x100
 [<ffffffff81216fed>] __btrfs_free_extent+0x6ce/0x712
 [<ffffffff8121bc89>] __btrfs_run_delayed_refs+0x939/0xbdf
 [<ffffffff8121dac8>] btrfs_run_delayed_refs+0x81/0x18f
 [<ffffffff8122aeb2>] btrfs_commit_transaction+0xeb/0x849
 [<ffffffff8124e777>] btrfs_init_new_device+0x9a1/0xc00
 [<ffffffff8114069b>] ? ____cache_alloc+0x1c/0x29b
 [<ffffffff81129d3e>] ? mem_cgroup_end_update_page_stat+0x17/0x26
 [<ffffffff8125570f>] ? btrfs_ioctl+0x989/0x24b1
 [<ffffffff81141096>] ? __kmalloc_track_caller+0x130/0x144
 [<ffffffff8125570f>] ? btrfs_ioctl+0x989/0x24b1
 [<ffffffff81255730>] btrfs_ioctl+0x9aa/0x24b1
 [<ffffffff81611e15>] ? __do_page_fault+0x330/0x3df
 [<ffffffff8116da43>] ? mntput_no_expire+0x33/0x12b
 [<ffffffff81163b16>] do_vfs_ioctl+0x3d2/0x41d
 [<ffffffff8115676b>] ? ____fput+0xe/0x10
 [<ffffffff8106973a>] ? task_work_run+0x87/0x98
 [<ffffffff81163bb8>] SyS_ioctl+0x57/0x82
 [<ffffffff81611ed2>] ? do_page_fault+0xe/0x10
 [<ffffffff816154ad>] system_call_fastpath+0x1a/0x1f
---[ end trace 7d08b9b7f2f17b38 ]---
BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure
BTRFS info (device dm-9): forced readonly
ERROR: error adding the device '/dev/mapper/crypt_sdm1' - Input/output error
polgara:~# Mar 17 09:07:14 polgara kernel: [124403.240880] BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure

Mmmh, dm-9 is another device, although it seems to work:
polgara:~# dd if=/dev/dm-9 of=/dev/null bs=1M
^C1255+0 records in
1254+0 records out
1314914304 bytes (1.3 GB) copied, 15.169 s, 86.7 MB/s

polgara:~# btrfs device stats /dev/dm-9
[/dev/mapper/crypt_sdk1].write_io_errs   0
[/dev/mapper/crypt_sdk1].read_io_errs    0
[/dev/mapper/crypt_sdk1].flush_io_errs   0
[/dev/mapper/crypt_sdk1].corruption_errs 0
[/dev/mapper/crypt_sdk1].generation_errs 0


I also started getting errors on my device after hours of use last night (pasted below).
Not sure if I really have a 2nd device problem or not:

/dev/mapper/crypt_sde1 is dm-2,

BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
quiet_error: 123 callbacks suppressed
Buffer I/O error on device dm-2, logical block 16
Buffer I/O error on device dm-2, logical block 16384
Buffer I/O error on device dm-2, logical block 67108864
Buffer I/O error on device dm-2, logical block 16
Buffer I/O error on device dm-2, logical block 16384
Buffer I/O error on device dm-2, logical block 67108864
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 1
Buffer I/O error on device dm-2, logical block 2
Buffer I/O error on device dm-2, logical block 3
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 122095101
Buffer I/O error on device dm-2, logical block 122095101
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 0
btrfs_dev_stat_print_on_error: 366 callbacks suppressed
btrfs_dev_stat_print_on_error: 346 callbacks suppressed
btrfs_dev_stat_print_on_error: 606 callbacks suppressed
btrfs_dev_stat_print_on_error: 276 callbacks suppressed
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
btrfs_dev_stat_print_on_error: 11469 callbacks suppressed
btree_readpage_end_io_hook: 31227 callbacks suppressed
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064

eventually it turned into:
BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927996, flush 0, corrupt 0, gen 0
BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927997, flush 0, corrupt 0, gen 0
BTRFS: bad tree block start 17271740454546054736 1265680384
------------[ cut here ]------------
WARNING: CPU: 1 PID: 10414 at fs/btrfs/super.c:257 __btrfs_abort_transaction+0x50/0x100()
BTRFS: Transaction aborted (error -5)
Modules linked in: xts gf128mul ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_stats ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse dm_crypt dm_mod configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs btusb bluetooth 6lowpan_iphc rfkill usbkbd usbmouse joydev hid_generic usbhid hid iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm microcode snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec pcspkr snd_hwdep i2c_i801 snd_pcm_oss snd_mixer_oss lpc_ich snd_pcm snd_seq_midi snd_seq_midi_event sg sr_mod cdrom snd_rawmidi snd_seq snd_seq_device snd_timer atl1 mii mvsas snd nouveau libsas scsi_transport_
soundcore ttm ehci_pci asus_atk0110 floppy uhci_hcd ehci_hcd usbcore acpi_cpufreq usb_common processor evdev
CPU: 1 PID: 10414 Comm: btrfs-transacti Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1
Hardware name: System manufacturer P5KC/P5KC, BIOS 0502    05/24/2007
 0000000000000000 ffff88004ae4fb30 ffffffff816090b3 ffff88004ae4fb78
 ffff88004ae4fb68 ffffffff81050025 ffffffff8120913a 00000000fffffffb
 ffff88004f2e7800 ffff8800603804c0 ffffffff81839280 ffff88004ae4fbc8
Call Trace:
 [<ffffffff816090b3>] dump_stack+0x4e/0x7a
 [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
 [<ffffffff8120913a>] ? __btrfs_abort_transaction+0x50/0x100
 [<ffffffff8105008a>] warn_slowpath_fmt+0x4c/0x4e
 [<ffffffff8120913a>] __btrfs_abort_transaction+0x50/0x100
 [<ffffffff81216fed>] __btrfs_free_extent+0x6ce/0x712
 [<ffffffff8121bc89>] __btrfs_run_delayed_refs+0x939/0xbdf
 [<ffffffff8121dac8>] btrfs_run_delayed_refs+0x81/0x18f
 [<ffffffff8122ae40>] btrfs_commit_transaction+0x79/0x849
 [<ffffffff812277ca>] transaction_kthread+0xf8/0x1ab
 [<ffffffff812276d2>] ? btrfs_cleanup_transaction+0x43f/0x43f
 [<ffffffff8106bc56>] kthread+0xae/0xb6
 [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61
 [<ffffffff816153fc>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61
---[ end trace 7d08b9b7f2f17b35 ]---
BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure
BTRFS info (device dm-9): forced readonly
BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure
------------[ cut here ]------------


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

  reply	other threads:[~2014-03-17 16:13 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-16 15:23 [PATCH] Btrfs: fix incremental send's decision to delay a dir move/rename Filipe David Borba Manana
2014-03-16 17:09 ` [PATCH v2] " Filipe David Borba Manana
2014-03-16 20:37 ` [PATCH v3] " Filipe David Borba Manana
2014-03-16 22:20   ` How to handle a RAID5 arrawy with a failing drive? Marc MERLIN
2014-03-16 22:55     ` Chris Murphy
2014-03-16 23:12       ` Chris Murphy
2014-03-16 23:17         ` Marc MERLIN
2014-03-16 23:23           ` Chris Murphy
2014-03-17  0:51             ` Marc MERLIN
2014-03-17  1:06               ` Chris Murphy
2014-03-17  1:17                 ` Marc MERLIN
2014-03-17  2:56                   ` Chris Murphy
2014-03-17  3:44                     ` Marc MERLIN
2014-03-17  5:12                       ` Chris Murphy
2014-03-17 16:13                         ` Marc MERLIN [this message]
2014-03-17 17:38                           ` Chris Murphy
2014-03-16 23:40           ` ronnie sahlberg
2014-03-16 23:20         ` Chris Murphy
2014-03-18  9:02     ` Duncan
2014-03-19  6:09       ` How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds Marc MERLIN
2014-03-19  6:32         ` Chris Murphy
2014-03-19 15:40           ` Marc MERLIN
2014-03-19 16:53             ` Chris Murphy
2014-03-19 22:40               ` Marc MERLIN
     [not found]                 ` <CAGwxe4jL+L571MtEmeHnTnHQSD7h+2ApfWqycgV-ymXhfMR-JA@mail.gmail.com>
2014-03-20  0:46                   ` Marc MERLIN
2014-03-20  7:37                     ` Tobias Holst
2014-03-23 19:22               ` Marc MERLIN
2014-03-20  7:37             ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140317161307.GJ6143@merlins.org \
    --to=marc@merlins.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.