From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f47.google.com ([209.85.218.47]:34563 "EHLO mail-oi0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932604AbcJYB0Q (ORCPT ); Mon, 24 Oct 2016 21:26:16 -0400 Received: by mail-oi0-f47.google.com with SMTP id t73so67847189oie.1 for ; Mon, 24 Oct 2016 18:26:15 -0700 (PDT) Date: Mon, 24 Oct 2016 21:26:06 -0400 From: Nicholas Steeves To: Stefan Malte Schumacher Cc: Btrfs BTRFS Subject: Re: Scrubbing Errors after restoring backup Message-ID: <20161025012606.GA6752@DigitalMercury.dynalias.net> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="RnlQjJ0d97Da+TV1" In-Reply-To: Sender: linux-btrfs-owner@vger.kernel.org List-ID: --RnlQjJ0d97Da+TV1 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 24 October 2016 at 17:53, Stefan Malte Schumacher wrote: > Hello > > For reference please see this post. > https://mail-archive.com/linux-btrfs@vger.kernel.org/msg58461.html > Please note that I downgraded to btrfs-progs 4.6.1 as advised. > > After exchanging the malfunctioning drive I re-created the filesystem > and restored the backup from my NAS. (I didnt entirely trust the > filesystem after so many errors) On completing the restoration I > manually started scrubbing, which ended with hundreds of checksum and > read errors on /dev/sda. > The drive checks out fine in smart and passed through all scheduled > SMART Self-Tests. The model is not identical to the two drives > recently added to the system - the new drives are WD Blue, the four > original ones are WD Greens. > > I have resetted the output from btrfs dev stats and restarted the > scrubbing process. I am unsure how to interpret or explain the errors > of the last scrub run. I scrubbed regularly each month for nearly > three years and never had any errors. I would be grateful for any > advice how to proceed. > > Yours sincerely > Stefan Hi Stefan, What kernel version are you using? Was the backup a file-level archive or a btrfs send stream? I'm confused about the evolution your hardware. Originally you had four disk raid1? Or a dix disk raid1? The one that failed was /dev/sdf, which seems to suggest: /dev/sdc - WD green /dev/sdd - WD green /dev/sde - WD green /dev/sdf - WD green <- failed I would expect that the new volume is something like: /dev/sdc - New unnamed model or 3 year old WD Green? /dev/sdd - New unnamed model or 3 year old WD Green? /dev/sde - New WD Blue /dev/sdf - New WD Blue Did you move the sata cables to use: /dev/sda - Unknown. New disk or 3 year old disk? /dev/sdb - Unknown. New disk or 3 year old disk? /dev/sdc - New WD Blue /dev/sdd - New WD Blue And this is a freshly-created btrfs volume? When you restored from backup, your hard drive firmware should have detected any bad sectors and relocated the write to a reserve sector, and I'm assuming none of the logs have anything in them that would indicate a failed write. If sda is from the 3 year old batch of WD greens I would distrust it. Frequent culprits of similar problems are flaky sata cables or a flaky PSU. In the case of flaky sata cables, dmesg (usually?) shows PHY and "hard resetting link" errors. I also wonder if the sata0 port on your motherboard might be bad. The only reason I mention this is because I've seen two H67/P67 cougarpoint chipset motherboards lose their sata0 channel. It also happens with other brands' chipsets... Whatever the case, when stuff like this happened to me I've always used something like a combination of a cpuburnP6 per logical CPU, memtester (in Linux; do this after a clean 24h memtest86+ run), a huge and intense bonnie++ run, with as many things plugged into the USB ports as possible--including charging at least one high-power device--while burning a DVD and/or running something that stresses the GPU...to try to shake down potential PSU issues. Maybe passmark (under Linux) has similar functionality with an easier interface? I've also used parallel diskscan (https://github.com/baruch/diskscan) runs to test old disks and to check for statistical anomalies. If you do: 1. use tape to number your cables; record which drives are connected into which sata ports with which cables. Do simultaneous runs of diskscan on /dev/disk/by-id/$relevant_disks, check dmesg, and record the results. 2. unplug sata cables from drives and shuffle; document specifics and test. 3. unplug sata cables from motherboard and shuffle; document specifics and test. For the cost of new sata cables, you might as well just buy new ones because then these tests can be used to check for bad ones among the new cables; it's a better use of time, because it's possible that you'll detect a bad cable, replace it, test the new cable, and find out that the new cable is defective. Fountain of Bad Luck=E2=84=A2 <- If something can fail, it will fail when I use it ;-) That said, I've never tested a WD green drive...the reds' performance smoothly decreases towards the end of the drive (outer tracks are quite a bit faster than inner tracks). For all I know the greens have erratic performance baked into their power-saving design... If there's consistently a latency spike at the same location for the test associated with a particular drive that can indicate a relocated bad sector. Does anyone know if this method reliably indicates when a drive is lying about its SMART 5 Reallocated_Sector_Ct report? Cheers, Nicholas --RnlQjJ0d97Da+TV1 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCgAGBQJYDrSuAAoJEFqIMEdffRBh/V4P/0pL23ffhuCauQfsVc/2RUx6 EYP+Ft4YfiG1oCD6Ht6OrWwEKdEZXCnoNWTDBb89ZBnPYBYlwIJEf3jEIthfeF/U obIMl+AmVKg82igSqaTk3wnKGsVOIg0zSJw39OixD05A/lzHio0jqz4+qbYNvs7y /YJ7ikTEXnp2jCgqG376wRZAoHM9pjy2KoB5d54s/BvkO15ClpTkhE49NJgpEopm BRRcF2ASZ7l5QHL0aoPnoXR5IMEZlRy2daA6gvTuiltdQs3/O4gZnHscpjRBvLSF BjsTsULu9fnBveqr7D1TVjcJIm2EXVcE3kwG94ayktzc5sH719OdM4yr9ZNGKXiw AtfpUqXXYlfqtdrFJqKrom4xEAqSkEw/4h/UK4tXmujDJLY9m2yTzFCv/h0iknbX 2jAB0h76HmgknDXh/69UypWczhqqYROOAjRMe1tpEKVHWZuVZCyxCB/BjmANUCH9 Hlx1EJAjfaAiS7LBEeFhSCQs1B+yhg0WzfPeECora+jZYOjH7kYF3oP3bbB2bEbE yxZ7H/tijmSeXTzt//WkJ6ixKnVrBmsVfqTrULok4Yne5oCCtrfMD7CcV1I+RW/p GS5uEdzygp/5+Pm32ngkzMpxqo//+bDzNPlJWKXr90mYsJfOVtcMq3TW1E0b5c69 v9ENYrx8PLlqbYik9uIQ =reJY -----END PGP SIGNATURE----- --RnlQjJ0d97Da+TV1--