All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Christian Pernegger <pernegger@gmail.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: first it froze, now the (btrfs) root fs won't mount ...
Date: Mon, 21 Oct 2019 10:02:18 -0400	[thread overview]
Message-ID: <db934877-4168-81b4-689a-7de1fc34cace@gmail.com> (raw)
In-Reply-To: <CAKbQEqFdY8hSko2jW=3BzpiZ6H4EV9yhncozoy=Kzroh3KfD5g@mail.gmail.com>

On 2019-10-21 09:02, Christian Pernegger wrote:
> [Please CC me, I'm not on the list.]
> 
> Am Mo., 21. Okt. 2019 um 13:47 Uhr schrieb Austin S. Hemmelgarn
> <ahferroin7@gmail.com>:
>> I've [worked with fs clones] like this dozens of times on single-device volumes with exactly zero issues.
> 
> Thank you, I have taken precautions, but it does seem to work fine.
> 
>> There are actually two possible ways I can think of a buggy GPU driver causing this type of issue: [snip]
> 
> Interesting and plausible, but ...
> 
>> Your best option for mitigation [...] is to ensure that your hardware has an IOMMU [...] and ensure it's enabled in firmware.
> 
> It has and it is. (The machine's been specced so GPU pass-through is
> an option, should it be required. I haven't gotten around to setting
> that up yet, haven't even gotten a second GPU, but I have laid the
> groundwork, the IOMMU is enabled and, as far as one can tell from logs
> and such, working.)
> 
>> However, there's also the possibility that you may have hardware issues.
> 
> Don't I know it ... The problem is, if there are hardware issues,
> that's the first I've seen of them, and while I didn't run torture
> tests, there was quite a lot of benchmarking when it was new. Needle
> in a haystack. Some memory testing can't hurt, I suppose. Any other
> ideas (for hardware testing)?
The power supply would be the other big one I'd suggest testing, as a 
bad PSU can cause all kinds of odd intermittent issues. Just like with 
RAM, you can't really easily cover everything, but you can check some 
things that have very low false negative rates when indicating problems.

Typical procedure I use is:

1. Completely disconnect the PSU from _everything_ inside the computer. 
(If you're really paranoid, you can completely remove the PSU from the 
case too, though that won't really make the testing more reliable or safer).
2. Make sure the PSU itself is plugged in to mains power, with the 
switch on the back (if it has one) turned on.
3. Connect a good multimeter to the 24-pin main power connector, with 
the positive probe on pin 8 and the negative probe on pin 7, set to 
measure DC voltages in the double-digit range with the highest precision 
possible.
4. Short pins 15 and 16 of the 24-pin main power connector using a short 
piece of solid copper wire. At this point, if the PSU has a fan, the fan 
should turn on. The multimeter should read +5 volts within half a second 
or less.
5. Check voltages of each of the power rails relative to ground. Make 
sure and check each one for a couple of seconds to watch for any 
fluctuations, and make a point to check _each_ set of wires coming off 
of the PSU separately (as well as checking each wire in each connector 
independently, even if they're supposed to be tied together internally).
6. Check the =5V standby power by hooking up the multimeter to that and 
a ground pin, then disconnecting the copper wire mentioned in step 3. 
It should maintain it's voltage while you're disconnecting the wire and 
afterwards, even once the fan stops.

You can find the respective pinouts online in many places (for example, 
[1]).  Tolerances are +/- 5% on everything except the negative voltages 
which are +/- 10%. The -5V pin may show nothing, which is normal (modern 
systems do not use -5V for anything, and actually most don't use -12V 
anymore either, though that's still provided). This won't confirm that 
the PSU isn't suspect (it could still have issues under load), but if 
any of this testing fails, you can be 100% certain you have either a bad 
PSU, or that your mains power is suspect (usually the issue there is 
very high line noise, though you'll need special equipment to test for 
that).
> 
> Back on the topic of TRIM: I'm 99 % certain discard wasn't set on the
> mount (not by me, in any case), but I think Mint runs fstrim
> periodically by default. Just to be sure, should any form of TRIM be
> disabled?
The issue with TRIM is that it drops old copies of the on-disk data 
structures used by BTRFS, which can make recovery more difficult in the 
event of a crash. Running `fstrim` at regular intervals is not as much 
of an issue as inline discard, but still drops the old trees, so there's 
a window of time right after it gets run when you are more vulnerable.

Additionally, some SSD's have had issues with TRIM causing data 
corruption elsewhere on the disk, but it's been years since I've seen a 
report of such issues, and I don't think a Samsung device as recent as 
yours is likely to have such problems.

> The only other idea I've got is Timeshift's hourly snapshots. (How)
> would btrfs deal with a crash during snapshot creation?
It should have no issues whatsoever most of the time.  The only case I 
can think of where it might is if you're snapshotting a subvolume that's 
being written to at the same time. Snapshots on BTRFS are only truly 
atomic if none of the data being snapshotted is being written to at the 
same time. If there are pending writes, there are some indeterminate 
states involved, and crashing then might produce a corrupted snapshot, 
but shouldn't cause any other issues.
> 
> 
> In other news, I've still not quite given up, mainly because the fs
> doesn't look all that broken. The output of btrfs inspect-internal
> dump-tree (incl. options), for instance, looks like gibberish to me of
> course, but it looks sane, doesn't spew warnings, doesn't error out or
> crash. Also plain btrfs check --init-extent-tree errored out, same
> with -s0, but with -s1 it's now chugging along. (BTW, is there a
> hierarchy among the super block slots, a best or newest one?)
AIUI, when they get updated, they get written out in the order they 
occur on disk, but other than that they're supposed to always be 
in-sync.  So if you have an issue when the first is being written out, 
you can often recover by using the second or later ones.
> 
> Will keep you posted.
> 
> Cheers,
> C.
> 


      parent reply	other threads:[~2019-10-21 14:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAKbQEqE7xN1q3byFL7-_pD=_pGJ0Vm9pj7d-g+rRgtONeH-GrA@mail.gmail.com>
2019-10-19 22:34 ` first it froze, now the (btrfs) root fs won't mount Christian Pernegger
2019-10-20  0:38   ` Qu Wenruo
2019-10-20 10:11     ` Christian Pernegger
2019-10-20 10:22       ` Christian Pernegger
2019-10-20 10:28         ` Qu Wenruo
2019-10-21 10:47           ` Christian Pernegger
2019-10-21 10:55             ` Qu Wenruo
2019-10-21 11:47             ` Austin S. Hemmelgarn
2019-10-21 13:02               ` Christian Pernegger
2019-10-21 13:34                 ` Qu Wenruo
2019-10-22 22:56                   ` Christian Pernegger
2019-10-23  0:25                     ` Qu Wenruo
2019-10-23 11:31                     ` Austin S. Hemmelgarn
2019-10-24 10:41                       ` Christian Pernegger
2019-10-24 11:26                         ` Qu Wenruo
2019-10-24 11:40                         ` Austin S. Hemmelgarn
2019-10-25 16:43                           ` Christian Pernegger
2019-10-25 17:05                             ` Christian Pernegger
2019-10-25 17:16                               ` Austin S. Hemmelgarn
2019-10-25 17:12                             ` Austin S. Hemmelgarn
2019-10-26  0:01                             ` Qu Wenruo
2019-10-26  9:23                               ` Christian Pernegger
2019-10-26  9:41                                 ` Qu Wenruo
2019-10-26 13:52                                   ` Christian Pernegger
2019-10-26 14:06                                     ` Qu Wenruo
2019-10-26 16:30                                       ` Christian Pernegger
2019-10-27  0:46                                         ` Qu Wenruo
     [not found]                                           ` <CAKbQEqFne8eohE3gvCMm8LqA-KimFrwwvE5pUBTn-h-VBhJq1A@mail.gmail.com>
2019-10-27 13:38                                             ` Qu Wenruo
2019-10-21 14:02                 ` Austin S. Hemmelgarn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db934877-4168-81b4-689a-7de1fc34cace@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pernegger@gmail.com \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.