Two persistent problems

* Two persistent problems
@ 2014-11-14 21:51 Hugo Mills
  2014-11-14 22:00 ` Josef Bacik
  0 siblings, 1 reply; 6+ messages in thread
From: Hugo Mills @ 2014-11-14 21:51 UTC (permalink / raw)
  To: Btrfs mailing list; +Cc: Chris Mason, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 2328 bytes --]

   Chris, Josef, anyone else who's interested,

   On IRC, I've been seeing reports of two persistent unsolved
problems. Neither is showing up very often, but both have turned up
often enough to indicate that there's something specific going on
worthy of investigation.

   One of them is definitely a btrfs problem. The other may be btrfs,
or something in the block layer, or just broken hardware; it's hard to
tell from where I sit.

Problem 1: ENOSPC on balance

   This has been going on since about March this year. I can
reasonably certainly recall 8-10 cases, possibly a number more. When
running a balance, the operation fails with ENOSPC when there's plenty
of space remaining unallocated. This happens on full balance, filtered
balance, and device delete. Other than the ENOSPC on balance, the FS
seems to work OK. It seems to be more prevalent on filesystems
converted from ext*. The first few or more reports of this didn't make
it to bugzilla, but a few of them since then have gone in.

Problem 2: Unexplained zeroes

   Failure to mount. Transid failure, "expected xyz, have 0". Chris
looked at an early one of these (for Ke, on IRC) back in September
(the 27th -- sadly, the public IRC logs aren't there for it, but I can
supply a copy of the private log). He rapidly came to the conclusion
that it was something bad going on with TRIM, replacing some blocks
with zeroes. Since then, I've seen a bunch of these coming past on
IRC. It seems to be a 3.17 thing. I can successfully predict the
presence of an SSD and -odiscard from the "have 0". I've successfully
persuaded several people to put this into bugzilla and capture
btrfs-images.  btrfs recover doesn't generally seem to be helpful in
recovering data.

   I think Josef had problem 1 in his sights, but I don't know if
additional images or reports are helpful at this point. For problem 2,
there's obviously something bad going on, but there's not much else to
go on -- and the inability to recover data isn't good.

   For each of these, what more information should I be trying to
collect from any future reporters?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
            --- Great films about cricket:  Forrest Stump ---            

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread