linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SSD failure modes
@ 2013-01-23  2:56 James Harper
       [not found] ` <6035A0D088A63A46850C3988ED045A4B35638C4A-mzsoxcrO4/2UD0RQwgcqbDSf8X3wrgjD@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: James Harper @ 2013-01-23  2:56 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

What is the expected behaviour of bcache when an SSD wears out? Do SSD's internally do a verify after write to ensure that the data has made it to the 'media' correctly and report a failure if that's the case? Does (or can) bcache do a verify itself?

And what about an SSD that fails hard (eg linux detects an unplug)? A system crash is acceptable in such a case if the cache was in write-back mode, but what are the chances of rebooting successfully with the outstanding cached writes now lost?

Thanks

James

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: SSD failure modes
       [not found] ` <6035A0D088A63A46850C3988ED045A4B35638C4A-mzsoxcrO4/2UD0RQwgcqbDSf8X3wrgjD@public.gmane.org>
@ 2013-01-24 22:52   ` Kent Overstreet
       [not found]     ` <20130124225205.GL26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Kent Overstreet @ 2013-01-24 22:52 UTC (permalink / raw)
  To: James Harper; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 23, 2013 at 02:56:40AM +0000, James Harper wrote:
> What is the expected behaviour of bcache when an SSD wears out? Do SSD's internally do a verify after write to ensure that the data has made it to the 'media' correctly and report a failure if that's the case? Does (or can) bcache do a verify itself?

I've never heard of SSDs doing that, AFAIK they rely more on strong ECC.
Bcache does not itself do that kind of verify, though I think it'd be
pretty easy to implement (and you'd only need it for dirty data and
metadata).

> And what about an SSD that fails hard (eg linux detects an unplug)? A system crash is acceptable in such a case if the cache was in write-back mode, but what are the chances of rebooting successfully with the outstanding cached writes now lost?

I just wrote some documentation about error handling - tell me if that
helps:
http://atlas.evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt?h=bcache-dev

Not quite sure I get the scenario you're describing though - you unplug
the SSD, then reboot?

The reboot itself is not a problem, nor is unplugging the SSD -
unplugging the SSD from bcache's perspective looks like a crash when
things come up again (at some point writes just stopped, but whatever's
on the SSD will still be consistent).

However, if you unplug the SSD in writeback mode, run for a bit, and
then reboot - after the SSD is unplugged all the writeback writes are
going to error. We could retry those writes as writes that bypass the
cache (in case it was just a random IO error), although we don't do that
yet - but metadata writes fail in writeback mode we may want to just
panic the kernel.

Hrm.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: SSD failure modes
       [not found]     ` <20130124225205.GL26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2013-01-25  2:26       ` James Harper
  0 siblings, 0 replies; 3+ messages in thread
From: James Harper @ 2013-01-25  2:26 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

> I just wrote some documentation about error handling - tell me if that
> helps:
> http://atlas.evilpiepirate.org/git/linux-
> bcache.git/tree/Documentation/bcache.txt?h=bcache-dev
> 
> Not quite sure I get the scenario you're describing though - you unplug
> the SSD, then reboot?
> 

No what I was describing is if the SSD fails completely and as far as Linux is concerned it's just gone. I've seen a few harddisks fail like that just recently (heatwave over here in AU) where the computer seems to be working okay then things just freeze up, and after a reboot there is no harddisk anymore. I haven't used SSD's enough to have seen something like that yet, but if it's a controller failure then it's possible.

I was just wondering how data integrity would fare in such a case. I guess it's the same as if there was a power failure and then you removed the SSD before booting up again, except that I believe bcache can bypass the cache for certain types of streaming writes so there is the potential for things to get out of sync, maybe?

James

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-01-25  2:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-23  2:56 SSD failure modes James Harper
     [not found] ` <6035A0D088A63A46850C3988ED045A4B35638C4A-mzsoxcrO4/2UD0RQwgcqbDSf8X3wrgjD@public.gmane.org>
2013-01-24 22:52   ` Kent Overstreet
     [not found]     ` <20130124225205.GL26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2013-01-25  2:26       ` James Harper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).