All of lore.kernel.org
 help / color / mirror / Atom feed
* Race to power off harming SATA SSDs
@ 2017-04-10 23:21 Henrique de Moraes Holschuh
  2017-04-10 23:34 ` Bart Van Assche
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Henrique de Moraes Holschuh @ 2017-04-10 23:21 UTC (permalink / raw)
  To: linux-kernel, linux-scsi, linux-ide; +Cc: Hans de Goede, Tejun Heo

Summary:

Linux properly issues the SSD prepare-to-poweroff command to SATA SSDs,
but it does not wait for long enough to ensure the SSD has carried it
through.

This causes a race between the platform power-off path, and the SSD
device.  When the SSD loses the race, its power is cut while it is still
doing its final book-keeping for poweroff.  This is known to be harmful
to most SSDs, and there is a non-zero chance of it even bricking.

Apparently, it is enough to wait a few seconds before powering off the
platform to give the SSDs enough time to fully enter STANDBY IMMEDIATE.

This issue was verified to exist on SATA SSDs made by at least Crucial
(and thus likely also Micron), Intel, and Samsung.  It was verified to
exist on several 3.x to 4.9 kernels, both distro (Debian) and also
upstream stable/longterm kernels from kernel.org.   Only x86-64 was
tested.

A proof of concept patch is attached, which was sufficient to
*completely* avoid the issue on the test set, for a perid of six to
eight weeks of testing.


Details and hypothesis:

For a long while I have noticed that S.M.A.R.T-provided attributes for
SSDs related to "unit was powered off unexpectedly" under my care where
raising on several boxes, without any unexpected power cuts being
accounted for.

This has been going for a *long* time (several years, since the first
SSD I got).  But it was too rare an event for me to try to track down
the root cause...  until a friend reported his SSD was already reporting
several hundred unclean power-offs on his laptop.  That made it much
easier to track down.

Per spec (and device manuals), SCSI, SATA and ATA-attached SSDs must be
informed of an imminent poweroff to checkpoing background tasks, flush
RAM caches and close logs.  For SCSI SSDs, you must tissue a
START_STOP_UNIT (stop) command.  For SATA, you must issue a STANDBY
IMMEDIATE command.  I haven't checked ATA, but it should be the same as
SATA.

In order to comply with this requirement, the Linux SCSI "sd" device
driver issues a START_STOP_UNIT command when the device is shutdown[1].
For SATA SSD devices, the SCSI START_STOP_UNIT command is properly
translated by the kernel SAT layer to STANDBY IMMEDIATE for SSDs.

After issuing the command, the kernel properly waits for the device to
report that the command has been completed before it proceeds.

However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion
[often?] only indicates that the device is now switching to the target
power management state, not that it has reached the target state.  Any
further device status inquires would return that it is in STANDBY mode,
even if it is still entering that state.

The kernel then continues the shutdown path while the SSD is still
preparing itself to be powered off, and it becomes a race.  When the
kernel + firmware wins, platform power is cut before the SSD has
finished (i.e. the SSD is subject to an unclean power-off).

Evidently, how often the SSD will lose the race depends on a platform
and SSD combination, and also on how often the system is powered off.
A sluggish firmware that takes its time to cut power can save the day...


Observing the effects:

An unclean SSD power-off will be signaled by the SSD device through an
increase on a specific S.M.A.R.T attribute.  These SMART attributes can
be read using the smartmontools package from www.smartmontools.org,
which should be available in just about every Linux distro.

	smartctl -A /dev/sd#

The SMART attribute related to unclean power-off is vendor-specific, so
one might have to track down the SSD datasheet to know which attribute a
particular SSD uses.  The naming of the attribute also varies.

For a Crucial M500 SSD with up-to-date firmware, this would be attribute
174 "Unexpect_Power_Loss_Ct", for example.

NOTE: unclean SSD power-offs are dangerous and may brick the device in
the worst case, or otherwise harm it (reduce longevity, damage flash
blocks).  It is also not impossible to get data corruption.


Testing, and working around the issue:

I've asked for several Debian developers to test a patch (attached) in
any of their boxes that had SSDs complaining of unclean poweroffs.  This
gave us a test corpus of Intel, Crucial and Samsung SSDs, on laptops,
desktops, and a few workstations.

The proof-of-concept patch adds a delay of one second to the SD-device
shutdown path.

Previously, the more sensitive devices/platforms in the test set would
report at least one or two unclean SSD power-offs a month.  With the
patch, there was NOT a single increase reported after several weeks of
testing.

This is obviously not a test with 100% confidence, but it indicates very
strongly that the above analysis was correct, and that an added delay
was enough to work around the issue in the entire test set.



Fixing the issue properly:

The proof of concept patch works fine, but it "punishes" the system with
too much delay.  Also, if sd device shutdown is serialized, it will
punish systems with many /dev/sd devices severely.

1. The delay needs to happen only once right before powering down for
   hibernation/suspend/power-off.  There is no need to delay per-device
   for platform power off/suspend/hibernate.

2. A per-device delay needs to happen before signaling that a device
   can be safely removed when doing controlled hotswap (e.g. when
   deleting the SD device due to a sysfs command).

I am unsure how much *total* delay would be enough.  Two seconds seems
like a safe bet.

Any comments?  Any clues on how to make the delay "smarter" to trigger
only once during platform shutdown, but still trigger per-device when
doing per-device hotswapping ?


[1] In ancient times, it didn't, or at least the ATA/SATA side didn't.
It has been fixed for at least a decade, refer to "manage_start_stop", a
deprecated sysfs node that should have been removed in y2008 :-)

-- 
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:21 Race to power off harming SATA SSDs Henrique de Moraes Holschuh
@ 2017-04-10 23:34 ` Bart Van Assche
  2017-04-10 23:50   ` Henrique de Moraes Holschuh
  2017-04-10 23:49 ` sd: wait for slow devices on shutdown path Henrique de Moraes Holschuh
  2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
  2 siblings, 1 reply; 47+ messages in thread
From: Bart Van Assche @ 2017-04-10 23:34 UTC (permalink / raw)
  To: linux-scsi, linux-kernel, hmh, linux-ide; +Cc: hdegoede, tj

On Mon, 2017-04-10 at 20:21 -0300, Henrique de Moraes Holschuh wrote:
> A proof of concept patch is attached

Thank you for the very detailed write-up. Sorry but no patch was attached
to the e-mail I received from you ...

Bart.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* sd: wait for slow devices on shutdown path
  2017-04-10 23:21 Race to power off harming SATA SSDs Henrique de Moraes Holschuh
  2017-04-10 23:34 ` Bart Van Assche
@ 2017-04-10 23:49 ` Henrique de Moraes Holschuh
  2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
  2 siblings, 0 replies; 47+ messages in thread
From: Henrique de Moraes Holschuh @ 2017-04-10 23:49 UTC (permalink / raw)
  To: linux-kernel, linux-scsi, linux-ide; +Cc: Hans de Goede, Tejun Heo

Author: Henrique de Moraes Holschuh <hmh@debian.org>
Date:   Wed Feb 1 20:42:02 2017 -0200

    sd: wait for slow devices on shutdown path
    
    Wait 1s during suspend/shutdown for the device to settle after
    we issue the STOP command.
    
    Otherwise we race ATA SSDs to powerdown, possibly causing damage to
    FLASH/data and even bricking the device.
    
    This is an experimental patch, there are likely better ways of doing
    this that don't punish non-SSDs.
    
    Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4e08d1cd..3c6d5d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3230,6 +3230,38 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 			res = 0;
 	}
 
+	/*
+	 * Wait for slow devices that signal they have fully entered
+	 * the stopped state before they actully did it.
+	 *
+	 * This behavior is apparently allowed per-spec for ATA
+	 * devices, and our SAT layer does not account for it.
+	 * Thus, on return, the device might still be in the process
+	 * of entering STANDBY state.
+	 *
+	 * Worse, apparently the ATA spec also says the unit should
+	 * return that it is already in STANDBY state *while still
+	 * entering that state*.
+	 *
+	 * SSDs absolutely depend on receiving a STANDBY IMMEDIATE
+	 * command prior to power off for a clean shutdown (and
+	 * likely we don't want to send them *anything else* in-
+	 * between either, to be on the safe side).
+	 *
+	 * As things stand, we are racing the SSD's firmware.  If it
+	 * finishes first, nothing bad happens.  If it doesn't, we
+	 * cut power while it is still saving metadata, and not only
+	 * this will cause extra FLASH wear (and maybe even damage
+	 * some cells), it also has a non-zero chance of bricking the
+	 * SSD.
+	 *
+	 * Issue reported on Intel, Crucial and Micron SSDs.
+	 * Issue can be detected by S.M.A.R.T. signaling unexpected
+	 * power cuts.
+	 */
+	if (!res && !start)
+		msleep(1000);
+
 	/* SCSI error codes must not go to the generic layer */
 	if (res)
 		return -EIO;

-- 
  Henrique Holschuh

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:34 ` Bart Van Assche
@ 2017-04-10 23:50   ` Henrique de Moraes Holschuh
  0 siblings, 0 replies; 47+ messages in thread
From: Henrique de Moraes Holschuh @ 2017-04-10 23:50 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, linux-kernel, linux-ide, hdegoede, tj

On Mon, 10 Apr 2017, Bart Van Assche wrote:
> On Mon, 2017-04-10 at 20:21 -0300, Henrique de Moraes Holschuh wrote:
> > A proof of concept patch is attached
> 
> Thank you for the very detailed write-up. Sorry but no patch was attached
> to the e-mail I received from you ...

Indeed.  It should arrive shortly, helpfully undamaged.  Otherwise I
will resend with git-send-email :p

-- 
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:21 Race to power off harming SATA SSDs Henrique de Moraes Holschuh
  2017-04-10 23:34 ` Bart Van Assche
  2017-04-10 23:49 ` sd: wait for slow devices on shutdown path Henrique de Moraes Holschuh
@ 2017-04-10 23:52 ` Tejun Heo
  2017-04-10 23:57   ` James Bottomley
                     ` (3 more replies)
  2 siblings, 4 replies; 47+ messages in thread
From: Tejun Heo @ 2017-04-10 23:52 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: linux-kernel, linux-scsi, linux-ide, Hans de Goede

Hello,

On Mon, Apr 10, 2017 at 08:21:19PM -0300, Henrique de Moraes Holschuh wrote:
...
> Per spec (and device manuals), SCSI, SATA and ATA-attached SSDs must be
> informed of an imminent poweroff to checkpoing background tasks, flush
> RAM caches and close logs.  For SCSI SSDs, you must tissue a
> START_STOP_UNIT (stop) command.  For SATA, you must issue a STANDBY
> IMMEDIATE command.  I haven't checked ATA, but it should be the same as
> SATA.

Yeah, it's the same.  Even hard drives are expected to survive a lot
of unexpected power losses tho.  They have to do emergency head
unloads but they're designed to withstand a healthy number of them.

> In order to comply with this requirement, the Linux SCSI "sd" device
> driver issues a START_STOP_UNIT command when the device is shutdown[1].
> For SATA SSD devices, the SCSI START_STOP_UNIT command is properly
> translated by the kernel SAT layer to STANDBY IMMEDIATE for SSDs.
> 
> After issuing the command, the kernel properly waits for the device to
> report that the command has been completed before it proceeds.
> 
> However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion
> [often?] only indicates that the device is now switching to the target
> power management state, not that it has reached the target state.  Any
> further device status inquires would return that it is in STANDBY mode,
> even if it is still entering that state.
> 
> The kernel then continues the shutdown path while the SSD is still
> preparing itself to be powered off, and it becomes a race.  When the
> kernel + firmware wins, platform power is cut before the SSD has
> finished (i.e. the SSD is subject to an unclean power-off).

At that point, the device is fully flushed and in terms of data
integrity should be fine with losing power at any point anyway.

> Evidently, how often the SSD will lose the race depends on a platform
> and SSD combination, and also on how often the system is powered off.
> A sluggish firmware that takes its time to cut power can save the day...
> 
> 
> Observing the effects:
> 
> An unclean SSD power-off will be signaled by the SSD device through an
> increase on a specific S.M.A.R.T attribute.  These SMART attributes can
> be read using the smartmontools package from www.smartmontools.org,
> which should be available in just about every Linux distro.
> 
> 	smartctl -A /dev/sd#
> 
> The SMART attribute related to unclean power-off is vendor-specific, so
> one might have to track down the SSD datasheet to know which attribute a
> particular SSD uses.  The naming of the attribute also varies.
> 
> For a Crucial M500 SSD with up-to-date firmware, this would be attribute
> 174 "Unexpect_Power_Loss_Ct", for example.
> 
> NOTE: unclean SSD power-offs are dangerous and may brick the device in
> the worst case, or otherwise harm it (reduce longevity, damage flash
> blocks).  It is also not impossible to get data corruption.

I get that the incrementing counters might not be pretty but I'm a bit
skeptical about this being an actual issue.  Because if that were
true, the device would be bricking itself from any sort of power
losses be that an actual power loss, battery rundown or hard power off
after crash.

> Testing, and working around the issue:
> 
> I've asked for several Debian developers to test a patch (attached) in
> any of their boxes that had SSDs complaining of unclean poweroffs.  This
> gave us a test corpus of Intel, Crucial and Samsung SSDs, on laptops,
> desktops, and a few workstations.
> 
> The proof-of-concept patch adds a delay of one second to the SD-device
> shutdown path.
> 
> Previously, the more sensitive devices/platforms in the test set would
> report at least one or two unclean SSD power-offs a month.  With the
> patch, there was NOT a single increase reported after several weeks of
> testing.
> 
> This is obviously not a test with 100% confidence, but it indicates very
> strongly that the above analysis was correct, and that an added delay
> was enough to work around the issue in the entire test set.
> 
> 
> 
> Fixing the issue properly:
> 
> The proof of concept patch works fine, but it "punishes" the system with
> too much delay.  Also, if sd device shutdown is serialized, it will
> punish systems with many /dev/sd devices severely.
> 
> 1. The delay needs to happen only once right before powering down for
>    hibernation/suspend/power-off.  There is no need to delay per-device
>    for platform power off/suspend/hibernate.
> 
> 2. A per-device delay needs to happen before signaling that a device
>    can be safely removed when doing controlled hotswap (e.g. when
>    deleting the SD device due to a sysfs command).
> 
> I am unsure how much *total* delay would be enough.  Two seconds seems
> like a safe bet.
> 
> Any comments?  Any clues on how to make the delay "smarter" to trigger
> only once during platform shutdown, but still trigger per-device when
> doing per-device hotswapping ?

So, if this is actually an issue, sure, we can try to work around;
however, can we first confirm that this has any other consequences
than a SMART counter being bumped up?  I'm not sure how meaningful
that is in itself.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
@ 2017-04-10 23:57   ` James Bottomley
  2017-04-11  2:02     ` Henrique de Moraes Holschuh
  2017-04-11  1:26   ` Henrique de Moraes Holschuh
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: James Bottomley @ 2017-04-10 23:57 UTC (permalink / raw)
  To: Tejun Heo, Henrique de Moraes Holschuh
  Cc: linux-kernel, linux-scsi, linux-ide, Hans de Goede

On Tue, 2017-04-11 at 08:52 +0900, Tejun Heo wrote:
[...]
> > Any comments?  Any clues on how to make the delay "smarter" to 
> > trigger only once during platform shutdown, but still trigger per
> > -device when doing per-device hotswapping ?
> 
> So, if this is actually an issue, sure, we can try to work around;
> however, can we first confirm that this has any other consequences
> than a SMART counter being bumped up?  I'm not sure how meaningful
> that is in itself.

Seconded; especially as the proposed patch is way too invasive: we run
single threaded on shutdown and making every disk wait 1s is going to
drive enterprises crazy.  I'm with Tejun: If the device replies GOOD to
SYNCHRONIZE CACHE, that means we're entitled to assume all written data
is safely on non-volatile media and any "essential housekeeping" can be
redone if the power goes away.

James

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
  2017-04-10 23:57   ` James Bottomley
@ 2017-04-11  1:26   ` Henrique de Moraes Holschuh
  2017-04-11 10:37   ` Martin Steigerwald
  2017-05-07 20:40     ` Pavel Machek
  3 siblings, 0 replies; 47+ messages in thread
From: Henrique de Moraes Holschuh @ 2017-04-11  1:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, linux-scsi, linux-ide, Hans de Goede

On Tue, 11 Apr 2017, Tejun Heo wrote:
> > The kernel then continues the shutdown path while the SSD is still
> > preparing itself to be powered off, and it becomes a race.  When the
> > kernel + firmware wins, platform power is cut before the SSD has
> > finished (i.e. the SSD is subject to an unclean power-off).
> 
> At that point, the device is fully flushed and in terms of data
> integrity should be fine with losing power at any point anyway.

All bets are off at this point, really.

We issued a command that explicitly orders the SSD to checkpoint and
stop all background tasks, and flush *everything* including invisible
state (device data, stats, logs, translation tables, flash metadata,
etc)...  and then cut its power before it finished.

> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were

As an *example* I know of because I tracked it personally, Crucial SSDs
models from a few years ago were known to eventually brick on any
platforms where they were being subject to repeated unclean shutdowns,
*Windows included*.  There are some threads on their forums about it.
Firmware revisions made it harder to happen, but still...

> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

Bricking is a worst-case, really.  I guess they learned to keep the
device always in a will-not-brick state using append-only logs for
critical state or something, so it really takes very nasty flash damage
to exactly the wrong place to render it unusable.

> > Fixing the issue properly:
> > 
> > The proof of concept patch works fine, but it "punishes" the system with
> > too much delay.  Also, if sd device shutdown is serialized, it will
> > punish systems with many /dev/sd devices severely.
> > 
> > 1. The delay needs to happen only once right before powering down for
> >    hibernation/suspend/power-off.  There is no need to delay per-device
> >    for platform power off/suspend/hibernate.
> > 
> > 2. A per-device delay needs to happen before signaling that a device
> >    can be safely removed when doing controlled hotswap (e.g. when
> >    deleting the SD device due to a sysfs command).
> > 
> > I am unsure how much *total* delay would be enough.  Two seconds seems
> > like a safe bet.
> > 
> > Any comments?  Any clues on how to make the delay "smarter" to trigger
> > only once during platform shutdown, but still trigger per-device when
> > doing per-device hotswapping ?
> 
> So, if this is actually an issue, sure, we can try to work around;
> however, can we first confirm that this has any other consequences
> than a SMART counter being bumped up?  I'm not sure how meaningful
> that is in itself.

I have no idea how to confirm an SSD is being either less, or more
damaged by the "STANDBY-IMMEDIATE and cut power too early", when
compared with "sudden power cut".  At least not without actually
damaging the SSDs using three groups (normal power cuts,
STANDBY-IMMEDIATE + power cut, control group).

A "SSD power cut test" search on duckduckgo shows several papers and
testing reports on the first results page.  I don't think there is any
doubt whatsoever that your typical consumer SSD *can* get damaged by a
"sudden power cut" so badly that it is actually noticed by the user.

That FLASH itself gets damaged or can have stored data corrupted by
power cuts at bad times is quite clear:

http://cseweb.ucsd.edu/users/swanson/papers/DAC2011PowerCut.pdf

SSDs do a lot of work to recover from that without data loss.  You won't
notice it easily unless that recovery work *fails*.

-- 
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:57   ` James Bottomley
@ 2017-04-11  2:02     ` Henrique de Moraes Holschuh
  0 siblings, 0 replies; 47+ messages in thread
From: Henrique de Moraes Holschuh @ 2017-04-11  2:02 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, linux-kernel, linux-scsi, linux-ide, Hans de Goede

On Mon, 10 Apr 2017, James Bottomley wrote:
> On Tue, 2017-04-11 at 08:52 +0900, Tejun Heo wrote:
> [...]
> > > Any comments?  Any clues on how to make the delay "smarter" to 
> > > trigger only once during platform shutdown, but still trigger per
> > > -device when doing per-device hotswapping ?
> > 
> > So, if this is actually an issue, sure, we can try to work around;
> > however, can we first confirm that this has any other consequences
> > than a SMART counter being bumped up?  I'm not sure how meaningful
> > that is in itself.
> 
> Seconded; especially as the proposed patch is way too invasive: we run

It is a proof of concept thing.  It even says so in the patch commit
log, and in the cover text.

I don't want an one second delay per device.  I never proposed that,
either.  In fact, I *specifically* asked for something else in the
paragraph you quoted.

I would much prefer an one- or two-seconds delay per platform *power
off*.  And that's for platforms that do ACPI-like heavy-duty S3/S4/S5
like x86/x86-64.  Opportunistic high-frequency suspend on mobile likely
requires no such handling.

The per-device delay would be needed only for hotplug removal (device
delete), and that's just because some hardware powers down bays (like
older thinkpads with ATA-compatible bays, and some industrial systems).

-- 
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
  2017-04-10 23:57   ` James Bottomley
  2017-04-11  1:26   ` Henrique de Moraes Holschuh
@ 2017-04-11 10:37   ` Martin Steigerwald
  2017-04-11 14:31     ` Henrique de Moraes Holschuh
  2017-05-07 20:40     ` Pavel Machek
  3 siblings, 1 reply; 47+ messages in thread
From: Martin Steigerwald @ 2017-04-11 10:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Henrique de Moraes Holschuh, linux-kernel, linux-scsi, linux-ide,
	Hans de Goede

Am Dienstag, 11. April 2017, 08:52:06 CEST schrieb Tejun Heo:
> > Evidently, how often the SSD will lose the race depends on a platform
> > and SSD combination, and also on how often the system is powered off.
> > A sluggish firmware that takes its time to cut power can save the day...
> > 
> > 
> > Observing the effects:
> > 
> > An unclean SSD power-off will be signaled by the SSD device through an
> > increase on a specific S.M.A.R.T attribute.  These SMART attributes can
> > be read using the smartmontools package from www.smartmontools.org,
> > which should be available in just about every Linux distro.
> > 
> > smartctl -A /dev/sd#
> > 
> > The SMART attribute related to unclean power-off is vendor-specific, so
> > one might have to track down the SSD datasheet to know which attribute a
> > particular SSD uses.  The naming of the attribute also varies.
> > 
> > For a Crucial M500 SSD with up-to-date firmware, this would be attribute
> > 174 "Unexpect_Power_Loss_Ct", for example.
> > 
> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were
> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

The write-up by Henrique has been a very informative and interesting read for 
me. I wondered about the same question tough.

I do have a Crucial M500 and I do have an increase of that counter:

martin@merkaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201*   
smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       1
smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174 Unexpect_Power_Loss_Ct  
0x0032   100   100   000    Old_age   Always       -       67
smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       105
smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       148
smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct  0x0032   
100   100   000    Old_age   Always       -       201
smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       272


I mostly didn´t notice anything, except for one time where I indeed had a 
BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD (which 
also has an attribute for unclean shutdown which raises).

I blogged about this in german language quite some time ago:

https://blog.teamix.de/2015/01/19/btrfs-raid-1-selbstheilung-in-aktion/

(I think its easy enough to get the point of the blog post even when not 
understanding german)

Result of scrub:

   scrub started at Thu Oct  9 15:52:00 2014 and finished after 564 seconds
        total bytes scrubbed: 268.36GiB with 60 errors
        error details: csum=60
        corrected errors: 60, uncorrectable errors: 0, unverified errors: 0

Device errors were on:

merkaba:~> btrfs device stats /home
[/dev/mapper/msata-home].write_io_errs   0
[/dev/mapper/msata-home].read_io_errs    0
[/dev/mapper/msata-home].flush_io_errs   0
[/dev/mapper/msata-home].corruption_errs 60
[/dev/mapper/msata-home].generation_errs 0
[…]

(thats the Crucial m500)


I didn´t have any explaination of this, but I suspected some unclean shutdown, 
even tough I remembered no unclean shutdown. I take good care to always has a 
battery in this ThinkPad T520, due to unclean shutdown issues with Intel SSD 
320 (bricked device which reports 8 MiB as capacity, probably fixed by the 
firmware update I applied back then).

The write-up Henrique gave me the idea, that maybe it wasn´t an user triggered 
unclean shutdown that caused the issue, but an unclean shutdown triggered by 
the Linux kernel SSD shutdown procedure implementation.

Of course, I don´t know whether this is the case and I think there is no way 
to proof or falsify it years after this happened. I never had this happen 
again.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-11 10:37   ` Martin Steigerwald
@ 2017-04-11 14:31     ` Henrique de Moraes Holschuh
  2017-04-12  7:47       ` Martin Steigerwald
  0 siblings, 1 reply; 47+ messages in thread
From: Henrique de Moraes Holschuh @ 2017-04-11 14:31 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Tejun Heo, linux-kernel, linux-scsi, linux-ide, Hans de Goede

On Tue, 11 Apr 2017, Martin Steigerwald wrote:
> I do have a Crucial M500 and I do have an increase of that counter:
> 
> martin@merkaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201*   
> smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
> Old_age   Always       -       1
> smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174 Unexpect_Power_Loss_Ct  
> 0x0032   100   100   000    Old_age   Always       -       67
> smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
> Old_age   Always       -       105
> smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
> Old_age   Always       -       148
> smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct  0x0032   
> 100   100   000    Old_age   Always       -       201
> smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
> Old_age   Always       -       272
> 
> 
> I mostly didn´t notice anything, except for one time where I indeed had a 
> BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD (which 
> also has an attribute for unclean shutdown which raises).

The Crucial M500 has something called "RAIN" which it got unmodified
from its Micron datacenter siblings of the time, along with a large
amount of flash overprovisioning.  Too bad it lost the overprovisioned
supercapacitor bank present on the Microns.

RAIN does block-level N+1 RAID5-like parity across the flash chips on
top of the usual block-based ECC, and the SSD has a background scrubber
task that repairs and blocks that fail ECC correction using the RAIN
parity information.

On such an SSD, you really need multi-chip flash corruption beyond what
ECC can fix to even get the operating system/filesystem to notice any
damage, unless you are paying attention to its SMART attributes (it
counts the number of blocks that required RAIN recovery -- which implies
ECC failed to correct that block in the first place), etc.

Unfortunately, I do not have correlation data to know whether there is
an increase on RAIN-corrected or ECC-corrected blocks during the 24h
after an unclean poweroff right after STANDBY IMMEDIATE on a Crucial
M500 SSD.

> The write-up Henrique gave me the idea, that maybe it wasn´t an user triggered 
> unclean shutdown that caused the issue, but an unclean shutdown triggered by 
> the Linux kernel SSD shutdown procedure implementation.

Maybe.  But that corruption could easily having been caused by something
else.  There is no shortage of possible culprits.

I expect most damage caused by unclean SSD power-offs to be hidden from
the user/operating system/filesystem by the extensive recovery
facilities present on most SSDs.

Note that the fact that data was transparently (and sucessfully)
recovered doesn't mean damage did not happen, or that the unit was not
harmed by it: it likely got some extra flash wear at the very least.

BTW, for the record, Windows 7 also appears to have had (and maybe still
have) this issue as far as I can tell.  Almost every user report of
excessive unclean power off alerts (and also of SSD bricking) to be
found on SSD vendor forums come from Windows users.

-- 
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-11 14:31     ` Henrique de Moraes Holschuh
@ 2017-04-12  7:47       ` Martin Steigerwald
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Steigerwald @ 2017-04-12  7:47 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Tejun Heo, linux-kernel, linux-scsi, linux-ide, Hans de Goede

Am Dienstag, 11. April 2017, 11:31:29 CEST schrieb Henrique de Moraes 
Holschuh:
> On Tue, 11 Apr 2017, Martin Steigerwald wrote:
> > I do have a Crucial M500 and I do have an increase of that counter:
> > 
> > martin@merkaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201*
> > smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always       -       1
> > smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174
> > Unexpect_Power_Loss_Ct
> > 0x0032   100   100   000    Old_age   Always       -       67
> > smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always       -       105
> > smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always       -       148
> > smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct 
> > 0x0032 100   100   000    Old_age   Always       -       201
> > smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always       -       272
> > 
> > 
> > I mostly didn´t notice anything, except for one time where I indeed had a
> > BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD
> > (which also has an attribute for unclean shutdown which raises).
> 
> The Crucial M500 has something called "RAIN" which it got unmodified
> from its Micron datacenter siblings of the time, along with a large
> amount of flash overprovisioning.  Too bad it lost the overprovisioned
> supercapacitor bank present on the Microns.

I think I read about this some time ago. I decided for a Crucial M500 cause in 
tests it wasn´t the fastest, but there were hints that it may be one of the 
most reliable mSATA SSDs of that time.

[… RAIN explaination …]

> > The write-up Henrique gave me the idea, that maybe it wasn´t an user
> > triggered unclean shutdown that caused the issue, but an unclean shutdown
> > triggered by the Linux kernel SSD shutdown procedure implementation.
> 
> Maybe.  But that corruption could easily having been caused by something
> else.  There is no shortage of possible culprits.

Yes.

> I expect most damage caused by unclean SSD power-offs to be hidden from
> the user/operating system/filesystem by the extensive recovery
> facilities present on most SSDs.
> 
> Note that the fact that data was transparently (and sucessfully)
> recovered doesn't mean damage did not happen, or that the unit was not
> harmed by it: it likely got some extra flash wear at the very least.

Okay, I understand.

Well my guess back then, I didn´t fully elaborate on it in the initial mail, 
but did so in the blog post, was exactly that I didn´t see any capacitor on 
the mSATA SSD board. But I know the Intel SSD 320 has capacitors. So I 
thought, okay, maybe there really has been a sudden powerloss due to me trying 
to exchange battery during suspend to RAM / standby, without me remembering 
this event. And I thought, okay, without capacitor the SSD then didn´t get a 
chance to write some of the data. But again this also is just a guess.

I can provide to you smart data files in case you want to have a look at them.

> BTW, for the record, Windows 7 also appears to have had (and maybe still
> have) this issue as far as I can tell.  Almost every user report of
> excessive unclean power off alerts (and also of SSD bricking) to be
> found on SSD vendor forums come from Windows users.

Interesting.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
@ 2017-05-07 20:40     ` Pavel Machek
  2017-04-11  1:26   ` Henrique de Moraes Holschuh
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-07 20:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: boris.brezillon, linux-scsi, Hans de Goede, linux-kernel,
	linux-ide, linux-mtd, Henrique de Moraes Holschuh, dwmw2

Hi!

> > However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion
> > [often?] only indicates that the device is now switching to the target
> > power management state, not that it has reached the target state.  Any
> > further device status inquires would return that it is in STANDBY mode,
> > even if it is still entering that state.
> > 
> > The kernel then continues the shutdown path while the SSD is still
> > preparing itself to be powered off, and it becomes a race.  When the
> > kernel + firmware wins, platform power is cut before the SSD has
> > finished (i.e. the SSD is subject to an unclean power-off).
> 
> At that point, the device is fully flushed and in terms of data
> integrity should be fine with losing power at any point anyway.

Actually, no, that is not how it works.

"Fully flushed" is one thing, surviving power loss is
different. Explanation below.

> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were
> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

And that's exactly what users see. If you do enough power fails on a
SSD, you usually brick it, some die sooner than others. There was some
test results published, some are here
http://lkcl.net/reports/ssd_analysis.html, I believe I seen some
others too.

It is very hard for a NAND to work reliably in face of power
failures. In fact, not even Linux MTD + UBIFS works well in that
regards. See
http://www.linux-mtd.infradead.org/faq/ubi.html. (Unfortunately, its
down now?!). If we can't get it right, do you believe SSD manufactures
do?

[Issue is, if you powerdown during erase, you get "weakly erased"
page, which will contain expected 0xff's, but you'll get bitflips
there quickly. Similar issue exists for writes. It is solveable in
software, just hard and slow... and we don't do it.]
									
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-07 20:40     ` Pavel Machek
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-07 20:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Henrique de Moraes Holschuh, linux-kernel, linux-scsi, linux-ide,
	Hans de Goede, boris.brezillon, linux-mtd, dwmw2

Hi!

> > However, *IN PRACTICE*, SATA STANDBY IMMEDIATE command completion
> > [often?] only indicates that the device is now switching to the target
> > power management state, not that it has reached the target state.  Any
> > further device status inquires would return that it is in STANDBY mode,
> > even if it is still entering that state.
> > 
> > The kernel then continues the shutdown path while the SSD is still
> > preparing itself to be powered off, and it becomes a race.  When the
> > kernel + firmware wins, platform power is cut before the SSD has
> > finished (i.e. the SSD is subject to an unclean power-off).
> 
> At that point, the device is fully flushed and in terms of data
> integrity should be fine with losing power at any point anyway.

Actually, no, that is not how it works.

"Fully flushed" is one thing, surviving power loss is
different. Explanation below.

> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were
> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

And that's exactly what users see. If you do enough power fails on a
SSD, you usually brick it, some die sooner than others. There was some
test results published, some are here
http://lkcl.net/reports/ssd_analysis.html, I believe I seen some
others too.

It is very hard for a NAND to work reliably in face of power
failures. In fact, not even Linux MTD + UBIFS works well in that
regards. See
http://www.linux-mtd.infradead.org/faq/ubi.html. (Unfortunately, its
down now?!). If we can't get it right, do you believe SSD manufactures
do?

[Issue is, if you powerdown during erase, you get "weakly erased"
page, which will contain expected 0xff's, but you'll get bitflips
there quickly. Similar issue exists for writes. It is solveable in
software, just hard and slow... and we don't do it.]
									
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-07 20:40     ` Pavel Machek
  (?)
@ 2017-05-08  7:21     ` David Woodhouse
  2017-05-08  7:38         ` Ricard Wanderlof
  2017-05-08  9:28       ` Pavel Machek
  -1 siblings, 2 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08  7:21 UTC (permalink / raw)
  To: Pavel Machek, Tejun Heo
  Cc: Henrique de Moraes Holschuh, linux-kernel, linux-scsi, linux-ide,
	Hans de Goede, boris.brezillon, linux-mtd

[-- Attachment #1: Type: text/plain, Size: 1874 bytes --]

On Sun, 2017-05-07 at 22:40 +0200, Pavel Machek wrote:
> > > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > > the worst case, or otherwise harm it (reduce longevity, damage flash
> > > blocks).  It is also not impossible to get data corruption.
>
> > I get that the incrementing counters might not be pretty but I'm a bit
> > skeptical about this being an actual issue.  Because if that were
> > true, the device would be bricking itself from any sort of power
> > losses be that an actual power loss, battery rundown or hard power off
> > after crash.
>
> And that's exactly what users see. If you do enough power fails on a
> SSD, you usually brick it, some die sooner than others. There was some
> test results published, some are here
> http://lkcl.net/reports/ssd_analysis.html, I believe I seen some
> others too.
> 
> It is very hard for a NAND to work reliably in face of power
> failures. In fact, not even Linux MTD + UBIFS works well in that
> regards. See
> http://www.linux-mtd.infradead.org/faq/ubi.html. (Unfortunately, its
> down now?!). If we can't get it right, do you believe SSD manufactures
> do?
> 
> [Issue is, if you powerdown during erase, you get "weakly erased"
> page, which will contain expected 0xff's, but you'll get bitflips
> there quickly. Similar issue exists for writes. It is solveable in
> software, just hard and slow... and we don't do it.]

It's not that hard. We certainly do it in JFFS2. I was fairly sure that
it was also part of the design considerations for UBI — it really ought
to be right there too. I'm less sure about UBIFS but I would have
expected it to be OK.

SSDs however are often crap; power fail those at your peril. And of
course there's nothing you can do when they do fail, whereas we accept
patches for things which are implemented in Linux.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  7:21     ` David Woodhouse
@ 2017-05-08  7:38         ` Ricard Wanderlof
  2017-05-08  9:28       ` Pavel Machek
  1 sibling, 0 replies; 47+ messages in thread
From: Ricard Wanderlof @ 2017-05-08  7:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh


On Mon, 8 May 2017, David Woodhouse wrote:

> > [Issue is, if you powerdown during erase, you get "weakly erased"
> > page, which will contain expected 0xff's, but you'll get bitflips
> > there quickly. Similar issue exists for writes. It is solveable in
> > software, just hard and slow... and we don't do it.]
> 
> It's not that hard. We certainly do it in JFFS2. I was fairly sure that
> it was also part of the design considerations for UBI ? it really ought
> to be right there too. I'm less sure about UBIFS but I would have
> expected it to be OK.

I've got a problem with the underlying mechanism. How long does it take to 
erase a NAND block? A couple of milliseconds. That means that for an erase 
to be "weak" du to a power fail, the host CPU must issue an erase command, 
and then the power to the NAND must drop within those milliseconds. 
However, in most systems there will be a power monitor which will 
essentially reset the CPU as soon as the power starts dropping. So in 
practice, by the time the voltage is too low to successfully supply the 
NAND chip, the CPU has already been reset, hence, no reset command will 
have been given by the time NAND runs out of steam.

Sure, with switchmode power supplies, we don't have those large capacitors 
in the power supply which can keep the power going for a second or more, 
but still, I would think that the power wouldn't die fast enough for this 
to be an issue.

But I could very well be wrong and I haven't had experience with that many 
NAND flash systems. But then please tell me where the above reasoning is 
flawed.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08  7:38         ` Ricard Wanderlof
  0 siblings, 0 replies; 47+ messages in thread
From: Ricard Wanderlof @ 2017-05-08  7:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh


On Mon, 8 May 2017, David Woodhouse wrote:

> > [Issue is, if you powerdown during erase, you get "weakly erased"
> > page, which will contain expected 0xff's, but you'll get bitflips
> > there quickly. Similar issue exists for writes. It is solveable in
> > software, just hard and slow... and we don't do it.]
> 
> It's not that hard. We certainly do it in JFFS2. I was fairly sure that
> it was also part of the design considerations for UBI ? it really ought
> to be right there too. I'm less sure about UBIFS but I would have
> expected it to be OK.

I've got a problem with the underlying mechanism. How long does it take to 
erase a NAND block? A couple of milliseconds. That means that for an erase 
to be "weak" du to a power fail, the host CPU must issue an erase command, 
and then the power to the NAND must drop within those milliseconds. 
However, in most systems there will be a power monitor which will 
essentially reset the CPU as soon as the power starts dropping. So in 
practice, by the time the voltage is too low to successfully supply the 
NAND chip, the CPU has already been reset, hence, no reset command will 
have been given by the time NAND runs out of steam.

Sure, with switchmode power supplies, we don't have those large capacitors 
in the power supply which can keep the power going for a second or more, 
but still, I would think that the power wouldn't die fast enough for this 
to be an issue.

But I could very well be wrong and I haven't had experience with that many 
NAND flash systems. But then please tell me where the above reasoning is 
flawed.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  7:38         ` Ricard Wanderlof
@ 2017-05-08  8:13           ` David Woodhouse
  -1 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08  8:13 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Mon, 2017-05-08 at 09:38 +0200, Ricard Wanderlof wrote:
> On Mon, 8 May 2017, David Woodhouse wrote:
> 
> > 
> > > 
> > > [Issue is, if you powerdown during erase, you get "weakly erased"
> > > page, which will contain expected 0xff's, but you'll get bitflips
> > > there quickly. Similar issue exists for writes. It is solveable in
> > > software, just hard and slow... and we don't do it.]
> > It's not that hard. We certainly do it in JFFS2. I was fairly sure that
> > it was also part of the design considerations for UBI ? it really ought
> > to be right there too. I'm less sure about UBIFS but I would have
> > expected it to be OK.
> I've got a problem with the underlying mechanism. How long does it take to 
> erase a NAND block? A couple of milliseconds. That means that for an erase 
> to be "weak" du to a power fail, the host CPU must issue an erase command, 
> and then the power to the NAND must drop within those milliseconds. 
> However, in most systems there will be a power monitor which will 
> essentially reset the CPU as soon as the power starts dropping. So in 
> practice, by the time the voltage is too low to successfully supply the 
> NAND chip, the CPU has already been reset, hence, no reset command will 
> have been given by the time NAND runs out of steam.
> 
> Sure, with switchmode power supplies, we don't have those large capacitors 
> in the power supply which can keep the power going for a second or more, 
> but still, I would think that the power wouldn't die fast enough for this 
> to be an issue.
> 
> But I could very well be wrong and I haven't had experience with that many 
> NAND flash systems. But then please tell me where the above reasoning is 
> flawed.

Our empirical testing trumps your "can never happen" theory :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08  8:13           ` David Woodhouse
  0 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08  8:13 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Mon, 2017-05-08 at 09:38 +0200, Ricard Wanderlof wrote:
> On Mon, 8 May 2017, David Woodhouse wrote:
> 
> > 
> > > 
> > > [Issue is, if you powerdown during erase, you get "weakly erased"
> > > page, which will contain expected 0xff's, but you'll get bitflips
> > > there quickly. Similar issue exists for writes. It is solveable in
> > > software, just hard and slow... and we don't do it.]
> > It's not that hard. We certainly do it in JFFS2. I was fairly sure that
> > it was also part of the design considerations for UBI ? it really ought
> > to be right there too. I'm less sure about UBIFS but I would have
> > expected it to be OK.
> I've got a problem with the underlying mechanism. How long does it take to 
> erase a NAND block? A couple of milliseconds. That means that for an erase 
> to be "weak" du to a power fail, the host CPU must issue an erase command, 
> and then the power to the NAND must drop within those milliseconds. 
> However, in most systems there will be a power monitor which will 
> essentially reset the CPU as soon as the power starts dropping. So in 
> practice, by the time the voltage is too low to successfully supply the 
> NAND chip, the CPU has already been reset, hence, no reset command will 
> have been given by the time NAND runs out of steam.
> 
> Sure, with switchmode power supplies, we don't have those large capacitors 
> in the power supply which can keep the power going for a second or more, 
> but still, I would think that the power wouldn't die fast enough for this 
> to be an issue.
> 
> But I could very well be wrong and I haven't had experience with that many 
> NAND flash systems. But then please tell me where the above reasoning is 
> flawed.

Our empirical testing trumps your "can never happen" theory :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  8:13           ` David Woodhouse
@ 2017-05-08  8:36             ` Ricard Wanderlof
  -1 siblings, 0 replies; 47+ messages in thread
From: Ricard Wanderlof @ 2017-05-08  8:36 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh


On Mon, 8 May 2017, David Woodhouse wrote:

> > I've got a problem with the underlying mechanism. How long does it take to 
> > erase a NAND block? A couple of milliseconds. That means that for an erase 
> > to be "weak" du to a power fail, the host CPU must issue an erase command, 
> > and then the power to the NAND must drop within those milliseconds. 
> > However, in most systems there will be a power monitor which will 
> > essentially reset the CPU as soon as the power starts dropping. So in 
> > practice, by the time the voltage is too low to successfully supply the 
> > NAND chip, the CPU has already been reset, hence, no reset command will 
> > have been given by the time NAND runs out of steam.
> > 
> > Sure, with switchmode power supplies, we don't have those large capacitors 
> > in the power supply which can keep the power going for a second or more, 
> > but still, I would think that the power wouldn't die fast enough for this 
> > to be an issue.
> > 
> Our empirical testing trumps your "can never happen" theory :)

I'm sure it does. But what is the explanation then? Has anyone analyzed 
what is going on using an oscilloscope to verify relationship between 
erase command and supply voltage drop?

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08  8:36             ` Ricard Wanderlof
  0 siblings, 0 replies; 47+ messages in thread
From: Ricard Wanderlof @ 2017-05-08  8:36 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh


On Mon, 8 May 2017, David Woodhouse wrote:

> > I've got a problem with the underlying mechanism. How long does it take to 
> > erase a NAND block? A couple of milliseconds. That means that for an erase 
> > to be "weak" du to a power fail, the host CPU must issue an erase command, 
> > and then the power to the NAND must drop within those milliseconds. 
> > However, in most systems there will be a power monitor which will 
> > essentially reset the CPU as soon as the power starts dropping. So in 
> > practice, by the time the voltage is too low to successfully supply the 
> > NAND chip, the CPU has already been reset, hence, no reset command will 
> > have been given by the time NAND runs out of steam.
> > 
> > Sure, with switchmode power supplies, we don't have those large capacitors 
> > in the power supply which can keep the power going for a second or more, 
> > but still, I would think that the power wouldn't die fast enough for this 
> > to be an issue.
> > 
> Our empirical testing trumps your "can never happen" theory :)

I'm sure it does. But what is the explanation then? Has anyone analyzed 
what is going on using an oscilloscope to verify relationship between 
erase command and supply voltage drop?

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  8:36             ` Ricard Wanderlof
@ 2017-05-08  8:54               ` David Woodhouse
  -1 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08  8:54 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 794 bytes --]

On Mon, 2017-05-08 at 10:36 +0200, Ricard Wanderlof wrote:
> On Mon, 8 May 2017, David Woodhouse wrote:
> > Our empirical testing trumps your "can never happen" theory :)
>
> I'm sure it does. But what is the explanation then? Has anyone analyzed 
> what is going on using an oscilloscope to verify relationship between 
> erase command and supply voltage drop?

Not that I'm aware of. Once we have reached the "it does happen and we
have to cope" there was not a lot of point in working out *why* it
happened.

In fact, the only examples I *personally* remember were on NOR flash,
which takes longer to erase. So it's vaguely possible that it doesn't
happen on NAND. But really, it's not something we should be depending
on and the software mechanisms have to remain in place.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08  8:54               ` David Woodhouse
  0 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08  8:54 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 794 bytes --]

On Mon, 2017-05-08 at 10:36 +0200, Ricard Wanderlof wrote:
> On Mon, 8 May 2017, David Woodhouse wrote:
> > Our empirical testing trumps your "can never happen" theory :)
>
> I'm sure it does. But what is the explanation then? Has anyone analyzed 
> what is going on using an oscilloscope to verify relationship between 
> erase command and supply voltage drop?

Not that I'm aware of. Once we have reached the "it does happen and we
have to cope" there was not a lot of point in working out *why* it
happened.

In fact, the only examples I *personally* remember were on NOR flash,
which takes longer to erase. So it's vaguely possible that it doesn't
happen on NAND. But really, it's not something we should be depending
on and the software mechanisms have to remain in place.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  8:54               ` David Woodhouse
@ 2017-05-08  9:06                 ` Ricard Wanderlof
  -1 siblings, 0 replies; 47+ messages in thread
From: Ricard Wanderlof @ 2017-05-08  9:06 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh


On Mon, 8 May 2017, David Woodhouse wrote:

> > On Mon, 8 May 2017, David Woodhouse wrote:
> > > Our empirical testing trumps your "can never happen" theory :)
> >
> > I'm sure it does. But what is the explanation then? Has anyone analyzed 
> > what is going on using an oscilloscope to verify relationship between 
> > erase command and supply voltage drop?
> 
> Not that I'm aware of. Once we have reached the "it does happen and we
> have to cope" there was not a lot of point in working out *why* it
> happened.
> 
> In fact, the only examples I *personally* remember were on NOR flash,
> which takes longer to erase. So it's vaguely possible that it doesn't
> happen on NAND. But really, it's not something we should be depending
> on and the software mechanisms have to remain in place.

My point is really that say that the problem is in fact not that the erase 
is cut short due to the power fail, but that the software issues a second 
command before the first erase command has completed, for instance, or 
some other situation. Then we'd have a concrete situation which we can 
resolve (i.e., fix the bug), rather than assuming that it's the hardware's 
fault and implement various software workarounds.

On the other hand, making the software resilient to erase problems 
essentially makes the system more robust in any case, so it's not a bad 
thing of course.

It's just that I've seen this "we're software guys, and it must be the 
hardware's fault" (and vice versa) enough times to cause a small warning 
bell to off here.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08  9:06                 ` Ricard Wanderlof
  0 siblings, 0 replies; 47+ messages in thread
From: Ricard Wanderlof @ 2017-05-08  9:06 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh


On Mon, 8 May 2017, David Woodhouse wrote:

> > On Mon, 8 May 2017, David Woodhouse wrote:
> > > Our empirical testing trumps your "can never happen" theory :)
> >
> > I'm sure it does. But what is the explanation then? Has anyone analyzed 
> > what is going on using an oscilloscope to verify relationship between 
> > erase command and supply voltage drop?
> 
> Not that I'm aware of. Once we have reached the "it does happen and we
> have to cope" there was not a lot of point in working out *why* it
> happened.
> 
> In fact, the only examples I *personally* remember were on NOR flash,
> which takes longer to erase. So it's vaguely possible that it doesn't
> happen on NAND. But really, it's not something we should be depending
> on and the software mechanisms have to remain in place.

My point is really that say that the problem is in fact not that the erase 
is cut short due to the power fail, but that the software issues a second 
command before the first erase command has completed, for instance, or 
some other situation. Then we'd have a concrete situation which we can 
resolve (i.e., fix the bug), rather than assuming that it's the hardware's 
fault and implement various software workarounds.

On the other hand, making the software resilient to erase problems 
essentially makes the system more robust in any case, so it's not a bad 
thing of course.

It's just that I've seen this "we're software guys, and it must be the 
hardware's fault" (and vice versa) enough times to cause a small warning 
bell to off here.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  9:06                 ` Ricard Wanderlof
  (?)
@ 2017-05-08  9:09                 ` Hans de Goede
  2017-05-08 10:13                   ` David Woodhouse
  -1 siblings, 1 reply; 47+ messages in thread
From: Hans de Goede @ 2017-05-08  9:09 UTC (permalink / raw)
  To: Ricard Wanderlof, David Woodhouse
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	linux-kernel, linux-ide, linux-mtd, Henrique de Moraes Holschuh

Hi,

On 08-05-17 11:06, Ricard Wanderlof wrote:
> 
> On Mon, 8 May 2017, David Woodhouse wrote:
> 
>>> On Mon, 8 May 2017, David Woodhouse wrote:
>>>> Our empirical testing trumps your "can never happen" theory :)
>>>
>>> I'm sure it does. But what is the explanation then? Has anyone analyzed
>>> what is going on using an oscilloscope to verify relationship between
>>> erase command and supply voltage drop?
>>
>> Not that I'm aware of. Once we have reached the "it does happen and we
>> have to cope" there was not a lot of point in working out *why* it
>> happened.
>>
>> In fact, the only examples I *personally* remember were on NOR flash,
>> which takes longer to erase. So it's vaguely possible that it doesn't
>> happen on NAND. But really, it's not something we should be depending
>> on and the software mechanisms have to remain in place.
> 
> My point is really that say that the problem is in fact not that the erase
> is cut short due to the power fail, but that the software issues a second
> command before the first erase command has completed, for instance, or
> some other situation. Then we'd have a concrete situation which we can
> resolve (i.e., fix the bug), rather than assuming that it's the hardware's
> fault and implement various software workarounds.

You're forgetting that the SSD itself (this thread is about SSDs) also has
a major software component which is doing housekeeping all the time, so even
if the main CPU gets reset the SSD's controller may still happily be erasing
blocks.

Regards,

Hans

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  7:21     ` David Woodhouse
  2017-05-08  7:38         ` Ricard Wanderlof
@ 2017-05-08  9:28       ` Pavel Machek
  2017-05-08  9:34         ` David Woodhouse
  2017-05-08  9:51         ` Richard Weinberger
  1 sibling, 2 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-08  9:28 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Tejun Heo, Henrique de Moraes Holschuh, linux-kernel, linux-scsi,
	linux-ide, Hans de Goede, boris.brezillon, linux-mtd

On Mon 2017-05-08 08:21:34, David Woodhouse wrote:
> On Sun, 2017-05-07 at 22:40 +0200, Pavel Machek wrote:
> > > > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > > > the worst case, or otherwise harm it (reduce longevity, damage flash
> > > > blocks).  It is also not impossible to get data corruption.
> >
> > > I get that the incrementing counters might not be pretty but I'm a bit
> > > skeptical about this being an actual issue.  Because if that were
> > > true, the device would be bricking itself from any sort of power
> > > losses be that an actual power loss, battery rundown or hard power off
> > > after crash.
> >
> > And that's exactly what users see. If you do enough power fails on a
> > SSD, you usually brick it, some die sooner than others. There was some
> > test results published, some are here
> > http://lkcl.net/reports/ssd_analysis.html, I believe I seen some
> > others too.
> > 
> > It is very hard for a NAND to work reliably in face of power
> > failures. In fact, not even Linux MTD + UBIFS works well in that
> > regards. See
> > http://www.linux-mtd.infradead.org/faq/ubi.html. (Unfortunately, its
> > down now?!). If we can't get it right, do you believe SSD manufactures
> > do?
> > 
> > [Issue is, if you powerdown during erase, you get "weakly erased"
> > page, which will contain expected 0xff's, but you'll get bitflips
> > there quickly. Similar issue exists for writes. It is solveable in
> > software, just hard and slow... and we don't do it.]
> 
> It's not that hard. We certainly do it in JFFS2. I was fairly sure that
> it was also part of the design considerations for UBI — it really ought
> to be right there too. I'm less sure about UBIFS but I would have
> expected it to be OK.

Are you sure you have it right in JFFS2? Do you journal block erases?
Apparently, that was pretty much non-issue on older flashes.

https://web-beta.archive.org/web/20160923094716/http://www.linux-mtd.infradead.org:80/doc/ubifs.html#L_unstable_bits

> SSDs however are often crap; power fail those at your peril. And of
> course there's nothing you can do when they do fail, whereas we accept
> patches for things which are implemented in Linux.

Agreed. If the SSD indiciates unexpected powerdown, it is a problem
and we need to fix it.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  9:28       ` Pavel Machek
@ 2017-05-08  9:34         ` David Woodhouse
  2017-05-08 10:49           ` Pavel Machek
  2017-05-08  9:51         ` Richard Weinberger
  1 sibling, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2017-05-08  9:34 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Tejun Heo, Henrique de Moraes Holschuh, linux-kernel, linux-scsi,
	linux-ide, Hans de Goede, boris.brezillon, linux-mtd

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

On Mon, 2017-05-08 at 11:28 +0200, Pavel Machek wrote:
> 
> Are you sure you have it right in JFFS2? Do you journal block erases?
> Apparently, that was pretty much non-issue on older flashes.

It isn't necessary in JFFS2. It is a *purely* log-structured file
system (which is why it doesn't scale well past the 1GiB or so that we
made it handle for OLPC).

So we don't erase a block until all its contents are obsolete. And if
we fail to complete the erase... well the contents are either going to
fail a CRC check, or... still be obsoleted by later entries elsewhere.

And even if it *looks* like an erase has completed and the block is all
0xFF, we erase it again and write a 'clean marker' to it to indicate
that the erase was completed successfully. Because otherwise it can't
be trusted.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  9:28       ` Pavel Machek
  2017-05-08  9:34         ` David Woodhouse
@ 2017-05-08  9:51         ` Richard Weinberger
  1 sibling, 0 replies; 47+ messages in thread
From: Richard Weinberger @ 2017-05-08  9:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Woodhouse, Tejun Heo, Henrique de Moraes Holschuh, LKML,
	linux-scsi, linux-ide, Hans de Goede, Boris Brezillon, linux-mtd

Pavel,

On Mon, May 8, 2017 at 11:28 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Are you sure you have it right in JFFS2? Do you journal block erases?
> Apparently, that was pretty much non-issue on older flashes.

This is what the website says, yes. Do you have hardware where you can
trigger it?
If so, I'd love to get access to it.

So far I never saw the issue, sometimes people claim to suffer from it
but when I
inspect the problems in detail it is always something else.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  9:06                 ` Ricard Wanderlof
  (?)
@ 2017-05-08 10:12                   ` David Woodhouse
  -1 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08 10:12 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: boris.brezillon, Henrique de Moraes Holschuh, linux-scsi,
	linux-ide, linux-kernel, Hans de Goede, linux-mtd, Pavel Machek,
	Tejun Heo


[-- Attachment #1.1: Type: text/plain, Size: 1046 bytes --]

On Mon, 2017-05-08 at 11:06 +0200, Ricard Wanderlof wrote:
> 
> My point is really that say that the problem is in fact not that the erase 
> is cut short due to the power fail, but that the software issues a second 
> command before the first erase command has completed, for instance, or 
> some other situation. Then we'd have a concrete situation which we can 
> resolve (i.e., fix the bug), rather than assuming that it's the hardware's 
> fault and implement various software workarounds.

On NOR flash we have *definitely* seen it during powerfail testing.

A block looks like it's all 0xFF when you read it back on mount, but if
you read it repeatedly, you may see bit flips because it wasn't
completely erased. And even if you read it ten times and 'trust' that
it's properly erased, it could start to show those bit flips when you
start to program it.

It was very repeatable, and that's when we implemented the 'clean
markers' written after a successful erase, rather than trusting a block
that "looks empty".

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

[-- Attachment #2: Type: text/plain, Size: 144 bytes --]

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08 10:12                   ` David Woodhouse
  0 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08 10:12 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Mon, 2017-05-08 at 11:06 +0200, Ricard Wanderlof wrote:
> 
> My point is really that say that the problem is in fact not that the erase 
> is cut short due to the power fail, but that the software issues a second 
> command before the first erase command has completed, for instance, or 
> some other situation. Then we'd have a concrete situation which we can 
> resolve (i.e., fix the bug), rather than assuming that it's the hardware's 
> fault and implement various software workarounds.

On NOR flash we have *definitely* seen it during powerfail testing.

A block looks like it's all 0xFF when you read it back on mount, but if
you read it repeatedly, you may see bit flips because it wasn't
completely erased. And even if you read it ten times and 'trust' that
it's properly erased, it could start to show those bit flips when you
start to program it.

It was very repeatable, and that's when we implemented the 'clean
markers' written after a successful erase, rather than trusting a block
that "looks empty".

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
@ 2017-05-08 10:12                   ` David Woodhouse
  0 siblings, 0 replies; 47+ messages in thread
From: David Woodhouse @ 2017-05-08 10:12 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	Hans de Goede, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Mon, 2017-05-08 at 11:06 +0200, Ricard Wanderlof wrote:
> 
> My point is really that say that the problem is in fact not that the erase 
> is cut short due to the power fail, but that the software issues a second 
> command before the first erase command has completed, for instance, or 
> some other situation. Then we'd have a concrete situation which we can 
> resolve (i.e., fix the bug), rather than assuming that it's the hardware's 
> fault and implement various software workarounds.

On NOR flash we have *definitely* seen it during powerfail testing.

A block looks like it's all 0xFF when you read it back on mount, but if
you read it repeatedly, you may see bit flips because it wasn't
completely erased. And even if you read it ten times and 'trust' that
it's properly erased, it could start to show those bit flips when you
start to program it.

It was very repeatable, and that's when we implemented the 'clean
markers' written after a successful erase, rather than trusting a block
that "looks empty".

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  9:09                 ` Hans de Goede
@ 2017-05-08 10:13                   ` David Woodhouse
  2017-05-08 11:50                     ` Boris Brezillon
  0 siblings, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2017-05-08 10:13 UTC (permalink / raw)
  To: Hans de Goede, Ricard Wanderlof
  Cc: Pavel Machek, Tejun Heo, boris.brezillon, linux-scsi,
	linux-kernel, linux-ide, linux-mtd, Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote:
> You're forgetting that the SSD itself (this thread is about SSDs) also has
> a major software component which is doing housekeeping all the time, so even
> if the main CPU gets reset the SSD's controller may still happily be erasing
> blocks.

We're not really talking about SSDs at all any more; we're talking
about real flash with real maintainable software.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08  9:34         ` David Woodhouse
@ 2017-05-08 10:49           ` Pavel Machek
  2017-05-08 11:06             ` Richard Weinberger
  2017-05-08 11:09             ` David Woodhouse
  0 siblings, 2 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-08 10:49 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Tejun Heo, Henrique de Moraes Holschuh, linux-kernel, linux-scsi,
	linux-ide, Hans de Goede, boris.brezillon, linux-mtd

On Mon 2017-05-08 10:34:08, David Woodhouse wrote:
> On Mon, 2017-05-08 at 11:28 +0200, Pavel Machek wrote:
> > 
> > Are you sure you have it right in JFFS2? Do you journal block erases?
> > Apparently, that was pretty much non-issue on older flashes.
> 
> It isn't necessary in JFFS2. It is a *purely* log-structured file
> system (which is why it doesn't scale well past the 1GiB or so that we
> made it handle for OLPC).
> 
> So we don't erase a block until all its contents are obsolete. And if
> we fail to complete the erase... well the contents are either going to
> fail a CRC check, or... still be obsoleted by later entries elsewhere.
> 
> And even if it *looks* like an erase has completed and the block is all
> 0xFF, we erase it again and write a 'clean marker' to it to indicate
> that the erase was completed successfully. Because otherwise it can't
> be trusted.

Aha, nice, so it looks like ubifs is a step back here.

'clean marker' is a good idea... empty pages have plenty of space.

How do you handle the issue during regular write? Always ignore last
successfully written block?

Do you handle "paired pages" problem on MLC?

Best regards,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 10:49           ` Pavel Machek
@ 2017-05-08 11:06             ` Richard Weinberger
  2017-05-08 11:48               ` Boris Brezillon
  2017-05-08 11:09             ` David Woodhouse
  1 sibling, 1 reply; 47+ messages in thread
From: Richard Weinberger @ 2017-05-08 11:06 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Woodhouse, Boris Brezillon, linux-scsi, Hans de Goede,
	LKML, linux-ide, linux-mtd, Henrique de Moraes Holschuh,
	Tejun Heo

On Mon, May 8, 2017 at 12:49 PM, Pavel Machek <pavel@ucw.cz> wrote:
> Aha, nice, so it looks like ubifs is a step back here.
>
> 'clean marker' is a good idea... empty pages have plenty of space.

If UBI (not UBIFS) faces an empty block, it also re-erases it.
The EC header is uses as clean marker.

> How do you handle the issue during regular write? Always ignore last
> successfully written block?

The last page of a block is inspected and allowed to be corrupted.

> Do you handle "paired pages" problem on MLC?

Nope, no MLC support in mainline so far.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 10:49           ` Pavel Machek
  2017-05-08 11:06             ` Richard Weinberger
@ 2017-05-08 11:09             ` David Woodhouse
  2017-05-08 12:32               ` Pavel Machek
  1 sibling, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2017-05-08 11:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Tejun Heo, Henrique de Moraes Holschuh, linux-kernel, linux-scsi,
	linux-ide, Hans de Goede, boris.brezillon, linux-mtd

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

On Mon, 2017-05-08 at 12:49 +0200, Pavel Machek wrote:
> On Mon 2017-05-08 10:34:08, David Woodhouse wrote:
> > 
> > On Mon, 2017-05-08 at 11:28 +0200, Pavel Machek wrote:
> > > 
> > > 
> > > Are you sure you have it right in JFFS2? Do you journal block erases?
> > > Apparently, that was pretty much non-issue on older flashes.
> > It isn't necessary in JFFS2. It is a *purely* log-structured file
> > system (which is why it doesn't scale well past the 1GiB or so that we
> > made it handle for OLPC).
> > 
> > So we don't erase a block until all its contents are obsolete. And if
> > we fail to complete the erase... well the contents are either going to
> > fail a CRC check, or... still be obsoleted by later entries elsewhere.
> > 
> > And even if it *looks* like an erase has completed and the block is all
> > 0xFF, we erase it again and write a 'clean marker' to it to indicate
> > that the erase was completed successfully. Because otherwise it can't
> > be trusted.
> Aha, nice, so it looks like ubifs is a step back here.
> 
> 'clean marker' is a good idea... empty pages have plenty of space.

Well... you lose that space permanently. Although I suppose you could
do things differently and erase a block immediately prior to using it.
But in that case why ever write the cleanmarker? Just maintain a set of
blocks that you *will* erase and re-use.

> How do you handle the issue during regular write? Always ignore last
> successfully written block?

Log nodes have a CRC. If you get interrupted during a write, that CRC
should fail.

> Do you handle "paired pages" problem on MLC?

No. It would theoretically be possible, by not considering a write to
the first page "committed" until the second page of the pair is also
written. Essentially, it's not far off expanding the existing 'wbuf'
which we use to gather writes into full pages for NAND, to cover the
*whole* of the set of pages which are affected by MLC.

But we mostly consider JFFS2 to be obsolete these days, in favour of
UBI/UBIFS or other approaches.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 11:06             ` Richard Weinberger
@ 2017-05-08 11:48               ` Boris Brezillon
  2017-05-08 11:55                 ` Boris Brezillon
  2017-05-08 12:13                 ` Richard Weinberger
  0 siblings, 2 replies; 47+ messages in thread
From: Boris Brezillon @ 2017-05-08 11:48 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Pavel Machek, David Woodhouse, linux-scsi, Hans de Goede, LKML,
	linux-ide, linux-mtd, Henrique de Moraes Holschuh, Tejun Heo

On Mon, 8 May 2017 13:06:17 +0200
Richard Weinberger <richard.weinberger@gmail.com> wrote:

> On Mon, May 8, 2017 at 12:49 PM, Pavel Machek <pavel@ucw.cz> wrote:
> > Aha, nice, so it looks like ubifs is a step back here.
> >
> > 'clean marker' is a good idea... empty pages have plenty of space.  
> 
> If UBI (not UBIFS) faces an empty block, it also re-erases it.

Unfortunately, that's not the case, though UBI can easily be patched
to do that (see below).

> The EC header is uses as clean marker.

That is true. If the EC header has been written to a block, that means
this block has been correctly erased.

> 
> > How do you handle the issue during regular write? Always ignore last
> > successfully written block?  

I guess UBIFS can know what was written last, because of the log-based
approach + the seqnum stored along with FS nodes, but I'm pretty sure
UBIFS does not re-write the last written block in case of an unclean
mount. Richard, am I wrong?

> 
> The last page of a block is inspected and allowed to be corrupted.

Actually, it's not really about corrupted pages, it's about pages that
might become unreadable after a few reads.

> 
> > Do you handle "paired pages" problem on MLC?  
> 
> Nope, no MLC support in mainline so far.

Richard and I have put a lot of effort to reliably support MLC NANDs in
mainline, unfortunately this projects has been paused. You can access
the last version of our work here [1] if you're interested (it's
clearly not in a shippable state ;-)).

[1]https://github.com/bbrezillon/linux-sunxi/commits/bb/4.7/ubi-mlc

--->8---
diff --git a/drivers/mtd/ubi/attach.c b/drivers/mtd/ubi/attach.c
index 93ceea4f27d5..3d76941c9570 100644
--- a/drivers/mtd/ubi/attach.c
+++ b/drivers/mtd/ubi/attach.c
@@ -1121,21 +1121,20 @@ static int scan_peb(struct ubi_device *ubi, struct ubi_attach_info *ai,
                        return err;
                goto adjust_mean_ec;
        case UBI_IO_FF_BITFLIPS:
+       case UBI_IO_FF:
+               /*
+                * Always erase the block if the EC header is empty, even if
+                * no bitflips were reported because otherwise we might
+                * expose ourselves to the 'unstable bits' issue described
+                * here:
+                *
+                * http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits
+                */
                err = add_to_list(ai, pnum, UBI_UNKNOWN, UBI_UNKNOWN,
                                  ec, 1, &ai->erase);
                if (err)
                        return err;
                goto adjust_mean_ec;
-       case UBI_IO_FF:
-               if (ec_err || bitflips)
-                       err = add_to_list(ai, pnum, UBI_UNKNOWN,
-                                         UBI_UNKNOWN, ec, 1, &ai->erase);
-               else
-                       err = add_to_list(ai, pnum, UBI_UNKNOWN,
-                                         UBI_UNKNOWN, ec, 0, &ai->free);
-               if (err)
-                       return err;
-               goto adjust_mean_ec;
        default:
                ubi_err(ubi, "'ubi_io_read_vid_hdr()' returned unknown code %d",
                        err);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 10:13                   ` David Woodhouse
@ 2017-05-08 11:50                     ` Boris Brezillon
  2017-05-08 15:40                       ` David Woodhouse
  2017-05-08 16:43                       ` Pavel Machek
  0 siblings, 2 replies; 47+ messages in thread
From: Boris Brezillon @ 2017-05-08 11:50 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Hans de Goede, Ricard Wanderlof, Pavel Machek, Tejun Heo,
	linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

On Mon, 08 May 2017 11:13:10 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote:
> > You're forgetting that the SSD itself (this thread is about SSDs) also has
> > a major software component which is doing housekeeping all the time, so even
> > if the main CPU gets reset the SSD's controller may still happily be erasing
> > blocks.  
> 
> We're not really talking about SSDs at all any more; we're talking
> about real flash with real maintainable software.

It's probably a good sign that this new discussion should take place in
a different thread :-).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 11:48               ` Boris Brezillon
@ 2017-05-08 11:55                 ` Boris Brezillon
  2017-05-08 12:13                 ` Richard Weinberger
  1 sibling, 0 replies; 47+ messages in thread
From: Boris Brezillon @ 2017-05-08 11:55 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Pavel Machek, David Woodhouse, linux-scsi, Hans de Goede, LKML,
	linux-ide, linux-mtd, Henrique de Moraes Holschuh, Tejun Heo

On Mon, 8 May 2017 13:48:07 +0200
Boris Brezillon <boris.brezillon@free-electrons.com> wrote:

> On Mon, 8 May 2017 13:06:17 +0200
> Richard Weinberger <richard.weinberger@gmail.com> wrote:
> 
> > On Mon, May 8, 2017 at 12:49 PM, Pavel Machek <pavel@ucw.cz> wrote:  
> > > Aha, nice, so it looks like ubifs is a step back here.
> > >
> > > 'clean marker' is a good idea... empty pages have plenty of space.    
> > 
> > If UBI (not UBIFS) faces an empty block, it also re-erases it.  
> 
> Unfortunately, that's not the case, though UBI can easily be patched
> to do that (see below).

Sorry for the noise, I was wrong, UBI already re-erases empty blocks
[1].

[1]http://elixir.free-electrons.com/linux/latest/source/drivers/mtd/ubi/attach.c#L983

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 11:48               ` Boris Brezillon
  2017-05-08 11:55                 ` Boris Brezillon
@ 2017-05-08 12:13                 ` Richard Weinberger
  1 sibling, 0 replies; 47+ messages in thread
From: Richard Weinberger @ 2017-05-08 12:13 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Pavel Machek, David Woodhouse, linux-scsi, Hans de Goede, LKML,
	linux-ide, linux-mtd, Henrique de Moraes Holschuh, Tejun Heo

Boris,

Am 08.05.2017 um 13:48 schrieb Boris Brezillon:
>>> How do you handle the issue during regular write? Always ignore last
>>> successfully written block?  
> 
> I guess UBIFS can know what was written last, because of the log-based
> approach + the seqnum stored along with FS nodes, but I'm pretty sure
> UBIFS does not re-write the last written block in case of an unclean
> mount. Richard, am I wrong?

Yes. UBIFS has the machinery but uses it differently. When it faces ECC
errors while replying the journal it can recover good data from the LEB.
It assumes that an interrupted write leads always to ECC errors.

>>
>> The last page of a block is inspected and allowed to be corrupted.
> 
> Actually, it's not really about corrupted pages, it's about pages that
> might become unreadable after a few reads.

As stated before, it assumes an ECC error from an interrupted read.

We could automatically re-write everything in UBIFS that was written last
but we don't have this information for data UBI itself wrote since UBI has
no journal.

If unstable bit can be triggered with current systems we can think of a clever
trick to deal with that. So far nobody was able to show me the problem.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 11:09             ` David Woodhouse
@ 2017-05-08 12:32               ` Pavel Machek
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-08 12:32 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Tejun Heo, Henrique de Moraes Holschuh, linux-kernel, linux-scsi,
	linux-ide, Hans de Goede, boris.brezillon, linux-mtd

Hi!

> > 'clean marker' is a good idea... empty pages have plenty of space.
> 
> Well... you lose that space permanently. Although I suppose you could
> do things differently and erase a block immediately prior to using it.
> But in that case why ever write the cleanmarker? Just maintain a set of
> blocks that you *will* erase and re-use.

Yes, but erase is slow so that would hurt performance...?

> > How do you handle the issue during regular write? Always ignore last
> > successfully written block?
> 
> Log nodes have a CRC. If you get interrupted during a write, that CRC
> should fail.

Umm. That is not what "unstable bits" issue is about, right?

If you are interrupted during write, you can get into state where
readback will be correct on next boot (CRC, ECC ok), but then the bits
will go back few hours after that. You can't rely on checksums to
detect that.. because the bits will have the right values -- for a while.

> > Do you handle "paired pages" problem on MLC?
> 
> No. It would theoretically be possible, by not considering a write to
> the first page "committed" until the second page of the pair is also
> written. Essentially, it's not far off expanding the existing 'wbuf'
> which we use to gather writes into full pages for NAND, to cover the
> *whole* of the set of pages which are affected by MLC.
> 
> But we mostly consider JFFS2 to be obsolete these days, in favour of
> UBI/UBIFS or other approaches.

Yes, I guess MLC NAND chips are mostly too big for jjfs2.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 11:50                     ` Boris Brezillon
@ 2017-05-08 15:40                       ` David Woodhouse
  2017-05-08 21:36                         ` Pavel Machek
  2017-05-08 16:43                       ` Pavel Machek
  1 sibling, 1 reply; 47+ messages in thread
From: David Woodhouse @ 2017-05-08 15:40 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Hans de Goede, Ricard Wanderlof, Pavel Machek, Tejun Heo,
	linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

On Mon, 2017-05-08 at 13:50 +0200, Boris Brezillon wrote:
> On Mon, 08 May 2017 11:13:10 +0100
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > 
> > On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote:
> > > 
> > > You're forgetting that the SSD itself (this thread is about SSDs) also has
> > > a major software component which is doing housekeeping all the time, so even
> > > if the main CPU gets reset the SSD's controller may still happily be erasing
> > > blocks.  
> > We're not really talking about SSDs at all any more; we're talking
> > about real flash with real maintainable software.
>
> It's probably a good sign that this new discussion should take place in
> a different thread :-).

Well, maybe. But it was a silly thread in the first place. SATA SSDs
aren't *expected* to be reliable.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 11:50                     ` Boris Brezillon
  2017-05-08 15:40                       ` David Woodhouse
@ 2017-05-08 16:43                       ` Pavel Machek
  2017-05-08 17:43                         ` Tejun Heo
  2017-05-08 18:29                         ` Atlant Schmidt
  1 sibling, 2 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-08 16:43 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: David Woodhouse, Hans de Goede, Ricard Wanderlof, Tejun Heo,
	linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 1227 bytes --]

On Mon 2017-05-08 13:50:05, Boris Brezillon wrote:
> On Mon, 08 May 2017 11:13:10 +0100
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote:
> > > You're forgetting that the SSD itself (this thread is about SSDs) also has
> > > a major software component which is doing housekeeping all the time, so even
> > > if the main CPU gets reset the SSD's controller may still happily be erasing
> > > blocks.  
> > 
> > We're not really talking about SSDs at all any more; we're talking
> > about real flash with real maintainable software.
> 
> It's probably a good sign that this new discussion should take place in
> a different thread :-).

Well, you are right.. and I'm responsible.

What I was trying to point out was that storage people try to treat
SSDs as HDDs... and SSDs are very different. Harddrives mostly survive
powerfails (with emergency parking), while it is very, very difficult
to make SSD survive random powerfail, and we have to make sure we
always powerdown SSDs "cleanly".

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 16:43                       ` Pavel Machek
@ 2017-05-08 17:43                         ` Tejun Heo
  2017-05-08 18:56                           ` Pavel Machek
  2017-05-08 18:29                         ` Atlant Schmidt
  1 sibling, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2017-05-08 17:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Boris Brezillon, David Woodhouse, Hans de Goede,
	Ricard Wanderlof, linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

Hello,

On Mon, May 08, 2017 at 06:43:22PM +0200, Pavel Machek wrote:
> What I was trying to point out was that storage people try to treat
> SSDs as HDDs... and SSDs are very different. Harddrives mostly survive
> powerfails (with emergency parking), while it is very, very difficult
> to make SSD survive random powerfail, and we have to make sure we
> always powerdown SSDs "cleanly".

We do.

The issue raised is that some SSDs still increment the unexpected
power loss count even after clean shutdown sequence and that the
kernel should wait for some secs before powering off.

We can do that for select devices but I want something more than "this
SMART counter is getting incremented" before doing that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: Race to power off harming SATA SSDs
  2017-05-08 16:43                       ` Pavel Machek
  2017-05-08 17:43                         ` Tejun Heo
@ 2017-05-08 18:29                         ` Atlant Schmidt
  1 sibling, 0 replies; 47+ messages in thread
From: Atlant Schmidt @ 2017-05-08 18:29 UTC (permalink / raw)
  To: Pavel Machek, Boris Brezillon
  Cc: Ricard Wanderlof, linux-scsi, linux-ide, linux-kernel,
	Hans de Goede, linux-mtd, Henrique de Moraes Holschuh, Tejun Heo,
	David Woodhouse

> Well, you are right.. and I'm responsible.
>
> What I was trying to point out was that storage people try to treat SSDs as HDDs...
> and SSDs are very different. Harddrives mostly survive powerfails (with emergency
> parking), while it is very, very difficult to make SSD survive random powerfail,
> and we have to make sure we always powerdown SSDs "cleanly".

  It all depends on the class of SSD that we're discussing.
  "Enterprise class" SSDs will often use either ultracapacitors
  or batteries to allow them to successfully complete all of
  the necessary operations upon a power cut.

This e-mail and the information, including any attachments it contains, are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message. Thank you. Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 17:43                         ` Tejun Heo
@ 2017-05-08 18:56                           ` Pavel Machek
  2017-05-08 19:04                             ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Machek @ 2017-05-08 18:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Boris Brezillon, David Woodhouse, Hans de Goede,
	Ricard Wanderlof, linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Mon 2017-05-08 13:43:03, Tejun Heo wrote:
> Hello,
> 
> On Mon, May 08, 2017 at 06:43:22PM +0200, Pavel Machek wrote:
> > What I was trying to point out was that storage people try to treat
> > SSDs as HDDs... and SSDs are very different. Harddrives mostly survive
> > powerfails (with emergency parking), while it is very, very difficult
> > to make SSD survive random powerfail, and we have to make sure we
> > always powerdown SSDs "cleanly".
> 
> We do.
> 
> The issue raised is that some SSDs still increment the unexpected
> power loss count even after clean shutdown sequence and that the
> kernel should wait for some secs before powering off.
> 
> We can do that for select devices but I want something more than "this
> SMART counter is getting incremented" before doing that.

Well... the SMART counter tells us that the device was not shut down
correctly. Do we have reason to believe that it is _not_ telling us
truth? It is more than one device.

SSDs die when you power them without warning:
http://lkcl.net/reports/ssd_analysis.html

What kind of data would you like to see? "I have been using linux and
my SSD died"? We have had such reports. "I have killed 10 SSDs in a
week then I added one second delay, and this SSD survived 6 months"?


									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 18:56                           ` Pavel Machek
@ 2017-05-08 19:04                             ` Tejun Heo
  0 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2017-05-08 19:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Boris Brezillon, David Woodhouse, Hans de Goede,
	Ricard Wanderlof, linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

Hello,

On Mon, May 08, 2017 at 08:56:15PM +0200, Pavel Machek wrote:
> Well... the SMART counter tells us that the device was not shut down
> correctly. Do we have reason to believe that it is _not_ telling us
> truth? It is more than one device.

It also finished power off command successfully.

> SSDs die when you power them without warning:
> http://lkcl.net/reports/ssd_analysis.html
> 
> What kind of data would you like to see? "I have been using linux and
> my SSD died"? We have had such reports. "I have killed 10 SSDs in a
> week then I added one second delay, and this SSD survived 6 months"?

Repeating shutdown cycles and showing that the device actually is in
trouble would be great.  It doesn't have to reach full-on device
failure.  Showing some sign of corruption would be enough - increase
in CRC failure counts, bad block counts (a lot of devices report
remaining reserve or lifetime in one way or the other) and so on.
Right now, it might as well be just the SMART counter being funky.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Race to power off harming SATA SSDs
  2017-05-08 15:40                       ` David Woodhouse
@ 2017-05-08 21:36                         ` Pavel Machek
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2017-05-08 21:36 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Boris Brezillon, Hans de Goede, Ricard Wanderlof, Tejun Heo,
	linux-scsi, linux-kernel, linux-ide, linux-mtd,
	Henrique de Moraes Holschuh

On Mon 2017-05-08 16:40:11, David Woodhouse wrote:
> On Mon, 2017-05-08 at 13:50 +0200, Boris Brezillon wrote:
> > On Mon, 08 May 2017 11:13:10 +0100
> > David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > > 
> > > On Mon, 2017-05-08 at 11:09 +0200, Hans de Goede wrote:
> > > > 
> > > > You're forgetting that the SSD itself (this thread is about SSDs) also has
> > > > a major software component which is doing housekeeping all the time, so even
> > > > if the main CPU gets reset the SSD's controller may still happily be erasing
> > > > blocks.  
> > > We're not really talking about SSDs at all any more; we're talking
> > > about real flash with real maintainable software.
> >
> > It's probably a good sign that this new discussion should take place in
> > a different thread :-).
> 
> Well, maybe. But it was a silly thread in the first place. SATA SSDs
> aren't *expected* to be reliable.

Citation needed?

I'm pretty sure SATA SSDs are expected to be reliable, up to maximum
amount of gigabytes written (specified by manufacturer), as long as
you don't cut power without warning.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-05-08 21:36 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-10 23:21 Race to power off harming SATA SSDs Henrique de Moraes Holschuh
2017-04-10 23:34 ` Bart Van Assche
2017-04-10 23:50   ` Henrique de Moraes Holschuh
2017-04-10 23:49 ` sd: wait for slow devices on shutdown path Henrique de Moraes Holschuh
2017-04-10 23:52 ` Race to power off harming SATA SSDs Tejun Heo
2017-04-10 23:57   ` James Bottomley
2017-04-11  2:02     ` Henrique de Moraes Holschuh
2017-04-11  1:26   ` Henrique de Moraes Holschuh
2017-04-11 10:37   ` Martin Steigerwald
2017-04-11 14:31     ` Henrique de Moraes Holschuh
2017-04-12  7:47       ` Martin Steigerwald
2017-05-07 20:40   ` Pavel Machek
2017-05-07 20:40     ` Pavel Machek
2017-05-08  7:21     ` David Woodhouse
2017-05-08  7:38       ` Ricard Wanderlof
2017-05-08  7:38         ` Ricard Wanderlof
2017-05-08  8:13         ` David Woodhouse
2017-05-08  8:13           ` David Woodhouse
2017-05-08  8:36           ` Ricard Wanderlof
2017-05-08  8:36             ` Ricard Wanderlof
2017-05-08  8:54             ` David Woodhouse
2017-05-08  8:54               ` David Woodhouse
2017-05-08  9:06               ` Ricard Wanderlof
2017-05-08  9:06                 ` Ricard Wanderlof
2017-05-08  9:09                 ` Hans de Goede
2017-05-08 10:13                   ` David Woodhouse
2017-05-08 11:50                     ` Boris Brezillon
2017-05-08 15:40                       ` David Woodhouse
2017-05-08 21:36                         ` Pavel Machek
2017-05-08 16:43                       ` Pavel Machek
2017-05-08 17:43                         ` Tejun Heo
2017-05-08 18:56                           ` Pavel Machek
2017-05-08 19:04                             ` Tejun Heo
2017-05-08 18:29                         ` Atlant Schmidt
2017-05-08 10:12                 ` David Woodhouse
2017-05-08 10:12                   ` David Woodhouse
2017-05-08 10:12                   ` David Woodhouse
2017-05-08  9:28       ` Pavel Machek
2017-05-08  9:34         ` David Woodhouse
2017-05-08 10:49           ` Pavel Machek
2017-05-08 11:06             ` Richard Weinberger
2017-05-08 11:48               ` Boris Brezillon
2017-05-08 11:55                 ` Boris Brezillon
2017-05-08 12:13                 ` Richard Weinberger
2017-05-08 11:09             ` David Woodhouse
2017-05-08 12:32               ` Pavel Machek
2017-05-08  9:51         ` Richard Weinberger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.