All of lore.kernel.org
 help / color / mirror / Atom feed
* Western Digital Red's SMR and btrfs?
@ 2020-05-02  5:24 Rich Rauenzahn
  2020-05-04 23:08 ` Zygo Blaxell
  2020-05-05  9:30 ` Dan van der Ster
  0 siblings, 2 replies; 17+ messages in thread
From: Rich Rauenzahn @ 2020-05-02  5:24 UTC (permalink / raw)
  To: Btrfs BTRFS

Has there been any btrfs discussion off the list (I haven't seen any
SMR/shingled mails in the archive since 2016 or so) regarding the news
that WD's Red drives are actually SMR?

I'm using these reds in my btrfs setup (which is 2-3 drives in RAID1
configuration, not parity based RAIDs.)   I had noticed that adding a
new drive took a long time, but other than than, I haven't had any
issues that I know of.  They've lasted quite a long time, although I
think my NAS would be considered more of a cold storage/archival.
Photos and Videos.

Is btrfs raid1 going to be the sweet spot on these drives?

If I start swapping these out -- is there a recommended low power
drive?  I'd buy the red pro's, but they spin faster and produce more
heat and noise.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-02  5:24 Western Digital Red's SMR and btrfs? Rich Rauenzahn
@ 2020-05-04 23:08 ` Zygo Blaxell
  2020-05-04 23:24   ` Chris Murphy
  2020-05-09 21:00   ` Phil Karn
  2020-05-05  9:30 ` Dan van der Ster
  1 sibling, 2 replies; 17+ messages in thread
From: Zygo Blaxell @ 2020-05-04 23:08 UTC (permalink / raw)
  To: Rich Rauenzahn; +Cc: Btrfs BTRFS

On Fri, May 01, 2020 at 10:24:57PM -0700, Rich Rauenzahn wrote:
> Has there been any btrfs discussion off the list (I haven't seen any
> SMR/shingled mails in the archive since 2016 or so) regarding the news
> that WD's Red drives are actually SMR?
> 
> I'm using these reds in my btrfs setup (which is 2-3 drives in RAID1
> configuration, not parity based RAIDs.)   I had noticed that adding a
> new drive took a long time, but other than than, I haven't had any
> issues that I know of.  They've lasted quite a long time, although I
> think my NAS would be considered more of a cold storage/archival.
> Photos and Videos.

The basic problem with DM-SMR drives is that they cache writes in CMR
zones for a while, but they need significant idle periods (no read or
write commands from the host) to move the data back to SMR zones, or
they run out of CMR space and throttle writes from the host.

Some kinds of RAID rebuild don't provide sufficient idle time to complete
the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
down too much, the kernel times out on IO, and reports that the drive
has failed.  The RAID system running on top thinks the drive is faulty
(a false positive failure) and the fun begins (hope you don't have two
of these drives in the same array!).

NAS CMR drives in redundant RAID arrays should be configured to fail
fast--complete iops within 7 seconds.  This is the smartctl scterc command
that you may have seen on various RAID admin guides.  The default idle
timeout for the Linux kernel is 30 seconds, so NAS drives work fine.

Desktop CMR drives (which are not good in RAID arrays but people use
them anyway) have firmware hardcoded to retry reads for about 120
seconds before giving up.  To use desktop CMR drives in RAID arrays,
you must increase the Linux kernel IO timeout to 180 seconds or risk
false positive rejections (i.e. multi-disk failures) from RAID arrays.

Note that both desktop and NAS CMR drives have similar expected write
latencies in non-error cases, both on the order of a few milliseconds.
We only see the multi-minute latencies in error cases, e.g. if there's
a bad sector or similar drive failure, and those are rare events.

Now here is the problem:  DM-SMR drives have write latencies of up to 300
seconds in *non-error* cases.  They are up to 10,000 times slower than
CMR in the worst case.  Assume that there's an additional 120 seconds
for error recovery on top of the non-error write latency, and add the
extra 50% for safety, and the SMR drive should be configured with a
630 second timeout (10.5 minutes) in the Linux kernel to avoid false
positive failures.

Similarly, if you're serving network clients, their timeouts have to be
increased as well, usually many times larger because there's going to
be full host IO queues to these very slow drives.  It means a desktop
client user on your file server could be presented with an hourglass
for an hour when they click on a folder, or, more likely, just an error.

> Is btrfs raid1 going to be the sweet spot on these drives?

It depends.  You can probably use it normally and run scrubs on it.
Replace probably works OK if the drive firmware is sane.  You may have
problems with remove, resize and balance operations especially on metadata
block groups.  Definitely set the timeouts to nice high values (I'd use
15 minutes just to be sure) and be prepared to ride out some epic delays.
The array may be theoretically working, but unusable in practice.

> If I start swapping these out -- is there a recommended low power
> drive?  I'd buy the red pro's, but they spin faster and produce more
> heat and noise.

I've tested several low-power drives but can't recommend any of them
for NAS use (no SCTERC, short warranty, firmware bugs, and/or high
failure rate).  Red Pro, Gold, Ultrastar, and Ironwolf have been OK so
far, but as you point out, they're all 7200 rpm class drives.

> Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-04 23:08 ` Zygo Blaxell
@ 2020-05-04 23:24   ` Chris Murphy
  2020-05-05  2:00     ` Zygo Blaxell
  2020-05-09 21:00   ` Phil Karn
  1 sibling, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2020-05-04 23:24 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Rich Rauenzahn, Btrfs BTRFS

On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:

> Some kinds of RAID rebuild don't provide sufficient idle time to complete
> the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
> down too much, the kernel times out on IO, and reports that the drive
> has failed.  The RAID system running on top thinks the drive is faulty
> (a false positive failure) and the fun begins (hope you don't have two
> of these drives in the same array!).

This came up on linux-raid@ list today also, and someone posted this
smartmontools bug.
https://www.smartmontools.org/ticket/1313

It notes in part this error, which is not a time out.

[20809.396284] blk_update_request: I/O error, dev sdd, sector
3484334688 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0

An explicit write error is a defective drive. But even slow downs
resulting in link resets is defective. The marketing of DM-SMR says
it's suitable without having to apply local customizations accounting
for the drive being SMR.


> Desktop CMR drives (which are not good in RAID arrays but people use
> them anyway) have firmware hardcoded to retry reads for about 120
> seconds before giving up.  To use desktop CMR drives in RAID arrays,
> you must increase the Linux kernel IO timeout to 180 seconds or risk
> false positive rejections (i.e. multi-disk failures) from RAID arrays.

I think we're way past the time when all desktop oriented Linux
installations should have overridden the kernel default, using 180
second timeouts instead. Even in the single disk case. The system is
better off failing safe to slow response, rather than link resets and
subsequent face plant. But these days most every laptop and desktop's
sysroot is on an SSD of some kind.


> Now here is the problem:  DM-SMR drives have write latencies of up to 300
> seconds in *non-error* cases.  They are up to 10,000 times slower than
> CMR in the worst case.  Assume that there's an additional 120 seconds
> for error recovery on top of the non-error write latency, and add the
> extra 50% for safety, and the SMR drive should be configured with a
> 630 second timeout (10.5 minutes) in the Linux kernel to avoid false
> positive failures.

Incredible.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-04 23:24   ` Chris Murphy
@ 2020-05-05  2:00     ` Zygo Blaxell
  2020-05-05  2:22       ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2020-05-05  2:00 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Rich Rauenzahn, Btrfs BTRFS

On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote:
> On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> 
> > Some kinds of RAID rebuild don't provide sufficient idle time to complete
> > the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
> > down too much, the kernel times out on IO, and reports that the drive
> > has failed.  The RAID system running on top thinks the drive is faulty
> > (a false positive failure) and the fun begins (hope you don't have two
> > of these drives in the same array!).
> 
> This came up on linux-raid@ list today also, and someone posted this
> smartmontools bug.
> https://www.smartmontools.org/ticket/1313
> 
> It notes in part this error, which is not a time out.

Uhhh...wow.  If that's not an individual broken disk, but the programmed
behavior of the firmware, that would mean the drive model is not usable
at all.

> [20809.396284] blk_update_request: I/O error, dev sdd, sector
> 3484334688 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
> 
> An explicit write error is a defective drive. But even slow downs
> resulting in link resets is defective. The marketing of DM-SMR says
> it's suitable without having to apply local customizations accounting
> for the drive being SMR.
> 
> 
> > Desktop CMR drives (which are not good in RAID arrays but people use
> > them anyway) have firmware hardcoded to retry reads for about 120
> > seconds before giving up.  To use desktop CMR drives in RAID arrays,
> > you must increase the Linux kernel IO timeout to 180 seconds or risk
> > false positive rejections (i.e. multi-disk failures) from RAID arrays.
> 
> I think we're way past the time when all desktop oriented Linux
> installations should have overridden the kernel default, using 180
> second timeouts instead. Even in the single disk case. The system is
> better off failing safe to slow response, rather than link resets and
> subsequent face plant. But these days most every laptop and desktop's
> sysroot is on an SSD of some kind.
> 
> 
> > Now here is the problem:  DM-SMR drives have write latencies of up to 300
> > seconds in *non-error* cases.  They are up to 10,000 times slower than
> > CMR in the worst case.  Assume that there's an additional 120 seconds
> > for error recovery on top of the non-error write latency, and add the
> > extra 50% for safety, and the SMR drive should be configured with a
> > 630 second timeout (10.5 minutes) in the Linux kernel to avoid false
> > positive failures.
> 
> Incredible.
> 
> 
> -- 
> Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-05  2:00     ` Zygo Blaxell
@ 2020-05-05  2:22       ` Chris Murphy
  2020-05-05  3:26         ` Zygo Blaxell
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2020-05-05  2:22 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Rich Rauenzahn, Btrfs BTRFS

On Mon, May 4, 2020 at 8:00 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote:
> > On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > > Some kinds of RAID rebuild don't provide sufficient idle time to complete
> > > the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
> > > down too much, the kernel times out on IO, and reports that the drive
> > > has failed.  The RAID system running on top thinks the drive is faulty
> > > (a false positive failure) and the fun begins (hope you don't have two
> > > of these drives in the same array!).
> >
> > This came up on linux-raid@ list today also, and someone posted this
> > smartmontools bug.
> > https://www.smartmontools.org/ticket/1313
> >
> > It notes in part this error, which is not a time out.
>
> Uhhh...wow.  If that's not an individual broken disk, but the programmed
> behavior of the firmware, that would mean the drive model is not usable
> at all.

I haven't gone looking for a spec, but "sector ID not found" makes me
think of a trim/remap related failure, which, yeah it's gotta be a
firmware bug. This can't be "works as designed".


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-05  2:22       ` Chris Murphy
@ 2020-05-05  3:26         ` Zygo Blaxell
  0 siblings, 0 replies; 17+ messages in thread
From: Zygo Blaxell @ 2020-05-05  3:26 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Rich Rauenzahn, Btrfs BTRFS

On Mon, May 04, 2020 at 08:22:24PM -0600, Chris Murphy wrote:
> On Mon, May 4, 2020 at 8:00 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote:
> > > On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > > Some kinds of RAID rebuild don't provide sufficient idle time to complete
> > > > the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
> > > > down too much, the kernel times out on IO, and reports that the drive
> > > > has failed.  The RAID system running on top thinks the drive is faulty
> > > > (a false positive failure) and the fun begins (hope you don't have two
> > > > of these drives in the same array!).
> > >
> > > This came up on linux-raid@ list today also, and someone posted this
> > > smartmontools bug.
> > > https://www.smartmontools.org/ticket/1313
> > >
> > > It notes in part this error, which is not a time out.
> >
> > Uhhh...wow.  If that's not an individual broken disk, but the programmed
> > behavior of the firmware, that would mean the drive model is not usable
> > at all.
> 
> I haven't gone looking for a spec, but "sector ID not found" makes me
> think of a trim/remap related failure, which, yeah it's gotta be a
> firmware bug. This can't be "works as designed".

Usually IDNF is "I was looking for a sector, but I couldn't figure out
where on the disk it was," i.e. head positioning error or damage to the
metadata on a cylinder or sector header.  Though there are maybe some
that return IDNF instead of ABRT when they get a request for a sector
outside of the drive's legal LBA range.

The "didn't find a sector" variant usually indicates non-trivial damage
(impact on platter vs. bit fade), but could also be due to too much
vibration and a short read error timeout.  Also a small fraction of
bit errors will land on sector headers and produce IDNF without
other damage.

> 
> -- 
> Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-02  5:24 Western Digital Red's SMR and btrfs? Rich Rauenzahn
  2020-05-04 23:08 ` Zygo Blaxell
@ 2020-05-05  9:30 ` Dan van der Ster
  1 sibling, 0 replies; 17+ messages in thread
From: Dan van der Ster @ 2020-05-05  9:30 UTC (permalink / raw)
  To: Rich Rauenzahn; +Cc: Btrfs BTRFS

FWIW, I've written a little tool to help incrementally, slowly,
balance an array with SMR drives:

   https://gist.github.com/dvanders/c15d490ae380bcf4220a437b18a32f04

It balances 2 data chunks per iteration, and if that took longer than
some threshold (e.g. 60s), it injects an increasingly larger sleep
between subsequent iterations.
I'm just getting started with DM-SMR drives in my home array (3x 8TB
Seagates), but this script seems to be much more usable than a
one-shot full balance, which became ultra slow and made little
progress after the CMR cache filled up.

And my 2 cents: the RAID1 is quite usable for my media storage
use-case; outside of balancing I don't notice any slowness (and in
fact it maybe quicker than usual, due to the CMR cache which
sequentializes up to several gigabytes of random writes)

Cheers, Dan

On Sat, May 2, 2020 at 7:25 AM Rich Rauenzahn <rrauenza@gmail.com> wrote:
>
> Has there been any btrfs discussion off the list (I haven't seen any
> SMR/shingled mails in the archive since 2016 or so) regarding the news
> that WD's Red drives are actually SMR?
>
> I'm using these reds in my btrfs setup (which is 2-3 drives in RAID1
> configuration, not parity based RAIDs.)   I had noticed that adding a
> new drive took a long time, but other than than, I haven't had any
> issues that I know of.  They've lasted quite a long time, although I
> think my NAS would be considered more of a cold storage/archival.
> Photos and Videos.
>
> Is btrfs raid1 going to be the sweet spot on these drives?
>
> If I start swapping these out -- is there a recommended low power
> drive?  I'd buy the red pro's, but they spin faster and produce more
> heat and noise.
>
> Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-04 23:08 ` Zygo Blaxell
  2020-05-04 23:24   ` Chris Murphy
@ 2020-05-09 21:00   ` Phil Karn
  2020-05-09 21:46     ` Steven Fosdick
  2020-05-11  4:06     ` Damien Le Moal
  1 sibling, 2 replies; 17+ messages in thread
From: Phil Karn @ 2020-05-09 21:00 UTC (permalink / raw)
  To: Zygo Blaxell, Rich Rauenzahn; +Cc: Btrfs BTRFS

On 5/4/20 16:08, Zygo Blaxell wrote:
> The basic problem with DM-SMR drives is that they cache writes in CMR
> zones for a while, but they need significant idle periods (no read or
> write commands from the host) to move the data back to SMR zones, or
> they run out of CMR space and throttle writes from the host.

Does anybody know where the drive keeps all that metadata? On rotating
disk, or in flash somewhere?

Just wondering what happens when power suddenly fails during these
rewrite operations.

>
> Some kinds of RAID rebuild don't provide sufficient idle time to complete
> the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows

My understanding is that large sequential writes can go directly to the
SMR areas, which is an argument for a more conventional RAID array. How
hard does btrfs try to do large sequential writes?






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-09 21:00   ` Phil Karn
@ 2020-05-09 21:46     ` Steven Fosdick
  2020-05-11  5:06       ` Zygo Blaxell
  2020-05-11  4:06     ` Damien Le Moal
  1 sibling, 1 reply; 17+ messages in thread
From: Steven Fosdick @ 2020-05-09 21:46 UTC (permalink / raw)
  To: Phil Karn, Btrfs BTRFS; +Cc: Zygo Blaxell, Rich Rauenzahn

On Sat, 9 May 2020 at 22:02, Phil Karn <karn@ka9q.net> wrote:
> My understanding is that large sequential writes can go directly to the
> SMR areas, which is an argument for a more conventional RAID array. How
> hard does btrfs try to do large sequential writes?

Ok, so I had not heard of SMR before it was mentioned here and
immediate read the links.  It did occur to me that large sequential
writes could, in theory, go straight to SMR zones but it also occurred
to be that it isn't completely straight forward.

1. If the drive firmware is not declaring that the drive uses SMR, and
therefore the host doesn't send a specific command to begin a
sequential write, how many sectors in a row does the drive wait to
receive before conclusion this is a large sequential operation?

2. What happens if the sequential operation does not begin a the start
of an SMR zone?

The only thing that would make it easy is if the drive had a
battery-backed RAM cache at least as big as an SMR zone, ideally about
twice as big, so it could accumulate the data for one zone and then
start writing that while accepting data for the next.  As I have no
idea how big these zones are I have no idea how feasible that is.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-09 21:00   ` Phil Karn
  2020-05-09 21:46     ` Steven Fosdick
@ 2020-05-11  4:06     ` Damien Le Moal
  1 sibling, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2020-05-11  4:06 UTC (permalink / raw)
  To: Phil Karn, Zygo Blaxell, Rich Rauenzahn; +Cc: Btrfs BTRFS

On 2020/05/10 6:01, Phil Karn wrote:
> On 5/4/20 16:08, Zygo Blaxell wrote:
>> The basic problem with DM-SMR drives is that they cache writes in CMR
>> zones for a while, but they need significant idle periods (no read or
>> write commands from the host) to move the data back to SMR zones, or
>> they run out of CMR space and throttle writes from the host.
> 
> Does anybody know where the drive keeps all that metadata? On rotating
> disk, or in flash somewhere?

This is drive implementation dependent. That is not something defined by
standards. Differences will exist between vendors and models.

> Just wondering what happens when power suddenly fails during these
> rewrite operations.

The drive FW saves whatever information is needed, consistent with the drive
write cache flush state. Exactly like an SSD would do too.


>> Some kinds of RAID rebuild don't provide sufficient idle time to complete
>> the CMR-to-SMR writeback, so the host gets throttled.  If the drive slows
> 
> My understanding is that large sequential writes can go directly to the
> SMR areas, which is an argument for a more conventional RAID array. How
> hard does btrfs try to do large sequential writes?

"large" is not a sufficient parameter to conclude/guess on any specific
behavior. Alignment (start LBA) of the write command, sectors already written or
not, drive write cache on or off, drive write cache full or not, drive
implementation differences, etc. There are a lot more parameters influencing how
the drive will process writes. There is no simple statement that can be made
about how these drive work internally. This is completely vendor & model
dependent, exactly like SSDs FTL implementations.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-09 21:46     ` Steven Fosdick
@ 2020-05-11  5:06       ` Zygo Blaxell
  2020-05-11 20:35         ` Phil Karn
  0 siblings, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2020-05-11  5:06 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Phil Karn, Btrfs BTRFS, Rich Rauenzahn

On Sat, May 09, 2020 at 10:46:27PM +0100, Steven Fosdick wrote:
> On Sat, 9 May 2020 at 22:02, Phil Karn <karn@ka9q.net> wrote:
> > My understanding is that large sequential writes can go directly to the
> > SMR areas, which is an argument for a more conventional RAID array. How
> > hard does btrfs try to do large sequential writes?
> 
> Ok, so I had not heard of SMR before it was mentioned here and
> immediate read the links.  It did occur to me that large sequential
> writes could, in theory, go straight to SMR zones but it also occurred
> to be that it isn't completely straight forward.

This is a nice overview:

	https://www.snia.org/sites/default/files/Dunn-Feldman_SNIA_Tutorial_Shingled_Magnetic_Recording-r7_Final.pdf

> 1. If the drive firmware is not declaring that the drive uses SMR, and
> therefore the host doesn't send a specific command to begin a
> sequential write, how many sectors in a row does the drive wait to
> receive before conclusion this is a large sequential operation?
> 
> 2. What happens if the sequential operation does not begin a the start
> of an SMR zone?

In the event of a non-append write, a RMW operation performed on the
entire zone.

The exceptions would be data extents that are explicitly deleted
(TRIM command), and it looks like a sequential overwrite at the _end_
of a zone (i.e. starting in the middle on a sector boundary and writing
sequentially to the end of the zone without writing elsewhere in between)
can be executed without having to rewrite the entire zone (zones can be
appended at any time, the head erases data forward of the write location).
I don't know if any drives implement that.

In order to get conventional flush semantics to work, the drive has
to write everything twice:  once to a log zone (which is either CMR
or SMR), then copy from there back to the SMR zone to which it belongs
("cleaning").  There is necessarily a seek in between, as the log zone
and SMR data zones cannot coexist within a track.

DM-SMR drives usually have smaller zones than HA-SMR drives, but we can
only guess (or run a timing attack to find out).  This would allow the
drive to track a few zones in the typical 256MB RAM cache size for the
submarined SMR drives.

This source reports zone sizes of 15-40MB for DM-SMR and 256MB for HA-SMR,
with cache CMR sizes not exceeding 0.2% of capacity:

	https://www.usenix.org/system/files/conference/hotstorage16/hotstorage16_wu.pdf

btrfs should do OK as long as you use space_cache=v2--space cache v1
would force the drive into slow RMW operations every 30 seconds, as it
would be forcing the drive to complete cleaning operations in multiple
zones.  Nobody should be using space_cache=v1 any more, and this is
just yet another reason.

Superblock updates would keep 2 zones updated all the time, effectively
reducing the number of usable open zones in the drive permanently by 2.
Longer commit intervals may help.

> The only thing that would make it easy is if the drive had a
> battery-backed RAM cache at least as big as an SMR zone, ideally about
> twice as big, so it could accumulate the data for one zone and then
> start writing that while accepting data for the next.  As I have no
> idea how big these zones are I have no idea how feasible that is.

Batteries and flash are expensive, so you can assume the drive has neither
unless they are prominently featured in the marketing docs to explain
the costs that are passed on to the customer.  All of the metadata and
caches are stored on the spinning platters.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-11  5:06       ` Zygo Blaxell
@ 2020-05-11 20:35         ` Phil Karn
  2020-05-11 21:13           ` Alberto Bursi
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Karn @ 2020-05-11 20:35 UTC (permalink / raw)
  To: Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn

On 5/10/20 22:06, Zygo Blaxell wrote:
>
> The exceptions would be data extents that are explicitly deleted
> (TRIM command), and it looks like a sequential overwrite at the _end_
> of a zone (i.e. starting in the middle on a sector boundary and writing


Do these SMR drives generally support TRIM? What other spinning drives
support it?

I was surprised to recently discover a spinning drive that supports
TRIM. It's a HGST Z5K1 2.5" 5400 RPM 1TB OEM drive I pulled from an ASUS
laptop to replace with a SSD. TRIM support is verified by hdparm and by
running the fstrim command. There's nothing in the literature about this
being a hybrid drive.

Doesn't seem likely, but could it be shingled?

Phil




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-11 20:35         ` Phil Karn
@ 2020-05-11 21:13           ` Alberto Bursi
  2020-05-11 22:42             ` Phil Karn
  0 siblings, 1 reply; 17+ messages in thread
From: Alberto Bursi @ 2020-05-11 21:13 UTC (permalink / raw)
  To: Phil Karn, Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn



On 11/05/20 22:35, Phil Karn wrote:
> On 5/10/20 22:06, Zygo Blaxell wrote:
>>
>> The exceptions would be data extents that are explicitly deleted
>> (TRIM command), and it looks like a sequential overwrite at the _end_
>> of a zone (i.e. starting in the middle on a sector boundary and writing
> 
> 
> Do these SMR drives generally support TRIM? What other spinning drives
> support it?
> 
> I was surprised to recently discover a spinning drive that supports
> TRIM. It's a HGST Z5K1 2.5" 5400 RPM 1TB OEM drive I pulled from an ASUS
> laptop to replace with a SSD. TRIM support is verified by hdparm and by
> running the fstrim command. There's nothing in the literature about this
> being a hybrid drive.
> 
> Doesn't seem likely, but could it be shingled?
> 
> Phil
> 
> 
> 

Afaik drive-managed SMR drives (i.e. all drives that disguise themselves 
as non-SMR) are acting like a SSD, writing in empty "zones" first and 
then running garbage collection later to consolidate the data. TRIM is 
used for the same reasons SSDs also use it.
This is the way they are working around the performance penalty of SMR, 
as it's the same limitation NAND flash also has (you can write only a 
full cell at a time).

See here for example https://support-en.wd.com/app/answers/detail/a_id/25185

-Alberto

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-11 21:13           ` Alberto Bursi
@ 2020-05-11 22:42             ` Phil Karn
  2020-05-12  0:12               ` Zygo Blaxell
  2020-05-12  2:17               ` Alberto Bursi
  0 siblings, 2 replies; 17+ messages in thread
From: Phil Karn @ 2020-05-11 22:42 UTC (permalink / raw)
  To: Alberto Bursi, Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn

On 5/11/20 14:13, Alberto Bursi wrote:
>
> Afaik drive-managed SMR drives (i.e. all drives that disguise
> themselves as non-SMR) are acting like a SSD, writing in empty "zones"
> first and then running garbage collection later to consolidate the
> data. TRIM is used for the same reasons SSDs also use it.
> This is the way they are working around the performance penalty of
> SMR, as it's the same limitation NAND flash also has (you can write
> only a full cell at a time).
>
> See here for example
> https://support-en.wd.com/app/answers/detail/a_id/25185
>
> -Alberto

Right, I understand that (some?) SMR drives support TRIM for the same
reason that SSDs do (well, a very similar reason). My question was
whether there'd be any reason for a NON-SMR drive to support TRIM, or if
TRIM support necessarily implies shingled recording. I didn't know
shingled recording was in any general purpose 2.5" spinning laptop
drives like mine, and there's no mention of SMR in the HGST manual.

Phil




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-11 22:42             ` Phil Karn
@ 2020-05-12  0:12               ` Zygo Blaxell
  2020-05-12  2:17               ` Alberto Bursi
  1 sibling, 0 replies; 17+ messages in thread
From: Zygo Blaxell @ 2020-05-12  0:12 UTC (permalink / raw)
  To: Phil Karn; +Cc: Alberto Bursi, Steven Fosdick, Btrfs BTRFS, Rich Rauenzahn

On Mon, May 11, 2020 at 03:42:44PM -0700, Phil Karn wrote:
> On 5/11/20 14:13, Alberto Bursi wrote:
> >
> > Afaik drive-managed SMR drives (i.e. all drives that disguise
> > themselves as non-SMR) are acting like a SSD, writing in empty "zones"
> > first and then running garbage collection later to consolidate the
> > data. TRIM is used for the same reasons SSDs also use it.
> > This is the way they are working around the performance penalty of
> > SMR, as it's the same limitation NAND flash also has (you can write
> > only a full cell at a time).
> >
> > See here for example
> > https://support-en.wd.com/app/answers/detail/a_id/25185
> >
> > -Alberto
> 
> Right, I understand that (some?) SMR drives support TRIM for the same
> reason that SSDs do (well, a very similar reason). My question was
> whether there'd be any reason for a NON-SMR drive to support TRIM, or if
> TRIM support necessarily implies shingled recording. I didn't know
> shingled recording was in any general purpose 2.5" spinning laptop
> drives like mine, and there's no mention of SMR in the HGST manual.

According to

	https://hddscan.com/blog/2020/hdd-wd-smr.html

2.5" SMR drives appeared in 2016.

> Phil
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
  2020-05-11 22:42             ` Phil Karn
  2020-05-12  0:12               ` Zygo Blaxell
@ 2020-05-12  2:17               ` Alberto Bursi
  1 sibling, 0 replies; 17+ messages in thread
From: Alberto Bursi @ 2020-05-12  2:17 UTC (permalink / raw)
  To: Phil Karn, Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn



On 12/05/20 00:42, Phil Karn wrote:
> On 5/11/20 14:13, Alberto Bursi wrote:
>>
>> Afaik drive-managed SMR drives (i.e. all drives that disguise
>> themselves as non-SMR) are acting like a SSD, writing in empty "zones"
>> first and then running garbage collection later to consolidate the
>> data. TRIM is used for the same reasons SSDs also use it.
>> This is the way they are working around the performance penalty of
>> SMR, as it's the same limitation NAND flash also has (you can write
>> only a full cell at a time).
>>
>> See here for example
>> https://support-en.wd.com/app/answers/detail/a_id/25185
>>
>> -Alberto
> 
> Right, I understand that (some?) SMR drives support TRIM for the same
> reason that SSDs do (well, a very similar reason). My question was
> whether there'd be any reason for a NON-SMR drive to support TRIM, or if
> TRIM support necessarily implies shingled recording. I didn't know
> shingled recording was in any general purpose 2.5" spinning laptop
> drives like mine, and there's no mention of SMR in the HGST manual.
> 
> Phil
> 
> 
> 


Afaik there is no good reason for a normal hard drive to have TRIM 
support, as normal drives don't need to care about garbage collection, 
they can just overwrite freely.

I would say that TRIM implies either SMR or flash cache of some kind. 
Lack of TRIM isn't a guarantee though, some SMR drives (identified by 
their performance when benchmarked) were not reporting TRIM support.

It seems all three HDD manufacturers (WD, Toshiba and Seagate) just lied 
to everyone about the use of SMR in their drives for years and this was 
only discovered when this went into NAS-oriented drives that 
(unsurprisingly) blew up RAID arrays.

I would not trust the manual or official info from the pre-debacle 
period that much.

-Alberto

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Western Digital Red's SMR and btrfs?
@ 2020-05-02 12:26 Torstein Eide
  0 siblings, 0 replies; 17+ messages in thread
From: Torstein Eide @ 2020-05-02 12:26 UTC (permalink / raw)
  To: rrauenza; +Cc: linux-btrfs

I recommend to read this paper:
https://www.toshiba.co.jp/tech/review/en/01_02/pdf/a08.pdf
https://www.servethehome.com/surreptitiously-swapping-smr-into-hard-drives-must-end/

I think it very bad that WD did not declare that the disk is a SMR.

The SMR code that is written expect the drive to inform the host about
it status.  The host manage SMR and Host aware SMR.
The type WD red uses is Disk managed SMR, and our machines are unaware
of it SMR usage.

As far as I understand the problem that have been described by others,
it is not the SMR its self that is the problem.
The problem is that a user expect to be able to do random writes, like
normal like the old WD red drives. But as the action of rebuild a raid
is a sequential operation, when pared other writes is becomes random
writes.

So my understanding is that maybe SMR can be okay for setup with cache
or setups with huge downtime, for the system to be able to rebuild
without user writes. I am still looking for documentation to verify
what is the break point, on how much other writes are acceptable
during a rebuild before the linux kernel/FS will mark it as a bad
drive.

according to this test:
https://www.youtube.com/watch?v=JDYEG4X_LCg
WD red with SMR have better sequential read/write, during raid build,
for a empty raid.

according to this test:
https://www.youtube.com/watch?v=0PhvXPVH-qE
WD red with SMR have slightly slower sequential read/write, during
raid build, for a empty raid.

So what can BTRFS do.
I think this is something that primeratly needs to handle at kernel
level not filesystem level.
Where one solution can be to temperatury slow the write speed to that
manuel disk to have writes below disk rated level.
A other solution can be to stop rebuild if writes fall below a level,
for a some duration to allow the disk to move some of the data in
media cache area over SMR area.
But primarily there need to be a way to mark a DM-SMR disk with a
"This is a SMR disk " similar to host managed and host aware SMR, so
the kernel and/or filesystem can do something with it.

-- 
Torstein Eide
Torsteine@gmail.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-05-12  2:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-02  5:24 Western Digital Red's SMR and btrfs? Rich Rauenzahn
2020-05-04 23:08 ` Zygo Blaxell
2020-05-04 23:24   ` Chris Murphy
2020-05-05  2:00     ` Zygo Blaxell
2020-05-05  2:22       ` Chris Murphy
2020-05-05  3:26         ` Zygo Blaxell
2020-05-09 21:00   ` Phil Karn
2020-05-09 21:46     ` Steven Fosdick
2020-05-11  5:06       ` Zygo Blaxell
2020-05-11 20:35         ` Phil Karn
2020-05-11 21:13           ` Alberto Bursi
2020-05-11 22:42             ` Phil Karn
2020-05-12  0:12               ` Zygo Blaxell
2020-05-12  2:17               ` Alberto Bursi
2020-05-11  4:06     ` Damien Le Moal
2020-05-05  9:30 ` Dan van der Ster
2020-05-02 12:26 Torstein Eide

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.