* Western Digital Red's SMR and btrfs? @ 2020-05-02 5:24 Rich Rauenzahn 2020-05-04 23:08 ` Zygo Blaxell 2020-05-05 9:30 ` Dan van der Ster 0 siblings, 2 replies; 17+ messages in thread From: Rich Rauenzahn @ 2020-05-02 5:24 UTC (permalink / raw) To: Btrfs BTRFS Has there been any btrfs discussion off the list (I haven't seen any SMR/shingled mails in the archive since 2016 or so) regarding the news that WD's Red drives are actually SMR? I'm using these reds in my btrfs setup (which is 2-3 drives in RAID1 configuration, not parity based RAIDs.) I had noticed that adding a new drive took a long time, but other than than, I haven't had any issues that I know of. They've lasted quite a long time, although I think my NAS would be considered more of a cold storage/archival. Photos and Videos. Is btrfs raid1 going to be the sweet spot on these drives? If I start swapping these out -- is there a recommended low power drive? I'd buy the red pro's, but they spin faster and produce more heat and noise. Rich ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-02 5:24 Western Digital Red's SMR and btrfs? Rich Rauenzahn @ 2020-05-04 23:08 ` Zygo Blaxell 2020-05-04 23:24 ` Chris Murphy 2020-05-09 21:00 ` Phil Karn 2020-05-05 9:30 ` Dan van der Ster 1 sibling, 2 replies; 17+ messages in thread From: Zygo Blaxell @ 2020-05-04 23:08 UTC (permalink / raw) To: Rich Rauenzahn; +Cc: Btrfs BTRFS On Fri, May 01, 2020 at 10:24:57PM -0700, Rich Rauenzahn wrote: > Has there been any btrfs discussion off the list (I haven't seen any > SMR/shingled mails in the archive since 2016 or so) regarding the news > that WD's Red drives are actually SMR? > > I'm using these reds in my btrfs setup (which is 2-3 drives in RAID1 > configuration, not parity based RAIDs.) I had noticed that adding a > new drive took a long time, but other than than, I haven't had any > issues that I know of. They've lasted quite a long time, although I > think my NAS would be considered more of a cold storage/archival. > Photos and Videos. The basic problem with DM-SMR drives is that they cache writes in CMR zones for a while, but they need significant idle periods (no read or write commands from the host) to move the data back to SMR zones, or they run out of CMR space and throttle writes from the host. Some kinds of RAID rebuild don't provide sufficient idle time to complete the CMR-to-SMR writeback, so the host gets throttled. If the drive slows down too much, the kernel times out on IO, and reports that the drive has failed. The RAID system running on top thinks the drive is faulty (a false positive failure) and the fun begins (hope you don't have two of these drives in the same array!). NAS CMR drives in redundant RAID arrays should be configured to fail fast--complete iops within 7 seconds. This is the smartctl scterc command that you may have seen on various RAID admin guides. The default idle timeout for the Linux kernel is 30 seconds, so NAS drives work fine. Desktop CMR drives (which are not good in RAID arrays but people use them anyway) have firmware hardcoded to retry reads for about 120 seconds before giving up. To use desktop CMR drives in RAID arrays, you must increase the Linux kernel IO timeout to 180 seconds or risk false positive rejections (i.e. multi-disk failures) from RAID arrays. Note that both desktop and NAS CMR drives have similar expected write latencies in non-error cases, both on the order of a few milliseconds. We only see the multi-minute latencies in error cases, e.g. if there's a bad sector or similar drive failure, and those are rare events. Now here is the problem: DM-SMR drives have write latencies of up to 300 seconds in *non-error* cases. They are up to 10,000 times slower than CMR in the worst case. Assume that there's an additional 120 seconds for error recovery on top of the non-error write latency, and add the extra 50% for safety, and the SMR drive should be configured with a 630 second timeout (10.5 minutes) in the Linux kernel to avoid false positive failures. Similarly, if you're serving network clients, their timeouts have to be increased as well, usually many times larger because there's going to be full host IO queues to these very slow drives. It means a desktop client user on your file server could be presented with an hourglass for an hour when they click on a folder, or, more likely, just an error. > Is btrfs raid1 going to be the sweet spot on these drives? It depends. You can probably use it normally and run scrubs on it. Replace probably works OK if the drive firmware is sane. You may have problems with remove, resize and balance operations especially on metadata block groups. Definitely set the timeouts to nice high values (I'd use 15 minutes just to be sure) and be prepared to ride out some epic delays. The array may be theoretically working, but unusable in practice. > If I start swapping these out -- is there a recommended low power > drive? I'd buy the red pro's, but they spin faster and produce more > heat and noise. I've tested several low-power drives but can't recommend any of them for NAS use (no SCTERC, short warranty, firmware bugs, and/or high failure rate). Red Pro, Gold, Ultrastar, and Ironwolf have been OK so far, but as you point out, they're all 7200 rpm class drives. > Rich ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-04 23:08 ` Zygo Blaxell @ 2020-05-04 23:24 ` Chris Murphy 2020-05-05 2:00 ` Zygo Blaxell 2020-05-09 21:00 ` Phil Karn 1 sibling, 1 reply; 17+ messages in thread From: Chris Murphy @ 2020-05-04 23:24 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Rich Rauenzahn, Btrfs BTRFS On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > Some kinds of RAID rebuild don't provide sufficient idle time to complete > the CMR-to-SMR writeback, so the host gets throttled. If the drive slows > down too much, the kernel times out on IO, and reports that the drive > has failed. The RAID system running on top thinks the drive is faulty > (a false positive failure) and the fun begins (hope you don't have two > of these drives in the same array!). This came up on linux-raid@ list today also, and someone posted this smartmontools bug. https://www.smartmontools.org/ticket/1313 It notes in part this error, which is not a time out. [20809.396284] blk_update_request: I/O error, dev sdd, sector 3484334688 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0 An explicit write error is a defective drive. But even slow downs resulting in link resets is defective. The marketing of DM-SMR says it's suitable without having to apply local customizations accounting for the drive being SMR. > Desktop CMR drives (which are not good in RAID arrays but people use > them anyway) have firmware hardcoded to retry reads for about 120 > seconds before giving up. To use desktop CMR drives in RAID arrays, > you must increase the Linux kernel IO timeout to 180 seconds or risk > false positive rejections (i.e. multi-disk failures) from RAID arrays. I think we're way past the time when all desktop oriented Linux installations should have overridden the kernel default, using 180 second timeouts instead. Even in the single disk case. The system is better off failing safe to slow response, rather than link resets and subsequent face plant. But these days most every laptop and desktop's sysroot is on an SSD of some kind. > Now here is the problem: DM-SMR drives have write latencies of up to 300 > seconds in *non-error* cases. They are up to 10,000 times slower than > CMR in the worst case. Assume that there's an additional 120 seconds > for error recovery on top of the non-error write latency, and add the > extra 50% for safety, and the SMR drive should be configured with a > 630 second timeout (10.5 minutes) in the Linux kernel to avoid false > positive failures. Incredible. -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-04 23:24 ` Chris Murphy @ 2020-05-05 2:00 ` Zygo Blaxell 2020-05-05 2:22 ` Chris Murphy 0 siblings, 1 reply; 17+ messages in thread From: Zygo Blaxell @ 2020-05-05 2:00 UTC (permalink / raw) To: Chris Murphy; +Cc: Rich Rauenzahn, Btrfs BTRFS On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote: > On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > Some kinds of RAID rebuild don't provide sufficient idle time to complete > > the CMR-to-SMR writeback, so the host gets throttled. If the drive slows > > down too much, the kernel times out on IO, and reports that the drive > > has failed. The RAID system running on top thinks the drive is faulty > > (a false positive failure) and the fun begins (hope you don't have two > > of these drives in the same array!). > > This came up on linux-raid@ list today also, and someone posted this > smartmontools bug. > https://www.smartmontools.org/ticket/1313 > > It notes in part this error, which is not a time out. Uhhh...wow. If that's not an individual broken disk, but the programmed behavior of the firmware, that would mean the drive model is not usable at all. > [20809.396284] blk_update_request: I/O error, dev sdd, sector > 3484334688 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0 > > An explicit write error is a defective drive. But even slow downs > resulting in link resets is defective. The marketing of DM-SMR says > it's suitable without having to apply local customizations accounting > for the drive being SMR. > > > > Desktop CMR drives (which are not good in RAID arrays but people use > > them anyway) have firmware hardcoded to retry reads for about 120 > > seconds before giving up. To use desktop CMR drives in RAID arrays, > > you must increase the Linux kernel IO timeout to 180 seconds or risk > > false positive rejections (i.e. multi-disk failures) from RAID arrays. > > I think we're way past the time when all desktop oriented Linux > installations should have overridden the kernel default, using 180 > second timeouts instead. Even in the single disk case. The system is > better off failing safe to slow response, rather than link resets and > subsequent face plant. But these days most every laptop and desktop's > sysroot is on an SSD of some kind. > > > > Now here is the problem: DM-SMR drives have write latencies of up to 300 > > seconds in *non-error* cases. They are up to 10,000 times slower than > > CMR in the worst case. Assume that there's an additional 120 seconds > > for error recovery on top of the non-error write latency, and add the > > extra 50% for safety, and the SMR drive should be configured with a > > 630 second timeout (10.5 minutes) in the Linux kernel to avoid false > > positive failures. > > Incredible. > > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-05 2:00 ` Zygo Blaxell @ 2020-05-05 2:22 ` Chris Murphy 2020-05-05 3:26 ` Zygo Blaxell 0 siblings, 1 reply; 17+ messages in thread From: Chris Murphy @ 2020-05-05 2:22 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Chris Murphy, Rich Rauenzahn, Btrfs BTRFS On Mon, May 4, 2020 at 8:00 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote: > > On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > Some kinds of RAID rebuild don't provide sufficient idle time to complete > > > the CMR-to-SMR writeback, so the host gets throttled. If the drive slows > > > down too much, the kernel times out on IO, and reports that the drive > > > has failed. The RAID system running on top thinks the drive is faulty > > > (a false positive failure) and the fun begins (hope you don't have two > > > of these drives in the same array!). > > > > This came up on linux-raid@ list today also, and someone posted this > > smartmontools bug. > > https://www.smartmontools.org/ticket/1313 > > > > It notes in part this error, which is not a time out. > > Uhhh...wow. If that's not an individual broken disk, but the programmed > behavior of the firmware, that would mean the drive model is not usable > at all. I haven't gone looking for a spec, but "sector ID not found" makes me think of a trim/remap related failure, which, yeah it's gotta be a firmware bug. This can't be "works as designed". -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-05 2:22 ` Chris Murphy @ 2020-05-05 3:26 ` Zygo Blaxell 0 siblings, 0 replies; 17+ messages in thread From: Zygo Blaxell @ 2020-05-05 3:26 UTC (permalink / raw) To: Chris Murphy; +Cc: Rich Rauenzahn, Btrfs BTRFS On Mon, May 04, 2020 at 08:22:24PM -0600, Chris Murphy wrote: > On Mon, May 4, 2020 at 8:00 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > On Mon, May 04, 2020 at 05:24:11PM -0600, Chris Murphy wrote: > > > On Mon, May 4, 2020 at 5:09 PM Zygo Blaxell > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > Some kinds of RAID rebuild don't provide sufficient idle time to complete > > > > the CMR-to-SMR writeback, so the host gets throttled. If the drive slows > > > > down too much, the kernel times out on IO, and reports that the drive > > > > has failed. The RAID system running on top thinks the drive is faulty > > > > (a false positive failure) and the fun begins (hope you don't have two > > > > of these drives in the same array!). > > > > > > This came up on linux-raid@ list today also, and someone posted this > > > smartmontools bug. > > > https://www.smartmontools.org/ticket/1313 > > > > > > It notes in part this error, which is not a time out. > > > > Uhhh...wow. If that's not an individual broken disk, but the programmed > > behavior of the firmware, that would mean the drive model is not usable > > at all. > > I haven't gone looking for a spec, but "sector ID not found" makes me > think of a trim/remap related failure, which, yeah it's gotta be a > firmware bug. This can't be "works as designed". Usually IDNF is "I was looking for a sector, but I couldn't figure out where on the disk it was," i.e. head positioning error or damage to the metadata on a cylinder or sector header. Though there are maybe some that return IDNF instead of ABRT when they get a request for a sector outside of the drive's legal LBA range. The "didn't find a sector" variant usually indicates non-trivial damage (impact on platter vs. bit fade), but could also be due to too much vibration and a short read error timeout. Also a small fraction of bit errors will land on sector headers and produce IDNF without other damage. > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-04 23:08 ` Zygo Blaxell 2020-05-04 23:24 ` Chris Murphy @ 2020-05-09 21:00 ` Phil Karn 2020-05-09 21:46 ` Steven Fosdick 2020-05-11 4:06 ` Damien Le Moal 1 sibling, 2 replies; 17+ messages in thread From: Phil Karn @ 2020-05-09 21:00 UTC (permalink / raw) To: Zygo Blaxell, Rich Rauenzahn; +Cc: Btrfs BTRFS On 5/4/20 16:08, Zygo Blaxell wrote: > The basic problem with DM-SMR drives is that they cache writes in CMR > zones for a while, but they need significant idle periods (no read or > write commands from the host) to move the data back to SMR zones, or > they run out of CMR space and throttle writes from the host. Does anybody know where the drive keeps all that metadata? On rotating disk, or in flash somewhere? Just wondering what happens when power suddenly fails during these rewrite operations. > > Some kinds of RAID rebuild don't provide sufficient idle time to complete > the CMR-to-SMR writeback, so the host gets throttled. If the drive slows My understanding is that large sequential writes can go directly to the SMR areas, which is an argument for a more conventional RAID array. How hard does btrfs try to do large sequential writes? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-09 21:00 ` Phil Karn @ 2020-05-09 21:46 ` Steven Fosdick 2020-05-11 5:06 ` Zygo Blaxell 2020-05-11 4:06 ` Damien Le Moal 1 sibling, 1 reply; 17+ messages in thread From: Steven Fosdick @ 2020-05-09 21:46 UTC (permalink / raw) To: Phil Karn, Btrfs BTRFS; +Cc: Zygo Blaxell, Rich Rauenzahn On Sat, 9 May 2020 at 22:02, Phil Karn <karn@ka9q.net> wrote: > My understanding is that large sequential writes can go directly to the > SMR areas, which is an argument for a more conventional RAID array. How > hard does btrfs try to do large sequential writes? Ok, so I had not heard of SMR before it was mentioned here and immediate read the links. It did occur to me that large sequential writes could, in theory, go straight to SMR zones but it also occurred to be that it isn't completely straight forward. 1. If the drive firmware is not declaring that the drive uses SMR, and therefore the host doesn't send a specific command to begin a sequential write, how many sectors in a row does the drive wait to receive before conclusion this is a large sequential operation? 2. What happens if the sequential operation does not begin a the start of an SMR zone? The only thing that would make it easy is if the drive had a battery-backed RAM cache at least as big as an SMR zone, ideally about twice as big, so it could accumulate the data for one zone and then start writing that while accepting data for the next. As I have no idea how big these zones are I have no idea how feasible that is. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-09 21:46 ` Steven Fosdick @ 2020-05-11 5:06 ` Zygo Blaxell 2020-05-11 20:35 ` Phil Karn 0 siblings, 1 reply; 17+ messages in thread From: Zygo Blaxell @ 2020-05-11 5:06 UTC (permalink / raw) To: Steven Fosdick; +Cc: Phil Karn, Btrfs BTRFS, Rich Rauenzahn On Sat, May 09, 2020 at 10:46:27PM +0100, Steven Fosdick wrote: > On Sat, 9 May 2020 at 22:02, Phil Karn <karn@ka9q.net> wrote: > > My understanding is that large sequential writes can go directly to the > > SMR areas, which is an argument for a more conventional RAID array. How > > hard does btrfs try to do large sequential writes? > > Ok, so I had not heard of SMR before it was mentioned here and > immediate read the links. It did occur to me that large sequential > writes could, in theory, go straight to SMR zones but it also occurred > to be that it isn't completely straight forward. This is a nice overview: https://www.snia.org/sites/default/files/Dunn-Feldman_SNIA_Tutorial_Shingled_Magnetic_Recording-r7_Final.pdf > 1. If the drive firmware is not declaring that the drive uses SMR, and > therefore the host doesn't send a specific command to begin a > sequential write, how many sectors in a row does the drive wait to > receive before conclusion this is a large sequential operation? > > 2. What happens if the sequential operation does not begin a the start > of an SMR zone? In the event of a non-append write, a RMW operation performed on the entire zone. The exceptions would be data extents that are explicitly deleted (TRIM command), and it looks like a sequential overwrite at the _end_ of a zone (i.e. starting in the middle on a sector boundary and writing sequentially to the end of the zone without writing elsewhere in between) can be executed without having to rewrite the entire zone (zones can be appended at any time, the head erases data forward of the write location). I don't know if any drives implement that. In order to get conventional flush semantics to work, the drive has to write everything twice: once to a log zone (which is either CMR or SMR), then copy from there back to the SMR zone to which it belongs ("cleaning"). There is necessarily a seek in between, as the log zone and SMR data zones cannot coexist within a track. DM-SMR drives usually have smaller zones than HA-SMR drives, but we can only guess (or run a timing attack to find out). This would allow the drive to track a few zones in the typical 256MB RAM cache size for the submarined SMR drives. This source reports zone sizes of 15-40MB for DM-SMR and 256MB for HA-SMR, with cache CMR sizes not exceeding 0.2% of capacity: https://www.usenix.org/system/files/conference/hotstorage16/hotstorage16_wu.pdf btrfs should do OK as long as you use space_cache=v2--space cache v1 would force the drive into slow RMW operations every 30 seconds, as it would be forcing the drive to complete cleaning operations in multiple zones. Nobody should be using space_cache=v1 any more, and this is just yet another reason. Superblock updates would keep 2 zones updated all the time, effectively reducing the number of usable open zones in the drive permanently by 2. Longer commit intervals may help. > The only thing that would make it easy is if the drive had a > battery-backed RAM cache at least as big as an SMR zone, ideally about > twice as big, so it could accumulate the data for one zone and then > start writing that while accepting data for the next. As I have no > idea how big these zones are I have no idea how feasible that is. Batteries and flash are expensive, so you can assume the drive has neither unless they are prominently featured in the marketing docs to explain the costs that are passed on to the customer. All of the metadata and caches are stored on the spinning platters. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-11 5:06 ` Zygo Blaxell @ 2020-05-11 20:35 ` Phil Karn 2020-05-11 21:13 ` Alberto Bursi 0 siblings, 1 reply; 17+ messages in thread From: Phil Karn @ 2020-05-11 20:35 UTC (permalink / raw) To: Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn On 5/10/20 22:06, Zygo Blaxell wrote: > > The exceptions would be data extents that are explicitly deleted > (TRIM command), and it looks like a sequential overwrite at the _end_ > of a zone (i.e. starting in the middle on a sector boundary and writing Do these SMR drives generally support TRIM? What other spinning drives support it? I was surprised to recently discover a spinning drive that supports TRIM. It's a HGST Z5K1 2.5" 5400 RPM 1TB OEM drive I pulled from an ASUS laptop to replace with a SSD. TRIM support is verified by hdparm and by running the fstrim command. There's nothing in the literature about this being a hybrid drive. Doesn't seem likely, but could it be shingled? Phil ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-11 20:35 ` Phil Karn @ 2020-05-11 21:13 ` Alberto Bursi 2020-05-11 22:42 ` Phil Karn 0 siblings, 1 reply; 17+ messages in thread From: Alberto Bursi @ 2020-05-11 21:13 UTC (permalink / raw) To: Phil Karn, Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn On 11/05/20 22:35, Phil Karn wrote: > On 5/10/20 22:06, Zygo Blaxell wrote: >> >> The exceptions would be data extents that are explicitly deleted >> (TRIM command), and it looks like a sequential overwrite at the _end_ >> of a zone (i.e. starting in the middle on a sector boundary and writing > > > Do these SMR drives generally support TRIM? What other spinning drives > support it? > > I was surprised to recently discover a spinning drive that supports > TRIM. It's a HGST Z5K1 2.5" 5400 RPM 1TB OEM drive I pulled from an ASUS > laptop to replace with a SSD. TRIM support is verified by hdparm and by > running the fstrim command. There's nothing in the literature about this > being a hybrid drive. > > Doesn't seem likely, but could it be shingled? > > Phil > > > Afaik drive-managed SMR drives (i.e. all drives that disguise themselves as non-SMR) are acting like a SSD, writing in empty "zones" first and then running garbage collection later to consolidate the data. TRIM is used for the same reasons SSDs also use it. This is the way they are working around the performance penalty of SMR, as it's the same limitation NAND flash also has (you can write only a full cell at a time). See here for example https://support-en.wd.com/app/answers/detail/a_id/25185 -Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-11 21:13 ` Alberto Bursi @ 2020-05-11 22:42 ` Phil Karn 2020-05-12 0:12 ` Zygo Blaxell 2020-05-12 2:17 ` Alberto Bursi 0 siblings, 2 replies; 17+ messages in thread From: Phil Karn @ 2020-05-11 22:42 UTC (permalink / raw) To: Alberto Bursi, Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn On 5/11/20 14:13, Alberto Bursi wrote: > > Afaik drive-managed SMR drives (i.e. all drives that disguise > themselves as non-SMR) are acting like a SSD, writing in empty "zones" > first and then running garbage collection later to consolidate the > data. TRIM is used for the same reasons SSDs also use it. > This is the way they are working around the performance penalty of > SMR, as it's the same limitation NAND flash also has (you can write > only a full cell at a time). > > See here for example > https://support-en.wd.com/app/answers/detail/a_id/25185 > > -Alberto Right, I understand that (some?) SMR drives support TRIM for the same reason that SSDs do (well, a very similar reason). My question was whether there'd be any reason for a NON-SMR drive to support TRIM, or if TRIM support necessarily implies shingled recording. I didn't know shingled recording was in any general purpose 2.5" spinning laptop drives like mine, and there's no mention of SMR in the HGST manual. Phil ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-11 22:42 ` Phil Karn @ 2020-05-12 0:12 ` Zygo Blaxell 2020-05-12 2:17 ` Alberto Bursi 1 sibling, 0 replies; 17+ messages in thread From: Zygo Blaxell @ 2020-05-12 0:12 UTC (permalink / raw) To: Phil Karn; +Cc: Alberto Bursi, Steven Fosdick, Btrfs BTRFS, Rich Rauenzahn On Mon, May 11, 2020 at 03:42:44PM -0700, Phil Karn wrote: > On 5/11/20 14:13, Alberto Bursi wrote: > > > > Afaik drive-managed SMR drives (i.e. all drives that disguise > > themselves as non-SMR) are acting like a SSD, writing in empty "zones" > > first and then running garbage collection later to consolidate the > > data. TRIM is used for the same reasons SSDs also use it. > > This is the way they are working around the performance penalty of > > SMR, as it's the same limitation NAND flash also has (you can write > > only a full cell at a time). > > > > See here for example > > https://support-en.wd.com/app/answers/detail/a_id/25185 > > > > -Alberto > > Right, I understand that (some?) SMR drives support TRIM for the same > reason that SSDs do (well, a very similar reason). My question was > whether there'd be any reason for a NON-SMR drive to support TRIM, or if > TRIM support necessarily implies shingled recording. I didn't know > shingled recording was in any general purpose 2.5" spinning laptop > drives like mine, and there's no mention of SMR in the HGST manual. According to https://hddscan.com/blog/2020/hdd-wd-smr.html 2.5" SMR drives appeared in 2016. > Phil > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-11 22:42 ` Phil Karn 2020-05-12 0:12 ` Zygo Blaxell @ 2020-05-12 2:17 ` Alberto Bursi 1 sibling, 0 replies; 17+ messages in thread From: Alberto Bursi @ 2020-05-12 2:17 UTC (permalink / raw) To: Phil Karn, Zygo Blaxell, Steven Fosdick; +Cc: Btrfs BTRFS, Rich Rauenzahn On 12/05/20 00:42, Phil Karn wrote: > On 5/11/20 14:13, Alberto Bursi wrote: >> >> Afaik drive-managed SMR drives (i.e. all drives that disguise >> themselves as non-SMR) are acting like a SSD, writing in empty "zones" >> first and then running garbage collection later to consolidate the >> data. TRIM is used for the same reasons SSDs also use it. >> This is the way they are working around the performance penalty of >> SMR, as it's the same limitation NAND flash also has (you can write >> only a full cell at a time). >> >> See here for example >> https://support-en.wd.com/app/answers/detail/a_id/25185 >> >> -Alberto > > Right, I understand that (some?) SMR drives support TRIM for the same > reason that SSDs do (well, a very similar reason). My question was > whether there'd be any reason for a NON-SMR drive to support TRIM, or if > TRIM support necessarily implies shingled recording. I didn't know > shingled recording was in any general purpose 2.5" spinning laptop > drives like mine, and there's no mention of SMR in the HGST manual. > > Phil > > > Afaik there is no good reason for a normal hard drive to have TRIM support, as normal drives don't need to care about garbage collection, they can just overwrite freely. I would say that TRIM implies either SMR or flash cache of some kind. Lack of TRIM isn't a guarantee though, some SMR drives (identified by their performance when benchmarked) were not reporting TRIM support. It seems all three HDD manufacturers (WD, Toshiba and Seagate) just lied to everyone about the use of SMR in their drives for years and this was only discovered when this went into NAS-oriented drives that (unsurprisingly) blew up RAID arrays. I would not trust the manual or official info from the pre-debacle period that much. -Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-09 21:00 ` Phil Karn 2020-05-09 21:46 ` Steven Fosdick @ 2020-05-11 4:06 ` Damien Le Moal 1 sibling, 0 replies; 17+ messages in thread From: Damien Le Moal @ 2020-05-11 4:06 UTC (permalink / raw) To: Phil Karn, Zygo Blaxell, Rich Rauenzahn; +Cc: Btrfs BTRFS On 2020/05/10 6:01, Phil Karn wrote: > On 5/4/20 16:08, Zygo Blaxell wrote: >> The basic problem with DM-SMR drives is that they cache writes in CMR >> zones for a while, but they need significant idle periods (no read or >> write commands from the host) to move the data back to SMR zones, or >> they run out of CMR space and throttle writes from the host. > > Does anybody know where the drive keeps all that metadata? On rotating > disk, or in flash somewhere? This is drive implementation dependent. That is not something defined by standards. Differences will exist between vendors and models. > Just wondering what happens when power suddenly fails during these > rewrite operations. The drive FW saves whatever information is needed, consistent with the drive write cache flush state. Exactly like an SSD would do too. >> Some kinds of RAID rebuild don't provide sufficient idle time to complete >> the CMR-to-SMR writeback, so the host gets throttled. If the drive slows > > My understanding is that large sequential writes can go directly to the > SMR areas, which is an argument for a more conventional RAID array. How > hard does btrfs try to do large sequential writes? "large" is not a sufficient parameter to conclude/guess on any specific behavior. Alignment (start LBA) of the write command, sectors already written or not, drive write cache on or off, drive write cache full or not, drive implementation differences, etc. There are a lot more parameters influencing how the drive will process writes. There is no simple statement that can be made about how these drive work internally. This is completely vendor & model dependent, exactly like SSDs FTL implementations. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? 2020-05-02 5:24 Western Digital Red's SMR and btrfs? Rich Rauenzahn 2020-05-04 23:08 ` Zygo Blaxell @ 2020-05-05 9:30 ` Dan van der Ster 1 sibling, 0 replies; 17+ messages in thread From: Dan van der Ster @ 2020-05-05 9:30 UTC (permalink / raw) To: Rich Rauenzahn; +Cc: Btrfs BTRFS FWIW, I've written a little tool to help incrementally, slowly, balance an array with SMR drives: https://gist.github.com/dvanders/c15d490ae380bcf4220a437b18a32f04 It balances 2 data chunks per iteration, and if that took longer than some threshold (e.g. 60s), it injects an increasingly larger sleep between subsequent iterations. I'm just getting started with DM-SMR drives in my home array (3x 8TB Seagates), but this script seems to be much more usable than a one-shot full balance, which became ultra slow and made little progress after the CMR cache filled up. And my 2 cents: the RAID1 is quite usable for my media storage use-case; outside of balancing I don't notice any slowness (and in fact it maybe quicker than usual, due to the CMR cache which sequentializes up to several gigabytes of random writes) Cheers, Dan On Sat, May 2, 2020 at 7:25 AM Rich Rauenzahn <rrauenza@gmail.com> wrote: > > Has there been any btrfs discussion off the list (I haven't seen any > SMR/shingled mails in the archive since 2016 or so) regarding the news > that WD's Red drives are actually SMR? > > I'm using these reds in my btrfs setup (which is 2-3 drives in RAID1 > configuration, not parity based RAIDs.) I had noticed that adding a > new drive took a long time, but other than than, I haven't had any > issues that I know of. They've lasted quite a long time, although I > think my NAS would be considered more of a cold storage/archival. > Photos and Videos. > > Is btrfs raid1 going to be the sweet spot on these drives? > > If I start swapping these out -- is there a recommended low power > drive? I'd buy the red pro's, but they spin faster and produce more > heat and noise. > > Rich ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Western Digital Red's SMR and btrfs? @ 2020-05-02 12:26 Torstein Eide 0 siblings, 0 replies; 17+ messages in thread From: Torstein Eide @ 2020-05-02 12:26 UTC (permalink / raw) To: rrauenza; +Cc: linux-btrfs I recommend to read this paper: https://www.toshiba.co.jp/tech/review/en/01_02/pdf/a08.pdf https://www.servethehome.com/surreptitiously-swapping-smr-into-hard-drives-must-end/ I think it very bad that WD did not declare that the disk is a SMR. The SMR code that is written expect the drive to inform the host about it status. The host manage SMR and Host aware SMR. The type WD red uses is Disk managed SMR, and our machines are unaware of it SMR usage. As far as I understand the problem that have been described by others, it is not the SMR its self that is the problem. The problem is that a user expect to be able to do random writes, like normal like the old WD red drives. But as the action of rebuild a raid is a sequential operation, when pared other writes is becomes random writes. So my understanding is that maybe SMR can be okay for setup with cache or setups with huge downtime, for the system to be able to rebuild without user writes. I am still looking for documentation to verify what is the break point, on how much other writes are acceptable during a rebuild before the linux kernel/FS will mark it as a bad drive. according to this test: https://www.youtube.com/watch?v=JDYEG4X_LCg WD red with SMR have better sequential read/write, during raid build, for a empty raid. according to this test: https://www.youtube.com/watch?v=0PhvXPVH-qE WD red with SMR have slightly slower sequential read/write, during raid build, for a empty raid. So what can BTRFS do. I think this is something that primeratly needs to handle at kernel level not filesystem level. Where one solution can be to temperatury slow the write speed to that manuel disk to have writes below disk rated level. A other solution can be to stop rebuild if writes fall below a level, for a some duration to allow the disk to move some of the data in media cache area over SMR area. But primarily there need to be a way to mark a DM-SMR disk with a "This is a SMR disk " similar to host managed and host aware SMR, so the kernel and/or filesystem can do something with it. -- Torstein Eide Torsteine@gmail.com ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2020-05-12 2:17 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-02 5:24 Western Digital Red's SMR and btrfs? Rich Rauenzahn 2020-05-04 23:08 ` Zygo Blaxell 2020-05-04 23:24 ` Chris Murphy 2020-05-05 2:00 ` Zygo Blaxell 2020-05-05 2:22 ` Chris Murphy 2020-05-05 3:26 ` Zygo Blaxell 2020-05-09 21:00 ` Phil Karn 2020-05-09 21:46 ` Steven Fosdick 2020-05-11 5:06 ` Zygo Blaxell 2020-05-11 20:35 ` Phil Karn 2020-05-11 21:13 ` Alberto Bursi 2020-05-11 22:42 ` Phil Karn 2020-05-12 0:12 ` Zygo Blaxell 2020-05-12 2:17 ` Alberto Bursi 2020-05-11 4:06 ` Damien Le Moal 2020-05-05 9:30 ` Dan van der Ster 2020-05-02 12:26 Torstein Eide
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.