All of lore.kernel.org
 help / color / mirror / Atom feed
* Status of SMR with BTRFS
@ 2016-07-15 18:29 Hendrik Friedel
  2016-07-15 22:15 ` Tomasz Kusmierz
  0 siblings, 1 reply; 24+ messages in thread
From: Hendrik Friedel @ 2016-07-15 18:29 UTC (permalink / raw)
  To: Btrfs BTRFS

Hello,

I have a 5TB Seagate drive that uses SMR.

I was wondering, if BTRFS is usable with this Harddrive technology. So, 
first I searched the BTRFS wiki -nothing. Then google.

* I found this: https://bbs.archlinux.org/viewtopic.php?id=203696
But this turned out to be an issue not related to BTRFS.

* Then this: 
http://www.snia.org/sites/default/files/SDC15_presentations/smr/ 
HannesReinecke_Strategies_for_running_unmodified_FS_SMR.pdf
   " BTRFS operation matches SMR parameters very closely [...]

      High number of misaligned write accesses ; points to an issue with 
btrfs itself


* Then this: 
http://superuser.com/questions/962257/fastest-linux-filesystem-on-shingled-disks
The BTRFS performance seemed good.


* Finally this: http://www.spinics.net/lists/linux-btrfs/msg48072.html
"So you can get mixed results when trying to use the SMR devices but I'd 
say it will mostly not work.
But, btrfs has all the fundamental features in place, we'd have to make
adjustments to follow the SMR constraints:"
[...]
I have some notes at
https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt"


So, now I am wondering, what the state is today. "We" (I am happy to do 
that; but not sure of access rights) should also summarize this in the wiki.
My use-case by the way are back-ups. I am thinking of using some of the 
interesting BTRFS features for this (send/receive, deduplication)

Greetings,
Hendrik


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-15 18:29 Status of SMR with BTRFS Hendrik Friedel
@ 2016-07-15 22:15 ` Tomasz Kusmierz
  2016-07-16 10:29   ` Hendrik Friedel
  0 siblings, 1 reply; 24+ messages in thread
From: Tomasz Kusmierz @ 2016-07-15 22:15 UTC (permalink / raw)
  To: Hendrik Friedel; +Cc: Btrfs BTRFS

Thou I’m not a hardcore storage system professional:

What disk are you using ? There are two types:
1. SMR managed by device firmware. BTRFS sees that as a normal block device … problems you get are not related to BTRFS it self …
2. SMR managed by host system, BTRFS still does see this as a block device … just emulated by host system to look normal. 

In case of funky technologies like that I would research how exactly data is stored in terms of “BAND” and experiment with setting leaf & sector size to match a band, then create a btrfs on this device. 
Run stress.sh on it for couple of days.
If you get errors - setup a two standard disk raid1 btrfs file system
run stress.sh to see whenever you get errors on this system - to eliminate possibility that your system is actually generating errors. 

Then come back and we will see what’s going on :)


> On 15 Jul 2016, at 19:29, Hendrik Friedel <hendrik@friedels.name> wrote:
> 
> Hello,
> 
> I have a 5TB Seagate drive that uses SMR.
> 
> I was wondering, if BTRFS is usable with this Harddrive technology. So, first I searched the BTRFS wiki -nothing. Then google.
> 
> * I found this: https://bbs.archlinux.org/viewtopic.php?id=203696
> But this turned out to be an issue not related to BTRFS.
> 
> * Then this: http://www.snia.org/sites/default/files/SDC15_presentations/smr/ HannesReinecke_Strategies_for_running_unmodified_FS_SMR.pdf
>  " BTRFS operation matches SMR parameters very closely [...]
> 
>     High number of misaligned write accesses ; points to an issue with btrfs itself
> 
> 
> * Then this: http://superuser.com/questions/962257/fastest-linux-filesystem-on-shingled-disks
> The BTRFS performance seemed good.
> 
> 
> * Finally this: http://www.spinics.net/lists/linux-btrfs/msg48072.html
> "So you can get mixed results when trying to use the SMR devices but I'd say it will mostly not work.
> But, btrfs has all the fundamental features in place, we'd have to make
> adjustments to follow the SMR constraints:"
> [...]
> I have some notes at
> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt"
> 
> 
> So, now I am wondering, what the state is today. "We" (I am happy to do that; but not sure of access rights) should also summarize this in the wiki.
> My use-case by the way are back-ups. I am thinking of using some of the interesting BTRFS features for this (send/receive, deduplication)
> 
> Greetings,
> Hendrik
> 
> 
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> https://www.avast.com/antivirus
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-15 22:15 ` Tomasz Kusmierz
@ 2016-07-16 10:29   ` Hendrik Friedel
  2016-07-17  3:09     ` Tomasz Kusmierz
  0 siblings, 1 reply; 24+ messages in thread
From: Hendrik Friedel @ 2016-07-16 10:29 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Btrfs BTRFS

Hello Tomasz,

thanks for your reply.
> What disk are you using ?

It's a Seagate Expansion Desktop 5TB (USB3). It is probably a ST5000DM000.

> There are two types:
> 1. SMR managed by device firmware. BTRFS sees that as a normal block device … problems you get are not related to BTRFS it self …
That for sure. But the way BTRFS uses/writes data could cause problems 
in conjunction with these devices still, no?

> 2. SMR managed by host system, BTRFS still does see this as a block device … just emulated by host system to look normal.
I am not sure, what I am using. How can I find out?

> In case of funky technologies like that I would research how exactly data is stored in terms of “BAND” and experiment with setting leaf & sector size to match a band,
Sorry, but I have no idea where to start.

It seems to me, although the drive being a pure consumer drive, it is a 
'pro' feature and I should avoid it with BTRFS. I am just surprised, 
there is no hint in the wiki with that regards.

Greetings,
Hendrik


>> > On 15 Jul 2016, at 19:29, Hendrik Friedel <hendrik@friedels.name> wrote:
>> >
>> > Hello,
>> >
>> > I have a 5TB Seagate drive that uses SMR.
>> >
>> > I was wondering, if BTRFS is usable with this Harddrive technology. So, first I searched the BTRFS wiki -nothing. Then google.
>> >
>> > * I found this: https://bbs.archlinux.org/viewtopic.php?id=203696
>> > But this turned out to be an issue not related to BTRFS.
>> >
>> > * Then this: http://www.snia.org/sites/default/files/SDC15_presentations/smr/ HannesReinecke_Strategies_for_running_unmodified_FS_SMR.pdf
>> >  " BTRFS operation matches SMR parameters very closely [...]
>> >
>> >     High number of misaligned write accesses ; points to an issue with btrfs itself
>> >
>> >
>> > * Then this: http://superuser.com/questions/962257/fastest-linux-filesystem-on-shingled-disks
>> > The BTRFS performance seemed good.
>> >
>> >
>> > * Finally this: http://www.spinics.net/lists/linux-btrfs/msg48072.html
>> > "So you can get mixed results when trying to use the SMR devices but I'd say it will mostly not work.
>> > But, btrfs has all the fundamental features in place, we'd have to make
>> > adjustments to follow the SMR constraints:"
>> > [...]
>> > I have some notes at
>> > https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt"
>> >
>> >
>> > So, now I am wondering, what the state is today. "We" (I am happy to do that; but not sure of access rights) should also summarize this in the wiki.
>> > My use-case by the way are back-ups. I am thinking of using some of the interesting BTRFS features for this (send/receive, deduplication)
>> >
>> > Greetings,
>> > Hendrik
>> >
>> >
>> > ---
>> > Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
>> > https://www.avast.com/antivirus
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-16 10:29   ` Hendrik Friedel
@ 2016-07-17  3:09     ` Tomasz Kusmierz
  2016-07-17  9:08       ` Hendrik Friedel
  0 siblings, 1 reply; 24+ messages in thread
From: Tomasz Kusmierz @ 2016-07-17  3:09 UTC (permalink / raw)
  To: Hendrik Friedel; +Cc: Btrfs BTRFS

Just please don't take it as picking or something:

> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a ST5000DM000.

this is TGMR not SMR disk:
http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf
So it still confirms to standard record strategy ...


>> There are two types:
>> 1. SMR managed by device firmware. BTRFS sees that as a normal block
>> device … problems you get are not related to BTRFS it self …
>
> That for sure. But the way BTRFS uses/writes data could cause problems in
> conjunction with these devices still, no?
I'm sorry but I'm confused now, what "magical way of using/writing
data" you actually mean ? AFAIK btrfs sees the disk as a block device
... for example devices has a very varying sector size, which is a 512
bytes + some CRC + maybe ECC ... btrfs does not access this data,
drive does ... to be honest drives tend to lie to you continuously !
They use this ECC to magically bail out of bad sector, give you data
and silently switch to spare sector ...

Now think slowly and thoroughly about it: who would write a code (and
maintain it) for a file system that access device specific data for X
amount of vendors with each having Y amount of model specific
configurations/caveats/firmwares/protocols ... S.M.A.R.T. emerged to
give a unifying interface to device statistics ... this is how bad it
was ...


FYI,
in 2009 I was creating a product with linux that was starting from a
flash based FS ... some people required that data after 20 years would
boot up unchanged ... my answer was: "HOW", yes I could ensure a
certain files integrity in readout by checking md5, but I could not
warrant a whole FS integrity ... specially at the time when j2ffs was
only option on flash memories (yeah it had to be RW as well @#$*@#$)
... so btrfs comes along and takes away most of problems ... if you
care about your data, do some research ... if not ... maybe raiserFS
is for you :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17  3:09     ` Tomasz Kusmierz
@ 2016-07-17  9:08       ` Hendrik Friedel
  2016-07-17 20:48         ` Henk Slager
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Hendrik Friedel @ 2016-07-17  9:08 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Btrfs BTRFS, dave

Hi Thomasz,

@Dave I have added you to the conversation, as I refer to your notes 
(https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt)

thanks for your reply!

>> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a ST5000DM000.
>
> this is TGMR not SMR disk:
> http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf
> So it still confirms to standard record strategy ...

I am not convinced. I had not heared TGMR before. But I find TGMR as a 
technology for the head.
https://pics.computerbase.de/4/0/3/4/4/29-1080.455720475.jpg

In any case: the drive behaves like a SMR drive: I ran a benchmark on it 
with up to 200MB/s.
When copying a file onto the drive in parallel the rate in the benchmark 
dropped to 7MB/s, while that particular file was copied at 40MB/s.

>
>>> There are two types:
>>> 1. SMR managed by device firmware. BTRFS sees that as a normal block
>>> device … problems you get are not related to BTRFS it self …
>>
>> That for sure. But the way BTRFS uses/writes data could cause problems in
>> conjunction with these devices still, no?
> I'm sorry but I'm confused now, what "magical way of using/writing
> data" you actually mean ? AFAIK btrfs sees the disk as a block device

Well, btrfs does write data very different to many other file systems. 
On every write the file is copied to another place, even if just one bit 
is changed. That's special and I am wondering whether that could cause 
problems.

> Now think slowly and thoroughly about it: who would write a code (and
> maintain it) for a file system that access device specific data for X
> amount of vendors with each having Y amount of model specific
> configurations/caveats/firmwares/protocols ... S.M.A.R.T. emerged to
> give a unifying interface to device statistics ... this is how bad it
> was ...

Well, I'm no pro. But I found this:
https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
And this does sound like improvements to BTRFS can be done for SMR in a 
generic, not vendor/device specific manner.

And I am wondering:
a) whether it is advisable to use BTRFS on these drives before these 
improvements have been made already
   i) if not: Are there specific btrfs features that should be avoided, 
or btrfs in general?
b) whether these improvements have been made already

> care about your data, do some research ... if not ... maybe raiserFS
> is for you :)

You are right for sure. And that's what I do here. But I am far away 
from being able to judge myself, so I rely on support.

Greetings,
Hendrik


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17  9:08       ` Hendrik Friedel
@ 2016-07-17 20:48         ` Henk Slager
  2016-07-18 11:22         ` Austin S. Hemmelgarn
  2016-07-20 19:58         ` Chris Murphy
  2 siblings, 0 replies; 24+ messages in thread
From: Henk Slager @ 2016-07-17 20:48 UTC (permalink / raw)
  To: Btrfs BTRFS

>>> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a
>>> ST5000DM000.
>>
>>
>> this is TGMR not SMR disk:
>>
>> http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf
>> So it still confirms to standard record strategy ...
>
>
> I am not convinced. I had not heared TGMR before. But I find TGMR as a
> technology for the head.
> https://pics.computerbase.de/4/0/3/4/4/29-1080.455720475.jpg
>
> In any case: the drive behaves like a SMR drive: I ran a benchmark on it
> with up to 200MB/s.
> When copying a file onto the drive in parallel the rate in the benchmark
> dropped to 7MB/s, while that particular file was copied at 40MB/s.

It is very well possible that for a normal drive of 4TB or so you get
this kind of behaviour. Suppose you have 2 tasks, 1 writing in with 4k
blocksize to a 1G file at the beginning of the disk and the 2nd with
4k blocksize to a 1G file at the end of the disk. At the beginning you
get sustained ~150MB/s, at the end ~75MB/s. Between every 4k write (or
read) you move the head(s), so ~4ms lost.

I was wondering how big the zones etc are and hopefully this is still true:
http://blog.schmorp.de/data/smr/fast15-paper-aghayev.pdf


> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
> And this does sound like improvements to BTRFS can be done for SMR in a
> generic, not vendor/device specific manner.

Maybe have a look at recent patches from Hannes R from SUSE (to 4.7
kernel AFAIK) and see what will be possible with Btrfs once this
'zone-handling' is all working on the lower layers. Currently, there
is nothing special in Btrfs for SMR drives in recent kernels, but in
my experience it works, if you keep device-managed SMR
characteristics/limitations in mind. Maybe like a tape-archive or
dvd-burner.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17  9:08       ` Hendrik Friedel
  2016-07-17 20:48         ` Henk Slager
@ 2016-07-18 11:22         ` Austin S. Hemmelgarn
  2016-07-18 18:31           ` Hendrik Friedel
  2016-07-20 19:58         ` Chris Murphy
  2 siblings, 1 reply; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-18 11:22 UTC (permalink / raw)
  To: Hendrik Friedel, Tomasz Kusmierz; +Cc: Btrfs BTRFS, dave

On 2016-07-17 05:08, Hendrik Friedel wrote:
> Hi Thomasz,
>
> @Dave I have added you to the conversation, as I refer to your notes
> (https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt)
>
> thanks for your reply!
>
>>> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a
>>> ST5000DM000.
>>
>> this is TGMR not SMR disk:
>> http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-us/docs/100743772a.pdf
>>
>> So it still confirms to standard record strategy ...
>
> I am not convinced. I had not heared TGMR before. But I find TGMR as a
> technology for the head.
> https://pics.computerbase.de/4/0/3/4/4/29-1080.455720475.jpg
TGMR is a derivative of giant magneto-resistance, and is what's been 
used in hard disk drives for decades now.  With limited exceptions in 
recent years and in ancient embedded systems, all modern hard drives are 
TGMR based.
> In any case: the drive behaves like a SMR drive: I ran a benchmark on it
> with up to 200MB/s.
> When copying a file onto the drive in parallel the rate in the benchmark
> dropped to 7MB/s, while that particular file was copied at 40MB/s.
This type of performance degradation is actually not unexpected 
depending on the physical location of the files on disk.  Based on the 
numbers you gave, the datasheet for the disk itself, and some basic 
math, I'd guess that the two files ended up on opposite edges of a 
single platter, which means that the head is seeking back and forth 
almost constantly, which is what's killing your performance.
>
>>
>>>> There are two types:
>>>> 1. SMR managed by device firmware. BTRFS sees that as a normal block
>>>> device … problems you get are not related to BTRFS it self …
>>>
>>> That for sure. But the way BTRFS uses/writes data could cause
>>> problems in
>>> conjunction with these devices still, no?
>> I'm sorry but I'm confused now, what "magical way of using/writing
>> data" you actually mean ? AFAIK btrfs sees the disk as a block device
>
> Well, btrfs does write data very different to many other file systems.
> On every write the file is copied to another place, even if just one bit
> is changed. That's special and I am wondering whether that could cause
> problems.
There's two things that should be clarified here:
1. BTRFS only copies part of the file, not the whole file.  At most, it 
copies out a single block, which on reasonably sized filesystems is 16k 
these days.
2. Copy-on-write semantics are not as special as you might think, ZFS is 
also a COW filesystem, as are all log-structured filesystems (NILFS2, 
LogFS, LFS, WAFL, and a couple of others), and a number of filesystems 
support it to some degree (OCFS2, XFS, and even NTFS).
>
>> Now think slowly and thoroughly about it: who would write a code (and
>> maintain it) for a file system that access device specific data for X
>> amount of vendors with each having Y amount of model specific
>> configurations/caveats/firmwares/protocols ... S.M.A.R.T. emerged to
>> give a unifying interface to device statistics ... this is how bad it
>> was ...
>
> Well, I'm no pro. But I found this:
> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
> And this does sound like improvements to BTRFS can be done for SMR in a
> generic, not vendor/device specific manner.
>
> And I am wondering:
> a) whether it is advisable to use BTRFS on these drives before these
> improvements have been made already
>   i) if not: Are there specific btrfs features that should be avoided,
> or btrfs in general?
It's fine to use BTRFS, you just have to give the drive time to finish 
it's internal bookkeeping before you power it off.  The same is true 
with any filesystem, it's just that COW filesystems that aren't log 
structured tend to require a lot more internal bookkeeping on the drive 
than traditional filesystems.  On a typical non-COW filesystem, this 
takes maybe a minute or two depending on the amount of data written.  On 
a log structured filesystem, it may take only 15-30 seconds (log 
structured filesystems are more similar to the internal layout of an SMR 
drive than most other filesystems).  For a COW filesystem like BTRFS, it 
may take 10-30 minutes depending on how much data got moved because of 
the COW semantics.  It's worth pointing out that these numbers are for 
archival usage (write a few big files infrequently), they go up pretty 
steeply for more traditional usage such as for a root filesystem.
> b) whether these improvements have been made already
Not yet.
>
>> care about your data, do some research ... if not ... maybe raiserFS
>> is for you :)
>
> You are right for sure. And that's what I do here. But I am far away
> from being able to judge myself, so I rely on support.
The fact that you're doing research at all is still better than many 
users I see.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-18 11:22         ` Austin S. Hemmelgarn
@ 2016-07-18 18:31           ` Hendrik Friedel
  2016-07-18 18:44             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 24+ messages in thread
From: Hendrik Friedel @ 2016-07-18 18:31 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Tomasz Kusmierz; +Cc: Btrfs BTRFS, dave

Hello and thanks for your replies,

>>>> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a
>>>> ST5000DM000.
>>>
>>> this is TGMR not SMR disk:
> TGMR is a derivative of giant magneto-resistance, and is what's been
> used in hard disk drives for decades now.  With limited exceptions in
> recent years and in ancient embedded systems, all modern hard drives are
> TGMR based.

Ok, thanks; So, TGMR does not say whether or not the Device is SMR or 
not, right?
While the Data-Sheet does not mention SMR and the 'Desktop' in the name 
rather than 'Archive' would indicate no SMR, some reviews indicate SMR 
(http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241)

>> In any case: the drive behaves like a SMR drive: I ran a benchmark on it
>> with up to 200MB/s.
>> When copying a file onto the drive in parallel the rate in the benchmark
>> dropped to 7MB/s, while that particular file was copied at 40MB/s.
> This type of performance degradation is actually not unexpected

Ok. I was not aware. I expected some, but less impact.


> There's two things that should be clarified here:
[...]
Thanks for clarifying.

>> Well, I'm no pro. But I found this:
>> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
>> And this does sound like improvements to BTRFS can be done for SMR in a
>> generic, not vendor/device specific manner.
>>
>> And I am wondering:
[...]
>> b) whether these improvements have been made already
> Not yet.

Ok, thanks.
So I conclude that on SMR Drives, BTRFS has all benefits that it has on 
all other devices and there are no BTRFS related disadvantages in 
relation with BTRFS. Nevertheless, some improvements to BTRFS can be 
made in order to improve BTRFS with these drives.

Greetings,
Hendrik

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-18 18:31           ` Hendrik Friedel
@ 2016-07-18 18:44             ` Austin S. Hemmelgarn
  2016-07-18 19:05               ` Hendrik Friedel
  0 siblings, 1 reply; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-18 18:44 UTC (permalink / raw)
  To: Hendrik Friedel, Tomasz Kusmierz; +Cc: Btrfs BTRFS, dave

On 2016-07-18 14:31, Hendrik Friedel wrote:
> Hello and thanks for your replies,
>
>>>>> It's a Seagate Expansion Desktop 5TB (USB3). It is probably a
>>>>> ST5000DM000.
>>>>
>>>> this is TGMR not SMR disk:
>> TGMR is a derivative of giant magneto-resistance, and is what's been
>> used in hard disk drives for decades now.  With limited exceptions in
>> recent years and in ancient embedded systems, all modern hard drives are
>> TGMR based.
>
> Ok, thanks; So, TGMR does not say whether or not the Device is SMR or
> not, right?
I'm not 100% certain about that.  Technically, the only non-firmware 
difference is in the read head and the tracking.  If it were me, I'd be 
listing SMR instead of TGMR on the data sheet, but I'd be more than 
willing to bet that many drive manufacturers won't think like that.
> While the Data-Sheet does not mention SMR and the 'Desktop' in the name
> rather than 'Archive' would indicate no SMR, some reviews indicate SMR
> (http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241)
I know for a fact that at least through 2015, all 'Desktop' branded 3.5 
inch Seagate hard drives up through 4TB in capacity used TGMR with 1TB 
platters (500GB per side of the platter).  I've got one of their 5TB 
external drives at work (with 'Expansion' branding) which uses a 3.5 
inch disk which based on testing I've done appears to be traditional 
TGMR with either thinner platters (and thus more of them) or some other 
method of improving areal storage density.  Beyond that, I'm not sure, 
but I believe that their 'Desktop' branding still means it's TGMR and 
not SMR.
>
>
>>> In any case: the drive behaves like a SMR drive: I ran a benchmark on it
>>> with up to 200MB/s.
>>> When copying a file onto the drive in parallel the rate in the benchmark
>>> dropped to 7MB/s, while that particular file was copied at 40MB/s.
>> This type of performance degradation is actually not unexpected
>
> Ok. I was not aware. I expected some, but less impact.
There are a number of factors that contribute to this, I see less 
degradation on enterprise disks than regular retail units, but it's 
still there.  Even with the insane precision they already have using 
voice-coils for actuation of the head, there's a functional upper limit 
on how fast it can move without risking damaging anything.
>
>
>> There's two things that should be clarified here:
> [...]
> Thanks for clarifying.
>
>>> Well, I'm no pro. But I found this:
>>> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
>>> And this does sound like improvements to BTRFS can be done for SMR in a
>>> generic, not vendor/device specific manner.
>>>
>>> And I am wondering:
> [...]
>>> b) whether these improvements have been made already
>> Not yet.
>
> Ok, thanks.
> So I conclude that on SMR Drives, BTRFS has all benefits that it has on
> all other devices and there are no BTRFS related disadvantages in
> relation with BTRFS. Nevertheless, some improvements to BTRFS can be
> made in order to improve BTRFS with these drives.
I've come to pretty much the same conclusion in my usage.  That said, 
quite a few improvements could be made to BTRFS in general, not just 
with respect to SMR drives.

I'd very much suggest avoiding USB connected SMR drives though, USB is 
already poorly designed for storage devices (even with USB attached 
SCSI), and most of the filesystem issues I see personally (not just with 
BTRFS, but any other filesystem as well) are on USB connected storage, 
so I'd be very wary of adding all the potential issues with SMR drives 
on top of that as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-18 18:44             ` Austin S. Hemmelgarn
@ 2016-07-18 19:05               ` Hendrik Friedel
  2016-07-18 19:30                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 24+ messages in thread
From: Hendrik Friedel @ 2016-07-18 19:05 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Tomasz Kusmierz; +Cc: Btrfs BTRFS, dave

Hello Austin,

thanks for your reply.

>> Ok, thanks; So, TGMR does not say whether or not the Device is SMR or
>> not, right?
> I'm not 100% certain about that.  Technically, the only non-firmware
> difference is in the read head and the tracking.  If it were me, I'd be
> listing SMR instead of TGMR on the data sheet, but I'd be more than
> willing to bet that many drive manufacturers won't think like that.
>> While the Data-Sheet does not mention SMR and the 'Desktop' in the name
>> rather than 'Archive' would indicate no SMR, some reviews indicate SMR
>> (http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241)
>>
>  Beyond that, I'm not sure,
> but I believe that their 'Desktop' branding still means it's TGMR and
> not SMR.

... which in the Seagate nomenclature might not exclude each other (TGMR 
could still be SMR). I will just ask them...
How did you find out on your drives whether they use SMR?

> I'd very much suggest avoiding USB connected SMR drives though, USB is
> already poorly designed for storage devices (even with USB attached
> SCSI), and most of the filesystem issues I see personally (not just with
> BTRFS, but any other filesystem as well) are on USB connected storage,
> so I'd be very wary of adding all the potential issues with SMR drives
> on top of that as well.

Understood. But I use this drive as a Backup. The Drive must not be 
connected to the System unless doing a backup. Otherwise a Virus, or 
just an issue with the power (peak due to lightning strike) might vanish 
both the Source data and Backup at once (single point of failure).

Greetings,
Hendrik

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-18 19:05               ` Hendrik Friedel
@ 2016-07-18 19:30                 ` Austin S. Hemmelgarn
  2016-07-18 22:29                   ` Tomasz Kusmierz
  0 siblings, 1 reply; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-18 19:30 UTC (permalink / raw)
  To: Hendrik Friedel, Tomasz Kusmierz; +Cc: Btrfs BTRFS, dave

On 2016-07-18 15:05, Hendrik Friedel wrote:
> Hello Austin,
>
> thanks for your reply.
>
>>> Ok, thanks; So, TGMR does not say whether or not the Device is SMR or
>>> not, right?
>> I'm not 100% certain about that.  Technically, the only non-firmware
>> difference is in the read head and the tracking.  If it were me, I'd be
>> listing SMR instead of TGMR on the data sheet, but I'd be more than
>> willing to bet that many drive manufacturers won't think like that.
>>> While the Data-Sheet does not mention SMR and the 'Desktop' in the name
>>> rather than 'Archive' would indicate no SMR, some reviews indicate SMR
>>> (http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241)
>>>
>>>
>>  Beyond that, I'm not sure,
>> but I believe that their 'Desktop' branding still means it's TGMR and
>> not SMR.
>
> ... which in the Seagate nomenclature might not exclude each other (TGMR
> could still be SMR). I will just ask them...
> How did you find out on your drives whether they use SMR?
I've actually manually deconstructed quite a few hard disks for work. 
We have to wipe old disks before they leave the building, and if we 
can't do it in software because of something like a head crash, we have 
to take them apart and physically destroy the platters.  There are 
actually visible differences in regular TGMR and SMR heads, so if you 
know what to look for, you can tell by the write heads (it's also a dead 
giveaway when a drive has significantly fewer platters than TB's of 
space).  There's also measurable performance differences for certain 
access patterns, but that doesn't work as reliably as it sounds like it 
should (the actual physical data layout on disk these days may look 
nothing like what software sees).
>
>> I'd very much suggest avoiding USB connected SMR drives though, USB is
>> already poorly designed for storage devices (even with USB attached
>> SCSI), and most of the filesystem issues I see personally (not just with
>> BTRFS, but any other filesystem as well) are on USB connected storage,
>> so I'd be very wary of adding all the potential issues with SMR drives
>> on top of that as well.
>
> Understood. But I use this drive as a Backup. The Drive must not be
> connected to the System unless doing a backup. Otherwise a Virus, or
> just an issue with the power (peak due to lightning strike) might vanish
> both the Source data and Backup at once (single point of failure).
In that case, I'd be very careful to wait for a while after finishing a 
backup before disconnecting the disk (and make sure to unmount the 
filesystem before waiting so that everything gets flushed to disk), and 
make sure to validate your backups as well.  Once the disk has been 
safely disconnected, you're fine though, the issues arise if the device 
suddenly disappears from the bus or loses power (which is of course an 
issue for regular drives, the filesystem damage just tends to be much 
worse with SMR drives).  For what it's worth, I've seen fewer issues 
with (recent) USB 3.0 devices than USB 2.0, but even just bumping the 
cable can cause issues sometimes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-18 19:30                 ` Austin S. Hemmelgarn
@ 2016-07-18 22:29                   ` Tomasz Kusmierz
  0 siblings, 0 replies; 24+ messages in thread
From: Tomasz Kusmierz @ 2016-07-18 22:29 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Hendrik Friedel, Btrfs BTRFS, dave

Sorry for late reply, there was a lot of traffic in this thread so:

1. I do apologize but I've got a wrong end of the stick, I was
convinced that btrfs does cause corruption on your disk because some
of the link that you've hav in original post were pointing to topics
with corruptions going on, but you are concerned about performance -
right ?

2. I'm still not convinced that Seagate would miss out such a feature
as SMR and mistakenly called it TGMR ... a lot of money is involved
with storing data and egg in the face moment can cast them a lot ...
ALSO it's called "barracuda" historically this meant disks with good
IO performance, can't think that somebody at seagate would put
barracuda label on stinking SMR (yes SMR stinks! but about that later
on). Thou I remember how Micron used to have a complete mess in their
data sheets :/

3. f2fs as well as jffs2, logfs will give a tremendous performance on
spinners :D those file system are meant for tiny flash devices with
minimal IO, minimal power available, minimal erases, and ti fit in
that market those try to mimise fragmentation of flash, to serialise
writes, to eliminate jumping thought flash so in will self wear
balance. Result of that on spinner is that it will give you a very
static and sequential IO on writing data, but your reads will be crap
... and as every single "developed for flash" filesystem it will
expect your device to have a 100% functional TRIM that will result in
a block (usually 128kB) to be reset to 0xFF ... spinners don't do that
... and this is where you will see corruption. Also a lot of those
file systems will require a direct access to device rather than block
device emulation. Also on a flash device you can walk in and alter a
bit in a byte as long as you change it from physical "1" to "0", on
spinner you need to rewrite a sector and associated with it CRC and
corrective data etc, on SMR you will have to rewrite a whole BAND ...
FUN !! so every time your filesystem will mark something for future
TRIM it will try to set single bit in block associated data (hahaha
your band needs to get rewritten) and this is how you will effectively
kill sectors (bands) on your disk !

4. SMR stinks ... yes it does ... it's a recipe for a disaster, slight
modifications cause a lot of work load on a drive ... if you modify a
"sector" most of a band needs to get rewritten ... this is where
corruption creeps it, where disk surface wears, I understand how NSA
may have a use case for that - google shifts data between server farms
in US and broad then they send a copy to NSA (yes they do), NSA stores
it, but they don't care about single bit root of minor defects, data
does not get modified, just analysed in bulk and discarded and whole
array get written with fresh data ... amazon on the other hand gets
paid for not having your data corrupted ... so they won't fancy SMR
that much (maybe glacier) see a patern here ? as a user to have that
type of use case is just weird, if you want to back up your data than
you care about it ... then I'm not convinced that SMR is truly for
you. Also 5TB device connected to USB3 used as a backup :O :O :O :O :O
:O I wouldn't keep my "just in case internet was down backup of
pornhub" on that setup :) And I'm not picking on you here ... I
personally used far better backup than that and still it failed and
still people pointed out bluntly how pathetic it was ... and they were
right !


In terms of SMR those are my brutal opinions ... and nothing more. I
accept that most likely I'm wrong. Hell, been wrong most of my life,
it's just after 10 years of engineering embedded devices for various
application I'm just veeeeeerrrrrrrrryyyy precocious due to
experiences with a lot of "some bright spark" (clueless guy that
wanted to feel more intelligent than engineers) "decided to use this
revolutionising thing" (wanted to prove him self and based everything
on a luck) "and created a valid learning experience for whole
development team" (all engineers wanted to kill him) "and we all came
out of that stronger and with more experience" (he/she got fired).

On 18 July 2016 at 20:30, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> On 2016-07-18 15:05, Hendrik Friedel wrote:
>>
>> Hello Austin,
>>
>> thanks for your reply.
>>
>>>> Ok, thanks; So, TGMR does not say whether or not the Device is SMR or
>>>> not, right?
>>>
>>> I'm not 100% certain about that.  Technically, the only non-firmware
>>> difference is in the read head and the tracking.  If it were me, I'd be
>>> listing SMR instead of TGMR on the data sheet, but I'd be more than
>>> willing to bet that many drive manufacturers won't think like that.
>>>>
>>>> While the Data-Sheet does not mention SMR and the 'Desktop' in the name
>>>> rather than 'Archive' would indicate no SMR, some reviews indicate SMR
>>>>
>>>> (http://www.legitreviews.com/seagate-barracuda-st5000dm000-5tb-desktop-hard-drive-review_161241)
>>>>
>>>>
>>>  Beyond that, I'm not sure,
>>> but I believe that their 'Desktop' branding still means it's TGMR and
>>> not SMR.
>>
>>
>> ... which in the Seagate nomenclature might not exclude each other (TGMR
>> could still be SMR). I will just ask them...
>> How did you find out on your drives whether they use SMR?
>
> I've actually manually deconstructed quite a few hard disks for work. We
> have to wipe old disks before they leave the building, and if we can't do it
> in software because of something like a head crash, we have to take them
> apart and physically destroy the platters.  There are actually visible
> differences in regular TGMR and SMR heads, so if you know what to look for,
> you can tell by the write heads (it's also a dead giveaway when a drive has
> significantly fewer platters than TB's of space).  There's also measurable
> performance differences for certain access patterns, but that doesn't work
> as reliably as it sounds like it should (the actual physical data layout on
> disk these days may look nothing like what software sees).
>>
>>
>>> I'd very much suggest avoiding USB connected SMR drives though, USB is
>>> already poorly designed for storage devices (even with USB attached
>>> SCSI), and most of the filesystem issues I see personally (not just with
>>> BTRFS, but any other filesystem as well) are on USB connected storage,
>>> so I'd be very wary of adding all the potential issues with SMR drives
>>> on top of that as well.
>>
>>
>> Understood. But I use this drive as a Backup. The Drive must not be
>> connected to the System unless doing a backup. Otherwise a Virus, or
>> just an issue with the power (peak due to lightning strike) might vanish
>> both the Source data and Backup at once (single point of failure).
>
> In that case, I'd be very careful to wait for a while after finishing a
> backup before disconnecting the disk (and make sure to unmount the
> filesystem before waiting so that everything gets flushed to disk), and make
> sure to validate your backups as well.  Once the disk has been safely
> disconnected, you're fine though, the issues arise if the device suddenly
> disappears from the bus or loses power (which is of course an issue for
> regular drives, the filesystem damage just tends to be much worse with SMR
> drives).  For what it's worth, I've seen fewer issues with (recent) USB 3.0
> devices than USB 2.0, but even just bumping the cable can cause issues
> sometimes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17  9:08       ` Hendrik Friedel
  2016-07-17 20:48         ` Henk Slager
  2016-07-18 11:22         ` Austin S. Hemmelgarn
@ 2016-07-20 19:58         ` Chris Murphy
  2016-07-21 12:46           ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2016-07-20 19:58 UTC (permalink / raw)
  To: Hendrik Friedel; +Cc: Tomasz Kusmierz, Btrfs BTRFS, dave

On Sun, Jul 17, 2016 at 3:08 AM, Hendrik Friedel <hendrik@friedels.name> wrote:

> Well, btrfs does write data very different to many other file systems. On
> every write the file is copied to another place, even if just one bit is
> changed. That's special and I am wondering whether that could cause
> problems.

It depends on the application. In practice, the program most
responsible for writing the file often does a faux-COW by writing a
whole new (temporary) file somewhere, when that operation completes,
it then deletes the original, and move+renames the temporary one into
place where the original one, doing fsync in between each of those
operations. I think some of this is done via VFS also. It's all much
more metadata centric than what Btrfs would do on its own.

I'd expect the write pattern of Btrfs to be similar to f2fs, with
respect to sequentiality of new writes.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-20 19:58         ` Chris Murphy
@ 2016-07-21 12:46           ` Austin S. Hemmelgarn
  2016-07-21 13:34             ` Chris Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-21 12:46 UTC (permalink / raw)
  To: Chris Murphy, Hendrik Friedel; +Cc: Tomasz Kusmierz, Btrfs BTRFS, dave

On 2016-07-20 15:58, Chris Murphy wrote:
> On Sun, Jul 17, 2016 at 3:08 AM, Hendrik Friedel <hendrik@friedels.name> wrote:
>
>> Well, btrfs does write data very different to many other file systems. On
>> every write the file is copied to another place, even if just one bit is
>> changed. That's special and I am wondering whether that could cause
>> problems.
>
> It depends on the application. In practice, the program most
> responsible for writing the file often does a faux-COW by writing a
> whole new (temporary) file somewhere, when that operation completes,
> it then deletes the original, and move+renames the temporary one into
> place where the original one, doing fsync in between each of those
> operations. I think some of this is done via VFS also. It's all much
> more metadata centric than what Btrfs would do on its own.
I'm pretty certain that the VFS itself does not do replace by rename 
type stuff.  BTRFS by nature technically does though, it's the same idea 
as a COW update, just at a higher level, so we're technically doing the 
same thing for every single block that changes.  The only issue I can 
think of in this context with a replace by rename is that you end up 
hitting the metadata trees twice.
>
> I'd expect the write pattern of Btrfs to be similar to f2fs, with
> respect to sequentiality of new writes.
Not necessarily, F2FS is log structured, and while not as much like 
traditional log structured filesystems, it still has a similar long-term 
write pattern to stuff like NILFS2 or LFS.  I've not done as much with 
F2FS specifically, but I can say based on comparison to other log 
structured filesystems that outside of WORM write patterns in userspace, 
BTRFS does not have a similar write pattern to a log structured 
filesystem.  We try to pack stuff into existing allocations pretty 
aggressively, so we end up with most of our writes condensed in a small 
area of the disk.  The only cases I've seen where we get long sequential 
writes are when writing out single files one by one, without having 
anything else running at the same time on the FS.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-21 12:46           ` Austin S. Hemmelgarn
@ 2016-07-21 13:34             ` Chris Murphy
  2016-07-21 14:02               ` Andrei Borzenkov
                                 ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Chris Murphy @ 2016-07-21 13:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Hendrik Friedel, Tomasz Kusmierz, Btrfs BTRFS, dave

On Thu, Jul 21, 2016 at 6:46 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-07-20 15:58, Chris Murphy wrote:
>>
>> On Sun, Jul 17, 2016 at 3:08 AM, Hendrik Friedel <hendrik@friedels.name>
>> wrote:
>>
>>> Well, btrfs does write data very different to many other file systems. On
>>> every write the file is copied to another place, even if just one bit is
>>> changed. That's special and I am wondering whether that could cause
>>> problems.
>>
>>
>> It depends on the application. In practice, the program most
>> responsible for writing the file often does a faux-COW by writing a
>> whole new (temporary) file somewhere, when that operation completes,
>> it then deletes the original, and move+renames the temporary one into
>> place where the original one, doing fsync in between each of those
>> operations. I think some of this is done via VFS also. It's all much
>> more metadata centric than what Btrfs would do on its own.
>
> I'm pretty certain that the VFS itself does not do replace by rename type
> stuff.

I can't tell what does it. But so far every program I've tried: vi,
gedit, GIMP, writes out a new file - as in, it has a different inode
number and every extent has a different address. That every program
reimplements this faux-COW would kinda surprise me rather than just
letting the VFS do it for everyone. I think since ancient times
literally overwriting files is just a bad idea that pretty much
guarantees data loss of old and new data if something interrupts that
overwrite.


>BTRFS by nature technically does though, it's the same idea as a COW
> update, just at a higher level, so we're technically doing the same thing
> for every single block that changes.  The only issue I can think of in this
> context with a replace by rename is that you end up hitting the metadata
> trees twice.

Do programs have a way to communicate what portion of a data file is
modified, so that only changed blocks are COW'd? When I change a
single pixel in a 400MiB image and do a save (to overwrite the
original file), it takes just as long to overwrite as to write it out
as a new file. It'd be neat if that could be optimized but I don't see
it being the case at the moment.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-21 13:34             ` Chris Murphy
@ 2016-07-21 14:02               ` Andrei Borzenkov
  2016-07-21 14:12               ` Austin S. Hemmelgarn
  2016-07-21 15:35               ` Patrik Lundquist
  2 siblings, 0 replies; 24+ messages in thread
From: Andrei Borzenkov @ 2016-07-21 14:02 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Austin S. Hemmelgarn, Hendrik Friedel, Tomasz Kusmierz,
	Btrfs BTRFS, dave

On Thu, Jul 21, 2016 at 4:34 PM, Chris Murphy <lists@colorremedies.com> wrote:

>
> Do programs have a way to communicate what portion of a data file is
> modified, so that only changed blocks are COW'd? When I change a
> single pixel in a 400MiB image and do a save (to overwrite the
> original file), it takes just as long to overwrite as to write it out
> as a new file. It'd be neat if that could be optimized but I don't see
> it being the case at the moment.
>

NetApp has an option to do it for CIFS connections. It literally
compares old and new files on renames and discards duplicates. It is
off by default. I am not aware of anyone using it :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-21 13:34             ` Chris Murphy
  2016-07-21 14:02               ` Andrei Borzenkov
@ 2016-07-21 14:12               ` Austin S. Hemmelgarn
  2016-07-21 14:31                 ` Chris Murphy
  2016-07-21 15:35               ` Patrik Lundquist
  2 siblings, 1 reply; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-21 14:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hendrik Friedel, Tomasz Kusmierz, Btrfs BTRFS, dave

On 2016-07-21 09:34, Chris Murphy wrote:
> On Thu, Jul 21, 2016 at 6:46 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-07-20 15:58, Chris Murphy wrote:
>>>
>>> On Sun, Jul 17, 2016 at 3:08 AM, Hendrik Friedel <hendrik@friedels.name>
>>> wrote:
>>>
>>>> Well, btrfs does write data very different to many other file systems. On
>>>> every write the file is copied to another place, even if just one bit is
>>>> changed. That's special and I am wondering whether that could cause
>>>> problems.
>>>
>>>
>>> It depends on the application. In practice, the program most
>>> responsible for writing the file often does a faux-COW by writing a
>>> whole new (temporary) file somewhere, when that operation completes,
>>> it then deletes the original, and move+renames the temporary one into
>>> place where the original one, doing fsync in between each of those
>>> operations. I think some of this is done via VFS also. It's all much
>>> more metadata centric than what Btrfs would do on its own.
>>
>> I'm pretty certain that the VFS itself does not do replace by rename type
>> stuff.
>
> I can't tell what does it. But so far every program I've tried: vi,
> gedit, GIMP, writes out a new file - as in, it has a different inode
> number and every extent has a different address. That every program
> reimplements this faux-COW would kinda surprise me rather than just
> letting the VFS do it for everyone. I think since ancient times
> literally overwriting files is just a bad idea that pretty much
> guarantees data loss of old and new data if something interrupts that
> overwrite.
This really isn't fake COW, it's COW, just at a higher level than most 
programmers would think of it.  The rename to replace is the pointer 
update, and the copy granularity is variable based on the size of the file.

The whole practice is used by just about everything, and dates back to 
before even SVR4, because traditional filesystems will corrupt files if 
they're being written when a power loss or crash occurs.  It's also 
popular because it breaks hard links, which have often be used as a poor 
man's form of deduplication.  Even on newer journaled filesystems, 
things aren't always safe across a power loss if you don't do this.  It 
can't be done legitimately in the VFS though, because POSIX requires 
that the inode not change if the file is just overwritten or rewritten 
in place.  Vi (which is probably vim on your system, although all other 
implementations I know of do likewise) does the this by itself.  Most 
graphical applications have it happen through libraries they link to (I 
know for a fact that Qt has an option to do this, and I'm pretty certain 
Glib does too, but I don't know if they do by default or not).  In 
general though, it's really not all that much duplicated code, maybe 20 
lines tops, assuming they don't use predictable file names and open code 
the temporary name generation.
>
>> BTRFS by nature technically does though, it's the same idea as a COW
>> update, just at a higher level, so we're technically doing the same thing
>> for every single block that changes.  The only issue I can think of in this
>> context with a replace by rename is that you end up hitting the metadata
>> trees twice.
>
> Do programs have a way to communicate what portion of a data file is
> modified, so that only changed blocks are COW'd? When I change a
> single pixel in a 400MiB image and do a save (to overwrite the
> original file), it takes just as long to overwrite as to write it out
> as a new file. It'd be neat if that could be optimized but I don't see
> it being the case at the moment.
AFAIUI, in BTRFS (and also ZFS), whatever blocks get rewritten get 
COW'ed.  So, rewriting the whole file will COW the whole file, not just 
the blocks that are different.  Trying to check in the FS itself what 
changed is actually rather inefficient (you will almost always spend 
more time comparing data than you will save by writing it all out if 
your using fast storage, and every write potentially implies a huge 
number of reads), and relying on the application to tell us is 
dangerous.  That said, most of the required infrastructure is already 
present in the in-band deduplication stuff, and in fact, it may do this 
for files that get rewritten frequently enough that they don't get 
pushed out of it's cache (I haven't tested for this, and I don't have 
the time or expertise to read through the code to see if it will, but 
based on my current understanding of how it works, it should do this 
implicitly).  The whole thing is a trade off though, because only 
COW'ing the parts that changed leads to higher levels of fragmentation, 
and that's part of why database and disk image files have such issues 
with fragmentation and making them NOCOW helps with these issues, they 
only get spot rewrites.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-21 14:12               ` Austin S. Hemmelgarn
@ 2016-07-21 14:31                 ` Chris Murphy
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Murphy @ 2016-07-21 14:31 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS

On Thu, Jul 21, 2016 at 8:12 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> This really isn't fake COW, it's COW, just at a higher level than most
> programmers would think of it.

Alright I'll stop calling it that. Thanks.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-21 13:34             ` Chris Murphy
  2016-07-21 14:02               ` Andrei Borzenkov
  2016-07-21 14:12               ` Austin S. Hemmelgarn
@ 2016-07-21 15:35               ` Patrik Lundquist
  2 siblings, 0 replies; 24+ messages in thread
From: Patrik Lundquist @ 2016-07-21 15:35 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Austin S. Hemmelgarn, Hendrik Friedel, Tomasz Kusmierz,
	Btrfs BTRFS, dave

On 21 July 2016 at 15:34, Chris Murphy <lists@colorremedies.com> wrote:
>
> Do programs have a way to communicate what portion of a data file is
> modified, so that only changed blocks are COW'd? When I change a
> single pixel in a 400MiB image and do a save (to overwrite the
> original file), it takes just as long to overwrite as to write it out
> as a new file. It'd be neat if that could be optimized but I don't see
> it being the case at the moment.

Programs can choose to seek within a file and only overwrite changed
parts, like BitTorrent (use NOCOW or defrag files like that).

Paint programs usually compress the changed image on save, so most of
the file is changed anyway. But if it's a raw image file just writing
the changed pixels should work, but that would require a comparison
with the original image (or a for pixel change history) so I doubt
anyone cares to implement it at the application level.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
@ 2016-07-21 12:22 Matthias Prager
  0 siblings, 0 replies; 24+ messages in thread
From: Matthias Prager @ 2016-07-21 12:22 UTC (permalink / raw)
  To: linux-btrfs, Chris Murphy; +Cc: Matthias Prager

> I'd expect the write pattern of Btrfs to be similar to f2fs, with
> respect to sequentiality of new writes.
Ideally yes - though my tests with a Seagate SMR drive suggest
otherwise. Optimizing the write behavior would probably lead to speed
improvements for btrfs on spinning disks.

---
Matthias Prager

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17 21:44   ` Matthias Prager
@ 2016-07-18 18:49     ` Jukka Larja
  0 siblings, 0 replies; 24+ messages in thread
From: Jukka Larja @ 2016-07-18 18:49 UTC (permalink / raw)
  To: Matthias Prager, Henk Slager; +Cc: linux-btrfs

18.7.2016, 0.44, Matthias Prager kirjoitti:

> the Seagate SMR drives are fast enough to handle Gbit-LAN
> speeds if they are served mostly large sequential chunks by the file
> system, which f2fs actually manages to do (cold storage in my scenario
> too). Btrfs does too many scattered writes for this to work without
> bandages (i.e. caching or snapshotting) (although I do see the advantage
> in having checksums for data which you write once and then read like
> once every year).

I have two 8 TB Seagate Archive drives (which did cause me lots of extra 
work, before I got everything patched and/or hardware to work correctly) in 
Btrfs raid1[0] configuration. They serve as backup for DVR (and for testing, 
as a backup of a backup of my Windows laptop) with around 4 TB of 
recordings. Original rsync ran at about network speed. Both drives were 
writing a bit over 100 MB/s for the whole time.

Backups are updated once per day, few GBs of recordings and around 5000 
small files from laptop changed, removed or created. I'm sort of waiting for 
something to explode, but so far it seems to be working fine. Maybe I'll put 
some real backups there some day.

[0] No need to tell me this shouldn't be done. It's for testing purposes.

-- 
      ...Elämälle vierasta toimintaa...
     Jukka Larja, Roskakori@aarghimedes.fi

"Later subverted by himself; when he tried to march through the Makran 
desert *SPOILER ALERT*. Ouch..."
"Buddy, Alexander is literally ancient history. There are limits to the 
spoiler tag."
- TV Tropes Wiki, Never Tell Me The Odds, Aleksanteri suuresta -

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17 20:10 ` Henk Slager
@ 2016-07-17 21:44   ` Matthias Prager
  2016-07-18 18:49     ` Jukka Larja
  0 siblings, 1 reply; 24+ messages in thread
From: Matthias Prager @ 2016-07-17 21:44 UTC (permalink / raw)
  To: Henk Slager; +Cc: linux-btrfs, Matthias Prager

Am 17.07.2016 um 22:10 schrieb Henk Slager:
> What kernel (version) did you use ?
> I hope it included:
> http://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=bugzilla-93581&id=7c4fbd50bfece00abf529bc96ac989dd2bb83ca4
> 
> so >= 4.4, as without this patch, it is quite problematic, if not
> impossible, to use this 8TB Seagate SMR drive with linux without doing
> other patches or setting/module changes.
Thanks for that pointer, I tested kernels 3.18.28, 4.1.[17+19] and 4.5.0
. I had seen task aborts on the drive when io-stressing it with kernels
3.18 and 4.1 (and ext4), but I never figured out the exact reason. Since
I'm currently stuck at kernel 4.1.x, I did not research this any further
(kernels >=4.2 aren't usable in esxi-guests when using pass-through
devices due to irq handling issues which lead to driver inits failing -
I'm told vmware is still sitting on a fix).


> Since this patch, I have been using the drive for cold storage
> archiving, connected to a Baytrail SoC SATA port. I use bcache
> (writethrough or writearound) on an 8TB GPT partition that has a LUKS
> container that is Btrfs m-dup, d-single formatted and mounted
> compress=lzo,noatine,nossd. It is only powered on once a month for a
> day or so and then it receives incremental snapshots mostly or some
> SSD or flash images of 10-50G.
> I have more or less kept all the snapshots sofar, so chunks keep being
> added to previously unwritten space, so as sequential as possible.
Mhh, see that would be one too many layers of complexity for my taste in
such a setup - the Seagate SMR drives are fast enough to handle Gbit-LAN
speeds if they are served mostly large sequential chunks by the file
system, which f2fs actually manages to do (cold storage in my scenario
too). Btrfs does too many scattered writes for this to work without
bandages (i.e. caching or snapshotting) (although I do see the advantage
in having checksums for data which you write once and then read like
once every year).


> If free space would be heavily fragmented and also files would be
> heavily fragmented and the disk would be very full, adding new files
> or modifying would be very slow. You see than many seconds that the
> drive is active but no traffic on the SATA link. Also then there is
> the risk that the default '/sys/block/$(kerneldevname)/device/timeout'
> of 30 secs is too low, and that the kernel might reset the SATA link.
> A SATA link still happened 2x the last 1/2 year, I haven't really
> looked at the details sofar, just rebooted at some point in time
> later, but I will set the timeout at least higher, e.g. 180, and then
> see if ata errors/resets still occur. It might be FW crashes as well.
As far as I've tested f2fs never backed the SMR drive into a corner,
which is probably due to it's sequential write pattern as a
log-structured file system and it's background garbage collection (i.e.
defragmentation) - even in a full state. I imagine this will probably
not work out for hot data though.


> 
> At least this SMR drive is not advised to use in raid setups. As
> not-so-active array it might work if you use the right timeouts and
> scterc etc, but if have seen how long the wait on the SATA link can be
> and that makes me realize that the stamp 'Archive Drive' done by
> Seagate has a clear reason.
Agreed these drives do need special handling. For archival workloads
with cold data they can be used if the file system is kind enough. I
wouldn't be comfortable using these drives in any scenario where they
might be backed into a corner in which case the wait times are far to
uncalculable for my taste.


---
Matthias

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
  2016-07-17  8:26 Matthias Prager
@ 2016-07-17 20:10 ` Henk Slager
  2016-07-17 21:44   ` Matthias Prager
  0 siblings, 1 reply; 24+ messages in thread
From: Henk Slager @ 2016-07-17 20:10 UTC (permalink / raw)
  To: Matthias Prager; +Cc: linux-btrfs

On Sun, Jul 17, 2016 at 10:26 AM, Matthias Prager
<linux@matthiasprager.de> wrote:

> from my experience btrfs does work as badly with SMR drives (I only had
> the opportunity to test on a 8TB Seagate device-managed drive) as ext4.
> The initial performance is fine (for a few gigabytes / minutes), but
> drops of a cliff as soon as the internal buffer-region for
> non-sequential writes fills up (even though I tested large file SMB
> transfers).

What kernel (version) did you use ?
I hope it included:
http://git.kernel.org/cgit/linux/kernel/git/mkp/linux.git/commit/?h=bugzilla-93581&id=7c4fbd50bfece00abf529bc96ac989dd2bb83ca4

so >= 4.4, as without this patch, it is quite problematic, if not
impossible, to use this 8TB Seagate SMR drive with linux without doing
other patches or setting/module changes.

Since this patch, I have been using the drive for cold storage
archiving, connected to a Baytrail SoC SATA port. I use bcache
(writethrough or writearound) on an 8TB GPT partition that has a LUKS
container that is Btrfs m-dup, d-single formatted and mounted
compress=lzo,noatine,nossd. It is only powered on once a month for a
day or so and then it receives incremental snapshots mostly or some
SSD or flash images of 10-50G.
I have more or less kept all the snapshots sofar, so chunks keep being
added to previously unwritten space, so as sequential as possible.

If free space would be heavily fragmented and also files would be
heavily fragmented and the disk would be very full, adding new files
or modifying would be very slow. You see than many seconds that the
drive is active but no traffic on the SATA link. Also then there is
the risk that the default '/sys/block/$(kerneldevname)/device/timeout'
of 30 secs is too low, and that the kernel might reset the SATA link.
A SATA link still happened 2x the last 1/2 year, I haven't really
looked at the details sofar, just rebooted at some point in time
later, but I will set the timeout at least higher, e.g. 180, and then
see if ata errors/resets still occur. It might be FW crashes as well.

> The only file system that worked really well with the 8TB Seagate SMR
> drive was f2fs. I used 'mkfs.f2fs -o 0 -a 0 -s 9 /dev/sdx' to create one
> and mounted it with noatime. -o means no additional over provisioning
> (the 5% default is a lot of wasted space on a 8TB drive), -a 0 tells
> f2fs not to use separate areas on the disks at the same time (which does
> not perform well on hdds only on ssds) and finally -s 9 tells f2fs to
> layout the file system in 1GB chunks.
> I hammered this file system for some days (via SMB and via shred-script)
> and it worked really well (performance and stability wise).

Interesting that f2fs works well, although now thinking a bit, I am
not so surprised that it works better than ext4

> I am considering using SMR drives for the next upgrades in my storage
> server in the basement - the only things missing in f2fs are checksums
> and raid1 support. But in my current setup (md-raid1+ext4) I don't get
> checksums either so f2fs+smr is still on my road-map. Long term, I would
> really like to switch to btrfs with it's built-in check summing (which
> unfortunately does not work with NOCOW) and raid1. But some of the file
> systems are almost 100% filled and I'm not trusting btrfs's stability
> yet (and the manageability / handling of btrfs lacks behind compared to
> say zfs).

At least this SMR drive is not advised to use in raid setups. As
not-so-active array it might work if you use the right timeouts and
scterc etc, but if have seen how long the wait on the SATA link can be
and that makes me realize that the stamp 'Archive Drive' done by
Seagate has a clear reason.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Status of SMR with BTRFS
@ 2016-07-17  8:26 Matthias Prager
  2016-07-17 20:10 ` Henk Slager
  0 siblings, 1 reply; 24+ messages in thread
From: Matthias Prager @ 2016-07-17  8:26 UTC (permalink / raw)
  To: linux-btrfs

Hello Hendrik,

from my experience btrfs does work as badly with SMR drives (I only had
the opportunity to test on a 8TB Seagate device-managed drive) as ext4.
The initial performance is fine (for a few gigabytes / minutes), but
drops of a cliff as soon as the internal buffer-region for
non-sequential writes fills up (even though I tested large file SMB
transfers).

The only file system that worked really well with the 8TB Seagate SMR
drive was f2fs. I used 'mkfs.f2fs -o 0 -a 0 -s 9 /dev/sdx' to create one
and mounted it with noatime. -o means no additional over provisioning
(the 5% default is a lot of wasted space on a 8TB drive), -a 0 tells
f2fs not to use separate areas on the disks at the same time (which does
not perform well on hdds only on ssds) and finally -s 9 tells f2fs to
layout the file system in 1GB chunks.
I hammered this file system for some days (via SMB and via shred-script)
and it worked really well (performance and stability wise).

I am considering using SMR drives for the next upgrades in my storage
server in the basement - the only things missing in f2fs are checksums
and raid1 support. But in my current setup (md-raid1+ext4) I don't get
checksums either so f2fs+smr is still on my road-map. Long term, I would
really like to switch to btrfs with it's built-in check summing (which
unfortunately does not work with NOCOW) and raid1. But some of the file
systems are almost 100% filled and I'm not trusting btrfs's stability
yet (and the manageability / handling of btrfs lacks behind compared to
say zfs).


I realize this mails sounds very negative for btrfs, I'm sorry that was
not my intention. I'm actually a big fan of btrfs and already running it
on my test-server, but I fear it still needs quite some time to mature.
That's why I really appreciate all the hard work of the btrfs-devs!


Kind regards
Matthias

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-07-21 15:35 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 18:29 Status of SMR with BTRFS Hendrik Friedel
2016-07-15 22:15 ` Tomasz Kusmierz
2016-07-16 10:29   ` Hendrik Friedel
2016-07-17  3:09     ` Tomasz Kusmierz
2016-07-17  9:08       ` Hendrik Friedel
2016-07-17 20:48         ` Henk Slager
2016-07-18 11:22         ` Austin S. Hemmelgarn
2016-07-18 18:31           ` Hendrik Friedel
2016-07-18 18:44             ` Austin S. Hemmelgarn
2016-07-18 19:05               ` Hendrik Friedel
2016-07-18 19:30                 ` Austin S. Hemmelgarn
2016-07-18 22:29                   ` Tomasz Kusmierz
2016-07-20 19:58         ` Chris Murphy
2016-07-21 12:46           ` Austin S. Hemmelgarn
2016-07-21 13:34             ` Chris Murphy
2016-07-21 14:02               ` Andrei Borzenkov
2016-07-21 14:12               ` Austin S. Hemmelgarn
2016-07-21 14:31                 ` Chris Murphy
2016-07-21 15:35               ` Patrik Lundquist
2016-07-17  8:26 Matthias Prager
2016-07-17 20:10 ` Henk Slager
2016-07-17 21:44   ` Matthias Prager
2016-07-18 18:49     ` Jukka Larja
2016-07-21 12:22 Matthias Prager

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.