Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* Q: what exactly does SSD mode still do?
@ 2020-03-26 18:16 Holger Hoffstätte
  2020-03-26 22:21 ` Hans van Kranenburg
  0 siblings, 1 reply; 5+ messages in thread
From: Holger Hoffstätte @ 2020-03-26 18:16 UTC (permalink / raw)
  To: linux-btrfs


Hi,

could someone explain what SSD mode *actually* still does? Not ssd_spread,
that's clear and unrelated. A recent commit removed the thread-offloaded
bio submission (avoiding context switches etc.) - which I thought was the
reason for SSD mode? - and looking through the code I couldn't find any
bits that helped clarify the difference.

Thanks!
Holger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Q: what exactly does SSD mode still do?
  2020-03-26 18:16 Q: what exactly does SSD mode still do? Holger Hoffstätte
@ 2020-03-26 22:21 ` Hans van Kranenburg
  2020-03-27 10:29   ` Holger Hoffstätte
  0 siblings, 1 reply; 5+ messages in thread
From: Hans van Kranenburg @ 2020-03-26 22:21 UTC (permalink / raw)
  To: Holger Hoffstätte, linux-btrfs

Hi!

On 3/26/20 7:16 PM, Holger Hoffstätte wrote:
> 
> could someone explain what SSD mode *actually* still does? Not ssd_spread,
> that's clear and unrelated. A recent commit removed the thread-offloaded
> bio submission (avoiding context switches etc.)

Can you share the commit id?

> - which I thought was the
> reason for SSD mode? - and looking through the code I couldn't find any
> bits that helped clarify the difference.

After the change in 2017 to change the extent allocator in ssd mode for
data to behave like nossd already did before, there are two differences
between ssd and nossd left:

1) This if statement in tree-log.c:

cd354ad613a39 (Chris Mason  2011-10-20 15:45:37 -0400 3042)
   /* when we're on an ssd, just kick the log commit out */
0b246afa62b0c (Jeff Mahoney 2016-06-22 18:54:23 -0400 3043)
   if (!btrfs_test_opt(fs_info, SSD) &&

2) Metadata "cluster allocator" write behavior:

*empty_cluster = SZ_64K  # nossd
*empty_cluster = SZ_2M  # ssd

This happens in extent-tree.c.

For 1) I guess this is ok if you can do "seek free writes"?

For 2) I initially wanted to start more research on the behavioral
difference, but when upgrading from Linux 4.9 to 4.19, the majority of
the problems with exploding extent tree metadata writes were already
gone (in ssd mode), so that never happened. So, there's still those two
hard coded values without any proper recent explanation why they should
be at that value.

Hans

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Q: what exactly does SSD mode still do?
  2020-03-26 22:21 ` Hans van Kranenburg
@ 2020-03-27 10:29   ` Holger Hoffstätte
  2020-03-28 19:35     ` Zygo Blaxell
  0 siblings, 1 reply; 5+ messages in thread
From: Holger Hoffstätte @ 2020-03-27 10:29 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 3/26/20 11:21 PM, Hans van Kranenburg wrote:
> Hi!
> 
> On 3/26/20 7:16 PM, Holger Hoffstätte wrote:
>>
>> could someone explain what SSD mode *actually* still does? Not ssd_spread,
>> that's clear and unrelated. A recent commit removed the thread-offloaded
>> bio submission (avoiding context switches etc.)
> 
> Can you share the commit id?

[1] followed by [2].

>> - which I thought was the
>> reason for SSD mode? - and looking through the code I couldn't find any
>> bits that helped clarify the difference.
> 
> After the change in 2017 to change the extent allocator in ssd mode for
> data to behave like nossd already did before, there are two differences
> between ssd and nossd left:
> 
> 1) This if statement in tree-log.c:
> 
> cd354ad613a39 (Chris Mason  2011-10-20 15:45:37 -0400 3042)
>     /* when we're on an ssd, just kick the log commit out */
> 0b246afa62b0c (Jeff Mahoney 2016-06-22 18:54:23 -0400 3043)
>     if (!btrfs_test_opt(fs_info, SSD) &&

Ah yes, multi-writer batching - a common DB optimization technique.
I wonder how much of a difference that actually still makes, but
it sounds like a good idea.

> 2) Metadata "cluster allocator" write behavior:
> 
> *empty_cluster = SZ_64K  # nossd
> *empty_cluster = SZ_2M  # ssd
> 
> This happens in extent-tree.c.

2M used to be a common erase block size on SSDs. Or maybe it's just
a nice round number..  ¯\(ツ)/¯

cheers,
Holger

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08635bae0b4ceb08fe4c156a11c83baec397d36d

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba8a9d07954397f0645cf62bcc1ef536e8e7ba24


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Q: what exactly does SSD mode still do?
  2020-03-27 10:29   ` Holger Hoffstätte
@ 2020-03-28 19:35     ` Zygo Blaxell
  2020-03-28 21:31       ` Hans van Kranenburg
  0 siblings, 1 reply; 5+ messages in thread
From: Zygo Blaxell @ 2020-03-28 19:35 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Hans van Kranenburg, linux-btrfs

On Fri, Mar 27, 2020 at 11:29:52AM +0100, Holger Hoffstätte wrote:
> On 3/26/20 11:21 PM, Hans van Kranenburg wrote:
> > 2) Metadata "cluster allocator" write behavior:
> > 
> > *empty_cluster = SZ_64K  # nossd
> > *empty_cluster = SZ_2M  # ssd
> > 
> > This happens in extent-tree.c.
> 
> 2M used to be a common erase block size on SSDs. Or maybe it's just
> a nice round number..  ¯\(ツ)/¯

As a side-effect, 2M write clusters close the write hole on raid5/6 if you
have an array that is a power of 2 data disks wide.  This capability is
wasted when it's only available through the 'ssd' mount option.

The behavior could be quite useful if it was properly integrated with
the raid5/6 stuff:  set *empty_cluster = block group data width, make
sure it's aligned to raid5/6 stripe boundaries, and use it for both data
and metadata.

It works by effectively making partially-filled clusters read-only.
If we can guarantee that clusters are aligned to raid5/6 data/parity block
boundaries, then btrfs can't allocate new data in partially filled raid5/6
stripes, so it won't break the parity relation and won't have write hole.

> cheers,
> Holger
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08635bae0b4ceb08fe4c156a11c83baec397d36d
> 
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba8a9d07954397f0645cf62bcc1ef536e8e7ba24
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Q: what exactly does SSD mode still do?
  2020-03-28 19:35     ` Zygo Blaxell
@ 2020-03-28 21:31       ` Hans van Kranenburg
  0 siblings, 0 replies; 5+ messages in thread
From: Hans van Kranenburg @ 2020-03-28 21:31 UTC (permalink / raw)
  To: Zygo Blaxell, Holger Hoffstätte; +Cc: linux-btrfs

On 3/28/20 8:35 PM, Zygo Blaxell wrote:
> On Fri, Mar 27, 2020 at 11:29:52AM +0100, Holger Hoffstätte wrote:
>> On 3/26/20 11:21 PM, Hans van Kranenburg wrote:
>>> 2) Metadata "cluster allocator" write behavior:
>>>
>>> *empty_cluster = SZ_64K  # nossd
>>> *empty_cluster = SZ_2M  # ssd
>>>
>>> This happens in extent-tree.c.
>>
>> 2M used to be a common erase block size on SSDs. Or maybe it's just
>> a nice round number..  ¯\(ツ)/¯
> 
> As a side-effect, 2M write clusters close the write hole on raid5/6 if you
> have an array that is a power of 2 data disks wide.  This capability is
> wasted when it's only available through the 'ssd' mount option.

Search for SSD_SPREAD in free-space-cache.c. There's this cont1_bytes
which is a fallback, so you'll have to run full SSD_SPREAD mode for this
to happen IINM.

https://www.spinics.net/lists/linux-btrfs/msg70624.html for a huge braindump

While running Linux 4.9 back then, I had to actually use 'ssd_spread'
metadata (not for data, possible thanks to that 'bug') to prevent
metadata writes from running around in circles while writing the extent
tree. With 4.19, I can juse use 'ssd' and TBH I have no idea what change
in between got rid of that insane amount of write overhead. So, I never
continued with researching behavior of different options (empty_cluster,
cont1_bytes combinations).

> The behavior could be quite useful if it was properly integrated with
> the raid5/6 stuff:  set *empty_cluster = block group data width, make
> sure it's aligned to raid5/6 stripe boundaries, and use it for both data
> and metadata.
> 
> It works by effectively making partially-filled clusters read-only.
> If we can guarantee that clusters are aligned to raid5/6 data/parity block
> boundaries, then btrfs can't allocate new data in partially filled raid5/6
> stripes, so it won't break the parity relation and won't have write hole.
> 
>> cheers,
>> Holger
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08635bae0b4ceb08fe4c156a11c83baec397d36d
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba8a9d07954397f0645cf62bcc1ef536e8e7ba24
>>

K

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26 18:16 Q: what exactly does SSD mode still do? Holger Hoffstätte
2020-03-26 22:21 ` Hans van Kranenburg
2020-03-27 10:29   ` Holger Hoffstätte
2020-03-28 19:35     ` Zygo Blaxell
2020-03-28 21:31       ` Hans van Kranenburg

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git