linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [XFS SUMMIT] SSD optimised allocation policy
@ 2020-05-14 10:34 Dave Chinner
  2020-05-19  6:32 ` Darrick J. Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2020-05-14 10:34 UTC (permalink / raw)
  To: linux-xfs


Topic:	SSD Optimised allocation policies

Scope:
	Performance
	Storage efficiency

Proposal:

Non-rotational storage is typically very fast. Our allocation
policies are all, fundamentally, based on very slow storage which
has extremely high latency between IO to different LBA regions. We
burn CPU to optimise for minimal seeks to minimise the expensive
physical movement of disk heads and platter rotation.

We know when the underlying storage is solid state - there's a
"non-rotational" field in the block device config that tells us the
storage doesn't need physical seek optimisation. We should make use
of that.

My proposal is that we look towards arranging the filesystem
allocation policies into CPU-optimised silos. We start by making
filesystems on SSDs with AG counts that are multiples of the CPU
count in the system (e.g. 4x the number of CPUs) to drive
parallelism at the allocation level, and then associate allocation
groups with specific CPUs in the system. Hence each CPU has a set of
allocation groups is selects between for the operations that are run
on it. Hence allocation is typically local to a specific CPU.
Optimisation proceeds from the basis of CPU locality optimisation,
not storage locality optimisation.

What this allows is processes on different CPUs to never contend for
allocation resources. Locality of objects just doesn't matter for
solid state storage, so we gain nothing by trying to group inodes,
directories, their metadata and data physically close together. We
want writes that happen at the same time to be physically close
together so we aggregate them into larger IOs, but we really
don't care about optimising write locality for best read performance
(i.e. must be contiguous for sequential access) for this storage.

Further, we can look at faster allocation strategies - we don't need
to find the "nearest" if we don't have a contiguous free extent to
allocate into, we just want the one that costs the least CPU to
find. This is because solid state storage is so fast that filesystem
performance is CPU limited, not storage limited. Hence we need to
think about allocation policies differently and start optimising
them for minimum CPU expenditure rather than best layout.

Other things to discuss include:
	- how do we convert metadata structures to write-once style
	  behaviour rather than overwrite in place?
	- extremely large block sizes for metadata (e.g. 4MB) to
	  align better with SSD erase block sizes
	- what parts of the allocation algorithms don't we need
	- are we better off with huge numbers of small AGs rather
	  than fewer large AGs?

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [XFS SUMMIT] SSD optimised allocation policy
  2020-05-14 10:34 [XFS SUMMIT] SSD optimised allocation policy Dave Chinner
@ 2020-05-19  6:32 ` Darrick J. Wong
  2020-05-20  1:46   ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Darrick J. Wong @ 2020-05-19  6:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, May 14, 2020 at 08:34:54PM +1000, Dave Chinner wrote:
> 
> Topic:	SSD Optimised allocation policies
> 
> Scope:
> 	Performance
> 	Storage efficiency
> 
> Proposal:
> 
> Non-rotational storage is typically very fast. Our allocation
> policies are all, fundamentally, based on very slow storage which
> has extremely high latency between IO to different LBA regions. We
> burn CPU to optimise for minimal seeks to minimise the expensive
> physical movement of disk heads and platter rotation.
> 
> We know when the underlying storage is solid state - there's a
> "non-rotational" field in the block device config that tells us the
> storage doesn't need physical seek optimisation. We should make use
> of that.
> 
> My proposal is that we look towards arranging the filesystem
> allocation policies into CPU-optimised silos. We start by making
> filesystems on SSDs with AG counts that are multiples of the CPU
> count in the system (e.g. 4x the number of CPUs) to drive

I guess you and I have been doing this for years with seemingly few ill
effects. ;)

That said, I did encounter a wackass system with 104 CPUs, a 1.4T RAID
array of spinning disks, 229 AGs sized ~6.5GB each, and a 50M log.  The
~900 io writers were sinking thesystem, so clearly some people are still
getting it wrong even with traditional storage. :(

> parallelism at the allocation level, and then associate allocation
> groups with specific CPUs in the system. Hence each CPU has a set of
> allocation groups is selects between for the operations that are run
> on it. Hence allocation is typically local to a specific CPU.
> Optimisation proceeds from the basis of CPU locality optimisation,
> not storage locality optimisation.

I wonder how hard it would be to compile a locality map for storage and
CPUs from whatever numa and bus topology information the kernel already
knows about?

> What this allows is processes on different CPUs to never contend for
> allocation resources. Locality of objects just doesn't matter for
> solid state storage, so we gain nothing by trying to group inodes,
> directories, their metadata and data physically close together. We
> want writes that happen at the same time to be physically close
> together so we aggregate them into larger IOs, but we really
> don't care about optimising write locality for best read performance
> (i.e. must be contiguous for sequential access) for this storage.
> 
> Further, we can look at faster allocation strategies - we don't need
> to find the "nearest" if we don't have a contiguous free extent to
> allocate into, we just want the one that costs the least CPU to
> find. This is because solid state storage is so fast that filesystem
> performance is CPU limited, not storage limited. Hence we need to
> think about allocation policies differently and start optimising
> them for minimum CPU expenditure rather than best layout.
> 
> Other things to discuss include:
> 	- how do we convert metadata structures to write-once style
> 	  behaviour rather than overwrite in place?

(Hm?)

> 	- extremely large block sizes for metadata (e.g. 4MB) to
> 	  align better with SSD erase block sizes

If we had metadata blocks that size, I'd advocate for studying how we
could restructure the btree to log updates in the slack space and only
checkpoint lower in the tree when necessary.

> 	- what parts of the allocation algorithms don't we need

Brian reworked part of the allocator a couple of cycles ago to reduce
the long tail latency of chasing through one free space btree when the
other one would have given it a quick answer; how beneficial has that
been?  Could it be more aggressive?

(Will have to ponder allocation issues in more depth when I'm more
awake..)

> 	- are we better off with huge numbers of small AGs rather
> 	  than fewer large AGs?

There's probably some point of dimininishing returns, but this seems
likely.  Has anyone studied this recently?

--D

> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [XFS SUMMIT] SSD optimised allocation policy
  2020-05-19  6:32 ` Darrick J. Wong
@ 2020-05-20  1:46   ` Dave Chinner
  0 siblings, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2020-05-20  1:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 18, 2020 at 11:32:04PM -0700, Darrick J. Wong wrote:
> On Thu, May 14, 2020 at 08:34:54PM +1000, Dave Chinner wrote:
> > 
> > Topic:	SSD Optimised allocation policies
> > 
> > Scope:
> > 	Performance
> > 	Storage efficiency
> > 
> > Proposal:
> > 
> > Non-rotational storage is typically very fast. Our allocation
> > policies are all, fundamentally, based on very slow storage which
> > has extremely high latency between IO to different LBA regions. We
> > burn CPU to optimise for minimal seeks to minimise the expensive
> > physical movement of disk heads and platter rotation.
> > 
> > We know when the underlying storage is solid state - there's a
> > "non-rotational" field in the block device config that tells us the
> > storage doesn't need physical seek optimisation. We should make use
> > of that.
> > 
> > My proposal is that we look towards arranging the filesystem
> > allocation policies into CPU-optimised silos. We start by making
> > filesystems on SSDs with AG counts that are multiples of the CPU
> > count in the system (e.g. 4x the number of CPUs) to drive
> 
> I guess you and I have been doing this for years with seemingly few ill
> effects. ;)
> 
> That said, I did encounter a wackass system with 104 CPUs, a 1.4T RAID
> array of spinning disks, 229 AGs sized ~6.5GB each, and a 50M log.  The
> ~900 io writers were sinking thesystem, so clearly some people are still
> getting it wrong even with traditional storage. :(

Unfortunately, we can't prevent people from screwing things up by
not understanding what adverse impact twiddling mkfs knobs will
have. And, realistically, that's way outside the scope of this
topic...

> > parallelism at the allocation level, and then associate allocation
> > groups with specific CPUs in the system. Hence each CPU has a set of
> > allocation groups is selects between for the operations that are run
> > on it. Hence allocation is typically local to a specific CPU.
> > Optimisation proceeds from the basis of CPU locality optimisation,
> > not storage locality optimisation.
> 
> I wonder how hard it would be to compile a locality map for storage and
> CPUs from whatever numa and bus topology information the kernel already
> knows about?

We don't even need to do that. Just assigning specific AGs to
specific CPUs will give per-cpu locality. And the thing is that this
mapping can change from mount to mount without adversely impacting
performance, because storage locality is largely irrelevant to
performance.

> > What this allows is processes on different CPUs to never contend for
> > allocation resources. Locality of objects just doesn't matter for
> > solid state storage, so we gain nothing by trying to group inodes,
> > directories, their metadata and data physically close together. We
> > want writes that happen at the same time to be physically close
> > together so we aggregate them into larger IOs, but we really
> > don't care about optimising write locality for best read performance
> > (i.e. must be contiguous for sequential access) for this storage.
> > 
> > Further, we can look at faster allocation strategies - we don't need
> > to find the "nearest" if we don't have a contiguous free extent to
> > allocate into, we just want the one that costs the least CPU to
> > find. This is because solid state storage is so fast that filesystem
> > performance is CPU limited, not storage limited. Hence we need to
> > think about allocation policies differently and start optimising
> > them for minimum CPU expenditure rather than best layout.
> > 
> > Other things to discuss include:
> > 	- how do we convert metadata structures to write-once style
> > 	  behaviour rather than overwrite in place?
> 
> (Hm?)

So we don't have to overwrite metadata in place. Overwrite in place
is not nice to SSDs because they can't overwrite in place and this
leads to write amplification in the device.

e.g. I was wondering if there was ways we could look at changing the
individual btrees to use something like a log structured merge
tree...

> > 	- extremely large block sizes for metadata (e.g. 4MB) to
> > 	  align better with SSD erase block sizes
> 
> If we had metadata blocks that size, I'd advocate for studying how we
> could restructure the btree to log updates in the slack space and only
> checkpoint lower in the tree when necessary.

well, I was kinda thinking that the "write once style" algorithms go
fit much better with large blocks than update-in-place, especially
with the way we log buffers right now.

Blocks that large would allow storing deltas in the block, then when
the block is full rewriting the entire block with all the deltas
applied. And that's where write-once algorithms become interesting;
we compress a portion of the tree and the new index into the
block(s) and rewrite the root pointer...

> > 	- what parts of the allocation algorithms don't we need
> 
> Brian reworked part of the allocator a couple of cycles ago to reduce
> the long tail latency of chasing through one free space btree when the
> other one would have given it a quick answer; how beneficial has that
> been?  Could it be more aggressive?

Yeah, that's what I'm getting at. All the "near/exact/size" stuff
has a lot of overhead to find the best target. For SSDs "near"
doesn't matter, so if we want a specific size we can jsut take the
first fit. "exact" is only used to do contiguous allocation - that
still has benefit from a filesystem persepective (less extents in
data files) even if it doesn't improve IO performance, so I think
there's whole reams of selection logic we can completely ignore for
SSDs...

> > 	- are we better off with huge numbers of small AGs rather
> > 	  than fewer large AGs?
> 
> There's probably some point of dimininishing returns, but this seems
> likely.  Has anyone studied this recently?

Not beyond "I need 100+ AGs on this 400GB SSD for allocation
parallelism". i.e. I've typically just increased the number of AGs
until nothing blocks on AGI or AGF headers, and ignored everything
else.

We probably need to look at how to handle thousands of AGs
effectively. That is two-fold: a) managing the memory overhead of
having that many active AGs and b) how we select AGs to allocate
from.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-05-20  1:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-14 10:34 [XFS SUMMIT] SSD optimised allocation policy Dave Chinner
2020-05-19  6:32 ` Darrick J. Wong
2020-05-20  1:46   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).