All of lore.kernel.org
 help / color / mirror / Atom feed
* agcount for 2TB, 4TB and 8TB drives
@ 2017-10-06  8:46 Gandalf Corvotempesta
  2017-10-06 15:38 ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Gandalf Corvotempesta @ 2017-10-06  8:46 UTC (permalink / raw)
  To: linux-xfs

Hi to all,
i'm new to XFS.

Which is the proper agcount for 2TB, 4TB and 8TB drives (not part of any RAID) ?

mkfs.xfs automatically choosen 4 AGs. Isn't this too low ?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-06  8:46 agcount for 2TB, 4TB and 8TB drives Gandalf Corvotempesta
@ 2017-10-06 15:38 ` Darrick J. Wong
  2017-10-06 16:18   ` Eric Sandeen
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2017-10-06 15:38 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: linux-xfs

On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
> Hi to all,
> i'm new to XFS.
> 
> Which is the proper agcount for 2TB, 4TB and 8TB drives (not part of any RAID) ?
> 
> mkfs.xfs automatically choosen 4 AGs. Isn't this too low ?

No.  Have a look at calc_default_ag_geometry in libxcmd/topology.c for
how we calculate the default AG count / size.  4TB single-disks and
smaller get 4 AGs; larger than that get 1AG per TB.  RAID arrays are
different.

Semirelated question: for a solid state disk on a machine with high CPU
counts do we prefer agcount == cpucount to take advantage of the
high(er) iops and lack of seek time to increase parallelism?

(Not that I've studied that in depth.)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-06 15:38 ` Darrick J. Wong
@ 2017-10-06 16:18   ` Eric Sandeen
  2017-10-06 22:20     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Sandeen @ 2017-10-06 16:18 UTC (permalink / raw)
  To: Darrick J. Wong, Gandalf Corvotempesta; +Cc: linux-xfs

On 10/6/17 10:38 AM, Darrick J. Wong wrote:
> On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
>> Hi to all,
>> i'm new to XFS.
>>
>> Which is the proper agcount for 2TB, 4TB and 8TB drives (not part of any RAID) ?
>>
>> mkfs.xfs automatically choosen 4 AGs. Isn't this too low ?
> 
> No.  Have a look at calc_default_ag_geometry in libxcmd/topology.c for
> how we calculate the default AG count / size.  4TB single-disks and
> smaller get 4 AGs; larger than that get 1AG per TB.  RAID arrays are
> different.

Right; max AG size is 1T (for a default mkfs):

        /*
         * For a single underlying storage device over 4TB in size
         * use the maximum AG size.  Between 128MB and 4TB, just use
         * 4 AGs and scale up smoothly between min/max AG sizes.
         */

But if there is a stripe unit, it goes into multi-disk mode, assumes
you have more parallelism than a single spindle, and makes more AGs.

        /*
         * For the multidisk configs we choose an AG count based on the number
         * of data blocks available, trying to keep the number of AGs higher
         * than the single disk configurations. This makes the assumption that
         * larger filesystems have more parallelism available to them.
         */

If you have a single 8T disk with only a handful of heads, you won't
benefit from more AGs.

> Semirelated question: for a solid state disk on a machine with high CPU
> counts do we prefer agcount == cpucount to take advantage of the
> high(er) iops and lack of seek time to increase parallelism?
> 
> (Not that I've studied that in depth.)

Interesting question.  :)  Maybe harder to answer for SSD black boxes?

-Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-06 16:18   ` Eric Sandeen
@ 2017-10-06 22:20     ` Dave Chinner
  2017-10-06 22:21       ` Eric Sandeen
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-10-06 22:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
> On 10/6/17 10:38 AM, Darrick J. Wong wrote:
> > On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
> > Semirelated question: for a solid state disk on a machine with high CPU
> > counts do we prefer agcount == cpucount to take advantage of the
> > high(er) iops and lack of seek time to increase parallelism?
> > 
> > (Not that I've studied that in depth.)
> 
> Interesting question.  :)  Maybe harder to answer for SSD black boxes?

Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
is zero after doing all the other checks. Then SSDs will get larger
AG counts automatically.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-06 22:20     ` Dave Chinner
@ 2017-10-06 22:21       ` Eric Sandeen
  2017-10-09  8:05         ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Sandeen @ 2017-10-06 22:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Gandalf Corvotempesta, linux-xfs



On 10/6/17 5:20 PM, Dave Chinner wrote:
> On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
>> On 10/6/17 10:38 AM, Darrick J. Wong wrote:
>>> On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
>>> Semirelated question: for a solid state disk on a machine with high CPU
>>> counts do we prefer agcount == cpucount to take advantage of the
>>> high(er) iops and lack of seek time to increase parallelism?
>>>
>>> (Not that I've studied that in depth.)
>>
>> Interesting question.  :)  Maybe harder to answer for SSD black boxes?
> 
> Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
> is zero after doing all the other checks. Then SSDs will get larger
> AG counts automatically.

The "hard part" was knowing just how much parallelism is actually inside
the black box.  But "multidisk mode" doesn't go too overboard, so yeah
that's probably fine.

-Eric
 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-06 22:21       ` Eric Sandeen
@ 2017-10-09  8:05         ` Avi Kivity
  2017-10-09 11:23           ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2017-10-09  8:05 UTC (permalink / raw)
  To: Eric Sandeen, Dave Chinner
  Cc: Darrick J. Wong, Gandalf Corvotempesta, linux-xfs



On 10/07/2017 01:21 AM, Eric Sandeen wrote:
>
> On 10/6/17 5:20 PM, Dave Chinner wrote:
>> On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
>>> On 10/6/17 10:38 AM, Darrick J. Wong wrote:
>>>> On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
>>>> Semirelated question: for a solid state disk on a machine with high CPU
>>>> counts do we prefer agcount == cpucount to take advantage of the
>>>> high(er) iops and lack of seek time to increase parallelism?
>>>>
>>>> (Not that I've studied that in depth.)
>>> Interesting question.  :)  Maybe harder to answer for SSD black boxes?
>> Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
>> is zero after doing all the other checks. Then SSDs will get larger
>> AG counts automatically.
> The "hard part" was knowing just how much parallelism is actually inside
> the black box.

It's often > 100.

>    But "multidisk mode" doesn't go too overboard, so yeah
> that's probably fine.
>


Is there a penalty associated with having too many allocation groups?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-09  8:05         ` Avi Kivity
@ 2017-10-09 11:23           ` Dave Chinner
  2017-10-09 15:46             ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-10-09 11:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
> On 10/07/2017 01:21 AM, Eric Sandeen wrote:
> >On 10/6/17 5:20 PM, Dave Chinner wrote:
> >>On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
> >>>On 10/6/17 10:38 AM, Darrick J. Wong wrote:
> >>>>On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
> >>>>Semirelated question: for a solid state disk on a machine with high CPU
> >>>>counts do we prefer agcount == cpucount to take advantage of the
> >>>>high(er) iops and lack of seek time to increase parallelism?
> >>>>
> >>>>(Not that I've studied that in depth.)
> >>>Interesting question.  :)  Maybe harder to answer for SSD black boxes?
> >>Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
> >>is zero after doing all the other checks. Then SSDs will get larger
> >>AG counts automatically.
> >The "hard part" was knowing just how much parallelism is actually inside
> >the black box.
> 
> It's often > 100.

Sure, that might be the IO concurrency the SSD sees and handles, but
you very rarely require that much allocation parallelism in the
workload. Only a small amount of the IO submission path is actually
allocation work, so a single AG can provide plenty of async IO
parallelism before an AG is the limiting factor.

i.e. A single AG can typically support tens of thousands of free
space manipulations per second before the AG locks become the
bottleneck. Hence by the time you get to 16 AGs there's concurrency
available for (runs a concurrent workload and measures) at least
350,000 allocation transactions per second on relatively slow 5 year
old 8-core server CPUs. And that's CPU bound (16 CPUs all at >95%),
so faster, more recent CPUs will run much higher numbers.

IOws, don't confuse allocation concurrency with IO concurrency or
application concurrency. It's not the same thing and it is rarely a
limiting factor for most workloads, even the most IO intensive
ones...

> >   But "multidisk mode" doesn't go too overboard, so yeah
> >that's probably fine.
> 
> Is there a penalty associated with having too many allocation groups?

Yes. You break up the large contiguous free spaces into many smaller
free spaces and so can induce premature onset of filesystem aging
related performance degradations. And for spinning disks, more than
4-8AGs per spindle causes excessive seeks in mixed workloads and
degrades performance that way....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-09 11:23           ` Dave Chinner
@ 2017-10-09 15:46             ` Avi Kivity
  2017-10-09 22:03               ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2017-10-09 15:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs



On 10/09/2017 02:23 PM, Dave Chinner wrote:
> On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
>> On 10/07/2017 01:21 AM, Eric Sandeen wrote:
>>> On 10/6/17 5:20 PM, Dave Chinner wrote:
>>>> On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
>>>>> On 10/6/17 10:38 AM, Darrick J. Wong wrote:
>>>>>> On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
>>>>>> Semirelated question: for a solid state disk on a machine with high CPU
>>>>>> counts do we prefer agcount == cpucount to take advantage of the
>>>>>> high(er) iops and lack of seek time to increase parallelism?
>>>>>>
>>>>>> (Not that I've studied that in depth.)
>>>>> Interesting question.  :)  Maybe harder to answer for SSD black boxes?
>>>> Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
>>>> is zero after doing all the other checks. Then SSDs will get larger
>>>> AG counts automatically.
>>> The "hard part" was knowing just how much parallelism is actually inside
>>> the black box.
>> It's often > 100.
> Sure, that might be the IO concurrency the SSD sees and handles, but
> you very rarely require that much allocation parallelism in the
> workload. Only a small amount of the IO submission path is actually
> allocation work, so a single AG can provide plenty of async IO
> parallelism before an AG is the limiting factor.

Sure. Can a single AG issue multiple I/Os, or is it single-threaded?

I understand that XFS_XFLAG_EXTSIZE and XFS_IOC_FSSETXATTR can reduce 
the AG's load. Is there a downside? for example, when I truncate + close 
the file, will the preallocated data still remain allocated? Do I need 
to return it with an fallocate()?

>
> i.e. A single AG can typically support tens of thousands of free
> space manipulations per second before the AG locks become the
> bottleneck. Hence by the time you get to 16 AGs there's concurrency
> available for (runs a concurrent workload and measures) at least
> 350,000 allocation transactions per second on relatively slow 5 year
> old 8-core server CPUs. And that's CPU bound (16 CPUs all at >95%),
> so faster, more recent CPUs will run much higher numbers.
>
> IOws, don't confuse allocation concurrency with IO concurrency or
> application concurrency. It's not the same thing and it is rarely a
> limiting factor for most workloads, even the most IO intensive
> ones...

In my load, the allocation load is not very high, but the impact of 
iowait is. So if I can reduce the chance of io_submit() blocking because 
of AG contention, then I'm happy to increase the number of AGs even if 
it hurts other things.

>
>>>    But "multidisk mode" doesn't go too overboard, so yeah
>>> that's probably fine.
>> Is there a penalty associated with having too many allocation groups?
> Yes. You break up the large contiguous free spaces into many smaller
> free spaces and so can induce premature onset of filesystem aging
> related performance degradations. And for spinning disks, more than
> 4-8AGs per spindle causes excessive seeks in mixed workloads and
> degrades performance that way....

For an SSD, would an AG per 10GB be reasonable? per 100GB?

Machines with 60-100 logical cores and low-tens of terabytes of SSD are 
becoming common.  How many AGs would work for such a machine? Again the 
allocation load is not very high (allocating a few GB/s with 32MB hints, 
so < 100 allocs/sec), but the penalty for contention is pretty high.

Thanks for the info!

> Cheers,
>
> Dave.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-09 15:46             ` Avi Kivity
@ 2017-10-09 22:03               ` Dave Chinner
  2017-10-10  9:07                 ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-10-09 22:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Mon, Oct 09, 2017 at 06:46:41PM +0300, Avi Kivity wrote:
> 
> 
> On 10/09/2017 02:23 PM, Dave Chinner wrote:
> >On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
> >>On 10/07/2017 01:21 AM, Eric Sandeen wrote:
> >>>On 10/6/17 5:20 PM, Dave Chinner wrote:
> >>>>On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
> >>>>>On 10/6/17 10:38 AM, Darrick J. Wong wrote:
> >>>>>>On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
> >>>>>>Semirelated question: for a solid state disk on a machine with high CPU
> >>>>>>counts do we prefer agcount == cpucount to take advantage of the
> >>>>>>high(er) iops and lack of seek time to increase parallelism?
> >>>>>>
> >>>>>>(Not that I've studied that in depth.)
> >>>>>Interesting question.  :)  Maybe harder to answer for SSD black boxes?
> >>>>Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
> >>>>is zero after doing all the other checks. Then SSDs will get larger
> >>>>AG counts automatically.
> >>>The "hard part" was knowing just how much parallelism is actually inside
> >>>the black box.
> >>It's often > 100.
> >Sure, that might be the IO concurrency the SSD sees and handles, but
> >you very rarely require that much allocation parallelism in the
> >workload. Only a small amount of the IO submission path is actually
> >allocation work, so a single AG can provide plenty of async IO
> >parallelism before an AG is the limiting factor.
> 
> Sure. Can a single AG issue multiple I/Os, or is it single-threaded?

AGs don't issue IO. Applications issue IO, the filesystem allocates
space from AGs according to the write IO that passes through it.

i.e. when you don't do allocation in the write IO path or you are
doing read IOs, then the number of AGs is /completely irrelevant/.
In those cases a single AG can "support" the entire IO load your
application and storage subsystem can handle.

The only time an AG lock is taken in the IO path is during extent
allocation (i.e. writes). And, as I've already said, a single AG can
easily handle tens of thousands of allocation transactions a second
before it becomes a bottleneck.

IOWs, the worse case is that you'll get tens of thousands of IOs per
second through an AG.

> I understand that XFS_XFLAG_EXTSIZE and XFS_IOC_FSSETXATTR can
> reduce the AG's load.

Not really. They change the allocation pattern on the inode. This
changes how the inode data is laid out on disk, but it doesn't
necessarily change the allocation overhead of the write IO path.
That's all dependent on what the application IO patterns are and how
they match the extent size hints.

In general, nobody ever notices what the "load" on an AG is and
that's because almost no-one ever drives AGs to their limits.  The
mkfs defaults and the allocation policies keep the load distributed
across the filesystem and so storage subsystems almost always run
out of IO and/or seek capability before the filesystem runs out of
allocation concurrency. And, in general, most machines run out of
CPU power before they drive enough concurrency and load through the
filesystem that it starts contending on internal locks.

Sure, I have plenty of artificial workloads that drive this sort
contention, but no-one has a production workload that requires those
sorts of behaviours or creates the same level of lock contention
that these artificial workloads drive.

> Is there a downside? for example, when I
> truncate + close the file, will the preallocated data still remain
> allocated? Do I need to return it with an fallocate()?

No. Yes.

> >space manipulations per second before the AG locks become the
> >bottleneck. Hence by the time you get to 16 AGs there's concurrency
> >available for (runs a concurrent workload and measures) at least
> >350,000 allocation transactions per second on relatively slow 5 year
> >old 8-core server CPUs. And that's CPU bound (16 CPUs all at >95%),
> >so faster, more recent CPUs will run much higher numbers.
> >
> >IOws, don't confuse allocation concurrency with IO concurrency or
> >application concurrency. It's not the same thing and it is rarely a
> >limiting factor for most workloads, even the most IO intensive
> >ones...
> 
> In my load, the allocation load is not very high, but the impact of
> iowait is. So if I can reduce the chance of io_submit() blocking
> because of AG contention, then I'm happy to increase the number of
> AGs even if it hurts other things.

That's what RWF_NOWAIT is for. It pushes any write IO that requires
allocation into a thread rather possibly blocking the submitting
thread on any lock or IO in the allocation path.

> >>>   But "multidisk mode" doesn't go too overboard, so yeah
> >>>that's probably fine.
> >>Is there a penalty associated with having too many allocation groups?
> >Yes. You break up the large contiguous free spaces into many smaller
> >free spaces and so can induce premature onset of filesystem aging
> >related performance degradations. And for spinning disks, more than
> >4-8AGs per spindle causes excessive seeks in mixed workloads and
> >degrades performance that way....
> 
> For an SSD, would an AG per 10GB be reasonable? per 100GB?

No. Maybe.

Like I said, we can use the multi-disk mode in mkfs for this - it
already selects an appropriate number of AGs according to the size
of the filesystem appropriately.

> Machines with 60-100 logical cores and low-tens of terabytes of SSD
> are becoming common.  How many AGs would work for such a machine?

Multidisk default, which will be 32 AGs for anything in the 1->32TB
range. And over 32TB, you get 1 AG per TB...

> Again the allocation load is not very high (allocating a few GB/s
> with 32MB hints, so < 100 allocs/sec), but the penalty for
> contention is pretty high.

I think you're worrying about a non-problem. Use RWF_NOWAIT for your
AIO, and most of your existing IO submission blocking problems will
go away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-09 22:03               ` Dave Chinner
@ 2017-10-10  9:07                 ` Avi Kivity
  2017-10-10 22:55                   ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2017-10-10  9:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs



On 10/10/2017 01:03 AM, Dave Chinner wrote:
> On Mon, Oct 09, 2017 at 06:46:41PM +0300, Avi Kivity wrote:
>>
>> On 10/09/2017 02:23 PM, Dave Chinner wrote:
>>> On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
>>>> On 10/07/2017 01:21 AM, Eric Sandeen wrote:
>>>>> On 10/6/17 5:20 PM, Dave Chinner wrote:
>>>>>> On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote:
>>>>>>> On 10/6/17 10:38 AM, Darrick J. Wong wrote:
>>>>>>>> On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote:
>>>>>>>> Semirelated question: for a solid state disk on a machine with high CPU
>>>>>>>> counts do we prefer agcount == cpucount to take advantage of the
>>>>>>>> high(er) iops and lack of seek time to increase parallelism?
>>>>>>>>
>>>>>>>> (Not that I've studied that in depth.)
>>>>>>> Interesting question.  :)  Maybe harder to answer for SSD black boxes?
>>>>>> Easy: switch to multidisk mode if /sys/block/<dev>/queue/rotational
>>>>>> is zero after doing all the other checks. Then SSDs will get larger
>>>>>> AG counts automatically.
>>>>> The "hard part" was knowing just how much parallelism is actually inside
>>>>> the black box.
>>>> It's often > 100.
>>> Sure, that might be the IO concurrency the SSD sees and handles, but
>>> you very rarely require that much allocation parallelism in the
>>> workload. Only a small amount of the IO submission path is actually
>>> allocation work, so a single AG can provide plenty of async IO
>>> parallelism before an AG is the limiting factor.
>> Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
> AGs don't issue IO. Applications issue IO, the filesystem allocates
> space from AGs according to the write IO that passes through it.

What I meant was I/O in order to satisfy an allocation (read from the 
free extent btree or whatever), not the application's I/O.

>
> i.e. when you don't do allocation in the write IO path or you are
> doing read IOs, then the number of AGs is /completely irrelevant/.
> In those cases a single AG can "support" the entire IO load your
> application and storage subsystem can handle.
>
> The only time an AG lock is taken in the IO path is during extent
> allocation (i.e. writes). And, as I've already said, a single AG can
> easily handle tens of thousands of allocation transactions a second
> before it becomes a bottleneck.

Well, my own workload has at most a hundred allocations per second (32MB 
hints, 3GB/s writes)*, so I'm asking more to increase my understanding 
of XFS. But for me locks become a problem a lot sooner than they become 
a bottleneck, because I am using AIO and blocking in io_submit() 
destroys performance for me.

*below I see that it may wrong, so perhaps I have about 23k allocs/sec 
(128k buffers, 3GB/s writes).

>
> IOWs, the worse case is that you'll get tens of thousands of IOs per
> second through an AG.

For me, the worst case is worse. If io_submit() blocks, then there is 
nothing to utilize the processor core, and thus generate more I/Os that 
may have utilized the disk (for example, reads that don't need that 
lock). My use case is much more sensitive to lock contention.


Does the new RWF_NOWAIT goodness extend to AG locks? In that case I'll 
punt the io_submit to a worker thread that can block.

Ah, below you say it does.

>
>> I understand that XFS_XFLAG_EXTSIZE and XFS_IOC_FSSETXATTR can
>> reduce the AG's load.
> Not really. They change the allocation pattern on the inode. This
> changes how the inode data is laid out on disk, but it doesn't
> necessarily change the allocation overhead of the write IO path.
> That's all dependent on what the application IO patterns are and how
> they match the extent size hints.

I write 128k naturally-aligned writes using aio, so I expect it will 
match. Will every write go into the AG allocator, or just writes that 
cross a 32MB boundary?


>
> In general, nobody ever notices what the "load" on an AG is and
> that's because almost no-one ever drives AGs to their limits.  The
> mkfs defaults and the allocation policies keep the load distributed
> across the filesystem and so storage subsystems almost always run
> out of IO and/or seek capability before the filesystem runs out of
> allocation concurrency. And, in general, most machines run out of
> CPU power before they drive enough concurrency and load through the
> filesystem that it starts contending on internal locks.
>
> Sure, I have plenty of artificial workloads that drive this sort
> contention, but no-one has a production workload that requires those
> sorts of behaviours or creates the same level of lock contention
> that these artificial workloads drive.

I've certainly seen lock contention in XFS, there was a recent thread 
(started by Tomasz) where a filesystem that was close to full was almost 
completely degraded for us.

Again we are more sensitive to contention than other workloads, because 
contention for us doesn't just block the work downstream from lock 
acquisition, it blocks all other work on that core for the duration.

>
>> Is there a downside? for example, when I
>> truncate + close the file, will the preallocated data still remain
>> allocated? Do I need to return it with an fallocate()?
> No. Yes.

Thanks. Most of my files are much larger, so the waste isn't too high, 
but it's still waste.

>
>>> space manipulations per second before the AG locks become the
>>> bottleneck. Hence by the time you get to 16 AGs there's concurrency
>>> available for (runs a concurrent workload and measures) at least
>>> 350,000 allocation transactions per second on relatively slow 5 year
>>> old 8-core server CPUs. And that's CPU bound (16 CPUs all at >95%),
>>> so faster, more recent CPUs will run much higher numbers.
>>>
>>> IOws, don't confuse allocation concurrency with IO concurrency or
>>> application concurrency. It's not the same thing and it is rarely a
>>> limiting factor for most workloads, even the most IO intensive
>>> ones...
>> In my load, the allocation load is not very high, but the impact of
>> iowait is. So if I can reduce the chance of io_submit() blocking
>> because of AG contention, then I'm happy to increase the number of
>> AGs even if it hurts other things.
> That's what RWF_NOWAIT is for. It pushes any write IO that requires
> allocation into a thread rather possibly blocking the submitting
> thread on any lock or IO in the allocation path.

Excellent, we'll use that, although it will be years before our users 
see the benefit.

>>>>>    But "multidisk mode" doesn't go too overboard, so yeah
>>>>> that's probably fine.
>>>> Is there a penalty associated with having too many allocation groups?
>>> Yes. You break up the large contiguous free spaces into many smaller
>>> free spaces and so can induce premature onset of filesystem aging
>>> related performance degradations. And for spinning disks, more than
>>> 4-8AGs per spindle causes excessive seeks in mixed workloads and
>>> degrades performance that way....
>> For an SSD, would an AG per 10GB be reasonable? per 100GB?
> No. Maybe.
>
> Like I said, we can use the multi-disk mode in mkfs for this - it
> already selects an appropriate number of AGs according to the size
> of the filesystem appropriately.
>
>> Machines with 60-100 logical cores and low-tens of terabytes of SSD
>> are becoming common.  How many AGs would work for such a machine?
> Multidisk default, which will be 32 AGs for anything in the 1->32TB
> range. And over 32TB, you get 1 AG per TB...


Ok. Then doubling it so that each logical core has an AG wouldn't be 
such a big change.

>
>> Again the allocation load is not very high (allocating a few GB/s
>> with 32MB hints, so < 100 allocs/sec), but the penalty for
>> contention is pretty high.
> I think you're worrying about a non-problem. Use RWF_NOWAIT for your
> AIO, and most of your existing IO submission blocking problems will
> go away.
>

We'll start using RWF_NOWAIT, but many of our users are on a 3.10 
derivative kernel and won't install 4.14-rc6 on their production 
clusters. If a mkfs tweak can help them, then I'll happily do it.

I don't have direct proof that too few AGs are causing problems for me, 
but I've seen many traces showing XFS blocking, and like I said, it's a 
disaster for us. Unfortunately these problems are hard to reproduce and 
are expensive to test.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-10  9:07                 ` Avi Kivity
@ 2017-10-10 22:55                   ` Dave Chinner
  2017-10-13  8:13                     ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-10-10 22:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
> On 10/10/2017 01:03 AM, Dave Chinner wrote:
> >>On 10/09/2017 02:23 PM, Dave Chinner wrote:
> >>>On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
> >>>Sure, that might be the IO concurrency the SSD sees and handles, but
> >>>you very rarely require that much allocation parallelism in the
> >>>workload. Only a small amount of the IO submission path is actually
> >>>allocation work, so a single AG can provide plenty of async IO
> >>>parallelism before an AG is the limiting factor.
> >>Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
> >AGs don't issue IO. Applications issue IO, the filesystem allocates
> >space from AGs according to the write IO that passes through it.
> 
> What I meant was I/O in order to satisfy an allocation (read from
> the free extent btree or whatever), not the application's I/O.

Once you're in the per-AG allocator context, it is single threaded
until the allocation is complete. We do things like btree block
readahead to minimise IO wait times, but we can't completely hide
things like metadata read Io wait time when it is required to make
progress.

> >>I understand that XFS_XFLAG_EXTSIZE and XFS_IOC_FSSETXATTR can
> >>reduce the AG's load.
> >Not really. They change the allocation pattern on the inode. This
> >changes how the inode data is laid out on disk, but it doesn't
> >necessarily change the allocation overhead of the write IO path.
> >That's all dependent on what the application IO patterns are and how
> >they match the extent size hints.
> 
> I write 128k naturally-aligned writes using aio, so I expect it will
> match. Will every write go into the AG allocator, or just writes
> that cross a 32MB boundary?

It enters an allocation only when an allocation is required. i.e.
only when the write lands in a hole. If you're doing sequential 128k
writes and using 32MB extent size hints, then it only allocates once
every 32768/128 = 256 writes. If you are doing random IO into a
sparse file, then it all bets are off.

> >That's what RWF_NOWAIT is for. It pushes any write IO that requires
> >allocation into a thread rather possibly blocking the submitting
> >thread on any lock or IO in the allocation path.
> 
> Excellent, we'll use that, although it will be years before our
> users see the benefit.

Well, that's really in your control, not mine.

The disconnect between upstream progress and LTS production
systems is not something upstream can do anything about. Often the
problems LTS production systems see are already solved upstream and
so the only answer we can really give you here is "upgrade, backport
features your customers need yourself, or pay someone else to
maintain a backport with the features you need".

> >>Machines with 60-100 logical cores and low-tens of terabytes of SSD
> >>are becoming common.  How many AGs would work for such a machine?
> >Multidisk default, which will be 32 AGs for anything in the 1->32TB
> >range. And over 32TB, you get 1 AG per TB...
> 
> 
> Ok. Then doubling it so that each logical core has an AG wouldn't be
> such a big change.

But it won't make any difference to your workload because there's no
relationship between CPU cores and the AG selected for allocation.
The AG selection is based on filesystem relationships (e.g. local to
parent directory inode), and so if you have two files in the same
directory they will start trying to allocate from the same AG even
thought hey get written from different cores concurrently. The only
time they'll get moved into different AGs is if there is allocation
contention.

Yes, the allocator algorithms detect AG contention internally and
switch to uncontended AGs rather than blocking. There's /lots/ of
stuff inside the allocators to minimise blocking - that's one of the
reasons you see less submission blocking problems on XFS than other
filesytsems. If you're not getting threads blocking waiting to get
AGF locks, then you most certainly don't have allocator contention.
Even if you do have threads blocking on AGF locks, that could simply
be a sign you are running too close to ENOSPC, not contention...

The reality is, however, that even an uncontended AG can block if
the necessary metadata isn't in memory, or the log is full, or
memory cannot be immediately allocated, etc. RWF_NOWAIT avoids the
whole class of "allocator can block" problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-10 22:55                   ` Dave Chinner
@ 2017-10-13  8:13                     ` Avi Kivity
  2017-10-14 22:42                       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2017-10-13  8:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On 10/11/2017 01:55 AM, Dave Chinner wrote:
> On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
>> On 10/10/2017 01:03 AM, Dave Chinner wrote:
>>>> On 10/09/2017 02:23 PM, Dave Chinner wrote:
>>>>> On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
>>>>> Sure, that might be the IO concurrency the SSD sees and handles, but
>>>>> you very rarely require that much allocation parallelism in the
>>>>> workload. Only a small amount of the IO submission path is actually
>>>>> allocation work, so a single AG can provide plenty of async IO
>>>>> parallelism before an AG is the limiting factor.
>>>> Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
>>> AGs don't issue IO. Applications issue IO, the filesystem allocates
>>> space from AGs according to the write IO that passes through it.
>> What I meant was I/O in order to satisfy an allocation (read from
>> the free extent btree or whatever), not the application's I/O.
> Once you're in the per-AG allocator context, it is single threaded
> until the allocation is complete. We do things like btree block
> readahead to minimise IO wait times, but we can't completely hide
> things like metadata read Io wait time when it is required to make
> progress.

I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the free 
space btree, or just contention? (I expect the latter from the patches 
I've seen, but perhaps I missed something).

I imagine I'll have a lot of amortization there: if a 32MB allocation 
fails, the subsequent 32MB allocation for the same file will likely hit 
the same location and be satisified from cache. My workload is pure 
O_DIRECT so no memory pressure in the kernel.

>>>> I understand that XFS_XFLAG_EXTSIZE and XFS_IOC_FSSETXATTR can
>>>> reduce the AG's load.
>>> Not really. They change the allocation pattern on the inode. This
>>> changes how the inode data is laid out on disk, but it doesn't
>>> necessarily change the allocation overhead of the write IO path.
>>> That's all dependent on what the application IO patterns are and how
>>> they match the extent size hints.
>> I write 128k naturally-aligned writes using aio, so I expect it will
>> match. Will every write go into the AG allocator, or just writes
>> that cross a 32MB boundary?
> It enters an allocation only when an allocation is required. i.e.
> only when the write lands in a hole. If you're doing sequential 128k
> writes and using 32MB extent size hints, then it only allocates once
> every 32768/128 = 256 writes. If you are doing random IO into a
> sparse file, then it all bets are off.

Pure sequential writes.


>
>>> That's what RWF_NOWAIT is for. It pushes any write IO that requires
>>> allocation into a thread rather possibly blocking the submitting
>>> thread on any lock or IO in the allocation path.
>> Excellent, we'll use that, although it will be years before our
>> users see the benefit.
> Well, that's really in your control, not mine.
>
> The disconnect between upstream progress and LTS production
> systems is not something upstream can do anything about. Often the
> problems LTS production systems see are already solved upstream and
> so the only answer we can really give you here is "upgrade, backport
> features your customers need yourself, or pay someone else to
> maintain a backport with the features you need".

I understand the situation. This was to explain why I'm looking for 
workarounds in deployed code when fixes in new code are available. My 
users/customers don't run kernels provided by me.

>>>> Machines with 60-100 logical cores and low-tens of terabytes of SSD
>>>> are becoming common.  How many AGs would work for such a machine?
>>> Multidisk default, which will be 32 AGs for anything in the 1->32TB
>>> range. And over 32TB, you get 1 AG per TB...
>>
>> Ok. Then doubling it so that each logical core has an AG wouldn't be
>> such a big change.
> But it won't make any difference to your workload because there's no
> relationship between CPU cores and the AG selected for allocation.
> The AG selection is based on filesystem relationships (e.g. local to
> parent directory inode), and so if you have two files in the same
> directory they will start trying to allocate from the same AG even
> thought hey get written from different cores concurrently. The only
> time they'll get moved into different AGs is if there is allocation
> contention.

Unfortunately, all cores writing files in the same directory is exactly 
my workload. I can change it, but there is a backwards compatibility 
cost to that change. I can probably also trick XFS by creating the file 
in a dedicated subdirectory and rename()ing it later.

>
> Yes, the allocator algorithms detect AG contention internally and
> switch to uncontended AGs rather than blocking. There's /lots/ of
> stuff inside the allocators to minimise blocking - that's one of the
> reasons you see less submission blocking problems on XFS than other
> filesytsems. If you're not getting threads blocking waiting to get
> AGF locks, then you most certainly don't have allocator contention.
> Even if you do have threads blocking on AGF locks, that could simply
> be a sign you are running too close to ENOSPC, not contention...
>
> The reality is, however, that even an uncontended AG can block if
> the necessary metadata isn't in memory, or the log is full, or
> memory cannot be immediately allocated, etc. RWF_NOWAIT avoids the
> whole class of "allocator can block" problem...


Thanks. I do have blocks from time to time, but we were not able to 
pinpoint the cause as I don't own those systems (and also lack knowledge 
about the internals). At least one issue _was_ related to free space 
running out, so that fits.

The vast majority of the time XFS AIO works very well. The problem is 
that when problems do happen, performance drops of sharply, and it's 
often in a situation that's hard to debug.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-13  8:13                     ` Avi Kivity
@ 2017-10-14 22:42                       ` Dave Chinner
  2017-10-15  9:36                         ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-10-14 22:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
> On 10/11/2017 01:55 AM, Dave Chinner wrote:
> >On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
> >>On 10/10/2017 01:03 AM, Dave Chinner wrote:
> >>>>On 10/09/2017 02:23 PM, Dave Chinner wrote:
> >>>>>On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
> >>>>>Sure, that might be the IO concurrency the SSD sees and handles, but
> >>>>>you very rarely require that much allocation parallelism in the
> >>>>>workload. Only a small amount of the IO submission path is actually
> >>>>>allocation work, so a single AG can provide plenty of async IO
> >>>>>parallelism before an AG is the limiting factor.
> >>>>Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
> >>>AGs don't issue IO. Applications issue IO, the filesystem allocates
> >>>space from AGs according to the write IO that passes through it.
> >>What I meant was I/O in order to satisfy an allocation (read from
> >>the free extent btree or whatever), not the application's I/O.
> >Once you're in the per-AG allocator context, it is single threaded
> >until the allocation is complete. We do things like btree block
> >readahead to minimise IO wait times, but we can't completely hide
> >things like metadata read Io wait time when it is required to make
> >progress.
> 
> I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the
> free space btree, or just contention? (I expect the latter from the
> patches I've seen, but perhaps I missed something).

No, it checks at a high level whether allocation is needed (i.e. IO
into a hole) and if allocation is needed, it punts the IO
immediately to the background thread and returns to userspace. i.e.
it never gets near the allocator to begin with....

Like I said before, RWF_NOWAIT prevents entire classes of
AIO submission blocking issues from occuring. Use it and almost all
filesystem blocking concerns go away....

> The vast majority of the time XFS AIO works very well. The problem
> is that when problems do happen, performance drops of sharply, and
> it's often in a situation that's hard to debug.

Yes, and that's made worse by there being relatively few people
around with the knowledge to be able to find the the root cause when
it does happen. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-14 22:42                       ` Dave Chinner
@ 2017-10-15  9:36                         ` Avi Kivity
  2017-10-15 22:00                           ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2017-10-15  9:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs



On 10/15/2017 01:42 AM, Dave Chinner wrote:
> On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
>> On 10/11/2017 01:55 AM, Dave Chinner wrote:
>>> On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
>>>> On 10/10/2017 01:03 AM, Dave Chinner wrote:
>>>>>> On 10/09/2017 02:23 PM, Dave Chinner wrote:
>>>>>>> On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
>>>>>>> Sure, that might be the IO concurrency the SSD sees and handles, but
>>>>>>> you very rarely require that much allocation parallelism in the
>>>>>>> workload. Only a small amount of the IO submission path is actually
>>>>>>> allocation work, so a single AG can provide plenty of async IO
>>>>>>> parallelism before an AG is the limiting factor.
>>>>>> Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
>>>>> AGs don't issue IO. Applications issue IO, the filesystem allocates
>>>>> space from AGs according to the write IO that passes through it.
>>>> What I meant was I/O in order to satisfy an allocation (read from
>>>> the free extent btree or whatever), not the application's I/O.
>>> Once you're in the per-AG allocator context, it is single threaded
>>> until the allocation is complete. We do things like btree block
>>> readahead to minimise IO wait times, but we can't completely hide
>>> things like metadata read Io wait time when it is required to make
>>> progress.
>> I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the
>> free space btree, or just contention? (I expect the latter from the
>> patches I've seen, but perhaps I missed something).
> No, it checks at a high level whether allocation is needed (i.e. IO
> into a hole) and if allocation is needed, it punts the IO
> immediately to the background thread and returns to userspace. i.e.
> it never gets near the allocator to begin with....


Interesting, that's both good and bad. Good, because we avoided a 
potential stall. Bad, because if the stall would not actually have 
happened (lock not contended, btree nodes cached) then we got punted to 
the helper thread which is a more expensive path.

In fact we don't even need to try the write, we know that every 
32MB/128k = 256 writes we will hit an allocation. Perhaps we can 
fallocate() the next 32MB chunk while writing to the previous one. If 
fallocate() is fast enough, writes will both never block/fail. If it's 
not, then we'll block/fail, but the likelihood is reduced. We can even 
increase the chunk size if we see we're getting blocked.

Even better would be if XFS would detect the sequential write and start 
allocating ahead of it.

>
> Like I said before, RWF_NOWAIT prevents entire classes of
> AIO submission blocking issues from occuring. Use it and almost all
> filesystem blocking concerns go away....

I will indeed.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-15  9:36                         ` Avi Kivity
@ 2017-10-15 22:00                           ` Dave Chinner
  2017-10-16 10:00                             ` Avi Kivity
  2017-10-18  7:31                             ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2017-10-15 22:00 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Sun, Oct 15, 2017 at 12:36:03PM +0300, Avi Kivity wrote:
> 
> 
> On 10/15/2017 01:42 AM, Dave Chinner wrote:
> >On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
> >>On 10/11/2017 01:55 AM, Dave Chinner wrote:
> >>>On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
> >>>>On 10/10/2017 01:03 AM, Dave Chinner wrote:
> >>>>>>On 10/09/2017 02:23 PM, Dave Chinner wrote:
> >>>>>>>On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
> >>>>>>>Sure, that might be the IO concurrency the SSD sees and handles, but
> >>>>>>>you very rarely require that much allocation parallelism in the
> >>>>>>>workload. Only a small amount of the IO submission path is actually
> >>>>>>>allocation work, so a single AG can provide plenty of async IO
> >>>>>>>parallelism before an AG is the limiting factor.
> >>>>>>Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
> >>>>>AGs don't issue IO. Applications issue IO, the filesystem allocates
> >>>>>space from AGs according to the write IO that passes through it.
> >>>>What I meant was I/O in order to satisfy an allocation (read from
> >>>>the free extent btree or whatever), not the application's I/O.
> >>>Once you're in the per-AG allocator context, it is single threaded
> >>>until the allocation is complete. We do things like btree block
> >>>readahead to minimise IO wait times, but we can't completely hide
> >>>things like metadata read Io wait time when it is required to make
> >>>progress.
> >>I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the
> >>free space btree, or just contention? (I expect the latter from the
> >>patches I've seen, but perhaps I missed something).
> >No, it checks at a high level whether allocation is needed (i.e. IO
> >into a hole) and if allocation is needed, it punts the IO
> >immediately to the background thread and returns to userspace. i.e.
> >it never gets near the allocator to begin with....
> 
> Interesting, that's both good and bad. Good, because we avoided a
> potential stall. Bad, because if the stall would not actually have
> happened (lock not contended, btree nodes cached) then we got punted
> to the helper thread which is a more expensive path.

Avoiding latency has costs in complexity, resources and CPU time.
That's why we've never ended up with a fully generic async syscall
interface in the kernel - every time someone tries, it dies the
death of complexity.

RWF_NOWAIT is simple, easy to maintain and has, in most cases, no
observable overhead.

> In fact we don't even need to try the write, we know that every
> 32MB/128k = 256 writes we will hit an allocation. Perhaps we can
> fallocate() the next 32MB chunk while writing to the previous one.

fallocate will block *all* IO and mmap faults on that file, not just
the ones that require allocation. fallocate creates a complete IO
submission pipeline stall, punting all new IO submissions to the
background worker where they will block until fallocate completes.

IOWs, in terms of overhead, IO submission efficiency and IO pipeline
bubbles, fallocate is close the worst thing you can possibly do.
Extent size hints are far more efficient and less intrusive than
manually using fallocate from userspace.

> If fallocate() is fast enough, writes will both never block/fail. If
> it's not, then we'll block/fail, but the likelihood is reduced. We
> can even increase the chunk size if we see we're getting blocked.

If you call fallocate, other AIO writes will always get blocked
because fallocate creates an IO submission barrier. fallocate might
be fast, but it's also a total IO submission serialisation point and
so has a much more significant effect on IO submission latency when
compared to doing allocation directly in the IO path via extent size
hints...

> Even better would be if XFS would detect the sequential write and
> start allocating ahead of it.

That's what delayed allocation does with buffered IO. We
specifically do not do that with direct IO because it's direct IO
and we only do exactly what the IO the user submits requires us to
do.

As it is, I'm not sure that it would gain us anything over extent
size hints because they are effectively doing exactly the same thing
(i.e.  allocate ahead) on every write that hits a hole beyond
EOF when extending the file....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-15 22:00                           ` Dave Chinner
@ 2017-10-16 10:00                             ` Avi Kivity
  2017-10-16 22:31                               ` Dave Chinner
  2017-10-18  7:31                             ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2017-10-16 10:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs



On 10/16/2017 01:00 AM, Dave Chinner wrote:
> On Sun, Oct 15, 2017 at 12:36:03PM +0300, Avi Kivity wrote:
>>
>> On 10/15/2017 01:42 AM, Dave Chinner wrote:
>>> On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
>>>> On 10/11/2017 01:55 AM, Dave Chinner wrote:
>>>>> On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
>>>>>> On 10/10/2017 01:03 AM, Dave Chinner wrote:
>>>>>>>> On 10/09/2017 02:23 PM, Dave Chinner wrote:
>>>>>>>>> On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
>>>>>>>>> Sure, that might be the IO concurrency the SSD sees and handles, but
>>>>>>>>> you very rarely require that much allocation parallelism in the
>>>>>>>>> workload. Only a small amount of the IO submission path is actually
>>>>>>>>> allocation work, so a single AG can provide plenty of async IO
>>>>>>>>> parallelism before an AG is the limiting factor.
>>>>>>>> Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
>>>>>>> AGs don't issue IO. Applications issue IO, the filesystem allocates
>>>>>>> space from AGs according to the write IO that passes through it.
>>>>>> What I meant was I/O in order to satisfy an allocation (read from
>>>>>> the free extent btree or whatever), not the application's I/O.
>>>>> Once you're in the per-AG allocator context, it is single threaded
>>>>> until the allocation is complete. We do things like btree block
>>>>> readahead to minimise IO wait times, but we can't completely hide
>>>>> things like metadata read Io wait time when it is required to make
>>>>> progress.
>>>> I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the
>>>> free space btree, or just contention? (I expect the latter from the
>>>> patches I've seen, but perhaps I missed something).
>>> No, it checks at a high level whether allocation is needed (i.e. IO
>>> into a hole) and if allocation is needed, it punts the IO
>>> immediately to the background thread and returns to userspace. i.e.
>>> it never gets near the allocator to begin with....
>> Interesting, that's both good and bad. Good, because we avoided a
>> potential stall. Bad, because if the stall would not actually have
>> happened (lock not contended, btree nodes cached) then we got punted
>> to the helper thread which is a more expensive path.
> Avoiding latency has costs in complexity, resources and CPU time.
> That's why we've never ended up with a fully generic async syscall
> interface in the kernel - every time someone tries, it dies the
> death of complexity.
>
> RWF_NOWAIT is simple, easy to maintain and has, in most cases, no
> observable overhead.

There is no observable overhead in the kernel, but there will be some 
for the application. As soon as we cross a hint boundary writes start to 
fail, and the application needs to move them to a helper thread and 
re-submit them. These duplicate submissions happen until the helper 
thread is able to respond, and the first write manages to allocate the 
space.

Without RWF_NOWAIT, there are two possibilities: either you get lucky 
and the first write to cross the boundary doesn't block, or you get  
unlucky and you stall. There's no doubt that RWF_NOWAIT is a lot better, 
but it does cause the system to do some more work. I guess it can be 
amortized away with larger hints.

>> In fact we don't even need to try the write, we know that every
>> 32MB/128k = 256 writes we will hit an allocation. Perhaps we can
>> fallocate() the next 32MB chunk while writing to the previous one.
> fallocate will block *all* IO and mmap faults on that file, not just
> the ones that require allocation. fallocate creates a complete IO
> submission pipeline stall, punting all new IO submissions to the
> background worker where they will block until fallocate completes.

Ok, I'll stay away from it, except during close time to remove unused 
extents.

> IOWs, in terms of overhead, IO submission efficiency and IO pipeline
> bubbles, fallocate is close the worst thing you can possibly do.
> Extent size hints are far more efficient and less intrusive than
> manually using fallocate from userspace.
>
>> If fallocate() is fast enough, writes will both never block/fail. If
>> it's not, then we'll block/fail, but the likelihood is reduced. We
>> can even increase the chunk size if we see we're getting blocked.
> If you call fallocate, other AIO writes will always get blocked
> because fallocate creates an IO submission barrier. fallocate might
> be fast, but it's also a total IO submission serialisation point and
> so has a much more significant effect on IO submission latency when
> compared to doing allocation directly in the IO path via extent size
> hints...

Got it.

>> Even better would be if XFS would detect the sequential write and
>> start allocating ahead of it.
> That's what delayed allocation does with buffered IO. We
> specifically do not do that with direct IO because it's direct IO
> and we only do exactly what the IO the user submits requires us to
> do.
>
> As it is, I'm not sure that it would gain us anything over extent
> size hints because they are effectively doing exactly the same thing
> (i.e.  allocate ahead) on every write that hits a hole beyond
> EOF when extending the file....

If I understand correctly, you do get momentary serialization when you 
cross a hint boundary, while with allocate ahead, you would not.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-16 10:00                             ` Avi Kivity
@ 2017-10-16 22:31                               ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2017-10-16 22:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta, linux-xfs

On Mon, Oct 16, 2017 at 01:00:32PM +0300, Avi Kivity wrote:
> On 10/16/2017 01:00 AM, Dave Chinner wrote:
> >On Sun, Oct 15, 2017 at 12:36:03PM +0300, Avi Kivity wrote:
> >>On 10/15/2017 01:42 AM, Dave Chinner wrote:
> >>>On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
> >>Even better would be if XFS would detect the sequential write and
> >>start allocating ahead of it.
> >That's what delayed allocation does with buffered IO. We
> >specifically do not do that with direct IO because it's direct IO
> >and we only do exactly what the IO the user submits requires us to
> >do.
> >
> >As it is, I'm not sure that it would gain us anything over extent
> >size hints because they are effectively doing exactly the same thing
> >(i.e.  allocate ahead) on every write that hits a hole beyond
> >EOF when extending the file....
> 
> If I understand correctly, you do get momentary serialization when
> you cross a hint boundary, while with allocate ahead, you would not.

Allocate ahead still requires a threshold to be crossed to trigger
allocation. So it doesn't get rid of allocation, it just changes
what IO triggers it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: agcount for 2TB, 4TB and 8TB drives
  2017-10-15 22:00                           ` Dave Chinner
  2017-10-16 10:00                             ` Avi Kivity
@ 2017-10-18  7:31                             ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2017-10-18  7:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Avi Kivity, Eric Sandeen, Darrick J. Wong, Gandalf Corvotempesta,
	linux-xfs

On Mon, Oct 16, 2017 at 09:00:19AM +1100, Dave Chinner wrote:
> fallocate will block *all* IO and mmap faults on that file, not just
> the ones that require allocation. fallocate creates a complete IO
> submission pipeline stall, punting all new IO submissions to the
> background worker where they will block until fallocate completes.

Not sure if it helps Avis case, but I think we could relax the
fallocate exclusive i_rwsem requirement a bit, similar to direct I/O
writes.  Basically as long as your fallocate is block aligned we
should be fine with a shared iolock for the "normal" allocating
fallocate (discounting things like hole punches and extent shifts
of course).

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-10-18  7:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-06  8:46 agcount for 2TB, 4TB and 8TB drives Gandalf Corvotempesta
2017-10-06 15:38 ` Darrick J. Wong
2017-10-06 16:18   ` Eric Sandeen
2017-10-06 22:20     ` Dave Chinner
2017-10-06 22:21       ` Eric Sandeen
2017-10-09  8:05         ` Avi Kivity
2017-10-09 11:23           ` Dave Chinner
2017-10-09 15:46             ` Avi Kivity
2017-10-09 22:03               ` Dave Chinner
2017-10-10  9:07                 ` Avi Kivity
2017-10-10 22:55                   ` Dave Chinner
2017-10-13  8:13                     ` Avi Kivity
2017-10-14 22:42                       ` Dave Chinner
2017-10-15  9:36                         ` Avi Kivity
2017-10-15 22:00                           ` Dave Chinner
2017-10-16 10:00                             ` Avi Kivity
2017-10-16 22:31                               ` Dave Chinner
2017-10-18  7:31                             ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.