From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:32113 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751184AbdJIWE7 (ORCPT ); Mon, 9 Oct 2017 18:04:59 -0400 Date: Tue, 10 Oct 2017 09:03:32 +1100 From: Dave Chinner Subject: Re: agcount for 2TB, 4TB and 8TB drives Message-ID: <20171009220332.GP3666@dastard> References: <20171006153803.GI7122@magnolia> <8e6fd742-8767-e786-746d-2b9f2929b98c@sandeen.net> <20171006222031.GU3666@dastard> <1b5b6410-b1d9-8519-7032-8ea0ca46f5b5@scylladb.com> <20171009112306.GM3666@dastard> <89a7ae9a-9960-ae37-f6ca-0c1f2e33f65f@scylladb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <89a7ae9a-9960-ae37-f6ca-0c1f2e33f65f@scylladb.com> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Avi Kivity Cc: Eric Sandeen , "Darrick J. Wong" , Gandalf Corvotempesta , linux-xfs@vger.kernel.org On Mon, Oct 09, 2017 at 06:46:41PM +0300, Avi Kivity wrote: > > > On 10/09/2017 02:23 PM, Dave Chinner wrote: > >On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote: > >>On 10/07/2017 01:21 AM, Eric Sandeen wrote: > >>>On 10/6/17 5:20 PM, Dave Chinner wrote: > >>>>On Fri, Oct 06, 2017 at 11:18:39AM -0500, Eric Sandeen wrote: > >>>>>On 10/6/17 10:38 AM, Darrick J. Wong wrote: > >>>>>>On Fri, Oct 06, 2017 at 10:46:20AM +0200, Gandalf Corvotempesta wrote: > >>>>>>Semirelated question: for a solid state disk on a machine with high CPU > >>>>>>counts do we prefer agcount == cpucount to take advantage of the > >>>>>>high(er) iops and lack of seek time to increase parallelism? > >>>>>> > >>>>>>(Not that I've studied that in depth.) > >>>>>Interesting question. :) Maybe harder to answer for SSD black boxes? > >>>>Easy: switch to multidisk mode if /sys/block//queue/rotational > >>>>is zero after doing all the other checks. Then SSDs will get larger > >>>>AG counts automatically. > >>>The "hard part" was knowing just how much parallelism is actually inside > >>>the black box. > >>It's often > 100. > >Sure, that might be the IO concurrency the SSD sees and handles, but > >you very rarely require that much allocation parallelism in the > >workload. Only a small amount of the IO submission path is actually > >allocation work, so a single AG can provide plenty of async IO > >parallelism before an AG is the limiting factor. > > Sure. Can a single AG issue multiple I/Os, or is it single-threaded? AGs don't issue IO. Applications issue IO, the filesystem allocates space from AGs according to the write IO that passes through it. i.e. when you don't do allocation in the write IO path or you are doing read IOs, then the number of AGs is /completely irrelevant/. In those cases a single AG can "support" the entire IO load your application and storage subsystem can handle. The only time an AG lock is taken in the IO path is during extent allocation (i.e. writes). And, as I've already said, a single AG can easily handle tens of thousands of allocation transactions a second before it becomes a bottleneck. IOWs, the worse case is that you'll get tens of thousands of IOs per second through an AG. > I understand that XFS_XFLAG_EXTSIZE and XFS_IOC_FSSETXATTR can > reduce the AG's load. Not really. They change the allocation pattern on the inode. This changes how the inode data is laid out on disk, but it doesn't necessarily change the allocation overhead of the write IO path. That's all dependent on what the application IO patterns are and how they match the extent size hints. In general, nobody ever notices what the "load" on an AG is and that's because almost no-one ever drives AGs to their limits. The mkfs defaults and the allocation policies keep the load distributed across the filesystem and so storage subsystems almost always run out of IO and/or seek capability before the filesystem runs out of allocation concurrency. And, in general, most machines run out of CPU power before they drive enough concurrency and load through the filesystem that it starts contending on internal locks. Sure, I have plenty of artificial workloads that drive this sort contention, but no-one has a production workload that requires those sorts of behaviours or creates the same level of lock contention that these artificial workloads drive. > Is there a downside? for example, when I > truncate + close the file, will the preallocated data still remain > allocated? Do I need to return it with an fallocate()? No. Yes. > >space manipulations per second before the AG locks become the > >bottleneck. Hence by the time you get to 16 AGs there's concurrency > >available for (runs a concurrent workload and measures) at least > >350,000 allocation transactions per second on relatively slow 5 year > >old 8-core server CPUs. And that's CPU bound (16 CPUs all at >95%), > >so faster, more recent CPUs will run much higher numbers. > > > >IOws, don't confuse allocation concurrency with IO concurrency or > >application concurrency. It's not the same thing and it is rarely a > >limiting factor for most workloads, even the most IO intensive > >ones... > > In my load, the allocation load is not very high, but the impact of > iowait is. So if I can reduce the chance of io_submit() blocking > because of AG contention, then I'm happy to increase the number of > AGs even if it hurts other things. That's what RWF_NOWAIT is for. It pushes any write IO that requires allocation into a thread rather possibly blocking the submitting thread on any lock or IO in the allocation path. > >>> But "multidisk mode" doesn't go too overboard, so yeah > >>>that's probably fine. > >>Is there a penalty associated with having too many allocation groups? > >Yes. You break up the large contiguous free spaces into many smaller > >free spaces and so can induce premature onset of filesystem aging > >related performance degradations. And for spinning disks, more than > >4-8AGs per spindle causes excessive seeks in mixed workloads and > >degrades performance that way.... > > For an SSD, would an AG per 10GB be reasonable? per 100GB? No. Maybe. Like I said, we can use the multi-disk mode in mkfs for this - it already selects an appropriate number of AGs according to the size of the filesystem appropriately. > Machines with 60-100 logical cores and low-tens of terabytes of SSD > are becoming common.  How many AGs would work for such a machine? Multidisk default, which will be 32 AGs for anything in the 1->32TB range. And over 32TB, you get 1 AG per TB... > Again the allocation load is not very high (allocating a few GB/s > with 32MB hints, so < 100 allocs/sec), but the penalty for > contention is pretty high. I think you're worrying about a non-problem. Use RWF_NOWAIT for your AIO, and most of your existing IO submission blocking problems will go away. Cheers, Dave. -- Dave Chinner david@fromorbit.com