From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f42.google.com ([74.125.82.42]:46901 "EHLO mail-wm0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751410AbdJPKAj (ORCPT ); Mon, 16 Oct 2017 06:00:39 -0400 Received: by mail-wm0-f42.google.com with SMTP id m72so1384195wmc.1 for ; Mon, 16 Oct 2017 03:00:39 -0700 (PDT) Subject: Re: agcount for 2TB, 4TB and 8TB drives References: <1b5b6410-b1d9-8519-7032-8ea0ca46f5b5@scylladb.com> <20171009112306.GM3666@dastard> <89a7ae9a-9960-ae37-f6ca-0c1f2e33f65f@scylladb.com> <20171009220332.GP3666@dastard> <38bd7785-174d-fd09-fc1f-50a2d4e1dd69@scylladb.com> <20171010225524.GV3666@dastard> <86635b89-5016-5cd1-53a2-bf21b842ae04@scylladb.com> <20171014224224.GD15067@dastard> <20171015220018.GG3666@dastard> From: Avi Kivity Message-ID: <10d04693-02ef-427f-95fe-cc7bcaf3ec6f@scylladb.com> Date: Mon, 16 Oct 2017 13:00:32 +0300 MIME-Version: 1.0 In-Reply-To: <20171015220018.GG3666@dastard> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Dave Chinner Cc: Eric Sandeen , "Darrick J. Wong" , Gandalf Corvotempesta , linux-xfs@vger.kernel.org On 10/16/2017 01:00 AM, Dave Chinner wrote: > On Sun, Oct 15, 2017 at 12:36:03PM +0300, Avi Kivity wrote: >> >> On 10/15/2017 01:42 AM, Dave Chinner wrote: >>> On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote: >>>> On 10/11/2017 01:55 AM, Dave Chinner wrote: >>>>> On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote: >>>>>> On 10/10/2017 01:03 AM, Dave Chinner wrote: >>>>>>>> On 10/09/2017 02:23 PM, Dave Chinner wrote: >>>>>>>>> On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote: >>>>>>>>> Sure, that might be the IO concurrency the SSD sees and handles, but >>>>>>>>> you very rarely require that much allocation parallelism in the >>>>>>>>> workload. Only a small amount of the IO submission path is actually >>>>>>>>> allocation work, so a single AG can provide plenty of async IO >>>>>>>>> parallelism before an AG is the limiting factor. >>>>>>>> Sure. Can a single AG issue multiple I/Os, or is it single-threaded? >>>>>>> AGs don't issue IO. Applications issue IO, the filesystem allocates >>>>>>> space from AGs according to the write IO that passes through it. >>>>>> What I meant was I/O in order to satisfy an allocation (read from >>>>>> the free extent btree or whatever), not the application's I/O. >>>>> Once you're in the per-AG allocator context, it is single threaded >>>>> until the allocation is complete. We do things like btree block >>>>> readahead to minimise IO wait times, but we can't completely hide >>>>> things like metadata read Io wait time when it is required to make >>>>> progress. >>>> I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the >>>> free space btree, or just contention? (I expect the latter from the >>>> patches I've seen, but perhaps I missed something). >>> No, it checks at a high level whether allocation is needed (i.e. IO >>> into a hole) and if allocation is needed, it punts the IO >>> immediately to the background thread and returns to userspace. i.e. >>> it never gets near the allocator to begin with.... >> Interesting, that's both good and bad. Good, because we avoided a >> potential stall. Bad, because if the stall would not actually have >> happened (lock not contended, btree nodes cached) then we got punted >> to the helper thread which is a more expensive path. > Avoiding latency has costs in complexity, resources and CPU time. > That's why we've never ended up with a fully generic async syscall > interface in the kernel - every time someone tries, it dies the > death of complexity. > > RWF_NOWAIT is simple, easy to maintain and has, in most cases, no > observable overhead. There is no observable overhead in the kernel, but there will be some for the application. As soon as we cross a hint boundary writes start to fail, and the application needs to move them to a helper thread and re-submit them. These duplicate submissions happen until the helper thread is able to respond, and the first write manages to allocate the space. Without RWF_NOWAIT, there are two possibilities: either you get lucky and the first write to cross the boundary doesn't block, or you get  unlucky and you stall. There's no doubt that RWF_NOWAIT is a lot better, but it does cause the system to do some more work. I guess it can be amortized away with larger hints. >> In fact we don't even need to try the write, we know that every >> 32MB/128k = 256 writes we will hit an allocation. Perhaps we can >> fallocate() the next 32MB chunk while writing to the previous one. > fallocate will block *all* IO and mmap faults on that file, not just > the ones that require allocation. fallocate creates a complete IO > submission pipeline stall, punting all new IO submissions to the > background worker where they will block until fallocate completes. Ok, I'll stay away from it, except during close time to remove unused extents. > IOWs, in terms of overhead, IO submission efficiency and IO pipeline > bubbles, fallocate is close the worst thing you can possibly do. > Extent size hints are far more efficient and less intrusive than > manually using fallocate from userspace. > >> If fallocate() is fast enough, writes will both never block/fail. If >> it's not, then we'll block/fail, but the likelihood is reduced. We >> can even increase the chunk size if we see we're getting blocked. > If you call fallocate, other AIO writes will always get blocked > because fallocate creates an IO submission barrier. fallocate might > be fast, but it's also a total IO submission serialisation point and > so has a much more significant effect on IO submission latency when > compared to doing allocation directly in the IO path via extent size > hints... Got it. >> Even better would be if XFS would detect the sequential write and >> start allocating ahead of it. > That's what delayed allocation does with buffered IO. We > specifically do not do that with direct IO because it's direct IO > and we only do exactly what the IO the user submits requires us to > do. > > As it is, I'm not sure that it would gain us anything over extent > size hints because they are effectively doing exactly the same thing > (i.e. allocate ahead) on every write that hits a hole beyond > EOF when extending the file.... If I understand correctly, you do get momentary serialization when you cross a hint boundary, while with allocate ahead, you would not.