All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Avi Kivity <avi@scylladb.com>
Cc: Eric Sandeen <sandeen@sandeen.net>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com>,
	linux-xfs@vger.kernel.org
Subject: Re: agcount for 2TB, 4TB and 8TB drives
Date: Mon, 16 Oct 2017 09:00:19 +1100	[thread overview]
Message-ID: <20171015220018.GG3666@dastard> (raw)
In-Reply-To: <db0ca95f-ce16-4b2e-7d69-52f3552a6004@scylladb.com>

On Sun, Oct 15, 2017 at 12:36:03PM +0300, Avi Kivity wrote:
> 
> 
> On 10/15/2017 01:42 AM, Dave Chinner wrote:
> >On Fri, Oct 13, 2017 at 11:13:24AM +0300, Avi Kivity wrote:
> >>On 10/11/2017 01:55 AM, Dave Chinner wrote:
> >>>On Tue, Oct 10, 2017 at 12:07:42PM +0300, Avi Kivity wrote:
> >>>>On 10/10/2017 01:03 AM, Dave Chinner wrote:
> >>>>>>On 10/09/2017 02:23 PM, Dave Chinner wrote:
> >>>>>>>On Mon, Oct 09, 2017 at 11:05:56AM +0300, Avi Kivity wrote:
> >>>>>>>Sure, that might be the IO concurrency the SSD sees and handles, but
> >>>>>>>you very rarely require that much allocation parallelism in the
> >>>>>>>workload. Only a small amount of the IO submission path is actually
> >>>>>>>allocation work, so a single AG can provide plenty of async IO
> >>>>>>>parallelism before an AG is the limiting factor.
> >>>>>>Sure. Can a single AG issue multiple I/Os, or is it single-threaded?
> >>>>>AGs don't issue IO. Applications issue IO, the filesystem allocates
> >>>>>space from AGs according to the write IO that passes through it.
> >>>>What I meant was I/O in order to satisfy an allocation (read from
> >>>>the free extent btree or whatever), not the application's I/O.
> >>>Once you're in the per-AG allocator context, it is single threaded
> >>>until the allocation is complete. We do things like btree block
> >>>readahead to minimise IO wait times, but we can't completely hide
> >>>things like metadata read Io wait time when it is required to make
> >>>progress.
> >>I see, thanks. Will RWF_NOWAIT detect the need to do I/O for the
> >>free space btree, or just contention? (I expect the latter from the
> >>patches I've seen, but perhaps I missed something).
> >No, it checks at a high level whether allocation is needed (i.e. IO
> >into a hole) and if allocation is needed, it punts the IO
> >immediately to the background thread and returns to userspace. i.e.
> >it never gets near the allocator to begin with....
> 
> Interesting, that's both good and bad. Good, because we avoided a
> potential stall. Bad, because if the stall would not actually have
> happened (lock not contended, btree nodes cached) then we got punted
> to the helper thread which is a more expensive path.

Avoiding latency has costs in complexity, resources and CPU time.
That's why we've never ended up with a fully generic async syscall
interface in the kernel - every time someone tries, it dies the
death of complexity.

RWF_NOWAIT is simple, easy to maintain and has, in most cases, no
observable overhead.

> In fact we don't even need to try the write, we know that every
> 32MB/128k = 256 writes we will hit an allocation. Perhaps we can
> fallocate() the next 32MB chunk while writing to the previous one.

fallocate will block *all* IO and mmap faults on that file, not just
the ones that require allocation. fallocate creates a complete IO
submission pipeline stall, punting all new IO submissions to the
background worker where they will block until fallocate completes.

IOWs, in terms of overhead, IO submission efficiency and IO pipeline
bubbles, fallocate is close the worst thing you can possibly do.
Extent size hints are far more efficient and less intrusive than
manually using fallocate from userspace.

> If fallocate() is fast enough, writes will both never block/fail. If
> it's not, then we'll block/fail, but the likelihood is reduced. We
> can even increase the chunk size if we see we're getting blocked.

If you call fallocate, other AIO writes will always get blocked
because fallocate creates an IO submission barrier. fallocate might
be fast, but it's also a total IO submission serialisation point and
so has a much more significant effect on IO submission latency when
compared to doing allocation directly in the IO path via extent size
hints...

> Even better would be if XFS would detect the sequential write and
> start allocating ahead of it.

That's what delayed allocation does with buffered IO. We
specifically do not do that with direct IO because it's direct IO
and we only do exactly what the IO the user submits requires us to
do.

As it is, I'm not sure that it would gain us anything over extent
size hints because they are effectively doing exactly the same thing
(i.e.  allocate ahead) on every write that hits a hole beyond
EOF when extending the file....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2017-10-15 22:00 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-06  8:46 agcount for 2TB, 4TB and 8TB drives Gandalf Corvotempesta
2017-10-06 15:38 ` Darrick J. Wong
2017-10-06 16:18   ` Eric Sandeen
2017-10-06 22:20     ` Dave Chinner
2017-10-06 22:21       ` Eric Sandeen
2017-10-09  8:05         ` Avi Kivity
2017-10-09 11:23           ` Dave Chinner
2017-10-09 15:46             ` Avi Kivity
2017-10-09 22:03               ` Dave Chinner
2017-10-10  9:07                 ` Avi Kivity
2017-10-10 22:55                   ` Dave Chinner
2017-10-13  8:13                     ` Avi Kivity
2017-10-14 22:42                       ` Dave Chinner
2017-10-15  9:36                         ` Avi Kivity
2017-10-15 22:00                           ` Dave Chinner [this message]
2017-10-16 10:00                             ` Avi Kivity
2017-10-16 22:31                               ` Dave Chinner
2017-10-18  7:31                             ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171015220018.GG3666@dastard \
    --to=david@fromorbit.com \
    --cc=avi@scylladb.com \
    --cc=darrick.wong@oracle.com \
    --cc=gandalf.corvotempesta@gmail.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.