All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
To: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org,
	axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org
Subject: Re: [PATCH] [RFC] xfs: wire up aio_fsync method
Date: Mon, 16 Jun 2014 12:00:30 +1000	[thread overview]
Message-ID: <20140616020030.GC9508@dastard> (raw)
In-Reply-To: <20140615223323.GB9508@dastard>

On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> > On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > > > There doesn't really seem anything XFS specific here, so instead
> > > > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > > > based on the workqueue and ->fsync.
> > > 
> > > I really don't know whether the other ->fsync methods in other
> > > filesystems can stand alone like that. I also don't have the
> > > time to test that it works properly on all filesystems right now.
> > 
> > Of course they can, as shown by various calls to vfs_fsync_range that
> > is nothing but a small wrapper around ->fsync.
> 
> Sure, but that's not getting 10,000 concurrent callers, is it? And
> some fsync methods require journal credits, and others serialise
> completely, and so on.
> 
> Besides, putting an *unbound, highly concurrent* aio queue into the
> kernel for an operation that can serialise the entire filesystem
> seems like a pretty nasty user-level DOS vector to me.

FWIW, the non-linear system CPU overhead of a fs_mark test I've been
running isn't anything related to XFS.  The async fsync workqueue
results in several thousand worker threads dispatching IO
concurrently across 16 CPUs:

$ ps -ef |grep kworker |wc -l
4693
$

Profiles from 3.15 + xfs for-next + xfs aio_fsync show:

-  51.33%  [kernel]            [k] percpu_ida_alloc
   - percpu_ida_alloc
      + 85.73% blk_mq_wait_for_tags
      + 14.23% blk_mq_get_tag
-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 66.26% virtio_queue_rq
         - __blk_mq_run_hw_queue
            - 99.65% blk_mq_run_hw_queue
               + 99.47% blk_mq_insert_requests
               + 0.53% blk_mq_insert_request
.....
-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.59% __schedule
         - 86.49% schedule
            + 47.72% percpu_ida_alloc
            + 21.75% worker_thread
            + 19.12% schedule_timeout
....
      + 18.06% blk_mq_make_request

Runtime:

real    4m1.243s
user    0m47.724s
sys     11m56.724s

Most of the excessive CPU usage is coming from the blk-mq layer, and
XFS is barely showing up in the profiles at all - the IDA tag
allocator is burning 8 CPUs at about 60,000 write IOPS....

I know that the tag allocator has been rewritten, so I tested
against a current a current Linus kernel with the XFS aio-fsync
patch. The results are all over the place - from several sequential
runs of the same test (removing the files in between so each tests
starts from an empty fs):

Wall time	sys time	IOPS	 files/s
4m58.151s	11m12.648s	30,000	 13,500
4m35.075s	12m45.900s	45,000	 15,000
3m10.665s	11m15.804s	65,000	 21,000
3m27.384s	11m54.723s	85,000	 20,000
3m59.574s	11m12.012s	50,000	 16,500
4m12.704s	12m15.720s	50,000	 17,000

The 3.15 based kernel was pretty consistent around the 4m10 mark,
generally only +/-10s in runtime and not much change in system time.
The files/s rate reported by fs_mark doesn't vary that much, either.
So the new tag allocator seems to be no better in terms of IO
dispatch scalability, yet adds significant variability to IO
performance.

What I noticed is a massive jump in context switch overhead: from
around 250,000/s to over 800,000/s and the CPU profiles show that
this comes from the new tag allocator:

-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 58.22% prepare_to_wait
           100.00% bt_get
              blk_mq_get_tag
              __blk_mq_alloc_request
              blk_mq_map_request
              blk_sq_make_request
              generic_make_request
      - 22.51% virtio_queue_rq
           __blk_mq_run_hw_queue
....
-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 58.73% __schedule
         - 53.42% io_schedule
              99.88% bt_get
                 blk_mq_get_tag
                 __blk_mq_alloc_request
                 blk_mq_map_request
                 blk_sq_make_request
                 generic_make_request
         - 35.58% schedule
            + 49.31% worker_thread
            + 32.45% schedule_timeout
            + 10.35% _xfs_log_force_lsn
            + 3.10% xlog_cil_force_lsn
....

The new block-mq tag allocator is hammering the waitqueues and
that's generating a large amount of lock contention. It looks like
the new allocator replaces CPU burn doing work in the IDA allocator
with the same amount of CPU burn from extra context switch
overhead....

Oh, OH. Now I understand!

# echo 4 > /sys/block/vdc/queue/nr_requests

<run workload>

80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 98.49% prepare_to_wait
           bt_get
           blk_mq_get_tag
           __blk_mq_alloc_request
           blk_mq_map_request
           blk_sq_make_request
           generic_make_request
         + submit_bio
      + 1.07% finish_wait
+  13.63%  [kernel]  [k] _raw_spin_unlock_irq
...

It's context switch bound at 800,000 context switches/s, burning all
16 CPUs waking up and going to sleep and doing very little real
work. How little real work? About 3000 IOPS for 2MB/s of IO. That
amount of IO should only take a single digit CPU percentage of one
CPU.

This seems like bad behaviour to have on a congested block device,
even a high performance one....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: linux-fsdevel@vger.kernel.org, linux-man@vger.kernel.org,
	axboe@kernel.dk, xfs@oss.sgi.com
Subject: Re: [PATCH] [RFC] xfs: wire up aio_fsync method
Date: Mon, 16 Jun 2014 12:00:30 +1000	[thread overview]
Message-ID: <20140616020030.GC9508@dastard> (raw)
In-Reply-To: <20140615223323.GB9508@dastard>

On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> > On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > > > There doesn't really seem anything XFS specific here, so instead
> > > > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > > > based on the workqueue and ->fsync.
> > > 
> > > I really don't know whether the other ->fsync methods in other
> > > filesystems can stand alone like that. I also don't have the
> > > time to test that it works properly on all filesystems right now.
> > 
> > Of course they can, as shown by various calls to vfs_fsync_range that
> > is nothing but a small wrapper around ->fsync.
> 
> Sure, but that's not getting 10,000 concurrent callers, is it? And
> some fsync methods require journal credits, and others serialise
> completely, and so on.
> 
> Besides, putting an *unbound, highly concurrent* aio queue into the
> kernel for an operation that can serialise the entire filesystem
> seems like a pretty nasty user-level DOS vector to me.

FWIW, the non-linear system CPU overhead of a fs_mark test I've been
running isn't anything related to XFS.  The async fsync workqueue
results in several thousand worker threads dispatching IO
concurrently across 16 CPUs:

$ ps -ef |grep kworker |wc -l
4693
$

Profiles from 3.15 + xfs for-next + xfs aio_fsync show:

-  51.33%  [kernel]            [k] percpu_ida_alloc
   - percpu_ida_alloc
      + 85.73% blk_mq_wait_for_tags
      + 14.23% blk_mq_get_tag
-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 66.26% virtio_queue_rq
         - __blk_mq_run_hw_queue
            - 99.65% blk_mq_run_hw_queue
               + 99.47% blk_mq_insert_requests
               + 0.53% blk_mq_insert_request
.....
-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.59% __schedule
         - 86.49% schedule
            + 47.72% percpu_ida_alloc
            + 21.75% worker_thread
            + 19.12% schedule_timeout
....
      + 18.06% blk_mq_make_request

Runtime:

real    4m1.243s
user    0m47.724s
sys     11m56.724s

Most of the excessive CPU usage is coming from the blk-mq layer, and
XFS is barely showing up in the profiles at all - the IDA tag
allocator is burning 8 CPUs at about 60,000 write IOPS....

I know that the tag allocator has been rewritten, so I tested
against a current a current Linus kernel with the XFS aio-fsync
patch. The results are all over the place - from several sequential
runs of the same test (removing the files in between so each tests
starts from an empty fs):

Wall time	sys time	IOPS	 files/s
4m58.151s	11m12.648s	30,000	 13,500
4m35.075s	12m45.900s	45,000	 15,000
3m10.665s	11m15.804s	65,000	 21,000
3m27.384s	11m54.723s	85,000	 20,000
3m59.574s	11m12.012s	50,000	 16,500
4m12.704s	12m15.720s	50,000	 17,000

The 3.15 based kernel was pretty consistent around the 4m10 mark,
generally only +/-10s in runtime and not much change in system time.
The files/s rate reported by fs_mark doesn't vary that much, either.
So the new tag allocator seems to be no better in terms of IO
dispatch scalability, yet adds significant variability to IO
performance.

What I noticed is a massive jump in context switch overhead: from
around 250,000/s to over 800,000/s and the CPU profiles show that
this comes from the new tag allocator:

-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 58.22% prepare_to_wait
           100.00% bt_get
              blk_mq_get_tag
              __blk_mq_alloc_request
              blk_mq_map_request
              blk_sq_make_request
              generic_make_request
      - 22.51% virtio_queue_rq
           __blk_mq_run_hw_queue
....
-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 58.73% __schedule
         - 53.42% io_schedule
              99.88% bt_get
                 blk_mq_get_tag
                 __blk_mq_alloc_request
                 blk_mq_map_request
                 blk_sq_make_request
                 generic_make_request
         - 35.58% schedule
            + 49.31% worker_thread
            + 32.45% schedule_timeout
            + 10.35% _xfs_log_force_lsn
            + 3.10% xlog_cil_force_lsn
....

The new block-mq tag allocator is hammering the waitqueues and
that's generating a large amount of lock contention. It looks like
the new allocator replaces CPU burn doing work in the IDA allocator
with the same amount of CPU burn from extra context switch
overhead....

Oh, OH. Now I understand!

# echo 4 > /sys/block/vdc/queue/nr_requests

<run workload>

80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 98.49% prepare_to_wait
           bt_get
           blk_mq_get_tag
           __blk_mq_alloc_request
           blk_mq_map_request
           blk_sq_make_request
           generic_make_request
         + submit_bio
      + 1.07% finish_wait
+  13.63%  [kernel]  [k] _raw_spin_unlock_irq
...

It's context switch bound at 800,000 context switches/s, burning all
16 CPUs waking up and going to sleep and doing very little real
work. How little real work? About 3000 IOPS for 2MB/s of IO. That
amount of IO should only take a single digit CPU percentage of one
CPU.

This seems like bad behaviour to have on a congested block device,
even a high performance one....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-06-16  2:00 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-12  8:34 [PATCH] [RFC] xfs: wire up aio_fsync method Dave Chinner
2014-06-12  8:56 ` [PATCH] [RFC] fs_mark: add asynchronous fsync Dave Chinner
2014-06-12 14:13 ` [PATCH] [RFC] xfs: wire up aio_fsync method Christoph Hellwig
2014-06-12 23:44   ` Dave Chinner
2014-06-13 16:23     ` Christoph Hellwig
2014-06-15 22:33       ` Dave Chinner
2014-06-15 22:33         ` Dave Chinner
2014-06-16  2:00         ` Dave Chinner [this message]
2014-06-16  2:00           ` Dave Chinner
2014-06-16  2:58           ` Jens Axboe
2014-06-16  2:58             ` Jens Axboe
     [not found]             ` <539E5D66.8040605-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-16  7:19               ` Dave Chinner
2014-06-16  7:19                 ` Dave Chinner
2014-06-16 19:30                 ` Jens Axboe
2014-06-16 19:30                   ` Jens Axboe
2014-06-16 22:27                   ` Dave Chinner
2014-06-17 13:23                     ` Jens Axboe
2014-06-17 13:23                       ` Jens Axboe
2014-06-18  0:28                       ` Dave Chinner
2014-06-18  0:28                         ` Dave Chinner
2014-06-18  0:28                         ` Dave Chinner
2014-06-18  2:24                         ` Jens Axboe
2014-06-18  2:24                           ` Jens Axboe
2014-06-18  2:24                           ` Jens Axboe
2014-06-18  3:13                           ` Dave Chinner
2014-06-18  3:13                             ` Dave Chinner
2014-06-18  3:20                             ` Jens Axboe
2014-06-18  3:20                               ` Jens Axboe
2014-06-18  3:20                               ` Jens Axboe
     [not found]                               ` <53A10597.6020707-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-18  5:02                                 ` Dave Chinner
2014-06-18  5:02                                   ` Dave Chinner
2014-06-18  5:02                                   ` Dave Chinner
2014-06-18  6:13                                   ` Dave Chinner
2014-06-18  6:13                                     ` Dave Chinner
     [not found]       ` <20140613162352.GB23394-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-06-16 21:06         ` Michael Kerrisk (man-pages)
2014-06-16 21:06           ` Michael Kerrisk (man-pages)
2014-06-17 14:01           ` Christoph Hellwig
2014-06-12 15:24 ` Brian Foster
2014-06-12 23:57   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140616020030.GC9508@dastard \
    --to=david-fqsqvqoi3ljby3ivrkzq2a@public.gmane.org \
    --cc=axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org \
    --cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.