On 06/16/2014 01:19 AM, Dave Chinner wrote: > On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote: >> On 2014-06-15 20:00, Dave Chinner wrote: >>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote: >>>> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote: >>>>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote: >>>>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote: >>>>>>> There doesn't really seem anything XFS specific here, so instead >>>>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c >>>>>>> based on the workqueue and ->fsync. >>>>>> >>>>>> I really don't know whether the other ->fsync methods in other >>>>>> filesystems can stand alone like that. I also don't have the >>>>>> time to test that it works properly on all filesystems right now. >>>>> >>>>> Of course they can, as shown by various calls to vfs_fsync_range that >>>>> is nothing but a small wrapper around ->fsync. >>>> >>>> Sure, but that's not getting 10,000 concurrent callers, is it? And >>>> some fsync methods require journal credits, and others serialise >>>> completely, and so on. >>>> >>>> Besides, putting an *unbound, highly concurrent* aio queue into the >>>> kernel for an operation that can serialise the entire filesystem >>>> seems like a pretty nasty user-level DOS vector to me. >>> >>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been >>> running isn't anything related to XFS. The async fsync workqueue >>> results in several thousand worker threads dispatching IO >>> concurrently across 16 CPUs: >>> >>> $ ps -ef |grep kworker |wc -l >>> 4693 >>> $ >>> >>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show: >>> >>> - 51.33% [kernel] [k] percpu_ida_alloc >>> - percpu_ida_alloc >>> + 85.73% blk_mq_wait_for_tags >>> + 14.23% blk_mq_get_tag >>> - 14.25% [kernel] [k] _raw_spin_unlock_irqrestore >>> - _raw_spin_unlock_irqrestore >>> - 66.26% virtio_queue_rq >>> - __blk_mq_run_hw_queue >>> - 99.65% blk_mq_run_hw_queue >>> + 99.47% blk_mq_insert_requests >>> + 0.53% blk_mq_insert_request >>> ..... >>> - 7.91% [kernel] [k] _raw_spin_unlock_irq >>> - _raw_spin_unlock_irq >>> - 69.59% __schedule >>> - 86.49% schedule >>> + 47.72% percpu_ida_alloc >>> + 21.75% worker_thread >>> + 19.12% schedule_timeout >>> .... >>> + 18.06% blk_mq_make_request >>> >>> Runtime: >>> >>> real 4m1.243s >>> user 0m47.724s >>> sys 11m56.724s >>> >>> Most of the excessive CPU usage is coming from the blk-mq layer, and >>> XFS is barely showing up in the profiles at all - the IDA tag >>> allocator is burning 8 CPUs at about 60,000 write IOPS.... >>> >>> I know that the tag allocator has been rewritten, so I tested >>> against a current a current Linus kernel with the XFS aio-fsync >>> patch. The results are all over the place - from several sequential >>> runs of the same test (removing the files in between so each tests >>> starts from an empty fs): >>> >>> Wall time sys time IOPS files/s >>> 4m58.151s 11m12.648s 30,000 13,500 >>> 4m35.075s 12m45.900s 45,000 15,000 >>> 3m10.665s 11m15.804s 65,000 21,000 >>> 3m27.384s 11m54.723s 85,000 20,000 >>> 3m59.574s 11m12.012s 50,000 16,500 >>> 4m12.704s 12m15.720s 50,000 17,000 >>> >>> The 3.15 based kernel was pretty consistent around the 4m10 mark, >>> generally only +/-10s in runtime and not much change in system time. >>> The files/s rate reported by fs_mark doesn't vary that much, either. >>> So the new tag allocator seems to be no better in terms of IO >>> dispatch scalability, yet adds significant variability to IO >>> performance. >>> >>> What I noticed is a massive jump in context switch overhead: from >>> around 250,000/s to over 800,000/s and the CPU profiles show that >>> this comes from the new tag allocator: >>> >>> - 34.62% [kernel] [k] _raw_spin_unlock_irqrestore >>> - _raw_spin_unlock_irqrestore >>> - 58.22% prepare_to_wait >>> 100.00% bt_get >>> blk_mq_get_tag >>> __blk_mq_alloc_request >>> blk_mq_map_request >>> blk_sq_make_request >>> generic_make_request >>> - 22.51% virtio_queue_rq >>> __blk_mq_run_hw_queue >>> .... >>> - 21.56% [kernel] [k] _raw_spin_unlock_irq >>> - _raw_spin_unlock_irq >>> - 58.73% __schedule >>> - 53.42% io_schedule >>> 99.88% bt_get >>> blk_mq_get_tag >>> __blk_mq_alloc_request >>> blk_mq_map_request >>> blk_sq_make_request >>> generic_make_request >>> - 35.58% schedule >>> + 49.31% worker_thread >>> + 32.45% schedule_timeout >>> + 10.35% _xfs_log_force_lsn >>> + 3.10% xlog_cil_force_lsn >>> .... >>> >>> The new block-mq tag allocator is hammering the waitqueues and >>> that's generating a large amount of lock contention. It looks like >>> the new allocator replaces CPU burn doing work in the IDA allocator >>> with the same amount of CPU burn from extra context switch >>> overhead.... >>> >>> Oh, OH. Now I understand! >>> >>> # echo 4 > /sys/block/vdc/queue/nr_requests >>> >>> >>> >>> 80.56% [kernel] [k] _raw_spin_unlock_irqrestore >>> - _raw_spin_unlock_irqrestore >>> - 98.49% prepare_to_wait >>> bt_get >>> blk_mq_get_tag >>> __blk_mq_alloc_request >>> blk_mq_map_request >>> blk_sq_make_request >>> generic_make_request >>> + submit_bio >>> + 1.07% finish_wait >>> + 13.63% [kernel] [k] _raw_spin_unlock_irq >>> ... >>> >>> It's context switch bound at 800,000 context switches/s, burning all >>> 16 CPUs waking up and going to sleep and doing very little real >>> work. How little real work? About 3000 IOPS for 2MB/s of IO. That >>> amount of IO should only take a single digit CPU percentage of one >>> CPU. >> >> With thousands of threads? I think not. Sanely submitted 3000 IOPS, >> correct, I would agree with you. >> >>> This seems like bad behaviour to have on a congested block device, >>> even a high performance one.... >> >> That is pretty much the suck. How do I reproduce this (eg what are >> you running, and what are the xfs aio fsync patches)? Even if > > http://oss.sgi.com/pipermail/xfs/2014-June/036773.html > >> dispatching thousands of threads to do IO is a bad idea (it very >> much is), gracefully handling is a must. I haven't seen any bad >> behavior from the new allocator, it seems to be well behaved (for >> most normal cases, anyway). I'd like to take a stab at ensuring this >> works, too. >> >> If you tell me exactly what you are running, I'll reproduce and get >> this fixed up tomorrow. > > Test case - take fs_mark: > > git://oss.sgi.com/dgc/fs_mark > > Apply the patch for aio fsync support: > > http://oss.sgi.com/pipermail/xfs/2014-June/036774.html > > Run this test: > > $ time ./fs_mark -D 10000 -S5 -n 50000 -s 4096 -L 5 -A \ > -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 \ > -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 \ > -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 \ > -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 \ > -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 \ > -d /mnt/scratch/15 > > Drop the "-A" if you want to use normal fsync (but then you won't > see the problem). > > Use a XFS filesystem that has at least 32 AGs (I'm using > a sparse 500TB fs image on a virtio device). I'm also using mkfs > options of "-m crc=1,finobt=1", but to use that last one you'll need > a mkfs built from the xfsprogs git tree. It shouldn't make any > difference to the result, though. I'm running on a VM with 16 CPUs > and 16GB RAM, using fakenuma=4. Can you try with this patch? -- Jens Axboe