All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dennis Zhou <dennis@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>
Cc: tj@kernel.org, axboe@kernel.dk, linux-block@vger.kernel.org,
	dm-devel@redhat.com
Subject: Re: can we reduce bio_set_dev overhead due to bio_associate_blkg?
Date: Wed, 30 Mar 2022 08:28:28 -0400	[thread overview]
Message-ID: <YkRM7Iyp8m6A1BCl@fedora> (raw)
In-Reply-To: <YkSK6mU1fja2OykG@redhat.com>

Hi Mike,

On Wed, Mar 30, 2022 at 12:52:58PM -0400, Mike Snitzer wrote:
> Hey Tejun and Dennis,
> 
> I recently found that due to bio_set_dev()'s call to
> bio_associate_blkg(), bio_set_dev() needs much more cpu than ideal;
> especially when doing 4K IOs via io_uring's HIPRI bio-polling.
> 
> I'm very naive about blk-cgroups.. so I'm hopeful you or others can
> help me cut through this to understand what the ideal outcome should
> be for DM's bio clone + remap heavy use-case as it relates to
> bio_associate_blkg.
> 
> If I hack dm-linear with a local __bio_set_dev that simply removes
> the call to bio_associate_blkg() my IOPS go from ~980K to 995K.
> 
> Looking at what is happening a bit, relative to this DM bio cloning
> usecase, it seems __bio_clone() calls bio_clone_blkg_association() to
> clone the blkg from DM device, then dm-linear.c:linear_map's call
> to bio_set_dev() will cause bio_associate_blkg(bio) to reuse the css
> but then it triggers an update because the bdev is being remapped in
> the bio (due to linear_map sending the IO to the real underlying
> device). End result _seems_ like collective wasteful effort to get the
> blk-cgroup resources setup properly in the face of a simple remap.
> 
> Seems the current DM pattern is causing repeat blkg work for _every_
> remapped bio?  Do you see a way to speed up repeat calls to
> bio_associate_blkg()?
> 

I must admit I wrote this with limited knowledge of bio cloning at the
time. I can fill in the thought process here.

The idea was every bio should have a blkg associated with it for io
accounting and things like blk-iolatency and blk-iocost. The device
abstraction I believe means we can set limits here as well on submission
rate to the md device.

I think cloning is a special case that I might have gotten wrong. If
there is a bio_set_dev() call after each clone(), then the
bio_clone_blkg_association() is excess work. We'd need to audit how
bio_alloc_clone() is being used to be safe. Alternatively, we could opt
for a bio_alloc_clone_noblkg(), but that's a little bit uglier.

1. bio_set_dev() above md <- needed so we can do throttling on the md.
2. bio_alloc_clone() <- doesn't need to clone the blkg() info.
3. bio_set_dev() in md <- sets the right underlying device association.

Thanks,
Dennis

> Test kernel is my latest dm-5.19 branch (though latest Linus 5.18-rc0
> kernel should be fine too):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19
> 
> I'm using dm-linear ontop on a 16G blk-mq null_blk device:
> 
> modprobe null_blk queue_mode=2 poll_queues=2 bs=4096 gb=16
> SIZE=`blockdev --getsz /dev/nullb0`
> echo "0 $SIZE linear /dev/nullb0 0" | dmsetup create linear
> 
> And running the workload with fio using this wrapper script:
> io_uring.sh 20 1 /dev/mapper/linear 4096
> 
> #!/bin/bash
> 
> RTIME=$1
> JOBS=$2
> DEV=$3
> BS=$4
> 
> QD=64
> BATCH=16
> HI=1
> 
> fio --bs=$BS --ioengine=io_uring --fixedbufs --registerfiles --hipri=$HI \
>         --iodepth=$QD \
>         --iodepth_batch_submit=$BATCH \
>         --iodepth_batch_complete_min=$BATCH \
>         --filename=$DEV \
>         --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \
>         --name=test --group_reporting

WARNING: multiple messages have this Message-ID (diff)
From: Dennis Zhou <dennis@kernel.org>
To: Mike Snitzer <snitzer@kernel.org>
Cc: tj@kernel.org, axboe@kernel.dk, dm-devel@redhat.com,
	linux-block@vger.kernel.org
Subject: Re: [dm-devel] can we reduce bio_set_dev overhead due to bio_associate_blkg?
Date: Wed, 30 Mar 2022 08:28:28 -0400	[thread overview]
Message-ID: <YkRM7Iyp8m6A1BCl@fedora> (raw)
In-Reply-To: <YkSK6mU1fja2OykG@redhat.com>

Hi Mike,

On Wed, Mar 30, 2022 at 12:52:58PM -0400, Mike Snitzer wrote:
> Hey Tejun and Dennis,
> 
> I recently found that due to bio_set_dev()'s call to
> bio_associate_blkg(), bio_set_dev() needs much more cpu than ideal;
> especially when doing 4K IOs via io_uring's HIPRI bio-polling.
> 
> I'm very naive about blk-cgroups.. so I'm hopeful you or others can
> help me cut through this to understand what the ideal outcome should
> be for DM's bio clone + remap heavy use-case as it relates to
> bio_associate_blkg.
> 
> If I hack dm-linear with a local __bio_set_dev that simply removes
> the call to bio_associate_blkg() my IOPS go from ~980K to 995K.
> 
> Looking at what is happening a bit, relative to this DM bio cloning
> usecase, it seems __bio_clone() calls bio_clone_blkg_association() to
> clone the blkg from DM device, then dm-linear.c:linear_map's call
> to bio_set_dev() will cause bio_associate_blkg(bio) to reuse the css
> but then it triggers an update because the bdev is being remapped in
> the bio (due to linear_map sending the IO to the real underlying
> device). End result _seems_ like collective wasteful effort to get the
> blk-cgroup resources setup properly in the face of a simple remap.
> 
> Seems the current DM pattern is causing repeat blkg work for _every_
> remapped bio?  Do you see a way to speed up repeat calls to
> bio_associate_blkg()?
> 

I must admit I wrote this with limited knowledge of bio cloning at the
time. I can fill in the thought process here.

The idea was every bio should have a blkg associated with it for io
accounting and things like blk-iolatency and blk-iocost. The device
abstraction I believe means we can set limits here as well on submission
rate to the md device.

I think cloning is a special case that I might have gotten wrong. If
there is a bio_set_dev() call after each clone(), then the
bio_clone_blkg_association() is excess work. We'd need to audit how
bio_alloc_clone() is being used to be safe. Alternatively, we could opt
for a bio_alloc_clone_noblkg(), but that's a little bit uglier.

1. bio_set_dev() above md <- needed so we can do throttling on the md.
2. bio_alloc_clone() <- doesn't need to clone the blkg() info.
3. bio_set_dev() in md <- sets the right underlying device association.

Thanks,
Dennis

> Test kernel is my latest dm-5.19 branch (though latest Linus 5.18-rc0
> kernel should be fine too):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19
> 
> I'm using dm-linear ontop on a 16G blk-mq null_blk device:
> 
> modprobe null_blk queue_mode=2 poll_queues=2 bs=4096 gb=16
> SIZE=`blockdev --getsz /dev/nullb0`
> echo "0 $SIZE linear /dev/nullb0 0" | dmsetup create linear
> 
> And running the workload with fio using this wrapper script:
> io_uring.sh 20 1 /dev/mapper/linear 4096
> 
> #!/bin/bash
> 
> RTIME=$1
> JOBS=$2
> DEV=$3
> BS=$4
> 
> QD=64
> BATCH=16
> HI=1
> 
> fio --bs=$BS --ioengine=io_uring --fixedbufs --registerfiles --hipri=$HI \
>         --iodepth=$QD \
>         --iodepth_batch_submit=$BATCH \
>         --iodepth_batch_complete_min=$BATCH \
>         --filename=$DEV \
>         --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \
>         --name=test --group_reporting

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2022-03-30 19:23 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-30 16:52 can we reduce bio_set_dev overhead due to bio_associate_blkg? Mike Snitzer
2022-03-30 16:52 ` [dm-devel] " Mike Snitzer
2022-03-30 12:28 ` Dennis Zhou [this message]
2022-03-30 12:28   ` Dennis Zhou
2022-03-31  4:39   ` Christoph Hellwig
2022-03-31  4:39     ` [dm-devel] " Christoph Hellwig
2022-03-31  5:52     ` Dennis Zhou
2022-03-31  5:52       ` [dm-devel] " Dennis Zhou
2022-03-31  9:15       ` Christoph Hellwig
2022-03-31  9:15         ` [dm-devel] " Christoph Hellwig
2022-04-08 15:42         ` Mike Snitzer
2022-04-08 15:42           ` [dm-devel] " Mike Snitzer
2022-04-09  5:15           ` Christoph Hellwig
2022-04-09  5:15             ` [dm-devel] " Christoph Hellwig
2022-04-11 16:58             ` Mike Snitzer
2022-04-11 16:58               ` [dm-devel] " Mike Snitzer
2022-04-11 17:16               ` Mike Snitzer
2022-04-11 17:16                 ` [dm-devel] " Mike Snitzer
2022-04-11 17:33                 ` [PATCH] block: remove redundant blk-cgroup init from __bio_clone Mike Snitzer
2022-04-11 17:33                   ` [dm-devel] " Mike Snitzer
2022-04-12  5:27                   ` Christoph Hellwig
2022-04-12  5:27                     ` [dm-devel] " Christoph Hellwig
2022-04-12  7:52                     ` Dennis Zhou
2022-04-12  7:52                       ` Dennis Zhou
2022-04-23 16:55                   ` Christoph Hellwig
2022-04-23 16:55                     ` [dm-devel] " Christoph Hellwig
2022-04-26 17:30                     ` Mike Snitzer
2022-04-26 17:30                       ` [dm-devel] " Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YkRM7Iyp8m6A1BCl@fedora \
    --to=dennis@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=snitzer@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.