* Re: [dm-devel] dm-crypt performance [not found] <Pine.LNX.4.64.1303252051520.9745@file.rdu.redhat.com> @ 2013-03-26 12:27 ` Alasdair G Kergon 2013-03-26 20:05 ` Milan Broz 0 siblings, 1 reply; 45+ messages in thread From: Alasdair G Kergon @ 2013-03-26 12:27 UTC (permalink / raw) To: Mikulas Patocka Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, Milan Broz, linux-kernel, Christoph Hellwig, Christian Schmidt [Adding dm-crypt + linux-kernel] On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > I performed some dm-crypt performance tests as Mike suggested. > > It turns out that unbound workqueue performance has improved somewhere > between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the > patches for hand-built dispatch are no longer needed. > > For RAID-0 composed of two disks with total throughput 260MB/s, the > unbound workqueue performs as well as the hand-built dispatch (both > sustain the 260MB/s transfer rate). > > For ramdisk, unbound workqueue performs better than hand-built dispatch > (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested > (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves > performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s). > > > > However, there is still the problem with request ordering. Milan found out > that under some circumstances parallel dm-crypt has worse performance than > the previous dm-crypt code. I found out that this is not caused by > deficiencies in the code that distributes work to individual processors. > Performance drop is caused by the fact that distributing write bios to > multiple processors causes the encryption to finish out of order and the > I/O scheduler is unable to merge these out-of-order bios. > > The deadline and noop schedulers perform better (only 50% slowdown > compared to old dm-crypt), CFQ performs very badly (8 times slowdown). > > > If I sort the requests in dm-crypt to come out in the same order as they > were received, there is no longer any slowdown, the new crypt performs as > well as the old crypt, but the last time I submitted the patches, people > objected to sorting requests in dm-crypt, saying that the I/O scheduler > should sort them. But it doesn't. This problem still persists in the > current kernels. > > > For best performance we could use the unbound workqueue implementation > with request sorting, if people don't object to the request sorting being > done in dm-crypt. On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote: > FYI, XFS also does it's own request ordering for the metadata buffers, > because it knows the needed ordering and has a bigger view than than > than especially CFQ. You at least have precedence in a widely used > subsystem for this code. So please post this updated version of the patches for a wider group of people to try out. Alasdair ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dm-devel] dm-crypt performance 2013-03-26 12:27 ` [dm-devel] dm-crypt performance Alasdair G Kergon @ 2013-03-26 20:05 ` Milan Broz 2013-03-26 20:28 ` Mike Snitzer 2013-04-09 18:08 ` [dm-devel] dm-crypt performance Mikulas Patocka 0 siblings, 2 replies; 45+ messages in thread From: Milan Broz @ 2013-03-26 20:05 UTC (permalink / raw) To: Mikulas Patocka Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, Milan Broz, linux-kernel, Christoph Hellwig, Christian Schmidt On 26.3.2013 13:27, Alasdair G Kergon wrote: > [Adding dm-crypt + linux-kernel] Thanks. > > On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: >> I performed some dm-crypt performance tests as Mike suggested. >> >> It turns out that unbound workqueue performance has improved somewhere >> between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the >> patches for hand-built dispatch are no longer needed. >> >> For RAID-0 composed of two disks with total throughput 260MB/s, the >> unbound workqueue performs as well as the hand-built dispatch (both >> sustain the 260MB/s transfer rate). >> >> For ramdisk, unbound workqueue performs better than hand-built dispatch >> (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested >> (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves >> performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s). I found that ramdisk tests are usualy quite misleading for dmcrypt. Better use some fast SSD, ideally in RAID0 (so you get >500MB or so). Also be sure you compare recent machines which uses AES-NI. For reference, null cipher (no crypt, data copy only) works as well, but this is not real-world scenario. After introducing Andi's patches, we created performance regression for people which created "RAID over several dmcrypt devices". (All IOs were processed by one core.) Rerely use case but several people complained. But most of people reported that current approach works much better (even with stupid dd test - I think it is because page cache sumbits requests from different CPUs so it in fact run in parallel). But using dd with direct-io is trivial way how to simulate the "problem". (I guess we all like using dd for performance testing... :-]) >> However, there is still the problem with request ordering. Milan found out >> that under some circumstances parallel dm-crypt has worse performance than >> the previous dm-crypt code. I found out that this is not caused by >> deficiencies in the code that distributes work to individual processors. >> Performance drop is caused by the fact that distributing write bios to >> multiple processors causes the encryption to finish out of order and the >> I/O scheduler is unable to merge these out-of-order bios. If the IO scheduler is unable to merge these request because of out of order bios, please try to FIX IO scheduler and do not invent workarounds in dmcrypt. (With recent accelerated crypto this should not happen so often btw.) I know it is not easy but I really do not like that in "little-walled device-mapper garden" is something what should be done on different layer (again). >> The deadline and noop schedulers perform better (only 50% slowdown >> compared to old dm-crypt), CFQ performs very badly (8 times slowdown). >> >> >> If I sort the requests in dm-crypt to come out in the same order as they >> were received, there is no longer any slowdown, the new crypt performs as >> well as the old crypt, but the last time I submitted the patches, people >> objected to sorting requests in dm-crypt, saying that the I/O scheduler >> should sort them. But it doesn't. This problem still persists in the >> current kernels. I have probable no vote here anymore but for the record: I am strictly against any sorting of requests in dmcrypt. My reasons are: - dmcrypt should be simple transparent layer (doing one thing - encryption), sorting of requests was always primarily IO scheduler domain (which has well-known knobs to control it already) - Are we sure we are not inroducing some another side channel in disc encryption? (Unprivileged user can measure timing here). (Perhaps stupid reason but please do not prefer performance to security in encryption. Enough we have timing attacks for AES implementations...) - In my testing (several months ago) output was very unstable - in some situations it helps, in some it was worse. I have no longer hard data but some test output was sent to Alasdair. >> For best performance we could use the unbound workqueue implementation >> with request sorting, if people don't object to the request sorting being >> done in dm-crypt. So again: - why IO scheduler is not working properly here? Do it need some extensions? If fixed, it can help even is some other non-dmcrypt IO patterns. (I mean dmcrypt can set some special parameter for underlying device queue automagically to fine-tune sorting parameters.) - can we have some cpu-bound workqueue which automatically switch to unbound (relocates work to another cpu) if it detects some saturation watermark etc? (Again, this can be used in other code. http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html (Yes, I see skepticism there :-) > On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote: >> FYI, XFS also does it's own request ordering for the metadata buffers, >> because it knows the needed ordering and has a bigger view than than >> than especially CFQ. You at least have precedence in a widely used >> subsystem for this code. Nice. But XFS is much more complex system. Isn't it enough that multipath uses own IO queue (so we have one IO scheduler on top of another, and now we have metadata io sorting in XFS on top of it and planning one more in dmcrypt? Is it really good approach?) Milan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-26 20:05 ` Milan Broz @ 2013-03-26 20:28 ` Mike Snitzer 2013-03-26 20:58 ` Milan Broz 2013-03-28 18:53 ` Tejun Heo 2013-04-09 18:08 ` [dm-devel] dm-crypt performance Mikulas Patocka 1 sibling, 2 replies; 45+ messages in thread From: Mike Snitzer @ 2013-03-26 20:28 UTC (permalink / raw) To: Milan Broz Cc: Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, tj On Tue, Mar 26 2013 at 4:05pm -0400, Milan Broz <gmazyland@gmail.com> wrote: > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > > > >>For best performance we could use the unbound workqueue implementation > >>with request sorting, if people don't object to the request sorting being > >>done in dm-crypt. > > So again: > > - why IO scheduler is not working properly here? Do it need some extensions? > If fixed, it can help even is some other non-dmcrypt IO patterns. > (I mean dmcrypt can set some special parameter for underlying device queue > automagically to fine-tune sorting parameters.) Not sure, but IO scheduler changes are fairly slow to materialize given the potential for adverse side-effects. Are you so surprised that a shotgun blast of IOs might make the IO schduler less optimal than if some basic sorting were done at the layer above? > - can we have some cpu-bound workqueue which automatically switch to unbound > (relocates work to another cpu) if it detects some saturation watermark etc? > (Again, this can be used in other code. > http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html > (Yes, I see skepticism there :-) Question for Tejun? (now cc'd). > >On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote: > >>FYI, XFS also does it's own request ordering for the metadata buffers, > >>because it knows the needed ordering and has a bigger view than than > >>than especially CFQ. You at least have precedence in a widely used > >>subsystem for this code. > > Nice. But XFS is much more complex system. > Isn't it enough that multipath uses own IO queue (so we have one IO scheduler > on top of another, and now we have metadata io sorting in XFS on top of it > and planning one more in dmcrypt? Is it really good approach?) Multipath's request_queue is the only one with an active IO scheduler; the requests are dispatched directly to the underlying devices' queues without any IO scheduling. As for dm-crypt; as you know it is bio-based so it is already dealing with out of order IOs (no benefit of upper level IO scheduler). Seems relatively clear, from Mikulas' results, that maybe you're hoping for a bit too much magic from the IO scheduler gnomes that lurk on LKML. BTW, pretty sure btrfs takes care to maintain some IO dispatch ordering too. Mike ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-26 20:28 ` Mike Snitzer @ 2013-03-26 20:58 ` Milan Broz 2013-03-28 18:53 ` Tejun Heo 1 sibling, 0 replies; 45+ messages in thread From: Milan Broz @ 2013-03-26 20:58 UTC (permalink / raw) To: Mike Snitzer Cc: Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, tj On 26.3.2013 21:28, Mike Snitzer wrote: > On Tue, Mar 26 2013 at 4:05pm -0400, > Milan Broz <gmazyland@gmail.com> wrote: > >>> On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: >>> >>>> For best performance we could use the unbound workqueue implementation >>>> with request sorting, if people don't object to the request sorting being >>>> done in dm-crypt. >> >> So again: >> >> - why IO scheduler is not working properly here? Do it need some extensions? >> If fixed, it can help even is some other non-dmcrypt IO patterns. >> (I mean dmcrypt can set some special parameter for underlying device queue >> automagically to fine-tune sorting parameters.) > > Not sure, but IO scheduler changes are fairly slow to materialize given > the potential for adverse side-effects. Are you so surprised that a > shotgun blast of IOs might make the IO schduler less optimal than if > some basic sorting were done at the layer above? All I said is that I think the problems should be solved on proper layer where are already all mechanisms to properly control it. And only if it is not possible then use such workarounds. CPU bounded io in dmcrypt is in kernel for >2 years and I know about just few cases where it caused real problems. Maybe I am mistaken - then now is ideal time for people to complain :) Anyway, are we talking about the same Mikulas' patch I tested months ago or you have something new? I mean this part from series of dmcrypt patches: http://mbroz.fedorapeople.org/dm-crypt/3.6-rc/dm-crypt-25-sort-writes.patch Milan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-26 20:28 ` Mike Snitzer 2013-03-26 20:58 ` Milan Broz @ 2013-03-28 18:53 ` Tejun Heo 2013-03-28 19:33 ` Vivek Goyal 1 sibling, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-03-28 18:53 UTC (permalink / raw) To: Mike Snitzer Cc: Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Vivek Goyal, Jens Axboe Hello, (cc'ing Vivek and Jens for the iosched related bits) On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote: > On Tue, Mar 26 2013 at 4:05pm -0400, > Milan Broz <gmazyland@gmail.com> wrote: > > > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > > > > > >>For best performance we could use the unbound workqueue implementation > > >>with request sorting, if people don't object to the request sorting being > > >>done in dm-crypt. > > > > So again: > > > > - why IO scheduler is not working properly here? Do it need some extensions? > > If fixed, it can help even is some other non-dmcrypt IO patterns. > > (I mean dmcrypt can set some special parameter for underlying device queue > > automagically to fine-tune sorting parameters.) > > Not sure, but IO scheduler changes are fairly slow to materialize given > the potential for adverse side-effects. Are you so surprised that a > shotgun blast of IOs might make the IO schduler less optimal than if > some basic sorting were done at the layer above? My memory is already pretty hazy but Vivek should be able to correct me if I say something nonsense. The thing is, the order and timings of IOs coming down from upper layers has certain meanings to ioscheds and they exploit those patterns to do better scheduling. Reordering IOs randomly actually makes certain information about the IO stream lost and makes ioscheds mis-classify the IO stream - e.g. what could have been classfied as "mostly consecutive streaming IO" could after such reordering fail to be detected as such. Sure, ioscheds can probably be improved to compensate for such temporary localized reorderings but nothing is free and given that most of the upper stacks already do pretty good job of issuing IOs orderly when possible, it would be a bit silly to do more than usually necessary in ioscheds. So, no, I don't think maintaining IO order in stacking drivers is a bad idea. I actually think all stacking drivers should do that; otherwise, they really are destroying actual useful side-band information. > > - can we have some cpu-bound workqueue which automatically switch to unbound > > (relocates work to another cpu) if it detects some saturation watermark etc? > > (Again, this can be used in other code. > > http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html > > (Yes, I see skepticism there :-) > > Question for Tejun? (now cc'd). Unbound workqueues went through quite a bit of improvements lately and are currently growing NUMA affinity support. Once merged, all unbound work items issued on a NUMA node will be processed in the same NUMA node, which should mitigate some, unfortunately not all, of the disadvantages compared to per-cpu ones. Mikulas, can you share more about your test setup? Was it a NUMA machine? Which wq branch did you use? The NUMA affinity support would have less severe but similar issue as per-cpu. If all IOs are being issued from one node while other nodes are idle, that specific node can get saturated. NUMA affinity support is adjusted both from inside kernel and userland via sysfs, so there are control knobs for corner cases. As for maintaining CPU or NUMA affinity until the CPU / node is saturated and spilling to other CPUs/nodes beyond that, yeah, an interesting idea. It's non-trivial and would have to incorporate a lot of notions on "load" similar to the scheduler. It really becomes a generic load balancing problem as it'd be pointless and actually harmful to, say, spill work items to each other between two saturated NUMA nodes. So, if the brunt of scattering workload across random CPUs can be avoided by NUMA affinity, that could be a reasonable tradeoff, I think. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-28 18:53 ` Tejun Heo @ 2013-03-28 19:33 ` Vivek Goyal 2013-03-28 19:44 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Vivek Goyal @ 2013-03-28 19:33 UTC (permalink / raw) To: Tejun Heo Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe On Thu, Mar 28, 2013 at 11:53:27AM -0700, Tejun Heo wrote: > Hello, > > (cc'ing Vivek and Jens for the iosched related bits) > > On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote: > > On Tue, Mar 26 2013 at 4:05pm -0400, > > Milan Broz <gmazyland@gmail.com> wrote: > > > > > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote: > > > > > > > >>For best performance we could use the unbound workqueue implementation > > > >>with request sorting, if people don't object to the request sorting being > > > >>done in dm-crypt. > > > > > > So again: > > > > > > - why IO scheduler is not working properly here? Do it need some extensions? > > > If fixed, it can help even is some other non-dmcrypt IO patterns. > > > (I mean dmcrypt can set some special parameter for underlying device queue > > > automagically to fine-tune sorting parameters.) > > > > Not sure, but IO scheduler changes are fairly slow to materialize given > > the potential for adverse side-effects. Are you so surprised that a > > shotgun blast of IOs might make the IO schduler less optimal than if > > some basic sorting were done at the layer above? > > My memory is already pretty hazy but Vivek should be able to correct > me if I say something nonsense. The thing is, the order and timings > of IOs coming down from upper layers has certain meanings to ioscheds > and they exploit those patterns to do better scheduling. > > Reordering IOs randomly actually makes certain information about the > IO stream lost and makes ioscheds mis-classify the IO stream - > e.g. what could have been classfied as "mostly consecutive streaming > IO" could after such reordering fail to be detected as such. Sure, > ioscheds can probably be improved to compensate for such temporary > localized reorderings but nothing is free and given that most of the > upper stacks already do pretty good job of issuing IOs orderly when > possible, it would be a bit silly to do more than usually necessary in > ioscheds. > > So, no, I don't think maintaining IO order in stacking drivers is a > bad idea. I actually think all stacking drivers should do that; > otherwise, they really are destroying actual useful side-band > information. I am curious why out of order bio is a problem. Doesn't the elevator already merge bio's with existing requests and if merging does not happen then requests are sorted in order. So why ordering is not happening properly with dm-crypt. What additional info dm-crypt has that it can do better ordering than IO scheduler. CFQ might seeing more performance hit because we maintain per process queues and kernel threads might not be sharing the IO context (i am not sure). So if all the crypto threads can share the IO context, atleast it will make sure all IO from them goes into a single queue. So it would help if somebody can explain that why existing merging and sorting logic is not working well with dm-crypt and what additional info dm-crypt has which can help it do better job. Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-28 19:33 ` Vivek Goyal @ 2013-03-28 19:44 ` Tejun Heo 2013-03-28 20:38 ` Vivek Goyal 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-03-28 19:44 UTC (permalink / raw) To: Vivek Goyal Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe Hello, On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote: > I am curious why out of order bio is a problem. Doesn't the elevator > already merge bio's with existing requests and if merging does not > happen then requests are sorted in order. So why ordering is not > happening properly with dm-crypt. What additional info dm-crypt has > that it can do better ordering than IO scheduler. Hmmm... well, for one, it doesn't only change ordering. It also changes the timings. Before iosched would get contiguous stream of IOs when the queue gets unplugged (BTW, how does dm crypt handling plugging? If not handled properly, it could definitely affect a lot of things.) With multiple threads doing encryption in the middle, the iosched could get scattered IOs which could easily span multiple millisecs. Even if context tagging was done properly, it could easily lead to much less efficient IO patterns to hardware. Keeping IO order combined with proper plug handling would not only keep the ordering constant but also the relative timing of events, which is an important factor when scheduling IOs. > CFQ might seeing more performance hit because we maintain per > process queues and kernel threads might not be sharing the IO context > (i am not sure). So if all the crypto threads can share the IO > context, atleast it will make sure all IO from them goes into a > single queue. Right, this is important too although I fail to see how workqueue vs. custom dispatch would make any difference here. dm-crypt should definitely be using bio_associate_current(). Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-28 19:44 ` Tejun Heo @ 2013-03-28 20:38 ` Vivek Goyal 2013-03-28 20:45 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Vivek Goyal @ 2013-03-28 20:38 UTC (permalink / raw) To: Tejun Heo Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe On Thu, Mar 28, 2013 at 12:44:43PM -0700, Tejun Heo wrote: > Hello, > > On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote: > > I am curious why out of order bio is a problem. Doesn't the elevator > > already merge bio's with existing requests and if merging does not > > happen then requests are sorted in order. So why ordering is not > > happening properly with dm-crypt. What additional info dm-crypt has > > that it can do better ordering than IO scheduler. > > Hmmm... well, for one, it doesn't only change ordering. It also > changes the timings. Before iosched would get contiguous stream of > IOs when the queue gets unplugged (BTW, how does dm crypt handling > plugging? If not handled properly, it could definitely affect a lot > of things.) With multiple threads doing encryption in the middle, the > iosched could get scattered IOs which could easily span multiple > millisecs. Even if context tagging was done properly, it could easily > lead to much less efficient IO patterns to hardware. > > Keeping IO order combined with proper plug handling would not only > keep the ordering constant but also the relative timing of events, > which is an important factor when scheduling IOs. If timing of unordered IO is an issue, then dm-crypt can try to batch IO submission using blk_start_plug()/blk_finish_plug(). That way dm-crypt can batch bio and control submission and there should not be a need to put specific ordering logic in dm-crypt. So if there are multiple threads doing crypto, they end up submitting bio after crypto operation or queue it somewhere and ther is single submitter thread. If compltion of crypto is an issue, then I think it is very hard to determine whether extra waiting helps with throughput or hurts. If dm-crypt can decide that somehow, then I guess they can just try to do batch submission of IO from various crypto threads and see if it helps with performance. (At some point of time, submitter thread will become a bottleneck). > > > CFQ might seeing more performance hit because we maintain per > > process queues and kernel threads might not be sharing the IO context > > (i am not sure). So if all the crypto threads can share the IO > > context, atleast it will make sure all IO from them goes into a > > single queue. > > Right, this is important too although I fail to see how workqueue > vs. custom dispatch would make any difference here. dm-crypt should > definitely be using bio_associate_current(). Agreed. bio_associate_current() will atleast help keeping all bios of single context in single queue and promote more merging (if submission happens in right time frame). Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt performance 2013-03-28 20:38 ` Vivek Goyal @ 2013-03-28 20:45 ` Tejun Heo 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-03-28 20:45 UTC (permalink / raw) To: Vivek Goyal Cc: Mike Snitzer, Milan Broz, Mikulas Patocka, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Jens Axboe Hello, On Thu, Mar 28, 2013 at 04:38:08PM -0400, Vivek Goyal wrote: > If timing of unordered IO is an issue, then dm-crypt can try > to batch IO submission using blk_start_plug()/blk_finish_plug(). That way > dm-crypt can batch bio and control submission and there should not > be a need to put specific ordering logic in dm-crypt. Yes, it has to preserve and propagate the plugging boundaries and if you think about the implementation, maintaining issue order don't really need to be "sorted" per-se. Just keep the list of bios received but still going through encryption in the received order with a counter of in-progress bios in the plugging boundary. Link the outputs to the source bios somehow and when the counter hits zero, issue them in the same order. While keeping the specific order itself might not be essential, it's not gonna add any significant complexity or runtime overhead and I think it generally is a good idea for stacking drivers to preserve as much information and context as possible in general. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* dm-crypt parallelization patches 2013-03-28 20:45 ` Tejun Heo @ 2013-04-09 17:51 ` Mikulas Patocka 2013-04-09 17:57 ` Tejun Heo 2013-04-09 18:36 ` dm-crypt parallelization patches Vivek Goyal 0 siblings, 2 replies; 45+ messages in thread From: Mikulas Patocka @ 2013-04-09 17:51 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hi I placed the dm-crypt parallization patches at: http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/ The patches paralellize dm-crypt and make it possible to use all processor cores. The patch dm-crypt-remove-percpu.patch removes some percpu variables and replaces them with per-request variables. The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the encryption workqueue, allowing the encryption to be distributed to all CPUs in the system. The patch dm-crypt-offload-writes-to-thread.patch moves submission of all write requests to a single thread. The patch dm-crypt-sort-requests.patch sorts write requests submitted by a single thread. The requests are sorted according to the sector number, rb-tree is used for efficient sorting. Some usage notes: * turn off automatic cpu frequency scaling (or set it to "performance" governor) - cpufreq doesn't recognize encryption workload correctly, sometimes it underclocks all the CPU cores when there is some encryption work to do, resulting in bad performance * when using filesystem on encrypted dm-crypt device, reduce maximum request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute "dm-2" with the real name of your dm-crypt device). Note that having too big requests means that there is a small number of requests and they cannot be distributed to all available processors in parallel - it results in worse performance. Having too small requests results in high request overhead and also reduced performance. So you must find the optimal request size for your system and workload. For me, when testing this on ramdisk, the optimal is 8KiB. --- Now, the problem with I/O scheduler: when doing performance testing, it turns out that the parallel version is sometimes worse than the previous implementation. When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of ext2 filesystem on 15k SCSI disk and run this command time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12 the results are this: CFQ scheduler: -------------- no patches: 21.9s patch 1: 21.7s patches 1,2: 2:33s patches 1,2 (+ nr_requests = 1280000) 2:18s patches 1,2,3: 20.7s patches 1,2,3,4: 20.7s deadline scheduler: ------------------- no patches: 27.4s patch 1: 27.4s patches 1,2: 27.8s patches 1,2,3: 29.6s patches 1,2,3,4: 29.6s We can see that CFQ performs badly with the patch 2, but improves with the patch 3. All that patch 3 does is that it moves write requests from encryption threads to a separate thread. So it seems that CFQ has some deficiency that it cannot merge adjacent requests done by different processes. The problem is this: - we have 256k write direct-i/o request - it is broken to 4k bios (because we run on dm-loop on a filesystem with 4k block size) - encryption of these 4k bios is distributed to 12 processes on a 12-core machine - encryption finishes out of order and in different processes, 4k bios with encrypted data are submitted to CFQ - CFQ doesn't merge them - the disk is flooded with random 4k write requests, and performs much worse than with 256k requests Increasing nr_requests to 1280000 helps a little, but not much - it is still order of magnitute slower. I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look at it and find out why it doesn't merge requests from different processes. Why do I have to do a seemingly senseless operation (hand over write requests to a separate thread) in patch 3 to improve performance? Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka @ 2013-04-09 17:57 ` Tejun Heo 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:36 ` dm-crypt parallelization patches Vivek Goyal 1 sibling, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-09 17:57 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > single thread. The requests are sorted according to the sector number, > rb-tree is used for efficient sorting. Hmmm? Why not just keep the issuing order along with plugging boundaries? > So it seems that CFQ has some deficiency that it cannot merge adjacent > requests done by different processes. As I wrote before, please use bio_associate_current(). Currently, dm-crypt is completely messing up all the context information that cfq depends on to schedule IOs. Of course, it doesn't perform well. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 17:57 ` Tejun Heo @ 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:10 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-09 18:08 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Tejun Heo wrote: > On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > > single thread. The requests are sorted according to the sector number, > > rb-tree is used for efficient sorting. > > Hmmm? Why not just keep the issuing order along with plugging > boundaries? What do you mean? I used to have a patch that keeps order of requests as they were introduced, but sorting the requests according to sector number is a bit simpler. > > So it seems that CFQ has some deficiency that it cannot merge adjacent > > requests done by different processes. > > As I wrote before, please use bio_associate_current(). Currently, > dm-crypt is completely messing up all the context information that cfq > depends on to schedule IOs. Of course, it doesn't perform well. bio_associate_current() is only valid on a system with cgroups and there are no cgroups on the kernel where I tested it. It is an empty function: static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } Mikulas > Thanks. > > -- > tejun > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:08 ` Mikulas Patocka @ 2013-04-09 18:10 ` Tejun Heo 2013-04-09 18:42 ` Vivek Goyal 2013-04-09 19:42 ` Mikulas Patocka 0 siblings, 2 replies; 45+ messages in thread From: Tejun Heo @ 2013-04-09 18:10 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hey, On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > Hmmm? Why not just keep the issuing order along with plugging > > boundaries? > > What do you mean? > > I used to have a patch that keeps order of requests as they were > introduced, but sorting the requests according to sector number is a bit > simpler. You're still destroying the context information. Please just keep the issuing order along with plugging boundaries. > > As I wrote before, please use bio_associate_current(). Currently, > > dm-crypt is completely messing up all the context information that cfq > > depends on to schedule IOs. Of course, it doesn't perform well. > > bio_associate_current() is only valid on a system with cgroups and there > are no cgroups on the kernel where I tested it. It is an empty function: > > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } Yeah, because blkcg was the only user. Please feel free to drop the ifdefs. It covers both iocontext and cgroup association. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:10 ` Tejun Heo @ 2013-04-09 18:42 ` Vivek Goyal 2013-04-09 18:57 ` Tejun Heo 2013-04-09 19:42 ` Mikulas Patocka 1 sibling, 1 reply; 45+ messages in thread From: Vivek Goyal @ 2013-04-09 18:42 UTC (permalink / raw) To: Tejun Heo Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 11:10:31AM -0700, Tejun Heo wrote: > Hey, > > On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > > Hmmm? Why not just keep the issuing order along with plugging > > > boundaries? > > > > What do you mean? > > > > I used to have a patch that keeps order of requests as they were > > introduced, but sorting the requests according to sector number is a bit > > simpler. > > You're still destroying the context information. Please just keep the > issuing order along with plugging boundaries. I guess plugging boundary is more important than issuing order as block layer should take care of mering the bio and put in right order (attempt_plug_merge()). But to make use of plugging boundary, one would probably still need submission using single thread. And if one is using single thread for submission, one will still get good performance (even if you are not using bio_associate_current()), as by default all bio will go to submitting thread's context. Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:42 ` Vivek Goyal @ 2013-04-09 18:57 ` Tejun Heo 2013-04-09 19:13 ` Vivek Goyal 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-09 18:57 UTC (permalink / raw) To: Vivek Goyal Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hello, On Tue, Apr 09, 2013 at 02:42:48PM -0400, Vivek Goyal wrote: > I guess plugging boundary is more important than issuing order as > block layer should take care of mering the bio and put in right > order (attempt_plug_merge()). Yeah, the exact order probably doesn't affect things too much but it's just a nice design principle to follow - if you're gonna step in in the middle and meddle with requests, preserve as much context as reasonably possible, and it's not like preserving that order is difficult. > But to make use of plugging boundary, one would probably still need > submission using single thread. It doesn't have to a specific task. Whoever finishes the last bio / segment / whatever in the plugging domain can issue all of them. I probably am missing details but the overall mechanism can be pretty simple. Just keep the bios from the same plugging domain in the received order along with an atomic counter and issue them all when the counter hits zero. No need to fiddle with sorting or whatever. > And if one is using single thread for submission, one will still get > good performance (even if you are not using bio_associate_current()), as > by default all bio will go to submitting thread's context. And destroy all per-ioc and cgroup logics in block layer in the process. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:57 ` Tejun Heo @ 2013-04-09 19:13 ` Vivek Goyal 0 siblings, 0 replies; 45+ messages in thread From: Vivek Goyal @ 2013-04-09 19:13 UTC (permalink / raw) To: Tejun Heo Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 11:57:21AM -0700, Tejun Heo wrote: [..] > And destroy all per-ioc and cgroup logics in block layer in the > process. Oh, I am in no-way suggesting don't use bio_associate_current(). I am just trying to analyze the performance issue right now and saying that as far as performance is concenred, one will get it back even if one does not use bio_associate_current(). Yes but one needs to use bio_associate_current() to make sure bio's are attributed to right cgroup and associate bio to right task. This should help solving the long standing issue of task losing its ioprio if dm-crypt target is in the stack. Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 18:10 ` Tejun Heo 2013-04-09 18:42 ` Vivek Goyal @ 2013-04-09 19:42 ` Mikulas Patocka 2013-04-09 19:52 ` Tejun Heo 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-09 19:42 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Tejun Heo wrote: > Hey, > > On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote: > > > Hmmm? Why not just keep the issuing order along with plugging > > > boundaries? > > > > What do you mean? > > > > I used to have a patch that keeps order of requests as they were > > introduced, but sorting the requests according to sector number is a bit > > simpler. > > You're still destroying the context information. Please just keep the > issuing order along with plugging boundaries. > > > > As I wrote before, please use bio_associate_current(). Currently, > > > dm-crypt is completely messing up all the context information that cfq > > > depends on to schedule IOs. Of course, it doesn't perform well. > > > > bio_associate_current() is only valid on a system with cgroups and there > > are no cgroups on the kernel where I tested it. It is an empty function: > > > > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } > > Yeah, because blkcg was the only user. Please feel free to drop the > ifdefs. It covers both iocontext and cgroup association. > > Thanks. If I drop ifdefs, it doesn't compile (because other cgroup stuff it missing). So I enabled bio cgroups. bio_associate_current can't be used, because by the time we allocate the outgoing write bio, we are no longer in the process that submitted the original bio. Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" structure and set these fields on all outgoing bios. It has no effect on performance, it is as bad as if I hadn't done it. Mikulas --- (this is the patch that I used, to be applied after dm-crypt-unbound-workqueue.patch) --- drivers/md/dm-crypt.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) Index: linux-3.8.6-fast/drivers/md/dm-crypt.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-09 20:32:41.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-09 21:29:12.000000000 +0200 @@ -20,6 +20,7 @@ #include <linux/backing-dev.h> #include <linux/atomic.h> #include <linux/scatterlist.h> +#include <linux/cgroup.h> #include <asm/page.h> #include <asm/unaligned.h> #include <crypto/hash.h> @@ -60,6 +61,9 @@ struct dm_crypt_io { int error; sector_t sector; struct dm_crypt_io *base_io; + + struct io_context *ioc; + struct cgroup_subsys_state *css; }; struct dm_crypt_request { @@ -797,6 +801,14 @@ static struct bio *crypt_alloc_buffer(st if (!clone) return NULL; + if (unlikely(io->base_io != NULL)) { + clone->bi_ioc = io->base_io->ioc; + clone->bi_css = io->base_io->css; + } else { + clone->bi_ioc = io->ioc; + clone->bi_css = io->css; + } + clone_init(io, clone); *out_of_pages = 0; @@ -859,6 +871,9 @@ static struct dm_crypt_io *crypt_io_allo io->ctx.req = NULL; atomic_set(&io->io_pending, 0); + io->ioc = NULL; + io->css = NULL; + return io; } @@ -884,6 +899,14 @@ static void crypt_dec_pending(struct dm_ if (io->ctx.req) mempool_free(io->ctx.req, cc->req_pool); + + if (io->ioc) { + put_io_context(io->ioc); + } + if (io->css) { + css_put(io->css); + } + mempool_free(io, cc->io_pool); if (likely(!base_io)) @@ -927,6 +950,9 @@ static void crypt_endio(struct bio *clon if (rw == WRITE) crypt_free_buffer_pages(cc, clone); + clone->bi_ioc = NULL; + clone->bi_css = NULL; + bio_put(clone); if (rw == READ && !error) { @@ -1658,6 +1684,21 @@ static int crypt_map(struct dm_target *t io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector)); + if (current->io_context) { + struct io_context *ioc = current->io_context; + struct cgroup_subsys_state *css; + + get_io_context_active(ioc); + io->ioc = ioc; + + /* associate blkcg if exists */ + rcu_read_lock(); + css = task_subsys_state(current, blkio_subsys_id); + if (css && css_tryget(css)) + io->css = css; + rcu_read_unlock(); + } + if (bio_data_dir(io->base_bio) == READ) { if (kcryptd_io_read(io, GFP_NOWAIT)) kcryptd_queue_io(io); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 19:42 ` Mikulas Patocka @ 2013-04-09 19:52 ` Tejun Heo 2013-04-09 20:32 ` Mikulas Patocka 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-09 19:52 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote: > If I drop ifdefs, it doesn't compile (because other cgroup stuff it > missing). > > So I enabled bio cgroups. > > bio_associate_current can't be used, because by the time we allocate the > outgoing write bio, we are no longer in the process that submitted the > original bio. Oh, I suppose it'd need some massaging to selectively turn off the cgroup part. > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - and we probably need to change that to bio_associate_task(). > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" > structure and set these fields on all outgoing bios. It has no effect on > performance, it is as bad as if I hadn't done it. A good way to verify that the tagging is correct would be configuring io limits in block cgroup and see whether the limits are correctly applied when going through dm-crypt (please test with direct-io or reads, writeback is horribly broken, sorry).working correctly, maybe plugging is the overriding factor? Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 19:52 ` Tejun Heo @ 2013-04-09 20:32 ` Mikulas Patocka 2013-04-09 21:02 ` Tejun Heo 2013-04-09 21:07 ` Vivek Goyal 0 siblings, 2 replies; 45+ messages in thread From: Mikulas Patocka @ 2013-04-09 20:32 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Tejun Heo wrote: > On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote: > > If I drop ifdefs, it doesn't compile (because other cgroup stuff it > > missing). > > > > So I enabled bio cgroups. > > > > bio_associate_current can't be used, because by the time we allocate the > > outgoing write bio, we are no longer in the process that submitted the > > original bio. > > Oh, I suppose it'd need some massaging to selectively turn off the > cgroup part. > > > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - > > and we probably need to change that to bio_associate_task(). Generally, we shouldn't associate bios with "current" task in device mapper targets. For example suppose that we have two stacked dm-crypt targets: In the "current" process pointer in lower dm-crypt target's request function always points to the workqueue of the upper dm-crypt target that submits the bios. So if we associate the bio with "current" in the lower target, we are associating it with a preallocated workqueue and we already lost the information who submitted it. You should associate a bio with a task when you create the bio and "md" and "dm" midlayers should just forward this association to lower layer bios. > > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" > > structure and set these fields on all outgoing bios. It has no effect on > > performance, it is as bad as if I hadn't done it. > > A good way to verify that the tagging is correct would be configuring > io limits in block cgroup and see whether the limits are correctly > applied when going through dm-crypt (please test with direct-io or > reads, writeback is horribly broken, sorry).working correctly, maybe > plugging is the overriding factor? > > Thanks. It doesn't work because device mapper on the underlying layers ignores bi_ioc and bi_css. If I make device mapper forward bi_ioc and bi_css to outgoing bios, it improves performance (from 2:30 to 1:30), but it is still far from perfect. Mikulas --- dm: forward cgroup context This patch makes dm forward associated cgroup context to cloned bios. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm.c | 9 +++++++++ fs/bio.c | 2 ++ 2 files changed, 11 insertions(+) Index: linux-3.8.6-fast/drivers/md/dm.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-09 22:00:36.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-09 22:19:40.000000000 +0200 @@ -453,6 +453,10 @@ static void free_io(struct mapped_device static void free_tio(struct mapped_device *md, struct dm_target_io *tio) { +#ifdef CONFIG_BLK_CGROUP + tio->clone.bi_ioc = NULL; + tio->clone.bi_css = NULL; +#endif bio_put(&tio->clone); } @@ -1124,6 +1128,11 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); +#ifdef CONFIG_BLK_CGROUP + tio->clone.bi_ioc = ci->bio->bi_ioc; + tio->clone.bi_css = ci->bio->bi_css; +#endif + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 20:32 ` Mikulas Patocka @ 2013-04-09 21:02 ` Tejun Heo 2013-04-09 21:03 ` Tejun Heo 2013-04-09 21:07 ` Vivek Goyal 1 sibling, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-09 21:02 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt Hey, On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: > > and we probably need to change that to bio_associate_task(). > > Generally, we shouldn't associate bios with "current" task in device > mapper targets. For example suppose that we have two stacked dm-crypt > targets: It only follows the first association so it doesn't matter how many layers it goes through. That said, yeah, there could be situations where @task is availalbe but the bio's already in the hand of a different task. If that's the case, change it to associate_task(@task). > It doesn't work because device mapper on the underlying layers ignores > bi_ioc and bi_css. > > If I make device mapper forward bi_ioc and bi_css to outgoing bios, it > improves performance (from 2:30 to 1:30), but it is still far from > perfect. For testing, copying bi_ioc and bi_css directly is fine but please add another interface to copy those for the actual code. Say, bio_copy_association(@to_bio, @from_bio) or whatever. As for the performance loss, I'm somewhat confident in saying the remaining difference would be from ignoring plugging boundaries. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 21:02 ` Tejun Heo @ 2013-04-09 21:03 ` Tejun Heo 0 siblings, 0 replies; 45+ messages in thread From: Tejun Heo @ 2013-04-09 21:03 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 02:02:01PM -0700, Tejun Heo wrote: > For testing, copying bi_ioc and bi_css directly is fine but please add > another interface to copy those for the actual code. Say, > bio_copy_association(@to_bio, @from_bio) or whatever. Another and probably better possibility is just remembering the issuing task (you would of course need to hold an extra ref as long as you wanna use it) and use bio_associate_task() on it when creating new bios. Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 20:32 ` Mikulas Patocka 2013-04-09 21:02 ` Tejun Heo @ 2013-04-09 21:07 ` Vivek Goyal 2013-04-09 21:18 ` Mikulas Patocka 1 sibling, 1 reply; 45+ messages in thread From: Vivek Goyal @ 2013-04-09 21:07 UTC (permalink / raw) To: Mikulas Patocka Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: [..] > Generally, we shouldn't associate bios with "current" task in device > mapper targets. For example suppose that we have two stacked dm-crypt > targets: > > In the "current" process pointer in lower dm-crypt target's request > function always points to the workqueue of the upper dm-crypt target that > submits the bios. So if we associate the bio with "current" in the lower > target, we are associating it with a preallocated workqueue and we already > lost the information who submitted it. > > You should associate a bio with a task when you create the bio and "md" > and "dm" midlayers should just forward this association to lower layer > bios. bio_associate_current() return -EBUSY if bio has already been associated with an io context. So in a stack if every driver calls bio_associate_current(), upon bio submission, it will automatically amke sure bio gets associated with submitter task in top level device and calls by lower level device will be ignored. Lower level devices I think just need to make sure this context info is propogated to cloned bios. [..] > +#ifdef CONFIG_BLK_CGROUP > + tio->clone.bi_ioc = ci->bio->bi_ioc; > + tio->clone.bi_css = ci->bio->bi_css; You also need to take references to ioc and css objects. I guess a helper function will be better. May be something like. bio_associate_bio_context(bio1, bio2) And this initialize bio2's context with bio1's context. Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 21:07 ` Vivek Goyal @ 2013-04-09 21:18 ` Mikulas Patocka 2013-04-10 19:24 ` Vivek Goyal 0 siblings, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-09 21:18 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 9 Apr 2013, Vivek Goyal wrote: > On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote: > > [..] > > Generally, we shouldn't associate bios with "current" task in device > > mapper targets. For example suppose that we have two stacked dm-crypt > > targets: > > > > In the "current" process pointer in lower dm-crypt target's request > > function always points to the workqueue of the upper dm-crypt target that > > submits the bios. So if we associate the bio with "current" in the lower > > target, we are associating it with a preallocated workqueue and we already > > lost the information who submitted it. > > > > You should associate a bio with a task when you create the bio and "md" > > and "dm" midlayers should just forward this association to lower layer > > bios. > > bio_associate_current() return -EBUSY if bio has already been associated > with an io context. > > So in a stack if every driver calls bio_associate_current(), upon bio > submission, it will automatically amke sure bio gets associated with > submitter task in top level device and calls by lower level device > will be ignored. The stacking drivers do not pass the same bio to each other. The stacking driver receives a bio, allocates zero or more new bios and sends these new bios to the lower layer. So you need to propagate ownership from the received bio to newly allocated bios, you don't want to associate newly allocated bio with "current" process. > Lower level devices I think just need to make sure this context > info is propogated to cloned bios. > > > [..] > > +#ifdef CONFIG_BLK_CGROUP > > + tio->clone.bi_ioc = ci->bio->bi_ioc; > > + tio->clone.bi_css = ci->bio->bi_css; > > You also need to take references to ioc and css objects. I guess a helper > function will be better. May be something like. The lifetime of the "tio" structure is shorter than the lifetime of "ci->bio". So we don't need to increment reference. We only need to increment reference if we copy ownership to a new bio that has could have longer lifetime than the original bio. But this situation is very rare - in most stacking drivers the newly allocated bio has shorter lifetime than the original one. > bio_associate_bio_context(bio1, bio2) > > And this initialize bio2's context with bio1's context. Yes, that would be ok. > Thanks > Vivek Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 21:18 ` Mikulas Patocka @ 2013-04-10 19:24 ` Vivek Goyal 2013-04-10 23:42 ` [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) Mikulas Patocka 0 siblings, 1 reply; 45+ messages in thread From: Vivek Goyal @ 2013-04-10 19:24 UTC (permalink / raw) To: Mikulas Patocka Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote: [..] > > bio_associate_current() return -EBUSY if bio has already been associated > > with an io context. > > > > So in a stack if every driver calls bio_associate_current(), upon bio > > submission, it will automatically amke sure bio gets associated with > > submitter task in top level device and calls by lower level device > > will be ignored. > > The stacking drivers do not pass the same bio to each other. > > The stacking driver receives a bio, allocates zero or more new bios and > sends these new bios to the lower layer. So you need to propagate > ownership from the received bio to newly allocated bios, you don't want to > associate newly allocated bio with "current" process. Actually I was asking to call bio_associate_current() for the incoming bio in the driver and not for the newly created bios by the driver. For any newly created bios on behalf of this incoming bio, we need to copy the context so that this context info can be propogated down the stack. > > > Lower level devices I think just need to make sure this context > > info is propogated to cloned bios. > > > > > > [..] > > > +#ifdef CONFIG_BLK_CGROUP > > > + tio->clone.bi_ioc = ci->bio->bi_ioc; > > > + tio->clone.bi_css = ci->bio->bi_css; > > > > You also need to take references to ioc and css objects. I guess a helper > > function will be better. May be something like. > > The lifetime of the "tio" structure is shorter than the lifetime of > "ci->bio". So we don't need to increment reference. > > We only need to increment reference if we copy ownership to a new bio that > has could have longer lifetime than the original bio. But this situation > is very rare - in most stacking drivers the newly allocated bio has > shorter lifetime than the original one. I think it is not a good idea to rely on the fact that cloned bios or newly created bios will have shorter lifetime than the original bio. In fact when the bio completes and you free it up bio_disassociate_task() will try to put io context and blkcg reference. So it is important to take reference if you are copying context info in any newly created bio. Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-10 19:24 ` Vivek Goyal @ 2013-04-10 23:42 ` Mikulas Patocka 2013-04-10 23:50 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-10 23:42 UTC (permalink / raw) To: Vivek Goyal Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Wed, 10 Apr 2013, Vivek Goyal wrote: > On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote: > > [..] > > > bio_associate_current() return -EBUSY if bio has already been associated > > > with an io context. > > > > > > So in a stack if every driver calls bio_associate_current(), upon bio > > > submission, it will automatically amke sure bio gets associated with > > > submitter task in top level device and calls by lower level device > > > will be ignored. > > > > The stacking drivers do not pass the same bio to each other. > > > > The stacking driver receives a bio, allocates zero or more new bios and > > sends these new bios to the lower layer. So you need to propagate > > ownership from the received bio to newly allocated bios, you don't want to > > associate newly allocated bio with "current" process. > > Actually I was asking to call bio_associate_current() for the incoming > bio in the driver and not for the newly created bios by the driver. Yes, I think it's better to call it in the driver than in the upper layer, because if the driver doesn't forward the bio to a worker thread, we don't have to call bio_associate_current() and we save a few atomic instructions. > For any newly created bios on behalf of this incoming bio, we need to > copy the context so that this context info can be propogated down the > stack. See this patch. It implements cgroup associations for dm core and dm-crypt target. Do you think the interface is correct? (i.e. can I start modifying more dm dm-targets to use it?) > > We only need to increment reference if we copy ownership to a new bio that > > has could have longer lifetime than the original bio. But this situation > > is very rare - in most stacking drivers the newly allocated bio has > > shorter lifetime than the original one. > > I think it is not a good idea to rely on the fact that cloned bios or > newly created bios will have shorter lifetime than the original bio. In fact > when the bio completes and you free it up bio_disassociate_task() will try > to put io context and blkcg reference. So it is important to take > reference if you are copying context info in any newly created bio. We clear the associate with bio_clear_context before freeing the bio. > Thanks > Vivek --- dm: retain cgroup context This patch makes dm and dm-crypt target retain cgroup context. New functions bio_clone_context and bio_clear_context are introduced. bio_associate_current and bio_disassociate_task are exported to modules. dm core is changed so that it copies the context to cloned bio. dm associates the bio with current process if it is going to offload bio to a thread. dm-crypt copies the context to outgoing bios and associates the bio with current process. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-crypt.c | 17 ++++++++++++++--- drivers/md/dm.c | 5 +++++ fs/bio.c | 2 ++ include/linux/bio.h | 39 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 60 insertions(+), 3 deletions(-) Index: linux-3.8.6-fast/drivers/md/dm.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-10 14:39:28.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-10 20:09:52.000000000 +0200 @@ -453,6 +453,7 @@ static void free_io(struct mapped_device static void free_tio(struct mapped_device *md, struct dm_target_io *tio) { + bio_clear_context(&tio->clone); bio_put(&tio->clone); } @@ -521,6 +522,8 @@ static void queue_io(struct mapped_devic { unsigned long flags; + bio_associate_current(bio); + spin_lock_irqsave(&md->deferred_lock, flags); bio_list_add(&md->deferred, bio); spin_unlock_irqrestore(&md->deferred_lock, flags); @@ -1124,6 +1127,8 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); + bio_clone_context(ci->bio, &tio->clone); + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); Index: linux-3.8.6-fast/include/linux/bio.h =================================================================== --- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-10 14:38:56.000000000 +0200 +++ linux-3.8.6-fast/include/linux/bio.h 2013-04-10 20:14:08.000000000 +0200 @@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +/* + * bio_associate_current associates the bio with the current process. It must be + * called by any block device driver that passes the bio to a different process + * to be processed. It must be called in the original process. + * bio_associate_current does nothing if the bio is already associated. + * + * bio_dissociate_task dissociates the bio from the task. It is called + * automatically at bio destruction. + */ int bio_associate_current(struct bio *bio); void bio_disassociate_task(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ @@ -299,6 +308,36 @@ static inline void bio_disassociate_task #endif /* CONFIG_BLK_CGROUP */ /* + * bio_clone_context copies cgroup context from the original bio to the new bio. + * It is used by bio midlayer drivers that create new bio based on an original + * bio and forward it to the lower layer. + * + * No reference counts are incremented - it is assumed that the lifestime of the + * new bio is shorter than the lifetime of the original bio. If the new bio can + * outlive the old bio, the caller must increment the reference counts. + * + * Before freeing the new bio, the caller must clear the context with + * bio_clear_context function. If bio_clear_context were not called, the + * reference counts would be decremented on both new and original bio, resulting + * in crash due to reference count underflow. + */ +static inline void bio_clone_context(struct bio *orig, struct bio *new) +{ +#ifdef CONFIG_BLK_CGROUP + new->bi_ioc = orig->bi_ioc; + new->bi_css = orig->bi_css; +#endif +} + +static inline void bio_clear_context(struct bio *bio) +{ +#ifdef CONFIG_BLK_CGROUP + bio->bi_ioc = NULL; + bio->bi_css = NULL; +#endif +} + +/* * bio_set is used to allow other portions of the IO system to * allocate their own private memory pools for bio and iovec structures. * These memory pools in turn all allocate from the bio_slab Index: linux-3.8.6-fast/drivers/md/dm-crypt.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-10 14:38:56.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-10 19:52:56.000000000 +0200 @@ -181,6 +181,7 @@ struct crypt_config { static struct kmem_cache *_crypt_io_pool; static void clone_init(struct dm_crypt_io *, struct bio *); +static void clone_free(struct bio *); static void kcryptd_queue_crypt(struct dm_crypt_io *io); static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq); @@ -846,7 +847,7 @@ static struct bio *crypt_alloc_buffer(st } if (!clone->bi_size) { - bio_put(clone); + clone_free(clone); return NULL; } @@ -945,7 +946,7 @@ static void crypt_endio(struct bio *clon if (rw == WRITE) crypt_free_buffer_pages(cc, clone); - bio_put(clone); + clone_free(clone); if (rw == READ && !error) { kcryptd_queue_crypt(io); @@ -966,6 +967,14 @@ static void clone_init(struct dm_crypt_i clone->bi_end_io = crypt_endio; clone->bi_bdev = cc->dev->bdev; clone->bi_rw = io->base_bio->bi_rw; + + bio_clone_context(io->base_bio, clone); +} + +static void clone_free(struct bio *clone) +{ + bio_clear_context(clone); + bio_put(clone); } static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp) @@ -1026,7 +1035,7 @@ static void kcryptd_crypt_write_io_submi if (unlikely(io->error < 0)) { crypt_free_buffer_pages(cc, clone); - bio_put(clone); + clone_free(clone); crypt_dec_pending(io); return; } @@ -1692,6 +1701,8 @@ static int crypt_map(struct dm_target *t return DM_MAPIO_REMAPPED; } + bio_associate_current(bio); + io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector)); if (bio_data_dir(io->base_bio) == READ) { Index: linux-3.8.6-fast/fs/bio.c =================================================================== --- linux-3.8.6-fast.orig/fs/bio.c 2013-04-10 19:49:13.000000000 +0200 +++ linux-3.8.6-fast/fs/bio.c 2013-04-10 19:50:10.000000000 +0200 @@ -1703,6 +1703,7 @@ int bio_associate_current(struct bio *bi return 0; } +EXPORT_SYMBOL(bio_associate_current); /** * bio_disassociate_task - undo bio_associate_current() @@ -1719,6 +1720,7 @@ void bio_disassociate_task(struct bio *b bio->bi_css = NULL; } } +EXPORT_SYMBOL(bio_disassociate_task); #endif /* CONFIG_BLK_CGROUP */ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-10 23:42 ` [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) Mikulas Patocka @ 2013-04-10 23:50 ` Tejun Heo 2013-04-11 19:49 ` [PATCH v2] " Mikulas Patocka 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-10 23:50 UTC (permalink / raw) To: Mikulas Patocka Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote: > /* > + * bio_clone_context copies cgroup context from the original bio to the new bio. > + * It is used by bio midlayer drivers that create new bio based on an original > + * bio and forward it to the lower layer. > + * > + * No reference counts are incremented - it is assumed that the lifestime of the > + * new bio is shorter than the lifetime of the original bio. If the new bio can > + * outlive the old bio, the caller must increment the reference counts. > + * > + * Before freeing the new bio, the caller must clear the context with > + * bio_clear_context function. If bio_clear_context were not called, the > + * reference counts would be decremented on both new and original bio, resulting > + * in crash due to reference count underflow. > + */ > +static inline void bio_clone_context(struct bio *orig, struct bio *new) > +{ > +#ifdef CONFIG_BLK_CGROUP > + new->bi_ioc = orig->bi_ioc; > + new->bi_css = orig->bi_css; Hmmm... Let's not do this. Sure, you'd be saving several instructions but the gain is unlikely to be significant given that those cachelines are likely to be hot anyway. Also, please name it bio_copy_association(). Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-10 23:50 ` Tejun Heo @ 2013-04-11 19:49 ` Mikulas Patocka 2013-04-11 19:52 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-11 19:49 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Wed, 10 Apr 2013, Tejun Heo wrote: > On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote: > > /* > > + * bio_clone_context copies cgroup context from the original bio to the new bio. > > + * It is used by bio midlayer drivers that create new bio based on an original > > + * bio and forward it to the lower layer. > > + * > > + * No reference counts are incremented - it is assumed that the lifestime of the > > + * new bio is shorter than the lifetime of the original bio. If the new bio can > > + * outlive the old bio, the caller must increment the reference counts. > > + * > > + * Before freeing the new bio, the caller must clear the context with > > + * bio_clear_context function. If bio_clear_context were not called, the > > + * reference counts would be decremented on both new and original bio, resulting > > + * in crash due to reference count underflow. > > + */ > > +static inline void bio_clone_context(struct bio *orig, struct bio *new) > > +{ > > +#ifdef CONFIG_BLK_CGROUP > > + new->bi_ioc = orig->bi_ioc; > > + new->bi_css = orig->bi_css; > > Hmmm... Let's not do this. Sure, you'd be saving several instructions > but the gain is unlikely to be significant given that those cachelines > are likely to be hot anyway. Also, please name it > bio_copy_association(). > > Thanks. > > -- > tejun If the bi_css pointer points to a structure that is shared between processes, using atomic instruction causes cache line boucing - it doesn't cost a few instructions, it costs 2-3 hundreds cycles. I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that the refcount must be decremented - if the flag is set, refcounts must be decremented when bio is destroyed, if it is not set, references are borrowed from upper layer bio. It is less bug-prone than the previous patch. Mikulas --- dm: retain cgroup context This patch makes dm and dm-crypt target retain cgroup context. New function bio_clone_context is introduced. It copies cgroup context from one bio to another without incrementing the reference count. A new bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should be decremented when the bio is freed. bio_associate_current and bio_disassociate_task are exported to modules. dm core is changed so that it copies the context to cloned bio. dm associates the bio with current process if it is going to offload bio to a thread. dm-crypt associates the bio with the current process and copies the context to outgoing bios. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/dm-crypt.c | 4 ++++ drivers/md/dm.c | 7 +++++-- fs/bio.c | 11 +++++++++-- include/linux/bio.h | 27 +++++++++++++++++++++++++++ include/linux/blk_types.h | 3 +++ 5 files changed, 48 insertions(+), 4 deletions(-) Index: linux-3.8.6-fast/drivers/md/dm.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-11 17:29:09.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-11 19:33:47.000000000 +0200 @@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); + bio_clone_context(&tio->clone, ci->bio); + tio->io = ci->io; tio->ti = ti; memset(&tio->info, 0, sizeof(tio->info)); @@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { dm_put_live_table(md, srcu_idx); - if (bio_rw(bio) != READA) + if (bio_rw(bio) != READA) { + bio_associate_current(bio); queue_io(md, bio); - else + } else bio_io_error(bio); return; } Index: linux-3.8.6-fast/include/linux/bio.h =================================================================== --- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-11 17:29:07.000000000 +0200 +++ linux-3.8.6-fast/include/linux/bio.h 2013-04-11 19:34:11.000000000 +0200 @@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +/* + * bio_associate_current associates the bio with the current process. It should + * be called by any block device driver that passes the bio to a different + * process to be processed. It must be called in the original process. + * bio_associate_current does nothing if the bio is already associated. + * + * bio_dissociate_task dissociates the bio from the task. It is called + * automatically at bio destruction. + */ int bio_associate_current(struct bio *bio); void bio_disassociate_task(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ @@ -299,6 +308,24 @@ static inline void bio_disassociate_task #endif /* CONFIG_BLK_CGROUP */ /* + * bio_clone_context copies cgroup context from the original bio to the new bio. + * It is used by bio midlayer drivers that create new bio based on an original + * bio and forward it to the lower layer. + * + * No reference counts are incremented - it is assumed that the lifestime of the + * new bio is shorter than the lifetime of the original bio. If the new bio can + * outlive the old bio, the caller must increment the reference counts. + */ +static inline void bio_clone_context(struct bio *bio, struct bio *bio_src) +{ +#ifdef CONFIG_BLK_CGROUP + BUG_ON(bio->bi_ioc != NULL); + bio->bi_ioc = bio_src->bi_ioc; + bio->bi_css = bio_src->bi_css; +#endif +} + +/* * bio_set is used to allow other portions of the IO system to * allocate their own private memory pools for bio and iovec structures. * These memory pools in turn all allocate from the bio_slab Index: linux-3.8.6-fast/drivers/md/dm-crypt.c =================================================================== --- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-11 17:29:07.000000000 +0200 +++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-11 19:33:48.000000000 +0200 @@ -966,6 +966,8 @@ static void clone_init(struct dm_crypt_i clone->bi_end_io = crypt_endio; clone->bi_bdev = cc->dev->bdev; clone->bi_rw = io->base_bio->bi_rw; + + bio_clone_context(clone, io->base_bio); } static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp) @@ -1692,6 +1694,8 @@ static int crypt_map(struct dm_target *t return DM_MAPIO_REMAPPED; } + bio_associate_current(bio); + io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector)); if (bio_data_dir(io->base_bio) == READ) { Index: linux-3.8.6-fast/fs/bio.c =================================================================== --- linux-3.8.6-fast.orig/fs/bio.c 2013-04-11 17:29:07.000000000 +0200 +++ linux-3.8.6-fast/fs/bio.c 2013-04-11 19:30:58.000000000 +0200 @@ -1690,6 +1690,8 @@ int bio_associate_current(struct bio *bi if (!ioc) return -ENOENT; + bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT; + /* acquire active ref on @ioc and associate */ get_io_context_active(ioc); bio->bi_ioc = ioc; @@ -1703,6 +1705,7 @@ int bio_associate_current(struct bio *bi return 0; } +EXPORT_SYMBOL(bio_associate_current); /** * bio_disassociate_task - undo bio_associate_current() @@ -1711,14 +1714,18 @@ int bio_associate_current(struct bio *bi void bio_disassociate_task(struct bio *bio) { if (bio->bi_ioc) { - put_io_context(bio->bi_ioc); + if (bio_flagged(bio, BIO_DROP_CGROUP_REFCOUNT)) + put_io_context(bio->bi_ioc); bio->bi_ioc = NULL; } if (bio->bi_css) { - css_put(bio->bi_css); + if (bio_flagged(bio, BIO_DROP_CGROUP_REFCOUNT)) + css_put(bio->bi_css); bio->bi_css = NULL; } + bio->bi_flags &= ~(1UL << BIO_DROP_CGROUP_REFCOUNT); } +EXPORT_SYMBOL(bio_disassociate_task); #endif /* CONFIG_BLK_CGROUP */ Index: linux-3.8.6-fast/include/linux/blk_types.h =================================================================== --- linux-3.8.6-fast.orig/include/linux/blk_types.h 2013-04-11 17:29:07.000000000 +0200 +++ linux-3.8.6-fast/include/linux/blk_types.h 2013-04-11 17:29:10.000000000 +0200 @@ -114,6 +114,9 @@ struct bio { #define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */ #define BIO_QUIET 10 /* Make BIO Quiet */ #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */ +#ifdef CONFIG_BLK_CGROUP +#define BIO_DROP_CGROUP_REFCOUNT 12 /* decrement cgroup context refcount */ +#endif /* * Flags starting here get preserved by bio_reset() - this includes ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-11 19:49 ` [PATCH v2] " Mikulas Patocka @ 2013-04-11 19:52 ` Tejun Heo 2013-04-11 20:00 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-11 19:52 UTC (permalink / raw) To: Mikulas Patocka Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Thu, Apr 11, 2013 at 03:49:20PM -0400, Mikulas Patocka wrote: > If the bi_css pointer points to a structure that is shared between > processes, using atomic instruction causes cache line boucing - it doesn't > cost a few instructions, it costs 2-3 hundreds cycles. > > I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that > the refcount must be decremented - if the flag is set, refcounts must be > decremented when bio is destroyed, if it is not set, references are > borrowed from upper layer bio. > > It is less bug-prone than the previous patch. If this becomes an actual bottleneck, the right thing to do is making css ref per-cpu. Please stop messing around with refcounting. NACK. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-11 19:52 ` Tejun Heo @ 2013-04-11 20:00 ` Tejun Heo 2013-04-12 0:06 ` Mikulas Patocka 2013-04-12 18:01 ` Mikulas Patocka 0 siblings, 2 replies; 45+ messages in thread From: Tejun Heo @ 2013-04-11 20:00 UTC (permalink / raw) To: Mikulas Patocka Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > If this becomes an actual bottleneck, the right thing to do is making > css ref per-cpu. Please stop messing around with refcounting. If you think this kind of hackery is acceptable, you really need to re-evaluate your priorities in making engineering decisions. In tightly coupled code, maybe, but you're trying to introduce utterly broken error-prone thing as a generic block layer API. I mean, are you for real? -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-11 20:00 ` Tejun Heo @ 2013-04-12 0:06 ` Mikulas Patocka 2013-04-12 0:22 ` Tejun Heo 2013-04-12 18:01 ` Mikulas Patocka 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-12 0:06 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > > If this becomes an actual bottleneck, the right thing to do is making > > css ref per-cpu. Please stop messing around with refcounting. > > If you think this kind of hackery is acceptable, you really need to > re-evaluate your priorities in making engineering decisions. In > tightly coupled code, maybe, but you're trying to introduce utterly > broken error-prone thing as a generic block layer API. I mean, are > you for real? > > -- > tejun All that I can tell you is that adding an empty atomic operation "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Every single atomic reference in the block layer is measurable. How did I measure it: (1) use dm SRCU patches (http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/) that replace some atomic accesses in device mapper with SRCU. The patches will likely be included in the kernel to improve performance. (2) use the patch v2 that I posted in this thread (3) add bio_associate_current(bio) to _dm_request (so that each bio is associated with a process even if it is not offloaded to a workqueue) (4) change bio_clone_context to actually increase reference counts: static inline void bio_clone_context(struct bio *bio, struct bio *bio_src) { #ifdef CONFIG_BLK_CGROUP BUG_ON(bio->bi_ioc != NULL); if (bio_src->bi_ioc) { get_io_context_active(bio_src->bi_ioc); bio->bi_ioc = bio_src->bi_ioc; if (bio_src->bi_css && css_tryget(bio_src->bi_css)) { bio->bi_css = bio_src->bi_css; } bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT; } #endif } (5) add "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt)" to bio_clone_context and bio_disassociate_task Now, measuring: - create 4GiB ramdisk, fill it with dd so that it is allocated - create 5 nested device mapper linear targets on it - run "time fio --rw=randrw --size=1G --bs=512 --filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12" (it was ran on 12-core machine, so there are 12 concurrent jobs) If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it takes 40 seconds. Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-12 0:06 ` Mikulas Patocka @ 2013-04-12 0:22 ` Tejun Heo 2013-04-12 5:59 ` [PATCH v2] make dm and dm-crypt forward cgroup context Milan Broz 2013-04-12 18:17 ` [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) Mikulas Patocka 0 siblings, 2 replies; 45+ messages in thread From: Tejun Heo @ 2013-04-12 0:22 UTC (permalink / raw) To: Mikulas Patocka Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: > All that I can tell you is that adding an empty atomic operation > "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" > to bio_clone_context and bio_disassociate_task increases the time to run a > benchmark from 23 to 40 seconds. Right, linear target on ramdisk, very realistic, and you know what, hell with dm, let's just hand code everything into submit_bio(). I'm sure it will speed up your test case significantly. If this actually matters, improve it in *sane* way. Make the refcnts per-cpu and not use atomic ops. In fact, we already have proposed implementation of percpu refcnt which is being used by aio restructure patches and likely to be included in some form. It's not quite ready yet, so please work on something useful like that instead of continuing this non-sense. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context 2013-04-12 0:22 ` Tejun Heo @ 2013-04-12 5:59 ` Milan Broz 2013-04-12 18:17 ` [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) Mikulas Patocka 1 sibling, 0 replies; 45+ messages in thread From: Milan Broz @ 2013-04-12 5:59 UTC (permalink / raw) To: Mikulas Patocka Cc: Tejun Heo, Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On 12.4.2013 2:22, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: >> All that I can tell you is that adding an empty atomic operation >> "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" >> to bio_clone_context and bio_disassociate_task increases the time to run a >> benchmark from 23 to 40 seconds. > > Right, linear target on ramdisk, very realistic, and you know what, > hell with dm, let's just hand code everything into submit_bio(). I'm > sure it will speed up your test case significantly. > > If this actually matters, improve it in *sane* way. Make the refcnts > per-cpu and not use atomic ops. In fact, we already have proposed > implementation of percpu refcnt which is being used by aio restructure > patches and likely to be included in some form. It's not quite ready > yet, so please work on something useful like that instead of > continuing this non-sense. Hey, what's going on here? Seems dmcrypt problem transformed into block level refcount flame :) Mikulas, please, can you talk to Tejun and find some better way how to solve DM & block level context bio contexts here? (Ideally on some realistic scenario, you have enough hw in Red Hat to try, some raid0 ssds with linear on top should be good example) and later (when agreed) implement it on dmcrypt? I definitely do not want dmcrypt becomes guinea pig here, it should remain simple as possible and should do transparent _encryption_ and not any inline device-mapper super optimizing games. Thanks, Milan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-12 0:22 ` Tejun Heo 2013-04-12 5:59 ` [PATCH v2] make dm and dm-crypt forward cgroup context Milan Broz @ 2013-04-12 18:17 ` Mikulas Patocka 1 sibling, 0 replies; 45+ messages in thread From: Mikulas Patocka @ 2013-04-12 18:17 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: > > All that I can tell you is that adding an empty atomic operation > > "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" > > to bio_clone_context and bio_disassociate_task increases the time to run a > > benchmark from 23 to 40 seconds. > > Right, linear target on ramdisk, very realistic, and you know what, > hell with dm, let's just hand code everything into submit_bio(). I'm > sure it will speed up your test case significantly. The purpose of this benchmarking is not to optimize linear target on ramdisk. The purpose is to optimize the kernel for upcoming massively parallel servers, with possibly hundreds of energy-efficient cores or so. On these systems every single atomic reference really becomes a bottleneck. And since I don't have such massively parallel server, I am testing it on 12-core machine with ramdisk - the performance drop due to atomic accesses can be measured even on that. I already eliminated most of the atomic operations with this patch http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch And I don't see a sense in adding more for cgroup, especially, if it can be easily avoided. Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-11 20:00 ` Tejun Heo 2013-04-12 0:06 ` Mikulas Patocka @ 2013-04-12 18:01 ` Mikulas Patocka 2013-04-12 18:29 ` Tejun Heo 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-12 18:01 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > > If this becomes an actual bottleneck, the right thing to do is making > > css ref per-cpu. Please stop messing around with refcounting. > > If you think this kind of hackery is acceptable, you really need to > re-evaluate your priorities in making engineering decisions. In > tightly coupled code, maybe, but you're trying to introduce utterly > broken error-prone thing as a generic block layer API. I mean, are > you for real? > > -- > tejun Please describe what is wrong with the code? Why do you call it hackery? When device mapper is creating a cloned bio for the lower layer, it is already assumed that the cloned bio has shorter lifetime than the original bio it was created from. The device mapper copies a part of the bio vector from the original bio to the cloned bio, it copies pointers to pages without increasing reference counts on those pages. As long as the original bio is not returned with bio_endio, the pages must exist, so there is no need to increase their reference counts. Now, if copying pointers without increasing reference counts is OK for pages, why do you think it is not OK for cgroup context? Why do you call this bug-prone? - how do you think a bug could happen? If someone in device mapper errorneously ends the master bio while the cloned bio is still in progress, there is already a memory corruption bug (the cloned bio vector points to potentially free pages) and safeguarding the cgroup pointers won't fix it. So if you think that reference counts should be incremented by every clone of the original bio, what kind of bug should it protect against? If we don't increment reference counts for pages, why should we do it for cgroup pointers? Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-12 18:01 ` Mikulas Patocka @ 2013-04-12 18:29 ` Tejun Heo 2013-04-15 13:02 ` Mikulas Patocka 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-12 18:29 UTC (permalink / raw) To: Mikulas Patocka Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: > So if you think that reference counts should be incremented by every clone > of the original bio, what kind of bug should it protect against? If we > don't increment reference counts for pages, why should we do it for cgroup > pointers? These things are called trade-offs. You look at the overhead of the things and how complex / fragile things get when certain shortcuts are taken and how well contained and easy to verify / debug when things go wrong and then make your choice. Do the two really look the same to you? The page refs are much more expensive, mostly contained in and the main focus of dm. ioc/css refs aren't that expensive to begin with, css refcnting is widely scattered across the kernel, the association interface is likely to be used by any entity issuing IOs asynchronously soonish, and there is much saner way to improve it - which would be beneficial not only to block / dm but everyone else using it. Something being done in one place doesn't automatically make it okay everywhere else. We can and do use hackery but *with* discretion. If you still can't understand, I'm not sure what more I can tell you. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-12 18:29 ` Tejun Heo @ 2013-04-15 13:02 ` Mikulas Patocka 2013-04-16 17:24 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-15 13:02 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Fri, 12 Apr 2013, Tejun Heo wrote: > On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: > > So if you think that reference counts should be incremented by every clone > > of the original bio, what kind of bug should it protect against? If we > > don't increment reference counts for pages, why should we do it for cgroup > > pointers? > > These things are called trade-offs. You look at the overhead of the > things and how complex / fragile things get when certain shortcuts are > taken and how well contained and easy to verify / debug when things go > wrong and then make your choice. So what we are here trading for what? The patch eliminates the atomic references when passing the bio down the stack of bio drivers. The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places and tests it on two places - that is not big. The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio cgroup functions, so it doesn't complicate interface to other subsystems. The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. > Do the two really look the same to you? The page refs are much more > expensive, mostly contained in and the main focus of dm. ioc/css refs > aren't that expensive to begin with, css refcnting is widely scattered ioc is per-task, so it is likely to be cached (but there are processors that have slow atomic operations even on cached data - on Pentium 4 it takes about 100 cycles). But css is shared between tasks and produces the cache ping-pong effect. > across the kernel, the association interface is likely to be used by > any entity issuing IOs asynchronously soonish, and there is much saner > way to improve it - which would be beneficial not only to block / dm > but everyone else using it. > > Something being done in one place doesn't automatically make it okay > everywhere else. We can and do use hackery but *with* discretion. > > If you still can't understand, I'm not sure what more I can tell you. > > -- > tejun I don't know what's wrong with 4 lines of code to manipulate a flag. I understand that you don't want to do something complicated and bug-prone to improve performance. But the patch is neither complicated nor bug-prone. Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-15 13:02 ` Mikulas Patocka @ 2013-04-16 17:24 ` Tejun Heo 2013-04-16 19:41 ` Mikulas Patocka 2013-04-18 16:47 ` Mike Snitzer 0 siblings, 2 replies; 45+ messages in thread From: Tejun Heo @ 2013-04-16 17:24 UTC (permalink / raw) To: Mikulas Patocka Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon Hey, On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > The patch is not bug-prone, because we already must make sure that the > cloned bio has shorter lifetime than the master bio - so the patch doesn't > introduce any new possibilities to make bugs. The whole world isn't composed of only your code. As I said repeatedly, you're introducing an API which is misleading and can easily cause subtle bugs which are very difficult to reproduce. Imagine it being used to tag a metatdata or checksum update bio being sent down while processing another bio and used to "clone" the context of the original bio. It'll work most of the time even if the original bio gets completed first but it'll break when it gets really unlucky - e.g. racing with other operations which can put the base css ref, and it'll be hellish to reproduce and everyone would have to pay for your silly hack. > > Do the two really look the same to you? The page refs are much more > > expensive, mostly contained in and the main focus of dm. ioc/css refs > > aren't that expensive to begin with, css refcnting is widely scattered > > ioc is per-task, so it is likely to be cached (but there are processors > that have slow atomic operations even on cached data - on Pentium 4 it > takes about 100 cycles). But css is shared between tasks and produces the > cache ping-pong effect. For $DIETY's sake, how many times do I have to tell you to use per-cpu reference count? Why do I have to repeat the same story over and over again? What part of "make the reference count per-cpu" don't you get? It's not a complicated message. At this point, I can't even understand why or what the hell you're arguing. There's a clearly better way to do it and you're just repeating yourself like a broken record that your hack in itself isn't broken. So, if you wanna continue that way for whatever reason, you have my firm nack and I'm outta this thread. Bye bye. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-16 17:24 ` Tejun Heo @ 2013-04-16 19:41 ` Mikulas Patocka 2013-04-18 16:47 ` Mike Snitzer 1 sibling, 0 replies; 45+ messages in thread From: Mikulas Patocka @ 2013-04-16 19:41 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt, Alasdair G. Kergon On Tue, 16 Apr 2013, Tejun Heo wrote: > Hey, > > On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > > The patch is not bug-prone, because we already must make sure that the > > cloned bio has shorter lifetime than the master bio - so the patch doesn't > > introduce any new possibilities to make bugs. > > The whole world isn't composed of only your code. As I said > repeatedly, you're introducing an API which is misleading and can > easily cause subtle bugs which are very difficult to reproduce. > > Imagine it being used to tag a metatdata or checksum update bio being > sent down while processing another bio and used to "clone" the context > of the original bio. It'll work most of the time even if the original > bio gets completed first but it'll break when it gets really unlucky - > e.g. racing with other operations which can put the base css ref, and > it'll be hellish to reproduce and everyone would have to pay for your > silly hack. That's why the comment at the function says: "it is assumed that the lifestime of the new bio is shorter than the lifetime of the original bio. If the new bio can outlive the old bio, the caller must increment the reference counts." - do you think that it is so bad that someone will use the function without reading the comment? Anyway, the situation that you describe could only happen in dm-bufio or dm-kcopyd files, so it's easy to control and increment the reference counts there. There are no other places in device mapper where we create bios that live longer than original one. Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-16 17:24 ` Tejun Heo 2013-04-16 19:41 ` Mikulas Patocka @ 2013-04-18 16:47 ` Mike Snitzer 2013-04-18 17:03 ` Tejun Heo 1 sibling, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2013-04-18 16:47 UTC (permalink / raw) To: Tejun Heo Cc: Mikulas Patocka, Jens Axboe, dm-crypt, Christian Schmidt, linux-kernel, Christoph Hellwig, dm-devel, Andi Kleen, Alasdair G. Kergon, Milan Broz, Vivek Goyal On Tue, Apr 16 2013 at 1:24pm -0400, Tejun Heo <tj@kernel.org> wrote: > Hey, > > On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > > The patch is not bug-prone, because we already must make sure that the > > cloned bio has shorter lifetime than the master bio - so the patch doesn't > > introduce any new possibilities to make bugs. > > The whole world isn't composed of only your code. As I said > repeatedly, you're introducing an API which is misleading and can > easily cause subtle bugs which are very difficult to reproduce. > > Imagine it being used to tag a metatdata or checksum update bio being > sent down while processing another bio and used to "clone" the context > of the original bio. It'll work most of the time even if the original > bio gets completed first but it'll break when it gets really unlucky - > e.g. racing with other operations which can put the base css ref, and > it'll be hellish to reproduce and everyone would have to pay for your > silly hack. > > > > Do the two really look the same to you? The page refs are much more > > > expensive, mostly contained in and the main focus of dm. ioc/css refs > > > aren't that expensive to begin with, css refcnting is widely scattered > > > > ioc is per-task, so it is likely to be cached (but there are processors > > that have slow atomic operations even on cached data - on Pentium 4 it > > takes about 100 cycles). But css is shared between tasks and produces the > > cache ping-pong effect. > > For $DIETY's sake, how many times do I have to tell you to use per-cpu > reference count? Why do I have to repeat the same story over and over > again? What part of "make the reference count per-cpu" don't you get? > It's not a complicated message. > > At this point, I can't even understand why or what the hell you're > arguing. There's a clearly better way to do it and you're just > repeating yourself like a broken record that your hack in itself isn't > broken. > > So, if you wanna continue that way for whatever reason, you have my > firm nack and I'm outta this thread. > > Bye bye. Hey Tejun, I see you nack and raise you with: please reconsider in the near term. Your point about not wanting to introduce a generic block interface that isn't "safe" for all users noted. But as Mikulas has repeatedly said DM does _not_ ever need to do the refcounting. So it seems a bit absurd to introduce the requirement that DM should stand up an interface that uses percpu. That is a fair amount of churn that DM will never have a need to take advantage of. So why not introduce __bio_copy_association(bio1, bio2) and add a BUG_ON in it if bio2 isn't a clone of bio1? When there is a need for async IO to have more scalable refcounting that would be the time to introduce bio_copy_association that uses per-cpu refcounting (and yes we could then even nuke __bio_copy_association). It just seems to me a bit burdensome to ask Mikulas to add this infrastructure when DM really doesn't need it at all. But again I do understand your desire for others to be stearing the kernel where it needs to be to benefit future use-cases. But I think in general it best to introduce complexity when there is an actual need. Your insights are amazingly helpful and I think it is unfortunate that this refcounting issue overshadowed the positive advancements of dm-crypt scaling. I'm just looking to see if we can carry on with a temporary intermediate step with __bio_copy_association. Thanks, Mike ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-18 16:47 ` Mike Snitzer @ 2013-04-18 17:03 ` Tejun Heo 2013-05-22 18:50 ` Mike Snitzer 0 siblings, 1 reply; 45+ messages in thread From: Tejun Heo @ 2013-04-18 17:03 UTC (permalink / raw) To: Mike Snitzer Cc: Mikulas Patocka, Jens Axboe, dm-crypt, Christian Schmidt, linux-kernel, Christoph Hellwig, dm-devel, Andi Kleen, Alasdair G. Kergon, Milan Broz, Vivek Goyal Hello, Mike. On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote: > I see you nack and raise you with: please reconsider in the near term. The thing is that percpu-refcnting is already in mostly ready-form, so unless this dm series is planned to be merged for v3.10-rc1, I don't see the need for a separate near term solution. Mikulas can just do the normal refcnting and cgroup will most likely adopt per-cpu ref anyway once percpu-refcnting is in, so it should all fall in places pretty quickly. For devel / testing / whatever, Mikulas can surely turn off cgroup, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-04-18 17:03 ` Tejun Heo @ 2013-05-22 18:50 ` Mike Snitzer 2013-05-22 19:48 ` Tejun Heo 0 siblings, 1 reply; 45+ messages in thread From: Mike Snitzer @ 2013-05-22 18:50 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, dm-crypt, Christian Schmidt, linux-kernel, Christoph Hellwig, dm-devel, Andi Kleen, Mikulas Patocka, Vivek Goyal, Milan Broz, Alasdair G. Kergon On Thu, Apr 18 2013 at 1:03pm -0400, Tejun Heo <tj@kernel.org> wrote: > Hello, Mike. > > On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote: > > I see you nack and raise you with: please reconsider in the near term. > > The thing is that percpu-refcnting is already in mostly ready-form, so > unless this dm series is planned to be merged for v3.10-rc1, I don't > see the need for a separate near term solution. Mikulas can just do > the normal refcnting and cgroup will most likely adopt per-cpu ref > anyway once percpu-refcnting is in, so it should all fall in places > pretty quickly. For devel / testing / whatever, Mikulas can surely > turn off cgroup, right? Hey Tejun, Was wondering: how is percpu-refcounting coming along? Do you have a pointer to the code that can be pulled in for use by Mikulas' dm-crypt changes? Would be nice to get this stuff sorted out for the 3.11 merge window. Mike ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) 2013-05-22 18:50 ` Mike Snitzer @ 2013-05-22 19:48 ` Tejun Heo 0 siblings, 0 replies; 45+ messages in thread From: Tejun Heo @ 2013-05-22 19:48 UTC (permalink / raw) To: Mike Snitzer Cc: Jens Axboe, dm-crypt, Christian Schmidt, linux-kernel, Christoph Hellwig, dm-devel, Andi Kleen, Mikulas Patocka, Vivek Goyal, Milan Broz, Alasdair G. Kergon Hey, On Wed, May 22, 2013 at 02:50:14PM -0400, Mike Snitzer wrote: > Was wondering: how is percpu-refcounting coming along? Do you have a > pointer to the code that can be pulled in for use by Mikulas' dm-crypt > changes? > > Would be nice to get this stuff sorted out for the 3.11 merge window. Still in progress. Waiting for Kent's next round and yeah I do hope so too as I really wanna convert css refcnting to it regardless of dm-crypt work. The thread is at https://patchwork.kernel.org/patch/2562111/ Thanks. -- tejun ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: dm-crypt parallelization patches 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka 2013-04-09 17:57 ` Tejun Heo @ 2013-04-09 18:36 ` Vivek Goyal 1 sibling, 0 replies; 45+ messages in thread From: Vivek Goyal @ 2013-04-09 18:36 UTC (permalink / raw) To: Mikulas Patocka Cc: Jens Axboe, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote: > Hi > > I placed the dm-crypt parallization patches at: > http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/ > > The patches paralellize dm-crypt and make it possible to use all processor > cores. > > > The patch dm-crypt-remove-percpu.patch removes some percpu variables and > replaces them with per-request variables. > > The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the > encryption workqueue, allowing the encryption to be distributed to all > CPUs in the system. > > The patch dm-crypt-offload-writes-to-thread.patch moves submission of all > write requests to a single thread. > > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a > single thread. The requests are sorted according to the sector number, > rb-tree is used for efficient sorting. > > Some usage notes: > > * turn off automatic cpu frequency scaling (or set it to "performance" > governor) - cpufreq doesn't recognize encryption workload correctly, > sometimes it underclocks all the CPU cores when there is some encryption > work to do, resulting in bad performance > > * when using filesystem on encrypted dm-crypt device, reduce maximum > request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute > "dm-2" with the real name of your dm-crypt device). Note that having too > big requests means that there is a small number of requests and they > cannot be distributed to all available processors in parallel - it > results in worse performance. Having too small requests results in high > request overhead and also reduced performance. So you must find the > optimal request size for your system and workload. For me, when testing > this on ramdisk, the optimal is 8KiB. > > --- > > Now, the problem with I/O scheduler: when doing performance testing, it > turns out that the parallel version is sometimes worse than the previous > implementation. > > When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of > ext2 filesystem on 15k SCSI disk and run this command > > time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt > --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 > --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 > --name=job12 > > the results are this: > CFQ scheduler: > -------------- > no patches: > 21.9s > patch 1: > 21.7s > patches 1,2: > 2:33s > patches 1,2 (+ nr_requests = 1280000) > 2:18s > patches 1,2,3: > 20.7s > patches 1,2,3,4: > 20.7s > > deadline scheduler: > ------------------- > no patches: > 27.4s > patch 1: > 27.4s > patches 1,2: > 27.8s > patches 1,2,3: > 29.6s > patches 1,2,3,4: > 29.6s > > > We can see that CFQ performs badly with the patch 2, but improves with the > patch 3. All that patch 3 does is that it moves write requests from > encryption threads to a separate thread. > > So it seems that CFQ has some deficiency that it cannot merge adjacent > requests done by different processes. > CFQ does not merge requests across different cfq queues (cfqq). Each queue is associated with one iocontext. So in this case each worker thread is submitting its own bio and each 4K bio must be going in separate cfqq hence no merging is taking place. The moment you applied patch 3, where a single thread submitted bios, each bio went into single queue and possibly got merged. So either use single thread to submit bio or better use bio_associate_current() (as tejun suggested) on original 256K bio. (Hopefully bio iocontext association information is retained when you split the bios into smaller pieces). > The problem is this: > - we have 256k write direct-i/o request > - it is broken to 4k bios (because we run on dm-loop on a filesystem with > 4k block size) > - encryption of these 4k bios is distributed to 12 processes on a 12-core > machine > - encryption finishes out of order and in different processes, 4k bios > with encrypted data are submitted to CFQ > - CFQ doesn't merge them > - the disk is flooded with random 4k write requests, and performs much > worse than with 256k requests > Thanks Vivek ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dm-devel] dm-crypt performance 2013-03-26 20:05 ` Milan Broz 2013-03-26 20:28 ` Mike Snitzer @ 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:59 ` [dm-crypt] " Milan Broz 1 sibling, 1 reply; 45+ messages in thread From: Mikulas Patocka @ 2013-04-09 18:08 UTC (permalink / raw) To: Milan Broz Cc: Mike Snitzer, dm-devel, Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig, Christian Schmidt On Tue, 26 Mar 2013, Milan Broz wrote: > - Are we sure we are not inroducing some another side channel in disc > encryption? (Unprivileged user can measure timing here). > (Perhaps stupid reason but please do not prefer performance to security > in encryption. Enough we have timing attacks for AES implementations...) So use serpent - it is implemented without any data-dependent lookup tables, so it has no timing attacks. AES uses data-dependent lookup tables, on CPU with hyperthreding, the second thread can observe L1 cache footprint done by the first thread and get some information about data being encrypted... Mikulas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dm-crypt] [dm-devel] dm-crypt performance 2013-04-09 18:08 ` [dm-devel] dm-crypt performance Mikulas Patocka @ 2013-04-09 18:59 ` Milan Broz 0 siblings, 0 replies; 45+ messages in thread From: Milan Broz @ 2013-04-09 18:59 UTC (permalink / raw) To: Mikulas Patocka Cc: Milan Broz, Mike Snitzer, dm-crypt, Christian Schmidt, linux-kernel, Christoph Hellwig, dm-devel, Andi Kleen On 9.4.2013 20:08, Mikulas Patocka wrote: > > > On Tue, 26 Mar 2013, Milan Broz wrote: > >> - Are we sure we are not inroducing some another side channel in disc >> encryption? (Unprivileged user can measure timing here). >> (Perhaps stupid reason but please do not prefer performance to security >> in encryption. Enough we have timing attacks for AES implementations...) > > So use serpent - it is implemented without any data-dependent lookup > tables, so it has no timing attacks. I wish using something different than AES is just such simple technical issue for many people. But e.g. just try it in FIPS mode where AES is the only option:-) Anyway, using bio_associate_current() seems to be the right way to try now... Milan ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2013-05-22 19:48 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <Pine.LNX.4.64.1303252051520.9745@file.rdu.redhat.com> 2013-03-26 12:27 ` [dm-devel] dm-crypt performance Alasdair G Kergon 2013-03-26 20:05 ` Milan Broz 2013-03-26 20:28 ` Mike Snitzer 2013-03-26 20:58 ` Milan Broz 2013-03-28 18:53 ` Tejun Heo 2013-03-28 19:33 ` Vivek Goyal 2013-03-28 19:44 ` Tejun Heo 2013-03-28 20:38 ` Vivek Goyal 2013-03-28 20:45 ` Tejun Heo 2013-04-09 17:51 ` dm-crypt parallelization patches Mikulas Patocka 2013-04-09 17:57 ` Tejun Heo 2013-04-09 18:08 ` Mikulas Patocka 2013-04-09 18:10 ` Tejun Heo 2013-04-09 18:42 ` Vivek Goyal 2013-04-09 18:57 ` Tejun Heo 2013-04-09 19:13 ` Vivek Goyal 2013-04-09 19:42 ` Mikulas Patocka 2013-04-09 19:52 ` Tejun Heo 2013-04-09 20:32 ` Mikulas Patocka 2013-04-09 21:02 ` Tejun Heo 2013-04-09 21:03 ` Tejun Heo 2013-04-09 21:07 ` Vivek Goyal 2013-04-09 21:18 ` Mikulas Patocka 2013-04-10 19:24 ` Vivek Goyal 2013-04-10 23:42 ` [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) Mikulas Patocka 2013-04-10 23:50 ` Tejun Heo 2013-04-11 19:49 ` [PATCH v2] " Mikulas Patocka 2013-04-11 19:52 ` Tejun Heo 2013-04-11 20:00 ` Tejun Heo 2013-04-12 0:06 ` Mikulas Patocka 2013-04-12 0:22 ` Tejun Heo 2013-04-12 5:59 ` [PATCH v2] make dm and dm-crypt forward cgroup context Milan Broz 2013-04-12 18:17 ` [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches) Mikulas Patocka 2013-04-12 18:01 ` Mikulas Patocka 2013-04-12 18:29 ` Tejun Heo 2013-04-15 13:02 ` Mikulas Patocka 2013-04-16 17:24 ` Tejun Heo 2013-04-16 19:41 ` Mikulas Patocka 2013-04-18 16:47 ` Mike Snitzer 2013-04-18 17:03 ` Tejun Heo 2013-05-22 18:50 ` Mike Snitzer 2013-05-22 19:48 ` Tejun Heo 2013-04-09 18:36 ` dm-crypt parallelization patches Vivek Goyal 2013-04-09 18:08 ` [dm-devel] dm-crypt performance Mikulas Patocka 2013-04-09 18:59 ` [dm-crypt] " Milan Broz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).