On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe wrote: > On 06/29/2017 02:40 AM, Ming Lei wrote: >> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe wrote: >>> On 06/28/2017 03:12 PM, Brian King wrote: >>>> This patch converts the in_flight counter in struct hd_struct from a >>>> pair of atomics to a pair of percpu counters. This eliminates a couple >>>> of atomics from the hot path. When running this on a Power system, to >>>> a single null_blk device with 80 submission queues, irq mode 0, with >>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >>> >>> This has been done before, but I've never really liked it. The reason is >>> that it means that reading the part stat inflight count now has to >>> iterate over every possible CPU. Did you use partitions in your testing? >>> How many CPUs were configured? When I last tested this a few years ago >>> on even a quad core nehalem (which is notoriously shitty for cross-node >>> latencies), it was a net loss. >> >> One year ago, I saw null_blk's IOPS can be decreased to 10% >> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >> 96 cores, and dual numa nodes) too, the performance can be >> recovered basically if per numa-node counter is introduced and >> used in this case, but the patch was never posted out. >> If anyone is interested in that, I can rebase the patch on current >> block tree and post out. I guess the performance issue might be >> related with system cache coherency implementation more or less. >> This issue on ARM64 can be observed with the following userspace >> atomic counting test too: >> >> http://kernel.ubuntu.com/~ming/test/cache/ > > How well did the per-node thing work? Doesn't seem to me like it would Last time, on ARM64, I remembered that the IOPS was basically recovered, but now I don't have a such machine to test. Could Brian test the attached patch to see if it works on big Power machine? And the idea is simple, just make the atomic counter per-node. > go far enough. And per CPU is too much. One potential improvement would > be to change the part_stat_read() to just loop online CPUs, instead of > all possible CPUs. When CPUs go on/offline, use that as the slow path to > ensure the stats are sane. Often there's a huge difference between > NR_CPUS configured and what the system has. As Brian states, RH ships > with 2048, while I doubt a lot of customers actually run that... One observation I saw on arm64 dual socket before is that atomic inc/dec on counter stored in local numa node is much cheaper than cross-node, that is why I tried the per-node counter. And wrt. in-flight atomic counter, both inc and dec should happen on CPUs belonging to same numa node in case of blk-mq. > > Outside of coming up with a more clever data structure that is fully > CPU topology aware, one thing that could work is just having X cache > line separated read/write inflight counters per node, where X is some > suitable value (like 4). That prevents us from having cross node > traffic, and it also keeps the cross cpu traffic fairly low. That should > provide a nice balance between cost of incrementing the inflight > counting, and the cost of looping for reading it. > > And that brings me to the next part... > >>> I do agree that we should do something about it, and it's one of those >>> items I've highlighted in talks about blk-mq on pending issues to fix >>> up. It's just not great as it currently stands, but I don't think per >>> CPU counters is the right way to fix it, at least not for the inflight >>> counter. >> >> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >> we can use some blk-mq knowledge(tagset?) to figure out the >> 'in_flight' counter. I thought about it before, but never got a >> perfect solution, and looks it is a bit hard, :-) > > The tags are already a bit spread out, so it's worth a shot. That would > remove the need to do anything in the inc/dec path, as the tags already > do that. The inlight count could be easily retrieved with > sbitmap_weight(). The only issue here is that we need separate read and > write counters, and the weight would obviously only get us the total > count. But we can have a slower path for that, just iterate the tags and > count them. The fast path only cares about total count. > > Let me try that out real quick. > > -- > Jens Axboe > Thanks, Ming Lei