From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu To: Ming Lei , Brian King Cc: linux-block , "open list:DEVICE-MAPPER (LVM)" , Alasdair Kergon , Mike Snitzer References: <20170628211010.4C8C9124035@b01ledav002.gho.pok.ibm.com> <7f0a852e-5f90-4c63-9a43-a4180557530c@kernel.dk> <07ba10a8-6369-c1bc-dc9a-b550d9394c22@kernel.dk> <8f4ff428-e158-0df5-cf54-ae3cdea7ad1f@kernel.dk> <3729482c-fcca-2af5-4d05-7e44bcd71159@linux.vnet.ibm.com> <0759ff58-caa0-9e55-b5ac-6324d9ba521b@kernel.dk> From: Jens Axboe Message-ID: <599ba934-902d-d6ce-5a5a-9b32657b4a08@kernel.dk> Date: Fri, 30 Jun 2017 17:26:53 -0600 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 List-ID: On 06/30/2017 05:23 PM, Ming Lei wrote: > Hi Bian, > > On Sat, Jul 1, 2017 at 2:33 AM, Brian King wrote: >> On 06/30/2017 09:08 AM, Jens Axboe wrote: >>>>>> Compared with the totally percpu approach, this way might help 1:M or >>>>>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to >>>>>> each CPU(especially there are huge hw queues on a big system), :-( >>>>> >>>>> Not disagreeing with that, without having some mechanism to only >>>>> loop queues that have pending requests. That would be similar to the >>>>> ctx_map for sw to hw queues. But I don't think that would be worthwhile >>>>> doing, I like your pnode approach better. However, I'm still not fully >>>>> convinced that one per node is enough to get the scalability we need. >>>>> >>>>> Would be great if Brian could re-test with your updated patch, so we >>>>> know how it works for him at least. >>>> >>>> I'll try running with both approaches today and see how they compare. >>> >>> Focus on Ming's, a variant of that is the most likely path forward, >>> imho. It'd be great to do a quick run on mine as well, just to establish >>> how it compares to mainline, though. >> >> On my initial runs, the one from you Jens, appears to perform a bit better, although >> both are a huge improvement from what I was seeing before. >> >> I ran 4k random reads using fio to nullblk in two configurations on my 20 core >> system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 threads >> to a single null_blk as well as 80 threads to 80 null_block devices, so one thread > > Could you share what the '80 null_block devices' is? It means you > create 80 null_blk > devices? Or you create one null_blk and make its hw queue number as 80 > via module > parameter of ''submit_queues"? That's a valid question, was going to ask that too. But I assumed that Brian used submit_queues to set as many queues as he has logical CPUs in the system. > > I guess we should focus on multi-queue case since it is the normal way of NVMe. > >> per null_blk. This is what I saw on this machine: >> >> Using the Per node atomic change from Ming Lei >> 1 null_blk, 80 threads >> iops=9376.5K >> >> 80 null_blk, 1 thread >> iops=9523.5K >> >> >> Using the alternate patch from Jens using the tags >> 1 null_blk, 80 threads >> iops=9725.8K >> >> 80 null_blk, 1 thread >> iops=9569.4K > > If 1 thread means single fio job, looks the number is too too high, that means > one random IO can complete in about 0.1us(100ns) on single CPU, not sure if it > is possible, :-) It means either 1 null_blk device, 80 threads running IO to it. Or 80 null_blk devices, each with a thread running IO to it. See above, he details that it's 80 threads on 80 devices for that case. -- Jens Axboe