From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965187AbeALTSb (ORCPT + 1 other); Fri, 12 Jan 2018 14:18:31 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:55646 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964998AbeALTS3 (ORCPT ); Fri, 12 Jan 2018 14:18:29 -0500 Date: Fri, 12 Jan 2018 11:18:25 -0800 From: "Paul E. McKenney" To: Mikulas Patocka Cc: Mike Snitzer , Zdenek Kabelac , "Alasdair G. Kergon" , dm-devel@redhat.com, linux-kernel@vger.kernel.org Subject: Re: SRCU's apparent use of NR_CPUS? [was: re: dm: allocate struct mapped_device with kvzalloc] Reply-To: paulmck@linux.vnet.ibm.com References: <20171101154844.GA25792@redhat.com> <20171101162306.GU3659@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18011219-0040-0000-0000-000003E0F771 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008366; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000245; SDB=6.00974056; UDB=6.00493603; IPR=6.00754002; BA=6.00005775; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00019006; XFM=3.00000015; UTC=2018-01-12 19:18:25 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18011219-0041-0000-0000-000007D6536B Message-Id: <20180112191825.GB9671@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-01-12_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1801120259 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Fri, Nov 03, 2017 at 04:10:50PM -0400, Mikulas Patocka wrote: > > > On Wed, 1 Nov 2017, Paul E. McKenney wrote: > > > On Wed, Nov 01, 2017 at 11:48:44AM -0400, Mike Snitzer wrote: > > > [cc'ing Paul, and LKML, to get his/others' take on SRCU cpu scaling] > > > > > > On Tue, Oct 31 2017 at 7:33pm -0400, > > > Mikulas Patocka wrote: > > > > > > > The structure srcu_struct can be very big, its size is proportional to the > > > > value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field > > > > io_barrier in the struct mapped_device has 84kB in the debugging kernel > > > > and 50kB in the non-debugging kernel. The large size may result in failure > > > > of the function kzalloc_node. > > > > > > > > In order to avoid the allocation failure, we use the function > > > > kvzalloc_node, this function falls back to vmalloc if a large contiguous > > > > chunk of memory is not available. This patch also moves the field > > > > io_barrier to the last position of struct mapped_device - the reason is > > > > that on many processor architectures, short memory offsets result in > > > > smaller code than long memory offsets - on x86-64 it reduces code size by > > > > 320 bytes. > > > > > > > > Note to stable kernel maintainers - the kernels 4.11 and older don't have > > > > the function kvzalloc_node, you can use the function vzalloc_node instead. > > > > > > > > Signed-off-by: Mikulas Patocka > > > > Cc: stable@vger.kernel.org > > > > > > This looks reasonable as a near-term workaround.. BUT: > > > Paul has there been any discussion about how to make SRCU support > > > dynamically scaling up to NR_CPUS maximum as 'nr_cpus' changes (rather > > > than accounting for worst case of NR_CPUS up-front)? > > > > This is the first I have heard of this being a problem. > > > > For static instances of srcu_struct, life is hard. > > > > But it should not be all that difficult for SRCU to provide an allocator > > for the dynamic cases, which given your kzalloc_node() above is the case > > you are worried about, at least assuming that these allocations happen > > after rcu_init() is invoked (which is pretty early). > > > > My approach would be to move the srcu_struct ->node[] array to its > > own structure, with a pointer from srcu_struct, allowing short-sized > > allocations to be used. (But I do need to check to make sure that there > > are no gotchas, and with RCU there usually are a few.) Obviously some > > -serious- testing would be required -- do you have a range of systems > > to test on? > > > > However, you would still have your potential failure case for systems > > that really did have large numbers of CPUs, some of which really do > > exist in the wild. > > So - you can allocate srcu_struct->node with kvmalloc - it will > automatically fallback to vmalloc if the allocation is too large for > kmalloc. And after looking more closely at this, I am now leaning more towards providing a function to run-time allocate an srcu_struct. I would move the ->node[] array to the end to allow variable-length allocations. This would mean that your structures would contain a pointer to the srcu_struct rather than the srcu_struct itself. Would that work for you? Thanx, Paul > Mikulas > > > > (But I had a quick look at scrutree.h and I'm not seeing explicit use of > > > NR_CPUS, so it is likely occuring via implicit percpu through some > > > member of 'struct srcu_struct', e.g. 'sda'?) > > > > The srcu_struct structure sees NR_CPUS via include/linux/rcu_node_tree.h, > > which sizes the srcu_node array at build time. > > > > The sda pointer references a per-CPU allocation, which I believe already > > is sized to the actual system rather than to NR_CPUS. > > > > Thanx, Paul > > > > > Thanks, > > > Mike > > > > > > > > > > > --- > > > > drivers/md/dm-core.h | 3 ++- > > > > drivers/md/dm.c | 6 +++--- > > > > 2 files changed, 5 insertions(+), 4 deletions(-) > > > > > > > > Index: linux-2.6/drivers/md/dm-core.h > > > > =================================================================== > > > > --- linux-2.6.orig/drivers/md/dm-core.h > > > > +++ linux-2.6/drivers/md/dm-core.h > > > > @@ -29,7 +29,6 @@ struct dm_kobject_holder { > > > > * DM targets must _not_ deference a mapped_device to directly access its members! > > > > */ > > > > struct mapped_device { > > > > - struct srcu_struct io_barrier; > > > > struct mutex suspend_lock; > > > > > > > > /* > > > > @@ -127,6 +126,8 @@ struct mapped_device { > > > > struct blk_mq_tag_set *tag_set; > > > > bool use_blk_mq:1; > > > > bool init_tio_pdu:1; > > > > + > > > > + struct srcu_struct io_barrier; > > > > }; > > > > > > > > void dm_init_md_queue(struct mapped_device *md); > > > > Index: linux-2.6/drivers/md/dm.c > > > > =================================================================== > > > > --- linux-2.6.orig/drivers/md/dm.c > > > > +++ linux-2.6/drivers/md/dm.c > > > > @@ -1695,7 +1695,7 @@ static struct mapped_device *alloc_dev(i > > > > struct mapped_device *md; > > > > void *old_md; > > > > > > > > - md = kzalloc_node(sizeof(*md), GFP_KERNEL, numa_node_id); > > > > + md = kvzalloc_node(sizeof(*md), GFP_KERNEL, numa_node_id); > > > > if (!md) { > > > > DMWARN("unable to allocate device, out of memory."); > > > > return NULL; > > > > @@ -1795,7 +1795,7 @@ bad_io_barrier: > > > > bad_minor: > > > > module_put(THIS_MODULE); > > > > bad_module_get: > > > > - kfree(md); > > > > + kvfree(md); > > > > return NULL; > > > > } > > > > > > > > @@ -1814,7 +1814,7 @@ static void free_dev(struct mapped_devic > > > > free_minor(minor); > > > > > > > > module_put(THIS_MODULE); > > > > - kfree(md); > > > > + kvfree(md); > > > > } > > > > > > > > static void __bind_mempools(struct mapped_device *md, struct dm_table *t) > > > > > >