From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754986AbcGLO1e (ORCPT ); Tue, 12 Jul 2016 10:27:34 -0400 Received: from mail-qt0-f170.google.com ([209.85.216.170]:34756 "EHLO mail-qt0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754939AbcGLO1b (ORCPT ); Tue, 12 Jul 2016 10:27:31 -0400 Date: Tue, 12 Jul 2016 10:27:27 -0400 From: Tejun Heo To: Waiman Long Cc: Alexander Viro , Jan Kara , Jeff Layton , "J. Bruce Fields" , Christoph Lameter , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Andi Kleen , Dave Chinner , Boqun Feng , Scott J Norton , Douglas Hatch Subject: Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs Message-ID: <20160712142727.GA3190@htj.duckdns.org> References: <1468258332-61537-1-git-send-email-Waiman.Long@hpe.com> <1468258332-61537-7-git-send-email-Waiman.Long@hpe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1468258332-61537-7-git-send-email-Waiman.Long@hpe.com> User-Agent: Mutt/1.6.1 (2016-04-27) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote: > The percpu APIs are extensively used in the Linux kernel to reduce > cacheline contention and improve performance. For some use cases, the > percpu APIs may be too fine-grain for distributed resources whereas > a per-node based allocation may be too coarse as we can have dozens > of CPUs in a NUMA node in some high-end systems. > > This patch introduces a simple per-subnode APIs where each of the > distributed resources will be shared by only a handful of CPUs within > a NUMA node. The per-subnode APIs are built on top of the percpu APIs > and hence requires the same amount of memory as if the percpu APIs > are used. However, it helps to reduce the total number of separate > resources that needed to be managed. As a result, it can speed up code > that need to iterate all the resources compared with using the percpu > APIs. Cacheline contention, however, will increases slightly as each > resource is shared by more than one CPU. As long as the number of CPUs > in each subnode is small, the performance impact won't be significant. > > In this patch, at most 2 sibling groups can be put into a subnode. For > an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled > and 2 when it is not. I understand that there's a trade-off between local access and global traversing and you're trying to find a sweet spot between the two, but this seems pretty arbitrary. What's the use case? What are the numbers? Why are global traversals often enough to matter so much? Thanks. -- tejun