From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Horman Subject: Re: [RFC] eal: add cgroup-aware resource self discovery Date: Tue, 26 Jan 2016 09:19:07 -0500 Message-ID: <20160126141907.GA20685@hmsreliant.think-freely.org> References: <1453661393-85704-1-git-send-email-jianfeng.tan@intel.com> <20160125134636.GA29690@hmsreliant.think-freely.org> <56A6D85A.6030400@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: dev@dpdk.org, yuanhan.liu@intel.com To: "Tan, Jianfeng" Return-path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 8D5AA8E85 for ; Tue, 26 Jan 2016 15:19:18 +0100 (CET) Content-Disposition: inline In-Reply-To: <56A6D85A.6030400@intel.com> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Tue, Jan 26, 2016 at 10:22:18AM +0800, Tan, Jianfeng wrote: > > Hi Neil, > > On 1/25/2016 9:46 PM, Neil Horman wrote: > >On Mon, Jan 25, 2016 at 02:49:53AM +0800, Jianfeng Tan wrote: > ... > >>-- > >>2.1.4 > >> > >> > > > >This doesn't make a whole lot of sense, for several reasons: > > > >1) Applications, as a general rule shouldn't be interrogating the cgroups > >interface at all. > > The main reason to do this in DPDK is that DPDK obtains resource information > from sysfs and proc, which are not well containerized so far. And DPDK > pre-allocates resource instead of on-demand gradual allocating. > Not disagreeing with this, just suggesting that: 1) Interrogating cgroups really isn't the best way to collect that information 2) Pre-allocating those resources isn't particularly wise without some mechanism to reallocate it, as resource constraints can change (consider your cpuset getting rewritten) > > > >2) Cgroups aren't the only way in which a cpuset or memoryset can be restricted > >(the isolcpus command line argument, or a taskset on a parent process for > >instance, but there are several others). > > Yes, I agree. To enable that, I'd like design the new API for resource self > discovery in a flexible way. A parameter "type" is used to specify the > solution to discovery way. In addition, I'm considering to add a callback > function pointer so that users can write their own resource discovery > functions. > Why? You don't need an API for this, or if you really want one, it can be very generic if you use POSIX apis to gather the information. What you have here is going to be very linux specific, and will need reimplementing for BSD or other operating systems. To use the cpuset example, instead of reading and parsing the mask files in the cgroup filesystem module to find your task and corresponding mask, just call sched_setaffinity with an all f's mask, then call sched_getaffinity. The returned mask will be all the cpus your process is allowed to execute on, taking into account every limiting filter the system you are running on offers. There are simmilar OS level POSIX apis for most resources out there. You really don't need to dig through cgroups just to learn what some of those reources are. > > > >Instead of trying to figure out what cpuset is valid for your process by > >interrogating the cgroups heirarchy, instead you should follow the proscribed > >method of calling sched_getaffinity after calling sched_setaffinity. That will > >give you the canonical cpuset that you are executing on, taking all cpuset > >filters into account (including cgroups and any other restrictions). Its far > >simpler as well, as it doesn't require a ton of file/string processing. > > Yes, this way is much better for cpuset discovery. But is there such a > syscall for hugepages? > In what capacity? Interrogating how many hugepages you have, or to what node they are affined to? Capacity would require reading the requisite proc file, as theres no posix api for this resource. Node affinity can be implied by setting the numa policy of the dpdk and then writing to /proc/nr_hugepages, as the kernel will attempt to distribute hugepages evenly among the tasks' numa policy configuration. That said, I would advise that you strongly consider not exporting hugepages as a resource, as: a) Applications generally don't need to know that they are using hugepages, and so they dont need to know where said hugepages live, they just allocate memory via your allocation api and you give them something appropriate b) Hugepages are a resource that are very specific to Linux, and to X86 Linux at that. Some OS implement simmilar resources, but they may have very different semantics. And other Arches may or may not implement various forms of compound paging at all. As the DPDK expands to support more OS'es and arches, it would be nice to ensure that the programming surfaces that you expose have a more broad level of support. Neil > Thanks, > Jianfeng > > > > >Neil > > > >