From mboxrd@z Thu Jan  1 00:00:00 1970
From: Neil Horman <nhorman@tuxdriver.com>
Subject: Re: [RFC] eal: add cgroup-aware resource self discovery
Date: Tue, 26 Jan 2016 09:19:07 -0500
Message-ID: <20160126141907.GA20685@hmsreliant.think-freely.org>
References: <1453661393-85704-1-git-send-email-jianfeng.tan@intel.com>
 <20160125134636.GA29690@hmsreliant.think-freely.org>
 <56A6D85A.6030400@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: dev@dpdk.org, yuanhan.liu@intel.com
To: "Tan, Jianfeng" <jianfeng.tan@intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58])
 by dpdk.org (Postfix) with ESMTP id 8D5AA8E85
 for <dev@dpdk.org>; Tue, 26 Jan 2016 15:19:18 +0100 (CET)
Content-Disposition: inline
In-Reply-To: <56A6D85A.6030400@intel.com>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Tue, Jan 26, 2016 at 10:22:18AM +0800, Tan, Jianfeng wrote:
> 
> Hi Neil,
> 
> On 1/25/2016 9:46 PM, Neil Horman wrote:
> >On Mon, Jan 25, 2016 at 02:49:53AM +0800, Jianfeng Tan wrote:
> ...
> >>-- 
> >>2.1.4
> >>
> >>
> >
> >This doesn't make a whole lot of sense, for several reasons:
> >
> >1) Applications, as a general rule shouldn't be interrogating the cgroups
> >interface at all.
> 
> The main reason to do this in DPDK is that DPDK obtains resource information
> from sysfs and proc, which are not well containerized so far. And DPDK
> pre-allocates resource instead of on-demand gradual allocating.
> 
Not disagreeing with this, just suggesting that:

1) Interrogating cgroups really isn't the best way to collect that information
2) Pre-allocating those resources isn't particularly wise without some mechanism
to reallocate it, as resource constraints can change (consider your cpuset
getting rewritten)

> >
> >2) Cgroups aren't the only way in which a cpuset or memoryset can be restricted
> >(the isolcpus command line argument, or a taskset on a parent process for
> >instance, but there are several others).
> 
> Yes, I agree. To enable that, I'd like design the new API for resource self
> discovery in a flexible way. A parameter "type" is used to specify the
> solution to discovery way. In addition, I'm considering to add a callback
> function pointer so that users can write their own resource discovery
> functions.
> 
Why?  You don't need an API for this, or if you really want one, it can be very
generic if you use POSIX apis to gather the information.  What you have here is
going to be very linux specific, and will need reimplementing for BSD or other
operating systems.  To use the cpuset example, instead of reading and parsing
the mask files in the cgroup filesystem module to find your task and
corresponding mask, just call sched_setaffinity with an all f's mask, then call
sched_getaffinity.  The returned mask will be all the cpus your process is
allowed to execute on, taking into account every limiting filter the system you
are running on offers.

There are simmilar OS level POSIX apis for most resources out there.  You really
don't need to dig through cgroups just to learn what some of those reources are.

> >
> >Instead of trying to figure out what cpuset is valid for your process by
> >interrogating the cgroups heirarchy, instead you should follow the proscribed
> >method of calling sched_getaffinity after calling sched_setaffinity.  That will
> >give you the canonical cpuset that you are executing on, taking all cpuset
> >filters into account (including cgroups and any other restrictions).  Its far
> >simpler as well, as it doesn't require a ton of file/string processing.
> 
> Yes, this way is much better for cpuset discovery. But is there such a
> syscall for hugepages?
> 
In what capacity?  Interrogating how many hugepages you have, or to what node
they are affined to?  Capacity would require reading the requisite proc file, as
theres no posix api for this resource.  Node affinity can be implied by setting
the numa policy of the dpdk and then writing to /proc/nr_hugepages, as the
kernel will attempt to distribute hugepages evenly among the tasks' numa policy
configuration.

That said, I would advise that you strongly consider not exporting hugepages as
a resource, as:

a) Applications generally don't need to know that they are using hugepages, and
so they dont need to know where said hugepages live, they just allocate memory
via your allocation api and you give them something appropriate

b) Hugepages are a resource that are very specific to Linux, and to X86 Linux at
that.  Some OS implement simmilar resources, but they may have very different
semantics.  And other Arches may or may not implement various forms of compound
paging at all.  As the DPDK expands to support more OS'es and arches, it would
be nice to ensure that the programming surfaces that you expose have a more
broad level of support.

Neil

> Thanks,
> Jianfeng
> 
> >
> >Neil
> >
> 
>