From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Date: Thu, 10 Jan 2019 12:36:33 +1100 Subject: [lustre-devel] [PATCH v2 33/33] lustre: update version to 2.9.99 In-Reply-To: <53F17B0D-5FB3-4B69-B483-7AA4FBCA259B@whamcloud.com> References: <1546812868-11794-1-git-send-email-jsimmons@infradead.org> <1546812868-11794-34-git-send-email-jsimmons@infradead.org> <874lakj5ck.fsf@notabene.neil.brown.name> <45806D2E-8AAD-48ED-8B14-6D5CC11D824E@whamcloud.com> <9BDCA6A9-A826-49A7-9126-BC1DCC96AC1D@whamcloud.com> <53F17B0D-5FB3-4B69-B483-7AA4FBCA259B@whamcloud.com> Message-ID: <87k1jdi99a.fsf@notabene.neil.brown.name> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org On Wed, Jan 09 2019, Andreas Dilger wrote: > On Jan 9, 2019, at 11:28, James Simmons wrote: >> >> >>>>> This might be because the upstream Lustre doesn't allow setting per-process >>>>> JobID via environment variable, only as a single per-node value. The real >>>>> unfortunate part is that the "get JobID from environment" actually works for >>>>> every reasonable architecture (even the one which was originally broken >>>>> fixed it), but it got yanked anyway. This is actually one of the features >>>>> of Lustre that lots of HPC sites like to use, since it allows them to track >>>>> on the servers which users/jobs/processes on the client are doing IO. >>>> >>>> To give background for Neil see thread: >>>> >>>> https://lore.kernel.org/patchwork/patch/416846 >>>> >>>> In this case I do agree with Greg. The latest jobid does implement an >>>> upcall and upcalls don't play niece with containers. Their is also the >>>> namespace issue pointed out. I think the namespace issue might be fixed >>>> in the latest OpenSFS code. >>> >>> I'm not sure what you mean? AFAIK, there is no upcall for JobID, except >>> maybe in the kernel client where we weren't allowed to parse the process >>> environment directly. I agree an upcall is problematic with namespaces, >>> in addition to being less functional (only a JobID per node instead of >>> per process), which is why direct access to JOBENV is better IMHO. >> >> I have some evil ideas about this. Need to think about it some more since >> this is a more complex problem. > > Since the kernel manages the environment variables via getenv() and setenv(), > I honestly don't see why accessing them directly is a huge issue? This is, at best, an over-simplification. The kernel doesn't "manage" the environment variables. When a process calls execve() (or similar) a collection of strings called "arguments" and another collection of strings called "environment" are extracted from the processes vm, and used for initializing part of the newly created vm. That is all the kernel does with either. (except for providing /proc/*/cmdline and /proc/*/environ, which is best-effort). getenv() ad setenv() are entirely implemented in user-space. It is quite possible for a process to mess-up its args or environment in a way that will make /proc/*/{cmdline,environ} fail to return anything useful. It is quite possible for the memory storing args and env to be swapped out. If a driver tried to accesses either, it might trigger page-in of that part of the address space, which would probably work but might not be a good idea. As I understand it, the goal here is to have a cluster-wide identifier that can be attached to groups of processes on different nodes. Then stats relating to all of those processes can be collected together. If I didn't think that control-groups were an abomination I would probably suggest using them to define a group of processes, then to attach a tag to that group. Both the netcl and net_prio cgroups do exactly this. perf_event does as well, and even uses the tag exactly for collecting performance-data together for a set of processes. Maybe we could try to champion an fs_event control group? However it is a long time since I've looked at control groups, and they might have moved on a bit. But as I do think that control-groups are an abomination, I couldn't possible suggest any such thing. Unix already has a perfectly good grouping abstraction - process groups (unfortunately there are about 3 sorts of these, but that needn't be a big problem). Stats can be collected based on pgid, and a mapping from client+pgid->jobid can be communicated to whatever collects the statistics ... somehow. NeilBrown -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 832 bytes Desc: not available URL: