All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] Proposal for JobID caching
@ 2017-01-18 20:08 Ben Evans
  2017-01-18 20:39 ` Oleg Drokin
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-01-18 20:08 UTC (permalink / raw)
  To: lustre-devel

Overview
            The Lustre filesystem added the ability to track I/O performance of a job across a cluster.  The initial algorithm was relatively simplistic:  for every I/O, look up the job ID of the process and include it in the RPC being sent to the server.  This imposed a non-trivial performance impact on client I/O performance.
            An additional algorithm was introduced to handle the single job per node case, where instead of looking up the job ID of the process, Lustre simply accesses the value of a variable set through the proc interface.  This improved performance greatly, but only functions when a single job is being run.
            A new approach is needed for multiple job per node systems.

Proposed Solution
            The proposed solution to this is to create a small PID->JobID table in kernel memory.  When a process performs an IO, a lookup is done in the table for the PID, if a JobID exists for that PID, it is used, otherwise it is retrieved via the same methods as the original Jobstats algorithm.  Once located the JobID is stored in a PID/JobID table in memory. The existing cfs_hash_table structure and functions will be used to implement the table.

Rationale
            This reduces the number of calls into userspace, minimizing the time taken on each I/O.  It also easily supports multiple job per node scenarios, and like other proposed solutions has no issue with multiple jobs performing I/O on the same file at the same time.

Requirements

?      Performance cannot significantly detract from baseline performance without jobstats

?      Supports multiple jobs per node

?      Coordination with the scheduler is not required, but interfaces may be provided

?      Supports multiple PIDs per job

New Data Structures
            pid_to_jobid {
                        struct hlist_node pj_hash;
                        u54 pj_pid;
                        char pj_jobid[LUSTRE_JOBID_SIZE];
spinlock_t jp_lock;
                        time_t jp_time;
}
Proc Variables
Writing to /proc/fs/lustre/jobid_name while not in "nodelocal" mode will cause all entries in the cache for that jobID to be removed from the cache

Populating the Cache
            When lustre_get_jobid is called, the process, and in the cached mode, first a check will be done in the cache for a valid PID to JobID mapping.  If none exists, it uses the same mechanisms to get the JobID and populates the appropriate PID to JobID map.
If a lookup is performed and the PID to JobID mapping exists, but is more than 30 seconds old, the JobID is refreshed.
Purging the Cache
            The cache can be purged of a specific job by writing the JobID to the jobid_name proc file.  Any items in the cache that are more than 300 seconds out of date will also be purged at this time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170118/f6d3f32a/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-18 20:08 [lustre-devel] Proposal for JobID caching Ben Evans
@ 2017-01-18 20:39 ` Oleg Drokin
  2017-01-18 22:35   ` Ben Evans
  2017-01-20 21:50   ` Dilger, Andreas
  0 siblings, 2 replies; 14+ messages in thread
From: Oleg Drokin @ 2017-01-18 20:39 UTC (permalink / raw)
  To: lustre-devel


On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:

> Overview
>             The Lustre filesystem added the ability to track I/O performance of a job across a cluster.  The initial algorithm was relatively simplistic:  for every I/O, look up the job ID of the process and include it in the RPC being sent to the server.  This imposed a non-trivial performance impact on client I/O performance.
>             An additional algorithm was introduced to handle the single job per node case, where instead of looking up the job ID of the process, Lustre simply accesses the value of a variable set through the proc interface.  This improved performance greatly, but only functions when a single job is being run.
>             A new approach is needed for multiple job per node systems.
>  
> Proposed Solution
>             The proposed solution to this is to create a small PID->JobID table in kernel memory.  When a process performs an IO, a lookup is done in the table for the PID, if a JobID exists for that PID, it is used, otherwise it is retrieved via the same methods as the original Jobstats algorithm.  Once located the JobID is stored in a PID/JobID table in memory. The existing cfs_hash_table structure and functions will be used to implement the table.
>  
> Rationale
>             This reduces the number of calls into userspace, minimizing the time taken on each I/O.  It also easily supports multiple job per node scenarios, and like other proposed solutions has no issue with multiple jobs performing I/O on the same file at the same time.
>  
> Requirements
> ?      Performance cannot significantly detract from baseline performance without jobstats
> ?      Supports multiple jobs per node
> ?      Coordination with the scheduler is not required, but interfaces may be provided
> ?      Supports multiple PIDs per job
>              
> New Data Structures
>             pid_to_jobid {
>                         struct hlist_node pj_hash;
>                         u54 pj_pid;
>                         char pj_jobid[LUSTRE_JOBID_SIZE];
> spinlock_t jp_lock;
>                         time_t jp_time;
> }
> Proc Variables
> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode will cause all entries in the cache for that jobID to be removed from the cache
>  
> Populating the Cache
>             When lustre_get_jobid is called, the process, and in the cached mode, first a check will be done in the cache for a valid PID to JobID mapping.  If none exists, it uses the same mechanisms to get the JobID and populates the appropriate PID to JobID map.
> If a lookup is performed and the PID to JobID mapping exists, but is more than 30 seconds old, the JobID is refreshed.
> Purging the Cache
>             The cache can be purged of a specific job by writing the JobID to the jobid_name proc file.  Any items in the cache that are more than 300 seconds out of date will also be purged at this time.


I'd much rather prefer you go to the table that's populated outside of the kernel
somehow.
Let's be realistic, poking around in userspace process environments for random
strings is not such a great idea at all even though it did look like a good idea
in the past for simplicity reasons.
Similar to nodelocal, we probably just switch to a method where you call a
particular lctl command that would mark the whole session as belonging
to some job. This might take several forms, e.g. nodelocal itself could
be extended to only apply to a current namespace/container
But if you do really run different jobs in the global namespace, we probably can
probably just make the lctl to spawn a shell with commands that all would
be marked as a particular job? Or we can probably trace the parent of lctl and
mark that so that all its children become somehow marked too.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-18 20:39 ` Oleg Drokin
@ 2017-01-18 22:35   ` Ben Evans
  2017-01-18 22:56     ` Oleg Drokin
  2017-01-20 21:50   ` Dilger, Andreas
  1 sibling, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-01-18 22:35 UTC (permalink / raw)
  To: lustre-devel



On 1/18/17, 3:39 PM, "Oleg Drokin" <oleg.drokin@intel.com> wrote:

>
>On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>
>> Overview
>>             The Lustre filesystem added the ability to track I/O
>>performance of a job across a cluster.  The initial algorithm was
>>relatively simplistic:  for every I/O, look up the job ID of the process
>>and include it in the RPC being sent to the server.  This imposed a
>>non-trivial performance impact on client I/O performance.
>>             An additional algorithm was introduced to handle the single
>>job per node case, where instead of looking up the job ID of the
>>process, Lustre simply accesses the value of a variable set through the
>>proc interface.  This improved performance greatly, but only functions
>>when a single job is being run.
>>             A new approach is needed for multiple job per node systems.
>>  
>> Proposed Solution
>>             The proposed solution to this is to create a small
>>PID->JobID table in kernel memory.  When a process performs an IO, a
>>lookup is done in the table for the PID, if a JobID exists for that PID,
>>it is used, otherwise it is retrieved via the same methods as the
>>original Jobstats algorithm.  Once located the JobID is stored in a
>>PID/JobID table in memory. The existing cfs_hash_table structure and
>>functions will be used to implement the table.
>>  
>> Rationale
>>             This reduces the number of calls into userspace, minimizing
>>the time taken on each I/O.  It also easily supports multiple job per
>>node scenarios, and like other proposed solutions has no issue with
>>multiple jobs performing I/O on the same file at the same time.
>>  
>> Requirements
>> ?      Performance cannot significantly detract from baseline
>>performance without jobstats
>> ?      Supports multiple jobs per node
>> ?      Coordination with the scheduler is not required, but interfaces
>>may be provided
>> ?      Supports multiple PIDs per job
>>              
>> New Data Structures
>>             pid_to_jobid {
>>                         struct hlist_node pj_hash;
>>                         u54 pj_pid;
>>                         char pj_jobid[LUSTRE_JOBID_SIZE];
>> spinlock_t jp_lock;
>>                         time_t jp_time;
>> }
>> Proc Variables
>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>will cause all entries in the cache for that jobID to be removed from
>>the cache
>>  
>> Populating the Cache
>>             When lustre_get_jobid is called, the process, and in the
>>cached mode, first a check will be done in the cache for a valid PID to
>>JobID mapping.  If none exists, it uses the same mechanisms to get the
>>JobID and populates the appropriate PID to JobID map.
>> If a lookup is performed and the PID to JobID mapping exists, but is
>>more than 30 seconds old, the JobID is refreshed.
>> Purging the Cache
>>             The cache can be purged of a specific job by writing the
>>JobID to the jobid_name proc file.  Any items in the cache that are more
>>than 300 seconds out of date will also be purged at this time.
>
>
>I'd much rather prefer you go to the table that's populated outside of
>the kernel
>somehow.
>Let's be realistic, poking around in userspace process environments for
>random
>strings is not such a great idea at all even though it did look like a
>good idea
>in the past for simplicity reasons.

On the upside, there's far less of that going on now, since the results
are cached via pid.  I'm unaware of a table that exists in userspace that
maps PIDs to Jobs.

>Similar to nodelocal, we probably just switch to a method where you call a
>particular lctl command that would mark the whole session as belonging
>to some job. This might take several forms, e.g. nodelocal itself could
>be extended to only apply to a current namespace/container

That would make sense, but would need to requirement that each job has
it's own namespace/container.

>But if you do really run different jobs in the global namespace, we
>probably can
>probably just make the lctl to spawn a shell with commands that all would
>be marked as a particular job? Or we can probably trace the parent of
>lctl and
>mark that so that all its children become somehow marked too.

One of the things that came up during this is how do you handle a random
user who logs into a compute node and runs something like rsync?  The more
conditions we place around getting jobstats to function properly, the
harder these types of behaviors are to track down.  One thing I was
thinking was that if jobstats is enabled, that the fallback if no JobID
can be found is to simply use the taskname_uid method, so an admin would
see rsync.1234 pop up on your monitoring dashboard.

-Ben

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-18 22:35   ` Ben Evans
@ 2017-01-18 22:56     ` Oleg Drokin
  2017-01-19 15:19       ` Ben Evans
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Drokin @ 2017-01-18 22:56 UTC (permalink / raw)
  To: lustre-devel


On Jan 18, 2017, at 5:35 PM, Ben Evans wrote:

> 
> 
> On 1/18/17, 3:39 PM, "Oleg Drokin" <oleg.drokin@intel.com> wrote:
> 
>> 
>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>> 
>>> Overview
>>>            The Lustre filesystem added the ability to track I/O
>>> performance of a job across a cluster.  The initial algorithm was
>>> relatively simplistic:  for every I/O, look up the job ID of the process
>>> and include it in the RPC being sent to the server.  This imposed a
>>> non-trivial performance impact on client I/O performance.
>>>            An additional algorithm was introduced to handle the single
>>> job per node case, where instead of looking up the job ID of the
>>> process, Lustre simply accesses the value of a variable set through the
>>> proc interface.  This improved performance greatly, but only functions
>>> when a single job is being run.
>>>            A new approach is needed for multiple job per node systems.
>>> 
>>> Proposed Solution
>>>            The proposed solution to this is to create a small
>>> PID->JobID table in kernel memory.  When a process performs an IO, a
>>> lookup is done in the table for the PID, if a JobID exists for that PID,
>>> it is used, otherwise it is retrieved via the same methods as the
>>> original Jobstats algorithm.  Once located the JobID is stored in a
>>> PID/JobID table in memory. The existing cfs_hash_table structure and
>>> functions will be used to implement the table.
>>> 
>>> Rationale
>>>            This reduces the number of calls into userspace, minimizing
>>> the time taken on each I/O.  It also easily supports multiple job per
>>> node scenarios, and like other proposed solutions has no issue with
>>> multiple jobs performing I/O on the same file at the same time.
>>> 
>>> Requirements
>>> ?      Performance cannot significantly detract from baseline
>>> performance without jobstats
>>> ?      Supports multiple jobs per node
>>> ?      Coordination with the scheduler is not required, but interfaces
>>> may be provided
>>> ?      Supports multiple PIDs per job
>>> 
>>> New Data Structures
>>>            pid_to_jobid {
>>>                        struct hlist_node pj_hash;
>>>                        u54 pj_pid;
>>>                        char pj_jobid[LUSTRE_JOBID_SIZE];
>>> spinlock_t jp_lock;
>>>                        time_t jp_time;
>>> }
>>> Proc Variables
>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>> will cause all entries in the cache for that jobID to be removed from
>>> the cache
>>> 
>>> Populating the Cache
>>>            When lustre_get_jobid is called, the process, and in the
>>> cached mode, first a check will be done in the cache for a valid PID to
>>> JobID mapping.  If none exists, it uses the same mechanisms to get the
>>> JobID and populates the appropriate PID to JobID map.
>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>> more than 30 seconds old, the JobID is refreshed.
>>> Purging the Cache
>>>            The cache can be purged of a specific job by writing the
>>> JobID to the jobid_name proc file.  Any items in the cache that are more
>>> than 300 seconds out of date will also be purged at this time.
>> 
>> 
>> I'd much rather prefer you go to the table that's populated outside of
>> the kernel
>> somehow.
>> Let's be realistic, poking around in userspace process environments for
>> random
>> strings is not such a great idea at all even though it did look like a
>> good idea
>> in the past for simplicity reasons.
> 
> On the upside, there's far less of that going on now, since the results
> are cached via pid.  I'm unaware of a table that exists in userspace that
> maps PIDs to Jobs.

there is not.

>> Similar to nodelocal, we probably just switch to a method where you call a
>> particular lctl command that would mark the whole session as belonging
>> to some job. This might take several forms, e.g. nodelocal itself could
>> be extended to only apply to a current namespace/container
> 
> That would make sense, but would need to requirement that each job has
> it's own namespace/container.

Only if you run multiple jobs per node at the same time,
otherwise just do the nodelocal for hte global root namespace.

>> But if you do really run different jobs in the global namespace, we
>> probably can
>> probably just make the lctl to spawn a shell with commands that all would
>> be marked as a particular job? Or we can probably trace the parent of
>> lctl and
>> mark that so that all its children become somehow marked too.
> 
> One of the things that came up during this is how do you handle a random
> user who logs into a compute node and runs something like rsync?  The more

Current scheme does not handle it either, unles you use nodelocal and then their
actions would attribute to the job currently running (not super ideal as well),
I imagine there's a legitimate reason for users to log into the nodes running
unrelated jobs?

> conditions we place around getting jobstats to function properly, the
> harder these types of behaviors are to track down.  One thing I was
> thinking was that if jobstats is enabled, that the fallback if no JobID
> can be found is to simply use the taskname_uid method, so an admin would
> see rsync.1234 pop up on your monitoring dashboard.

If you have every node into its own container, then the global namespace could
be set to "unscheduledcommand-$hostname" or some such and every container
would get its own jobid.

This does require containers of course. Or if we set the id based on the process group,
then again they would get that and anything outside would get something
default helping you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-18 22:56     ` Oleg Drokin
@ 2017-01-19 15:19       ` Ben Evans
  2017-01-19 16:28         ` Oleg Drokin
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-01-19 15:19 UTC (permalink / raw)
  To: lustre-devel



On 1/18/17, 5:56 PM, "Oleg Drokin" <oleg.drokin@intel.com> wrote:

>
>On Jan 18, 2017, at 5:35 PM, Ben Evans wrote:
>
>> 
>> 
>> On 1/18/17, 3:39 PM, "Oleg Drokin" <oleg.drokin@intel.com> wrote:
>> 
>>> 
>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>> 
>>>> Overview
>>>>            The Lustre filesystem added the ability to track I/O
>>>> performance of a job across a cluster.  The initial algorithm was
>>>> relatively simplistic:  for every I/O, look up the job ID of the
>>>>process
>>>> and include it in the RPC being sent to the server.  This imposed a
>>>> non-trivial performance impact on client I/O performance.
>>>>            An additional algorithm was introduced to handle the single
>>>> job per node case, where instead of looking up the job ID of the
>>>> process, Lustre simply accesses the value of a variable set through
>>>>the
>>>> proc interface.  This improved performance greatly, but only functions
>>>> when a single job is being run.
>>>>            A new approach is needed for multiple job per node systems.
>>>> 
>>>> Proposed Solution
>>>>            The proposed solution to this is to create a small
>>>> PID->JobID table in kernel memory.  When a process performs an IO, a
>>>> lookup is done in the table for the PID, if a JobID exists for that
>>>>PID,
>>>> it is used, otherwise it is retrieved via the same methods as the
>>>> original Jobstats algorithm.  Once located the JobID is stored in a
>>>> PID/JobID table in memory. The existing cfs_hash_table structure and
>>>> functions will be used to implement the table.
>>>> 
>>>> Rationale
>>>>            This reduces the number of calls into userspace, minimizing
>>>> the time taken on each I/O.  It also easily supports multiple job per
>>>> node scenarios, and like other proposed solutions has no issue with
>>>> multiple jobs performing I/O on the same file at the same time.
>>>> 
>>>> Requirements
>>>> ?      Performance cannot significantly detract from baseline
>>>> performance without jobstats
>>>> ?      Supports multiple jobs per node
>>>> ?      Coordination with the scheduler is not required, but interfaces
>>>> may be provided
>>>> ?      Supports multiple PIDs per job
>>>> 
>>>> New Data Structures
>>>>            pid_to_jobid {
>>>>                        struct hlist_node pj_hash;
>>>>                        u54 pj_pid;
>>>>                        char pj_jobid[LUSTRE_JOBID_SIZE];
>>>> spinlock_t jp_lock;
>>>>                        time_t jp_time;
>>>> }
>>>> Proc Variables
>>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>>> will cause all entries in the cache for that jobID to be removed from
>>>> the cache
>>>> 
>>>> Populating the Cache
>>>>            When lustre_get_jobid is called, the process, and in the
>>>> cached mode, first a check will be done in the cache for a valid PID
>>>>to
>>>> JobID mapping.  If none exists, it uses the same mechanisms to get the
>>>> JobID and populates the appropriate PID to JobID map.
>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>> more than 30 seconds old, the JobID is refreshed.
>>>> Purging the Cache
>>>>            The cache can be purged of a specific job by writing the
>>>> JobID to the jobid_name proc file.  Any items in the cache that are
>>>>more
>>>> than 300 seconds out of date will also be purged at this time.
>>> 
>>> 
>>> I'd much rather prefer you go to the table that's populated outside of
>>> the kernel
>>> somehow.
>>> Let's be realistic, poking around in userspace process environments for
>>> random
>>> strings is not such a great idea at all even though it did look like a
>>> good idea
>>> in the past for simplicity reasons.
>> 
>> On the upside, there's far less of that going on now, since the results
>> are cached via pid.  I'm unaware of a table that exists in userspace
>>that
>> maps PIDs to Jobs.
>
>there is not.
>
>>> Similar to nodelocal, we probably just switch to a method where you
>>>call a
>>> particular lctl command that would mark the whole session as belonging
>>> to some job. This might take several forms, e.g. nodelocal itself could
>>> be extended to only apply to a current namespace/container
>> 
>> That would make sense, but would need to requirement that each job has
>> it's own namespace/container.
>
>Only if you run multiple jobs per node at the same time,
>otherwise just do the nodelocal for hte global root namespace.

Agreed, this is supposed to handle the multiple jobs per node case.

>
>>> But if you do really run different jobs in the global namespace, we
>>> probably can
>>> probably just make the lctl to spawn a shell with commands that all
>>>would
>>> be marked as a particular job? Or we can probably trace the parent of
>>> lctl and
>>> mark that so that all its children become somehow marked too.
>> 
>> One of the things that came up during this is how do you handle a random
>> user who logs into a compute node and runs something like rsync?  The
>>more
>
>Current scheme does not handle it either, unles you use nodelocal and
>then their
>actions would attribute to the job currently running (not super ideal as
>well),
>I imagine there's a legitimate reason for users to log into the nodes
>running
>unrelated jobs?

The current scheme does handle it, if you use the procname_uid setting.

>
>> conditions we place around getting jobstats to function properly, the
>> harder these types of behaviors are to track down.  One thing I was
>> thinking was that if jobstats is enabled, that the fallback if no JobID
>> can be found is to simply use the taskname_uid method, so an admin would
>> see rsync.1234 pop up on your monitoring dashboard.
>
>If you have every node into its own container, then the global namespace
>could
>be set to "unscheduledcommand-$hostname" or some such and every container
>would get its own jobid.

or simply default to the existing procname_uid setting.

>
>This does require containers of course. Or if we set the id based on the
>process group,
>then again they would get that and anything outside would get something
>default helping you.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-19 15:19       ` Ben Evans
@ 2017-01-19 16:28         ` Oleg Drokin
  0 siblings, 0 replies; 14+ messages in thread
From: Oleg Drokin @ 2017-01-19 16:28 UTC (permalink / raw)
  To: lustre-devel


On Jan 19, 2017, at 10:19 AM, Ben Evans wrote:

> 
> 
> On 1/18/17, 5:56 PM, "Oleg Drokin" <oleg.drokin@intel.com> wrote:
> 
>> 
>> On Jan 18, 2017, at 5:35 PM, Ben Evans wrote:
>> 
>>>> But if you do really run different jobs in the global namespace, we
>>>> probably can
>>>> probably just make the lctl to spawn a shell with commands that all
>>>> would
>>>> be marked as a particular job? Or we can probably trace the parent of
>>>> lctl and
>>>> mark that so that all its children become somehow marked too.
>>> 
>>> One of the things that came up during this is how do you handle a random
>>> user who logs into a compute node and runs something like rsync?  The
>>> more
>> 
>> Current scheme does not handle it either, unles you use nodelocal and
>> then their
>> actions would attribute to the job currently running (not super ideal as
>> well),
>> I imagine there's a legitimate reason for users to log into the nodes
>> running
>> unrelated jobs?
> 
> The current scheme does handle it, if you use the procname_uid setting.

But then that's the only thing it handles, you don't get the actual jobid
this way.
What you are looking for is a fallback if a command is not actually part
of any known job, I though, otherwise use the jobid that was somehow
detected.
Or were you thinking of somehow mapping this on management nodes then
from node+pid into known jobs?

>>> conditions we place around getting jobstats to function properly, the
>>> harder these types of behaviors are to track down.  One thing I was
>>> thinking was that if jobstats is enabled, that the fallback if no JobID
>>> can be found is to simply use the taskname_uid method, so an admin would
>>> see rsync.1234 pop up on your monitoring dashboard.
>> 
>> If you have every node into its own container, then the global namespace
>> could
>> be set to "unscheduledcommand-$hostname" or some such and every container
>> would get its own jobid.
> 
> or simply default to the existing procname_uid setting.

Yes, that too.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-18 20:39 ` Oleg Drokin
  2017-01-18 22:35   ` Ben Evans
@ 2017-01-20 21:50   ` Dilger, Andreas
  2017-01-20 22:00     ` Ben Evans
  1 sibling, 1 reply; 14+ messages in thread
From: Dilger, Andreas @ 2017-01-20 21:50 UTC (permalink / raw)
  To: lustre-devel

On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin@intel.com> wrote:
> 
> 
> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
> 
>> Overview
>>            The Lustre filesystem added the ability to track I/O performance of a job across a cluster.  The initial algorithm was relatively simplistic:  for every I/O, look up the job ID of the process and include it in the RPC being sent to the server.  This imposed a non-trivial performance impact on client I/O performance.
>>            An additional algorithm was introduced to handle the single job per node case, where instead of looking up the job ID of the process, Lustre simply accesses the value of a variable set through the proc interface.  This improved performance greatly, but only functions when a single job is being run.
>>            A new approach is needed for multiple job per node systems.
>> 
>> Proposed Solution
>>            The proposed solution to this is to create a small PID->JobID table in kernel memory.  When a process performs an IO, a lookup is done in the table for the PID, if a JobID exists for that PID, it is used, otherwise it is retrieved via the same methods as the original Jobstats algorithm.  Once located the JobID is stored in a PID/JobID table in memory. The existing cfs_hash_table structure and functions will be used to implement the table.
>> 
>> Rationale
>>            This reduces the number of calls into userspace, minimizing the time taken on each I/O.  It also easily supports multiple job per node scenarios, and like other proposed solutions has no issue with multiple jobs performing I/O on the same file at the same time.
>> 
>> Requirements
>> ?      Performance cannot significantly detract from baseline performance without jobstats
>> ?      Supports multiple jobs per node
>> ?      Coordination with the scheduler is not required, but interfaces may be provided
>> ?      Supports multiple PIDs per job
>> 
>> New Data Structures
>>            pid_to_jobid {
>>                        struct hlist_node pj_hash;
>>                        u54 pj_pid;
>>                        char pj_jobid[LUSTRE_JOBID_SIZE];
>> spinlock_t jp_lock;
>>                        time_t jp_time;
>> }
>> Proc Variables
>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode will cause all entries in the cache for that jobID to be removed from the cache
>> 
>> Populating the Cache
>>            When lustre_get_jobid is called, the process, and in the cached mode, first a check will be done in the cache for a valid PID to JobID mapping.  If none exists, it uses the same mechanisms to get the JobID and populates the appropriate PID to JobID map.
>> If a lookup is performed and the PID to JobID mapping exists, but is more than 30 seconds old, the JobID is refreshed.
>> Purging the Cache
>>            The cache can be purged of a specific job by writing the JobID to the jobid_name proc file.  Any items in the cache that are more than 300 seconds out of date will also be purged at this time.
> 
> 
> I'd much rather prefer you go to the table that's populated outside of the kernel
> somehow.
> Let's be realistic, poking around in userspace process environments for random
> strings is not such a great idea at all even though it did look like a good idea
> in the past for simplicity reasons.
> Similar to nodelocal, we probably just switch to a method where you call a
> particular lctl command that would mark the whole session as belonging
> to some job. This might take several forms, e.g. nodelocal itself could
> be extended to only apply to a current namespace/container
> But if you do really run different jobs in the global namespace, we probably can
> probably just make the lctl to spawn a shell with commands that all would
> be marked as a particular job? Or we can probably trace the parent of lctl and
> mark that so that all its children become somehow marked too.

Having lctl spawn a shell or requiring everything to run in a container is impractical for users, and will just make it harder to use JobID, IMHO.  The job scheduler is _already_ storing the JobID in the process environment so that it is available to all of the threads running as part of the job.  The question is how the job prolog script can communicate the JobID directly to Lustre without using a global /proc file?  Doing an upcall to userspace per JobID lookup is going to be *worse* for performance than the current searching through the process environment.

I'm not against Ben's proposal to implement a cache in the kernel for different processes.  It is unfortunate that we can't have proper thread-local storage for Lustre, so a hash table is probably reasonable for this (there may be thousands of threads involved).  I don't think the cl_env struct would be useful, since it is not tied to a specific thread (AFAIK), but rather assigned as different threads enter/exit kernel context.  Note that we already have similar time-limited caches for the identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to see whether the code can be shared.

Another (not very nice) option to avoid looking through the environment variables (which IMHO isn't so bad, even though the upstream folks don't like it) is to associate the JobID set via /proc with a process group internally and look the PGID up in the kernel to find the JobID.  That can be repeated each time a new JobID is set via /proc, since the PGID would stick around for each new job/shell/process created under the PGID.  It won't be as robust as looking up the JobID in the environment, but probably good enough for most uses.

I would definitely also be in favor of having some way to fall back to procname_uid if the PGID cannot be found, the job environment variable is not available, and there is nothing in nodelocal.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-20 21:50   ` Dilger, Andreas
@ 2017-01-20 22:00     ` Ben Evans
  2017-02-02 15:20       ` Ben Evans
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-01-20 22:00 UTC (permalink / raw)
  To: lustre-devel



On 1/20/17, 4:50 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:

>On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin@intel.com> wrote:
>> 
>> 
>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>> 
>>> Overview
>>>            The Lustre filesystem added the ability to track I/O
>>>performance of a job across a cluster.  The initial algorithm was
>>>relatively simplistic:  for every I/O, look up the job ID of the
>>>process and include it in the RPC being sent to the server.  This
>>>imposed a non-trivial performance impact on client I/O performance.
>>>            An additional algorithm was introduced to handle the single
>>>job per node case, where instead of looking up the job ID of the
>>>process, Lustre simply accesses the value of a variable set through the
>>>proc interface.  This improved performance greatly, but only functions
>>>when a single job is being run.
>>>            A new approach is needed for multiple job per node systems.
>>> 
>>> Proposed Solution
>>>            The proposed solution to this is to create a small
>>>PID->JobID table in kernel memory.  When a process performs an IO, a
>>>lookup is done in the table for the PID, if a JobID exists for that
>>>PID, it is used, otherwise it is retrieved via the same methods as the
>>>original Jobstats algorithm.  Once located the JobID is stored in a
>>>PID/JobID table in memory. The existing cfs_hash_table structure and
>>>functions will be used to implement the table.
>>> 
>>> Rationale
>>>            This reduces the number of calls into userspace, minimizing
>>>the time taken on each I/O.  It also easily supports multiple job per
>>>node scenarios, and like other proposed solutions has no issue with
>>>multiple jobs performing I/O on the same file at the same time.
>>> 
>>> Requirements
>>> ?      Performance cannot significantly detract from baseline
>>>performance without jobstats
>>> ?      Supports multiple jobs per node
>>> ?      Coordination with the scheduler is not required, but interfaces
>>>may be provided
>>> ?      Supports multiple PIDs per job
>>> 
>>> New Data Structures
>>>            pid_to_jobid {
>>>                        struct hlist_node pj_hash;
>>>                        u54 pj_pid;
>>>                        char pj_jobid[LUSTRE_JOBID_SIZE];
>>> spinlock_t jp_lock;
>>>                        time_t jp_time;
>>> }
>>> Proc Variables
>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>>will cause all entries in the cache for that jobID to be removed from
>>>the cache
>>> 
>>> Populating the Cache
>>>            When lustre_get_jobid is called, the process, and in the
>>>cached mode, first a check will be done in the cache for a valid PID to
>>>JobID mapping.  If none exists, it uses the same mechanisms to get the
>>>JobID and populates the appropriate PID to JobID map.
>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>more than 30 seconds old, the JobID is refreshed.
>>> Purging the Cache
>>>            The cache can be purged of a specific job by writing the
>>>JobID to the jobid_name proc file.  Any items in the cache that are
>>>more than 300 seconds out of date will also be purged at this time.
>> 
>> 
>> I'd much rather prefer you go to the table that's populated outside of
>>the kernel
>> somehow.
>> Let's be realistic, poking around in userspace process environments for
>>random
>> strings is not such a great idea at all even though it did look like a
>>good idea
>> in the past for simplicity reasons.
>> Similar to nodelocal, we probably just switch to a method where you
>>call a
>> particular lctl command that would mark the whole session as belonging
>> to some job. This might take several forms, e.g. nodelocal itself could
>> be extended to only apply to a current namespace/container
>> But if you do really run different jobs in the global namespace, we
>>probably can
>> probably just make the lctl to spawn a shell with commands that all
>>would
>> be marked as a particular job? Or we can probably trace the parent of
>>lctl and
>> mark that so that all its children become somehow marked too.
>
>Having lctl spawn a shell or requiring everything to run in a container
>is impractical for users, and will just make it harder to use JobID,
>IMHO.  The job scheduler is _already_ storing the JobID in the process
>environment so that it is available to all of the threads running as part
>of the job.  The question is how the job prolog script can communicate
>the JobID directly to Lustre without using a global /proc file?  Doing an
>upcall to userspace per JobID lookup is going to be *worse* for
>performance than the current searching through the process environment.
>
>I'm not against Ben's proposal to implement a cache in the kernel for
>different processes.  It is unfortunate that we can't have proper
>thread-local storage for Lustre, so a hash table is probably reasonable
>for this (there may be thousands of threads involved).  I don't think the
>cl_env struct would be useful, since it is not tied to a specific thread
>(AFAIK), but rather assigned as different threads enter/exit kernel
>context.  Note that we already have similar time-limited caches for the
>identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to
>see whether the code can be shared.

I'll take a look at those, but implementing the hash table was a pretty
simple solution, I need to work out a few kinks with memory leaks before
doing real performance tests on it to make sure it performs similarly to
nodelocal.

>Another (not very nice) option to avoid looking through the environment
>variables (which IMHO isn't so bad, even though the upstream folks don't
>like it) is to associate the JobID set via /proc with a process group
>internally and look the PGID up in the kernel to find the JobID.  That
>can be repeated each time a new JobID is set via /proc, since the PGID
>would stick around for each new job/shell/process created under the PGID.
> It won't be as robust as looking up the JobID in the environment, but
>probably good enough for most uses.
>
>I would definitely also be in favor of having some way to fall back to
>procname_uid if the PGID cannot be found, the job environment variable is
>not available, and there is nothing in nodelocal.

That's simple enough.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-01-20 22:00     ` Ben Evans
@ 2017-02-02 15:20       ` Ben Evans
  2017-02-07 23:01         ` Dilger, Andreas
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-02-02 15:20 UTC (permalink / raw)
  To: lustre-devel

https://review.whamcloud.com/#/c/25208/ is a working version of what I had
proposed, including the suggested changes to default to procname_uid.
This is not perfect, but the performance is much improved over the current
methods, and unlike inode-based caching Metadata performance isn't
negatively affected.  Multiple simultaneous jobs can be run on the same
file, and get appropriate metrics.


-Ben

On 1/20/17, 5:00 PM, "Ben Evans" <bevans@cray.com> wrote:

>
>
>On 1/20/17, 4:50 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
>
>>On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin@intel.com> wrote:
>>> 
>>> 
>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>> 
>>>> Overview
>>>>            The Lustre filesystem added the ability to track I/O
>>>>performance of a job across a cluster.  The initial algorithm was
>>>>relatively simplistic:  for every I/O, look up the job ID of the
>>>>process and include it in the RPC being sent to the server.  This
>>>>imposed a non-trivial performance impact on client I/O performance.
>>>>            An additional algorithm was introduced to handle the single
>>>>job per node case, where instead of looking up the job ID of the
>>>>process, Lustre simply accesses the value of a variable set through the
>>>>proc interface.  This improved performance greatly, but only functions
>>>>when a single job is being run.
>>>>            A new approach is needed for multiple job per node systems.
>>>> 
>>>> Proposed Solution
>>>>            The proposed solution to this is to create a small
>>>>PID->JobID table in kernel memory.  When a process performs an IO, a
>>>>lookup is done in the table for the PID, if a JobID exists for that
>>>>PID, it is used, otherwise it is retrieved via the same methods as the
>>>>original Jobstats algorithm.  Once located the JobID is stored in a
>>>>PID/JobID table in memory. The existing cfs_hash_table structure and
>>>>functions will be used to implement the table.
>>>> 
>>>> Rationale
>>>>            This reduces the number of calls into userspace, minimizing
>>>>the time taken on each I/O.  It also easily supports multiple job per
>>>>node scenarios, and like other proposed solutions has no issue with
>>>>multiple jobs performing I/O on the same file at the same time.
>>>> 
>>>> Requirements
>>>> ?      Performance cannot significantly detract from baseline
>>>>performance without jobstats
>>>> ?      Supports multiple jobs per node
>>>> ?      Coordination with the scheduler is not required, but interfaces
>>>>may be provided
>>>> ?      Supports multiple PIDs per job
>>>> 
>>>> New Data Structures
>>>>            pid_to_jobid {
>>>>                        struct hlist_node pj_hash;
>>>>                        u54 pj_pid;
>>>>                        char pj_jobid[LUSTRE_JOBID_SIZE];
>>>> spinlock_t jp_lock;
>>>>                        time_t jp_time;
>>>> }
>>>> Proc Variables
>>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>>>will cause all entries in the cache for that jobID to be removed from
>>>>the cache
>>>> 
>>>> Populating the Cache
>>>>            When lustre_get_jobid is called, the process, and in the
>>>>cached mode, first a check will be done in the cache for a valid PID to
>>>>JobID mapping.  If none exists, it uses the same mechanisms to get the
>>>>JobID and populates the appropriate PID to JobID map.
>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>>more than 30 seconds old, the JobID is refreshed.
>>>> Purging the Cache
>>>>            The cache can be purged of a specific job by writing the
>>>>JobID to the jobid_name proc file.  Any items in the cache that are
>>>>more than 300 seconds out of date will also be purged at this time.
>>> 
>>> 
>>> I'd much rather prefer you go to the table that's populated outside of
>>>the kernel
>>> somehow.
>>> Let's be realistic, poking around in userspace process environments for
>>>random
>>> strings is not such a great idea at all even though it did look like a
>>>good idea
>>> in the past for simplicity reasons.
>>> Similar to nodelocal, we probably just switch to a method where you
>>>call a
>>> particular lctl command that would mark the whole session as belonging
>>> to some job. This might take several forms, e.g. nodelocal itself could
>>> be extended to only apply to a current namespace/container
>>> But if you do really run different jobs in the global namespace, we
>>>probably can
>>> probably just make the lctl to spawn a shell with commands that all
>>>would
>>> be marked as a particular job? Or we can probably trace the parent of
>>>lctl and
>>> mark that so that all its children become somehow marked too.
>>
>>Having lctl spawn a shell or requiring everything to run in a container
>>is impractical for users, and will just make it harder to use JobID,
>>IMHO.  The job scheduler is _already_ storing the JobID in the process
>>environment so that it is available to all of the threads running as part
>>of the job.  The question is how the job prolog script can communicate
>>the JobID directly to Lustre without using a global /proc file?  Doing an
>>upcall to userspace per JobID lookup is going to be *worse* for
>>performance than the current searching through the process environment.
>>
>>I'm not against Ben's proposal to implement a cache in the kernel for
>>different processes.  It is unfortunate that we can't have proper
>>thread-local storage for Lustre, so a hash table is probably reasonable
>>for this (there may be thousands of threads involved).  I don't think the
>>cl_env struct would be useful, since it is not tied to a specific thread
>>(AFAIK), but rather assigned as different threads enter/exit kernel
>>context.  Note that we already have similar time-limited caches for the
>>identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to
>>see whether the code can be shared.
>
>I'll take a look at those, but implementing the hash table was a pretty
>simple solution, I need to work out a few kinks with memory leaks before
>doing real performance tests on it to make sure it performs similarly to
>nodelocal.
>
>>Another (not very nice) option to avoid looking through the environment
>>variables (which IMHO isn't so bad, even though the upstream folks don't
>>like it) is to associate the JobID set via /proc with a process group
>>internally and look the PGID up in the kernel to find the JobID.  That
>>can be repeated each time a new JobID is set via /proc, since the PGID
>>would stick around for each new job/shell/process created under the PGID.
>> It won't be as robust as looking up the JobID in the environment, but
>>probably good enough for most uses.
>>
>>I would definitely also be in favor of having some way to fall back to
>>procname_uid if the PGID cannot be found, the job environment variable is
>>not available, and there is nothing in nodelocal.
>
>That's simple enough.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-02-02 15:20       ` Ben Evans
@ 2017-02-07 23:01         ` Dilger, Andreas
  2017-02-16 14:36           ` Ben Evans
  0 siblings, 1 reply; 14+ messages in thread
From: Dilger, Andreas @ 2017-02-07 23:01 UTC (permalink / raw)
  To: lustre-devel

On Feb 2, 2017, at 08:20, Ben Evans <bevans@cray.com> wrote:
> 
> https://review.whamcloud.com/#/c/25208/ is a working version of what I had
> proposed, including the suggested changes to default to procname_uid.
> This is not perfect, but the performance is much improved over the current
> methods, and unlike inode-based caching Metadata performance isn't
> negatively affected.  Multiple simultaneous jobs can be run on the same
> file, and get appropriate metrics.

I reviewed the patch, and one question that I had is whether you've tested
if the JobID is correct when read/write RPCs are generated by readahead or
ptlrpcd?  That may be more relevant once the async readahead threads are
implemented by Dmitry.  With an inode-based JobID cache then the JobID can
(usually) be correctly determined even if the RPC is not generated in the
context of the user process.

I don't think that is necessarily a fault in your patch, but it may be that
the JobID determination hasn't kept pace with other changes in the code.  It
would be great if you would verify (possibly with a test attached to your
patch) that JobID is assigned to all the RPCs that need it.

Cheers, Andreas

> On 1/20/17, 5:00 PM, "Ben Evans" <bevans@cray.com> wrote:
> 
>> 
>> 
>> On 1/20/17, 4:50 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
>> 
>>> On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin@intel.com> wrote:
>>>> 
>>>> 
>>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>>> 
>>>>> Overview
>>>>>           The Lustre filesystem added the ability to track I/O
>>>>> performance of a job across a cluster.  The initial algorithm was
>>>>> relatively simplistic:  for every I/O, look up the job ID of the
>>>>> process and include it in the RPC being sent to the server.  This
>>>>> imposed a non-trivial performance impact on client I/O performance.
>>>>>           An additional algorithm was introduced to handle the single
>>>>> job per node case, where instead of looking up the job ID of the
>>>>> process, Lustre simply accesses the value of a variable set through the
>>>>> proc interface.  This improved performance greatly, but only functions
>>>>> when a single job is being run.
>>>>>           A new approach is needed for multiple job per node systems.
>>>>> 
>>>>> Proposed Solution
>>>>>           The proposed solution to this is to create a small
>>>>> PID->JobID table in kernel memory.  When a process performs an IO, a
>>>>> lookup is done in the table for the PID, if a JobID exists for that
>>>>> PID, it is used, otherwise it is retrieved via the same methods as the
>>>>> original Jobstats algorithm.  Once located the JobID is stored in a
>>>>> PID/JobID table in memory. The existing cfs_hash_table structure and
>>>>> functions will be used to implement the table.
>>>>> 
>>>>> Rationale
>>>>>           This reduces the number of calls into userspace, minimizing
>>>>> the time taken on each I/O.  It also easily supports multiple job per
>>>>> node scenarios, and like other proposed solutions has no issue with
>>>>> multiple jobs performing I/O on the same file at the same time.
>>>>> 
>>>>> Requirements
>>>>> ?      Performance cannot significantly detract from baseline
>>>>> performance without jobstats
>>>>> ?      Supports multiple jobs per node
>>>>> ?      Coordination with the scheduler is not required, but interfaces
>>>>> may be provided
>>>>> ?      Supports multiple PIDs per job
>>>>> 
>>>>> New Data Structures
>>>>>           pid_to_jobid {
>>>>>                       struct hlist_node pj_hash;
>>>>>                       u54 pj_pid;
>>>>>                       char pj_jobid[LUSTRE_JOBID_SIZE];
>>>>> spinlock_t jp_lock;
>>>>>                       time_t jp_time;
>>>>> }
>>>>> Proc Variables
>>>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>>>> will cause all entries in the cache for that jobID to be removed from
>>>>> the cache
>>>>> 
>>>>> Populating the Cache
>>>>>           When lustre_get_jobid is called, the process, and in the
>>>>> cached mode, first a check will be done in the cache for a valid PID to
>>>>> JobID mapping.  If none exists, it uses the same mechanisms to get the
>>>>> JobID and populates the appropriate PID to JobID map.
>>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>>> more than 30 seconds old, the JobID is refreshed.
>>>>> Purging the Cache
>>>>>           The cache can be purged of a specific job by writing the
>>>>> JobID to the jobid_name proc file.  Any items in the cache that are
>>>>> more than 300 seconds out of date will also be purged at this time.
>>>> 
>>>> 
>>>> I'd much rather prefer you go to the table that's populated outside of
>>>> the kernel
>>>> somehow.
>>>> Let's be realistic, poking around in userspace process environments for
>>>> random
>>>> strings is not such a great idea at all even though it did look like a
>>>> good idea
>>>> in the past for simplicity reasons.
>>>> Similar to nodelocal, we probably just switch to a method where you
>>>> call a
>>>> particular lctl command that would mark the whole session as belonging
>>>> to some job. This might take several forms, e.g. nodelocal itself could
>>>> be extended to only apply to a current namespace/container
>>>> But if you do really run different jobs in the global namespace, we
>>>> probably can
>>>> probably just make the lctl to spawn a shell with commands that all
>>>> would
>>>> be marked as a particular job? Or we can probably trace the parent of
>>>> lctl and
>>>> mark that so that all its children become somehow marked too.
>>> 
>>> Having lctl spawn a shell or requiring everything to run in a container
>>> is impractical for users, and will just make it harder to use JobID,
>>> IMHO.  The job scheduler is _already_ storing the JobID in the process
>>> environment so that it is available to all of the threads running as part
>>> of the job.  The question is how the job prolog script can communicate
>>> the JobID directly to Lustre without using a global /proc file?  Doing an
>>> upcall to userspace per JobID lookup is going to be *worse* for
>>> performance than the current searching through the process environment.
>>> 
>>> I'm not against Ben's proposal to implement a cache in the kernel for
>>> different processes.  It is unfortunate that we can't have proper
>>> thread-local storage for Lustre, so a hash table is probably reasonable
>>> for this (there may be thousands of threads involved).  I don't think the
>>> cl_env struct would be useful, since it is not tied to a specific thread
>>> (AFAIK), but rather assigned as different threads enter/exit kernel
>>> context.  Note that we already have similar time-limited caches for the
>>> identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to
>>> see whether the code can be shared.
>> 
>> I'll take a look at those, but implementing the hash table was a pretty
>> simple solution, I need to work out a few kinks with memory leaks before
>> doing real performance tests on it to make sure it performs similarly to
>> nodelocal.
>> 
>>> Another (not very nice) option to avoid looking through the environment
>>> variables (which IMHO isn't so bad, even though the upstream folks don't
>>> like it) is to associate the JobID set via /proc with a process group
>>> internally and look the PGID up in the kernel to find the JobID.  That
>>> can be repeated each time a new JobID is set via /proc, since the PGID
>>> would stick around for each new job/shell/process created under the PGID.
>>> It won't be as robust as looking up the JobID in the environment, but
>>> probably good enough for most uses.
>>> 
>>> I would definitely also be in favor of having some way to fall back to
>>> procname_uid if the PGID cannot be found, the job environment variable is
>>> not available, and there is nothing in nodelocal.
>> 
>> That's simple enough.
>> 
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-02-07 23:01         ` Dilger, Andreas
@ 2017-02-16 14:36           ` Ben Evans
  2017-02-16 22:30             ` Dilger, Andreas
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-02-16 14:36 UTC (permalink / raw)
  To: lustre-devel



On 2/7/17, 6:01 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:

>On Feb 2, 2017, at 08:20, Ben Evans <bevans@cray.com> wrote:
>> 
>> https://review.whamcloud.com/#/c/25208/ is a working version of what I
>>had
>> proposed, including the suggested changes to default to procname_uid.
>> This is not perfect, but the performance is much improved over the
>>current
>> methods, and unlike inode-based caching Metadata performance isn't
>> negatively affected.  Multiple simultaneous jobs can be run on the same
>> file, and get appropriate metrics.
>
>I reviewed the patch, and one question that I had is whether you've tested
>if the JobID is correct when read/write RPCs are generated by readahead or
>ptlrpcd?  That may be more relevant once the async readahead threads are
>implemented by Dmitry.  With an inode-based JobID cache then the JobID can
>(usually) be correctly determined even if the RPC is not generated in the
>context of the user process.
>
>I don't think that is necessarily a fault in your patch, but it may be
>that
>the JobID determination hasn't kept pace with other changes in the code.
>It
>would be great if you would verify (possibly with a test attached to your
>patch) that JobID is assigned to all the RPCs that need it.

I've seen some lustre thread names pop into the JobID under the
procname_uid scheme when doing something like a dd test.  Filtering them
out would be relatively straightforward, and keeping the old JobID (if
available) in the lookup table would be the way to get the most reliable
info.  There shouldn't be a difference with the current behavior in this
regard.

My issue with putting the information in the inode stems from 2 cases, the
first is RobinHood, which stats *everything*.  In the proposed solution,
one lookup would be done every 30 seconds.  Storing the inode, it would
happen for every stat, then never used again.

The other case is less probable, but still out there, in an environment
with multiple jobs per node, you may be running two different jobs on the
same input set, which would corrupt the counting.

-Ben

>
>Cheers, Andreas
>
>> On 1/20/17, 5:00 PM, "Ben Evans" <bevans@cray.com> wrote:
>> 
>>> 
>>> 
>>> On 1/20/17, 4:50 PM, "Dilger, Andreas" <andreas.dilger@intel.com>
>>>wrote:
>>> 
>>>> On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin@intel.com> wrote:
>>>>> 
>>>>> 
>>>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>>>> 
>>>>>> Overview
>>>>>>           The Lustre filesystem added the ability to track I/O
>>>>>> performance of a job across a cluster.  The initial algorithm was
>>>>>> relatively simplistic:  for every I/O, look up the job ID of the
>>>>>> process and include it in the RPC being sent to the server.  This
>>>>>> imposed a non-trivial performance impact on client I/O performance.
>>>>>>           An additional algorithm was introduced to handle the
>>>>>>single
>>>>>> job per node case, where instead of looking up the job ID of the
>>>>>> process, Lustre simply accesses the value of a variable set through
>>>>>>the
>>>>>> proc interface.  This improved performance greatly, but only
>>>>>>functions
>>>>>> when a single job is being run.
>>>>>>           A new approach is needed for multiple job per node
>>>>>>systems.
>>>>>> 
>>>>>> Proposed Solution
>>>>>>           The proposed solution to this is to create a small
>>>>>> PID->JobID table in kernel memory.  When a process performs an IO, a
>>>>>> lookup is done in the table for the PID, if a JobID exists for that
>>>>>> PID, it is used, otherwise it is retrieved via the same methods as
>>>>>>the
>>>>>> original Jobstats algorithm.  Once located the JobID is stored in a
>>>>>> PID/JobID table in memory. The existing cfs_hash_table structure and
>>>>>> functions will be used to implement the table.
>>>>>> 
>>>>>> Rationale
>>>>>>           This reduces the number of calls into userspace,
>>>>>>minimizing
>>>>>> the time taken on each I/O.  It also easily supports multiple job
>>>>>>per
>>>>>> node scenarios, and like other proposed solutions has no issue with
>>>>>> multiple jobs performing I/O on the same file at the same time.
>>>>>> 
>>>>>> Requirements
>>>>>> ?      Performance cannot significantly detract from baseline
>>>>>> performance without jobstats
>>>>>> ?      Supports multiple jobs per node
>>>>>> ?      Coordination with the scheduler is not required, but
>>>>>>interfaces
>>>>>> may be provided
>>>>>> ?      Supports multiple PIDs per job
>>>>>> 
>>>>>> New Data Structures
>>>>>>           pid_to_jobid {
>>>>>>                       struct hlist_node pj_hash;
>>>>>>                       u54 pj_pid;
>>>>>>                       char pj_jobid[LUSTRE_JOBID_SIZE];
>>>>>> spinlock_t jp_lock;
>>>>>>                       time_t jp_time;
>>>>>> }
>>>>>> Proc Variables
>>>>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>>>>> will cause all entries in the cache for that jobID to be removed
>>>>>>from
>>>>>> the cache
>>>>>> 
>>>>>> Populating the Cache
>>>>>>           When lustre_get_jobid is called, the process, and in the
>>>>>> cached mode, first a check will be done in the cache for a valid
>>>>>>PID to
>>>>>> JobID mapping.  If none exists, it uses the same mechanisms to get
>>>>>>the
>>>>>> JobID and populates the appropriate PID to JobID map.
>>>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>>>> more than 30 seconds old, the JobID is refreshed.
>>>>>> Purging the Cache
>>>>>>           The cache can be purged of a specific job by writing the
>>>>>> JobID to the jobid_name proc file.  Any items in the cache that are
>>>>>> more than 300 seconds out of date will also be purged at this time.
>>>>> 
>>>>> 
>>>>> I'd much rather prefer you go to the table that's populated outside
>>>>>of
>>>>> the kernel
>>>>> somehow.
>>>>> Let's be realistic, poking around in userspace process environments
>>>>>for
>>>>> random
>>>>> strings is not such a great idea at all even though it did look like
>>>>>a
>>>>> good idea
>>>>> in the past for simplicity reasons.
>>>>> Similar to nodelocal, we probably just switch to a method where you
>>>>> call a
>>>>> particular lctl command that would mark the whole session as
>>>>>belonging
>>>>> to some job. This might take several forms, e.g. nodelocal itself
>>>>>could
>>>>> be extended to only apply to a current namespace/container
>>>>> But if you do really run different jobs in the global namespace, we
>>>>> probably can
>>>>> probably just make the lctl to spawn a shell with commands that all
>>>>> would
>>>>> be marked as a particular job? Or we can probably trace the parent of
>>>>> lctl and
>>>>> mark that so that all its children become somehow marked too.
>>>> 
>>>> Having lctl spawn a shell or requiring everything to run in a
>>>>container
>>>> is impractical for users, and will just make it harder to use JobID,
>>>> IMHO.  The job scheduler is _already_ storing the JobID in the process
>>>> environment so that it is available to all of the threads running as
>>>>part
>>>> of the job.  The question is how the job prolog script can communicate
>>>> the JobID directly to Lustre without using a global /proc file?
>>>>Doing an
>>>> upcall to userspace per JobID lookup is going to be *worse* for
>>>> performance than the current searching through the process
>>>>environment.
>>>> 
>>>> I'm not against Ben's proposal to implement a cache in the kernel for
>>>> different processes.  It is unfortunate that we can't have proper
>>>> thread-local storage for Lustre, so a hash table is probably
>>>>reasonable
>>>> for this (there may be thousands of threads involved).  I don't think
>>>>the
>>>> cl_env struct would be useful, since it is not tied to a specific
>>>>thread
>>>> (AFAIK), but rather assigned as different threads enter/exit kernel
>>>> context.  Note that we already have similar time-limited caches for
>>>>the
>>>> identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to
>>>> see whether the code can be shared.
>>> 
>>> I'll take a look at those, but implementing the hash table was a pretty
>>> simple solution, I need to work out a few kinks with memory leaks
>>>before
>>> doing real performance tests on it to make sure it performs similarly
>>>to
>>> nodelocal.
>>> 
>>>> Another (not very nice) option to avoid looking through the
>>>>environment
>>>> variables (which IMHO isn't so bad, even though the upstream folks
>>>>don't
>>>> like it) is to associate the JobID set via /proc with a process group
>>>> internally and look the PGID up in the kernel to find the JobID.  That
>>>> can be repeated each time a new JobID is set via /proc, since the PGID
>>>> would stick around for each new job/shell/process created under the
>>>>PGID.
>>>> It won't be as robust as looking up the JobID in the environment, but
>>>> probably good enough for most uses.
>>>> 
>>>> I would definitely also be in favor of having some way to fall back to
>>>> procname_uid if the PGID cannot be found, the job environment
>>>>variable is
>>>> not available, and there is nothing in nodelocal.
>>> 
>>> That's simple enough.
>>> 
>> 
>
>Cheers, Andreas
>--
>Andreas Dilger
>Lustre Principal Architect
>Intel Corporation
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-02-16 14:36           ` Ben Evans
@ 2017-02-16 22:30             ` Dilger, Andreas
  2017-02-28 16:23               ` Ben Evans
  0 siblings, 1 reply; 14+ messages in thread
From: Dilger, Andreas @ 2017-02-16 22:30 UTC (permalink / raw)
  To: lustre-devel

On Feb 16, 2017, at 07:36, Ben Evans <bevans@cray.com> wrote:
> 
> 
> 
> On 2/7/17, 6:01 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
> 
>> On Feb 2, 2017, at 08:20, Ben Evans <bevans@cray.com> wrote:
>>> 
>>> https://review.whamcloud.com/#/c/25208/ is a working version of what I
>>> had
>>> proposed, including the suggested changes to default to procname_uid.
>>> This is not perfect, but the performance is much improved over the
>>> current
>>> methods, and unlike inode-based caching Metadata performance isn't
>>> negatively affected.  Multiple simultaneous jobs can be run on the same
>>> file, and get appropriate metrics.
>> 
>> I reviewed the patch, and one question that I had is whether you've tested
>> if the JobID is correct when read/write RPCs are generated by readahead or
>> ptlrpcd?  That may be more relevant once the async readahead threads are
>> implemented by Dmitry.  With an inode-based JobID cache then the JobID can
>> (usually) be correctly determined even if the RPC is not generated in the
>> context of the user process.
>> 
>> I don't think that is necessarily a fault in your patch, but it may be
>> that
>> the JobID determination hasn't kept pace with other changes in the code.
>> It
>> would be great if you would verify (possibly with a test attached to your
>> patch) that JobID is assigned to all the RPCs that need it.
> 
> I've seen some lustre thread names pop into the JobID under the
> procname_uid scheme when doing something like a dd test.  Filtering them
> out would be relatively straightforward, and keeping the old JobID (if
> available) in the lookup table would be the way to get the most reliable
> info.  There shouldn't be a difference with the current behavior in this
> regard.
> 
> My issue with putting the information in the inode stems from 2 cases, the
> first is RobinHood, which stats *everything*.  In the proposed solution,
> one lookup would be done every 30 seconds.  Storing the inode, it would
> happen for every stat, then never used again.
> 
> The other case is less probable, but still out there, in an environment
> with multiple jobs per node, you may be running two different jobs on the
> same input set, which would corrupt the counting.

If there are two jobs using the same input files, I suspect the second one
would get the data from the client cache, and not log anything on the server
at all.  In any case, I don't think that would be any different than the two
jobs are randomly interleaving their access to the same files on the server.

Conversely, having "ptlrpcd/0" appear in the jobstats doesn't really help
anyone figure out which user/job is causing IO traffic on the server.  If
RPCs generated by ptlrpcd, statahead, and other service threads that do work
on behalf of user processes (including readahead in the near future) have the
proper JobID then that would be much more useful.

Some suggestions on how to handle this, off the top of my head:
- blacklist service thread PIDs at startup in the JobID hash and have them
  get the JobID by some other method (e.g. inode, DLM lock/resource, other)
- store the JobID explicitly with the IO request when it is being put into
  a cache/queue and use this when submitting the RPC if present, otherwise get
  it from the hash

The latter may be preferable, since it doesn't need to do anything for sync
RPCs generated in process context, and avoids an extra lookup when processing
the RPC.  You might consider the first method for debugging when/where such
RPCs are generated, and have the backlisted threads dump a stack once if they
are being looked up in the JobID hash.

Cheers, Andreas

> -Ben
> 
>> 
>> Cheers, Andreas
>> 
>>> On 1/20/17, 5:00 PM, "Ben Evans" <bevans@cray.com> wrote:
>>> 
>>>> 
>>>> 
>>>> On 1/20/17, 4:50 PM, "Dilger, Andreas" <andreas.dilger@intel.com>
>>>> wrote:
>>>> 
>>>>> On Jan 18, 2017, at 13:39, Oleg Drokin <oleg.drokin@intel.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote:
>>>>>> 
>>>>>>> Overview
>>>>>>>          The Lustre filesystem added the ability to track I/O
>>>>>>> performance of a job across a cluster.  The initial algorithm was
>>>>>>> relatively simplistic:  for every I/O, look up the job ID of the
>>>>>>> process and include it in the RPC being sent to the server.  This
>>>>>>> imposed a non-trivial performance impact on client I/O performance.
>>>>>>>          An additional algorithm was introduced to handle the
>>>>>>> single
>>>>>>> job per node case, where instead of looking up the job ID of the
>>>>>>> process, Lustre simply accesses the value of a variable set through
>>>>>>> the
>>>>>>> proc interface.  This improved performance greatly, but only
>>>>>>> functions
>>>>>>> when a single job is being run.
>>>>>>>          A new approach is needed for multiple job per node
>>>>>>> systems.
>>>>>>> 
>>>>>>> Proposed Solution
>>>>>>>          The proposed solution to this is to create a small
>>>>>>> PID->JobID table in kernel memory.  When a process performs an IO, a
>>>>>>> lookup is done in the table for the PID, if a JobID exists for that
>>>>>>> PID, it is used, otherwise it is retrieved via the same methods as
>>>>>>> the
>>>>>>> original Jobstats algorithm.  Once located the JobID is stored in a
>>>>>>> PID/JobID table in memory. The existing cfs_hash_table structure and
>>>>>>> functions will be used to implement the table.
>>>>>>> 
>>>>>>> Rationale
>>>>>>>          This reduces the number of calls into userspace,
>>>>>>> minimizing
>>>>>>> the time taken on each I/O.  It also easily supports multiple job
>>>>>>> per
>>>>>>> node scenarios, and like other proposed solutions has no issue with
>>>>>>> multiple jobs performing I/O on the same file at the same time.
>>>>>>> 
>>>>>>> Requirements
>>>>>>> ?      Performance cannot significantly detract from baseline
>>>>>>> performance without jobstats
>>>>>>> ?      Supports multiple jobs per node
>>>>>>> ?      Coordination with the scheduler is not required, but
>>>>>>> interfaces
>>>>>>> may be provided
>>>>>>> ?      Supports multiple PIDs per job
>>>>>>> 
>>>>>>> New Data Structures
>>>>>>>          pid_to_jobid {
>>>>>>>                      struct hlist_node pj_hash;
>>>>>>>                      u54 pj_pid;
>>>>>>>                      char pj_jobid[LUSTRE_JOBID_SIZE];
>>>>>>> spinlock_t jp_lock;
>>>>>>>                      time_t jp_time;
>>>>>>> }
>>>>>>> Proc Variables
>>>>>>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode
>>>>>>> will cause all entries in the cache for that jobID to be removed
>>>>>>> from
>>>>>>> the cache
>>>>>>> 
>>>>>>> Populating the Cache
>>>>>>>          When lustre_get_jobid is called, the process, and in the
>>>>>>> cached mode, first a check will be done in the cache for a valid
>>>>>>> PID to
>>>>>>> JobID mapping.  If none exists, it uses the same mechanisms to get
>>>>>>> the
>>>>>>> JobID and populates the appropriate PID to JobID map.
>>>>>>> If a lookup is performed and the PID to JobID mapping exists, but is
>>>>>>> more than 30 seconds old, the JobID is refreshed.
>>>>>>> Purging the Cache
>>>>>>>          The cache can be purged of a specific job by writing the
>>>>>>> JobID to the jobid_name proc file.  Any items in the cache that are
>>>>>>> more than 300 seconds out of date will also be purged at this time.
>>>>>> 
>>>>>> 
>>>>>> I'd much rather prefer you go to the table that's populated outside
>>>>>> of
>>>>>> the kernel
>>>>>> somehow.
>>>>>> Let's be realistic, poking around in userspace process environments
>>>>>> for
>>>>>> random
>>>>>> strings is not such a great idea at all even though it did look like
>>>>>> a
>>>>>> good idea
>>>>>> in the past for simplicity reasons.
>>>>>> Similar to nodelocal, we probably just switch to a method where you
>>>>>> call a
>>>>>> particular lctl command that would mark the whole session as
>>>>>> belonging
>>>>>> to some job. This might take several forms, e.g. nodelocal itself
>>>>>> could
>>>>>> be extended to only apply to a current namespace/container
>>>>>> But if you do really run different jobs in the global namespace, we
>>>>>> probably can
>>>>>> probably just make the lctl to spawn a shell with commands that all
>>>>>> would
>>>>>> be marked as a particular job? Or we can probably trace the parent of
>>>>>> lctl and
>>>>>> mark that so that all its children become somehow marked too.
>>>>> 
>>>>> Having lctl spawn a shell or requiring everything to run in a
>>>>> container
>>>>> is impractical for users, and will just make it harder to use JobID,
>>>>> IMHO.  The job scheduler is _already_ storing the JobID in the process
>>>>> environment so that it is available to all of the threads running as
>>>>> part
>>>>> of the job.  The question is how the job prolog script can communicate
>>>>> the JobID directly to Lustre without using a global /proc file?
>>>>> Doing an
>>>>> upcall to userspace per JobID lookup is going to be *worse* for
>>>>> performance than the current searching through the process
>>>>> environment.
>>>>> 
>>>>> I'm not against Ben's proposal to implement a cache in the kernel for
>>>>> different processes.  It is unfortunate that we can't have proper
>>>>> thread-local storage for Lustre, so a hash table is probably
>>>>> reasonable
>>>>> for this (there may be thousands of threads involved).  I don't think
>>>>> the
>>>>> cl_env struct would be useful, since it is not tied to a specific
>>>>> thread
>>>>> (AFAIK), but rather assigned as different threads enter/exit kernel
>>>>> context.  Note that we already have similar time-limited caches for
>>>>> the
>>>>> identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to
>>>>> see whether the code can be shared.
>>>> 
>>>> I'll take a look at those, but implementing the hash table was a pretty
>>>> simple solution, I need to work out a few kinks with memory leaks
>>>> before
>>>> doing real performance tests on it to make sure it performs similarly
>>>> to
>>>> nodelocal.
>>>> 
>>>>> Another (not very nice) option to avoid looking through the
>>>>> environment
>>>>> variables (which IMHO isn't so bad, even though the upstream folks
>>>>> don't
>>>>> like it) is to associate the JobID set via /proc with a process group
>>>>> internally and look the PGID up in the kernel to find the JobID.  That
>>>>> can be repeated each time a new JobID is set via /proc, since the PGID
>>>>> would stick around for each new job/shell/process created under the
>>>>> PGID.
>>>>> It won't be as robust as looking up the JobID in the environment, but
>>>>> probably good enough for most uses.
>>>>> 
>>>>> I would definitely also be in favor of having some way to fall back to
>>>>> procname_uid if the PGID cannot be found, the job environment
>>>>> variable is
>>>>> not available, and there is nothing in nodelocal.
>>>> 
>>>> That's simple enough.
>>>> 
>>> 
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel Corporation
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-02-16 22:30             ` Dilger, Andreas
@ 2017-02-28 16:23               ` Ben Evans
  2017-02-28 21:17                 ` Dilger, Andreas
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Evans @ 2017-02-28 16:23 UTC (permalink / raw)
  To: lustre-devel



On 2/16/17, 5:30 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:

>On Feb 16, 2017, at 07:36, Ben Evans <bevans@cray.com> wrote:
>> 
>> 
>> 
>> On 2/7/17, 6:01 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
>> 
>>> On Feb 2, 2017, at 08:20, Ben Evans <bevans@cray.com> wrote:
>>>> 
>>>> https://review.whamcloud.com/#/c/25208/ is a working version of what I
>>>> had
>>>> proposed, including the suggested changes to default to procname_uid.
>>>> This is not perfect, but the performance is much improved over the
>>>> current
>>>> methods, and unlike inode-based caching Metadata performance isn't
>>>> negatively affected.  Multiple simultaneous jobs can be run on the
>>>>same
>>>> file, and get appropriate metrics.
>>> 
>>> I reviewed the patch, and one question that I had is whether you've
>>>tested
>>> if the JobID is correct when read/write RPCs are generated by
>>>readahead or
>>> ptlrpcd?  That may be more relevant once the async readahead threads
>>>are
>>> implemented by Dmitry.  With an inode-based JobID cache then the JobID
>>>can
>>> (usually) be correctly determined even if the RPC is not generated in
>>>the
>>> context of the user process.
>>> 
>>> I don't think that is necessarily a fault in your patch, but it may be
>>> that
>>> the JobID determination hasn't kept pace with other changes in the
>>>code.
>>> It
>>> would be great if you would verify (possibly with a test attached to
>>>your
>>> patch) that JobID is assigned to all the RPCs that need it.
>> 
>> I've seen some lustre thread names pop into the JobID under the
>> procname_uid scheme when doing something like a dd test.  Filtering them
>> out would be relatively straightforward, and keeping the old JobID (if
>> available) in the lookup table would be the way to get the most reliable
>> info.  There shouldn't be a difference with the current behavior in this
>> regard.
>> 
>> My issue with putting the information in the inode stems from 2 cases,
>>the
>> first is RobinHood, which stats *everything*.  In the proposed solution,
>> one lookup would be done every 30 seconds.  Storing the inode, it would
>> happen for every stat, then never used again.
>> 
>> The other case is less probable, but still out there, in an environment
>> with multiple jobs per node, you may be running two different jobs on
>>the
>> same input set, which would corrupt the counting.
>
>If there are two jobs using the same input files, I suspect the second one
>would get the data from the client cache, and not log anything on the
>server
>at all.  In any case, I don't think that would be any different than the
>two
>jobs are randomly interleaving their access to the same files on the
>server.
>
>Conversely, having "ptlrpcd/0" appear in the jobstats doesn't really help
>anyone figure out which user/job is causing IO traffic on the server.  If
>RPCs generated by ptlrpcd, statahead, and other service threads that do
>work
>on behalf of user processes (including readahead in the near future) have
>the
>proper JobID then that would be much more useful.
>
>Some suggestions on how to handle this, off the top of my head:
>- blacklist service thread PIDs at startup in the JobID hash and have them
>  get the JobID by some other method (e.g. inode, DLM lock/resource,
>other)
>- store the JobID explicitly with the IO request when it is being put into
>  a cache/queue and use this when submitting the RPC if present,
>otherwise get
>  it from the hash
>
>The latter may be preferable, since it doesn't need to do anything for
>sync
>RPCs generated in process context, and avoids an extra lookup when
>processing
>the RPC.  You might consider the first method for debugging when/where
>such
>RPCs are generated, and have the backlisted threads dump a stack once if
>they
>are being looked up in the JobID hash.
>
>Cheers, Andreas

I'm thinking a combination of approaches:  Use the hash as the primary
source, but populate the inode with the data as well and use it when one
of the "reserved" names pops up as the jobID.

For any file access, the open would trigger a JobID lookup, which would
put the correct info into the hash, and then into the inode.  As the JobID
is updated the inode's store would also be updated.

For a lookup, if the table returns ptlrpc, or any other of the Lustre
threads, then the inode cache would be used.

This way, we're doing as few userspace lookups as possible, fixing the
readahead hole that currently exists, and not having an issue with
processes like find or robinhood which touch a lot of files.

-Ben

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [lustre-devel] Proposal for JobID caching
  2017-02-28 16:23               ` Ben Evans
@ 2017-02-28 21:17                 ` Dilger, Andreas
  0 siblings, 0 replies; 14+ messages in thread
From: Dilger, Andreas @ 2017-02-28 21:17 UTC (permalink / raw)
  To: lustre-devel

On Feb 28, 2017, at 09:23, Ben Evans <bevans@cray.com> wrote:
> 
> 
> 
> On 2/16/17, 5:30 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
> 
>> On Feb 16, 2017, at 07:36, Ben Evans <bevans@cray.com> wrote:
>>> 
>>> 
>>> 
>>> On 2/7/17, 6:01 PM, "Dilger, Andreas" <andreas.dilger@intel.com> wrote:
>>> 
>>>> On Feb 2, 2017, at 08:20, Ben Evans <bevans@cray.com> wrote:
>>>>> 
>>>>> https://review.whamcloud.com/#/c/25208/ is a working version of what I
>>>>> had
>>>>> proposed, including the suggested changes to default to procname_uid.
>>>>> This is not perfect, but the performance is much improved over the
>>>>> current
>>>>> methods, and unlike inode-based caching Metadata performance isn't
>>>>> negatively affected.  Multiple simultaneous jobs can be run on the
>>>>> same
>>>>> file, and get appropriate metrics.
>>>> 
>>>> I reviewed the patch, and one question that I had is whether you've
>>>> tested
>>>> if the JobID is correct when read/write RPCs are generated by
>>>> readahead or
>>>> ptlrpcd?  That may be more relevant once the async readahead threads
>>>> are
>>>> implemented by Dmitry.  With an inode-based JobID cache then the JobID
>>>> can
>>>> (usually) be correctly determined even if the RPC is not generated in
>>>> the
>>>> context of the user process.
>>>> 
>>>> I don't think that is necessarily a fault in your patch, but it may be
>>>> that
>>>> the JobID determination hasn't kept pace with other changes in the
>>>> code.
>>>> It
>>>> would be great if you would verify (possibly with a test attached to
>>>> your
>>>> patch) that JobID is assigned to all the RPCs that need it.
>>> 
>>> I've seen some lustre thread names pop into the JobID under the
>>> procname_uid scheme when doing something like a dd test.  Filtering them
>>> out would be relatively straightforward, and keeping the old JobID (if
>>> available) in the lookup table would be the way to get the most reliable
>>> info.  There shouldn't be a difference with the current behavior in this
>>> regard.
>>> 
>>> My issue with putting the information in the inode stems from 2 cases,
>>> the
>>> first is RobinHood, which stats *everything*.  In the proposed solution,
>>> one lookup would be done every 30 seconds.  Storing the inode, it would
>>> happen for every stat, then never used again.
>>> 
>>> The other case is less probable, but still out there, in an environment
>>> with multiple jobs per node, you may be running two different jobs on
>>> the
>>> same input set, which would corrupt the counting.
>> 
>> If there are two jobs using the same input files, I suspect the second one
>> would get the data from the client cache, and not log anything on the
>> server
>> at all.  In any case, I don't think that would be any different than the
>> two
>> jobs are randomly interleaving their access to the same files on the
>> server.
>> 
>> Conversely, having "ptlrpcd/0" appear in the jobstats doesn't really help
>> anyone figure out which user/job is causing IO traffic on the server.  If
>> RPCs generated by ptlrpcd, statahead, and other service threads that do
>> work
>> on behalf of user processes (including readahead in the near future) have
>> the
>> proper JobID then that would be much more useful.
>> 
>> Some suggestions on how to handle this, off the top of my head:
>> - blacklist service thread PIDs at startup in the JobID hash and have them
>> get the JobID by some other method (e.g. inode, DLM lock/resource,
>> other)
>> - store the JobID explicitly with the IO request when it is being put into
>> a cache/queue and use this when submitting the RPC if present,
>> otherwise get
>> it from the hash
>> 
>> The latter may be preferable, since it doesn't need to do anything for
>> sync
>> RPCs generated in process context, and avoids an extra lookup when
>> processing
>> the RPC.  You might consider the first method for debugging when/where
>> such
>> RPCs are generated, and have the backlisted threads dump a stack once if
>> they
>> are being looked up in the JobID hash.
>> 
>> Cheers, Andreas
> 
> I'm thinking a combination of approaches:  Use the hash as the primary
> source, but populate the inode with the data as well and use it when one
> of the "reserved" names pops up as the jobID.
> 
> For any file access, the open would trigger a JobID lookup, which would
> put the correct info into the hash, and then into the inode.  As the JobID
> is updated the inode's store would also be updated.
> 
> For a lookup, if the table returns ptlrpc, or any other of the Lustre
> threads, then the inode cache would be used.
> 
> This way, we're doing as few userspace lookups as possible, fixing the
> readahead hole that currently exists, and not having an issue with
> processes like find or robinhood which touch a lot of files.

Yes, this sounds the same as what I was thinking.  It should be possible to
"blacklist" the client threads (ptlrpcd, statahead, ll_ping, wherever we use
kthread_run() on the client).

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-02-28 21:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-18 20:08 [lustre-devel] Proposal for JobID caching Ben Evans
2017-01-18 20:39 ` Oleg Drokin
2017-01-18 22:35   ` Ben Evans
2017-01-18 22:56     ` Oleg Drokin
2017-01-19 15:19       ` Ben Evans
2017-01-19 16:28         ` Oleg Drokin
2017-01-20 21:50   ` Dilger, Andreas
2017-01-20 22:00     ` Ben Evans
2017-02-02 15:20       ` Ben Evans
2017-02-07 23:01         ` Dilger, Andreas
2017-02-16 14:36           ` Ben Evans
2017-02-16 22:30             ` Dilger, Andreas
2017-02-28 16:23               ` Ben Evans
2017-02-28 21:17                 ` Dilger, Andreas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.