Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	aaron.lu@intel.com, akpm@linux-foundation.org,
	dave.hansen@linux.intel.com, mgorman@techsingularity.net,
	mike.kravetz@oracle.com, pasha.tatashin@oracle.com,
	steven.sistare@oracle.com, tim.c.chen@intel.com
Subject: Re: [RFC PATCH v3 1/7] ktask: add documentation
Date: Wed, 6 Dec 2017 15:32:48 -0500
Message-ID: <d8323ee9-eb99-7f55-50c6-c71f4986cf06@oracle.com> (raw)
In-Reply-To: <20171206143509.GG7515@dhcp22.suse.cz>

On 12/06/2017 09:35 AM, Michal Hocko wrote:
> Please note that I haven't checked any code in this patch series. I've
> just started here to see how the thing is supposed to work and what is
> the overall design

Thanks for taking a look, Michal.

> 
> On Tue 05-12-17 14:52:14, Daniel Jordan wrote:
> [...]
>> +Resource Limits and Auto-Tuning
>> +===============================
>> +
>> +ktask has resource limits on the number of workqueue items it queues.  In
>> +ktask, a workqueue item is a thread that runs chunks of the task until the task
>> +is finished.
>> +
>> +These limits support the different ways ktask uses workqueues:
>> + - ktask_run to run threads on the calling thread's node.
>> + - ktask_run_numa to run threads on the node(s) specified.
>> + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
>> +   system.
>> +
>> +To support these different ways of queueing work while maintaining an efficient
>> +concurrency level, we need both system-wide and per-node limits on the number
>> +of threads.  Without per-node limits, a node might become oversubscribed
>> +despite ktask staying within the system-wide limit, and without a system-wide
>> +limit, we can't properly account for work that can run on any node.
>> +
>> +The system-wide limit is based on the total number of CPUs, and the per-node
>> +limit on the CPU count for each node.  A per-node work item counts against the
>> +system-wide limit.  Workqueue's max_active can't accommodate both types of
>> +limit, no matter how many workqueues are used, so ktask implements its own.
>> +
>> +If a per-node limit is reached, the work item is allowed to run anywhere on the
>> +machine to avoid overwhelming the node.  If the global limit is also reached,
>> +ktask won't queue additional work items until we fall below the limit again.
>> +
>> +These limits apply only to workqueue items--that is, additional threads beyond
>> +the one starting the task.  That way, one thread per task is always allowed to
>> +run.
>> +
>> +Within the resource limits, ktask uses a default maximum number of threads per
>> +task to avoid disturbing other processes on the system.  Callers can change the
>> +limit with ktask_ctl_set_max_threads.  For example, this might be used to raise
>> +the maximum number of threads for a boot-time initialization task when more
>> +CPUs than usual are idle.
> 
> The last time something like this (maybe even this specific approach -
> I do not remember) the main objection was the auto-tuning. Unless I've
> missed anything here all the tuning is based on counters rather than
> the _current_ system utilization.

That's right, as it's written now, it's just counters.

> There is also no mention about other
> characteristics (e.g. power management), resource isloataion etc. So > let me ask again. How do you control that the parallelized operation
> doesn't run outside of the limit imposed to the calling context?

The current code doesn't do this, and the answer is the same for the 
rest of your questions.

For resource isolation, I'll experiment with moving ktask threads into 
and out of the cgroup of the calling thread.

Do any resources not covered by cgroup come to mind?  I'm trying to 
think if I've left anything out.

> How
> do you control whether a larger number of workers should be fired when
> the system is idle but we want to keep many cpus idle due to power
> constrains. 

For power management, I'm going to look into how ktask can use the 
current cpufreq settings and the scheduler hooks called by cpufreq.

We could make decisions about starting additional threads (if any) based 
on the CPU frequency range or policy.

> How do you control how many workers are fired based on
> cpu utilization? Do you talk to the scheduler to see overall/per node
> utilization.

We'd have to go off of past and present scheduler data to predict the 
future.  Even the best heuristic might get it wrong, but heuristics 
could be better than nothing.  I'll look into what data the scheduler 
exports.


Anyway, I think scalability bottlenecks should be weighed with the rest 
of this.  It seems wrong that the kernel should always assume that one 
thread is enough to free all of a process's memory or evict all the 
pages of a file system no matter how much work there is to do.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply index

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
2017-12-05 20:59   ` Daniel Jordan
2017-12-06 14:35   ` Michal Hocko
2017-12-06 20:32     ` Daniel Jordan [this message]
2017-12-08 12:43       ` Michal Hocko
2017-12-08 13:46         ` Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
2017-12-05 22:21   ` Andrew Morton
2017-12-06 14:21     ` Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 3/7] ktask: add /proc/sys/debug/ktask_max_threads Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 5/7] mm: parallelize clear_gigantic_page Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 7/7] mm: parallelize deferred struct page initialization within each node Daniel Jordan
2017-12-05 22:23 ` [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Andrew Morton
2017-12-06 14:21   ` Daniel Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8323ee9-eb99-7f55-50c6-c71f4986cf06@oracle.com \
    --to=daniel.m.jordan@oracle.com \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=pasha.tatashin@oracle.com \
    --cc=steven.sistare@oracle.com \
    --cc=tim.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git