From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id BA863900087 for ; Thu, 14 Apr 2011 13:38:14 -0400 (EDT) Received: from kpbe16.cbf.corp.google.com (kpbe16.cbf.corp.google.com [172.25.105.80]) by smtp-out.google.com with ESMTP id p3EHc9LG000628 for ; Thu, 14 Apr 2011 10:38:12 -0700 Received: from gxk8 (gxk8.prod.google.com [10.202.11.8]) by kpbe16.cbf.corp.google.com with ESMTP id p3EHWvcR021679 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Thu, 14 Apr 2011 10:38:08 -0700 Received: by gxk8 with SMTP id 8so2241062gxk.37 for ; Thu, 14 Apr 2011 10:38:08 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110414091435.0fc6f74c.kamezawa.hiroyu@jp.fujitsu.com> References: <1302678187-24154-1-git-send-email-yinghan@google.com> <20110413164747.0d4076d1.kamezawa.hiroyu@jp.fujitsu.com> <20110414091435.0fc6f74c.kamezawa.hiroyu@jp.fujitsu.com> Date: Thu, 14 Apr 2011 10:38:07 -0700 Message-ID: Subject: Re: [PATCH V3 0/7] memcg: per cgroup background reclaim From: Ying Han Content-Type: multipart/alternative; boundary=000e0cd37a7e66f9fc04a0e46305 Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: Pavel Emelyanov , Balbir Singh , Daisuke Nishimura , Li Zefan , Mel Gorman , Christoph Lameter , Johannes Weiner , Rik van Riel , Hugh Dickins , KOSAKI Motohiro , Tejun Heo , Michal Hocko , Andrew Morton , Dave Hansen , linux-mm@kvack.org --000e0cd37a7e66f9fc04a0e46305 Content-Type: text/plain; charset=ISO-8859-1 On Wed, Apr 13, 2011 at 5:14 PM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Wed, 13 Apr 2011 10:53:19 -0700 > Ying Han wrote: > > > On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki < > > kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > On Wed, 13 Apr 2011 00:03:00 -0700 > > > Ying Han wrote: > > > > > > > The current implementation of memcg supports targeting reclaim when > the > > > > cgroup is reaching its hard_limit and we do direct reclaim per > cgroup. > > > > Per cgroup background reclaim is needed which helps to spread out > memory > > > > pressure over longer period of time and smoothes out the cgroup > > > performance. > > > > > > > > If the cgroup is configured to use per cgroup background reclaim, a > > > kswapd > > > > thread is created which only scans the per-memcg LRU list. Two > watermarks > > > > ("high_wmark", "low_wmark") are added to trigger the background > reclaim > > > and > > > > stop it. The watermarks are calculated based on the cgroup's > > > limit_in_bytes. > > > > > > > > I run through dd test on large file and then cat the file. Then I > > > compared > > > > the reclaim related stats in memory.stat. > > > > > > > > Step1: Create a cgroup with 500M memory_limit. > > > > $ mkdir /dev/cgroup/memory/A > > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes > > > > $ echo $$ >/dev/cgroup/memory/A/tasks > > > > > > > > Step2: Test and set the wmarks. > > > > $ cat /dev/cgroup/memory/A/memory.wmark_ratio > > > > 0 > > > > > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks > > > > low_wmark 524288000 > > > > high_wmark 524288000 > > > > > > > > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio > > > > > > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks > > > > low_wmark 471859200 > > > > high_wmark 470016000 > > > > > > > > $ ps -ef | grep memcg > > > > root 18126 2 0 22:43 ? 00:00:00 [memcg_3] > > > > root 18129 7999 0 22:44 pts/1 00:00:00 grep memcg > > > > > > > > Step3: Dirty the pages by creating a 20g file on hard drive. > > > > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1 > > > > > > > > Here are the memory.stat with vs without the per-memcg reclaim. It > used > > > to be > > > > all the pages are reclaimed from direct reclaim, and now some of the > > > pages are > > > > also being reclaimed at background. > > > > > > > > Only direct reclaim With background reclaim: > > > > > > > > pgpgin 5248668 pgpgin 5248347 > > > > pgpgout 5120678 pgpgout 5133505 > > > > kswapd_steal 0 kswapd_steal 1476614 > > > > pg_pgsteal 5120578 pg_pgsteal 3656868 > > > > kswapd_pgscan 0 kswapd_pgscan 3137098 > > > > pg_scan 10861956 pg_scan 6848006 > > > > pgrefill 271174 pgrefill 290441 > > > > pgoutrun 0 pgoutrun 18047 > > > > allocstall 131689 allocstall 100179 > > > > > > > > real 7m42.702s real 7m42.323s > > > > user 0m0.763s user 0m0.748s > > > > sys 0m58.785s sys 0m52.123s > > > > > > > > throughput is 44.33 MB/sec throughput is 44.23 MB/sec > > > > > > > > Step 4: Cleanup > > > > $ echo $$ >/dev/cgroup/memory/tasks > > > > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty > > > > $ rmdir /dev/cgroup/memory/A > > > > $ echo 3 >/proc/sys/vm/drop_caches > > > > > > > > Step 5: Create the same cgroup and read the 20g file into pagecache. > > > > $ cat /export/hdc3/dd/tf0 > /dev/zero > > > > > > > > All the pages are reclaimed from background instead of direct reclaim > > > with > > > > the per cgroup reclaim. > > > > > > > > Only direct reclaim With background reclaim: > > > > pgpgin 5248668 pgpgin 5248114 > > > > pgpgout 5120678 pgpgout 5133480 > > > > kswapd_steal 0 kswapd_steal 5133397 > > > > pg_pgsteal 5120578 pg_pgsteal 0 > > > > kswapd_pgscan 0 kswapd_pgscan 5133400 > > > > pg_scan 10861956 pg_scan 0 > > > > pgrefill 271174 pgrefill 0 > > > > pgoutrun 0 pgoutrun 40535 > > > > allocstall 131689 allocstall 0 > > > > > > > > real 7m42.702s real 6m20.439s > > > > user 0m0.763s user 0m0.169s > > > > sys 0m58.785s sys 0m26.574s > > > > > > > > Note: > > > > This is the first effort of enhancing the target reclaim into memcg. > Here > > > are > > > > the existing known issues and our plan: > > > > > > > > 1. there are one kswapd thread per cgroup. the thread is created when > the > > > > cgroup changes its limit_in_bytes and is deleted when the cgroup is > being > > > > removed. In some enviroment when thousand of cgroups are being > configured > > > on > > > > a single host, we will have thousand of kswapd threads. The memory > > > consumption > > > > would be 8k*100 = 8M. We don't see a big issue for now if the host > can > > > host > > > > that many of cgroups. > > > > > > > > > > What's bad with using workqueue ? > > > > > > Pros. > > > - we don't have to keep our own thread pool. > > > - we don't have to see 'ps -elf' is filled by kswapd... > > > Cons. > > > - because threads are shared, we can't put kthread to cpu cgroup. > > > > > > > I did some study on workqueue after posting V2. There was a comment > suggesting > > workqueue instead of per-memcg kswapd thread, since it will cut the > number > > of kernel threads being created in host with lots of cgroups. Each kernel > > thread allocates about 8K of stack and 8M in total w/ thousand of > cgroups. > > > > The current workqueue model merged in 2.6.36 kernel is called > "concurrency > > managed workqueu(cmwq)", which is intended to provide flexible > concurrency > > without wasting resources. I studied a bit and here it is: > > > > 1. The workqueue is complicated and we need to be very careful of work > items > > in the workqueue. We've experienced in one workitem stucks and the rest > of > > the work item won't proceed. For example in dirty page writeback, one > > heavily writer cgroup could starve the other cgroups from flushing dirty > > pages to the same disk. In the kswapd case, I can image we might have > > similar scenario. > > > > 2. How to prioritize the workitems is another problem. The order of > adding > > the workitems in the queue reflects the order of cgroups being reclaimed. > We > > don't have that restriction currently but relying on the cpu scheduler to > > put kswapd on the right cpu-core to run. We "might" introduce priority > later > > for reclaim and how are we gonna deal with that. > > > > 3. Based on what i observed, not many callers has migrated to the cmwq > and I > > don't have much data of how good it is. > > > > > > Regardless of workqueue, can't we have moderate numbers of threads ? > > > > > > I really don't like to have too much threads and thinks > > > one-thread-per-memcg > > > is big enough to cause lock contension problem. > > > > > > > Back to the current model, on machine with thousands of cgroups which it > > will take 8M total for thousand of kswapd threads (8K stack for each > > thread). We are running system with fakenuma which each numa node has a > > kswapd. So far we haven't noticed issue caused by "lots of" kswapd > threads. > > Also, there shouldn't be any performance overhead for kernel thread if it > is > > not running. > > > > Based on the complexity of workqueue and the benefit it provides, I would > > like to stick to the current model first. After we get the basic stuff in > > and other targeting reclaim improvement, we can come back to this. What > do > > you think? > > > > Okay, fair enough. kthread_run() will win. > > Then, I have another request. I'd like to kswapd-for-memcg to some cpu > cgroup to limit cpu usage. > > - Could you show thread ID somewhere ? and > confirm we can put it to some cpu cgroup ? > (creating a auto cpu cgroup for memcg kswapd is a choice, I think.) > > BTW, when kthread_run() creates a kthread, which cgroup it will be under ? > If it will be under a cgroup who calls kthread_run(), per-memcg kswapd > will > go under cgroup where the user sets hi/low wmark, implicitly. > I don't think this is very bad. But it's better to mention the behavior > somewhere because memcg is tend to be used with cpu cgroup. > Could you check and add some doc ? > > And > - Could you drop PF_MEMALLOC ? (for now.) (in patch 4) > Hmm, do you mean to drop it for per-memcg kswapd? > - Could you check PF_KSWAPD doesn't do anything bad ? > There are eight places where the current_is_kswapd() is called. Five of them are called to update counter. And the rest looks good to me. 1. too_many_isolated() returns false if kswapd 2. should_reclaim_stall() returns false if kswapd 3. nfs_commit_inode() may_wait = NULL if kswapd --Ying > > Thanks, > -Kame > > > > > --000e0cd37a7e66f9fc04a0e46305 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Wed, Apr 13, 2011 at 5:14 PM, KAMEZAW= A Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
On Wed, 13 Apr 2011 10:53:19 -0700
Ying Han <yinghan@google.com> wrote:

> On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.f= ujitsu.com> wrote:
>
> > On Wed, 13 Apr 2011 00:03:00 -0700
> > Ying Han <yinghan@google= .com> wrote:
> >
> > > The current implementation of memcg supports targeting recla= im when the
> > > cgroup is reaching its hard_limit and we do direct reclaim p= er cgroup.
> > > Per cgroup background reclaim is needed which helps to sprea= d out memory
> > > pressure over longer period of time and smoothes out the cgr= oup
> > performance.
> > >
> > > If the cgroup is configured to use per cgroup background rec= laim, a
> > kswapd
> > > thread is created which only scans the per-memcg LRU list. T= wo watermarks
> > > ("high_wmark", "low_wmark") are added to= trigger the background reclaim
> > and
> > > stop it. The watermarks are calculated based on the cgroup&#= 39;s
> > limit_in_bytes.
> > >
> > > I run through dd test on large file and then cat the file. T= hen I
> > compared
> > > the reclaim related stats in memory.stat.
> > >
> > > Step1: Create a cgroup with 500M memory_limit.
> > > $ mkdir /dev/cgroup/memory/A
> > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes > > > $ echo $$ >/dev/cgroup/memory/A/tasks
> > >
> > > Step2: Test and set the wmarks.
> > > $ cat /dev/cgroup/memory/A/memory.wmark_ratio
> > > 0
> > >
> > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > low_wmark 524288000
> > > high_wmark 524288000
> > >
> > > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
> > >
> > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> > > low_wmark 471859200
> > > high_wmark 470016000
> > >
> > > $ ps -ef | grep memcg
> > > root =A0 =A0 18126 =A0 =A0 2 =A00 22:43 ? =A0 =A0 =A0 =A000:= 00:00 [memcg_3]
> > > root =A0 =A0 18129 =A07999 =A00 22:44 pts/1 =A0 =A000:00:00 = grep memcg
> > >
> > > Step3: Dirty the pages by creating a 20g file on hard drive.=
> > > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
> > >
> > > Here are the memory.stat with vs without the per-memcg recla= im. It used
> > to be
> > > all the pages are reclaimed from direct reclaim, and now som= e of the
> > pages are
> > > also being reclaimed at background.
> > >
> > > Only direct reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 With background reclaim:
> > >
> > > pgpgin 5248668 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0pgpgin 5248347
> > > pgpgout 5120678 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 pgpgout 5133505
> > > kswapd_steal 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0kswapd_steal 1476614
> > > pg_pgsteal 5120578 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0pg_pgsteal 3656868
> > > kswapd_pgscan 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 kswapd_pgscan 3137098
> > > pg_scan 10861956 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0pg_scan 6848006
> > > pgrefill 271174 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 pgrefill 290441
> > > pgoutrun 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0pgoutrun 18047
> > > allocstall 131689 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 allocstall 100179
> > >
> > > real =A0 =A07m42.702s =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 real 7m42.323s
> > > user =A0 =A00m0.763s =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0user 0m0.748s
> > > sys =A0 =A0 0m58.785s =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 sys =A00m52.123s
> > >
> > > throughput is 44.33 MB/sec =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0th= roughput is 44.23 MB/sec
> > >
> > > Step 4: Cleanup
> > > $ echo $$ >/dev/cgroup/memory/tasks
> > > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
> > > $ rmdir /dev/cgroup/memory/A
> > > $ echo 3 >/proc/sys/vm/drop_caches
> > >
> > > Step 5: Create the same cgroup and read the 20g file into pa= gecache.
> > > $ cat /export/hdc3/dd/tf0 > /dev/zero
> > >
> > > All the pages are reclaimed from background instead of direc= t reclaim
> > with
> > > the per cgroup reclaim.
> > >
> > > Only direct reclaim =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 With background reclaim:
> > > pgpgin 5248668 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0pgpgin 5248114
> > > pgpgout 5120678 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 pgpgout 5133480
> > > kswapd_steal 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0kswapd_steal 5133397
> > > pg_pgsteal 5120578 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0pg_pgsteal 0
> > > kswapd_pgscan 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 kswapd_pgscan 5133400
> > > pg_scan 10861956 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0pg_scan 0
> > > pgrefill 271174 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 pgrefill 0
> > > pgoutrun 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0pgoutrun 40535
> > > allocstall 131689 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 allocstall 0
> > >
> > > real =A0 =A07m42.702s =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 real 6m20.439s
> > > user =A0 =A00m0.763s =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0user 0m0.169s
> > > sys =A0 =A0 0m58.785s =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 sys =A00m26.574s
> > >
> > > Note:
> > > This is the first effort of enhancing the target reclaim int= o memcg. Here
> > are
> > > the existing known issues and our plan:
> > >
> > > 1. there are one kswapd thread per cgroup. the thread is cre= ated when the
> > > cgroup changes its limit_in_bytes and is deleted when the cg= roup is being
> > > removed. In some enviroment when thousand of cgroups are bei= ng configured
> > on
> > > a single host, we will have thousand of kswapd threads. The = memory
> > consumption
> > > would be 8k*100 =3D 8M. We don't see a big issue for now= if the host can
> > host
> > > that many of cgroups.
> > >
> >
> > What's bad with using workqueue ?
> >
> > Pros.
> > =A0- we don't have to keep our own thread pool.
> > =A0- we don't have to see 'ps -elf' is filled by kswa= pd...
> > Cons.
> > =A0- because threads are shared, we can't put kthread to cpu = cgroup.
> >
>
> I did some study on workqueue after posting V2. There was a comment su= ggesting
> workqueue instead of per-memcg kswapd thread, since it will cut the nu= mber
> of kernel threads being created in host with lots of cgroups. Each ker= nel
> thread allocates about 8K of stack and 8M in total w/ thousand of cgro= ups.
>
> The current workqueue model merged in 2.6.36 kernel is called "co= ncurrency
> managed workqueu(cmwq)", which is intended to provide flexible co= ncurrency
> without wasting resources. I studied a bit and here it is:
>
> 1. The workqueue is complicated and we need to be very careful of work= items
> in the workqueue. We've experienced in one workitem stucks and the= rest of
> the work item won't proceed. For example in dirty page writeback, = =A0one
> heavily writer cgroup could starve the other cgroups from flushing dir= ty
> pages to the same disk. In the kswapd case, I can image we might have<= br> > similar scenario.
>
> 2. How to prioritize the workitems is another problem. The order of ad= ding
> the workitems in the queue reflects the order of cgroups being reclaim= ed. We
> don't have that restriction currently but relying on the cpu sched= uler to
> put kswapd on the right cpu-core to run. We "might" introduc= e priority later
> for reclaim and how are we gonna deal with that.
>
> 3. Based on what i observed, not many callers has migrated to the cmwq= and I
> don't have much data of how good it is.
>
>
> Regardless of workqueue, can't we have moderate numbers of threads= ?
> >
> > I really don't like to have too much threads and thinks
> > one-thread-per-memcg
> > is big enough to cause lock contension problem.
> >
>
> Back to the current model, on machine with thousands of cgroups which = it
> will take 8M total for thousand of kswapd threads (8K stack for each > thread). =A0We are running system with fakenuma which each numa node h= as a
> kswapd. So far we haven't noticed issue caused by "lots of&qu= ot; kswapd threads.
> Also, there shouldn't be any performance overhead for kernel threa= d if it is
> not running.
>
> Based on the complexity of workqueue and the benefit it provides, I wo= uld
> like to stick to the current model first. After we get the basic stuff= in
> and other targeting reclaim improvement, we can come back to this. Wha= t do
> you think?
>

Okay, fair enough. kthread_run() will win.

Then, I have another request. I'd like to kswapd-for-memcg to some cpu<= br> cgroup to limit cpu usage.

- Could you show thread ID somewhere ? and
=A0confirm we can put it to some cpu cgroup ?
=A0(creating a auto cpu cgroup for memcg kswapd is a choice, I think.)

=A0BTW, when kthread_run() creates a kthread, which cgroup it will be unde= r ?
=A0If it will be under a cgroup who calls kthread_run(), per-memcg kswapd = will
=A0go under cgroup where the user sets hi/low wmark, implicitly.
=A0I don't think this is very bad. But it's better to mention the = behavior
=A0somewhere because memcg is tend to be used with cpu cgroup.
=A0Could you check and add some doc ?

And
- Could you drop PF_MEMALLOC ? (for now.) (in patch 4)
Hmm, do you mean to drop it for per-memcg kswapd?=A0
=A0
- Could you check PF_KSWAPD doesn't do anything bad ?
<= div>
=A0There are eight places where the current_is_kswapd() = is called. Five of them are called to update counter. And the rest looks go= od to me.

1. too_many_isolated()=A0
=A0 =A0 returns fal= se if kswapd

2.=A0should_reclaim_stall()
=A0 =A0 returns false if kswapd

3. =A0nfs_commit_= inode()
=A0 =A0may_wait =3D NULL if kswapd

--Ying
=A0 =A0=A0

Thanks,
-Kame





--000e0cd37a7e66f9fc04a0e46305-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org