On Thu, Apr 14, 2011 at 10:38 AM, Ying Han <yinghan@google.com> wrote:

>
>
> On Wed, Apr 13, 2011 at 5:14 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Wed, 13 Apr 2011 10:53:19 -0700
>> Ying Han <yinghan@google.com> wrote:
>>
>> > On Wed, Apr 13, 2011 at 12:47 AM, KAMEZAWA Hiroyuki <
>> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > > On Wed, 13 Apr 2011 00:03:00 -0700
>> > > Ying Han <yinghan@google.com> wrote:
>> > >
>> > > > The current implementation of memcg supports targeting reclaim when
>> the
>> > > > cgroup is reaching its hard_limit and we do direct reclaim per
>> cgroup.
>> > > > Per cgroup background reclaim is needed which helps to spread out
>> memory
>> > > > pressure over longer period of time and smoothes out the cgroup
>> > > performance.
>> > > >
>> > > > If the cgroup is configured to use per cgroup background reclaim, a
>> > > kswapd
>> > > > thread is created which only scans the per-memcg LRU list. Two
>> watermarks
>> > > > ("high_wmark", "low_wmark") are added to trigger the background
>> reclaim
>> > > and
>> > > > stop it. The watermarks are calculated based on the cgroup's
>> > > limit_in_bytes.
>> > > >
>> > > > I run through dd test on large file and then cat the file. Then I
>> > > compared
>> > > > the reclaim related stats in memory.stat.
>> > > >
>> > > > Step1: Create a cgroup with 500M memory_limit.
>> > > > $ mkdir /dev/cgroup/memory/A
>> > > > $ echo 500m >/dev/cgroup/memory/A/memory.limit_in_bytes
>> > > > $ echo $$ >/dev/cgroup/memory/A/tasks
>> > > >
>> > > > Step2: Test and set the wmarks.
>> > > > $ cat /dev/cgroup/memory/A/memory.wmark_ratio
>> > > > 0
>> > > >
>> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
>> > > > low_wmark 524288000
>> > > > high_wmark 524288000
>> > > >
>> > > > $ echo 90 >/dev/cgroup/memory/A/memory.wmark_ratio
>> > > >
>> > > > $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
>> > > > low_wmark 471859200
>> > > > high_wmark 470016000
>> > > >
>> > > > $ ps -ef | grep memcg
>> > > > root     18126     2  0 22:43 ?        00:00:00 [memcg_3]
>> > > > root     18129  7999  0 22:44 pts/1    00:00:00 grep memcg
>> > > >
>> > > > Step3: Dirty the pages by creating a 20g file on hard drive.
>> > > > $ ddtest -D /export/hdc3/dd -b 1024 -n 20971520 -t 1
>> > > >
>> > > > Here are the memory.stat with vs without the per-memcg reclaim. It
>> used
>> > > to be
>> > > > all the pages are reclaimed from direct reclaim, and now some of the
>> > > pages are
>> > > > also being reclaimed at background.
>> > > >
>> > > > Only direct reclaim                       With background reclaim:
>> > > >
>> > > > pgpgin 5248668                            pgpgin 5248347
>> > > > pgpgout 5120678                           pgpgout 5133505
>> > > > kswapd_steal 0                            kswapd_steal 1476614
>> > > > pg_pgsteal 5120578                        pg_pgsteal 3656868
>> > > > kswapd_pgscan 0                           kswapd_pgscan 3137098
>> > > > pg_scan 10861956                          pg_scan 6848006
>> > > > pgrefill 271174                           pgrefill 290441
>> > > > pgoutrun 0                                pgoutrun 18047
>> > > > allocstall 131689                         allocstall 100179
>> > > >
>> > > > real    7m42.702s                         real 7m42.323s
>> > > > user    0m0.763s                          user 0m0.748s
>> > > > sys     0m58.785s                         sys  0m52.123s
>> > > >
>> > > > throughput is 44.33 MB/sec                throughput is 44.23 MB/sec
>> > > >
>> > > > Step 4: Cleanup
>> > > > $ echo $$ >/dev/cgroup/memory/tasks
>> > > > $ echo 1 > /dev/cgroup/memory/A/memory.force_empty
>> > > > $ rmdir /dev/cgroup/memory/A
>> > > > $ echo 3 >/proc/sys/vm/drop_caches
>> > > >
>> > > > Step 5: Create the same cgroup and read the 20g file into pagecache.
>> > > > $ cat /export/hdc3/dd/tf0 > /dev/zero
>> > > >
>> > > > All the pages are reclaimed from background instead of direct
>> reclaim
>> > > with
>> > > > the per cgroup reclaim.
>> > > >
>> > > > Only direct reclaim                       With background reclaim:
>> > > > pgpgin 5248668                            pgpgin 5248114
>> > > > pgpgout 5120678                           pgpgout 5133480
>> > > > kswapd_steal 0                            kswapd_steal 5133397
>> > > > pg_pgsteal 5120578                        pg_pgsteal 0
>> > > > kswapd_pgscan 0                           kswapd_pgscan 5133400
>> > > > pg_scan 10861956                          pg_scan 0
>> > > > pgrefill 271174                           pgrefill 0
>> > > > pgoutrun 0                                pgoutrun 40535
>> > > > allocstall 131689                         allocstall 0
>> > > >
>> > > > real    7m42.702s                         real 6m20.439s
>> > > > user    0m0.763s                          user 0m0.169s
>> > > > sys     0m58.785s                         sys  0m26.574s
>> > > >
>> > > > Note:
>> > > > This is the first effort of enhancing the target reclaim into memcg.
>> Here
>> > > are
>> > > > the existing known issues and our plan:
>> > > >
>> > > > 1. there are one kswapd thread per cgroup. the thread is created
>> when the
>> > > > cgroup changes its limit_in_bytes and is deleted when the cgroup is
>> being
>> > > > removed. In some enviroment when thousand of cgroups are being
>> configured
>> > > on
>> > > > a single host, we will have thousand of kswapd threads. The memory
>> > > consumption
>> > > > would be 8k*100 = 8M. We don't see a big issue for now if the host
>> can
>> > > host
>> > > > that many of cgroups.
>> > > >
>> > >
>> > > What's bad with using workqueue ?
>> > >
>> > > Pros.
>> > >  - we don't have to keep our own thread pool.
>> > >  - we don't have to see 'ps -elf' is filled by kswapd...
>> > > Cons.
>> > >  - because threads are shared, we can't put kthread to cpu cgroup.
>> > >
>> >
>> > I did some study on workqueue after posting V2. There was a comment
>> suggesting
>> > workqueue instead of per-memcg kswapd thread, since it will cut the
>> number
>> > of kernel threads being created in host with lots of cgroups. Each
>> kernel
>> > thread allocates about 8K of stack and 8M in total w/ thousand of
>> cgroups.
>> >
>> > The current workqueue model merged in 2.6.36 kernel is called
>> "concurrency
>> > managed workqueu(cmwq)", which is intended to provide flexible
>> concurrency
>> > without wasting resources. I studied a bit and here it is:
>> >
>> > 1. The workqueue is complicated and we need to be very careful of work
>> items
>> > in the workqueue. We've experienced in one workitem stucks and the rest
>> of
>> > the work item won't proceed. For example in dirty page writeback,  one
>> > heavily writer cgroup could starve the other cgroups from flushing dirty
>> > pages to the same disk. In the kswapd case, I can image we might have
>> > similar scenario.
>> >
>> > 2. How to prioritize the workitems is another problem. The order of
>> adding
>> > the workitems in the queue reflects the order of cgroups being
>> reclaimed. We
>> > don't have that restriction currently but relying on the cpu scheduler
>> to
>> > put kswapd on the right cpu-core to run. We "might" introduce priority
>> later
>> > for reclaim and how are we gonna deal with that.
>> >
>> > 3. Based on what i observed, not many callers has migrated to the cmwq
>> and I
>> > don't have much data of how good it is.
>> >
>> >
>> > Regardless of workqueue, can't we have moderate numbers of threads ?
>> > >
>> > > I really don't like to have too much threads and thinks
>> > > one-thread-per-memcg
>> > > is big enough to cause lock contension problem.
>> > >
>> >
>> > Back to the current model, on machine with thousands of cgroups which it
>> > will take 8M total for thousand of kswapd threads (8K stack for each
>> > thread).  We are running system with fakenuma which each numa node has a
>> > kswapd. So far we haven't noticed issue caused by "lots of" kswapd
>> threads.
>> > Also, there shouldn't be any performance overhead for kernel thread if
>> it is
>> > not running.
>> >
>> > Based on the complexity of workqueue and the benefit it provides, I
>> would
>> > like to stick to the current model first. After we get the basic stuff
>> in
>> > and other targeting reclaim improvement, we can come back to this. What
>> do
>> > you think?
>> >
>>
>> Okay, fair enough. kthread_run() will win.
>>
>> Then, I have another request. I'd like to kswapd-for-memcg to some cpu
>> cgroup to limit cpu usage.
>>
>> - Could you show thread ID somewhere ?
>
> I added a patch which exports per-memcg-kswapd pid. This is necessary to
later link the kswapd thread to the memcg owner from userspace.

$ cat /dev/cgroup/memory/A/memory.kswapd_pid
memcg_3 6727


> and confirm we can put it to some cpu cgroup ?
>>
>
I tested it by echoing the memcg kswapd thread into a cpu group w/ some
cpu-share.


>  (creating a auto cpu cgroup for memcg kswapd is a choice, I think.)
>>
>>  BTW, when kthread_run() creates a kthread, which cgroup it will be under
>> ?
>>
>
By default, it is running under root. If there is a need to put the kswapd
thread into a cpu cgroup, userspace can make that change by reading the pid
from the new API and echo-ing.


>  If it will be under a cgroup who calls kthread_run(), per-memcg kswapd
>> will
>>  go under cgroup where the user sets hi/low wmark, implicitly.
>>  I don't think this is very bad. But it's better to mention the behavior
>>  somewhere because memcg is tend to be used with cpu cgroup.
>>  Could you check and add some doc ?
>>
>
It make senses to constrain the cpu usage of per-memcg kswapd thread as part
of the memcg. However, i see more problems of doing it than the benefits.

pros:
1. it is good for isolation which prevent one cgroup heavy reclaiming
behavior stealing cpu cycles from other cgroups.

cons:
1. constraining background reclaim will add more pressure into direct
reclaim. it is bad for the process performance, especially on machines with
spare cpu cycles most of time.
2. we have danger of priority inversion to preempt kswapd thread. In no
preemption kernel, we should be ok. In preemptive kernel, we might get
priority inversion by preempting kswapd holding mutex.
3. when user configure the cpu cgroup and memcg cgroup, they need to make
the reservation of cpu be proportional to memcg size.

--Ying


>
>> And
>> - Could you drop PF_MEMALLOC ? (for now.) (in patch 4)
>>
> Hmm, do you mean to drop it for per-memcg kswapd?
>

Ok, I dropped the flag for per-memcg kswapd and also made the comment.

>
>
>> - Could you check PF_KSWAPD doesn't do anything bad ?
>>
>
>  There are eight places where the current_is_kswapd() is called. Five of
> them are called to update counter. And the rest looks good to me.
>
> 1. too_many_isolated()
>     returns false if kswapd
>
> 2. should_reclaim_stall()
>     returns false if kswapd
>
> 3.  nfs_commit_inode()
>    may_wait = NULL if kswapd
>
> --Ying
>
>
>>
>> Thanks,
>> -Kame
>>
>>
>>
>>
>>
>