All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] memory control groups
@ 2011-01-17 19:14 Johannes Weiner
  2011-01-18  1:10 ` KAMEZAWA Hiroyuki
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Johannes Weiner @ 2011-01-17 19:14 UTC (permalink / raw)
  To: linux-mm, lsf-pc
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse

Hello,

on the MM summit, I would like to talk about the current state of
memory control groups, the features and extensions that are currently
being developed for it, and what their status is.

I am especially interested in talking about the current runtime memory
overhead memcg comes with (1% of ram) and what we can do to shrink it.

In comparison to how efficiently struct page is packed, and given that
distro kernels come with memcg enabled per default, I think we should
put a bit more thought into how struct page_cgroup (which exists for
every page in the system as well) is organized.

I have a patch series that removes the page backpointer from struct
page_cgroup by storing a node ID (or section ID, depending on whether
sparsemem is configured) in the free bits of pc->flags.

I also plan on replacing the pc->mem_cgroup pointer with an ID
(KAMEZAWA-san has patches for that), and move it to pc->flags too.
Every flag not used means doubling the amount of possible control
groups, so I have patches that get rid of some flags currently
allocated, including PCG_CACHE, PCG_ACCT_LRU, and PCG_MIGRATION.

[ I meant to send those out much earlier already, but a bug in the
migration rework was not responding to my yelling 'Marco', and now my
changes collide horribly with THP, so it will take another rebase. ]

The per-memcg dirty accounting work e.g. allocates a bunch of new bits
in pc->flags and I'd like to hash out if this leaves enough room for
the structure packing I described, or whether we can come up with a
different way of tracking state.

Would other people be interested in discussing this?

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-17 19:14 [LSF/MM TOPIC] memory control groups Johannes Weiner
@ 2011-01-18  1:10 ` KAMEZAWA Hiroyuki
  2011-01-18  8:40   ` Johannes Weiner
  2011-01-18  8:17 ` Michel Lespinasse
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-18  1:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, lsf-pc, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse

On Mon, 17 Jan 2011 20:14:00 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hello,
> 
> on the MM summit, I would like to talk about the current state of
> memory control groups, the features and extensions that are currently
> being developed for it, and what their status is.
> 
> I am especially interested in talking about the current runtime memory
> overhead memcg comes with (1% of ram) and what we can do to shrink it.
> 
> In comparison to how efficiently struct page is packed, and given that
> distro kernels come with memcg enabled per default, I think we should
> put a bit more thought into how struct page_cgroup (which exists for
> every page in the system as well) is organized.
> 
> I have a patch series that removes the page backpointer from struct
> page_cgroup by storing a node ID (or section ID, depending on whether
> sparsemem is configured) in the free bits of pc->flags.
> 
> I also plan on replacing the pc->mem_cgroup pointer with an ID
> (KAMEZAWA-san has patches for that), and move it to pc->flags too.
> Every flag not used means doubling the amount of possible control
> groups, so I have patches that get rid of some flags currently
> allocated, including PCG_CACHE, PCG_ACCT_LRU, and PCG_MIGRATION.
> 
> [ I meant to send those out much earlier already, but a bug in the
> migration rework was not responding to my yelling 'Marco', and now my
> changes collide horribly with THP, so it will take another rebase. ]
> 
> The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> in pc->flags and I'd like to hash out if this leaves enough room for
> the structure packing I described, or whether we can come up with a
> different way of tracking state.
> 

I see that there are requests for shrinking page_cgroup. And yes, I think
we should do so. I think there are trade-off between performance v.s.
memory usage. So, could you show the numbers when we discuss it ?

BTW, I think we can...

- PCG_ACCT_LRU bit can be dropped.(I think list_empty(&pc->lru) can be used.
                ROOT cgroup will not be problem.)
- pc->mem_cgroup can be replaced with ID.
  But move it into flags field seems difficult because of races.
- pc->page can be replaced with some lookup routine.
  But Section bit encoding may be something mysterious and look up cost
  will be problem.
- PCG_CACHE bit is a duplicate of information of 'page'. So, we can use PageAnon()
- I'm not sure PCG_MIGRATION. It's for avoiding races.

Note: we'll need to use 16bits for blkio tracking.

Another idea is dynamic allocation of page_cgroup. It may be able to be a help
for THP enviroment but will not work well (just adds overhead) against file cache
workload.

Anwyay, my priority of development for memcg this year is:

 1. dirty ratio support.
 2. Backgound reclaim (kswapd)
 3. blkio tracking.

Diet of page_cgroup should be done in step by step. We've seen many level down
when some new feature comes to memory cgroup. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-17 19:14 [LSF/MM TOPIC] memory control groups Johannes Weiner
  2011-01-18  1:10 ` KAMEZAWA Hiroyuki
@ 2011-01-18  8:17 ` Michel Lespinasse
  2011-01-18  8:45   ` KAMEZAWA Hiroyuki
  2011-01-18  8:53 ` CAI Qian
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Michel Lespinasse @ 2011-01-18  8:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, lsf-pc, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Greg Thelen, Ying Han

On Mon, Jan 17, 2011 at 11:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> on the MM summit, I would like to talk about the current state of
> memory control groups, the features and extensions that are currently
> being developed for it, and what their status is.

+1 - there is a lot to discuss about memcg...

> I am especially interested in talking about the current runtime memory
> overhead memcg comes with (1% of ram) and what we can do to shrink it.
>
> In comparison to how efficiently struct page is packed, and given that
> distro kernels come with memcg enabled per default, I think we should
> put a bit more thought into how struct page_cgroup (which exists for
> every page in the system as well) is organized.
>
> I have a patch series that removes the page backpointer from struct
> page_cgroup by storing a node ID (or section ID, depending on whether
> sparsemem is configured) in the free bits of pc->flags.
>
> I also plan on replacing the pc->mem_cgroup pointer with an ID
> (KAMEZAWA-san has patches for that), and move it to pc->flags too.
> Every flag not used means doubling the amount of possible control
> groups, so I have patches that get rid of some flags currently
> allocated, including PCG_CACHE, PCG_ACCT_LRU, and PCG_MIGRATION.
>
> [ I meant to send those out much earlier already, but a bug in the
> migration rework was not responding to my yelling 'Marco', and now my
> changes collide horribly with THP, so it will take another rebase. ]
>
> The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> in pc->flags and I'd like to hash out if this leaves enough room for
> the structure packing I described, or whether we can come up with a
> different way of tracking state.

This is probably longer term, but I would love to get rid of the
duplication between global LRU and per-cgroup LRU. Global LRU could be
approximated by scanning all per-cgroup LRU lists (in mounts
proportional to the list lengths).

> Would other people be interested in discussing this?

Definitely.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-18  1:10 ` KAMEZAWA Hiroyuki
@ 2011-01-18  8:40   ` Johannes Weiner
  2011-01-18  9:17     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2011-01-18  8:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, lsf-pc, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse

On Tue, Jan 18, 2011 at 10:10:57AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 17 Jan 2011 20:14:00 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Hello,
> > 
> > on the MM summit, I would like to talk about the current state of
> > memory control groups, the features and extensions that are currently
> > being developed for it, and what their status is.
> > 
> > I am especially interested in talking about the current runtime memory
> > overhead memcg comes with (1% of ram) and what we can do to shrink it.
> > 
> > In comparison to how efficiently struct page is packed, and given that
> > distro kernels come with memcg enabled per default, I think we should
> > put a bit more thought into how struct page_cgroup (which exists for
> > every page in the system as well) is organized.
> > 
> > I have a patch series that removes the page backpointer from struct
> > page_cgroup by storing a node ID (or section ID, depending on whether
> > sparsemem is configured) in the free bits of pc->flags.
> > 
> > I also plan on replacing the pc->mem_cgroup pointer with an ID
> > (KAMEZAWA-san has patches for that), and move it to pc->flags too.
> > Every flag not used means doubling the amount of possible control
> > groups, so I have patches that get rid of some flags currently
> > allocated, including PCG_CACHE, PCG_ACCT_LRU, and PCG_MIGRATION.
> > 
> > [ I meant to send those out much earlier already, but a bug in the
> > migration rework was not responding to my yelling 'Marco', and now my
> > changes collide horribly with THP, so it will take another rebase. ]
> > 
> > The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> > in pc->flags and I'd like to hash out if this leaves enough room for
> > the structure packing I described, or whether we can come up with a
> > different way of tracking state.
> > 
> 
> I see that there are requests for shrinking page_cgroup. And yes, I think
> we should do so. I think there are trade-off between performance v.s.
> memory usage. So, could you show the numbers when we discuss it ?

Yep, I will prepare them anyway for submission.

> BTW, I think we can...
> 
> - PCG_ACCT_LRU bit can be dropped.(I think list_empty(&pc->lru) can be used.
>                 ROOT cgroup will not be problem.)

Yes, that's what I did.  Should be protected by the lru lock and root
cgroup pages can easily be marked so that list_empty() works on them.

> - pc->mem_cgroup can be replaced with ID.
>   But move it into flags field seems difficult because of races.
> - pc->page can be replaced with some lookup routine.
>   But Section bit encoding may be something mysterious and look up cost
>   will be problem.

Why is that?

The lookup is actually straight-forward, like lookup_page_cgroup().
And we only need it when coming from the per-cgroup LRU, i.e. in
reclaim and force_empty.

> - PCG_CACHE bit is a duplicate of information of 'page'. So, we can use PageAnon()

I did that, too.  But for this to work, we need to make sure that
pages are always rmapped when they are charged and uncharged.  This is
one point where I collide with THP.  It's also why I complained that
migration clears page->mapping of replaced anonymous pages :)

> - I'm not sure PCG_MIGRATION. It's for avoiding races.

That's also a scary patch...  Yeah, it's to prevent uncharging of
oldpage in case migration fails and it has to be reused.  I changed
the migration sequence for memcg a bit so that we don't have to do
that anymore.  It survived basic testing.

> Note: we'll need to use 16bits for blkio tracking.
> 
> Another idea is dynamic allocation of page_cgroup. It may be able to be a help
> for THP enviroment but will not work well (just adds overhead) against file cache
> workload.
> 
> Anwyay, my priority of development for memcg this year is:
> 
>  1. dirty ratio support.
>  2. Backgound reclaim (kswapd)
>  3. blkio tracking.
> 
> Diet of page_cgroup should be done in step by step. We've seen many level down
> when some new feature comes to memory cgroup.

Yes, and that's what I'm afraid of.  We would never be able to add a
side-feature that makes struct page increase in arbitrary size.

If the feature is sufficiently important and there is no other way, it
should of course be an option.  But it should not be done careless.

E.g. I have a suspicion that we might be able to do dirty accounting
without all the flags (we have them in the page anyway!) but use
proportionals instead.  It's not page-accurate, but I think the
fundamental problem is solved: when the dirty ratio is exceeded,
throttle the cgroup with the biggest dirty share.

But yes, that's sort of what I want to discuss :)

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-18  8:17 ` Michel Lespinasse
@ 2011-01-18  8:45   ` KAMEZAWA Hiroyuki
  2011-02-07  5:27     ` Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-18  8:45 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Johannes Weiner, linux-mm, lsf-pc, Daisuke Nishimura,
	Balbir Singh, Greg Thelen, Ying Han

On Tue, 18 Jan 2011 00:17:53 -0800
Michel Lespinasse <walken@google.com> wrote:


> > The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> > in pc->flags and I'd like to hash out if this leaves enough room for
> > the structure packing I described, or whether we can come up with a
> > different way of tracking state.
> 
> This is probably longer term, but I would love to get rid of the
> duplication between global LRU and per-cgroup LRU. Global LRU could be
> approximated by scanning all per-cgroup LRU lists (in mounts
> proportional to the list lengths).
> 

I can't answer why the design, which memory cgroup's meta-page has its own LRU
rather than reusing page->lru, is selected at 1st implementation because I didn't
join the birth of memcg. Does anyone remember the reason or discussion ? 

As far as I can tell, I review patches for memcg with the viewpoint as
"Whether this patch will affect global LRU or not ? and will never break the
 algorithm of page reclaim of global LRU ?"

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-17 19:14 [LSF/MM TOPIC] memory control groups Johannes Weiner
  2011-01-18  1:10 ` KAMEZAWA Hiroyuki
  2011-01-18  8:17 ` Michel Lespinasse
@ 2011-01-18  8:53 ` CAI Qian
  2011-01-20 10:18 ` Balbir Singh
  2011-02-06 15:45 ` Michel Lespinasse
  4 siblings, 0 replies; 13+ messages in thread
From: CAI Qian @ 2011-01-18  8:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse, linux-mm, lsf-pc



----- Original Message -----
> Hello,
> 
> on the MM summit, I would like to talk about the current state of
> memory control groups, the features and extensions that are currently
> being developed for it, and what their status is.
> 
> I am especially interested in talking about the current runtime memory
> overhead memcg comes with (1% of ram) and what we can do to shrink it.
> 
> In comparison to how efficiently struct page is packed, and given that
> distro kernels come with memcg enabled per default, I think we should
> put a bit more thought into how struct page_cgroup (which exists for
> every page in the system as well) is organized.
> 
> I have a patch series that removes the page backpointer from struct
> page_cgroup by storing a node ID (or section ID, depending on whether
> sparsemem is configured) in the free bits of pc->flags.
> 
> I also plan on replacing the pc->mem_cgroup pointer with an ID
> (KAMEZAWA-san has patches for that), and move it to pc->flags too.
> Every flag not used means doubling the amount of possible control
> groups, so I have patches that get rid of some flags currently
> allocated, including PCG_CACHE, PCG_ACCT_LRU, and PCG_MIGRATION.
> 
> [ I meant to send those out much earlier already, but a bug in the
> migration rework was not responding to my yelling 'Marco', and now my
> changes collide horribly with THP, so it will take another rebase. ]
> 
> The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> in pc->flags and I'd like to hash out if this leaves enough room for
> the structure packing I described, or whether we can come up with a
> different way of tracking state.
> 
> Would other people be interested in discussing this?
I would love to be present the testing we have done here in work, and 
to gather some ideas from the testing angle as a QE engineer if there is
an invitation for me to obtain visa/travel budget etc.

CAI Qian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-18  8:40   ` Johannes Weiner
@ 2011-01-18  9:17     ` KAMEZAWA Hiroyuki
  2011-01-18 10:20       ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-18  9:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, lsf-pc, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse

On Tue, 18 Jan 2011 09:40:13 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Jan 18, 2011 at 10:10:57AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 17 Jan 2011 20:14:00 +0100
> > Johannes Weiner <hannes@cmpxchg.org> wrote:

> > - pc->mem_cgroup can be replaced with ID.
> >   But move it into flags field seems difficult because of races.
> > - pc->page can be replaced with some lookup routine.
> >   But Section bit encoding may be something mysterious and look up cost
> >   will be problem.
> 
> Why is that?
> 
> The lookup is actually straight-forward, like lookup_page_cgroup().
> And we only need it when coming from the per-cgroup LRU, i.e. in
> reclaim and force_empty.
>  

I see usage of pc->page is not very frequent. But I wonder we should
revisit performance of lookup_page_cgroup() before adding new weight.


> > - PCG_CACHE bit is a duplicate of information of 'page'. So, we can use PageAnon()
> 
> I did that, too.  But for this to work, we need to make sure that
> pages are always rmapped when they are charged and uncharged.  This is
> one point where I collide with THP.  It's also why I complained that
> migration clears page->mapping of replaced anonymous pages :)
> 
> > - I'm not sure PCG_MIGRATION. It's for avoiding races.
> 
> That's also a scary patch...  Yeah, it's to prevent uncharging of
> oldpage in case migration fails and it has to be reused.  I changed
> the migration sequence for memcg a bit so that we don't have to do
> that anymore.  It survived basic testing.
> 

Hmm. I saw level down of migration under memcg several times. So, I don't
want to modify running one without enough reason.
I guess all SECTION_BITS can be encoded to pc->flags without diet of flags.


> > 
> > Another idea is dynamic allocation of page_cgroup. It may be able to be a help
> > for THP enviroment but will not work well (just adds overhead) against file cache
> > workload.
> > 
> > Anwyay, my priority of development for memcg this year is:
> > 
> >  1. dirty ratio support.
> >  2. Backgound reclaim (kswapd)
> >  3. blkio tracking.
> > 
> > Diet of page_cgroup should be done in step by step. We've seen many level down
> > when some new feature comes to memory cgroup.
> 
> Yes, and that's what I'm afraid of.  We would never be able to add a
> side-feature that makes struct page increase in arbitrary size.
> 
> If the feature is sufficiently important and there is no other way, it
> should of course be an option.  But it should not be done careless.
> 
> E.g. I have a suspicion that we might be able to do dirty accounting
> without all the flags (we have them in the page anyway!) but use
> proportionals instead.  It's not page-accurate, but I think the
> fundamental problem is solved: when the dirty ratio is exceeded,
> throttle the cgroup with the biggest dirty share.
> 
> But yes, that's sort of what I want to discuss :)
> 

Using proportionals is a choice. But, IIUC, users of memcg wants 
something like /proc/meminfo. It doesn't match.
If I'm an user of container, I want an information like /proc/meminfo for
container.

Anyway, if the kernel goes to merge IO-less page reclaim, dirty ratio
support is the 1st thing we have to implement.
Without that, memcg will easily OOM.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-18  9:17     ` KAMEZAWA Hiroyuki
@ 2011-01-18 10:20       ` Johannes Weiner
  2011-01-19  0:14         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2011-01-18 10:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, lsf-pc, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse

On Tue, Jan 18, 2011 at 06:17:57PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 18 Jan 2011 09:40:13 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Jan 18, 2011 at 10:10:57AM +0900, KAMEZAWA Hiroyuki wrote:
> > > - pc->page can be replaced with some lookup routine.
> > >   But Section bit encoding may be something mysterious and look up cost
> > >   will be problem.
> > 
> > Why is that?
> > 
> > The lookup is actually straight-forward, like lookup_page_cgroup().
> > And we only need it when coming from the per-cgroup LRU, i.e. in
> > reclaim and force_empty.
> >  
> 
> I see usage of pc->page is not very frequent. But I wonder we should
> revisit performance of lookup_page_cgroup() before adding new weight.

I think those are two different things to tackle.  But I will make
sure to check for performance overhead when removing pc->page.

> > > - I'm not sure PCG_MIGRATION. It's for avoiding races.
> > 
> > That's also a scary patch...  Yeah, it's to prevent uncharging of
> > oldpage in case migration fails and it has to be reused.  I changed
> > the migration sequence for memcg a bit so that we don't have to do
> > that anymore.  It survived basic testing.
> > 
> 
> Hmm. I saw level down of migration under memcg several times. So, I don't
> want to modify running one without enough reason.
> I guess all SECTION_BITS can be encoded to pc->flags without diet of flags.

That's true, there is enough room for that.

Those reduction patches I only wrote to also pack the pc->mem_cgroup
ID into pc->flags, but these are two independent problems.

I would not have finished the patch only for that one tiny flag, but
it actually saved code and made it IMO a bit easier to understand.  I
consider this a serious upside of code that has a history of breaking.

But one at the time, first I will finish testing and benchmarking the
pc->page removal.

> > E.g. I have a suspicion that we might be able to do dirty accounting
> > without all the flags (we have them in the page anyway!) but use
> > proportionals instead.  It's not page-accurate, but I think the
> > fundamental problem is solved: when the dirty ratio is exceeded,
> > throttle the cgroup with the biggest dirty share.
> 
> Using proportionals is a choice. But, IIUC, users of memcg wants 
> something like /proc/meminfo. It doesn't match.
> If I'm an user of container, I want an information like /proc/meminfo for
> container.

I totally agree that this is information that needs exporting.

But you can easily calculate an absolute number of bytes by applying a
memcg's relative proportion to the absolute amount of dirty pages for
example.  The only difference is that it probably won't be 100%
accurate, but a few pages difference should really not matter for
user-visible statistics.

No?

> Anyway, if the kernel goes to merge IO-less page reclaim, dirty ratio
> support is the 1st thing we have to implement.
> Without that, memcg will easily OOM.

Agreed.  I am not saying that my memory footprint concerns should
stand in the way of merging important infrastructure.  This is work
that can still be done even after dirty accounting is merged.

Thanks,
	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-18 10:20       ` Johannes Weiner
@ 2011-01-19  0:14         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19  0:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, lsf-pc, Daisuke Nishimura, Balbir Singh, Greg Thelen,
	Ying Han, Michel Lespinasse

On Tue, 18 Jan 2011 11:20:06 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Jan 18, 2011 at 06:17:57PM +0900, KAMEZAWA Hiroyuki wrote: 
> > > > - I'm not sure PCG_MIGRATION. It's for avoiding races.
> > > 
> > > That's also a scary patch...  Yeah, it's to prevent uncharging of
> > > oldpage in case migration fails and it has to be reused.  I changed
> > > the migration sequence for memcg a bit so that we don't have to do
> > > that anymore.  It survived basic testing.
> > > 
> > 
> > Hmm. I saw level down of migration under memcg several times. So, I don't
> > want to modify running one without enough reason.
> > I guess all SECTION_BITS can be encoded to pc->flags without diet of flags.
> 
> That's true, there is enough room for that.
> 
> Those reduction patches I only wrote to also pack the pc->mem_cgroup
> ID into pc->flags, but these are two independent problems.
> 

That packing is dangerous because we have lock bit on pc->flags and
some access to pc->mem_cgroup is lockless. IIUC, it's difficult to
avoid race with modifying pc->mem_cgroup.
Hm, if we remove PCG_ACCT_LRU, it may be possible but I'm not sure
how FILESTAT etc. is safe.


> I would not have finished the patch only for that one tiny flag, but
> it actually saved code and made it IMO a bit easier to understand.  I
> consider this a serious upside of code that has a history of breaking.
> 
> But one at the time, first I will finish testing and benchmarking the
> pc->page removal.
> 
Sure.

> > > E.g. I have a suspicion that we might be able to do dirty accounting
> > > without all the flags (we have them in the page anyway!) but use
> > > proportionals instead.  It's not page-accurate, but I think the
> > > fundamental problem is solved: when the dirty ratio is exceeded,
> > > throttle the cgroup with the biggest dirty share.
> > 
> > Using proportionals is a choice. But, IIUC, users of memcg wants 
> > something like /proc/meminfo. It doesn't match.
> > If I'm an user of container, I want an information like /proc/meminfo for
> > container.
> 
> I totally agree that this is information that needs exporting.
> 
> But you can easily calculate an absolute number of bytes by applying a
> memcg's relative proportion to the absolute amount of dirty pages for
> example.  The only difference is that it probably won't be 100%
> accurate, but a few pages difference should really not matter for
> user-visible statistics.
> 
> No?
> 
With proportionals, we can't handle account moving between cgroups.
That means rmdir, force_empty, task_move can break dirty statistics
into mess.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-17 19:14 [LSF/MM TOPIC] memory control groups Johannes Weiner
                   ` (2 preceding siblings ...)
  2011-01-18  8:53 ` CAI Qian
@ 2011-01-20 10:18 ` Balbir Singh
  2011-02-06 15:45 ` Michel Lespinasse
  4 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2011-01-20 10:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, lsf-pc, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Greg Thelen, Ying Han, Michel Lespinasse

* Johannes Weiner <hannes@cmpxchg.org> [2011-01-17 20:14:00]:

> Hello,
> 
> on the MM summit, I would like to talk about the current state of
> memory control groups, the features and extensions that are currently
> being developed for it, and what their status is.
> 
> I am especially interested in talking about the current runtime memory
> overhead memcg comes with (1% of ram) and what we can do to shrink it.
> 
> In comparison to how efficiently struct page is packed, and given that
> distro kernels come with memcg enabled per default, I think we should
> put a bit more thought into how struct page_cgroup (which exists for
> every page in the system as well) is organized.
> 
> I have a patch series that removes the page backpointer from struct
> page_cgroup by storing a node ID (or section ID, depending on whether
> sparsemem is configured) in the free bits of pc->flags.
> 
> I also plan on replacing the pc->mem_cgroup pointer with an ID
> (KAMEZAWA-san has patches for that), and move it to pc->flags too.
> Every flag not used means doubling the amount of possible control
> groups, so I have patches that get rid of some flags currently
> allocated, including PCG_CACHE, PCG_ACCT_LRU, and PCG_MIGRATION.
> 
> [ I meant to send those out much earlier already, but a bug in the
> migration rework was not responding to my yelling 'Marco', and now my
> changes collide horribly with THP, so it will take another rebase. ]
> 
> The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> in pc->flags and I'd like to hash out if this leaves enough room for
> the structure packing I described, or whether we can come up with a
> different way of tracking state.
> 
> Would other people be interested in discussing this?
>

I would definitely be if I am invited to the LSF/MM summit. Even
otherwise we should discuss this over email

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-17 19:14 [LSF/MM TOPIC] memory control groups Johannes Weiner
                   ` (3 preceding siblings ...)
  2011-01-20 10:18 ` Balbir Singh
@ 2011-02-06 15:45 ` Michel Lespinasse
  2011-02-07  5:26   ` Balbir Singh
  4 siblings, 1 reply; 13+ messages in thread
From: Michel Lespinasse @ 2011-02-06 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, lsf-pc, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Greg Thelen, Ying Han

On Mon, Jan 17, 2011 at 11:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> on the MM summit, I would like to talk about the current state of
> memory control groups, the features and extensions that are currently
> being developed for it, and what their status is.
>
> I am especially interested in talking about the current runtime memory
> overhead memcg comes with (1% of ram) and what we can do to shrink it.
> [...]
> Would other people be interested in discussing this?

Well, YES :)

In addition to what you mentioned, I believe it would be possible to
avoid the duplication of global vs per-cgroup LRU lists. global
scanning would translate into proportional scanning of all per-cgroup
lists. If we could get that done, it would IMO become reasonable to
integrate back the remaining few page_cgroup fields into struct page
itself...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-02-06 15:45 ` Michel Lespinasse
@ 2011-02-07  5:26   ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2011-02-07  5:26 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Johannes Weiner, linux-mm, lsf-pc, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Greg Thelen, Ying Han

* Michel Lespinasse <walken@google.com> [2011-02-06 07:45:05]:

> On Mon, Jan 17, 2011 at 11:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > on the MM summit, I would like to talk about the current state of
> > memory control groups, the features and extensions that are currently
> > being developed for it, and what their status is.
> >
> > I am especially interested in talking about the current runtime memory
> > overhead memcg comes with (1% of ram) and what we can do to shrink it.
> > [...]
> > Would other people be interested in discussing this?
> 
> Well, YES :)
> 
> In addition to what you mentioned, I believe it would be possible to
> avoid the duplication of global vs per-cgroup LRU lists. global
> scanning would translate into proportional scanning of all per-cgroup
> lists. If we could get that done, it would IMO become reasonable to
> integrate back the remaining few page_cgroup fields into struct page
> itself...
>

We thought about the duplication and proportial scanning quite a bit
prior to final design and integration, but it does not scale well as
cgroups increase in number. I would also like to discuss things
like accounting shared pages, etc. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] memory control groups
  2011-01-18  8:45   ` KAMEZAWA Hiroyuki
@ 2011-02-07  5:27     ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2011-02-07  5:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michel Lespinasse, Johannes Weiner, linux-mm, lsf-pc,
	Daisuke Nishimura, Greg Thelen, Ying Han

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-18 17:45:23]:

> On Tue, 18 Jan 2011 00:17:53 -0800
> Michel Lespinasse <walken@google.com> wrote:
> 
> 
> > > The per-memcg dirty accounting work e.g. allocates a bunch of new bits
> > > in pc->flags and I'd like to hash out if this leaves enough room for
> > > the structure packing I described, or whether we can come up with a
> > > different way of tracking state.
> > 
> > This is probably longer term, but I would love to get rid of the
> > duplication between global LRU and per-cgroup LRU. Global LRU could be
> > approximated by scanning all per-cgroup LRU lists (in mounts
> > proportional to the list lengths).
> > 
> 
> I can't answer why the design, which memory cgroup's meta-page has its own LRU
> rather than reusing page->lru, is selected at 1st implementation because I didn't
> join the birth of memcg. Does anyone remember the reason or discussion ? 
>

The discussions can be found on LKML, some happened during OLS.
Keeping local LRU and global LRU was very important because we wanted
to make sure global reclaim is not broken or affected. We can discuss
this further.
 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-02-07  5:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-17 19:14 [LSF/MM TOPIC] memory control groups Johannes Weiner
2011-01-18  1:10 ` KAMEZAWA Hiroyuki
2011-01-18  8:40   ` Johannes Weiner
2011-01-18  9:17     ` KAMEZAWA Hiroyuki
2011-01-18 10:20       ` Johannes Weiner
2011-01-19  0:14         ` KAMEZAWA Hiroyuki
2011-01-18  8:17 ` Michel Lespinasse
2011-01-18  8:45   ` KAMEZAWA Hiroyuki
2011-02-07  5:27     ` Balbir Singh
2011-01-18  8:53 ` CAI Qian
2011-01-20 10:18 ` Balbir Singh
2011-02-06 15:45 ` Michel Lespinasse
2011-02-07  5:26   ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.