linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Control page reclaim granularity
@ 2012-03-08  7:34 Zheng Liu
  2012-03-08  8:39 ` Greg Thelen
  2012-03-08  9:35 ` Minchan Kim
  0 siblings, 2 replies; 32+ messages in thread
From: Zheng Liu @ 2012-03-08  7:34 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Konstantin Khlebnikov

Hi list,

Recently we encounter a problem about page reclaim.  I abstract it in here.
The problem is that there are two different file types.  One is small index
file, and another is large data file.  The index file is mmaped into memory,
and application hope that they can be kept in memory and don't be reclaimed
too frequently.  The data file is manipulted by read/write, and they should
be reclaimed more frequently than the index file.

As previously discussion [1], Konstantin suggest me to mmap index file with
PROT_EXEC flag.  Meanwhile he provides a patch to set a flag in mm_flags to
increase the priority of mmaped file pages.  However, these solutions are
not perfect.  I review the related patches (8cab4754 and c909e993) and I
think that mmaped index file with PROT_EXEC flag is too tricky.  From the
view of applicaton programmer, index file is a regular file that stores
some data.  So they should be mmap with PROT_READ | PROT_WRITE rather than
with PROT_EXEC.  As commit log said (8cab4754), the purpose of this patch
is to keep executable code in memory to improve the response of application.
In addition, Kongstantin's patch needs to adjust the application program.
So in some cases, we cannot touch the code of application, and this patch is
useless.

I have discussed with Kongstantin about this problem and we think maybe
kernel should provide some mechanism.  For example, user can set memory
pressure priorities for vma or inode, or mmaped pages and file pages can be
reclaimed separately.  If someone has thought about it, please let me know.
Any feedbacks are welcomed.  Thank you.

Previously discussion:
1. http://marc.info/?l=linux-mm&m=132947026019538&w=2

Regards,
Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08  7:34 Control page reclaim granularity Zheng Liu
@ 2012-03-08  8:39 ` Greg Thelen
  2012-03-08 16:13   ` Zheng Liu
  2012-03-08  9:35 ` Minchan Kim
  1 sibling, 1 reply; 32+ messages in thread
From: Greg Thelen @ 2012-03-08  8:39 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Konstantin Khlebnikov

Zheng Liu <gnehzuil.liu@gmail.com> writes:
> Hi list,
>
> Recently we encounter a problem about page reclaim.  I abstract it in here.
> The problem is that there are two different file types.  One is small index
> file, and another is large data file.  The index file is mmaped into memory,
> and application hope that they can be kept in memory and don't be reclaimed
> too frequently.  The data file is manipulted by read/write, and they should
> be reclaimed more frequently than the index file.
>
> As previously discussion [1], Konstantin suggest me to mmap index file with
> PROT_EXEC flag.  Meanwhile he provides a patch to set a flag in mm_flags to
> increase the priority of mmaped file pages.  However, these solutions are
> not perfect.  I review the related patches (8cab4754 and c909e993) and I
> think that mmaped index file with PROT_EXEC flag is too tricky.  From the
> view of applicaton programmer, index file is a regular file that stores
> some data.  So they should be mmap with PROT_READ | PROT_WRITE rather than
> with PROT_EXEC.  As commit log said (8cab4754), the purpose of this patch
> is to keep executable code in memory to improve the response of application.
> In addition, Kongstantin's patch needs to adjust the application program.
> So in some cases, we cannot touch the code of application, and this patch is
> useless.
>
> I have discussed with Kongstantin about this problem and we think maybe
> kernel should provide some mechanism.  For example, user can set memory
> pressure priorities for vma or inode, or mmaped pages and file pages can be
> reclaimed separately.  If someone has thought about it, please let me know.
> Any feedbacks are welcomed.  Thank you.
>
> Previously discussion:
> 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2
>
> Regards,
> Zheng

It's not exactly the same approach, but we have toyed with the idea of
charging different inodes to different cgroups.  Each cgroup would have
different soft/hard limits to allow for different cache behavior.

http://www.spinics.net/lists/linux-mm/msg06006.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08  7:34 Control page reclaim granularity Zheng Liu
  2012-03-08  8:39 ` Greg Thelen
@ 2012-03-08  9:35 ` Minchan Kim
  2012-03-08 16:54   ` Zheng Liu
  2012-03-12 14:55   ` Rik van Riel
  1 sibling, 2 replies; 32+ messages in thread
From: Minchan Kim @ 2012-03-08  9:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Konstantin Khlebnikov; +Cc: riel, kosaki.motohiro

On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote:
> Hi list,
> 
> Recently we encounter a problem about page reclaim.  I abstract it in here.
> The problem is that there are two different file types.  One is small index
> file, and another is large data file.  The index file is mmaped into memory,
> and application hope that they can be kept in memory and don't be reclaimed
> too frequently.  The data file is manipulted by read/write, and they should
> be reclaimed more frequently than the index file.
> 
> As previously discussion [1], Konstantin suggest me to mmap index file with
> PROT_EXEC flag.  Meanwhile he provides a patch to set a flag in mm_flags to
> increase the priority of mmaped file pages.  However, these solutions are
> not perfect.  I review the related patches (8cab4754 and c909e993) and I
> think that mmaped index file with PROT_EXEC flag is too tricky.  From the
> view of applicaton programmer, index file is a regular file that stores
> some data.  So they should be mmap with PROT_READ | PROT_WRITE rather than
> with PROT_EXEC.  As commit log said (8cab4754), the purpose of this patch
> is to keep executable code in memory to improve the response of application.
> In addition, Kongstantin's patch needs to adjust the application program.
> So in some cases, we cannot touch the code of application, and this patch is
> useless.
> 
> I have discussed with Kongstantin about this problem and we think maybe
> kernel should provide some mechanism.  For example, user can set memory
> pressure priorities for vma or inode, or mmaped pages and file pages can be
> reclaimed separately.  If someone has thought about it, please let me know.
> Any feedbacks are welcomed.  Thank you.
> 
> Previously discussion:
> 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2
> 
> Regards,
> Zheng

I  think it's a regression since 2.6.28.
Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped).
But we removed that routine when we applied split lru page replacement.
Rik, KOSAKI. What's the rationale?
We have to decide whether recovering that routine or creating new logic to keep
mapped page in memory.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08  8:39 ` Greg Thelen
@ 2012-03-08 16:13   ` Zheng Liu
  2012-03-14  7:19     ` Greg Thelen
  0 siblings, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-08 16:13 UTC (permalink / raw)
  To: Greg Thelen; +Cc: linux-mm, linux-kernel, Konstantin Khlebnikov

Hi Greg,

Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
mailing list.  So please Cc me.

I am glad to receive your reply and I am very interesting for your
approach.  Actually I am not very familiar with CGroup.  So would you
please send your patch to me if you can?  Thank you all the same.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08  9:35 ` Minchan Kim
@ 2012-03-08 16:54   ` Zheng Liu
  2012-03-12  0:28     ` Minchan Kim
  2012-03-12 14:55   ` Rik van Riel
  1 sibling, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-08 16:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Konstantin Khlebnikov, riel, kosaki.motohiro

Hi Minchan,

Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
mailing list.  So please Cc me.

IMHO, maybe we should re-think about how does user use mmap(2).  I
describe the cases I known in our product system.  They can be
categorized into two cases.  One is mmaped all data files into memory
and sometime it uses write(2) to append some data, and another uses
mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In the
second case,  the application wants to keep mmaped page into memory and
let file pages to be reclaimed firstly.  So, IMO, when application uses
mmap(2) to manipulate files, it is possible to imply that it wants keep
these mmaped pages into memory and do not be reclaimed.  At least these
pages do not be reclaimed early than file pages.  I think that maybe we
can recover that routine and provide a sysctl parameter to let the user
to set this ratio between mmaped pages and file pages.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08 16:54   ` Zheng Liu
@ 2012-03-12  0:28     ` Minchan Kim
  2012-03-12  2:06       ` Fwd: " Zheng Liu
  0 siblings, 1 reply; 32+ messages in thread
From: Minchan Kim @ 2012-03-12  0:28 UTC (permalink / raw)
  To: Minchan Kim, linux-mm, linux-kernel, Konstantin Khlebnikov, riel,
	kosaki.motohiro

On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> Hi Minchan,
> 
> Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
> mailing list.  So please Cc me.
> 
> IMHO, maybe we should re-think about how does user use mmap(2).  I
> describe the cases I known in our product system.  They can be
> categorized into two cases.  One is mmaped all data files into memory
> and sometime it uses write(2) to append some data, and another uses
> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In the
> second case,  the application wants to keep mmaped page into memory and
> let file pages to be reclaimed firstly.  So, IMO, when application uses
> mmap(2) to manipulate files, it is possible to imply that it wants keep
> these mmaped pages into memory and do not be reclaimed.  At least these
> pages do not be reclaimed early than file pages.  I think that maybe we
> can recover that routine and provide a sysctl parameter to let the user
> to set this ratio between mmaped pages and file pages.

I am not convinced why we should handle mapped page specially.
Sometimem, someone may use mmap by reducing buffer copy compared to read system call.
So I think we can't make sure mmaped pages are always win.

My suggestion is that it would be better to declare by user explicitly.
I think we can implement it by madvise and fadvise's WILLNEED option.
Current implementation is just readahead if there isn't a page in memory but I think
we can promote from inactive to active if there is already a page in
memory.

It's more clear and it couldn't be affected by kernel page reclaim algorithm change
like this.

> 
> Regards,
> Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12  0:28     ` Minchan Kim
@ 2012-03-12  2:06       ` Zheng Liu
  2012-03-12  5:19         ` Minchan Kim
  0 siblings, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-12  2:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Konstantin Khlebnikov, riel, kosaki.motohiro

On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> I forgot to Ccing you.
> Sorry.
> 
> ---------- Forwarded message ----------
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, Mar 12, 2012 at 9:28 AM
> Subject: Re: Control page reclaim granularity
> To: Minchan Kim <minchan@kernel.org>, linux-mm <linux-mm@kvack.org>,
> linux-kernel <linux-kernel@vger.kernel.org>, Konstantin Khlebnikov <
> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
> 
> 
> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > Hi Minchan,
> >
> > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
> > mailing list.  So please Cc me.
> >
> > IMHO, maybe we should re-think about how does user use mmap(2).  I
> > describe the cases I known in our product system.  They can be
> > categorized into two cases.  One is mmaped all data files into memory
> > and sometime it uses write(2) to append some data, and another uses
> > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In the
> > second case,  the application wants to keep mmaped page into memory and
> > let file pages to be reclaimed firstly.  So, IMO, when application uses
> > mmap(2) to manipulate files, it is possible to imply that it wants keep
> > these mmaped pages into memory and do not be reclaimed.  At least these
> > pages do not be reclaimed early than file pages.  I think that maybe we
> > can recover that routine and provide a sysctl parameter to let the user
> > to set this ratio between mmaped pages and file pages.
> 
> I am not convinced why we should handle mapped page specially.
> Sometimem, someone may use mmap by reducing buffer copy compared to read
> system call.
> So I think we can't make sure mmaped pages are always win.
> 
> My suggestion is that it would be better to declare by user explicitly.
> I think we can implement it by madvise and fadvise's WILLNEED option.
> Current implementation is just readahead if there isn't a page in memory
> but I think
> we can promote from inactive to active if there is already a page in
> memory.
> 
> It's more clear and it couldn't be affected by kernel page reclaim
> algorithm change
> like this.

Thank you for your advice.  But I still have question about this
solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
option,  it will cause an inconsistently status for pages that be
manipulated by madvise(2) and/or fadvise(2).  For example, when I call
madvise with WILLNEED flag, some pages will be moved into active list if
they already have been in memory, and other pages will be read into
memory and be saved in inactive list if they don't be in memory.  Then
pages that are in inactive list are possible to be reclaim.  So from the
view of users, it is inconsistent because some pages are in memory and
some pages are reclaimed.  But actually the user hopes that all of pages
can be kept in memory.  IMHO, this inconsistency is weird and makes users
puzzled.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12  2:06       ` Fwd: " Zheng Liu
@ 2012-03-12  5:19         ` Minchan Kim
  2012-03-12  6:20           ` Konstantin Khlebnikov
  0 siblings, 1 reply; 32+ messages in thread
From: Minchan Kim @ 2012-03-12  5:19 UTC (permalink / raw)
  To: Minchan Kim, linux-mm, linux-kernel, Konstantin Khlebnikov, riel,
	kosaki.motohiro

On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> > I forgot to Ccing you.
> > Sorry.
> > 
> > ---------- Forwarded message ----------
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Mon, Mar 12, 2012 at 9:28 AM
> > Subject: Re: Control page reclaim granularity
> > To: Minchan Kim <minchan@kernel.org>, linux-mm <linux-mm@kvack.org>,
> > linux-kernel <linux-kernel@vger.kernel.org>, Konstantin Khlebnikov <
> > khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
> > 
> > 
> > On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > > Hi Minchan,
> > >
> > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
> > > mailing list.  So please Cc me.
> > >
> > > IMHO, maybe we should re-think about how does user use mmap(2).  I
> > > describe the cases I known in our product system.  They can be
> > > categorized into two cases.  One is mmaped all data files into memory
> > > and sometime it uses write(2) to append some data, and another uses
> > > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In the
> > > second case,  the application wants to keep mmaped page into memory and
> > > let file pages to be reclaimed firstly.  So, IMO, when application uses
> > > mmap(2) to manipulate files, it is possible to imply that it wants keep
> > > these mmaped pages into memory and do not be reclaimed.  At least these
> > > pages do not be reclaimed early than file pages.  I think that maybe we
> > > can recover that routine and provide a sysctl parameter to let the user
> > > to set this ratio between mmaped pages and file pages.
> > 
> > I am not convinced why we should handle mapped page specially.
> > Sometimem, someone may use mmap by reducing buffer copy compared to read
> > system call.
> > So I think we can't make sure mmaped pages are always win.
> > 
> > My suggestion is that it would be better to declare by user explicitly.
> > I think we can implement it by madvise and fadvise's WILLNEED option.
> > Current implementation is just readahead if there isn't a page in memory
> > but I think
> > we can promote from inactive to active if there is already a page in
> > memory.
> > 
> > It's more clear and it couldn't be affected by kernel page reclaim
> > algorithm change
> > like this.
> 
> Thank you for your advice.  But I still have question about this
> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
> option,  it will cause an inconsistently status for pages that be
> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
> madvise with WILLNEED flag, some pages will be moved into active list if
> they already have been in memory, and other pages will be read into
> memory and be saved in inactive list if they don't be in memory.  Then
> pages that are in inactive list are possible to be reclaim.  So from the
> view of users, it is inconsistent because some pages are in memory and
> some pages are reclaimed.  But actually the user hopes that all of pages
> can be kept in memory.  IMHO, this inconsistency is weird and makes users
> puzzled.

Now problem is that

1. User want to keep pages which are used once in a while in memory.
2. Kernel want to reclaim them because they are surely reclaim target
   pages in point of view by LRU.

The most desriable approach is that user should use mlock to guarantee
them in memory. But mlock is too big overhead and user doesn't want to keep
memory all pages all at once.(Ie, he want demand paging when he need the page)
Right?

madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior.
In point of view, such inconsistency might not be a big problem.

Big problem I think now is that user should use madvise(WILLNEED) periodically because such
activation happens once when user calls madvise. If user doesn't use page frequently after
user calls it, it ends up moving into inactive list and even could be reclaimed.
It's not good. :-(

Okay. How about adding new VM_WORKINGSET?
And reclaimer would give one more round trip in active/inactive list when reclaim happens
if the page is referenced.

Sigh. We have no room for new VM_FLAG in 32 bit.

> 
> Regards,
> Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12  5:19         ` Minchan Kim
@ 2012-03-12  6:20           ` Konstantin Khlebnikov
  2012-03-12  8:14             ` Zheng Liu
  0 siblings, 1 reply; 32+ messages in thread
From: Konstantin Khlebnikov @ 2012-03-12  6:20 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, riel, kosaki.motohiro

Minchan Kim wrote:
> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
>>> I forgot to Ccing you.
>>> Sorry.
>>>
>>> ---------- Forwarded message ----------
>>> From: Minchan Kim<minchan@kernel.org>
>>> Date: Mon, Mar 12, 2012 at 9:28 AM
>>> Subject: Re: Control page reclaim granularity
>>> To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
>>> linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
>>>
>>>
>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
>>>> Hi Minchan,
>>>>
>>>> Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
>>>> mailing list.  So please Cc me.
>>>>
>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
>>>> describe the cases I known in our product system.  They can be
>>>> categorized into two cases.  One is mmaped all data files into memory
>>>> and sometime it uses write(2) to append some data, and another uses
>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In the
>>>> second case,  the application wants to keep mmaped page into memory and
>>>> let file pages to be reclaimed firstly.  So, IMO, when application uses
>>>> mmap(2) to manipulate files, it is possible to imply that it wants keep
>>>> these mmaped pages into memory and do not be reclaimed.  At least these
>>>> pages do not be reclaimed early than file pages.  I think that maybe we
>>>> can recover that routine and provide a sysctl parameter to let the user
>>>> to set this ratio between mmaped pages and file pages.
>>>
>>> I am not convinced why we should handle mapped page specially.
>>> Sometimem, someone may use mmap by reducing buffer copy compared to read
>>> system call.
>>> So I think we can't make sure mmaped pages are always win.
>>>
>>> My suggestion is that it would be better to declare by user explicitly.
>>> I think we can implement it by madvise and fadvise's WILLNEED option.
>>> Current implementation is just readahead if there isn't a page in memory
>>> but I think
>>> we can promote from inactive to active if there is already a page in
>>> memory.
>>>
>>> It's more clear and it couldn't be affected by kernel page reclaim
>>> algorithm change
>>> like this.
>>
>> Thank you for your advice.  But I still have question about this
>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
>> option,  it will cause an inconsistently status for pages that be
>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
>> madvise with WILLNEED flag, some pages will be moved into active list if
>> they already have been in memory, and other pages will be read into
>> memory and be saved in inactive list if they don't be in memory.  Then
>> pages that are in inactive list are possible to be reclaim.  So from the
>> view of users, it is inconsistent because some pages are in memory and
>> some pages are reclaimed.  But actually the user hopes that all of pages
>> can be kept in memory.  IMHO, this inconsistency is weird and makes users
>> puzzled.
>
> Now problem is that
>
> 1. User want to keep pages which are used once in a while in memory.
> 2. Kernel want to reclaim them because they are surely reclaim target
>     pages in point of view by LRU.
>
> The most desriable approach is that user should use mlock to guarantee
> them in memory. But mlock is too big overhead and user doesn't want to keep
> memory all pages all at once.(Ie, he want demand paging when he need the page)
> Right?
>
> madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior.
> In point of view, such inconsistency might not be a big problem.
>
> Big problem I think now is that user should use madvise(WILLNEED) periodically because such
> activation happens once when user calls madvise. If user doesn't use page frequently after
> user calls it, it ends up moving into inactive list and even could be reclaimed.
> It's not good. :-(
>
> Okay. How about adding new VM_WORKINGSET?
> And reclaimer would give one more round trip in active/inactive list when reclaim happens
> if the page is referenced.
>
> Sigh. We have no room for new VM_FLAG in 32 bit.

It would be nice to mark struct address_space with this flag and export AS_UNEVICTABLE somehow.
Maybe we can reuse file-locking engine for managing these bits =)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12  6:20           ` Konstantin Khlebnikov
@ 2012-03-12  8:14             ` Zheng Liu
  2012-03-12 13:42               ` Minchan Kim
  0 siblings, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-12  8:14 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Minchan Kim, riel, kosaki.motohiro

On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> Minchan Kim wrote:
>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
>>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
>>>> I forgot to Ccing you.
>>>> Sorry.
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Minchan Kim<minchan@kernel.org>
>>>> Date: Mon, Mar 12, 2012 at 9:28 AM
>>>> Subject: Re: Control page reclaim granularity
>>>> To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
>>>> linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
>>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
>>>>
>>>>
>>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
>>>>> Hi Minchan,
>>>>>
>>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
>>>>> linux-kernel
>>>>> mailing list.  So please Cc me.
>>>>>
>>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
>>>>> describe the cases I known in our product system.  They can be
>>>>> categorized into two cases.  One is mmaped all data files into memory
>>>>> and sometime it uses write(2) to append some data, and another uses
>>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
>>>>> the
>>>>> second case,  the application wants to keep mmaped page into memory
>>>>> and
>>>>> let file pages to be reclaimed firstly.  So, IMO, when application
>>>>> uses
>>>>> mmap(2) to manipulate files, it is possible to imply that it wants
>>>>> keep
>>>>> these mmaped pages into memory and do not be reclaimed.  At least
>>>>> these
>>>>> pages do not be reclaimed early than file pages.  I think that
>>>>> maybe we
>>>>> can recover that routine and provide a sysctl parameter to let the
>>>>> user
>>>>> to set this ratio between mmaped pages and file pages.
>>>>
>>>> I am not convinced why we should handle mapped page specially.
>>>> Sometimem, someone may use mmap by reducing buffer copy compared to
>>>> read
>>>> system call.
>>>> So I think we can't make sure mmaped pages are always win.
>>>>
>>>> My suggestion is that it would be better to declare by user explicitly.
>>>> I think we can implement it by madvise and fadvise's WILLNEED option.
>>>> Current implementation is just readahead if there isn't a page in
>>>> memory
>>>> but I think
>>>> we can promote from inactive to active if there is already a page in
>>>> memory.
>>>>
>>>> It's more clear and it couldn't be affected by kernel page reclaim
>>>> algorithm change
>>>> like this.
>>>
>>> Thank you for your advice.  But I still have question about this
>>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
>>> option,  it will cause an inconsistently status for pages that be
>>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
>>> madvise with WILLNEED flag, some pages will be moved into active list if
>>> they already have been in memory, and other pages will be read into
>>> memory and be saved in inactive list if they don't be in memory.  Then
>>> pages that are in inactive list are possible to be reclaim.  So from the
>>> view of users, it is inconsistent because some pages are in memory and
>>> some pages are reclaimed.  But actually the user hopes that all of pages
>>> can be kept in memory.  IMHO, this inconsistency is weird and makes
>>> users
>>> puzzled.
>>
>> Now problem is that
>>
>> 1. User want to keep pages which are used once in a while in memory.
>> 2. Kernel want to reclaim them because they are surely reclaim target
>>     pages in point of view by LRU.
>>
>> The most desriable approach is that user should use mlock to guarantee
>> them in memory. But mlock is too big overhead and user doesn't want to
>> keep
>> memory all pages all at once.(Ie, he want demand paging when he need
>> the page)
>> Right?
>>
>> madvise, it's a just hint for kernel and kernel doesn't need to make
>> sure madvise's behavior.
>> In point of view, such inconsistency might not be a big problem.
>>
>> Big problem I think now is that user should use madvise(WILLNEED)
>> periodically because such
>> activation happens once when user calls madvise. If user doesn't use
>> page frequently after
>> user calls it, it ends up moving into inactive list and even could be
>> reclaimed.
>> It's not good. :-(
>>
>> Okay. How about adding new VM_WORKINGSET?
>> And reclaimer would give one more round trip in active/inactive list
>> when reclaim happens
>> if the page is referenced.
>>
>> Sigh. We have no room for new VM_FLAG in 32 bit.
> 
> It would be nice to mark struct address_space with this flag and export
> AS_UNEVICTABLE somehow.
> Maybe we can reuse file-locking engine for managing these bits =)

Make sense to me.  We can mark this flag in struct address_space and check
it in page_refereneced_file().  If this flag is set, it will be cleard and
the function returns referenced > 1.  Then this page can be promoted into
activate list.  But I prefer to set/clear this flag in madvise.

PS, I have subscribed linux-mm mailing list. :-)

Regards,
Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12  8:14             ` Zheng Liu
@ 2012-03-12 13:42               ` Minchan Kim
  2012-03-12 14:18                 ` Konstantin Khlebnikov
  2012-03-12 15:15                 ` Zheng Liu
  0 siblings, 2 replies; 32+ messages in thread
From: Minchan Kim @ 2012-03-12 13:42 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-mm, linux-kernel, Minchan Kim, riel,
	kosaki.motohiro

On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> > Minchan Kim wrote:
> >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> >>>> I forgot to Ccing you.
> >>>> Sorry.
> >>>>
> >>>> ---------- Forwarded message ----------
> >>>> From: Minchan Kim<minchan@kernel.org>
> >>>> Date: Mon, Mar 12, 2012 at 9:28 AM
> >>>> Subject: Re: Control page reclaim granularity
> >>>> To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
> >>>> linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
> >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
> >>>>
> >>>>
> >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> >>>>> Hi Minchan,
> >>>>>
> >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
> >>>>> linux-kernel
> >>>>> mailing list.  So please Cc me.
> >>>>>
> >>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
> >>>>> describe the cases I known in our product system.  They can be
> >>>>> categorized into two cases.  One is mmaped all data files into memory
> >>>>> and sometime it uses write(2) to append some data, and another uses
> >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
> >>>>> the
> >>>>> second case,  the application wants to keep mmaped page into memory
> >>>>> and
> >>>>> let file pages to be reclaimed firstly.  So, IMO, when application
> >>>>> uses
> >>>>> mmap(2) to manipulate files, it is possible to imply that it wants
> >>>>> keep
> >>>>> these mmaped pages into memory and do not be reclaimed.  At least
> >>>>> these
> >>>>> pages do not be reclaimed early than file pages.  I think that
> >>>>> maybe we
> >>>>> can recover that routine and provide a sysctl parameter to let the
> >>>>> user
> >>>>> to set this ratio between mmaped pages and file pages.
> >>>>
> >>>> I am not convinced why we should handle mapped page specially.
> >>>> Sometimem, someone may use mmap by reducing buffer copy compared to
> >>>> read
> >>>> system call.
> >>>> So I think we can't make sure mmaped pages are always win.
> >>>>
> >>>> My suggestion is that it would be better to declare by user explicitly.
> >>>> I think we can implement it by madvise and fadvise's WILLNEED option.
> >>>> Current implementation is just readahead if there isn't a page in
> >>>> memory
> >>>> but I think
> >>>> we can promote from inactive to active if there is already a page in
> >>>> memory.
> >>>>
> >>>> It's more clear and it couldn't be affected by kernel page reclaim
> >>>> algorithm change
> >>>> like this.
> >>>
> >>> Thank you for your advice.  But I still have question about this
> >>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
> >>> option,  it will cause an inconsistently status for pages that be
> >>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
> >>> madvise with WILLNEED flag, some pages will be moved into active list if
> >>> they already have been in memory, and other pages will be read into
> >>> memory and be saved in inactive list if they don't be in memory.  Then
> >>> pages that are in inactive list are possible to be reclaim.  So from the
> >>> view of users, it is inconsistent because some pages are in memory and
> >>> some pages are reclaimed.  But actually the user hopes that all of pages
> >>> can be kept in memory.  IMHO, this inconsistency is weird and makes
> >>> users
> >>> puzzled.
> >>
> >> Now problem is that
> >>
> >> 1. User want to keep pages which are used once in a while in memory.
> >> 2. Kernel want to reclaim them because they are surely reclaim target
> >>     pages in point of view by LRU.
> >>
> >> The most desriable approach is that user should use mlock to guarantee
> >> them in memory. But mlock is too big overhead and user doesn't want to
> >> keep
> >> memory all pages all at once.(Ie, he want demand paging when he need
> >> the page)
> >> Right?
> >>
> >> madvise, it's a just hint for kernel and kernel doesn't need to make
> >> sure madvise's behavior.
> >> In point of view, such inconsistency might not be a big problem.
> >>
> >> Big problem I think now is that user should use madvise(WILLNEED)
> >> periodically because such
> >> activation happens once when user calls madvise. If user doesn't use
> >> page frequently after
> >> user calls it, it ends up moving into inactive list and even could be
> >> reclaimed.
> >> It's not good. :-(
> >>
> >> Okay. How about adding new VM_WORKINGSET?
> >> And reclaimer would give one more round trip in active/inactive list
> >> erwhen reclaim happens
> >> if the page is referenced.
> >>
> >> Sigh. We have no room for new VM_FLAG in 32 bit.
> > p
> > It would be nice to mark struct address_space with this flag and export
> > AS_UNEVICTABLE somehow.
> > Maybe we can reuse file-locking engine for managing these bits =)
> 
> Make sense to me.  We can mark this flag in struct address_space and check
> it in page_refereneced_file().  If this flag is set, it will be cleard and

Disadvantage is that we could set reclaim granularity as per-inode.
I want to set it as per-vma, not per-inode.

> the function returns referenced > 1.  Then this page can be promoted into
> activate list.  But I prefer to set/clear this flag in madvise.

Hmm, My idea is following as,
If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
are set by new VM flag and the page is referenced recently at least once.
It means it gives one more round trip in his list(ie, active/inactive list)
rather than activation so that the page would become less reclaimable.

> 
> PS, I have subscribed linux-mm mailing list. :-)

Congratulations! :)

> 
> Regards,
> Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12 13:42               ` Minchan Kim
@ 2012-03-12 14:18                 ` Konstantin Khlebnikov
  2012-03-13  2:48                   ` Minchan Kim
  2012-03-12 15:15                 ` Zheng Liu
  1 sibling, 1 reply; 32+ messages in thread
From: Konstantin Khlebnikov @ 2012-03-12 14:18 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, riel, kosaki.motohiro

[-- Attachment #1: Type: text/plain, Size: 6473 bytes --]

Minchan Kim wrote:
> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
>>> Minchan Kim wrote:
>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
>>>>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
>>>>>> I forgot to Ccing you.
>>>>>> Sorry.
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Minchan Kim<minchan@kernel.org>
>>>>>> Date: Mon, Mar 12, 2012 at 9:28 AM
>>>>>> Subject: Re: Control page reclaim granularity
>>>>>> To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
>>>>>> linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
>>>>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
>>>>>>> Hi Minchan,
>>>>>>>
>>>>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
>>>>>>> linux-kernel
>>>>>>> mailing list.  So please Cc me.
>>>>>>>
>>>>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
>>>>>>> describe the cases I known in our product system.  They can be
>>>>>>> categorized into two cases.  One is mmaped all data files into memory
>>>>>>> and sometime it uses write(2) to append some data, and another uses
>>>>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
>>>>>>> the
>>>>>>> second case,  the application wants to keep mmaped page into memory
>>>>>>> and
>>>>>>> let file pages to be reclaimed firstly.  So, IMO, when application
>>>>>>> uses
>>>>>>> mmap(2) to manipulate files, it is possible to imply that it wants
>>>>>>> keep
>>>>>>> these mmaped pages into memory and do not be reclaimed.  At least
>>>>>>> these
>>>>>>> pages do not be reclaimed early than file pages.  I think that
>>>>>>> maybe we
>>>>>>> can recover that routine and provide a sysctl parameter to let the
>>>>>>> user
>>>>>>> to set this ratio between mmaped pages and file pages.
>>>>>>
>>>>>> I am not convinced why we should handle mapped page specially.
>>>>>> Sometimem, someone may use mmap by reducing buffer copy compared to
>>>>>> read
>>>>>> system call.
>>>>>> So I think we can't make sure mmaped pages are always win.
>>>>>>
>>>>>> My suggestion is that it would be better to declare by user explicitly.
>>>>>> I think we can implement it by madvise and fadvise's WILLNEED option.
>>>>>> Current implementation is just readahead if there isn't a page in
>>>>>> memory
>>>>>> but I think
>>>>>> we can promote from inactive to active if there is already a page in
>>>>>> memory.
>>>>>>
>>>>>> It's more clear and it couldn't be affected by kernel page reclaim
>>>>>> algorithm change
>>>>>> like this.
>>>>>
>>>>> Thank you for your advice.  But I still have question about this
>>>>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
>>>>> option,  it will cause an inconsistently status for pages that be
>>>>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
>>>>> madvise with WILLNEED flag, some pages will be moved into active list if
>>>>> they already have been in memory, and other pages will be read into
>>>>> memory and be saved in inactive list if they don't be in memory.  Then
>>>>> pages that are in inactive list are possible to be reclaim.  So from the
>>>>> view of users, it is inconsistent because some pages are in memory and
>>>>> some pages are reclaimed.  But actually the user hopes that all of pages
>>>>> can be kept in memory.  IMHO, this inconsistency is weird and makes
>>>>> users
>>>>> puzzled.
>>>>
>>>> Now problem is that
>>>>
>>>> 1. User want to keep pages which are used once in a while in memory.
>>>> 2. Kernel want to reclaim them because they are surely reclaim target
>>>>      pages in point of view by LRU.
>>>>
>>>> The most desriable approach is that user should use mlock to guarantee
>>>> them in memory. But mlock is too big overhead and user doesn't want to
>>>> keep
>>>> memory all pages all at once.(Ie, he want demand paging when he need
>>>> the page)
>>>> Right?
>>>>
>>>> madvise, it's a just hint for kernel and kernel doesn't need to make
>>>> sure madvise's behavior.
>>>> In point of view, such inconsistency might not be a big problem.
>>>>
>>>> Big problem I think now is that user should use madvise(WILLNEED)
>>>> periodically because such
>>>> activation happens once when user calls madvise. If user doesn't use
>>>> page frequently after
>>>> user calls it, it ends up moving into inactive list and even could be
>>>> reclaimed.
>>>> It's not good. :-(
>>>>
>>>> Okay. How about adding new VM_WORKINGSET?
>>>> And reclaimer would give one more round trip in active/inactive list
>>>> erwhen reclaim happens
>>>> if the page is referenced.
>>>>
>>>> Sigh. We have no room for new VM_FLAG in 32 bit.
>>> p
>>> It would be nice to mark struct address_space with this flag and export
>>> AS_UNEVICTABLE somehow.
>>> Maybe we can reuse file-locking engine for managing these bits =)
>>
>> Make sense to me.  We can mark this flag in struct address_space and check
>> it in page_refereneced_file().  If this flag is set, it will be cleard and
>
> Disadvantage is that we could set reclaim granularity as per-inode.
> I want to set it as per-vma, not per-inode.

But with per-inode flag we can tune all files, not only memory-mapped.
See, attached patch. Currently I thinking about managing code,
file-locking engine really fits perfectly =)

>
>> the function returns referenced>  1.  Then this page can be promoted into
>> activate list.  But I prefer to set/clear this flag in madvise.
>
> Hmm, My idea is following as,
> If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
> and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
> are set by new VM flag and the page is referenced recently at least once.
> It means it gives one more round trip in his list(ie, active/inactive list)
> rather than activation so that the page would become less reclaimable.
>
>>
>> PS, I have subscribed linux-mm mailing list. :-)
>
> Congratulations! :)
>
>>
>> Regards,
>> Zheng
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


[-- Attachment #2: mm-introduce-mapping-as_workingset-flag --]
[-- Type: text/plain, Size: 3095 bytes --]

mm: introduce mapping AS_WORKINGSET flag

From: Konstantin Khlebnikov <khlebnikov@openvz.org>

This patch introduces new flag AS_WORKINGSET in mapping->flags.
If it set reclaimer will activates all pages for this inode after first usage.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/pagemap.h |   16 ++++++++++++++++
 mm/vmscan.c             |   15 ++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cfaaa69..c15fc17 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_WORKINGSET	= __GFP_BITS_SHIFT + 4,	/* promote pages activation */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,21 @@ static inline int mapping_unevictable(struct address_space *mapping)
 	return !!mapping;
 }
 
+static inline void mapping_set_workingset(struct address_space *mapping)
+{
+	set_bit(AS_WORKINGSET, &mapping->flags);
+}
+
+static inline void mapping_clear_workingset(struct address_space *mapping)
+{
+	clear_bit(AS_WORKINGSET, &mapping->flags);
+}
+
+static inline int mapping_test_workingset(struct address_space *mapping)
+{
+	return mapping && test_bit(AS_WORKINGSET, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57b9658..5ccbe8c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -701,6 +701,7 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
+						  struct address_space *mapping,
 						  struct mem_cgroup_zone *mz,
 						  struct scan_control *sc)
 {
@@ -721,6 +722,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * Activate workingset page if referenced at least once.
+	 */
+	if (mapping_test_workingset(mapping) &&
+	    (referenced_ptes || referenced_page))
+		return PAGEREF_ACTIVATE;
+
 	if (referenced_ptes) {
 		if (PageAnon(page))
 			return PAGEREF_ACTIVATE;
@@ -828,7 +836,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		references = page_check_references(page, mz, sc);
+		mapping = page_mapping(page);
+
+		references = page_check_references(page, mapping, mz, sc);
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
@@ -848,11 +858,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
+			mapping = &swapper_space;
 			may_enter_fs = 1;
 		}
 
-		mapping = page_mapping(page);
-
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08  9:35 ` Minchan Kim
  2012-03-08 16:54   ` Zheng Liu
@ 2012-03-12 14:55   ` Rik van Riel
  2012-03-13  2:57     ` Minchan Kim
  1 sibling, 1 reply; 32+ messages in thread
From: Rik van Riel @ 2012-03-12 14:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Konstantin Khlebnikov, kosaki.motohiro

On 03/08/2012 04:35 AM, Minchan Kim wrote:
> On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote:
>> Hi list,
>>
>> Recently we encounter a problem about page reclaim.  I abstract it in here.
>> The problem is that there are two different file types.  One is small index
>> file, and another is large data file.  The index file is mmaped into memory,
>> and application hope that they can be kept in memory and don't be reclaimed
>> too frequently.  The data file is manipulted by read/write, and they should
>> be reclaimed more frequently than the index file.

They should indeed be.  The data pages should not get promoted
to the active list unless they get referenced twice while on
the inactive list.

Mmaped pages, on the other hand, get promoted to the active
list after just one reference.

Also, as long as the inactive file list is larger than the
active file list, we do not reclaim active file pages at
all.

> I  think it's a regression since 2.6.28.
> Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped).
> But we removed that routine when we applied split lru page replacement.
> Rik, KOSAKI. What's the rationale?

One main reason is scalability.  We have to treat pages
in such a way that we do not have to search through
gigabytes of memory to find a few eviction candidates
to place on the inactive list - where they could get
reused and stopped from eviction again.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12 13:42               ` Minchan Kim
  2012-03-12 14:18                 ` Konstantin Khlebnikov
@ 2012-03-12 15:15                 ` Zheng Liu
  2012-03-13  2:51                   ` Minchan Kim
  1 sibling, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-12 15:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, riel, kosaki.motohiro

On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote:
> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
> > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> > > Minchan Kim wrote:
> > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> > >>>> I forgot to Ccing you.
> > >>>> Sorry.
> > >>>>
> > >>>> ---------- Forwarded message ----------
> > >>>> From: Minchan Kim<minchan@kernel.org>
> > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM
> > >>>> Subject: Re: Control page reclaim granularity
> > >>>> To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
> > >>>> linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
> > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
> > >>>>
> > >>>>
> > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > >>>>> Hi Minchan,
> > >>>>>
> > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
> > >>>>> linux-kernel
> > >>>>> mailing list.  So please Cc me.
> > >>>>>
> > >>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
> > >>>>> describe the cases I known in our product system.  They can be
> > >>>>> categorized into two cases.  One is mmaped all data files into memory
> > >>>>> and sometime it uses write(2) to append some data, and another uses
> > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
> > >>>>> the
> > >>>>> second case,  the application wants to keep mmaped page into memory
> > >>>>> and
> > >>>>> let file pages to be reclaimed firstly.  So, IMO, when application
> > >>>>> uses
> > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants
> > >>>>> keep
> > >>>>> these mmaped pages into memory and do not be reclaimed.  At least
> > >>>>> these
> > >>>>> pages do not be reclaimed early than file pages.  I think that
> > >>>>> maybe we
> > >>>>> can recover that routine and provide a sysctl parameter to let the
> > >>>>> user
> > >>>>> to set this ratio between mmaped pages and file pages.
> > >>>>
> > >>>> I am not convinced why we should handle mapped page specially.
> > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to
> > >>>> read
> > >>>> system call.
> > >>>> So I think we can't make sure mmaped pages are always win.
> > >>>>
> > >>>> My suggestion is that it would be better to declare by user explicitly.
> > >>>> I think we can implement it by madvise and fadvise's WILLNEED option.
> > >>>> Current implementation is just readahead if there isn't a page in
> > >>>> memory
> > >>>> but I think
> > >>>> we can promote from inactive to active if there is already a page in
> > >>>> memory.
> > >>>>
> > >>>> It's more clear and it couldn't be affected by kernel page reclaim
> > >>>> algorithm change
> > >>>> like this.
> > >>>
> > >>> Thank you for your advice.  But I still have question about this
> > >>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
> > >>> option,  it will cause an inconsistently status for pages that be
> > >>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
> > >>> madvise with WILLNEED flag, some pages will be moved into active list if
> > >>> they already have been in memory, and other pages will be read into
> > >>> memory and be saved in inactive list if they don't be in memory.  Then
> > >>> pages that are in inactive list are possible to be reclaim.  So from the
> > >>> view of users, it is inconsistent because some pages are in memory and
> > >>> some pages are reclaimed.  But actually the user hopes that all of pages
> > >>> can be kept in memory.  IMHO, this inconsistency is weird and makes
> > >>> users
> > >>> puzzled.
> > >>
> > >> Now problem is that
> > >>
> > >> 1. User want to keep pages which are used once in a while in memory.
> > >> 2. Kernel want to reclaim them because they are surely reclaim target
> > >>     pages in point of view by LRU.
> > >>
> > >> The most desriable approach is that user should use mlock to guarantee
> > >> them in memory. But mlock is too big overhead and user doesn't want to
> > >> keep
> > >> memory all pages all at once.(Ie, he want demand paging when he need
> > >> the page)
> > >> Right?
> > >>
> > >> madvise, it's a just hint for kernel and kernel doesn't need to make
> > >> sure madvise's behavior.
> > >> In point of view, such inconsistency might not be a big problem.
> > >>
> > >> Big problem I think now is that user should use madvise(WILLNEED)
> > >> periodically because such
> > >> activation happens once when user calls madvise. If user doesn't use
> > >> page frequently after
> > >> user calls it, it ends up moving into inactive list and even could be
> > >> reclaimed.
> > >> It's not good. :-(
> > >>
> > >> Okay. How about adding new VM_WORKINGSET?
> > >> And reclaimer would give one more round trip in active/inactive list
> > >> erwhen reclaim happens
> > >> if the page is referenced.
> > >>
> > >> Sigh. We have no room for new VM_FLAG in 32 bit.
> > > p
> > > It would be nice to mark struct address_space with this flag and export
> > > AS_UNEVICTABLE somehow.
> > > Maybe we can reuse file-locking engine for managing these bits =)
> > 
> > Make sense to me.  We can mark this flag in struct address_space and check
> > it in page_refereneced_file().  If this flag is set, it will be cleard and
> 
> Disadvantage is that we could set reclaim granularity as per-inode.
> I want to set it as per-vma, not per-inode.

I don't think this is a disadvantage.  This per-inode reclaim
granularity is useful for us.  Actually I have thought to implement a
per-inode memcg to let different file sets to be reclaimed separately.
So maybe we can provide two mechanisms to let the user to choose how to
use them.

> 
> > the function returns referenced > 1.  Then this page can be promoted into
> > activate list.  But I prefer to set/clear this flag in madvise.
> 
> Hmm, My idea is following as,
> If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
> and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
> are set by new VM flag and the page is referenced recently at least once.
> It means it gives one more round trip in his list(ie, active/inactive list)
> rather than activation so that the page would become less reclaimable.

No matter what the page is given one more round trip or is promoted into
active list, it can satisfy our current requirement.  So now the
question is which is better.  If we add a new VM flag, as you said
before, vma->vm_flags has no room for it in 32 bit.  I have noticed that
this topic has been discussed [1] and the result is that vm_flags is
still a unsigned long type.  So we need to use a tricky technique to solve
it.  If we add a new flag in struct addpress_space, it might be easy to
implement it.

1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html

Regards,
Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12 14:18                 ` Konstantin Khlebnikov
@ 2012-03-13  2:48                   ` Minchan Kim
  2012-03-13  4:37                     ` Konstantin Khlebnikov
  2012-03-13  6:30                     ` Zheng Liu
  0 siblings, 2 replies; 32+ messages in thread
From: Minchan Kim @ 2012-03-13  2:48 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Minchan Kim, linux-mm, linux-kernel, riel, kosaki.motohiro

On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote:
> Minchan Kim wrote:
> >On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
> >>On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> >>>Minchan Kim wrote:
> >>>>On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> >>>>>On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> >>>>>>I forgot to Ccing you.
> >>>>>>Sorry.
> >>>>>>
> >>>>>>---------- Forwarded message ----------
> >>>>>>From: Minchan Kim<minchan@kernel.org>
> >>>>>>Date: Mon, Mar 12, 2012 at 9:28 AM
> >>>>>>Subject: Re: Control page reclaim granularity
> >>>>>>To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
> >>>>>>linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
> >>>>>>khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
> >>>>>>
> >>>>>>
> >>>>>>On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> >>>>>>>Hi Minchan,
> >>>>>>>
> >>>>>>>Sorry, I forgot to say that I don't subscribe linux-mm and
> >>>>>>>linux-kernel
> >>>>>>>mailing list.  So please Cc me.
> >>>>>>>
> >>>>>>>IMHO, maybe we should re-think about how does user use mmap(2).  I
> >>>>>>>describe the cases I known in our product system.  They can be
> >>>>>>>categorized into two cases.  One is mmaped all data files into memory
> >>>>>>>and sometime it uses write(2) to append some data, and another uses
> >>>>>>>mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
> >>>>>>>the
> >>>>>>>second case,  the application wants to keep mmaped page into memory
> >>>>>>>and
> >>>>>>>let file pages to be reclaimed firstly.  So, IMO, when application
> >>>>>>>uses
> >>>>>>>mmap(2) to manipulate files, it is possible to imply that it wants
> >>>>>>>keep
> >>>>>>>these mmaped pages into memory and do not be reclaimed.  At least
> >>>>>>>these
> >>>>>>>pages do not be reclaimed early than file pages.  I think that
> >>>>>>>maybe we
> >>>>>>>can recover that routine and provide a sysctl parameter to let the
> >>>>>>>user
> >>>>>>>to set this ratio between mmaped pages and file pages.
> >>>>>>
> >>>>>>I am not convinced why we should handle mapped page specially.
> >>>>>>Sometimem, someone may use mmap by reducing buffer copy compared to
> >>>>>>read
> >>>>>>system call.
> >>>>>>So I think we can't make sure mmaped pages are always win.
> >>>>>>
> >>>>>>My suggestion is that it would be better to declare by user explicitly.
> >>>>>>I think we can implement it by madvise and fadvise's WILLNEED option.
> >>>>>>Current implementation is just readahead if there isn't a page in
> >>>>>>memory
> >>>>>>but I think
> >>>>>>we can promote from inactive to active if there is already a page in
> >>>>>>memory.
> >>>>>>
> >>>>>>It's more clear and it couldn't be affected by kernel page reclaim
> >>>>>>algorithm change
> >>>>>>like this.
> >>>>>
> >>>>>Thank you for your advice.  But I still have question about this
> >>>>>solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
> >>>>>option,  it will cause an inconsistently status for pages that be
> >>>>>manipulated by madvise(2) and/or fadvise(2).  For example, when I call
> >>>>>madvise with WILLNEED flag, some pages will be moved into active list if
> >>>>>they already have been in memory, and other pages will be read into
> >>>>>memory and be saved in inactive list if they don't be in memory.  Then
> >>>>>pages that are in inactive list are possible to be reclaim.  So from the
> >>>>>view of users, it is inconsistent because some pages are in memory and
> >>>>>some pages are reclaimed.  But actually the user hopes that all of pages
> >>>>>can be kept in memory.  IMHO, this inconsistency is weird and makes
> >>>>>users
> >>>>>puzzled.
> >>>>
> >>>>Now problem is that
> >>>>
> >>>>1. User want to keep pages which are used once in a while in memory.
> >>>>2. Kernel want to reclaim them because they are surely reclaim target
> >>>>     pages in point of view by LRU.
> >>>>
> >>>>The most desriable approach is that user should use mlock to guarantee
> >>>>them in memory. But mlock is too big overhead and user doesn't want to
> >>>>keep
> >>>>memory all pages all at once.(Ie, he want demand paging when he need
> >>>>the page)
> >>>>Right?
> >>>>
> >>>>madvise, it's a just hint for kernel and kernel doesn't need to make
> >>>>sure madvise's behavior.
> >>>>In point of view, such inconsistency might not be a big problem.
> >>>>
> >>>>Big problem I think now is that user should use madvise(WILLNEED)
> >>>>periodically because such
> >>>>activation happens once when user calls madvise. If user doesn't use
> >>>>page frequently after
> >>>>user calls it, it ends up moving into inactive list and even could be
> >>>>reclaimed.
> >>>>It's not good. :-(
> >>>>
> >>>>Okay. How about adding new VM_WORKINGSET?
> >>>>And reclaimer would give one more round trip in active/inactive list
> >>>>erwhen reclaim happens
> >>>>if the page is referenced.
> >>>>
> >>>>Sigh. We have no room for new VM_FLAG in 32 bit.
> >>>p
> >>>It would be nice to mark struct address_space with this flag and export
> >>>AS_UNEVICTABLE somehow.
> >>>Maybe we can reuse file-locking engine for managing these bits =)
> >>
> >>Make sense to me.  We can mark this flag in struct address_space and check
> >>it in page_refereneced_file().  If this flag is set, it will be cleard and
> >
> >Disadvantage is that we could set reclaim granularity as per-inode.
> >I want to set it as per-vma, not per-inode.
> 
> But with per-inode flag we can tune all files, not only memory-mapped.

I don't oppose per-inode setting but I believe we need file range or mmapped vma,
still. One file may have different characteristic part, something is working set
something is streaming part.

> See, attached patch. Currently I thinking about managing code,
> file-locking engine really fits perfectly =)

file-locking engine?
You consider fcntl as interface for it?
What do you mean?

> 
> >
> >>the function returns referenced>  1.  Then this page can be promoted into
> >>activate list.  But I prefer to set/clear this flag in madvise.
> >
> >Hmm, My idea is following as,
> >If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
> >and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
> >are set by new VM flag and the page is referenced recently at least once.
> >It means it gives one more round trip in his list(ie, active/inactive list)
> >rather than activation so that the page would become less reclaimable.
> >
> >>
> >>PS, I have subscribed linux-mm mailing list. :-)
> >
> >Congratulations! :)
> >
> >>
> >>Regards,
> >>Zheng
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> >Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> 

> mm: introduce mapping AS_WORKINGSET flag
> 
> From: Konstantin Khlebnikov <khlebnikov@openvz.org>
> 
> This patch introduces new flag AS_WORKINGSET in mapping->flags.
> If it set reclaimer will activates all pages for this inode after first usage.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> ---
>  include/linux/pagemap.h |   16 ++++++++++++++++
>  mm/vmscan.c             |   15 ++++++++++++---
>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index cfaaa69..c15fc17 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -24,6 +24,7 @@ enum mapping_flags {
>  	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
> +	AS_WORKINGSET	= __GFP_BITS_SHIFT + 4,	/* promote pages activation */
>  };
>  
>  static inline void mapping_set_error(struct address_space *mapping, int error)
> @@ -53,6 +54,21 @@ static inline int mapping_unevictable(struct address_space *mapping)
>  	return !!mapping;
>  }
>  
> +static inline void mapping_set_workingset(struct address_space *mapping)
> +{
> +	set_bit(AS_WORKINGSET, &mapping->flags);
> +}
> +
> +static inline void mapping_clear_workingset(struct address_space *mapping)
> +{
> +	clear_bit(AS_WORKINGSET, &mapping->flags);
> +}
> +
> +static inline int mapping_test_workingset(struct address_space *mapping)
> +{
> +	return mapping && test_bit(AS_WORKINGSET, &mapping->flags);
> +}
> +
>  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>  {
>  	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 57b9658..5ccbe8c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -701,6 +701,7 @@ enum page_references {
>  };
>  
>  static enum page_references page_check_references(struct page *page,
> +						  struct address_space *mapping,
>  						  struct mem_cgroup_zone *mz,
>  						  struct scan_control *sc)
>  {
> @@ -721,6 +722,13 @@ static enum page_references page_check_references(struct page *page,
>  	if (vm_flags & VM_LOCKED)
>  		return PAGEREF_RECLAIM;
>  
> +	/*
> +	 * Activate workingset page if referenced at least once.
> +	 */
> +	if (mapping_test_workingset(mapping) &&
> +	    (referenced_ptes || referenced_page))
> +		return PAGEREF_ACTIVATE;
> +
>  	if (referenced_ptes) {
>  		if (PageAnon(page))
>  			return PAGEREF_ACTIVATE;
> @@ -828,7 +836,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		references = page_check_references(page, mz, sc);
> +		mapping = page_mapping(page);
> +
> +		references = page_check_references(page, mapping, mz, sc);
>  		switch (references) {
>  		case PAGEREF_ACTIVATE:
>  			goto activate_locked;
> @@ -848,11 +858,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page))
>  				goto activate_locked;
> +			mapping = &swapper_space;
>  			may_enter_fs = 1;
>  		}
>  
> -		mapping = page_mapping(page);
> -
>  		/*
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-12 15:15                 ` Zheng Liu
@ 2012-03-13  2:51                   ` Minchan Kim
  0 siblings, 0 replies; 32+ messages in thread
From: Minchan Kim @ 2012-03-13  2:51 UTC (permalink / raw)
  To: Minchan Kim, Konstantin Khlebnikov, linux-mm, linux-kernel, riel,
	kosaki.motohiro

On Mon, Mar 12, 2012 at 11:15:43PM +0800, Zheng Liu wrote:
> On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote:
> > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
> > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> > > > Minchan Kim wrote:
> > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> > > >>>> I forgot to Ccing you.
> > > >>>> Sorry.
> > > >>>>
> > > >>>> ---------- Forwarded message ----------
> > > >>>> From: Minchan Kim<minchan@kernel.org>
> > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM
> > > >>>> Subject: Re: Control page reclaim granularity
> > > >>>> To: Minchan Kim<minchan@kernel.org>, linux-mm<linux-mm@kvack.org>,
> > > >>>> linux-kernel<linux-kernel@vger.kernel.org>, Konstantin Khlebnikov<
> > > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > > >>>>> Hi Minchan,
> > > >>>>>
> > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
> > > >>>>> linux-kernel
> > > >>>>> mailing list.  So please Cc me.
> > > >>>>>
> > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
> > > >>>>> describe the cases I known in our product system.  They can be
> > > >>>>> categorized into two cases.  One is mmaped all data files into memory
> > > >>>>> and sometime it uses write(2) to append some data, and another uses
> > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
> > > >>>>> the
> > > >>>>> second case,  the application wants to keep mmaped page into memory
> > > >>>>> and
> > > >>>>> let file pages to be reclaimed firstly.  So, IMO, when application
> > > >>>>> uses
> > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants
> > > >>>>> keep
> > > >>>>> these mmaped pages into memory and do not be reclaimed.  At least
> > > >>>>> these
> > > >>>>> pages do not be reclaimed early than file pages.  I think that
> > > >>>>> maybe we
> > > >>>>> can recover that routine and provide a sysctl parameter to let the
> > > >>>>> user
> > > >>>>> to set this ratio between mmaped pages and file pages.
> > > >>>>
> > > >>>> I am not convinced why we should handle mapped page specially.
> > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to
> > > >>>> read
> > > >>>> system call.
> > > >>>> So I think we can't make sure mmaped pages are always win.
> > > >>>>
> > > >>>> My suggestion is that it would be better to declare by user explicitly.
> > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option.
> > > >>>> Current implementation is just readahead if there isn't a page in
> > > >>>> memory
> > > >>>> but I think
> > > >>>> we can promote from inactive to active if there is already a page in
> > > >>>> memory.
> > > >>>>
> > > >>>> It's more clear and it couldn't be affected by kernel page reclaim
> > > >>>> algorithm change
> > > >>>> like this.
> > > >>>
> > > >>> Thank you for your advice.  But I still have question about this
> > > >>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
> > > >>> option,  it will cause an inconsistently status for pages that be
> > > >>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
> > > >>> madvise with WILLNEED flag, some pages will be moved into active list if
> > > >>> they already have been in memory, and other pages will be read into
> > > >>> memory and be saved in inactive list if they don't be in memory.  Then
> > > >>> pages that are in inactive list are possible to be reclaim.  So from the
> > > >>> view of users, it is inconsistent because some pages are in memory and
> > > >>> some pages are reclaimed.  But actually the user hopes that all of pages
> > > >>> can be kept in memory.  IMHO, this inconsistency is weird and makes
> > > >>> users
> > > >>> puzzled.
> > > >>
> > > >> Now problem is that
> > > >>
> > > >> 1. User want to keep pages which are used once in a while in memory.
> > > >> 2. Kernel want to reclaim them because they are surely reclaim target
> > > >>     pages in point of view by LRU.
> > > >>
> > > >> The most desriable approach is that user should use mlock to guarantee
> > > >> them in memory. But mlock is too big overhead and user doesn't want to
> > > >> keep
> > > >> memory all pages all at once.(Ie, he want demand paging when he need
> > > >> the page)
> > > >> Right?
> > > >>
> > > >> madvise, it's a just hint for kernel and kernel doesn't need to make
> > > >> sure madvise's behavior.
> > > >> In point of view, such inconsistency might not be a big problem.
> > > >>
> > > >> Big problem I think now is that user should use madvise(WILLNEED)
> > > >> periodically because such
> > > >> activation happens once when user calls madvise. If user doesn't use
> > > >> page frequently after
> > > >> user calls it, it ends up moving into inactive list and even could be
> > > >> reclaimed.
> > > >> It's not good. :-(
> > > >>
> > > >> Okay. How about adding new VM_WORKINGSET?
> > > >> And reclaimer would give one more round trip in active/inactive list
> > > >> erwhen reclaim happens
> > > >> if the page is referenced.
> > > >>
> > > >> Sigh. We have no room for new VM_FLAG in 32 bit.
> > > > p
> > > > It would be nice to mark struct address_space with this flag and export
> > > > AS_UNEVICTABLE somehow.
> > > > Maybe we can reuse file-locking engine for managing these bits =)
> > > 
> > > Make sense to me.  We can mark this flag in struct address_space and check
> > > it in page_refereneced_file().  If this flag is set, it will be cleard and
> > 
> > Disadvantage is that we could set reclaim granularity as per-inode.
> > I want to set it as per-vma, not per-inode.
> 
> I don't think this is a disadvantage.  This per-inode reclaim
> granularity is useful for us.  Actually I have thought to implement a
> per-inode memcg to let different file sets to be reclaimed separately.
> So maybe we can provide two mechanisms to let the user to choose how to
> use them.

I don't oppose supporting both mechanism but I don't want to give only per-inode
approach.

> 
> > 
> > > the function returns referenced > 1.  Then this page can be promoted into
> > > activate list.  But I prefer to set/clear this flag in madvise.
> > 
> > Hmm, My idea is following as,
> > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
> > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
> > are set by new VM flag and the page is referenced recently at least once.
> > It means it gives one more round trip in his list(ie, active/inactive list)
> > rather than activation so that the page would become less reclaimable.
> 
> No matter what the page is given one more round trip or is promoted into
> active list, it can satisfy our current requirement.  So now the
> question is which is better.  If we add a new VM flag, as you said
> before, vma->vm_flags has no room for it in 32 bit.  I have noticed that
> this topic has been discussed [1] and the result is that vm_flags is
> still a unsigned long type.  So we need to use a tricky technique to solve
> it.  If we add a new flag in struct addpress_space, it might be easy to
> implement it.

In case of per-inode, it's good but it doesn't work for per-vma and file-range.

> 
> 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html
> 
> Regards,
> Zheng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-12 14:55   ` Rik van Riel
@ 2012-03-13  2:57     ` Minchan Kim
  2012-03-13 14:57       ` Rik van Riel
  0 siblings, 1 reply; 32+ messages in thread
From: Minchan Kim @ 2012-03-13  2:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, linux-kernel, Konstantin Khlebnikov,
	kosaki.motohiro

On Mon, Mar 12, 2012 at 10:55:24AM -0400, Rik van Riel wrote:
> On 03/08/2012 04:35 AM, Minchan Kim wrote:
> >On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote:
> >>Hi list,
> >>
> >>Recently we encounter a problem about page reclaim.  I abstract it in here.
> >>The problem is that there are two different file types.  One is small index
> >>file, and another is large data file.  The index file is mmaped into memory,
> >>and application hope that they can be kept in memory and don't be reclaimed
> >>too frequently.  The data file is manipulted by read/write, and they should
> >>be reclaimed more frequently than the index file.
> 
> They should indeed be.  The data pages should not get promoted
> to the active list unless they get referenced twice while on
> the inactive list.
> 
> Mmaped pages, on the other hand, get promoted to the active
> list after just one reference.

As I look the code, mmaped page doesn't get promoted by one reference.
It will get promoted by second-round trip or touched by several mapping
when first round trip.

                if (referenced_page || referenced_ptes > 1) 
		        return PAGEREF_ACTIVATE;

> 
> Also, as long as the inactive file list is larger than the
> active file list, we do not reclaim active file pages at
> all.

True.

> 
> >I  think it's a regression since 2.6.28.
> >Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped).
> >But we removed that routine when we applied split lru page replacement.
> >Rik, KOSAKI. What's the rationale?
> 
> One main reason is scalability.  We have to treat pages
> in such a way that we do not have to search through
> gigabytes of memory to find a few eviction candidates
> to place on the inactive list - where they could get
> reused and stopped from eviction again.

Okay. Thanks, Rik.
Then, another question.
Why did we handle mmaped page specially at that time?
Just out of curiosity.

> 
> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  2:48                   ` Minchan Kim
@ 2012-03-13  4:37                     ` Konstantin Khlebnikov
  2012-03-13  5:00                       ` Konstantin Khlebnikov
  2012-03-13  6:30                     ` Zheng Liu
  1 sibling, 1 reply; 32+ messages in thread
From: Konstantin Khlebnikov @ 2012-03-13  4:37 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, riel, kosaki.motohiro

Minchan Kim wrote:
> On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote:
>> Minchan Kim wrote:
>>> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
>>>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
>>>>> Minchan Kim wrote:
>>>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
<CUT>
>>>>>>
>>>>>> Now problem is that
>>>>>>
>>>>>> 1. User want to keep pages which are used once in a while in memory.
>>>>>> 2. Kernel want to reclaim them because they are surely reclaim target
>>>>>>      pages in point of view by LRU.
>>>>>>
>>>>>> The most desriable approach is that user should use mlock to guarantee
>>>>>> them in memory. But mlock is too big overhead and user doesn't want to
>>>>>> keep
>>>>>> memory all pages all at once.(Ie, he want demand paging when he need
>>>>>> the page)
>>>>>> Right?
>>>>>>
>>>>>> madvise, it's a just hint for kernel and kernel doesn't need to make
>>>>>> sure madvise's behavior.
>>>>>> In point of view, such inconsistency might not be a big problem.
>>>>>>
>>>>>> Big problem I think now is that user should use madvise(WILLNEED)
>>>>>> periodically because such
>>>>>> activation happens once when user calls madvise. If user doesn't use
>>>>>> page frequently after
>>>>>> user calls it, it ends up moving into inactive list and even could be
>>>>>> reclaimed.
>>>>>> It's not good. :-(
>>>>>>
>>>>>> Okay. How about adding new VM_WORKINGSET?
>>>>>> And reclaimer would give one more round trip in active/inactive list
>>>>>> erwhen reclaim happens
>>>>>> if the page is referenced.
>>>>>>
>>>>>> Sigh. We have no room for new VM_FLAG in 32 bit.
>>>>> p
>>>>> It would be nice to mark struct address_space with this flag and export
>>>>> AS_UNEVICTABLE somehow.
>>>>> Maybe we can reuse file-locking engine for managing these bits =)
>>>>
>>>> Make sense to me.  We can mark this flag in struct address_space and check
>>>> it in page_refereneced_file().  If this flag is set, it will be cleard and
>>>
>>> Disadvantage is that we could set reclaim granularity as per-inode.
>>> I want to set it as per-vma, not per-inode.
>>
>> But with per-inode flag we can tune all files, not only memory-mapped.
>
> I don't oppose per-inode setting but I believe we need file range or mmapped vma,
> still. One file may have different characteristic part, something is working set
> something is streaming part.
>
>> See, attached patch. Currently I thinking about managing code,
>> file-locking engine really fits perfectly =)
>
> file-locking engine?
> You consider fcntl as interface for it?
> What do you mean?
>

If we set bits on inode we somehow account its users and clear AS_WORKINGSET and AS_UNEVICTABLE
at last file close. We can use file-locking engine for locking inodes in memory -- file lock automatically
release inode at last fput(). Maybe it's too tricky and we should add couple simple atomic counters to
generic strict inode (like i_writecount/i_readcount) but in this case we will add new code on fast-path.
So, looks like invention new kind of struct file_lock is best approach.
I don't want implement range-locking for now, but I can do it if somebody really wants this.

Yes, we can use fcntl(), but fadvise() is much better.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  4:37                     ` Konstantin Khlebnikov
@ 2012-03-13  5:00                       ` Konstantin Khlebnikov
  0 siblings, 0 replies; 32+ messages in thread
From: Konstantin Khlebnikov @ 2012-03-13  5:00 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, riel, kosaki.motohiro

Konstantin Khlebnikov wrote:
> Minchan Kim wrote:
>> On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote:
>>> Minchan Kim wrote:
>>>> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
>>>>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
>>>>>> Minchan Kim wrote:
>>>>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> <CUT>
>>>>>>>
>>>>>>> Now problem is that
>>>>>>>
>>>>>>> 1. User want to keep pages which are used once in a while in memory.
>>>>>>> 2. Kernel want to reclaim them because they are surely reclaim target
>>>>>>> pages in point of view by LRU.
>>>>>>>
>>>>>>> The most desriable approach is that user should use mlock to guarantee
>>>>>>> them in memory. But mlock is too big overhead and user doesn't want to
>>>>>>> keep
>>>>>>> memory all pages all at once.(Ie, he want demand paging when he need
>>>>>>> the page)
>>>>>>> Right?
>>>>>>>
>>>>>>> madvise, it's a just hint for kernel and kernel doesn't need to make
>>>>>>> sure madvise's behavior.
>>>>>>> In point of view, such inconsistency might not be a big problem.
>>>>>>>
>>>>>>> Big problem I think now is that user should use madvise(WILLNEED)
>>>>>>> periodically because such
>>>>>>> activation happens once when user calls madvise. If user doesn't use
>>>>>>> page frequently after
>>>>>>> user calls it, it ends up moving into inactive list and even could be
>>>>>>> reclaimed.
>>>>>>> It's not good. :-(
>>>>>>>
>>>>>>> Okay. How about adding new VM_WORKINGSET?
>>>>>>> And reclaimer would give one more round trip in active/inactive list
>>>>>>> erwhen reclaim happens
>>>>>>> if the page is referenced.
>>>>>>>
>>>>>>> Sigh. We have no room for new VM_FLAG in 32 bit.
>>>>>> p
>>>>>> It would be nice to mark struct address_space with this flag and export
>>>>>> AS_UNEVICTABLE somehow.
>>>>>> Maybe we can reuse file-locking engine for managing these bits =)
>>>>>
>>>>> Make sense to me. We can mark this flag in struct address_space and check
>>>>> it in page_refereneced_file(). If this flag is set, it will be cleard and
>>>>
>>>> Disadvantage is that we could set reclaim granularity as per-inode.
>>>> I want to set it as per-vma, not per-inode.
>>>
>>> But with per-inode flag we can tune all files, not only memory-mapped.
>>
>> I don't oppose per-inode setting but I believe we need file range or mmapped vma,
>> still. One file may have different characteristic part, something is working set
>> something is streaming part.
>>
>>> See, attached patch. Currently I thinking about managing code,
>>> file-locking engine really fits perfectly =)
>>
>> file-locking engine?
>> You consider fcntl as interface for it?
>> What do you mean?
>>
>
> If we set bits on inode we somehow account its users and clear AS_WORKINGSET and AS_UNEVICTABLE
> at last file close. We can use file-locking engine for locking inodes in memory -- file lock automatically
> release inode at last fput(). Maybe it's too tricky and we should add couple simple atomic counters to
> generic strict inode (like i_writecount/i_readcount) but in this case we will add new code on fast-path.
> So, looks like invention new kind of struct file_lock is best approach.
> I don't want implement range-locking for now, but I can do it if somebody really wants this.
>
> Yes, we can use fcntl(), but fadvise() is much better.

Another mad idea: if we mark vma, then we can add fake vma (belong init_mm for example) to
inode rmap to lock inode's pages range in memory without actually mapping file.
In page_referenced_one() we should handle this fake vma differently,
because page_check_address() will always fail for it.
Thus we can effectively implement AS_WORKINGSET and AS_UNEVICTABLE for arbitrary page ranges.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  2:48                   ` Minchan Kim
  2012-03-13  4:37                     ` Konstantin Khlebnikov
@ 2012-03-13  6:30                     ` Zheng Liu
  2012-03-13  6:48                       ` Zheng Liu
  1 sibling, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-13  6:30 UTC (permalink / raw)
  To: minchan
  Cc: khlebnikov, linux-mm, linux-kernel, riel, kosaki.motohiro, Zheng Liu

This only a first trivial try.  If this flag is set, reclaimer just give this
page one more round trip rather than promote it into active list.  Any comments
or advices are welcomed.

Regards,
Zheng

[PATCH] mm: per-inode mmaped page reclaim

From: Zheng Liu <wenqing.lz@taobao.com>

In some cases, user wants to control mmaped page reclaim granularity.  A new
flag is added into struct address_space to give the page one more round trip.
AS_WORKINGSET flag cannot be added in vma->vm_flags because this flag has no
room for a new flag in 32 bit.  Now user can call madvise(2) to set this flag
for a file.  If this flag is set, all pages will be given one more round trip
when reclaimer tries to shrink pages.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 include/asm-generic/mman-common.h |    2 ++
 include/linux/pagemap.h           |   16 ++++++++++++++++
 mm/madvise.c                      |    8 ++++++++
 mm/vmscan.c                       |   15 +++++++++++++++
 4 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
index 787abbb..7d26c9b 100644
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -48,6 +48,8 @@
 #define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
 #define MADV_NOHUGEPAGE	15		/* Not worth backing with hugepages */
 
+#define MADV_WORKINGSET 16		/* give one more round trip */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cfaaa69..80532a0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_WORKINGSET	= __GFP_BITS_SHIFT + 4, /* give one more round trip */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -36,6 +37,21 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
 	}
 }
 
+static inline void mapping_set_workingset(struct address_space *mapping)
+{
+	set_bit(AS_WORKINGSET, &mapping->flags);
+}
+
+static inline void mapping_clear_workingset(struct address_space *mapping)
+{
+	clear_bit(AS_WORKINGSET, &mapping->flags);
+}
+
+static inline int mapping_test_workingset(struct address_space *mapping)
+{
+	return mapping && test_bit(AS_WORKINGSET, &mapping->flags);
+}
+
 static inline void mapping_set_unevictable(struct address_space *mapping)
 {
 	set_bit(AS_UNEVICTABLE, &mapping->flags);
diff --git a/mm/madvise.c b/mm/madvise.c
index 74bf193..8ca6c9b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -77,6 +77,14 @@ static long madvise_behavior(struct vm_area_struct * vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_WORKINGSET:
+		if (vma->vm_file && vma->vm_file->f_mapping) {
+			mapping_set_workingset(vma->vm_file->f_mapping);
+		} else {
+			error = -EPERM;
+			goto out;
+		}
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..51f745b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -721,6 +721,15 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * give this page one more round trip because workingset
+	 * flag is set.
+	 */
+	if (mapping_test_workingset(page_mapping(page))) {
+		mapping_clear_workingset(page_mapping(page));
+		return PAGEREF_KEEP;
+	}
+
 	if (referenced_ptes) {
 		if (PageAnon(page))
 			return PAGEREF_ACTIVATE;
@@ -1737,6 +1746,12 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			continue;
 		}
 
+		if (mapping_test_workingset(page_mapping(page))) {
+			mapping_clear_workingset(page_mapping(page));
+			list_add(&page->lru, &l_active);
+			continue;
+		}
+
 		if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  6:30                     ` Zheng Liu
@ 2012-03-13  6:48                       ` Zheng Liu
  2012-03-13  7:21                         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-13  6:48 UTC (permalink / raw)
  To: minchan
  Cc: khlebnikov, linux-mm, linux-kernel, riel, kosaki.motohiro, Zheng Liu

Sorry, please forgive me.  This patch has a defect.  When one page is
scaned and flag is clear, all other's flags also are clear too.

Regards,
Zheng

On Tue, Mar 13, 2012 at 02:30:14PM +0800, Zheng Liu wrote:
> This only a first trivial try.  If this flag is set, reclaimer just give this
> page one more round trip rather than promote it into active list.  Any comments
> or advices are welcomed.
> 
> Regards,
> Zheng
> 
> [PATCH] mm: per-inode mmaped page reclaim
> 
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> In some cases, user wants to control mmaped page reclaim granularity.  A new
> flag is added into struct address_space to give the page one more round trip.
> AS_WORKINGSET flag cannot be added in vma->vm_flags because this flag has no
> room for a new flag in 32 bit.  Now user can call madvise(2) to set this flag
> for a file.  If this flag is set, all pages will be given one more round trip
> when reclaimer tries to shrink pages.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
>  include/asm-generic/mman-common.h |    2 ++
>  include/linux/pagemap.h           |   16 ++++++++++++++++
>  mm/madvise.c                      |    8 ++++++++
>  mm/vmscan.c                       |   15 +++++++++++++++
>  4 files changed, 41 insertions(+), 0 deletions(-)
> 
> diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> index 787abbb..7d26c9b 100644
> --- a/include/asm-generic/mman-common.h
> +++ b/include/asm-generic/mman-common.h
> @@ -48,6 +48,8 @@
>  #define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
>  #define MADV_NOHUGEPAGE	15		/* Not worth backing with hugepages */
>  
> +#define MADV_WORKINGSET 16		/* give one more round trip */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index cfaaa69..80532a0 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -24,6 +24,7 @@ enum mapping_flags {
>  	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
> +	AS_WORKINGSET	= __GFP_BITS_SHIFT + 4, /* give one more round trip */
>  };
>  
>  static inline void mapping_set_error(struct address_space *mapping, int error)
> @@ -36,6 +37,21 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
>  	}
>  }
>  
> +static inline void mapping_set_workingset(struct address_space *mapping)
> +{
> +	set_bit(AS_WORKINGSET, &mapping->flags);
> +}
> +
> +static inline void mapping_clear_workingset(struct address_space *mapping)
> +{
> +	clear_bit(AS_WORKINGSET, &mapping->flags);
> +}
> +
> +static inline int mapping_test_workingset(struct address_space *mapping)
> +{
> +	return mapping && test_bit(AS_WORKINGSET, &mapping->flags);
> +}
> +
>  static inline void mapping_set_unevictable(struct address_space *mapping)
>  {
>  	set_bit(AS_UNEVICTABLE, &mapping->flags);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 74bf193..8ca6c9b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -77,6 +77,14 @@ static long madvise_behavior(struct vm_area_struct * vma,
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_WORKINGSET:
> +		if (vma->vm_file && vma->vm_file->f_mapping) {
> +			mapping_set_workingset(vma->vm_file->f_mapping);
> +		} else {
> +			error = -EPERM;
> +			goto out;
> +		}
> +		break;
>  	}
>  
>  	if (new_flags == vma->vm_flags) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c52b235..51f745b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -721,6 +721,15 @@ static enum page_references page_check_references(struct page *page,
>  	if (vm_flags & VM_LOCKED)
>  		return PAGEREF_RECLAIM;
>  
> +	/*
> +	 * give this page one more round trip because workingset
> +	 * flag is set.
> +	 */
> +	if (mapping_test_workingset(page_mapping(page))) {
> +		mapping_clear_workingset(page_mapping(page));
> +		return PAGEREF_KEEP;
> +	}
> +
>  	if (referenced_ptes) {
>  		if (PageAnon(page))
>  			return PAGEREF_ACTIVATE;
> @@ -1737,6 +1746,12 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> +		if (mapping_test_workingset(page_mapping(page))) {
> +			mapping_clear_workingset(page_mapping(page));
> +			list_add(&page->lru, &l_active);
> +			continue;
> +		}
> +
>  		if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) {
>  			nr_rotated += hpage_nr_pages(page);
>  			/*
> -- 
> 1.7.4.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  6:48                       ` Zheng Liu
@ 2012-03-13  7:21                         ` Konstantin Khlebnikov
  2012-03-13  7:43                           ` Kautuk Consul
  0 siblings, 1 reply; 32+ messages in thread
From: Konstantin Khlebnikov @ 2012-03-13  7:21 UTC (permalink / raw)
  To: minchan, linux-mm, linux-kernel, riel, kosaki.motohiro, Zheng Liu

Zheng Liu wrote:
> Sorry, please forgive me.  This patch has a defect.  When one page is
> scaned and flag is clear, all other's flags also are clear too.

Yeah, funny patch =)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  7:21                         ` Konstantin Khlebnikov
@ 2012-03-13  7:43                           ` Kautuk Consul
  2012-03-13  7:47                             ` Kautuk Consul
  0 siblings, 1 reply; 32+ messages in thread
From: Kautuk Consul @ 2012-03-13  7:43 UTC (permalink / raw)
  To: minchan, riel, kosaki.motohiro, Zheng Liu, linux-mm; +Cc: linux-kernel

Hi,

I noticed this discussion and decided to pitch in one small idea from my side.

It would be nice to range lock an inode's pages by storing those
ranges which would be locked.
This could also add some good routines for the kernel in terms of
range locking for a single inode.
However, wouldn't this add some overhead to shrink_page_list() since
that code would need to go through
all these ranges while trying to reclaim a single page ?

One small suggestion from my side is:
Why don't we implement something like : "Complete page-cache reclaim
control from usermode"?
In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin
mentioned) for a file's
inode from usermode by using ioctl or fcntl or maybe even go as far as
implementing an O_NORECL
option to the open system call.

After setting the AS_UNEVICTABLE, the usermode application can choose
to keep and remove pages by
using the fadvise(WILLNEED) and fadvise(DONTNEED).

( I think maybe the presence of any VMA is might not really be
required for this idea. )

Thanks,
Kautuk.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  7:43                           ` Kautuk Consul
@ 2012-03-13  7:47                             ` Kautuk Consul
  2012-03-13  8:05                               ` Zheng Liu
  0 siblings, 1 reply; 32+ messages in thread
From: Kautuk Consul @ 2012-03-13  7:47 UTC (permalink / raw)
  To: minchan, riel, kosaki.motohiro, Zheng Liu, linux-mm; +Cc: linux-kernel

On Tue, Mar 13, 2012 at 1:13 PM, Kautuk Consul <consul.kautuk@gmail.com> wrote:
> Hi,
>
> I noticed this discussion and decided to pitch in one small idea from my side.
>
> It would be nice to range lock an inode's pages by storing those
> ranges which would be locked.
> This could also add some good routines for the kernel in terms of
> range locking for a single inode.
> However, wouldn't this add some overhead to shrink_page_list() since
> that code would need to go through
> all these ranges while trying to reclaim a single page ?
>
> One small suggestion from my side is:
> Why don't we implement something like : "Complete page-cache reclaim
> control from usermode"?
> In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin
> mentioned) for a file's
> inode from usermode by using ioctl or fcntl or maybe even go as far as
> implementing an O_NORECL
> option to the open system call.
>

Of course, only an application executing with root privileges should
be allowed to set the inode's
mapping flags in this manner.


> After setting the AS_UNEVICTABLE, the usermode application can choose
> to keep and remove pages by
> using the fadvise(WILLNEED) and fadvise(DONTNEED).
>
> ( I think maybe the presence of any VMA is might not really be
> required for this idea. )
>
> Thanks,
> Kautuk.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  8:05                               ` Zheng Liu
@ 2012-03-13  8:04                                 ` Kautuk Consul
  2012-03-13  8:08                                   ` Kautuk Consul
  0 siblings, 1 reply; 32+ messages in thread
From: Kautuk Consul @ 2012-03-13  8:04 UTC (permalink / raw)
  To: Kautuk Consul, minchan, riel, kosaki.motohiro, Zheng Liu,
	linux-mm, linux-kernel

>
> Hi Kautuk,
>
> IMHO, running application with root privilege is too dangerous.  We
> should avoid it.
>

I agree, but that's not my point.

All I'm saying is that we probably don't want to give normal
unprivileged usermode apps
the capability to set the mapping to AS_UNEVICTABLE as anyone can then
write an application
that hogs memory without allowing the kernel to free it through memory reclaim.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  7:47                             ` Kautuk Consul
@ 2012-03-13  8:05                               ` Zheng Liu
  2012-03-13  8:04                                 ` Kautuk Consul
  0 siblings, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-13  8:05 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: minchan, riel, kosaki.motohiro, Zheng Liu, linux-mm, linux-kernel

On Tue, Mar 13, 2012 at 01:17:41PM +0530, Kautuk Consul wrote:
> On Tue, Mar 13, 2012 at 1:13 PM, Kautuk Consul <consul.kautuk@gmail.com> wrote:
> > Hi,
> >
> > I noticed this discussion and decided to pitch in one small idea from my side.
> >
> > It would be nice to range lock an inode's pages by storing those
> > ranges which would be locked.
> > This could also add some good routines for the kernel in terms of
> > range locking for a single inode.
> > However, wouldn't this add some overhead to shrink_page_list() since
> > that code would need to go through
> > all these ranges while trying to reclaim a single page ?
> >
> > One small suggestion from my side is:
> > Why don't we implement something like : "Complete page-cache reclaim
> > control from usermode"?
> > In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin
> > mentioned) for a file's
> > inode from usermode by using ioctl or fcntl or maybe even go as far as
> > implementing an O_NORECL
> > option to the open system call.
> >
> 
> Of course, only an application executing with root privileges should
> be allowed to set the inode's
> mapping flags in this manner.

Hi Kautuk,

IMHO, running application with root privilege is too dangerous.  We
should avoid it.

Regards,
Zheng

> 
> 
> > After setting the AS_UNEVICTABLE, the usermode application can choose
> > to keep and remove pages by
> > using the fadvise(WILLNEED) and fadvise(DONTNEED).
> >
> > ( I think maybe the presence of any VMA is might not really be
> > required for this idea. )
> >
> > Thanks,
> > Kautuk.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  8:04                                 ` Kautuk Consul
@ 2012-03-13  8:08                                   ` Kautuk Consul
  2012-03-13  8:28                                     ` Zheng Liu
  0 siblings, 1 reply; 32+ messages in thread
From: Kautuk Consul @ 2012-03-13  8:08 UTC (permalink / raw)
  To: Kautuk Consul, minchan, riel, kosaki.motohiro, Zheng Liu,
	linux-mm, linux-kernel

>
> I agree, but that's not my point.
>
> All I'm saying is that we probably don't want to give normal
> unprivileged usermode apps
> the capability to set the mapping to AS_UNEVICTABLE as anyone can then
> write an application
> that hogs memory without allowing the kernel to free it through memory reclaim.

Sorry, I mean :
"... that hogs kernel unmapped page-cache memory without allowing the
kernel to free it through memory reclaim."

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  8:08                                   ` Kautuk Consul
@ 2012-03-13  8:28                                     ` Zheng Liu
  2012-03-13  8:36                                       ` Kautuk Consul
  0 siblings, 1 reply; 32+ messages in thread
From: Zheng Liu @ 2012-03-13  8:28 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: minchan, riel, kosaki.motohiro, Zheng Liu, linux-mm, linux-kernel

On Tue, Mar 13, 2012 at 01:38:56PM +0530, Kautuk Consul wrote:
> >
> > I agree, but that's not my point.
> >
> > All I'm saying is that we probably don't want to give normal
> > unprivileged usermode apps
> > the capability to set the mapping to AS_UNEVICTABLE as anyone can then
> > write an application
> > that hogs memory without allowing the kernel to free it through memory reclaim.

Yes, I think so.  But it seems that there has some codes that are
possible to be abused.  For example, as I said previously, applications
can mmap a normal data file with PROT_EXEC flag.  Then this file gets a
high priority to keep in memory (commit: 8cab4754).  So my point is that
we cannot control applications how to use these mechanisms.  We just
provide them and let applications to choose how to use them.
:-)

Regards,
Zheng

> 
> Sorry, I mean :
> "... that hogs kernel unmapped page-cache memory without allowing the
> kernel to free it through memory reclaim."

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  8:28                                     ` Zheng Liu
@ 2012-03-13  8:36                                       ` Kautuk Consul
  2012-03-13  9:03                                         ` Kautuk Consul
  0 siblings, 1 reply; 32+ messages in thread
From: Kautuk Consul @ 2012-03-13  8:36 UTC (permalink / raw)
  To: minchan, riel, kosaki.motohiro, Zheng Liu, linux-mm, linux-kernel

>
> Yes, I think so.  But it seems that there has some codes that are
> possible to be abused.  For example, as I said previously, applications
> can mmap a normal data file with PROT_EXEC flag.  Then this file gets a
> high priority to keep in memory (commit: 8cab4754).  So my point is that
> we cannot control applications how to use these mechanisms.  We just
> provide them and let applications to choose how to use them.
> :-)
>

That's true, but we are not talking about higher priority here,
because in extreme memory reclaim case
even PROT_EXEC pages will be reclaimed.

But I understand your point. It might be okay to have this for all
privileges applications.

The only problem that might happen might be in OOM because we will
have to include selection points for
these page-cache pages (proportionately) while finding the most
expensive process to kill.
( I'm talking about the page-cache pages which are not mapped to
usermode page-tables at all. )

If any usermode application reads in an extremely huge file, whose
inode has been set to AS_UNEVICTABLE,
we might want to kill those applications that read in those
pages(proportionately) so that the guilty application
can be killed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Fwd: Control page reclaim granularity
  2012-03-13  8:36                                       ` Kautuk Consul
@ 2012-03-13  9:03                                         ` Kautuk Consul
  0 siblings, 0 replies; 32+ messages in thread
From: Kautuk Consul @ 2012-03-13  9:03 UTC (permalink / raw)
  To: minchan, riel, kosaki.motohiro, Zheng Liu, linux-mm, linux-kernel

> The only problem that might happen might be in OOM because we will
> have to include selection points for
> these page-cache pages (proportionately) while finding the most
> expensive process to kill.
> ( I'm talking about the page-cache pages which are not mapped to
> usermode page-tables at all. )
>
> If any usermode application reads in an extremely huge file, whose
> inode has been set to AS_UNEVICTABLE,
> we might want to kill those applications that read in those
> pages(proportionately) so that the guilty application
> can be killed.

On some more thought, I guess for OOM and proprtionate working set accounting,
the approach mentioned by Konstantin (with fake VMA) should work fine
with respect to the
way oom_kill.c accounts for virtual address size of kill candidates.

So, I now think that the best way might indeed be to have a fake VMA
to account for the
page-cache pages not mapped to usermode.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-13  2:57     ` Minchan Kim
@ 2012-03-13 14:57       ` Rik van Riel
  0 siblings, 0 replies; 32+ messages in thread
From: Rik van Riel @ 2012-03-13 14:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Konstantin Khlebnikov, kosaki.motohiro

On 03/12/2012 10:57 PM, Minchan Kim wrote:
> On Mon, Mar 12, 2012 at 10:55:24AM -0400, Rik van Riel wrote:
>> On 03/08/2012 04:35 AM, Minchan Kim wrote:

>>> Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped).
>>> But we removed that routine when we applied split lru page replacement.
>>> Rik, KOSAKI. What's the rationale?
>>
>> One main reason is scalability.  We have to treat pages
>> in such a way that we do not have to search through
>> gigabytes of memory to find a few eviction candidates
>> to place on the inactive list - where they could get
>> reused and stopped from eviction again.
>
> Okay. Thanks, Rik.
> Then, another question.
> Why did we handle mmaped page specially at that time?
> Just out of curiosity.

We had to, because we had only one set of LRU lists.

Something had to be done to keep streaming IO from pushing
other things out of memory.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Control page reclaim granularity
  2012-03-08 16:13   ` Zheng Liu
@ 2012-03-14  7:19     ` Greg Thelen
  0 siblings, 0 replies; 32+ messages in thread
From: Greg Thelen @ 2012-03-14  7:19 UTC (permalink / raw)
  To: linux-mm, Zheng Liu; +Cc: linux-kernel, Konstantin Khlebnikov

Zheng Liu <gnehzuil.liu@gmail.com> writes:
> Hi Greg,
>
> Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
> mailing list.  So please Cc me.
>
> I am glad to receive your reply and I am very interesting for your
> approach.  Actually I am not very familiar with CGroup.  So would you
> please send your patch to me if you can?  Thank you all the same.
>
> Regards,
> Zheng

Sorry for the delay, I had trouble finding my old prototype patch.  The
patch below is based on v2.6.34.  The patch is just an idea not a
complete solution.

>From b1b127e0e1443446d51353b0d7a776bddc046009 Mon Sep 17 00:00:00 2001
From: Greg Thelen <gthelen@google.com>
Date: Sat, 5 Jun 2010 17:26:06 -0700
Subject: [PATCH] memcg: prototype of dentry/cgroup binding.

JUST A PROTOTYPE: DO NOT SUBMIT

This creates a /dev/cgroup/memory/X/memory.dir_roots file which one can
use to register a directory file descriptors.  The idea is that future
charges to registered directories, including child inodes, will be
billed to memcg X rather than whatever memcg the faulting process runs
within.
---
 fs/dcache.c                |    4 +++
 include/linux/dcache.h     |    1 +
 include/linux/memcontrol.h |    2 +-
 mm/filemap.c               |    3 ++
 mm/memcontrol.c            |   64 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f1358e5..dda48d7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -70,6 +70,7 @@ struct dentry_stat_t dentry_stat = {
 static void __d_free(struct dentry *dentry)
 {
 	WARN_ON(!list_empty(&dentry->d_alias));
+	BUG_ON(dentry->d_mem);
 	if (dname_external(dentry))
 		kfree(dentry->d_name.name);
 	kmem_cache_free(dentry_cache, dentry); 
@@ -172,6 +173,7 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
+	mem_cgroup_disassociate_from_dentry(dentry);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
@@ -953,6 +955,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_inode = NULL;
 	dentry->d_parent = NULL;
 	dentry->d_sb = NULL;
+	dentry->d_mem = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
 	dentry->d_mounted = 0;
@@ -964,6 +967,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	if (parent) {
 		dentry->d_parent = dget(parent);
 		dentry->d_sb = parent->d_sb;
+		dentry->d_mem = parent->d_mem;
 	} else {
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
 	}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index eebb617..523d58b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -114,6 +114,7 @@ struct dentry {
 	unsigned long d_time;		/* used by d_revalidate */
 	const struct dentry_operations *d_op;
 	struct super_block *d_sb;	/* The root of the dentry tree */
+	struct mem_cgroup *d_mem;	/* Optional memcg */
 	void *d_fsdata;			/* fs-specific data */
 
 	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44301c6..a8b54f9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -71,6 +71,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
+void mem_cgroup_disassociate_from_dentry(struct dentry *dentry);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
@@ -309,4 +310,3 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
-
diff --git a/mm/filemap.c b/mm/filemap.c
index 140ebda..a9a525b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -400,8 +400,11 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 
 	VM_BUG_ON(!PageLocked(page));
 
+	VM_BUG_ON(page->mapping != NULL);
+	page->mapping = mapping; /* XXX: hack? */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
+	page->mapping = NULL; /* XXX: hack? */
 	if (error)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8a79a6f..de9f150 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -793,6 +793,23 @@ void mem_cgroup_move_lists(struct page *page,
 	mem_cgroup_add_lru_list(page, to);
 }
 
+static void mem_cgroup_associate_dentry(struct mem_cgroup *mem,
+					struct dentry *dentry)
+{
+	css_get(&mem->css);
+	BUG_ON(dentry->d_mem);
+	dentry->d_mem = mem;
+}
+
+void mem_cgroup_disassociate_from_dentry(struct dentry *dentry)
+{
+	if (!dentry->d_mem)
+		return;
+
+	css_put(&dentry->d_mem->css);
+	dentry->d_mem = NULL;
+}
+
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
@@ -1914,6 +1931,29 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 		return 0;
 	prefetchw(pc);
 
+	/*
+	 * If the page is inode and related dentry indicates a cgroup, then
+	 * charge that cgroup.  Otherwise fallback on the mm's cgroup.
+	 *
+	 * TODO(gthelen): this needs more thought.
+	 */
+	if ((memcg == NULL) && !PageAnon(page)) {
+		struct address_space *as;
+		struct inode *inode;
+		struct dentry *dentry;
+
+		/* what kind of locking is needed to walk this?  dcache_lock (gulp)? */
+		as = (struct address_space *)page_rmapping(page);
+		if (as != NULL) {
+			inode = as->host;
+			BUG_ON(inode == NULL);
+			list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+				memcg = dentry->d_mem;
+				break;
+			}
+		}
+	}
+
 	mem = memcg;
 	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
 	if (ret || !mem)
@@ -3539,6 +3579,26 @@ unlock:
 	return ret;
 }
 
+static int mem_cgroup_dir_roots_write(struct cgroup *cgrp, struct cftype *cft,
+				      u64 dir_fd)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	struct file *dir;
+	int status = 0;
+
+	dir = fget(dir_fd);
+	if (!dir)
+		return -EINVAL;
+
+	if (dir->f_dentry->d_mem)
+		status = -EINVAL;
+	else
+		mem_cgroup_associate_dentry(mem, dir->f_dentry);
+
+	fput(dir);
+	return status;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3594,6 +3654,10 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
 	},
+	{
+		.name = "dir_roots",
+		.write_u64  = mem_cgroup_dir_roots_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-03-14  7:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-08  7:34 Control page reclaim granularity Zheng Liu
2012-03-08  8:39 ` Greg Thelen
2012-03-08 16:13   ` Zheng Liu
2012-03-14  7:19     ` Greg Thelen
2012-03-08  9:35 ` Minchan Kim
2012-03-08 16:54   ` Zheng Liu
2012-03-12  0:28     ` Minchan Kim
2012-03-12  2:06       ` Fwd: " Zheng Liu
2012-03-12  5:19         ` Minchan Kim
2012-03-12  6:20           ` Konstantin Khlebnikov
2012-03-12  8:14             ` Zheng Liu
2012-03-12 13:42               ` Minchan Kim
2012-03-12 14:18                 ` Konstantin Khlebnikov
2012-03-13  2:48                   ` Minchan Kim
2012-03-13  4:37                     ` Konstantin Khlebnikov
2012-03-13  5:00                       ` Konstantin Khlebnikov
2012-03-13  6:30                     ` Zheng Liu
2012-03-13  6:48                       ` Zheng Liu
2012-03-13  7:21                         ` Konstantin Khlebnikov
2012-03-13  7:43                           ` Kautuk Consul
2012-03-13  7:47                             ` Kautuk Consul
2012-03-13  8:05                               ` Zheng Liu
2012-03-13  8:04                                 ` Kautuk Consul
2012-03-13  8:08                                   ` Kautuk Consul
2012-03-13  8:28                                     ` Zheng Liu
2012-03-13  8:36                                       ` Kautuk Consul
2012-03-13  9:03                                         ` Kautuk Consul
2012-03-12 15:15                 ` Zheng Liu
2012-03-13  2:51                   ` Minchan Kim
2012-03-12 14:55   ` Rik van Riel
2012-03-13  2:57     ` Minchan Kim
2012-03-13 14:57       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).