linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* hugetlb pages not accounted for in rss
@ 2015-07-27 23:26 Mike Kravetz
  2015-07-28 18:32 ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: Mike Kravetz @ 2015-07-27 23:26 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Jörn Engel

I started looking at the hugetlb self tests.  The test hugetlbfstest
expects hugetlb pages to be accounted for in rss.  However, there is
no code in the kernel to do this accounting.

It looks like there was an effort to add the accounting back in 2013.
The test program made it into tree, but the accounting code did not.

The easiest way to resolve this issue would be to remove the test and
perhaps document that hugetlb pages are not accounted for in rss.
However, it does seem like a big oversight that hugetlb pages are not
accounted for in rss.  From a quick scan of the code it appears THP
pages are properly accounted for.

Thoughts?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-27 23:26 hugetlb pages not accounted for in rss Mike Kravetz
@ 2015-07-28 18:32 ` Jörn Engel
  2015-07-28 21:15   ` Mike Kravetz
  0 siblings, 1 reply; 71+ messages in thread
From: Jörn Engel @ 2015-07-28 18:32 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-mm, linux-kernel

On Mon, Jul 27, 2015 at 04:26:47PM -0700, Mike Kravetz wrote:
> I started looking at the hugetlb self tests.  The test hugetlbfstest
> expects hugetlb pages to be accounted for in rss.  However, there is
> no code in the kernel to do this accounting.
> 
> It looks like there was an effort to add the accounting back in 2013.
> The test program made it into tree, but the accounting code did not.

My apologies.  Upstream work always gets axed first when I run out of
time - which happens more often than not.

> The easiest way to resolve this issue would be to remove the test and
> perhaps document that hugetlb pages are not accounted for in rss.
> However, it does seem like a big oversight that hugetlb pages are not
> accounted for in rss.  From a quick scan of the code it appears THP
> pages are properly accounted for.
> 
> Thoughts?

Unsurprisingly I agree that hugepages should count towards rss.  Keeping
the test in keeps us honest.  Actually fixing the issue would make us
honest and correct.

Increasingly we have tiny processes (by rss) that actually consume large
fractions of total memory.  Makes rss somewhat useless as a measure of
anything.

Jörn

--
Consensus is no proof!
-- John Naisbitt

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-28 18:32 ` Jörn Engel
@ 2015-07-28 21:15   ` Mike Kravetz
  2015-07-28 22:15     ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Mike Kravetz @ 2015-07-28 21:15 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mm, linux-kernel

On 07/28/2015 11:32 AM, Jörn Engel wrote:
> On Mon, Jul 27, 2015 at 04:26:47PM -0700, Mike Kravetz wrote:
>> I started looking at the hugetlb self tests.  The test hugetlbfstest
>> expects hugetlb pages to be accounted for in rss.  However, there is
>> no code in the kernel to do this accounting.
>>
>> It looks like there was an effort to add the accounting back in 2013.
>> The test program made it into tree, but the accounting code did not.
>
> My apologies.  Upstream work always gets axed first when I run out of
> time - which happens more often than not.

No worries, I just noticed the inconsistency of the test program and
no supporting code in the kernel.

>> The easiest way to resolve this issue would be to remove the test and
>> perhaps document that hugetlb pages are not accounted for in rss.
>> However, it does seem like a big oversight that hugetlb pages are not
>> accounted for in rss.  From a quick scan of the code it appears THP
>> pages are properly accounted for.
>>
>> Thoughts?
>
> Unsurprisingly I agree that hugepages should count towards rss.  Keeping
> the test in keeps us honest.  Actually fixing the issue would make us
> honest and correct.
>
> Increasingly we have tiny processes (by rss) that actually consume large
> fractions of total memory.  Makes rss somewhat useless as a measure of
> anything.

I'll take a look at what it would take to get the accounting in place.
-- 
Mike Kravetz

>
> Jörn
>
> --
> Consensus is no proof!
> -- John Naisbitt
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-28 21:15   ` Mike Kravetz
@ 2015-07-28 22:15     ` David Rientjes
  2015-07-28 22:26       ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-07-28 22:15 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Jörn Engel, linux-mm, linux-kernel

On Tue, 28 Jul 2015, Mike Kravetz wrote:

> > > The easiest way to resolve this issue would be to remove the test and
> > > perhaps document that hugetlb pages are not accounted for in rss.
> > > However, it does seem like a big oversight that hugetlb pages are not
> > > accounted for in rss.  From a quick scan of the code it appears THP
> > > pages are properly accounted for.
> > > 
> > > Thoughts?
> > 
> > Unsurprisingly I agree that hugepages should count towards rss.  Keeping
> > the test in keeps us honest.  Actually fixing the issue would make us
> > honest and correct.
> > 
> > Increasingly we have tiny processes (by rss) that actually consume large
> > fractions of total memory.  Makes rss somewhat useless as a measure of
> > anything.
> 
> I'll take a look at what it would take to get the accounting in place.

I'm not sure that I would agree that accounting hugetlb pages in rss would 
always be appropriate.

For reserved hugetlb pages, not surplus, the hugetlb pages are always 
resident even when unmapped.  Unmapping the memory is not going to cause 
them to be freed.  That's different from thp where the hugepages are 
actually freed when you do munmap().

The oom killer looks at rss as the metric to determine which process to 
kill that will result in a large amount of memory freeing.  If hugetlb 
pages are accounted in rss, this may lead to unnecessary killing since 
little memory may be freed as a result.

For that reason, we've added hugetlb statistics to the oom killer output 
since we've been left wondering in the past where all the memory on the 
system went :)

We also have a separate hugetlb cgroup that tracks hugetlb memory usage 
rather than memcg.

Starting to account hugetlb pages in rss may lead to breakage in userspace 
and I would agree with your earlier suggestion that just removing any test 
for rss would be appropriate.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-28 22:15     ` David Rientjes
@ 2015-07-28 22:26       ` Jörn Engel
  2015-07-28 23:30         ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Jörn Engel @ 2015-07-28 22:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Mike Kravetz, linux-mm, linux-kernel

On Tue, Jul 28, 2015 at 03:15:17PM -0700, David Rientjes wrote:
> 
> Starting to account hugetlb pages in rss may lead to breakage in userspace 
> and I would agree with your earlier suggestion that just removing any test 
> for rss would be appropriate.

What would you propose for me then?  I have 80% RAM or more in reserved
hugepages.  OOM-killer is not a concern, as it panics the system - the
alternatives were almost universally silly and we didn't want to deal
with system in unpredictable states.  But knowing how much memory is
used by which process is a concern.  And if you only tell me about the
small (and continuously shrinking) portion, I essentially fly blind.

That is not a case of "may lead to breakage", it _is_ broken.

Ideally we would have fixed this in 2002 when hugetlbfs was introduced.
By now we might have to introduce a new field, rss_including_hugepages
or whatever.  Then we have to update tools like top etc. to use the new
field when appropriate.  No fun, but might be necessary.

If we can get away with including hugepages in rss and fixing the OOM
killer to be less silly, I would strongly prefer that.  But I don't know
how much of a mess we are already in.

Jörn

--
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-28 22:26       ` Jörn Engel
@ 2015-07-28 23:30         ` David Rientjes
  2015-07-29  0:53           ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-07-28 23:30 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Mike Kravetz, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1776 bytes --]

On Tue, 28 Jul 2015, Jörn Engel wrote:

> What would you propose for me then?  I have 80% RAM or more in reserved
> hugepages.  OOM-killer is not a concern, as it panics the system - the
> alternatives were almost universally silly and we didn't want to deal
> with system in unpredictable states.  But knowing how much memory is
> used by which process is a concern.  And if you only tell me about the
> small (and continuously shrinking) portion, I essentially fly blind.
> 
> That is not a case of "may lead to breakage", it _is_ broken.
> 
> Ideally we would have fixed this in 2002 when hugetlbfs was introduced.
> By now we might have to introduce a new field, rss_including_hugepages
> or whatever.  Then we have to update tools like top etc. to use the new
> field when appropriate.  No fun, but might be necessary.
> 
> If we can get away with including hugepages in rss and fixing the OOM
> killer to be less silly, I would strongly prefer that.  But I don't know
> how much of a mess we are already in.
> 

It's not only the oom killer, I don't believe hugeltb pages are accounted 
to the "rss" in memcg.  They use the hugetlb_cgroup for that.  Starting to 
account for them in existing memcg deployments would cause them to hit 
their memory limits much earlier.  The "rss_huge" field in memcg only 
represents transparent hugepages.

I agree with your comment that having done this when hugetlbfs was 
introduced would have been optimal.

It's always difficult to add a new class of memory to an existing metric 
("new" here because it's currently unaccounted).

If we can add yet another process metric to track hugetlbfs memory mapped, 
then the test could be converted to use that.  I'm not sure if the 
jusitifcation would be strong enough, but you could try.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-28 23:30         ` David Rientjes
@ 2015-07-29  0:53           ` Jörn Engel
  2015-07-29 19:08             ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Jörn Engel @ 2015-07-29  0:53 UTC (permalink / raw)
  To: David Rientjes; +Cc: Mike Kravetz, linux-mm, linux-kernel

On Tue, Jul 28, 2015 at 04:30:19PM -0700, David Rientjes wrote:
> 
> It's not only the oom killer, I don't believe hugeltb pages are accounted 
> to the "rss" in memcg.  They use the hugetlb_cgroup for that.  Starting to 
> account for them in existing memcg deployments would cause them to hit 
> their memory limits much earlier.  The "rss_huge" field in memcg only 
> represents transparent hugepages.
> 
> I agree with your comment that having done this when hugetlbfs was 
> introduced would have been optimal.
> 
> It's always difficult to add a new class of memory to an existing metric 
> ("new" here because it's currently unaccounted).
> 
> If we can add yet another process metric to track hugetlbfs memory mapped, 
> then the test could be converted to use that.  I'm not sure if the 
> jusitifcation would be strong enough, but you could try.

Well, we definitely need something.  Having a 100GB process show 3GB of
rss is not very useful.  How would we notice a memory leak if it only
affects hugepages, for example?

Jörn

--
The object-oriented version of 'Spaghetti code' is, of course, 'Lasagna code'.
(Too many layers).
-- Roberto Waltman.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-29  0:53           ` Jörn Engel
@ 2015-07-29 19:08             ` David Rientjes
  2015-07-29 23:20               ` Mike Kravetz
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-07-29 19:08 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Mike Kravetz, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1005 bytes --]

On Tue, 28 Jul 2015, Jörn Engel wrote:

> Well, we definitely need something.  Having a 100GB process show 3GB of
> rss is not very useful.  How would we notice a memory leak if it only
> affects hugepages, for example?
> 

Since the hugetlb pool is a global resource, it would also be helpful to  
determine if a process is mapping more than expected.  You can't do that  
just by adding a huge rss metric, however: if you have 2MB and 1GB
hugepages configured you wouldn't know if a process was mapping 512 2MB   
hugepages or 1 1GB hugepage.
  
That's the purpose of hugetlb_cgroup, after all, and it supports usage 
counters for all hstates.  The test could be converted to use that to 
measure usage if configured in the kernel.

Beyond that, I'm not sure how a per-hstate rss metric would be exported to 
userspace in a clean way and other ways of obtaining the same data are 
possible with hugetlb_cgroup.  I'm not sure how successful you'd be in 
arguing that we need separate rss counters for it.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-29 19:08             ` David Rientjes
@ 2015-07-29 23:20               ` Mike Kravetz
  2015-07-30 21:34                 ` Jörn Engel
  2015-08-04  2:55                 ` Naoya Horiguchi
  0 siblings, 2 replies; 71+ messages in thread
From: Mike Kravetz @ 2015-07-29 23:20 UTC (permalink / raw)
  To: David Rientjes, Jörn Engel; +Cc: linux-mm, linux-kernel

On 07/29/2015 12:08 PM, David Rientjes wrote:
> On Tue, 28 Jul 2015, Jörn Engel wrote:
>
>> Well, we definitely need something.  Having a 100GB process show 3GB of
>> rss is not very useful.  How would we notice a memory leak if it only
>> affects hugepages, for example?
>>
>
> Since the hugetlb pool is a global resource, it would also be helpful to
> determine if a process is mapping more than expected.  You can't do that
> just by adding a huge rss metric, however: if you have 2MB and 1GB
> hugepages configured you wouldn't know if a process was mapping 512 2MB
> hugepages or 1 1GB hugepage.
>
> That's the purpose of hugetlb_cgroup, after all, and it supports usage
> counters for all hstates.  The test could be converted to use that to
> measure usage if configured in the kernel.
>
> Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> userspace in a clean way and other ways of obtaining the same data are
> possible with hugetlb_cgroup.  I'm not sure how successful you'd be in
> arguing that we need separate rss counters for it.

If I want to track hugetlb usage on a per-task basis, do I then need to
create one cgroup per task?

For example, suppose I have many tasks using hugetlb and the global pool
is getting low on free pages.  It might be useful to know which tasks are
using hugetlb pages, and how many they are using.

I don't actually have this need (I think), but it appears to be what
Jörn is asking for.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-29 23:20               ` Mike Kravetz
@ 2015-07-30 21:34                 ` Jörn Engel
  2015-07-31 21:09                   ` David Rientjes
  2015-08-04  2:55                 ` Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: Jörn Engel @ 2015-07-30 21:34 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: David Rientjes, linux-mm, linux-kernel

On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> >
> >Since the hugetlb pool is a global resource, it would also be helpful to
> >determine if a process is mapping more than expected.  You can't do that
> >just by adding a huge rss metric, however: if you have 2MB and 1GB
> >hugepages configured you wouldn't know if a process was mapping 512 2MB
> >hugepages or 1 1GB hugepage.

Fair, although I believe 1GB hugepages are overrated.  If you assume
that per-page overhead is independent of page size (not quite true, but
close enough), going from 1% small pages to 0.8% small pages will
improve performance as much as going from 99% 2MB pages to 99% 1GB
pages.

> >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> >counters for all hstates.  The test could be converted to use that to
> >measure usage if configured in the kernel.
> >
> >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> >userspace in a clean way and other ways of obtaining the same data are
> >possible with hugetlb_cgroup.  I'm not sure how successful you'd be in
> >arguing that we need separate rss counters for it.
> 
> If I want to track hugetlb usage on a per-task basis, do I then need to
> create one cgroup per task?
> 
> For example, suppose I have many tasks using hugetlb and the global pool
> is getting low on free pages.  It might be useful to know which tasks are
> using hugetlb pages, and how many they are using.
> 
> I don't actually have this need (I think), but it appears to be what
> Jörn is asking for.

Maybe some background is useful.  I would absolutely love to use
transparent hugepages.  They are absolutely perfect in every respect,
except for performance.  With transparent hugepages we get higher
latencies.  Small pages are unacceptable, so we are forced to use
non-transparent hugepages.

The part of our system that uses small pages is pretty much constant,
while total system memory follows Moore's law.  When possible we even
try to shrink that part.  Hugepages already dominate today and things
will get worse.

But otherwise we have all the problems that others also have.  There are
memory leaks and we would like to know how much memory each process
actually uses.  Most people use rss, while we have nothing good.  And I
am not sure if cgroup is the correct answer for essentially fixing a
regression introduced in 2002.

Jörn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-30 21:34                 ` Jörn Engel
@ 2015-07-31 21:09                   ` David Rientjes
  0 siblings, 0 replies; 71+ messages in thread
From: David Rientjes @ 2015-07-31 21:09 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Mike Kravetz, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2617 bytes --]

On Thu, 30 Jul 2015, Jörn Engel wrote:

> > If I want to track hugetlb usage on a per-task basis, do I then need to
> > create one cgroup per task?
> > 

I think this would only be used for debugging or testing, but if you have 
root and are trying to organize processes into a hugetlb_cgroup hierarchy, 
presumably you would just look at smaps and find each thread's hugetlb 
memory usage and not bother.

> Maybe some background is useful.  I would absolutely love to use
> transparent hugepages.  They are absolutely perfect in every respect,
> except for performance.  With transparent hugepages we get higher
> latencies.  Small pages are unacceptable, so we are forced to use
> non-transparent hugepages.
> 

Believe me, we are on the same page that way :)  We still deploy 
configurations with hugetlb memory because we need to meet certain 
allocation requirements and it is only possible to do at boot.

With regard to the performance of thp, I can think of two things that are 
affecting you:

 - allocation cost

   Async memory compaction in the page fault path for thp memory is very
   lightweight and it happily falls back to using small pages instead.
   Memory compaction is always being improved upon and there is on-going
   work to do memory compaction both periodically and in the background to
   keep fragmentation low.  The ultimate goal would be to remove async
   compaction entirely from the thp page fault path and rely on 
   improvements to memory compaction such that we have a great allocation
   success rate and less cost when we fail.

 - NUMA cost

   Until very recently, thp pages could easily be allocated remotely
   instead of small pages locally.  That has since been improved and we
   only allocate thp locally and then fallback to small pages locally
   first.  Khugepaged can still migrate memory remotely, but it will
   allocate the hugepage on the node where the majority of smallpages
   are from.

> The part of our system that uses small pages is pretty much constant,
> while total system memory follows Moore's law.  When possible we even
> try to shrink that part.  Hugepages already dominate today and things
> will get worse.
> 

I wrote a patchset, hugepages overcommit, that allows unmapped hugetlb 
pages to be freed in oom conditions before calling the oom killer up to a 
certain threshold and then kickoff a background thread to try to 
reallocate them.  The idea is to keep the hugetlb pool as large as 
possible up to oom and then only reclaim what is needed and then try to 
reallocate them.  Not sure if it would help your particular usecase or 
not.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: hugetlb pages not accounted for in rss
  2015-07-29 23:20               ` Mike Kravetz
  2015-07-30 21:34                 ` Jörn Engel
@ 2015-08-04  2:55                 ` Naoya Horiguchi
  2015-08-04  5:13                   ` [PATCH] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-04  2:55 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: David Rientjes, Jörn Engel, linux-mm, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2123 bytes --]

On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> On 07/29/2015 12:08 PM, David Rientjes wrote:
> >On Tue, 28 Jul 2015, Jörn Engel wrote:
> >
> >>Well, we definitely need something.  Having a 100GB process show 3GB of
> >>rss is not very useful.  How would we notice a memory leak if it only
> >>affects hugepages, for example?
> >>
> >
> >Since the hugetlb pool is a global resource, it would also be helpful to
> >determine if a process is mapping more than expected.  You can't do that
> >just by adding a huge rss metric, however: if you have 2MB and 1GB
> >hugepages configured you wouldn't know if a process was mapping 512 2MB
> >hugepages or 1 1GB hugepage.
> >
> >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> >counters for all hstates.  The test could be converted to use that to
> >measure usage if configured in the kernel.
> >
> >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> >userspace in a clean way and other ways of obtaining the same data are
> >possible with hugetlb_cgroup.  I'm not sure how successful you'd be in
> >arguing that we need separate rss counters for it.
>
> If I want to track hugetlb usage on a per-task basis, do I then need to
> create one cgroup per task?
>
> For example, suppose I have many tasks using hugetlb and the global pool
> is getting low on free pages.  It might be useful to know which tasks are
> using hugetlb pages, and how many they are using.
>
> I don't actually have this need (I think), but it appears to be what
> Jörn is asking for.

One possible way to get hugetlb metric in per-task basis is to walk page
table via /proc/pid/pagemap, and counting page flags for each mapped page
(we can easily do this with tools/vm/page-types.c like "page-types -p <PID>
-b huge"). This is obviously slower than just storing the counter as
in-kernel data and just exporting it, but might be useful in some situation.

Thanks,
Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-04  2:55                 ` Naoya Horiguchi
@ 2015-08-04  5:13                   ` Naoya Horiguchi
  2015-08-04 18:21                     ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-04  5:13 UTC (permalink / raw)
  To: Mike Kravetz, David Rientjes, Jörn Engel, Andrew Morton
  Cc: linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5125 bytes --]

On Tue, Aug 04, 2015 at 02:55:30AM +0000, Naoya Horiguchi wrote:
> On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> > On 07/29/2015 12:08 PM, David Rientjes wrote:
> > >On Tue, 28 Jul 2015, Jörn Engel wrote:
> > >
> > >>Well, we definitely need something.  Having a 100GB process show 3GB of
> > >>rss is not very useful.  How would we notice a memory leak if it only
> > >>affects hugepages, for example?
> > >>
> > >
> > >Since the hugetlb pool is a global resource, it would also be helpful to
> > >determine if a process is mapping more than expected.  You can't do that
> > >just by adding a huge rss metric, however: if you have 2MB and 1GB
> > >hugepages configured you wouldn't know if a process was mapping 512 2MB
> > >hugepages or 1 1GB hugepage.
> > >
> > >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> > >counters for all hstates.  The test could be converted to use that to
> > >measure usage if configured in the kernel.
> > >
> > >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> > >userspace in a clean way and other ways of obtaining the same data are
> > >possible with hugetlb_cgroup.  I'm not sure how successful you'd be in
> > >arguing that we need separate rss counters for it.
> >
> > If I want to track hugetlb usage on a per-task basis, do I then need to
> > create one cgroup per task?
> >
> > For example, suppose I have many tasks using hugetlb and the global pool
> > is getting low on free pages.  It might be useful to know which tasks are
> > using hugetlb pages, and how many they are using.
> >
> > I don't actually have this need (I think), but it appears to be what
> > Jörn is asking for.
> 
> One possible way to get hugetlb metric in per-task basis is to walk page
> table via /proc/pid/pagemap, and counting page flags for each mapped page
> (we can easily do this with tools/vm/page-types.c like "page-types -p <PID>
> -b huge"). This is obviously slower than just storing the counter as
> in-kernel data and just exporting it, but might be useful in some situation.

BTW, currently smaps doesn't report any meaningful info for vma(VM_HUGETLB).
I wrote the following patch, which hopefully is helpful for your purpose.

Thanks,
Naoya Horiguchi

---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
inconvenient when we want to know per-task or per-vma base hugetlb usage.
This patch enables these fields by introducing smaps_hugetlb_range().

before patch:

  Size:              20480 kB
  Rss:                   0 kB
  Pss:                   0 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:         0 kB
  Referenced:            0 kB
  Anonymous:             0 kB
  AnonHugePages:         0 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

after patch:

  Size:              20480 kB
  Rss:               18432 kB
  Pss:               18432 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:     18432 kB
  Referenced:        18432 kB
  Anonymous:         18432 kB
  AnonHugePages:         0 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091881d4..c7218603306d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	seq_putc(m, '\n');
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page = NULL;
+
+	if (pte_present(*pte)) {
+		page = vm_normal_page(vma, addr, *pte);
+	} else if (is_swap_pte(*pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+		if (is_migration_entry(swpent))
+			page = migration_entry_to_page(swpent);
+	}
+	if (page)
+		smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
+			      pte_young(*pte), pte_dirty(*pte));
+	return 0;
+}
+#endif /* HUGETLB_PAGE */
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
 		.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+		.hugetlb_entry = smaps_hugetlb_range,
+#endif
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-04  5:13                   ` [PATCH] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
@ 2015-08-04 18:21                     ` Jörn Engel
  2015-08-06  2:18                       ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Jörn Engel @ 2015-08-04 18:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, David Rientjes, Andrew Morton, linux-mm, linux-kernel

On Tue, Aug 04, 2015 at 05:13:39AM +0000, Naoya Horiguchi wrote:
> On Tue, Aug 04, 2015 at 02:55:30AM +0000, Naoya Horiguchi wrote:
> > 
> > One possible way to get hugetlb metric in per-task basis is to walk page
> > table via /proc/pid/pagemap, and counting page flags for each mapped page
> > (we can easily do this with tools/vm/page-types.c like "page-types -p <PID>
> > -b huge"). This is obviously slower than just storing the counter as
> > in-kernel data and just exporting it, but might be useful in some situation.

Maybe.  The current situation is a mess and I don't know the best way
out of it yet.

> BTW, currently smaps doesn't report any meaningful info for vma(VM_HUGETLB).
> I wrote the following patch, which hopefully is helpful for your purpose.
> 
> Thanks,
> Naoya Horiguchi
> 
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
> 
> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
> 
> before patch:
> 
>   Size:              20480 kB
>   Rss:                   0 kB
>   Pss:                   0 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:            0 kB
>   Anonymous:             0 kB
>   AnonHugePages:         0 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 
> after patch:
> 
>   Size:              20480 kB
>   Rss:               18432 kB
>   Pss:               18432 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:     18432 kB
>   Referenced:        18432 kB
>   Anonymous:         18432 kB
>   AnonHugePages:         0 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht

Nice!

> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ca1e091881d4..c7218603306d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  	seq_putc(m, '\n');
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> +				 unsigned long addr, unsigned long end,
> +				 struct mm_walk *walk)
> +{
> +	struct mem_size_stats *mss = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct page *page = NULL;
> +
> +	if (pte_present(*pte)) {
> +		page = vm_normal_page(vma, addr, *pte);
> +	} else if (is_swap_pte(*pte)) {
> +		swp_entry_t swpent = pte_to_swp_entry(*pte);
> +
> +		if (is_migration_entry(swpent))
> +			page = migration_entry_to_page(swpent);
> +	}
> +	if (page)
> +		smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
> +			      pte_young(*pte), pte_dirty(*pte));
> +	return 0;
> +}
> +#endif /* HUGETLB_PAGE */
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>  	struct vm_area_struct *vma = v;
>  	struct mem_size_stats mss;
>  	struct mm_walk smaps_walk = {
>  		.pmd_entry = smaps_pte_range,
> +#ifdef CONFIG_HUGETLB_PAGE
> +		.hugetlb_entry = smaps_hugetlb_range,
> +#endif

Not too fond of the #ifdef.  But I won't blame you, as there already is
an example of the same and - worse - a contradicting example that
unconditionally assigns and moved the #ifdef elsewhere.

Hugetlb is the unloved stepchild with 13 years of neglect and
half-measures.  It shows.

Patch looks good to me.

Acked-by: Jörn Engel <joern@logfs.org>

Jörn

--
Functionality is an asset, but code is a liability.
--Ted Dziuba

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-04 18:21                     ` Jörn Engel
@ 2015-08-06  2:18                       ` David Rientjes
  2015-08-06  7:44                         ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-06  2:18 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Naoya Horiguchi, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1743 bytes --]

On Tue, 4 Aug 2015, Jörn Engel wrote:

> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
> > 
> > Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> > inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > This patch enables these fields by introducing smaps_hugetlb_range().
> > 
> > before patch:
> > 
> >   Size:              20480 kB
> >   Rss:                   0 kB
> >   Pss:                   0 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:         0 kB
> >   Referenced:            0 kB
> >   Anonymous:             0 kB
> >   AnonHugePages:         0 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> > 
> > after patch:
> > 
> >   Size:              20480 kB
> >   Rss:               18432 kB
> >   Pss:               18432 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:     18432 kB
> >   Referenced:        18432 kB
> >   Anonymous:         18432 kB
> >   AnonHugePages:         0 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> 
> Nice!
> 

Hmm, wouldn't this be confusing since VmRSS in /proc/pid/status doesn't 
match the rss shown in smaps, since hugetlb mappings aren't accounted in 
get_mm_rss()?

Not sure this is a good idea, I think consistency amongst rss values would 
be more important.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-06  2:18                       ` David Rientjes
@ 2015-08-06  7:44                         ` Naoya Horiguchi
  2015-08-07  7:24                           ` [PATCH v2 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-06  7:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jörn Engel, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 739 bytes --]

On Wed, Aug 05, 2015 at 07:18:44PM -0700, David Rientjes wrote:
...
> Hmm, wouldn't this be confusing since VmRSS in /proc/pid/status doesn't 
> match the rss shown in smaps, since hugetlb mappings aren't accounted in 
> get_mm_rss()?
> 
> Not sure this is a good idea, I think consistency amongst rss values would 
> be more important.

Right, so one option is making get_mm_rss() count hugetlb, but that could
make oom/memcg less efficient or broken as you stated in a previous email.
So another one is to add "VmHugetlbRSS:" field in /proc/pid/status?

Thanks,
Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 0/2] hugetlb: display per-process/per-vma usage
  2015-08-06  7:44                         ` Naoya Horiguchi
@ 2015-08-07  7:24                           ` Naoya Horiguchi
  2015-08-07  7:24                             ` [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
  2015-08-07  7:24                             ` [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
  0 siblings, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-07  7:24 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 889 bytes --]

I wrote patches to export hugetlb usage info via /proc/pid/{smaps,status}.
In this version, I added patch 2 for /proc/pid/status to deal with the
inconsistency concern from David (thanks for the comment).

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (2):
      smaps: fill missing fields for vma(VM_HUGETLB)
      mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

 fs/proc/task_mmu.c       | 32 +++++++++++++++++++++++++++++++-
 include/linux/hugetlb.h  | 18 ++++++++++++++++++
 include/linux/mm.h       |  3 +++
 include/linux/mm_types.h |  3 +++
 mm/hugetlb.c             |  9 +++++++++
 mm/memory.c              |  4 ++--
 mm/rmap.c                |  4 +++-
 7 files changed, 69 insertions(+), 4 deletions(-)ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-07  7:24                           ` [PATCH v2 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
  2015-08-07  7:24                             ` [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
@ 2015-08-07  7:24                             ` Naoya Horiguchi
  2015-08-07 22:50                               ` Andrew Morton
  2015-08-11  0:37                               ` David Rientjes
  1 sibling, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-07  7:24 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2824 bytes --]

Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
inconvenient when we want to know per-task or per-vma base hugetlb usage.
This patch enables these fields by introducing smaps_hugetlb_range().

before patch:

  Size:              20480 kB
  Rss:                   0 kB
  Pss:                   0 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:         0 kB
  Referenced:            0 kB
  Anonymous:             0 kB
  AnonHugePages:         0 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

after patch:

  Size:              20480 kB
  Rss:               18432 kB
  Pss:               18432 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:     18432 kB
  Referenced:        18432 kB
  Anonymous:         18432 kB
  AnonHugePages:         0 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Jörn Engel <joern@logfs.org>
---
 fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index ca1e091881d4..c7218603306d 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	seq_putc(m, '\n');
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page = NULL;
+
+	if (pte_present(*pte)) {
+		page = vm_normal_page(vma, addr, *pte);
+	} else if (is_swap_pte(*pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+		if (is_migration_entry(swpent))
+			page = migration_entry_to_page(swpent);
+	}
+	if (page)
+		smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
+			      pte_young(*pte), pte_dirty(*pte));
+	return 0;
+}
+#endif /* HUGETLB_PAGE */
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
 		.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+		.hugetlb_entry = smaps_hugetlb_range,
+#endif
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status
  2015-08-07  7:24                           ` [PATCH v2 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
@ 2015-08-07  7:24                             ` Naoya Horiguchi
  2015-08-07 22:55                               ` Andrew Morton
  2015-08-07  7:24                             ` [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-07  7:24 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 8380 bytes --]

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because applications which use hugetlb typically want to control
their processes on the basis of how much memory (including hugetlb) they use.
So this patch simply provides easy access to the info via /proc/pid/status.

This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
is now,) which I guess is fine until we have some strong reason to do it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c       |  5 ++++-
 include/linux/hugetlb.h  | 18 ++++++++++++++++++
 include/linux/mm.h       |  3 +++
 include/linux/mm_types.h |  3 +++
 mm/hugetlb.c             |  9 +++++++++
 mm/memory.c              |  4 ++--
 mm/rmap.c                |  4 +++-
 7 files changed, 42 insertions(+), 4 deletions(-)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index c7218603306d..f181f56fcce2 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
 	unsigned long data, text, lib, swap, ptes, pmds;
-	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, hugetlb_rss;
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -37,6 +37,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	hiwater_rss = total_rss = get_mm_rss(mm);
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
+	hugetlb_rss = get_hugetlb_rss(mm);
 
 	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
 	text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
@@ -51,6 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"VmPin:\t%8lu kB\n"
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
+		"VmHugetlbRSS:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -64,6 +66,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
+		hugetlb_rss << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
diff --git v4.2-rc4.orig/include/linux/hugetlb.h v4.2-rc4/include/linux/hugetlb.h
index d891f949466a..6319df124e68 100644
--- v4.2-rc4.orig/include/linux/hugetlb.h
+++ v4.2-rc4/include/linux/hugetlb.h
@@ -469,6 +469,21 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 #define hugepages_supported() (HPAGE_SHIFT != 0)
 #endif
 
+/*
+ * This simple wrappers are to hide MM_HUGETLBPAGES from outside hugetlbfs
+ * subsystem. The counter MM_HUGETLBPAGES is maintained in page unit basis,
+ * so it changes by 512 for example if a 2MB hugepage is mapped or unmapped.
+ */
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+	return get_mm_counter(mm, MM_HUGETLBPAGES);
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+	add_mm_counter(mm, MM_HUGETLBPAGES, value);
+}
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +519,9 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 {
 	return &mm->page_table_lock;
 }
+
+#define get_hugetlb_rss(mm)	0
+#define mod_hugetlb_rss(mm, value)	do {} while (0)
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4.orig/include/linux/mm.h v4.2-rc4/include/linux/mm.h
index 2e872f92dbac..9218a8856483 100644
--- v4.2-rc4.orig/include/linux/mm.h
+++ v4.2-rc4/include/linux/mm.h
@@ -1355,6 +1355,9 @@ static inline void sync_mm_rss(struct mm_struct *mm)
 }
 #endif
 
+extern inline void init_rss_vec(int *rss);
+extern inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+
 int vma_wants_writenotify(struct vm_area_struct *vma);
 
 extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git v4.2-rc4.orig/include/linux/mm_types.h v4.2-rc4/include/linux/mm_types.h
index 0038ac7466fd..887b43ba5a18 100644
--- v4.2-rc4.orig/include/linux/mm_types.h
+++ v4.2-rc4/include/linux/mm_types.h
@@ -348,6 +348,9 @@ enum {
 	MM_FILEPAGES,
 	MM_ANONPAGES,
 	MM_SWAPENTS,
+#ifdef CONFIG_HUGETLB_PAGE
+	MM_HUGETLBPAGES,
+#endif
 	NR_MM_COUNTERS
 };
 
diff --git v4.2-rc4.orig/mm/hugetlb.c v4.2-rc4/mm/hugetlb.c
index a8c3087089d8..12e5e7d3b60f 100644
--- v4.2-rc4.orig/mm/hugetlb.c
+++ v4.2-rc4/mm/hugetlb.c
@@ -2743,9 +2743,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 	int ret = 0;
+	int rss[NR_MM_COUNTERS];
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
+	init_rss_vec(rss);
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
@@ -2797,6 +2799,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			rss[MM_HUGETLBPAGES] += pages_per_huge_page(h);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -2805,6 +2808,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	if (cow)
 		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
 
+	add_mm_rss_vec(dst, rss);
 	return ret;
 }
 
@@ -2823,6 +2827,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long sz = huge_page_size(h);
 	const unsigned long mmun_start = start;	/* For mmu_notifiers */
 	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	int rss[NR_MM_COUNTERS];
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
@@ -2832,6 +2837,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	address = start;
 again:
+	init_rss_vec(rss);
 	for (; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
@@ -2877,6 +2883,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
+		rss[MM_HUGETLBPAGES] -= pages_per_huge_page(h);
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
@@ -2892,6 +2899,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 unlock:
 		spin_unlock(ptl);
 	}
+	add_mm_rss_vec(mm, rss);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -3261,6 +3269,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	mod_hugetlb_rss(mm, pages_per_huge_page(h));
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4.orig/mm/memory.c v4.2-rc4/mm/memory.c
index 388dcf9aa283..e09b53da2733 100644
--- v4.2-rc4.orig/mm/memory.c
+++ v4.2-rc4/mm/memory.c
@@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	return 0;
 }
 
-static inline void init_rss_vec(int *rss)
+inline void init_rss_vec(int *rss)
 {
 	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
 }
 
-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 {
 	int i;
 
diff --git v4.2-rc4.orig/mm/rmap.c v4.2-rc4/mm/rmap.c
index 171b68768df1..78e77b0ea3c3 100644
--- v4.2-rc4.orig/mm/rmap.c
+++ v4.2-rc4/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (!PageHuge(page)) {
+		if (PageHuge(page)) {
+			mod_hugetlb_rss(mm, -(1 << compound_order(page)));
+		} else {
 			if (PageAnon(page))
 				dec_mm_counter(mm, MM_ANONPAGES);
 			else
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-07  7:24                             ` [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
@ 2015-08-07 22:50                               ` Andrew Morton
  2015-08-11  0:37                               ` David Rientjes
  1 sibling, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2015-08-07 22:50 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Fri, 7 Aug 2015 07:24:50 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
> 
> before patch:
> 
>   Size:              20480 kB
>   Rss:                   0 kB
>   Pss:                   0 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:            0 kB
>   Anonymous:             0 kB
>   AnonHugePages:         0 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 
> after patch:
> 
>   Size:              20480 kB
>   Rss:               18432 kB
>   Pss:               18432 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:     18432 kB
>   Referenced:        18432 kB
>   Anonymous:         18432 kB
>   AnonHugePages:         0 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 

Is there something we can say about this in
Documentation/filesystems/proc.txt?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status
  2015-08-07  7:24                             ` [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
@ 2015-08-07 22:55                               ` Andrew Morton
  2015-08-10  0:47                                 ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Andrew Morton @ 2015-08-07 22:55 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Fri, 7 Aug 2015 07:24:50 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because applications which use hugetlb typically want to control
> their processes on the basis of how much memory (including hugetlb) they use.
> So this patch simply provides easy access to the info via /proc/pid/status.
> 
> This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
> is now,) which I guess is fine until we have some strong reason to do it.
> 

A procfs change triggers a documentation change.  Always, please. 
Documentation/filesystems/proc.txt is the place.

>
> ...
>
> @@ -504,6 +519,9 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>  {
>  	return &mm->page_table_lock;
>  }
> +
> +#define get_hugetlb_rss(mm)	0
> +#define mod_hugetlb_rss(mm, value)	do {} while (0)

I don't think these have to be macros?  inline functions are nicer in
several ways: more readable, more likely to be documented, can prevent
unused variable warnings.

>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,
>
> ...
>
> --- v4.2-rc4.orig/mm/memory.c
> +++ v4.2-rc4/mm/memory.c
> @@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
>  	return 0;
>  }
>  
> -static inline void init_rss_vec(int *rss)
> +inline void init_rss_vec(int *rss)
>  {
>  	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
>  }
>  
> -static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> +inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  {
>  	int i;

The inlines are a bit odd, but this does save ~10 bytes in memory.o for
some reason.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 1/3] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-10  0:47                                 ` Naoya Horiguchi
@ 2015-08-10  0:47                                   ` Naoya Horiguchi
  2015-08-10  0:47                                   ` [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS Naoya Horiguchi
  2015-08-10  0:47                                   ` [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
  2 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-10  0:47 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2824 bytes --]

Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
inconvenient when we want to know per-task or per-vma base hugetlb usage.
This patch enables these fields by introducing smaps_hugetlb_range().

before patch:

  Size:              20480 kB
  Rss:                   0 kB
  Pss:                   0 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:         0 kB
  Referenced:            0 kB
  Anonymous:             0 kB
  AnonHugePages:         0 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

after patch:

  Size:              20480 kB
  Rss:               18432 kB
  Pss:               18432 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:     18432 kB
  Referenced:        18432 kB
  Anonymous:         18432 kB
  AnonHugePages:         0 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Jörn Engel <joern@logfs.org>
---
 fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index ca1e091881d4..c7218603306d 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	seq_putc(m, '\n');
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page = NULL;
+
+	if (pte_present(*pte)) {
+		page = vm_normal_page(vma, addr, *pte);
+	} else if (is_swap_pte(*pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+		if (is_migration_entry(swpent))
+			page = migration_entry_to_page(swpent);
+	}
+	if (page)
+		smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
+			      pte_young(*pte), pte_dirty(*pte));
+	return 0;
+}
+#endif /* HUGETLB_PAGE */
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
 		.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+		.hugetlb_entry = smaps_hugetlb_range,
+#endif
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status
  2015-08-07 22:55                               ` Andrew Morton
@ 2015-08-10  0:47                                 ` Naoya Horiguchi
  2015-08-10  0:47                                   ` [PATCH v3 1/3] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
                                                     ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-10  0:47 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2174 bytes --]

On Fri, Aug 07, 2015 at 03:55:37PM -0700, Andrew Morton wrote:
> On Fri, 7 Aug 2015 07:24:50 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Currently there's no easy way to get per-process usage of hugetlb pages, which
> > is inconvenient because applications which use hugetlb typically want to control
> > their processes on the basis of how much memory (including hugetlb) they use.
> > So this patch simply provides easy access to the info via /proc/pid/status.
> > 
> > This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
> > is now,) which I guess is fine until we have some strong reason to do it.
> > 
> 
> A procfs change triggers a documentation change.  Always, please. 
> Documentation/filesystems/proc.txt is the place.

OK, I'll do this.

> >
> > ...
> >
> > @@ -504,6 +519,9 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> >  {
> >  	return &mm->page_table_lock;
> >  }
> > +
> > +#define get_hugetlb_rss(mm)	0
> > +#define mod_hugetlb_rss(mm, value)	do {} while (0)
> 
> I don't think these have to be macros?  inline functions are nicer in
> several ways: more readable, more likely to be documented, can prevent
> unused variable warnings.

Right, I'll use inline functions.

> >  #endif	/* CONFIG_HUGETLB_PAGE */
> >  
> >  static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >
> > ...
> >
> > --- v4.2-rc4.orig/mm/memory.c
> > +++ v4.2-rc4/mm/memory.c
> > @@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
> >  	return 0;
> >  }
> >  
> > -static inline void init_rss_vec(int *rss)
> > +inline void init_rss_vec(int *rss)
> >  {
> >  	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
> >  }
> >  
> > -static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > +inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> >  {
> >  	int i;
> 
> The inlines are a bit odd, but this does save ~10 bytes in memory.o for
> some reason.

so I'll keep going with this.

Thanks,
Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS
  2015-08-10  0:47                                 ` Naoya Horiguchi
  2015-08-10  0:47                                   ` [PATCH v3 1/3] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
@ 2015-08-10  0:47                                   ` Naoya Horiguchi
  2015-08-11  0:44                                     ` David Rientjes
  2015-08-10  0:47                                   ` [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
  2 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-10  0:47 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2438 bytes --]

/proc/PID/{status,smaps} is aware of hugetlb RSS now, so let's document it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 Documentation/filesystems/proc.txt | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
index 6f7fafde0884..cb8565e150ed 100644
--- v4.2-rc4.orig/Documentation/filesystems/proc.txt
+++ v4.2-rc4/Documentation/filesystems/proc.txt
@@ -168,6 +168,7 @@ For example, to get the status information of a process, all you have to do is
   VmLck:         0 kB
   VmHWM:       476 kB
   VmRSS:       476 kB
+  VmHugetlbRSS:  0 kB
   VmData:      156 kB
   VmStk:        88 kB
   VmExe:        68 kB
@@ -230,6 +231,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmLck                       locked memory size
  VmHWM                       peak resident set size ("high water mark")
  VmRSS                       size of memory portions
+ VmHugetlbRSS                size of hugetlb memory portions
  VmData                      size of data, stack, and text segments
  VmStk                       size of data, stack, and text segments
  VmExe                       size of text segment
@@ -440,8 +442,12 @@ indicates the amount of memory currently marked as referenced or accessed.
 "Anonymous" shows the amount of memory that does not belong to any file.  Even
 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
 and a page is modified, the file page is replaced by a private anonymous copy.
-"Swap" shows how much would-be-anonymous memory is also used, but out on
-swap.
+"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
+Since 4.3, "RSS" contains the amount of mappings for hugetlb pages. Although
+RSS of hugetlb mappings is maintained separately from normal mappings
+(displayed in "VmHugetlbRSS" field of /proc/PID/status,) /proc/PID/smaps shows
+both mappings in "RSS" field. Userspace applications clearly distinguish the
+type of mapping with 'ht' flag in "VmFlags" field.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
 flags associated with the particular virtual memory area in two letter encoded
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status
  2015-08-10  0:47                                 ` Naoya Horiguchi
  2015-08-10  0:47                                   ` [PATCH v3 1/3] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
  2015-08-10  0:47                                   ` [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS Naoya Horiguchi
@ 2015-08-10  0:47                                   ` Naoya Horiguchi
  2015-08-10  1:16                                     ` Naoya Horiguchi
  2 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-10  0:47 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 8528 bytes --]

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because applications which use hugetlb typically want to control
their processes on the basis of how much memory (including hugetlb) they use.
So this patch simply provides easy access to the info via /proc/pid/status.

This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
is now,) which I guess is fine until we have some strong reason to do it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
 fs/proc/task_mmu.c       |  5 ++++-
 include/linux/hugetlb.h  | 24 ++++++++++++++++++++++++
 include/linux/mm.h       |  3 +++
 include/linux/mm_types.h |  3 +++
 mm/hugetlb.c             |  9 +++++++++
 mm/memory.c              |  4 ++--
 mm/rmap.c                |  4 +++-
 7 files changed, 48 insertions(+), 4 deletions(-)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index c7218603306d..f181f56fcce2 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
 	unsigned long data, text, lib, swap, ptes, pmds;
-	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, hugetlb_rss;
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -37,6 +37,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	hiwater_rss = total_rss = get_mm_rss(mm);
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
+	hugetlb_rss = get_hugetlb_rss(mm);
 
 	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
 	text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
@@ -51,6 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"VmPin:\t%8lu kB\n"
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
+		"VmHugetlbRSS:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -64,6 +66,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
+		hugetlb_rss << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
diff --git v4.2-rc4.orig/include/linux/hugetlb.h v4.2-rc4/include/linux/hugetlb.h
index d891f949466a..fa956d94cacb 100644
--- v4.2-rc4.orig/include/linux/hugetlb.h
+++ v4.2-rc4/include/linux/hugetlb.h
@@ -469,6 +469,21 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 #define hugepages_supported() (HPAGE_SHIFT != 0)
 #endif
 
+/*
+ * This simple wrappers are to hide MM_HUGETLBPAGES from outside hugetlbfs
+ * subsystem. The counter MM_HUGETLBPAGES is maintained in page unit basis,
+ * so it changes by 512 for example if a 2MB hugepage is mapped or unmapped.
+ */
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+	return get_mm_counter(mm, MM_HUGETLBPAGES);
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+	add_mm_counter(mm, MM_HUGETLBPAGES, value);
+}
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +519,15 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 {
 	return &mm->page_table_lock;
 }
+
+static inline get_hugetlb_rss(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4.orig/include/linux/mm.h v4.2-rc4/include/linux/mm.h
index 2e872f92dbac..9218a8856483 100644
--- v4.2-rc4.orig/include/linux/mm.h
+++ v4.2-rc4/include/linux/mm.h
@@ -1355,6 +1355,9 @@ static inline void sync_mm_rss(struct mm_struct *mm)
 }
 #endif
 
+extern inline void init_rss_vec(int *rss);
+extern inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+
 int vma_wants_writenotify(struct vm_area_struct *vma);
 
 extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git v4.2-rc4.orig/include/linux/mm_types.h v4.2-rc4/include/linux/mm_types.h
index 0038ac7466fd..887b43ba5a18 100644
--- v4.2-rc4.orig/include/linux/mm_types.h
+++ v4.2-rc4/include/linux/mm_types.h
@@ -348,6 +348,9 @@ enum {
 	MM_FILEPAGES,
 	MM_ANONPAGES,
 	MM_SWAPENTS,
+#ifdef CONFIG_HUGETLB_PAGE
+	MM_HUGETLBPAGES,
+#endif
 	NR_MM_COUNTERS
 };
 
diff --git v4.2-rc4.orig/mm/hugetlb.c v4.2-rc4/mm/hugetlb.c
index a8c3087089d8..12e5e7d3b60f 100644
--- v4.2-rc4.orig/mm/hugetlb.c
+++ v4.2-rc4/mm/hugetlb.c
@@ -2743,9 +2743,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 	int ret = 0;
+	int rss[NR_MM_COUNTERS];
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
+	init_rss_vec(rss);
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
@@ -2797,6 +2799,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			rss[MM_HUGETLBPAGES] += pages_per_huge_page(h);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -2805,6 +2808,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	if (cow)
 		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
 
+	add_mm_rss_vec(dst, rss);
 	return ret;
 }
 
@@ -2823,6 +2827,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long sz = huge_page_size(h);
 	const unsigned long mmun_start = start;	/* For mmu_notifiers */
 	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	int rss[NR_MM_COUNTERS];
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
@@ -2832,6 +2837,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	address = start;
 again:
+	init_rss_vec(rss);
 	for (; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
@@ -2877,6 +2883,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
+		rss[MM_HUGETLBPAGES] -= pages_per_huge_page(h);
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
@@ -2892,6 +2899,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 unlock:
 		spin_unlock(ptl);
 	}
+	add_mm_rss_vec(mm, rss);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -3261,6 +3269,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	mod_hugetlb_rss(mm, pages_per_huge_page(h));
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4.orig/mm/memory.c v4.2-rc4/mm/memory.c
index 388dcf9aa283..e09b53da2733 100644
--- v4.2-rc4.orig/mm/memory.c
+++ v4.2-rc4/mm/memory.c
@@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	return 0;
 }
 
-static inline void init_rss_vec(int *rss)
+inline void init_rss_vec(int *rss)
 {
 	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
 }
 
-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 {
 	int i;
 
diff --git v4.2-rc4.orig/mm/rmap.c v4.2-rc4/mm/rmap.c
index 171b68768df1..78e77b0ea3c3 100644
--- v4.2-rc4.orig/mm/rmap.c
+++ v4.2-rc4/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (!PageHuge(page)) {
+		if (PageHuge(page)) {
+			mod_hugetlb_rss(mm, -(1 << compound_order(page)));
+		} else {
 			if (PageAnon(page))
 				dec_mm_counter(mm, MM_ANONPAGES);
 			else
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status
  2015-08-10  0:47                                   ` [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
@ 2015-08-10  1:16                                     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-10  1:16 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 8888 bytes --]

> @@ -504,6 +519,15 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
>  {
>  	return &mm->page_table_lock;
>  }
> +
> +static inline get_hugetlb_rss(struct mm_struct *mm)
> +{
> +	return 0;
> +}
> +
> +static inline mod_hugetlb_rss(struct mm_struct *mm, long value)
> +{
> +}
>  #endif	/* CONFIG_HUGETLB_PAGE */
>  
>  static inline spinlock_t *huge_pte_lock(struct hstate *h,

sorry for my silly mistake, this doesn't build.
Let me resend this.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Subject: [PATCH] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because applications which use hugetlb typically want to control
their processes on the basis of how much memory (including hugetlb) they use.
So this patch simply provides easy access to the info via /proc/pid/status.

This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
is now,) which I guess is fine until we have some strong reason to do it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
 fs/proc/task_mmu.c       |  5 ++++-
 include/linux/hugetlb.h  | 24 ++++++++++++++++++++++++
 include/linux/mm.h       |  3 +++
 include/linux/mm_types.h |  3 +++
 mm/hugetlb.c             |  9 +++++++++
 mm/memory.c              |  4 ++--
 mm/rmap.c                |  4 +++-
 7 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c7218603306d..f181f56fcce2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
 	unsigned long data, text, lib, swap, ptes, pmds;
-	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, hugetlb_rss;
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -37,6 +37,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	hiwater_rss = total_rss = get_mm_rss(mm);
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
+	hugetlb_rss = get_hugetlb_rss(mm);
 
 	data = mm->total_vm - mm->shared_vm - mm->stack_vm;
 	text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
@@ -51,6 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"VmPin:\t%8lu kB\n"
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
+		"VmHugetlbRSS:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -64,6 +66,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
+		hugetlb_rss << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d891f949466a..8c72e935bda8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -469,6 +469,21 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 #define hugepages_supported() (HPAGE_SHIFT != 0)
 #endif
 
+/*
+ * This simple wrappers are to hide MM_HUGETLBPAGES from outside hugetlbfs
+ * subsystem. The counter MM_HUGETLBPAGES is maintained in page unit basis,
+ * so it changes by 512 for example if a 2MB hugepage is mapped or unmapped.
+ */
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+	return get_mm_counter(mm, MM_HUGETLBPAGES);
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+	add_mm_counter(mm, MM_HUGETLBPAGES, value);
+}
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +519,15 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 {
 	return &mm->page_table_lock;
 }
+
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f92dbac..9218a8856483 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1355,6 +1355,9 @@ static inline void sync_mm_rss(struct mm_struct *mm)
 }
 #endif
 
+extern inline void init_rss_vec(int *rss);
+extern inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+
 int vma_wants_writenotify(struct vm_area_struct *vma);
 
 extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..887b43ba5a18 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -348,6 +348,9 @@ enum {
 	MM_FILEPAGES,
 	MM_ANONPAGES,
 	MM_SWAPENTS,
+#ifdef CONFIG_HUGETLB_PAGE
+	MM_HUGETLBPAGES,
+#endif
 	NR_MM_COUNTERS
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..12e5e7d3b60f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2743,9 +2743,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 	int ret = 0;
+	int rss[NR_MM_COUNTERS];
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
+	init_rss_vec(rss);
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
@@ -2797,6 +2799,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			rss[MM_HUGETLBPAGES] += pages_per_huge_page(h);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -2805,6 +2808,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	if (cow)
 		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
 
+	add_mm_rss_vec(dst, rss);
 	return ret;
 }
 
@@ -2823,6 +2827,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long sz = huge_page_size(h);
 	const unsigned long mmun_start = start;	/* For mmu_notifiers */
 	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	int rss[NR_MM_COUNTERS];
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
@@ -2832,6 +2837,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	address = start;
 again:
+	init_rss_vec(rss);
 	for (; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
@@ -2877,6 +2883,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
+		rss[MM_HUGETLBPAGES] -= pages_per_huge_page(h);
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
@@ -2892,6 +2899,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 unlock:
 		spin_unlock(ptl);
 	}
+	add_mm_rss_vec(mm, rss);
 	/*
 	 * mmu_gather ran out of room to batch pages, we break out of
 	 * the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -3261,6 +3269,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	mod_hugetlb_rss(mm, pages_per_huge_page(h));
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 388dcf9aa283..e09b53da2733 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	return 0;
 }
 
-static inline void init_rss_vec(int *rss)
+inline void init_rss_vec(int *rss)
 {
 	memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
 }
 
-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 {
 	int i;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b68768df1..78e77b0ea3c3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (!PageHuge(page)) {
+		if (PageHuge(page)) {
+			mod_hugetlb_rss(mm, -(1 << compound_order(page)));
+		} else {
 			if (PageAnon(page))
 				dec_mm_counter(mm, MM_ANONPAGES);
 			else
-- 
2.4.3


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-07  7:24                             ` [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
  2015-08-07 22:50                               ` Andrew Morton
@ 2015-08-11  0:37                               ` David Rientjes
  2015-08-11 23:32                                 ` Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-11  0:37 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Fri, 7 Aug 2015, Naoya Horiguchi wrote:

> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
> 
> before patch:
> 
>   Size:              20480 kB
>   Rss:                   0 kB
>   Pss:                   0 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:            0 kB
>   Anonymous:             0 kB
>   AnonHugePages:         0 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 
> after patch:
> 
>   Size:              20480 kB
>   Rss:               18432 kB
>   Pss:               18432 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:     18432 kB
>   Referenced:        18432 kB
>   Anonymous:         18432 kB
>   AnonHugePages:         0 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 

I think this will lead to breakage, unfortunately, specifically for users 
who are concerned with resource management.

An example: we use memcg hierarchies to charge memory for individual jobs, 
specific users, and system overhead.  Memcg is a cgroup, so this is done 
for an aggregate of processes, and we often have to monitor their memory 
usage.  Each process isn't assigned to its own memcg, and I don't believe 
common users of memcg assign individual processes to their own memcgs.  

When a memcg is out of memory, we need to track the memory usage of 
processes attached to its memcg hierarchy to determine what is unexpected, 
either as a result of a new rollout or because of a memory leak.  To do 
that, we use the rss exported by smaps that is now changed with this 
patch.  By using smaps rather than /proc/pid/status, we can report where 
memory usage is unexpected.

This would cause our process that manages all memcgs on our systems to 
break.  Perhaps I haven't been as convincing in my previous messages of 
this, but it's quite an obvious userspace regression.

This memory was not included in rss originally because memory in the 
hugetlb persistent pool is always resident.  Unmapping the memory does not 
free memory.  For this reason, hugetlb memory has always been treated as 
its own type of memory.

It would have been arguable back when hugetlbfs was introduced whether it 
should be included.  I'm afraid the ship has sailed on that since a decade 
has past and it would cause userspace to break if existing metrics are 
used that already have cleared defined semantics.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS
  2015-08-10  0:47                                   ` [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS Naoya Horiguchi
@ 2015-08-11  0:44                                     ` David Rientjes
  2015-08-12  0:03                                       ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-11  0:44 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Mon, 10 Aug 2015, Naoya Horiguchi wrote:

> diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
> index 6f7fafde0884..cb8565e150ed 100644
> --- v4.2-rc4.orig/Documentation/filesystems/proc.txt
> +++ v4.2-rc4/Documentation/filesystems/proc.txt
> @@ -168,6 +168,7 @@ For example, to get the status information of a process, all you have to do is
>    VmLck:         0 kB
>    VmHWM:       476 kB
>    VmRSS:       476 kB
> +  VmHugetlbRSS:  0 kB
>    VmData:      156 kB
>    VmStk:        88 kB
>    VmExe:        68 kB
> @@ -230,6 +231,7 @@ Table 1-2: Contents of the status files (as of 4.1)
>   VmLck                       locked memory size
>   VmHWM                       peak resident set size ("high water mark")
>   VmRSS                       size of memory portions
> + VmHugetlbRSS                size of hugetlb memory portions
>   VmData                      size of data, stack, and text segments
>   VmStk                       size of data, stack, and text segments
>   VmExe                       size of text segment
> @@ -440,8 +442,12 @@ indicates the amount of memory currently marked as referenced or accessed.
>  "Anonymous" shows the amount of memory that does not belong to any file.  Even
>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous copy.
> -"Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
> +Since 4.3, "RSS" contains the amount of mappings for hugetlb pages. Although
> +RSS of hugetlb mappings is maintained separately from normal mappings
> +(displayed in "VmHugetlbRSS" field of /proc/PID/status,) /proc/PID/smaps shows
> +both mappings in "RSS" field. Userspace applications clearly distinguish the
> +type of mapping with 'ht' flag in "VmFlags" field.
>  
>  "VmFlags" field deserves a separate description. This member represents the kernel
>  flags associated with the particular virtual memory area in two letter encoded

My objection to adding hugetlb memory to the RSS field of /proc/pid/smaps 
still stands and can be addressed in the thread of the first patch.  Since 
this includes wording that describes that change, then the objection would 
also cover that.

With regard to adding VmHugetlbRSS, I think the change is fine, and I 
appreciate that you call it VmHugetlbRSS and not VmHugeRSS since that 
would be confused with thp.

My only concern regarding VmHugetlbRSS would be extendability and whether 
we will eventually, or even today, want to differentiate between various 
hugetlb page sizes.  For example, if 1GB hugetlb pages on x86 are a 
precious resource, then how do I determine which process has mapped it 
rather than 512 2MB hugetlb pages?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-11  0:37                               ` David Rientjes
@ 2015-08-11 23:32                                 ` Naoya Horiguchi
  2015-08-11 23:48                                   ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-11 23:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3585 bytes --]

On Mon, Aug 10, 2015 at 05:37:54PM -0700, David Rientjes wrote:
> On Fri, 7 Aug 2015, Naoya Horiguchi wrote:
> 
> > Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> > inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > This patch enables these fields by introducing smaps_hugetlb_range().
> > 
> > before patch:
> > 
> >   Size:              20480 kB
> >   Rss:                   0 kB
> >   Pss:                   0 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:         0 kB
> >   Referenced:            0 kB
> >   Anonymous:             0 kB
> >   AnonHugePages:         0 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> > 
> > after patch:
> > 
> >   Size:              20480 kB
> >   Rss:               18432 kB
> >   Pss:               18432 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:     18432 kB
> >   Referenced:        18432 kB
> >   Anonymous:         18432 kB
> >   AnonHugePages:         0 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> > 
> 
> I think this will lead to breakage, unfortunately, specifically for users 
> who are concerned with resource management.
> 
> An example: we use memcg hierarchies to charge memory for individual jobs, 
> specific users, and system overhead.  Memcg is a cgroup, so this is done 
> for an aggregate of processes, and we often have to monitor their memory 
> usage.  Each process isn't assigned to its own memcg, and I don't believe 
> common users of memcg assign individual processes to their own memcgs.  
> 
> When a memcg is out of memory, we need to track the memory usage of 
> processes attached to its memcg hierarchy to determine what is unexpected, 
> either as a result of a new rollout or because of a memory leak.  To do 
> that, we use the rss exported by smaps that is now changed with this 
> patch.  By using smaps rather than /proc/pid/status, we can report where 
> memory usage is unexpected.
> 
> This would cause our process that manages all memcgs on our systems to 
> break.  Perhaps I haven't been as convincing in my previous messages of 
> this, but it's quite an obvious userspace regression.

OK, this version assumes that userspace distinguishes vma(VM_HUGETLB) with
"VmFlags" field, which is unrealistic. So I'll keep all existing fields
untouched by introducing hugetlb usage info.

> This memory was not included in rss originally because memory in the 
> hugetlb persistent pool is always resident.  Unmapping the memory does not 
> free memory.  For this reason, hugetlb memory has always been treated as 
> its own type of memory.

Right, so it might be better not to use the word "RSS" for hugetlb, maybe
something like "HugetlbPages:" seems better to me.

Thanks,
Naoya Horiguchi

> It would have been arguable back when hugetlbfs was introduced whether it 
> should be included.  I'm afraid the ship has sailed on that since a decade 
> has past and it would cause userspace to break if existing metrics are 
> used that already have cleared defined semantics.ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)
  2015-08-11 23:32                                 ` Naoya Horiguchi
@ 2015-08-11 23:48                                   ` David Rientjes
  0 siblings, 0 replies; 71+ messages in thread
From: David Rientjes @ 2015-08-11 23:48 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Tue, 11 Aug 2015, Naoya Horiguchi wrote:

> > This memory was not included in rss originally because memory in the 
> > hugetlb persistent pool is always resident.  Unmapping the memory does not 
> > free memory.  For this reason, hugetlb memory has always been treated as 
> > its own type of memory.
> 
> Right, so it might be better not to use the word "RSS" for hugetlb, maybe
> something like "HugetlbPages:" seems better to me.
> 

When the pagesize is also specified, as it is in smaps, I think this would 
be the best option.  Note that we can't distinguish between variable 
hugetlb sizes with VmFlags alone.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS
  2015-08-11  0:44                                     ` David Rientjes
@ 2015-08-12  0:03                                       ` Naoya Horiguchi
  2015-08-12  7:45                                         ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
                                                           ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-12  0:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4093 bytes --]

On Mon, Aug 10, 2015 at 05:44:54PM -0700, David Rientjes wrote:
> On Mon, 10 Aug 2015, Naoya Horiguchi wrote:
> 
> > diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
> > index 6f7fafde0884..cb8565e150ed 100644
> > --- v4.2-rc4.orig/Documentation/filesystems/proc.txt
> > +++ v4.2-rc4/Documentation/filesystems/proc.txt
> > @@ -168,6 +168,7 @@ For example, to get the status information of a process, all you have to do is
> >    VmLck:         0 kB
> >    VmHWM:       476 kB
> >    VmRSS:       476 kB
> > +  VmHugetlbRSS:  0 kB
> >    VmData:      156 kB
> >    VmStk:        88 kB
> >    VmExe:        68 kB
> > @@ -230,6 +231,7 @@ Table 1-2: Contents of the status files (as of 4.1)
> >   VmLck                       locked memory size
> >   VmHWM                       peak resident set size ("high water mark")
> >   VmRSS                       size of memory portions
> > + VmHugetlbRSS                size of hugetlb memory portions
> >   VmData                      size of data, stack, and text segments
> >   VmStk                       size of data, stack, and text segments
> >   VmExe                       size of text segment
> > @@ -440,8 +442,12 @@ indicates the amount of memory currently marked as referenced or accessed.
> >  "Anonymous" shows the amount of memory that does not belong to any file.  Even
> >  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> >  and a page is modified, the file page is replaced by a private anonymous copy.
> > -"Swap" shows how much would-be-anonymous memory is also used, but out on
> > -swap.
> > +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
> > +Since 4.3, "RSS" contains the amount of mappings for hugetlb pages. Although
> > +RSS of hugetlb mappings is maintained separately from normal mappings
> > +(displayed in "VmHugetlbRSS" field of /proc/PID/status,) /proc/PID/smaps shows
> > +both mappings in "RSS" field. Userspace applications clearly distinguish the
> > +type of mapping with 'ht' flag in "VmFlags" field.
> >  
> >  "VmFlags" field deserves a separate description. This member represents the kernel
> >  flags associated with the particular virtual memory area in two letter encoded
> 
> My objection to adding hugetlb memory to the RSS field of /proc/pid/smaps 
> still stands and can be addressed in the thread of the first patch.  Since 
> this includes wording that describes that change, then the objection would 
> also cover that.

OK, I'll update this in accordance with the change on the first patch.

> With regard to adding VmHugetlbRSS, I think the change is fine, and I 
> appreciate that you call it VmHugetlbRSS and not VmHugeRSS since that 
> would be confused with thp.

I plan to rename the field, then the new name will/should be unconfusing
between thp and hugetlb.

> My only concern regarding VmHugetlbRSS would be extendability and whether 
> we will eventually, or even today, want to differentiate between various 
> hugetlb page sizes.  For example, if 1GB hugetlb pages on x86 are a 
> precious resource, then how do I determine which process has mapped it 
> rather than 512 2MB hugetlb pages?

"KernelPageSize" field in /proc/PID/smaps is aware of hugetlb page sizes,
so I expected userspace to detect the size itself. But /proc/PID/status shows
only proccess-wide info, so userspace applications must read both of these
files to know the usage per hugepage size, which might be inconvenient.

One idea is to show the new field like "VmHugetlbRSS: 2x512kB 1x1GB" for
both of /proc/PID/{status,smaps}, which passes the full hugetlb info in a
single line so easier to parse and process. Or some other fields shows in
"kB", so "VmHugetlbRSS: 1052672 kB (2x512kB 1x1GB)" is possible for human
readability.

Thank you very much for the feedback, I'll repost soon, but any additional
comment is appreciated.

Naoyaÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-12  0:03                                       ` Naoya Horiguchi
@ 2015-08-12  7:45                                         ` Naoya Horiguchi
  2015-08-12  7:45                                           ` [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
  2015-08-12 20:25                                           ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps David Rientjes
  2015-08-20  8:26                                         ` [PATCH v5 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
  2015-09-16  0:21                                         ` [PATCH v1] mm: migrate: hugetlb: putback destination hugepage to active list Naoya Horiguchi
  2 siblings, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-12  7:45 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4829 bytes --]

Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
is inconvenient when we want to know per-task or per-vma base hugetlb usage.
To solve this, this patch adds a new line for hugetlb usage like below:

  Size:              20480 kB
  Rss:                   0 kB
  Pss:                   0 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:         0 kB
  Referenced:            0 kB
  Anonymous:             0 kB
  AnonHugePages:         0 kB
  HugetlbPages:      18432 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v3 -> v4:
- suspend Acked-by tag because v3->v4 change is not trivial
- I stated in previous discussion that HugetlbPages line can contain page
  size info, but that's not necessary because we already have KernelPageSize
  info.
- merged documentation update, where the current documentation doesn't mention
  AnonHugePages, so it's also added.
---
 Documentation/filesystems/proc.txt |  7 +++++--
 fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
index 6f7fafde0884..22e40211ef64 100644
--- v4.2-rc4.orig/Documentation/filesystems/proc.txt
+++ v4.2-rc4/Documentation/filesystems/proc.txt
@@ -423,6 +423,8 @@ Private_Clean:         0 kB
 Private_Dirty:         0 kB
 Referenced:          892 kB
 Anonymous:             0 kB
+AnonHugePages:         0 kB
+HugetlbPages:          0 kB
 Swap:                  0 kB
 KernelPageSize:        4 kB
 MMUPageSize:           4 kB
@@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
 "Anonymous" shows the amount of memory that does not belong to any file.  Even
 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
 and a page is modified, the file page is replaced by a private anonymous copy.
-"Swap" shows how much would-be-anonymous memory is also used, but out on
-swap.
+"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
+"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index ca1e091881d4..2c37938b82ee 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -445,6 +445,7 @@ struct mem_size_stats {
 	unsigned long anonymous;
 	unsigned long anonymous_thp;
 	unsigned long swap;
+	unsigned long hugetlb;
 	u64 pss;
 };
 
@@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	seq_putc(m, '\n');
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page = NULL;
+
+	if (pte_present(*pte)) {
+		page = vm_normal_page(vma, addr, *pte);
+	} else if (is_swap_pte(*pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+		if (is_migration_entry(swpent))
+			page = migration_entry_to_page(swpent);
+	}
+	if (page)
+		mss->hugetlb += huge_page_size(hstate_vma(vma));
+	return 0;
+}
+#endif /* HUGETLB_PAGE */
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
 		.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+		.hugetlb_entry = smaps_hugetlb_range,
+#endif
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
@@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
 		   "AnonHugePages:  %8lu kB\n"
+		   "HugetlbPages:   %8lu kB\n"
 		   "Swap:           %8lu kB\n"
 		   "KernelPageSize: %8lu kB\n"
 		   "MMUPageSize:    %8lu kB\n"
@@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
 		   mss.anonymous_thp >> 10,
+		   mss.hugetlb >> 10,
 		   mss.swap >> 10,
 		   vma_kernel_pagesize(vma) >> 10,
 		   vma_mmu_pagesize(vma) >> 10,
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-12  7:45                                         ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
@ 2015-08-12  7:45                                           ` Naoya Horiguchi
  2015-08-12 20:30                                             ` David Rientjes
  2015-08-12 20:25                                           ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps David Rientjes
  1 sibling, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-12  7:45 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes, Jörn Engel
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 7558 bytes --]

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because userspace applications which use hugetlb typically want
to control their processes on the basis of how much memory (including hugetlb)
they use. So this patch simply provides easy access to the info via
/proc/PID/status.

With this patch, for example, /proc/PID/status shows a line like this:

  HugetlbPages:      20480 kB (10x2048kB)

If your system supports and enables multiple hugepage sizes, the line looks
like this:

  HugetlbPages:    1069056 kB (1x1048576kB 10x2048kB)

, so you can easily know how many hugepages in which pagesize are used by a
process.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v3 -> v4:
- rename field (VmHugetlbRSS is not the best name)
- introduce struct hugetlb_usage in struct mm_struct (no invasion to struct
  mm_rss_stat)
- introduce hugetlb_report_usage()
- merged documentation update

v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/task_mmu.c                 |  1 +
 include/linux/hugetlb.h            | 20 ++++++++++++++++++++
 include/linux/mm_types.h           | 10 ++++++++++
 mm/hugetlb.c                       | 27 +++++++++++++++++++++++++++
 mm/rmap.c                          |  4 +++-
 6 files changed, 64 insertions(+), 1 deletion(-)

diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
index 22e40211ef64..e92a4a91fc99 100644
--- v4.2-rc4.orig/Documentation/filesystems/proc.txt
+++ v4.2-rc4/Documentation/filesystems/proc.txt
@@ -174,6 +174,7 @@ For example, to get the status information of a process, all you have to do is
   VmLib:      1412 kB
   VmPTE:        20 kb
   VmSwap:        0 kB
+  HugetlbPages:          0 kB (0x2048kB)
   Threads:        1
   SigQ:   0/28578
   SigPnd: 0000000000000000
@@ -237,6 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmPTE                       size of page table entries
  VmPMD                       size of second level page tables
  VmSwap                      size of swap usage (the number of referred swapents)
+ HugetlbPages                size of hugetlb memory portions (with additional info
+                             about number of mapped hugepages for each page size)
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index 2c37938b82ee..b3cf7fa9ef6c 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -69,6 +69,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		ptes >> 10,
 		pmds >> 10,
 		swap << (PAGE_SHIFT-10));
+	hugetlb_report_usage(m, mm);
 }
 
 unsigned long task_vsize(struct mm_struct *mm)
diff --git v4.2-rc4.orig/include/linux/hugetlb.h v4.2-rc4/include/linux/hugetlb.h
index d891f949466a..64aa4db01f48 100644
--- v4.2-rc4.orig/include/linux/hugetlb.h
+++ v4.2-rc4/include/linux/hugetlb.h
@@ -469,6 +469,18 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 #define hugepages_supported() (HPAGE_SHIFT != 0)
 #endif
 
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+
+static inline void inc_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+	atomic_long_inc(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+	atomic_long_dec(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +516,14 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 {
 	return &mm->page_table_lock;
 }
+
+static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
+{
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4.orig/include/linux/mm_types.h v4.2-rc4/include/linux/mm_types.h
index 0038ac7466fd..e95c5fe1eb7d 100644
--- v4.2-rc4.orig/include/linux/mm_types.h
+++ v4.2-rc4/include/linux/mm_types.h
@@ -364,6 +364,12 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct hugetlb_usage {
+	atomic_long_t count[HUGE_MAX_HSTATE];
+};
+#endif
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -484,6 +490,10 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+	struct hugetlb_usage hugetlb_usage;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git v4.2-rc4.orig/mm/hugetlb.c v4.2-rc4/mm/hugetlb.c
index a8c3087089d8..b890431d0763 100644
--- v4.2-rc4.orig/mm/hugetlb.c
+++ v4.2-rc4/mm/hugetlb.c
@@ -2562,6 +2562,30 @@ void hugetlb_show_meminfo(void)
 				1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm)
+{
+	int i;
+	unsigned long total_usage = 0;
+
+	for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+		total_usage += atomic_long_read(&mm->hugetlb_usage.count[i]) *
+			(huge_page_size(&hstates[i]) >> 10);
+	}
+
+	seq_printf(m, "HugetlbPages:\t%8lu kB (", total_usage);
+	for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+		if (huge_page_order(&hstates[i]) == 0)
+			break;
+		if (i > 0)
+			seq_puts(m, " ");
+
+		seq_printf(m, "%ldx%dkB",
+			atomic_long_read(&mm->hugetlb_usage.count[i]),
+			huge_page_size(&hstates[i]) >> 10);
+	}
+	seq_puts(m, ")\n");
+}
+
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
@@ -2797,6 +2821,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			inc_hugetlb_count(dst, h);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -2877,6 +2902,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
+		dec_hugetlb_count(mm, h);
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
@@ -3261,6 +3287,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	inc_hugetlb_count(mm, h);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4.orig/mm/rmap.c v4.2-rc4/mm/rmap.c
index 171b68768df1..b33278bc4ddb 100644
--- v4.2-rc4.orig/mm/rmap.c
+++ v4.2-rc4/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (!PageHuge(page)) {
+		if (PageHuge(page)) {
+			dec_hugetlb_count(mm, page_hstate(page));
+		} else {
 			if (PageAnon(page))
 				dec_mm_counter(mm, MM_ANONPAGES);
 			else
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-12  7:45                                         ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
  2015-08-12  7:45                                           ` [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
@ 2015-08-12 20:25                                           ` David Rientjes
  2015-08-13 21:14                                             ` Jörn Engel
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-12 20:25 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, 12 Aug 2015, Naoya Horiguchi wrote:

> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> To solve this, this patch adds a new line for hugetlb usage like below:
> 
>   Size:              20480 kB
>   Rss:                   0 kB
>   Pss:                   0 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:            0 kB
>   Anonymous:             0 kB
>   AnonHugePages:         0 kB
>   HugetlbPages:      18432 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-12  7:45                                           ` [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
@ 2015-08-12 20:30                                             ` David Rientjes
  2015-08-13  0:45                                               ` Naoya Horiguchi
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: David Rientjes @ 2015-08-12 20:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, 12 Aug 2015, Naoya Horiguchi wrote:

> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because userspace applications which use hugetlb typically want
> to control their processes on the basis of how much memory (including hugetlb)
> they use. So this patch simply provides easy access to the info via
> /proc/PID/status.
> 
> With this patch, for example, /proc/PID/status shows a line like this:
> 
>   HugetlbPages:      20480 kB (10x2048kB)
> 
> If your system supports and enables multiple hugepage sizes, the line looks
> like this:
> 
>   HugetlbPages:    1069056 kB (1x1048576kB 10x2048kB)
> 
> , so you can easily know how many hugepages in which pagesize are used by a
> process.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

I'm happy with this and thanks very much for going the extra mile and 
breaking the usage down by hstate size.

I'd be interested in the comments of others, though, to see if there is 
any reservation about the hstate size breakdown.  It may actually find no 
current customer who is interested in parsing it.  (If we keep it, I would 
suggest the 'x' change to '*' similar to per-order breakdowns in 
show_mem()).  It may also be possible to add it later if a definitive 
usecase is presented.

But overall I'm very happy with the new addition and think it's a good 
solution to the problem.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-12 20:30                                             ` David Rientjes
@ 2015-08-13  0:45                                               ` Naoya Horiguchi
  2015-08-13 21:14                                                 ` Jörn Engel
  2015-08-13 21:13                                               ` Jörn Engel
  2015-08-17 21:28                                               ` David Rientjes
  2 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-13  0:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 8671 bytes --]

On Wed, Aug 12, 2015 at 01:30:27PM -0700, David Rientjes wrote:
> On Wed, 12 Aug 2015, Naoya Horiguchi wrote:
> 
> > Currently there's no easy way to get per-process usage of hugetlb pages, which
> > is inconvenient because userspace applications which use hugetlb typically want
> > to control their processes on the basis of how much memory (including hugetlb)
> > they use. So this patch simply provides easy access to the info via
> > /proc/PID/status.
> > 
> > With this patch, for example, /proc/PID/status shows a line like this:
> > 
> >   HugetlbPages:      20480 kB (10x2048kB)
> > 
> > If your system supports and enables multiple hugepage sizes, the line looks
> > like this:
> > 
> >   HugetlbPages:    1069056 kB (1x1048576kB 10x2048kB)
> > 
> > , so you can easily know how many hugepages in which pagesize are used by a
> > process.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> I'm happy with this and thanks very much for going the extra mile and 
> breaking the usage down by hstate size.

Great to hear that.

> I'd be interested in the comments of others, though, to see if there is 
> any reservation about the hstate size breakdown.  It may actually find no 
> current customer who is interested in parsing it.  (If we keep it, I would 
> suggest the 'x' change to '*' similar to per-order breakdowns in 
> show_mem()).  It may also be possible to add it later if a definitive 
> usecase is presented.

I'm fine to change to '*'.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Subject: [PATCH] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because userspace applications which use hugetlb typically want
to control their processes on the basis of how much memory (including hugetlb)
they use. So this patch simply provides easy access to the info via
/proc/PID/status.

With this patch, for example, /proc/PID/status shows a line like this:

  HugetlbPages:      20480 kB (10*2048kB)

If your system supports and enables multiple hugepage sizes, the line looks
like this:

  HugetlbPages:    1069056 kB (1*1048576kB 10*2048kB)

, so you can easily know how many hugepages in which pagesize are used by a
process.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/task_mmu.c                 |  1 +
 include/linux/hugetlb.h            | 20 ++++++++++++++++++++
 include/linux/mm_types.h           | 10 ++++++++++
 mm/hugetlb.c                       | 27 +++++++++++++++++++++++++++
 mm/rmap.c                          |  4 +++-
 6 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 22e40211ef64..f561fc46e41b 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -174,6 +174,7 @@ For example, to get the status information of a process, all you have to do is
   VmLib:      1412 kB
   VmPTE:        20 kb
   VmSwap:        0 kB
+  HugetlbPages:          0 kB (0*2048kB)
   Threads:        1
   SigQ:   0/28578
   SigPnd: 0000000000000000
@@ -237,6 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmPTE                       size of page table entries
  VmPMD                       size of second level page tables
  VmSwap                      size of swap usage (the number of referred swapents)
+ HugetlbPages                size of hugetlb memory portions (with additional info
+                             about number of mapped hugepages for each page size)
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2c37938b82ee..b3cf7fa9ef6c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -69,6 +69,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		ptes >> 10,
 		pmds >> 10,
 		swap << (PAGE_SHIFT-10));
+	hugetlb_report_usage(m, mm);
 }
 
 unsigned long task_vsize(struct mm_struct *mm)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d891f949466a..64aa4db01f48 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -469,6 +469,18 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 #define hugepages_supported() (HPAGE_SHIFT != 0)
 #endif
 
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+
+static inline void inc_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+	atomic_long_inc(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+	atomic_long_dec(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +516,14 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 {
 	return &mm->page_table_lock;
 }
+
+static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
+{
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..e95c5fe1eb7d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -364,6 +364,12 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct hugetlb_usage {
+	atomic_long_t count[HUGE_MAX_HSTATE];
+};
+#endif
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -484,6 +490,10 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+	struct hugetlb_usage hugetlb_usage;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..2338c9713b7a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2562,6 +2562,30 @@ void hugetlb_show_meminfo(void)
 				1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm)
+{
+	int i;
+	unsigned long total_usage = 0;
+
+	for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+		total_usage += atomic_long_read(&mm->hugetlb_usage.count[i]) *
+			(huge_page_size(&hstates[i]) >> 10);
+	}
+
+	seq_printf(m, "HugetlbPages:\t%8lu kB (", total_usage);
+	for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+		if (huge_page_order(&hstates[i]) == 0)
+			break;
+		if (i > 0)
+			seq_puts(m, " ");
+
+		seq_printf(m, "%ld*%dkB",
+			atomic_long_read(&mm->hugetlb_usage.count[i]),
+			huge_page_size(&hstates[i]) >> 10);
+	}
+	seq_puts(m, ")\n");
+}
+
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
@@ -2797,6 +2821,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			inc_hugetlb_count(dst, h);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -2877,6 +2902,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
+		dec_hugetlb_count(mm, h);
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
@@ -3261,6 +3287,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	inc_hugetlb_count(mm, h);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b68768df1..b33278bc4ddb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (!PageHuge(page)) {
+		if (PageHuge(page)) {
+			dec_hugetlb_count(mm, page_hstate(page));
+		} else {
 			if (PageAnon(page))
 				dec_mm_counter(mm, MM_ANONPAGES);
 			else
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-12 20:30                                             ` David Rientjes
  2015-08-13  0:45                                               ` Naoya Horiguchi
@ 2015-08-13 21:13                                               ` Jörn Engel
  2015-08-17 21:28                                               ` David Rientjes
  2 siblings, 0 replies; 71+ messages in thread
From: Jörn Engel @ 2015-08-13 21:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Naoya Horiguchi, Andrew Morton, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, Aug 12, 2015 at 01:30:27PM -0700, David Rientjes wrote:
> 
> I'd be interested in the comments of others, though, to see if there is 
> any reservation about the hstate size breakdown.  It may actually find no 
> current customer who is interested in parsing it.  (If we keep it, I would 
> suggest the 'x' change to '*' similar to per-order breakdowns in 
> show_mem()).  It may also be possible to add it later if a definitive 
> usecase is presented.

I have no interest in parsing the size breakdown today.  I might change
my mind tomorrow and having the extra information hurts very little, so
I won't argue against it either.

> But overall I'm very happy with the new addition and think it's a good 
> solution to the problem.

Agreed.  Thank you!

Jörn

--
One of the painful things about our time is that those who feel certainty
are stupid, and those with any imagination and understanding are filled
with doubt and indecision.
-- Bertrand Russell

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-13  0:45                                               ` Naoya Horiguchi
@ 2015-08-13 21:14                                                 ` Jörn Engel
  0 siblings, 0 replies; 71+ messages in thread
From: Jörn Engel @ 2015-08-13 21:14 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Rientjes, Andrew Morton, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Thu, Aug 13, 2015 at 12:45:33AM +0000, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Subject: [PATCH] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
> 
> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because userspace applications which use hugetlb typically want
> to control their processes on the basis of how much memory (including hugetlb)
> they use. So this patch simply provides easy access to the info via
> /proc/PID/status.
> 
> With this patch, for example, /proc/PID/status shows a line like this:
> 
>   HugetlbPages:      20480 kB (10*2048kB)
> 
> If your system supports and enables multiple hugepage sizes, the line looks
> like this:
> 
>   HugetlbPages:    1069056 kB (1*1048576kB 10*2048kB)
> 
> , so you can easily know how many hugepages in which pagesize are used by a
> process.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: Joern Engel <joern@logfs.org>

Jörn

--
Computer system analysis is like child-rearing; you can do grievous damage,
but you cannot ensure success."
-- Tom DeMarco

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-12 20:25                                           ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps David Rientjes
@ 2015-08-13 21:14                                             ` Jörn Engel
  0 siblings, 0 replies; 71+ messages in thread
From: Jörn Engel @ 2015-08-13 21:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Naoya Horiguchi, Andrew Morton, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, Aug 12, 2015 at 01:25:59PM -0700, David Rientjes wrote:
> On Wed, 12 Aug 2015, Naoya Horiguchi wrote:
> 
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> > 
> >   Size:              20480 kB
> >   Rss:                   0 kB
> >   Pss:                   0 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:         0 kB
> >   Referenced:            0 kB
> >   Anonymous:             0 kB
> >   AnonHugePages:         0 kB
> >   HugetlbPages:      18432 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>

Acked-by: Joern Engel <joern@logfs.org>

Jörn

--
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-12 20:30                                             ` David Rientjes
  2015-08-13  0:45                                               ` Naoya Horiguchi
  2015-08-13 21:13                                               ` Jörn Engel
@ 2015-08-17 21:28                                               ` David Rientjes
  2 siblings, 0 replies; 71+ messages in thread
From: David Rientjes @ 2015-08-17 21:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On Wed, 12 Aug 2015, David Rientjes wrote:

> I'm happy with this and thanks very much for going the extra mile and 
> breaking the usage down by hstate size.
> 
> I'd be interested in the comments of others, though, to see if there is 
> any reservation about the hstate size breakdown.  It may actually find no 
> current customer who is interested in parsing it.  (If we keep it, I would 
> suggest the 'x' change to '*' similar to per-order breakdowns in 
> show_mem()).  It may also be possible to add it later if a definitive 
> usecase is presented.
> 
> But overall I'm very happy with the new addition and think it's a good 
> solution to the problem.
> 

No objections from anybody else, so

Acked-by: David Rientjes <rientjes@google.com>

Thanks Naoya!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v5 0/2] hugetlb: display per-process/per-vma usage
  2015-08-12  0:03                                       ` Naoya Horiguchi
  2015-08-12  7:45                                         ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
@ 2015-08-20  8:26                                         ` Naoya Horiguchi
  2015-08-20  8:26                                           ` [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
  2015-08-20  8:26                                           ` [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
  2015-09-16  0:21                                         ` [PATCH v1] mm: migrate: hugetlb: putback destination hugepage to active list Naoya Horiguchi
  2 siblings, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-20  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1376 bytes --]

The previous version had build issues in some architectures, because it
required to move the definition of HUGE_MAX_HSTATE across header files
in order to embed a new data structure struct hugetlb_usage into struct
mm_struct. This was a hard problem to solve, so I took another approach
in this version, where I add just a pointer (struct hugetlb_usage *) to
struct mm_struct and dynamically allocate and link it.
This makes the changes larger, but no build issues.

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (2):
      mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
      mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

 Documentation/filesystems/proc.txt | 10 +++++++--
 fs/hugetlbfs/inode.c               | 12 ++++++++++
 fs/proc/task_mmu.c                 | 30 +++++++++++++++++++++++++
 include/linux/hugetlb.h            | 36 +++++++++++++++++++++++++++++
 include/linux/mm_types.h           |  7 ++++++
 kernel/fork.c                      |  3 +++
 mm/hugetlb.c                       | 46 ++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                          |  1 +
 mm/rmap.c                          |  4 +++-
 9 files changed, 146 insertions(+), 3 deletions(-)ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-20  8:26                                         ` [PATCH v5 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
@ 2015-08-20  8:26                                           ` Naoya Horiguchi
  2015-08-20 10:49                                             ` Michal Hocko
                                                               ` (2 more replies)
  2015-08-20  8:26                                           ` [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
  1 sibling, 3 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-20  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4928 bytes --]

Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
is inconvenient when we want to know per-task or per-vma base hugetlb usage.
To solve this, this patch adds a new line for hugetlb usage like below:

  Size:              20480 kB
  Rss:                   0 kB
  Pss:                   0 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:         0 kB
  Referenced:            0 kB
  Anonymous:             0 kB
  AnonHugePages:         0 kB
  HugetlbPages:      18432 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Joern Engel <joern@logfs.org>
Acked-by: David Rientjes <rientjes@google.com>
---
v3 -> v4:
- suspend Acked-by tag because v3->v4 change is not trivial
- I stated in previous discussion that HugetlbPages line can contain page
  size info, but that's not necessary because we already have KernelPageSize
  info.
- merged documentation update, where the current documentation doesn't mention
  AnonHugePages, so it's also added.
---
 Documentation/filesystems/proc.txt |  7 +++++--
 fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
index 6f7fafde0884..22e40211ef64 100644
--- v4.2-rc4/Documentation/filesystems/proc.txt
+++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
@@ -423,6 +423,8 @@ Private_Clean:         0 kB
 Private_Dirty:         0 kB
 Referenced:          892 kB
 Anonymous:             0 kB
+AnonHugePages:         0 kB
+HugetlbPages:          0 kB
 Swap:                  0 kB
 KernelPageSize:        4 kB
 MMUPageSize:           4 kB
@@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
 "Anonymous" shows the amount of memory that does not belong to any file.  Even
 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
 and a page is modified, the file page is replaced by a private anonymous copy.
-"Swap" shows how much would-be-anonymous memory is also used, but out on
-swap.
+"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
+"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
index ca1e091881d4..2c37938b82ee 100644
--- v4.2-rc4/fs/proc/task_mmu.c
+++ v4.2-rc4_patched/fs/proc/task_mmu.c
@@ -445,6 +445,7 @@ struct mem_size_stats {
 	unsigned long anonymous;
 	unsigned long anonymous_thp;
 	unsigned long swap;
+	unsigned long hugetlb;
 	u64 pss;
 };
 
@@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	seq_putc(m, '\n');
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page = NULL;
+
+	if (pte_present(*pte)) {
+		page = vm_normal_page(vma, addr, *pte);
+	} else if (is_swap_pte(*pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+		if (is_migration_entry(swpent))
+			page = migration_entry_to_page(swpent);
+	}
+	if (page)
+		mss->hugetlb += huge_page_size(hstate_vma(vma));
+	return 0;
+}
+#endif /* HUGETLB_PAGE */
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
 		.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+		.hugetlb_entry = smaps_hugetlb_range,
+#endif
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
@@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
 		   "AnonHugePages:  %8lu kB\n"
+		   "HugetlbPages:   %8lu kB\n"
 		   "Swap:           %8lu kB\n"
 		   "KernelPageSize: %8lu kB\n"
 		   "MMUPageSize:    %8lu kB\n"
@@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
 		   mss.anonymous_thp >> 10,
+		   mss.hugetlb >> 10,
 		   mss.swap >> 10,
 		   vma_kernel_pagesize(vma) >> 10,
 		   vma_mmu_pagesize(vma) >> 10,
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-20  8:26                                         ` [PATCH v5 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
  2015-08-20  8:26                                           ` [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
@ 2015-08-20  8:26                                           ` Naoya Horiguchi
  2015-08-20 11:00                                             ` Michal Hocko
  1 sibling, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-20  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 10500 bytes --]

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because userspace applications which use hugetlb typically want
to control their processes on the basis of how much memory (including hugetlb)
they use. So this patch simply provides easy access to the info via
/proc/PID/status.

With this patch, for example, /proc/PID/status shows a line like this:

  HugetlbPages:      20480 kB (10*2048kB)

If your system supports and enables multiple hugepage sizes, the line looks
like this:

  HugetlbPages:    1069056 kB (1*1048576kB 10*2048kB)

, so you can easily know how many hugepages in which pagesize are used by a
process.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Joern Engel <joern@logfs.org>
Acked-by: David Rientjes <rientjes@google.com>
---
v4 -> v5:
- add (struct hugetlb_usage *) to struct mm_struct
- use %lu instead of %d for seq_printf()
- introduce hugetlb_fork

v3 -> v4:
- rename field (VmHugetlbRSS is not the best name)
- introduce struct hugetlb_usage in struct mm_struct (no invasion to struct
  mm_rss_stat)
- introduce hugetlb_report_usage()
- merged documentation update

v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/hugetlbfs/inode.c               | 12 ++++++++++
 fs/proc/task_mmu.c                 |  1 +
 include/linux/hugetlb.h            | 36 +++++++++++++++++++++++++++++
 include/linux/mm_types.h           |  7 ++++++
 kernel/fork.c                      |  3 +++
 mm/hugetlb.c                       | 46 ++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                          |  1 +
 mm/rmap.c                          |  4 +++-
 9 files changed, 112 insertions(+), 1 deletion(-)

diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
index 22e40211ef64..f561fc46e41b 100644
--- v4.2-rc4/Documentation/filesystems/proc.txt
+++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
@@ -174,6 +174,7 @@ For example, to get the status information of a process, all you have to do is
   VmLib:      1412 kB
   VmPTE:        20 kb
   VmSwap:        0 kB
+  HugetlbPages:          0 kB (0*2048kB)
   Threads:        1
   SigQ:   0/28578
   SigPnd: 0000000000000000
@@ -237,6 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmPTE                       size of page table entries
  VmPMD                       size of second level page tables
  VmSwap                      size of swap usage (the number of referred swapents)
+ HugetlbPages                size of hugetlb memory portions (with additional info
+                             about number of mapped hugepages for each page size)
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git v4.2-rc4/fs/hugetlbfs/inode.c v4.2-rc4_patched/fs/hugetlbfs/inode.c
index 0cf74df68617..bf6ea2645d35 100644
--- v4.2-rc4/fs/hugetlbfs/inode.c
+++ v4.2-rc4_patched/fs/hugetlbfs/inode.c
@@ -115,6 +115,13 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
+	if (!vma->vm_mm->hugetlb_usage) {
+		vma->vm_mm->hugetlb_usage = kzalloc(sizeof(struct hugetlb_usage),
+							GFP_KERNEL);
+		if (!vma->vm_mm->hugetlb_usage)
+			return -ENOMEM;
+	}
+
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
 	mutex_lock(&inode->i_mutex);
@@ -138,6 +145,11 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return ret;
 }
 
+void exit_hugetlb_mmap(struct mm_struct *mm)
+{
+	kfree(mm->hugetlb_usage);
+}
+
 /*
  * Called under down_write(mmap_sem).
  */
diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
index 2c37938b82ee..b3cf7fa9ef6c 100644
--- v4.2-rc4/fs/proc/task_mmu.c
+++ v4.2-rc4_patched/fs/proc/task_mmu.c
@@ -69,6 +69,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		ptes >> 10,
 		pmds >> 10,
 		swap << (PAGE_SHIFT-10));
+	hugetlb_report_usage(m, mm);
 }
 
 unsigned long task_vsize(struct mm_struct *mm)
diff --git v4.2-rc4/include/linux/hugetlb.h v4.2-rc4_patched/include/linux/hugetlb.h
index d891f949466a..db642ad0b847 100644
--- v4.2-rc4/include/linux/hugetlb.h
+++ v4.2-rc4_patched/include/linux/hugetlb.h
@@ -469,6 +469,25 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 #define hugepages_supported() (HPAGE_SHIFT != 0)
 #endif
 
+struct hugetlb_usage {
+	atomic_long_t count[HUGE_MAX_HSTATE];
+};
+
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+void exit_hugetlb_mmap(struct mm_struct *mm);
+int hugetlb_fork(struct mm_struct *new, struct mm_struct *old);
+
+static inline void inc_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+	VM_BUG_ON_MM(!mm->hugetlb_usage, mm);
+	atomic_long_inc(&mm->hugetlb_usage->count[hstate_index(h)]);
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+	VM_BUG_ON_MM(!mm->hugetlb_usage, mm);
+	atomic_long_dec(&mm->hugetlb_usage->count[hstate_index(h)]);
+}
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +523,23 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 {
 	return &mm->page_table_lock;
 }
+
+static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
+{
+}
+
+static inline void exit_hugetlb_mmap(struct mm_struct *mm)
+{
+}
+
+static inline int hugetlb_fork(struct mm_struct *new, struct mm_struct *old)
+{
+	return 0;
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4/include/linux/mm_types.h v4.2-rc4_patched/include/linux/mm_types.h
index 0038ac7466fd..851e964ee8d6 100644
--- v4.2-rc4/include/linux/mm_types.h
+++ v4.2-rc4_patched/include/linux/mm_types.h
@@ -364,6 +364,10 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct hugetlb_usage;
+#endif
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -484,6 +488,9 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+#ifdef CONFIG_HUGETLB_PAGE
+	struct hugetlb_usage *hugetlb_usage;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git v4.2-rc4/kernel/fork.c v4.2-rc4_patched/kernel/fork.c
index dbd9b8d7b7cc..d43baa91d48c 100644
--- v4.2-rc4/kernel/fork.c
+++ v4.2-rc4_patched/kernel/fork.c
@@ -425,6 +425,9 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	retval = khugepaged_fork(mm, oldmm);
 	if (retval)
 		goto out;
+	retval = hugetlb_fork(mm, oldmm);
+	if (retval)
+		goto out;
 
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
diff --git v4.2-rc4/mm/hugetlb.c v4.2-rc4_patched/mm/hugetlb.c
index a8c3087089d8..3aa8c7919364 100644
--- v4.2-rc4/mm/hugetlb.c
+++ v4.2-rc4_patched/mm/hugetlb.c
@@ -2562,6 +2562,49 @@ void hugetlb_show_meminfo(void)
 				1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
+static unsigned long mm_hstate_usage(struct mm_struct *mm, int hs_idx)
+{
+	if (!mm->hugetlb_usage)
+		return 0;
+	return atomic_long_read(&mm->hugetlb_usage->count[hs_idx]);
+}
+
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm)
+{
+	int i;
+	unsigned long total_usage = 0;
+
+	for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+		total_usage += mm_hstate_usage(mm, i) *
+			(huge_page_size(&hstates[i]) >> 10);
+	}
+
+	seq_printf(m, "HugetlbPages:\t%8lu kB (", total_usage);
+	for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+		if (huge_page_order(&hstates[i]) == 0)
+			break;
+		if (i > 0)
+			seq_puts(m, " ");
+
+		seq_printf(m, "%ld*%lukB", mm_hstate_usage(mm, i),
+			huge_page_size(&hstates[i]) >> 10);
+	}
+	seq_puts(m, ")\n");
+}
+
+int hugetlb_fork(struct mm_struct *new, struct mm_struct *old)
+{
+	if (old->hugetlb_usage) {
+		new->hugetlb_usage = kmalloc(sizeof(struct hugetlb_usage),
+							GFP_KERNEL);
+		if (!new->hugetlb_usage)
+			return -ENOMEM;
+		memcpy(new->hugetlb_usage, old->hugetlb_usage,
+			sizeof(struct hugetlb_usage));
+	}
+	return 0;
+}
+
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
@@ -2797,6 +2840,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 			page_dup_rmap(ptepage);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			inc_hugetlb_count(dst, h);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -2877,6 +2921,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
+		dec_hugetlb_count(mm, h);
 		page_remove_rmap(page);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
@@ -3261,6 +3306,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	inc_hugetlb_count(mm, h);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4/mm/mmap.c v4.2-rc4_patched/mm/mmap.c
index aa632ade2be7..9d9562bc79a8 100644
--- v4.2-rc4/mm/mmap.c
+++ v4.2-rc4_patched/mm/mmap.c
@@ -2847,6 +2847,7 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	exit_hugetlb_mmap(mm);
 	vm_unacct_memory(nr_accounted);
 }
 
diff --git v4.2-rc4/mm/rmap.c v4.2-rc4_patched/mm/rmap.c
index 171b68768df1..b33278bc4ddb 100644
--- v4.2-rc4/mm/rmap.c
+++ v4.2-rc4_patched/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	update_hiwater_rss(mm);
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (!PageHuge(page)) {
+		if (PageHuge(page)) {
+			dec_hugetlb_count(mm, page_hstate(page));
+		} else {
 			if (PageAnon(page))
 				dec_mm_counter(mm, MM_ANONPAGES);
 			else
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-20  8:26                                           ` [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
@ 2015-08-20 10:49                                             ` Michal Hocko
  2015-08-20 23:20                                               ` Naoya Horiguchi
  2015-09-07  1:29                                             ` Pádraig Brady
  2015-09-09 15:12                                             ` Vlastimil Babka
  2 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2015-08-20 10:49 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu 20-08-15 08:26:26, Naoya Horiguchi wrote:
> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> To solve this, this patch adds a new line for hugetlb usage like below:
> 
>   Size:              20480 kB
>   Rss:                   0 kB
>   Pss:                   0 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:            0 kB
>   Anonymous:             0 kB
>   AnonHugePages:         0 kB
>   HugetlbPages:      18432 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht

I have only now got to this thread. This is indeed very helpful. I would
just suggest to update Documentation/filesystems/proc.txt to be explicit
that Rss: doesn't count hugetlb pages for historical reasons.
 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Joern Engel <joern@logfs.org>
> Acked-by: David Rientjes <rientjes@google.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
> v3 -> v4:
> - suspend Acked-by tag because v3->v4 change is not trivial
> - I stated in previous discussion that HugetlbPages line can contain page
>   size info, but that's not necessary because we already have KernelPageSize
>   info.
> - merged documentation update, where the current documentation doesn't mention
>   AnonHugePages, so it's also added.
> ---
>  Documentation/filesystems/proc.txt |  7 +++++--
>  fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
> index 6f7fafde0884..22e40211ef64 100644
> --- v4.2-rc4/Documentation/filesystems/proc.txt
> +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
> @@ -423,6 +423,8 @@ Private_Clean:         0 kB
>  Private_Dirty:         0 kB
>  Referenced:          892 kB
>  Anonymous:             0 kB
> +AnonHugePages:         0 kB
> +HugetlbPages:          0 kB
>  Swap:                  0 kB
>  KernelPageSize:        4 kB
>  MMUPageSize:           4 kB
> @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
>  "Anonymous" shows the amount of memory that does not belong to any file.  Even
>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous copy.
> -"Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>  
>  "VmFlags" field deserves a separate description. This member represents the kernel
>  flags associated with the particular virtual memory area in two letter encoded
> diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
> index ca1e091881d4..2c37938b82ee 100644
> --- v4.2-rc4/fs/proc/task_mmu.c
> +++ v4.2-rc4_patched/fs/proc/task_mmu.c
> @@ -445,6 +445,7 @@ struct mem_size_stats {
>  	unsigned long anonymous;
>  	unsigned long anonymous_thp;
>  	unsigned long swap;
> +	unsigned long hugetlb;
>  	u64 pss;
>  };
>  
> @@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  	seq_putc(m, '\n');
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> +				 unsigned long addr, unsigned long end,
> +				 struct mm_walk *walk)
> +{
> +	struct mem_size_stats *mss = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct page *page = NULL;
> +
> +	if (pte_present(*pte)) {
> +		page = vm_normal_page(vma, addr, *pte);
> +	} else if (is_swap_pte(*pte)) {
> +		swp_entry_t swpent = pte_to_swp_entry(*pte);
> +
> +		if (is_migration_entry(swpent))
> +			page = migration_entry_to_page(swpent);
> +	}
> +	if (page)
> +		mss->hugetlb += huge_page_size(hstate_vma(vma));
> +	return 0;
> +}
> +#endif /* HUGETLB_PAGE */
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>  	struct vm_area_struct *vma = v;
>  	struct mem_size_stats mss;
>  	struct mm_walk smaps_walk = {
>  		.pmd_entry = smaps_pte_range,
> +#ifdef CONFIG_HUGETLB_PAGE
> +		.hugetlb_entry = smaps_hugetlb_range,
> +#endif
>  		.mm = vma->vm_mm,
>  		.private = &mss,
>  	};
> @@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
>  		   "AnonHugePages:  %8lu kB\n"
> +		   "HugetlbPages:   %8lu kB\n"
>  		   "Swap:           %8lu kB\n"
>  		   "KernelPageSize: %8lu kB\n"
>  		   "MMUPageSize:    %8lu kB\n"
> @@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
>  		   mss.anonymous_thp >> 10,
> +		   mss.hugetlb >> 10,
>  		   mss.swap >> 10,
>  		   vma_kernel_pagesize(vma) >> 10,
>  		   vma_mmu_pagesize(vma) >> 10,
> -- 
> 2.4.3
> N?????r??y????b?X??ǧv?^?)޺{.n?+????{????zX??\x17??ܨ}???Ơz?&j:+v???\a????zZ+??+zf???h???~????i???z?\x1e?w?????????&?)ߢ^[f??^jǫy?m??@A?a??\x7f?\f0??h?\x0f??i\x7f

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-20  8:26                                           ` [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
@ 2015-08-20 11:00                                             ` Michal Hocko
  2015-08-20 19:49                                               ` David Rientjes
  2015-08-20 23:34                                               ` Naoya Horiguchi
  0 siblings, 2 replies; 71+ messages in thread
From: Michal Hocko @ 2015-08-20 11:00 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> Currently there's no easy way to get per-process usage of hugetlb pages,

Is this really the case after your previous patch? You have both 
HugetlbPages and KernelPageSize which should be sufficient no?

Reading a single file is, of course, easier but is it really worth the
additional code? I haven't really looked at the patch so I might be
missing something but what would be an advantage over reading
/proc/<pid>/smaps and extracting the information from there?

[...]
>  Documentation/filesystems/proc.txt |  3 +++
>  fs/hugetlbfs/inode.c               | 12 ++++++++++
>  fs/proc/task_mmu.c                 |  1 +
>  include/linux/hugetlb.h            | 36 +++++++++++++++++++++++++++++
>  include/linux/mm_types.h           |  7 ++++++
>  kernel/fork.c                      |  3 +++
>  mm/hugetlb.c                       | 46 ++++++++++++++++++++++++++++++++++++++
>  mm/mmap.c                          |  1 +
>  mm/rmap.c                          |  4 +++-
>  9 files changed, 112 insertions(+), 1 deletion(-)
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-20 11:00                                             ` Michal Hocko
@ 2015-08-20 19:49                                               ` David Rientjes
  2015-08-21  6:32                                                 ` Michal Hocko
  2015-08-20 23:34                                               ` Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-20 19:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Andrew Morton, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu, 20 Aug 2015, Michal Hocko wrote:

> On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> > Currently there's no easy way to get per-process usage of hugetlb pages,
> 
> Is this really the case after your previous patch? You have both 
> HugetlbPages and KernelPageSize which should be sufficient no?
> 
> Reading a single file is, of course, easier but is it really worth the
> additional code? I haven't really looked at the patch so I might be
> missing something but what would be an advantage over reading
> /proc/<pid>/smaps and extracting the information from there?
> 

/proc/pid/smaps requires root, /proc/pid/status doesn't.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-20 10:49                                             ` Michal Hocko
@ 2015-08-20 23:20                                               ` Naoya Horiguchi
  2015-08-21  6:33                                                 ` Michal Hocko
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-20 23:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2792 bytes --]

On Thu, Aug 20, 2015 at 12:49:29PM +0200, Michal Hocko wrote:
> On Thu 20-08-15 08:26:26, Naoya Horiguchi wrote:
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> > 
> >   Size:              20480 kB
> >   Rss:                   0 kB
> >   Pss:                   0 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:         0 kB
> >   Referenced:            0 kB
> >   Anonymous:             0 kB
> >   AnonHugePages:         0 kB
> >   HugetlbPages:      18432 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> 
> I have only now got to this thread. This is indeed very helpful. I would
> just suggest to update Documentation/filesystems/proc.txt to be explicit
> that Rss: doesn't count hugetlb pages for historical reasons.

I agree, I want the following diff to be folded to this patch.

>  
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Acked-by: Joern Engel <joern@logfs.org>
> > Acked-by: David Rientjes <rientjes@google.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thank you.
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Fri, 21 Aug 2015 08:13:31 +0900
Subject: [PATCH] Documentation/filesystems/proc.txt: give additional comment
 about hugetlb usage

---
 Documentation/filesystems/proc.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index f561fc46e41b..b775b6faaeda 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -446,7 +446,8 @@ indicates the amount of memory currently marked as referenced or accessed.
 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
 and a page is modified, the file page is replaced by a private anonymous copy.
 "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
-"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
+"HugetlbPages" shows the ammount of memory backed by hugetlbfs page (which is
+not counted in "Rss" or "Pss" field for historical reasons.)
 "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
-- 
2.4.3
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-20 11:00                                             ` Michal Hocko
  2015-08-20 19:49                                               ` David Rientjes
@ 2015-08-20 23:34                                               ` Naoya Horiguchi
  2015-08-21  6:53                                                 ` Michal Hocko
  1 sibling, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-08-20 23:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 987 bytes --]

On Thu, Aug 20, 2015 at 01:00:05PM +0200, Michal Hocko wrote:
> On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> > Currently there's no easy way to get per-process usage of hugetlb pages,
> 
> Is this really the case after your previous patch? You have both 
> HugetlbPages and KernelPageSize which should be sufficient no?

We can calcurate it from these info, so saying "no easy way" was incorrect :(

> Reading a single file is, of course, easier but is it really worth the
> additional code? I haven't really looked at the patch so I might be
> missing something but what would be an advantage over reading
> /proc/<pid>/smaps and extracting the information from there?

My first idea was just "users should feel it useful", but permission as David
commented sounds a good technical reason to me.

Thanks,
Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-20 19:49                                               ` David Rientjes
@ 2015-08-21  6:32                                                 ` Michal Hocko
  2015-08-21 16:38                                                   ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2015-08-21  6:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Naoya Horiguchi, Andrew Morton, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu 20-08-15 12:49:59, David Rientjes wrote:
> On Thu, 20 Aug 2015, Michal Hocko wrote:
> 
> > On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> > > Currently there's no easy way to get per-process usage of hugetlb pages,
> > 
> > Is this really the case after your previous patch? You have both 
> > HugetlbPages and KernelPageSize which should be sufficient no?
> > 
> > Reading a single file is, of course, easier but is it really worth the
> > additional code? I haven't really looked at the patch so I might be
> > missing something but what would be an advantage over reading
> > /proc/<pid>/smaps and extracting the information from there?
> > 
> 
> /proc/pid/smaps requires root, /proc/pid/status doesn't.

Both mmotm and linus tree have
        REG("smaps",      S_IRUGO, proc_pid_smaps_operations),

and opening the file requires PTRACE_MODE_READ. So I do not see any
requirement for root here. Or did you mean that you need root to examine
all processes? That would be true but I am wondering why would be a regular
user interested in this break out numbers. Hugetlb management sounds
pretty much like an administrative or very specialized thing.

>From my understanding of the discussion there is no usecase to have this
information world readable. Is this correct?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-20 23:20                                               ` Naoya Horiguchi
@ 2015-08-21  6:33                                                 ` Michal Hocko
  0 siblings, 0 replies; 71+ messages in thread
From: Michal Hocko @ 2015-08-21  6:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu 20-08-15 23:20:12, Naoya Horiguchi wrote:
[...]
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Fri, 21 Aug 2015 08:13:31 +0900
> Subject: [PATCH] Documentation/filesystems/proc.txt: give additional comment
>  about hugetlb usage
> 
> ---
>  Documentation/filesystems/proc.txt | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index f561fc46e41b..b775b6faaeda 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -446,7 +446,8 @@ indicates the amount of memory currently marked as referenced or accessed.
>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous copy.
>  "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> -"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page (which is
> +not counted in "Rss" or "Pss" field for historical reasons.)
>  "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>  
>  "VmFlags" field deserves a separate description. This member represents the kernel

Thank you!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-20 23:34                                               ` Naoya Horiguchi
@ 2015-08-21  6:53                                                 ` Michal Hocko
  2015-08-21 16:30                                                   ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2015-08-21  6:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
[...]
> > Reading a single file is, of course, easier but is it really worth the
> > additional code? I haven't really looked at the patch so I might be
> > missing something but what would be an advantage over reading
> > /proc/<pid>/smaps and extracting the information from there?
> 
> My first idea was just "users should feel it useful", but permission as David
> commented sounds a good technical reason to me.

9 files changed, 112 insertions(+), 1 deletion(-)

is quite a lot especially when it touches hot paths like fork so it
better should have a good usecase. I have already asked in the other
email but is actually anybody requesting this? Nice to have is not
a good justification IMO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-21  6:53                                                 ` Michal Hocko
@ 2015-08-21 16:30                                                   ` Jörn Engel
  2015-08-24  8:51                                                     ` Michal Hocko
  0 siblings, 1 reply; 71+ messages in thread
From: Jörn Engel @ 2015-08-21 16:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Andrew Morton, David Rientjes, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Fri, Aug 21, 2015 at 08:53:21AM +0200, Michal Hocko wrote:
> On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
> [...]
> > > Reading a single file is, of course, easier but is it really worth the
> > > additional code? I haven't really looked at the patch so I might be
> > > missing something but what would be an advantage over reading
> > > /proc/<pid>/smaps and extracting the information from there?
> > 
> > My first idea was just "users should feel it useful", but permission as David
> > commented sounds a good technical reason to me.
> 
> 9 files changed, 112 insertions(+), 1 deletion(-)
> 
> is quite a lot especially when it touches hot paths like fork so it
> better should have a good usecase. I have already asked in the other
> email but is actually anybody requesting this? Nice to have is not
> a good justification IMO.

I need some way to judge the real rss of a process, including huge
pages.  No strong opinion on implementation details, but something is
clearly needed.

If you have processes with 99% huge pages, you are currently reduced to
guesswork.

Jörn

--
Journalism is printing what someone else does not want printed;
everything else is public relations.
-- George Orwell

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-21  6:32                                                 ` Michal Hocko
@ 2015-08-21 16:38                                                   ` Jörn Engel
  0 siblings, 0 replies; 71+ messages in thread
From: Jörn Engel @ 2015-08-21 16:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Fri, Aug 21, 2015 at 08:32:33AM +0200, Michal Hocko wrote:
> 
> Both mmotm and linus tree have
>         REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
> 
> and opening the file requires PTRACE_MODE_READ. So I do not see any
> requirement for root here. Or did you mean that you need root to examine
> all processes? That would be true but I am wondering why would be a regular
> user interested in this break out numbers. Hugetlb management sounds
> pretty much like an administrative or very specialized thing.
> 
> From my understanding of the discussion there is no usecase to have this
> information world readable. Is this correct?

Well, tools like top currently display rss.  Once we have some
interface, I would like a version of top that displays the true rss
including hugepages (hrss maybe?).

If we make such a tool impossible today, someone will complain about it
in the future and we created a new mess for ourselves.  I think it is
trouble enough to deal with the old one.

Jörn

--
Denying any reality for any laudable political goal is a bad strategy.
When the facts come out, the discovery of the facts will undermine the
laudable political goals.
-- Jared Diamond

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-21 16:30                                                   ` Jörn Engel
@ 2015-08-24  8:51                                                     ` Michal Hocko
  2015-08-25 23:23                                                       ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2015-08-24  8:51 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Naoya Horiguchi, Andrew Morton, David Rientjes, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Fri 21-08-15 09:30:33, Jörn Engel wrote:
> On Fri, Aug 21, 2015 at 08:53:21AM +0200, Michal Hocko wrote:
> > On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
> > [...]
> > > > Reading a single file is, of course, easier but is it really worth the
> > > > additional code? I haven't really looked at the patch so I might be
> > > > missing something but what would be an advantage over reading
> > > > /proc/<pid>/smaps and extracting the information from there?
> > > 
> > > My first idea was just "users should feel it useful", but permission as David
> > > commented sounds a good technical reason to me.
> > 
> > 9 files changed, 112 insertions(+), 1 deletion(-)
> > 
> > is quite a lot especially when it touches hot paths like fork so it
> > better should have a good usecase. I have already asked in the other
> > email but is actually anybody requesting this? Nice to have is not
> > a good justification IMO.
> 
> I need some way to judge the real rss of a process, including huge
> pages.  No strong opinion on implementation details, but something is
> clearly needed.

The current implementation makes me worry. Is the per hstate break down
really needed? The implementation would be much more easier without it.

> If you have processes with 99% huge pages, you are currently reduced to
> guesswork.

If you have 99% of hugetlb pages then your load is rather specific and I
would argue that /proc/<pid>/smaps (after patch 1) is a much better way to
get what you want.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-24  8:51                                                     ` Michal Hocko
@ 2015-08-25 23:23                                                       ` David Rientjes
  2015-08-26  6:38                                                         ` Michal Hocko
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-25 23:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jörn Engel, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Mon, 24 Aug 2015, Michal Hocko wrote:

> The current implementation makes me worry. Is the per hstate break down
> really needed? The implementation would be much more easier without it.
> 

Yes, it's needed.  It provides a complete picture of what statically 
reserved hugepages are in use and we're not going to change the 
implementation when it is needed to differentiate between variable hugetlb 
page sizes that risk breaking existing userspace parsers.

> If you have 99% of hugetlb pages then your load is rather specific and I
> would argue that /proc/<pid>/smaps (after patch 1) is a much better way to
> get what you want.
> 

Some distributions change the permissions of smaps, as already stated, for 
pretty clear security reasons since it can be used to defeat existing 
protection.  There's no reason why hugetlb page usage should not be 
exported in the same manner and location as memory usage.

Sheesh.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-25 23:23                                                       ` David Rientjes
@ 2015-08-26  6:38                                                         ` Michal Hocko
  2015-08-26 22:02                                                           ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2015-08-26  6:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jörn Engel, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Tue 25-08-15 16:23:34, David Rientjes wrote:
> On Mon, 24 Aug 2015, Michal Hocko wrote:
> 
> > The current implementation makes me worry. Is the per hstate break down
> > really needed? The implementation would be much more easier without it.
> > 
> 
> Yes, it's needed.  It provides a complete picture of what statically 
> reserved hugepages are in use and we're not going to change the 
> implementation when it is needed to differentiate between variable hugetlb 
> page sizes that risk breaking existing userspace parsers.

I thought the purpose was to give the amount of hugetlb based
resident memory. At least this is what Jörn was asking for AFAIU.
/proc/<pid>/status should be as lightweight as possible. The current
implementation is quite heavy as already pointed out. So I am really
curious whether this is _really_ needed. I haven't heard about a real
usecase except for top displaying HRss which doesn't need the break
down values. You have brought that up already
http://marc.info/?l=linux-mm&m=143941143109335&w=2 and nobody actually
asked for it. "I do not mind having it" is not an argument for inclusion
especially when the implementation is more costly and touches hot paths.

> > If you have 99% of hugetlb pages then your load is rather specific and I
> > would argue that /proc/<pid>/smaps (after patch 1) is a much better way to
> > get what you want.
> 
> Some distributions change the permissions of smaps, as already stated, for 
> pretty clear security reasons since it can be used to defeat existing 
> protection.  There's no reason why hugetlb page usage should not be 
> exported in the same manner and location as memory usage.

/proc/<pid>/status provides only per-memory-type break down information
(locked, data, stack, etc...). Different hugetlb sizes are still a
hugetlb memory. So I am not sure I understand you argument here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-26  6:38                                                         ` Michal Hocko
@ 2015-08-26 22:02                                                           ` David Rientjes
  2015-08-27  6:48                                                             ` Michal Hocko
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2015-08-26 22:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jörn Engel, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1108 bytes --]

On Wed, 26 Aug 2015, Michal Hocko wrote:

> I thought the purpose was to give the amount of hugetlb based
> resident memory.

Persistent hugetlb memory is always resident, the goal is to show what is 
currently mapped.

> At least this is what Jörn was asking for AFAIU.
> /proc/<pid>/status should be as lightweight as possible. The current
> implementation is quite heavy as already pointed out. So I am really
> curious whether this is _really_ needed. I haven't heard about a real
> usecase except for top displaying HRss which doesn't need the break
> down values. You have brought that up already
> http://marc.info/?l=linux-mm&m=143941143109335&w=2 and nobody actually
> asked for it. "I do not mind having it" is not an argument for inclusion
> especially when the implementation is more costly and touches hot paths.
> 

It iterates over HUGE_MAX_HSTATE and reads atomic usage counters twice.  
On x86, HUGE_MAX_HSTATE == 2.  I don't consider that to be expensive.

If you are concerned about the memory allocation of struct hugetlb_usage, 
it could easily be embedded directly in struct mm_struct.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-26 22:02                                                           ` David Rientjes
@ 2015-08-27  6:48                                                             ` Michal Hocko
  2015-08-27 17:23                                                               ` Jörn Engel
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2015-08-27  6:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jörn Engel, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Wed 26-08-15 15:02:49, David Rientjes wrote:
> On Wed, 26 Aug 2015, Michal Hocko wrote:
> 
> > I thought the purpose was to give the amount of hugetlb based
> > resident memory.
> 
> Persistent hugetlb memory is always resident, the goal is to show what is 
> currently mapped.
> 
> > At least this is what Jörn was asking for AFAIU.
> > /proc/<pid>/status should be as lightweight as possible. The current
> > implementation is quite heavy as already pointed out. So I am really
> > curious whether this is _really_ needed. I haven't heard about a real
> > usecase except for top displaying HRss which doesn't need the break
> > down values. You have brought that up already
> > http://marc.info/?l=linux-mm&m=143941143109335&w=2 and nobody actually
> > asked for it. "I do not mind having it" is not an argument for inclusion
> > especially when the implementation is more costly and touches hot paths.
> > 
> 
> It iterates over HUGE_MAX_HSTATE and reads atomic usage counters twice.  

I am not worried about /proc/<pid>/status read path. That one is indeed
trivial.

> On x86, HUGE_MAX_HSTATE == 2.  I don't consider that to be expensive.
> 
> If you are concerned about the memory allocation of struct hugetlb_usage, 
> it could easily be embedded directly in struct mm_struct.

Yes I am concerned about that and
9 files changed, 112 insertions(+), 1 deletion(-)
for something that is even not clear to be really required. And I still
haven't heard any strong usecase to justify it.

Can we go with the single and much simpler cumulative number first and
only add the break down list if it is _really_ required? We can even
document that the future version of /proc/<pid>/status might add an
additional information to prepare all the parsers to be more careful.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-27  6:48                                                             ` Michal Hocko
@ 2015-08-27 17:23                                                               ` Jörn Engel
  2015-08-27 20:44                                                                 ` David Rientjes
  2015-08-31  9:12                                                                 ` Michal Hocko
  0 siblings, 2 replies; 71+ messages in thread
From: Jörn Engel @ 2015-08-27 17:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
> 
> > On x86, HUGE_MAX_HSTATE == 2.  I don't consider that to be expensive.
> > 
> > If you are concerned about the memory allocation of struct hugetlb_usage, 
> > it could easily be embedded directly in struct mm_struct.
> 
> Yes I am concerned about that and
> 9 files changed, 112 insertions(+), 1 deletion(-)
> for something that is even not clear to be really required. And I still
> haven't heard any strong usecase to justify it.
> 
> Can we go with the single and much simpler cumulative number first and
> only add the break down list if it is _really_ required? We can even
> document that the future version of /proc/<pid>/status might add an
> additional information to prepare all the parsers to be more careful.

I don't care much which way we decide.  But I find your reasoning a bit
worrying.  If someone asks for a by-size breakup of hugepages in a few
years, you might have existing binaries that depend on the _absence_ of
those extra characters on the line.

Compare:
  HugetlbPages:      18432 kB
  HugetlbPages:    1069056 kB (1*1048576kB 10*2048kB)

Once someone has written a script that greps for 'HugetlbPages:.*kB$',
you have lost the option of adding anything else to the line.  You have
created yet another ABI compatibility headache today in order to save
112 lines of code.

That may be a worthwhile tradeoff, I don't know.  But at least I realize
there is a cost, while you seem to ignore that component.  There is
value in not painting yourself into a corner.

Jörn

--
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-27 17:23                                                               ` Jörn Engel
@ 2015-08-27 20:44                                                                 ` David Rientjes
  2015-08-31  9:12                                                                 ` Michal Hocko
  1 sibling, 0 replies; 71+ messages in thread
From: David Rientjes @ 2015-08-27 20:44 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Michal Hocko, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1518 bytes --]

On Thu, 27 Aug 2015, Jörn Engel wrote:

> On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
> > Can we go with the single and much simpler cumulative number first and
> > only add the break down list if it is _really_ required? We can even
> > document that the future version of /proc/<pid>/status might add an
> > additional information to prepare all the parsers to be more careful.
> 
> I don't care much which way we decide.  But I find your reasoning a bit
> worrying.  If someone asks for a by-size breakup of hugepages in a few
> years, you might have existing binaries that depend on the _absence_ of
> those extra characters on the line.
> 
> Compare:
>   HugetlbPages:      18432 kB
>   HugetlbPages:    1069056 kB (1*1048576kB 10*2048kB)
> 
> Once someone has written a script that greps for 'HugetlbPages:.*kB$',
> you have lost the option of adding anything else to the line.  You have
> created yet another ABI compatibility headache today in order to save
> 112 lines of code.
> 

This is exactly the concern that I have brought up in this thread.  We 
have no other way to sanely export the breakdown in hugepage size without 
new fields being added later with the hstate size being embedded in the 
name itself.

I agree with the code as it stands in -mm today and I'm thankful to Naoya 
that a long-term maintainable API has been established.  Respectfully, I 
have no idea why we are still talking about this and I'm not going to be 
responding further unless something changes in -mm.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
  2015-08-27 17:23                                                               ` Jörn Engel
  2015-08-27 20:44                                                                 ` David Rientjes
@ 2015-08-31  9:12                                                                 ` Michal Hocko
  1 sibling, 0 replies; 71+ messages in thread
From: Michal Hocko @ 2015-08-31  9:12 UTC (permalink / raw)
  To: Jörn Engel
  Cc: David Rientjes, Naoya Horiguchi, Andrew Morton, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Thu 27-08-15 10:23:51, Jörn Engel wrote:
> On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
> > 
> > > On x86, HUGE_MAX_HSTATE == 2.  I don't consider that to be expensive.
> > > 
> > > If you are concerned about the memory allocation of struct hugetlb_usage, 
> > > it could easily be embedded directly in struct mm_struct.
> > 
> > Yes I am concerned about that and
> > 9 files changed, 112 insertions(+), 1 deletion(-)
> > for something that is even not clear to be really required. And I still
> > haven't heard any strong usecase to justify it.
> > 
> > Can we go with the single and much simpler cumulative number first and
> > only add the break down list if it is _really_ required? We can even
> > document that the future version of /proc/<pid>/status might add an
> > additional information to prepare all the parsers to be more careful.
> 
> I don't care much which way we decide.  But I find your reasoning a bit
> worrying.  If someone asks for a by-size breakup of hugepages in a few
> years, you might have existing binaries that depend on the _absence_ of
> those extra characters on the line.
> 
> Compare:
>   HugetlbPages:      18432 kB
>   HugetlbPages:    1069056 kB (1*1048576kB 10*2048kB)
> 
> Once someone has written a script that greps for 'HugetlbPages:.*kB$',
> you have lost the option of adding anything else to the line. 

If you think that an explicit note in the documentation is
not sufficient then I believe we can still handle it backward
compatible. Like separate entries for each existing hugetlb page:
HugetlbPages:	     1069056 kB
Hugetlb2MPages:	     20480 kB
Hugetlb1GPages:	     1048576 kB

or something similar. I would even argue this would be slightly easier
to parse. So it is not like we would be locked into anything.

> You have
> created yet another ABI compatibility headache today in order to save
> 112 lines of code.
> 
> That may be a worthwhile tradeoff, I don't know.  But at least I realize
> there is a cost, while you seem to ignore that component.  There is
> value in not painting yourself into a corner.

My primary point was that we are adding a code for a feature nobody
actually asked for just because somebody might ask for it in future.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-20  8:26                                           ` [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
  2015-08-20 10:49                                             ` Michal Hocko
@ 2015-09-07  1:29                                             ` Pádraig Brady
  2015-09-07  2:23                                               ` Naoya Horiguchi
  2015-09-09 15:12                                             ` Vlastimil Babka
  2 siblings, 1 reply; 71+ messages in thread
From: Pádraig Brady @ 2015-09-07  1:29 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On 20/08/15 09:26, Naoya Horiguchi wrote:
> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> To solve this, this patch adds a new line for hugetlb usage like below:
> 
>   Size:              20480 kB
>   Rss:                   0 kB
>   Pss:                   0 kB
>   Shared_Clean:          0 kB
>   Shared_Dirty:          0 kB
>   Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:            0 kB
>   Anonymous:             0 kB
>   AnonHugePages:         0 kB
>   HugetlbPages:      18432 kB
>   Swap:                  0 kB
>   KernelPageSize:     2048 kB
>   MMUPageSize:        2048 kB
>   Locked:                0 kB
>   VmFlags: rd wr mr mw me de ht
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Joern Engel <joern@logfs.org>
> Acked-by: David Rientjes <rientjes@google.com>
> ---
> v3 -> v4:
> - suspend Acked-by tag because v3->v4 change is not trivial
> - I stated in previous discussion that HugetlbPages line can contain page
>   size info, but that's not necessary because we already have KernelPageSize
>   info.
> - merged documentation update, where the current documentation doesn't mention
>   AnonHugePages, so it's also added.
> ---
>  Documentation/filesystems/proc.txt |  7 +++++--
>  fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
> index 6f7fafde0884..22e40211ef64 100644
> --- v4.2-rc4/Documentation/filesystems/proc.txt
> +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
> @@ -423,6 +423,8 @@ Private_Clean:         0 kB
>  Private_Dirty:         0 kB
>  Referenced:          892 kB
>  Anonymous:             0 kB
> +AnonHugePages:         0 kB
> +HugetlbPages:          0 kB
>  Swap:                  0 kB
>  KernelPageSize:        4 kB
>  MMUPageSize:           4 kB
> @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
>  "Anonymous" shows the amount of memory that does not belong to any file.  Even
>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous copy.
> -"Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.

There is no distinction between "private" and "shared" in this "huge page" accounting right?
Would it be possible to account for the huge pages in the {Private,Shared}_{Clean,Dirty} fields?
Or otherwise split the huge page accounting into shared/private?

thanks!
Pádraig.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-09-07  1:29                                             ` Pádraig Brady
@ 2015-09-07  2:23                                               ` Naoya Horiguchi
  2015-09-07  6:46                                                 ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-09-07  2:23 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3894 bytes --]

On Mon, Sep 07, 2015 at 02:29:53AM +0100, Pádraig Brady wrote:
> On 20/08/15 09:26, Naoya Horiguchi wrote:
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> > 
> >   Size:              20480 kB
> >   Rss:                   0 kB
> >   Pss:                   0 kB
> >   Shared_Clean:          0 kB
> >   Shared_Dirty:          0 kB
> >   Private_Clean:         0 kB
> >   Private_Dirty:         0 kB
> >   Referenced:            0 kB
> >   Anonymous:             0 kB
> >   AnonHugePages:         0 kB
> >   HugetlbPages:      18432 kB
> >   Swap:                  0 kB
> >   KernelPageSize:     2048 kB
> >   MMUPageSize:        2048 kB
> >   Locked:                0 kB
> >   VmFlags: rd wr mr mw me de ht
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Acked-by: Joern Engel <joern@logfs.org>
> > Acked-by: David Rientjes <rientjes@google.com>
> > ---
> > v3 -> v4:
> > - suspend Acked-by tag because v3->v4 change is not trivial
> > - I stated in previous discussion that HugetlbPages line can contain page
> >   size info, but that's not necessary because we already have KernelPageSize
> >   info.
> > - merged documentation update, where the current documentation doesn't mention
> >   AnonHugePages, so it's also added.
> > ---
> >  Documentation/filesystems/proc.txt |  7 +++++--
> >  fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+), 2 deletions(-)
> > 
> > diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
> > index 6f7fafde0884..22e40211ef64 100644
> > --- v4.2-rc4/Documentation/filesystems/proc.txt
> > +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
> > @@ -423,6 +423,8 @@ Private_Clean:         0 kB
> >  Private_Dirty:         0 kB
> >  Referenced:          892 kB
> >  Anonymous:             0 kB
> > +AnonHugePages:         0 kB
> > +HugetlbPages:          0 kB
> >  Swap:                  0 kB
> >  KernelPageSize:        4 kB
> >  MMUPageSize:           4 kB
> > @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
> >  "Anonymous" shows the amount of memory that does not belong to any file.  Even
> >  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> >  and a page is modified, the file page is replaced by a private anonymous copy.
> > -"Swap" shows how much would-be-anonymous memory is also used, but out on
> > -swap.
> > +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> > +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> > +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
> 
> There is no distinction between "private" and "shared" in this "huge page" accounting right?

Right for current version. And I think that private/shared distinction
gives some help.

> Would it be possible to account for the huge pages in the {Private,Shared}_{Clean,Dirty} fields?
> Or otherwise split the huge page accounting into shared/private?

As for clean/dirty distinction, I'm not sure how it's worthwhile because
hugetlb pages are always on memory and never swapped out (userspace doesn't
care about dirtiness of hugetlb?).

According to commit log of commit b4d1d99fdd8b ("hugetlb: handle updating
of ACCESSED and DIRTY in hugetlb_fault()"), dirty bit of hugetlb is maintained
to make arch-specific TLB handling convenient. It looks purely kernel-internal,
so I think we don't have to expose it.

Thanks,
Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-09-07  2:23                                               ` Naoya Horiguchi
@ 2015-09-07  6:46                                                 ` Naoya Horiguchi
  2015-09-07  9:52                                                   ` Pádraig Brady
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-09-07  6:46 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4431 bytes --]

On Mon, Sep 07, 2015 at 02:23:44AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> On Mon, Sep 07, 2015 at 02:29:53AM +0100, Pádraig Brady wrote:
> > On 20/08/15 09:26, Naoya Horiguchi wrote:
> > > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > > To solve this, this patch adds a new line for hugetlb usage like below:
> > > 
> > >   Size:              20480 kB
> > >   Rss:                   0 kB
> > >   Pss:                   0 kB
> > >   Shared_Clean:          0 kB
> > >   Shared_Dirty:          0 kB
> > >   Private_Clean:         0 kB
> > >   Private_Dirty:         0 kB
> > >   Referenced:            0 kB
> > >   Anonymous:             0 kB
> > >   AnonHugePages:         0 kB
> > >   HugetlbPages:      18432 kB
> > >   Swap:                  0 kB
> > >   KernelPageSize:     2048 kB
> > >   MMUPageSize:        2048 kB
> > >   Locked:                0 kB
> > >   VmFlags: rd wr mr mw me de ht
> > > 
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Acked-by: Joern Engel <joern@logfs.org>
> > > Acked-by: David Rientjes <rientjes@google.com>
> > > ---
> > > v3 -> v4:
> > > - suspend Acked-by tag because v3->v4 change is not trivial
> > > - I stated in previous discussion that HugetlbPages line can contain page
> > >   size info, but that's not necessary because we already have KernelPageSize
> > >   info.
> > > - merged documentation update, where the current documentation doesn't mention
> > >   AnonHugePages, so it's also added.
> > > ---
> > >  Documentation/filesystems/proc.txt |  7 +++++--
> > >  fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
> > >  2 files changed, 34 insertions(+), 2 deletions(-)
> > > 
> > > diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
> > > index 6f7fafde0884..22e40211ef64 100644
> > > --- v4.2-rc4/Documentation/filesystems/proc.txt
> > > +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
> > > @@ -423,6 +423,8 @@ Private_Clean:         0 kB
> > >  Private_Dirty:         0 kB
> > >  Referenced:          892 kB
> > >  Anonymous:             0 kB
> > > +AnonHugePages:         0 kB
> > > +HugetlbPages:          0 kB
> > >  Swap:                  0 kB
> > >  KernelPageSize:        4 kB
> > >  MMUPageSize:           4 kB
> > > @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
> > >  "Anonymous" shows the amount of memory that does not belong to any file.  Even
> > >  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> > >  and a page is modified, the file page is replaced by a private anonymous copy.
> > > -"Swap" shows how much would-be-anonymous memory is also used, but out on
> > > -swap.
> > > +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> > > +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> > > +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
> > 
> > There is no distinction between "private" and "shared" in this "huge page" accounting right?
> 
> Right for current version. And I think that private/shared distinction
> gives some help.
> 
> > Would it be possible to account for the huge pages in the {Private,Shared}_{Clean,Dirty} fields?
> > Or otherwise split the huge page accounting into shared/private?

Sorry, I didn't catch you properly.
I think that accounting for hugetlb pages should be done only with HugetlbPages
or any other new field for hugetlb, in order not to break the behavior of existing
fields. So splitting HugetlbPages into shared/private looks good to me.

Thanks,
Naoya Horiguchi

> As for clean/dirty distinction, I'm not sure how it's worthwhile because
> hugetlb pages are always on memory and never swapped out (userspace doesn't
> care about dirtiness of hugetlb?).
> 
> According to commit log of commit b4d1d99fdd8b ("hugetlb: handle updating
> of ACCESSED and DIRTY in hugetlb_fault()"), dirty bit of hugetlb is maintained
> to make arch-specific TLB handling convenient. It looks purely kernel-internal,
> so I think we don't have to expose it.
> 
> Thanks,
> Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-09-07  6:46                                                 ` Naoya Horiguchi
@ 2015-09-07  9:52                                                   ` Pádraig Brady
  2015-09-07 10:52                                                     ` Pádraig Brady
  0 siblings, 1 reply; 71+ messages in thread
From: Pádraig Brady @ 2015-09-07  9:52 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On 07/09/15 07:46, Naoya Horiguchi wrote:
> On Mon, Sep 07, 2015 at 02:23:44AM +0000, Horiguchi Naoya(堀口 直也) wrote:
>> On Mon, Sep 07, 2015 at 02:29:53AM +0100, Pádraig Brady wrote:
>>> On 20/08/15 09:26, Naoya Horiguchi wrote:
>>>> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
>>>> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
>>>> To solve this, this patch adds a new line for hugetlb usage like below:
>>>>
>>>>   Size:              20480 kB
>>>>   Rss:                   0 kB
>>>>   Pss:                   0 kB
>>>>   Shared_Clean:          0 kB
>>>>   Shared_Dirty:          0 kB
>>>>   Private_Clean:         0 kB
>>>>   Private_Dirty:         0 kB
>>>>   Referenced:            0 kB
>>>>   Anonymous:             0 kB
>>>>   AnonHugePages:         0 kB
>>>>   HugetlbPages:      18432 kB
>>>>   Swap:                  0 kB
>>>>   KernelPageSize:     2048 kB
>>>>   MMUPageSize:        2048 kB
>>>>   Locked:                0 kB
>>>>   VmFlags: rd wr mr mw me de ht
>>>>
>>>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>> Acked-by: Joern Engel <joern@logfs.org>
>>>> Acked-by: David Rientjes <rientjes@google.com>
>>>> ---
>>>> v3 -> v4:
>>>> - suspend Acked-by tag because v3->v4 change is not trivial
>>>> - I stated in previous discussion that HugetlbPages line can contain page
>>>>   size info, but that's not necessary because we already have KernelPageSize
>>>>   info.
>>>> - merged documentation update, where the current documentation doesn't mention
>>>>   AnonHugePages, so it's also added.
>>>> ---
>>>>  Documentation/filesystems/proc.txt |  7 +++++--
>>>>  fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
>>>>  2 files changed, 34 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
>>>> index 6f7fafde0884..22e40211ef64 100644
>>>> --- v4.2-rc4/Documentation/filesystems/proc.txt
>>>> +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
>>>> @@ -423,6 +423,8 @@ Private_Clean:         0 kB
>>>>  Private_Dirty:         0 kB
>>>>  Referenced:          892 kB
>>>>  Anonymous:             0 kB
>>>> +AnonHugePages:         0 kB
>>>> +HugetlbPages:          0 kB
>>>>  Swap:                  0 kB
>>>>  KernelPageSize:        4 kB
>>>>  MMUPageSize:           4 kB
>>>> @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
>>>>  "Anonymous" shows the amount of memory that does not belong to any file.  Even
>>>>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>>>>  and a page is modified, the file page is replaced by a private anonymous copy.
>>>> -"Swap" shows how much would-be-anonymous memory is also used, but out on
>>>> -swap.
>>>> +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
>>>> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
>>>> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>>>
>>> There is no distinction between "private" and "shared" in this "huge page" accounting right?
>>
>> Right for current version. And I think that private/shared distinction
>> gives some help.
>>
>>> Would it be possible to account for the huge pages in the {Private,Shared}_{Clean,Dirty} fields?
>>> Or otherwise split the huge page accounting into shared/private?
> 
> Sorry, I didn't catch you properly.
> I think that accounting for hugetlb pages should be done only with HugetlbPages
> or any other new field for hugetlb, in order not to break the behavior of existing
> fields. 

On a more general note I'd be inclined to just account
for hugetlb pages in Rss and {Private,Shared}_Dirty
and fix any tools that double count.

> So splitting HugetlbPages into shared/private looks good to me.

Yes this is the most compatible solution,
and will allow one to accurately determine
how much core mem a process is using.

thanks!
Pádraig.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-09-07  9:52                                                   ` Pádraig Brady
@ 2015-09-07 10:52                                                     ` Pádraig Brady
  2015-09-17  9:39                                                       ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Pádraig Brady @ 2015-09-07 10:52 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On 07/09/15 10:52, Pádraig Brady wrote:
> On 07/09/15 07:46, Naoya Horiguchi wrote:
>> On Mon, Sep 07, 2015 at 02:23:44AM +0000, Horiguchi Naoya(堀口 直也) wrote:
>>> On Mon, Sep 07, 2015 at 02:29:53AM +0100, Pádraig Brady wrote:
>>>> On 20/08/15 09:26, Naoya Horiguchi wrote:
>>>>> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
>>>>> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
>>>>> To solve this, this patch adds a new line for hugetlb usage like below:
>>>>>
>>>>>   Size:              20480 kB
>>>>>   Rss:                   0 kB
>>>>>   Pss:                   0 kB
>>>>>   Shared_Clean:          0 kB
>>>>>   Shared_Dirty:          0 kB
>>>>>   Private_Clean:         0 kB
>>>>>   Private_Dirty:         0 kB
>>>>>   Referenced:            0 kB
>>>>>   Anonymous:             0 kB
>>>>>   AnonHugePages:         0 kB
>>>>>   HugetlbPages:      18432 kB
>>>>>   Swap:                  0 kB
>>>>>   KernelPageSize:     2048 kB
>>>>>   MMUPageSize:        2048 kB
>>>>>   Locked:                0 kB
>>>>>   VmFlags: rd wr mr mw me de ht
>>>>>
>>>>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>>> Acked-by: Joern Engel <joern@logfs.org>
>>>>> Acked-by: David Rientjes <rientjes@google.com>
>>>>> ---
>>>>> v3 -> v4:
>>>>> - suspend Acked-by tag because v3->v4 change is not trivial
>>>>> - I stated in previous discussion that HugetlbPages line can contain page
>>>>>   size info, but that's not necessary because we already have KernelPageSize
>>>>>   info.
>>>>> - merged documentation update, where the current documentation doesn't mention
>>>>>   AnonHugePages, so it's also added.
>>>>> ---
>>>>>  Documentation/filesystems/proc.txt |  7 +++++--
>>>>>  fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
>>>>>  2 files changed, 34 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
>>>>> index 6f7fafde0884..22e40211ef64 100644
>>>>> --- v4.2-rc4/Documentation/filesystems/proc.txt
>>>>> +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
>>>>> @@ -423,6 +423,8 @@ Private_Clean:         0 kB
>>>>>  Private_Dirty:         0 kB
>>>>>  Referenced:          892 kB
>>>>>  Anonymous:             0 kB
>>>>> +AnonHugePages:         0 kB
>>>>> +HugetlbPages:          0 kB
>>>>>  Swap:                  0 kB
>>>>>  KernelPageSize:        4 kB
>>>>>  MMUPageSize:           4 kB
>>>>> @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
>>>>>  "Anonymous" shows the amount of memory that does not belong to any file.  Even
>>>>>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>>>>>  and a page is modified, the file page is replaced by a private anonymous copy.
>>>>> -"Swap" shows how much would-be-anonymous memory is also used, but out on
>>>>> -swap.
>>>>> +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
>>>>> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
>>>>> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>>>>
>>>> There is no distinction between "private" and "shared" in this "huge page" accounting right?
>>>
>>> Right for current version. And I think that private/shared distinction
>>> gives some help.
>>>
>>>> Would it be possible to account for the huge pages in the {Private,Shared}_{Clean,Dirty} fields?
>>>> Or otherwise split the huge page accounting into shared/private?
>>
>> Sorry, I didn't catch you properly.
>> I think that accounting for hugetlb pages should be done only with HugetlbPages
>> or any other new field for hugetlb, in order not to break the behavior of existing
>> fields. 
> 
> On a more general note I'd be inclined to just account
> for hugetlb pages in Rss and {Private,Shared}_Dirty
> and fix any tools that double count.

By the same argument I presume the existing THP "AnonHugePages" smaps field
is not accounted for in the {Private,Shared}_... fields?
I.E. AnonHugePages may also benefit from splitting to Private/Shared?

thanks,
Pádraig.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-08-20  8:26                                           ` [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
  2015-08-20 10:49                                             ` Michal Hocko
  2015-09-07  1:29                                             ` Pádraig Brady
@ 2015-09-09 15:12                                             ` Vlastimil Babka
  2015-09-09 22:14                                               ` David Rientjes
  2 siblings, 1 reply; 71+ messages in thread
From: Vlastimil Babka @ 2015-09-09 15:12 UTC (permalink / raw)
  To: Naoya Horiguchi, Andrew Morton
  Cc: David Rientjes, Jörn Engel, Mike Kravetz, linux-mm,
	linux-kernel, Naoya Horiguchi

On 08/20/2015 10:26 AM, Naoya Horiguchi wrote:
> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> To solve this, this patch adds a new line for hugetlb usage like below:
>
>    Size:              20480 kB
>    Rss:                   0 kB
>    Pss:                   0 kB
>    Shared_Clean:          0 kB
>    Shared_Dirty:          0 kB
>    Private_Clean:         0 kB
>    Private_Dirty:         0 kB
>    Referenced:            0 kB
>    Anonymous:             0 kB
>    AnonHugePages:         0 kB
>    HugetlbPages:      18432 kB
>    Swap:                  0 kB
>    KernelPageSize:     2048 kB
>    MMUPageSize:        2048 kB
>    Locked:                0 kB
>    VmFlags: rd wr mr mw me de ht
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Joern Engel <joern@logfs.org>
> Acked-by: David Rientjes <rientjes@google.com>

Sorry for coming late to this thread. It's a nice improvement, but I 
find it somewhat illogical that the per-process stats (status) are more 
detailed than the per-mapping stats (smaps) with respect to the size 
breakdown. I would expect it to be the other way around. That would 
simplify the per-process accounting (I realize this has been a hot topic 
already), and allow those who really care to look at smaps.

I'm just not sure about the format in that case. In smaps, a line per 
size would probably make more sense. Even in status, the extra 
information in parentheses looks somewhat out of place. But of course, 
adding shared/private breakdown as suggested would lead to an explosion 
of number of lines in that case...

> ---
> v3 -> v4:
> - suspend Acked-by tag because v3->v4 change is not trivial
> - I stated in previous discussion that HugetlbPages line can contain page
>    size info, but that's not necessary because we already have KernelPageSize
>    info.
> - merged documentation update, where the current documentation doesn't mention
>    AnonHugePages, so it's also added.
> ---
>   Documentation/filesystems/proc.txt |  7 +++++--
>   fs/proc/task_mmu.c                 | 29 +++++++++++++++++++++++++++++
>   2 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
> index 6f7fafde0884..22e40211ef64 100644
> --- v4.2-rc4/Documentation/filesystems/proc.txt
> +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
> @@ -423,6 +423,8 @@ Private_Clean:         0 kB
>   Private_Dirty:         0 kB
>   Referenced:          892 kB
>   Anonymous:             0 kB
> +AnonHugePages:         0 kB
> +HugetlbPages:          0 kB
>   Swap:                  0 kB
>   KernelPageSize:        4 kB
>   MMUPageSize:           4 kB
> @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
>   "Anonymous" shows the amount of memory that does not belong to any file.  Even
>   a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>   and a page is modified, the file page is replaced by a private anonymous copy.
> -"Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>
>   "VmFlags" field deserves a separate description. This member represents the kernel
>   flags associated with the particular virtual memory area in two letter encoded
> diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
> index ca1e091881d4..2c37938b82ee 100644
> --- v4.2-rc4/fs/proc/task_mmu.c
> +++ v4.2-rc4_patched/fs/proc/task_mmu.c
> @@ -445,6 +445,7 @@ struct mem_size_stats {
>   	unsigned long anonymous;
>   	unsigned long anonymous_thp;
>   	unsigned long swap;
> +	unsigned long hugetlb;
>   	u64 pss;
>   };
>
> @@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>   	seq_putc(m, '\n');
>   }
>
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> +				 unsigned long addr, unsigned long end,
> +				 struct mm_walk *walk)
> +{
> +	struct mem_size_stats *mss = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct page *page = NULL;
> +
> +	if (pte_present(*pte)) {
> +		page = vm_normal_page(vma, addr, *pte);
> +	} else if (is_swap_pte(*pte)) {
> +		swp_entry_t swpent = pte_to_swp_entry(*pte);
> +
> +		if (is_migration_entry(swpent))
> +			page = migration_entry_to_page(swpent);
> +	}
> +	if (page)
> +		mss->hugetlb += huge_page_size(hstate_vma(vma));
> +	return 0;
> +}
> +#endif /* HUGETLB_PAGE */
> +
>   static int show_smap(struct seq_file *m, void *v, int is_pid)
>   {
>   	struct vm_area_struct *vma = v;
>   	struct mem_size_stats mss;
>   	struct mm_walk smaps_walk = {
>   		.pmd_entry = smaps_pte_range,
> +#ifdef CONFIG_HUGETLB_PAGE
> +		.hugetlb_entry = smaps_hugetlb_range,
> +#endif
>   		.mm = vma->vm_mm,
>   		.private = &mss,
>   	};
> @@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>   		   "Referenced:     %8lu kB\n"
>   		   "Anonymous:      %8lu kB\n"
>   		   "AnonHugePages:  %8lu kB\n"
> +		   "HugetlbPages:   %8lu kB\n"
>   		   "Swap:           %8lu kB\n"
>   		   "KernelPageSize: %8lu kB\n"
>   		   "MMUPageSize:    %8lu kB\n"
> @@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>   		   mss.referenced >> 10,
>   		   mss.anonymous >> 10,
>   		   mss.anonymous_thp >> 10,
> +		   mss.hugetlb >> 10,
>   		   mss.swap >> 10,
>   		   vma_kernel_pagesize(vma) >> 10,
>   		   vma_mmu_pagesize(vma) >> 10,
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-09-09 15:12                                             ` Vlastimil Babka
@ 2015-09-09 22:14                                               ` David Rientjes
  0 siblings, 0 replies; 71+ messages in thread
From: David Rientjes @ 2015-09-09 22:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Naoya Horiguchi, Andrew Morton, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

On Wed, 9 Sep 2015, Vlastimil Babka wrote:

> On 08/20/2015 10:26 AM, Naoya Horiguchi wrote:
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> > 
> >    Size:              20480 kB
> >    Rss:                   0 kB
> >    Pss:                   0 kB
> >    Shared_Clean:          0 kB
> >    Shared_Dirty:          0 kB
> >    Private_Clean:         0 kB
> >    Private_Dirty:         0 kB
> >    Referenced:            0 kB
> >    Anonymous:             0 kB
> >    AnonHugePages:         0 kB
> >    HugetlbPages:      18432 kB
> >    Swap:                  0 kB
> >    KernelPageSize:     2048 kB
> >    MMUPageSize:        2048 kB
> >    Locked:                0 kB
> >    VmFlags: rd wr mr mw me de ht
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Acked-by: Joern Engel <joern@logfs.org>
> > Acked-by: David Rientjes <rientjes@google.com>
> 
> Sorry for coming late to this thread. It's a nice improvement, but I find it
> somewhat illogical that the per-process stats (status) are more detailed than
> the per-mapping stats (smaps) with respect to the size breakdown. I would
> expect it to be the other way around. That would simplify the per-process
> accounting (I realize this has been a hot topic already), and allow those who
> really care to look at smaps.
> 

Smaps shows the pagesize for the hugepage of the vma and the rss, I 
believe you have all the information needed.  Some distributions also 
change smaps to only be readable by the owner or root for security 
reasons.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v1] mm: migrate: hugetlb: putback destination hugepage to active list
  2015-08-12  0:03                                       ` Naoya Horiguchi
  2015-08-12  7:45                                         ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
  2015-08-20  8:26                                         ` [PATCH v5 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
@ 2015-09-16  0:21                                         ` Naoya Horiguchi
  2015-09-16  2:53                                           ` Naoya Horiguchi
  2 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2015-09-16  0:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Andi Kleen, Hugh Dickins, linux-mm, linux-kernel,
	Naoya Horiguchi

Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
each hugetlb page maintains its active flag to avoid a race condition between
multiple calls of isolate_huge_page(), but current kernel doesn't set the flag
on a hugepage allocated by migration because the proper putback routine isn't
called. This means that users could still encounter the race referred to by
bcc54222309c in this special case, so this patch fixes it.

Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  #4.1
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git v4.3-rc1/mm/migrate.c v4.3-rc1_patched/mm/migrate.c
index c3cb566af3e2..7452a00bbb50 100644
--- v4.3-rc1/mm/migrate.c
+++ v4.3-rc1_patched/mm/migrate.c
@@ -1075,7 +1075,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (rc != MIGRATEPAGE_SUCCESS && put_new_page)
 		put_new_page(new_hpage, private);
 	else
-		put_page(new_hpage);
+		putback_active_hugepage(new_hpage);
 
 	if (result) {
 		if (rc)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v1] mm: migrate: hugetlb: putback destination hugepage to active list
  2015-09-16  0:21                                         ` [PATCH v1] mm: migrate: hugetlb: putback destination hugepage to active list Naoya Horiguchi
@ 2015-09-16  2:53                                           ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-09-16  2:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Andi Kleen, Hugh Dickins, linux-mm, linux-kernel

my bad, this patch is totally unrelated to the thread the previous email
replied to. I just mishandled my script wrapping git-send-email, sorry.

But just resending patch seems to be noisy, so I want this to be reviewed
on this email.
If it's inconvenient or uncommon way of submission, please let me know and
I'll resend in a new thread.

Thanks,
Naoya Horiguchi

On Wed, Sep 16, 2015 at 12:21:04AM +0000, Naoya Horiguchi wrote:
> Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
> each hugetlb page maintains its active flag to avoid a race condition between
> multiple calls of isolate_huge_page(), but current kernel doesn't set the flag
> on a hugepage allocated by migration because the proper putback routine isn't
> called. This means that users could still encounter the race referred to by
> bcc54222309c in this special case, so this patch fixes it.
> 
> Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: <stable@vger.kernel.org>  #4.1
> ---
>  mm/migrate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git v4.3-rc1/mm/migrate.c v4.3-rc1_patched/mm/migrate.c
> index c3cb566af3e2..7452a00bbb50 100644
> --- v4.3-rc1/mm/migrate.c
> +++ v4.3-rc1_patched/mm/migrate.c
> @@ -1075,7 +1075,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  	if (rc != MIGRATEPAGE_SUCCESS && put_new_page)
>  		put_new_page(new_hpage, private);
>  	else
> -		put_page(new_hpage);
> +		putback_active_hugepage(new_hpage);
>  
>  	if (result) {
>  		if (rc)
> -- 
> 2.4.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
  2015-09-07 10:52                                                     ` Pádraig Brady
@ 2015-09-17  9:39                                                       ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2015-09-17  9:39 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Andrew Morton, David Rientjes, Jörn Engel, Mike Kravetz,
	linux-mm, linux-kernel, Naoya Horiguchi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 907 bytes --]

On Mon, Sep 07, 2015 at 11:52:41AM +0100, Pádraig Brady wrote:
...
> 
> By the same argument I presume the existing THP "AnonHugePages" smaps field
> is not accounted for in the {Private,Shared}_... fields?
> I.E. AnonHugePages may also benefit from splitting to Private/Shared?

smaps_pmd_entry() not only increments mss->anonymous_thp, but also calls
smaps_account() which updates mss->anonymous, mss->referenced and
mss->{shared,private}_{clean,dirty}, so thp's shared/private characteristic
is included in other existing fields.
I think that even if we know the thp-specific shared/private profiles, it
might be hard to do something beneficial using that information, so I feel
keeping this field as-is is ok for now.

Thanks,
Naoya Horiguchiÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2015-09-17  9:51 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-27 23:26 hugetlb pages not accounted for in rss Mike Kravetz
2015-07-28 18:32 ` Jörn Engel
2015-07-28 21:15   ` Mike Kravetz
2015-07-28 22:15     ` David Rientjes
2015-07-28 22:26       ` Jörn Engel
2015-07-28 23:30         ` David Rientjes
2015-07-29  0:53           ` Jörn Engel
2015-07-29 19:08             ` David Rientjes
2015-07-29 23:20               ` Mike Kravetz
2015-07-30 21:34                 ` Jörn Engel
2015-07-31 21:09                   ` David Rientjes
2015-08-04  2:55                 ` Naoya Horiguchi
2015-08-04  5:13                   ` [PATCH] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
2015-08-04 18:21                     ` Jörn Engel
2015-08-06  2:18                       ` David Rientjes
2015-08-06  7:44                         ` Naoya Horiguchi
2015-08-07  7:24                           ` [PATCH v2 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
2015-08-07  7:24                             ` [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
2015-08-07 22:55                               ` Andrew Morton
2015-08-10  0:47                                 ` Naoya Horiguchi
2015-08-10  0:47                                   ` [PATCH v3 1/3] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
2015-08-10  0:47                                   ` [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS Naoya Horiguchi
2015-08-11  0:44                                     ` David Rientjes
2015-08-12  0:03                                       ` Naoya Horiguchi
2015-08-12  7:45                                         ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
2015-08-12  7:45                                           ` [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
2015-08-12 20:30                                             ` David Rientjes
2015-08-13  0:45                                               ` Naoya Horiguchi
2015-08-13 21:14                                                 ` Jörn Engel
2015-08-13 21:13                                               ` Jörn Engel
2015-08-17 21:28                                               ` David Rientjes
2015-08-12 20:25                                           ` [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps David Rientjes
2015-08-13 21:14                                             ` Jörn Engel
2015-08-20  8:26                                         ` [PATCH v5 0/2] hugetlb: display per-process/per-vma usage Naoya Horiguchi
2015-08-20  8:26                                           ` [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps Naoya Horiguchi
2015-08-20 10:49                                             ` Michal Hocko
2015-08-20 23:20                                               ` Naoya Horiguchi
2015-08-21  6:33                                                 ` Michal Hocko
2015-09-07  1:29                                             ` Pádraig Brady
2015-09-07  2:23                                               ` Naoya Horiguchi
2015-09-07  6:46                                                 ` Naoya Horiguchi
2015-09-07  9:52                                                   ` Pádraig Brady
2015-09-07 10:52                                                     ` Pádraig Brady
2015-09-17  9:39                                                       ` Naoya Horiguchi
2015-09-09 15:12                                             ` Vlastimil Babka
2015-09-09 22:14                                               ` David Rientjes
2015-08-20  8:26                                           ` [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status Naoya Horiguchi
2015-08-20 11:00                                             ` Michal Hocko
2015-08-20 19:49                                               ` David Rientjes
2015-08-21  6:32                                                 ` Michal Hocko
2015-08-21 16:38                                                   ` Jörn Engel
2015-08-20 23:34                                               ` Naoya Horiguchi
2015-08-21  6:53                                                 ` Michal Hocko
2015-08-21 16:30                                                   ` Jörn Engel
2015-08-24  8:51                                                     ` Michal Hocko
2015-08-25 23:23                                                       ` David Rientjes
2015-08-26  6:38                                                         ` Michal Hocko
2015-08-26 22:02                                                           ` David Rientjes
2015-08-27  6:48                                                             ` Michal Hocko
2015-08-27 17:23                                                               ` Jörn Engel
2015-08-27 20:44                                                                 ` David Rientjes
2015-08-31  9:12                                                                 ` Michal Hocko
2015-09-16  0:21                                         ` [PATCH v1] mm: migrate: hugetlb: putback destination hugepage to active list Naoya Horiguchi
2015-09-16  2:53                                           ` Naoya Horiguchi
2015-08-10  0:47                                   ` [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status Naoya Horiguchi
2015-08-10  1:16                                     ` Naoya Horiguchi
2015-08-07  7:24                             ` [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB) Naoya Horiguchi
2015-08-07 22:50                               ` Andrew Morton
2015-08-11  0:37                               ` David Rientjes
2015-08-11 23:32                                 ` Naoya Horiguchi
2015-08-11 23:48                                   ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).