linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap
       [not found] <4F91BC8A.9020503@parallels.com>
@ 2012-04-27 12:39 ` Konstantin Khlebnikov
  2012-04-27 13:37   ` Pavel Emelyanov
  2012-04-28 13:32   ` KOSAKI Motohiro
  2012-04-27 12:39 ` [PATCH 2/2] proc: report page->index instead of pfn for non-linear mappings " Konstantin Khlebnikov
  2012-04-30 10:48 ` [PATCH v3] proc: report file/anon bit " Konstantin Khlebnikov
  2 siblings, 2 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-27 12:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Pavel Emelyanov

This is an implementation of Andrew's proposal to extend the pagemap file
bits to report what is missing about tasks' working set.

The problem with the working set detection is multilateral. In the criu
(checkpoint/restore) project we dump the tasks' memory into image files
and to do it properly we need to detect which pages inside mappings are
really in use. The mincore syscall I though could help with this did not.
First, it doesn't report swapped pages, thus we cannot find out which
parts of anonymous mappings to dump. Next, it does report pages from page
cache as present even if they are not mapped, and it doesn't make
difference between private pages that has been cow-ed and private pages
that has not been cow-ed.

Note, that issue with swap pages is critical -- we must dump swap pages to
image file. But the issues with file pages are optimization -- we can take
all file pages to image, this would be correct, but if we know that a page
is not mapped or not cow-ed, we can remove them from dump file. The dump
would still be self-consistent, though significantly smaller in size (up
to 10 times smaller on real apps).

Andrew noticed, that the proc pagemap file solved 2 of 3 above issues -- it
reports whether a page is present or swapped and it doesn't report not
mapped page cache pages. But, it doesn't distinguish cow-ed file pages from
not cow-ed.

I would like to make the last unused bit in this file to report whether the
page mapped into respective pte is PageAnon or not.

v2:
* Rebase to uptodate kernel
* Fix file/anon bit reporting for migration entries
* Fix frame bits interval comment, it uses 55 lower bits (64 - 3 - 6)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
---
 Documentation/vm/pagemap.txt |    2 +-
 fs/proc/task_mmu.c           |   44 +++++++++++++++++++++++++++---------------
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 4600cbe..7587493 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,7 @@ There are three components to pagemap:
     * Bits 0-4   swap type if swapped
     * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
-    * Bit  61    reserved for future use
+    * Bit  61    page is file-page or shared-anon
     * Bit  62    page swapped
     * Bit  63    page present
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2d60492..bc3df31 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -700,6 +700,7 @@ struct pagemapread {
 
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
+#define PM_FILE             PM_STATUS(1LL)
 #define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
 #define PM_END_OF_BUFFER    1
 
@@ -733,20 +734,31 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	return err;
 }
 
-static u64 swap_pte_to_pagemap_entry(pte_t pte)
+static void pte_to_pagemap_entry(pagemap_entry_t *pme,
+		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
-	swp_entry_t e = pte_to_swp_entry(pte);
-	return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
-}
+	u64 frame, flags;
+	struct page *page = NULL;
+
+	if (pte_present(pte)) {
+		frame = pte_pfn(pte);
+		flags = PM_PRESENT;
+		page = vm_normal_page(vma, addr, pte);
+	} if (is_swap_pte(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		frame = swp_type(entry) |
+			(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+		flags = PM_SWAP;
+		if (is_migration_entry(entry))
+			page = migration_entry_to_page(entry);
+	} else
+		return;
 
-static void pte_to_pagemap_entry(pagemap_entry_t *pme, pte_t pte)
-{
-	if (is_swap_pte(pte))
-		*pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte))
-				| PM_PSHIFT(PAGE_SHIFT) | PM_SWAP);
-	else if (pte_present(pte))
-		*pme = make_pme(PM_PFRAME(pte_pfn(pte))
-				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
+	if (page && !PageAnon(page))
+		flags |= PM_FILE;
+
+	*pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -809,7 +821,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		if (vma && (vma->vm_start <= addr) &&
 		    !is_vm_hugetlb_page(vma)) {
 			pte = pte_offset_map(pmd, addr);
-			pte_to_pagemap_entry(&pme, *pte);
+			pte_to_pagemap_entry(&pme, vma, addr, *pte);
 			/* unmap before userspace copy */
 			pte_unmap(pte);
 		}
@@ -861,11 +873,11 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * For each page in the address space, this file contains one 64-bit entry
  * consisting of the following:
  *
- * Bits 0-55  page frame number (PFN) if present
+ * Bits 0-54  page frame number (PFN) if present
  * Bits 0-4   swap type if swapped
- * Bits 5-55  swap offset if swapped
+ * Bits 5-54  swap offset if swapped
  * Bits 55-60 page shift (page size = 1<<page shift)
- * Bit  61    reserved for future use
+ * Bit  61    page is file-page or shared-anon
  * Bit  62    page swapped
  * Bit  63    page present
  *


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] proc: report page->index instead of pfn for non-linear mappings in /proc/pid/pagemap
       [not found] <4F91BC8A.9020503@parallels.com>
  2012-04-27 12:39 ` [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap Konstantin Khlebnikov
@ 2012-04-27 12:39 ` Konstantin Khlebnikov
  2012-04-27 13:37   ` Pavel Emelyanov
  2012-04-30 10:48 ` [PATCH v3] proc: report file/anon bit " Konstantin Khlebnikov
  2 siblings, 1 reply; 9+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-27 12:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Rik van Riel, Pavel Emelyanov

Currently there is no way to find out current layout of non-linear mapping.
Also there is no way to distinguish ordinary file mapping from non-linear mapping.

Now in pagemap non-linear pte can be recognized as present swapped file-backed,
or as non-present non-swapped file-backed for non-present non-linear file-pte:

    present swapped file    data        description
    0       0       0       null        non-present
    0       0       1       page-index  non-linear file-pte
    0       1       0       swap-entry  anon-page in swap, migration or hwpoison
    0       1       1       swap-entry  file-page in migration or hwpoison
    1       0       0       page-pfn    present private-anon or special page
    1       0       1       page-pfn    present file or shared-anon page
    1       1       0       none        impossible combination
    1       1       1       page-index  non-linear file-page

[ the last unused combination 1-1-0 can be used for special pages, if anyone want this ]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
---
 Documentation/vm/pagemap.txt |   15 +++++++++++++++
 fs/proc/task_mmu.c           |   13 +++++++++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 7587493..6800dda 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -13,6 +13,7 @@ There are three components to pagemap:
    fs/proc/task_mmu.c, above pagemap_read):
 
     * Bits 0-54  page frame number (PFN) if present
+    * Bits 0-54  page index for non-linear mappings
     * Bits 0-4   swap type if swapped
     * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
@@ -26,6 +27,20 @@ There are three components to pagemap:
    precisely which pages are mapped (or in swap) and comparing mapped
    pages between processes.
 
+   For non-linear file mappings page index is reported instead of PFN.
+   Non-linear pte can be recognized as present swapped file-backed or
+   non-present non-swapped file-backed.
+
+    present swapped file    data	description
+    0       0       0       null	non-present
+    0       0       1       page-index	non-linear file-pte
+    0       1       0       swap-entry	anon-page in swap, migration or hwpoison
+    0       1       1       swap-entry	file-page in migration or hwpoison
+    1       0       0       page-pfn	present private-anon or special page
+    1       0       1       page-pfn	present file or shared-anon page
+    1       1       0       none	impossible combination
+    1       1       1       page-index	non-linear file-page
+
    Efficient users of this interface will use /proc/pid/maps to
    determine which areas of memory are actually mapped and llseek to
    skip over unmapped regions.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index bc3df31..fcc802f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -744,6 +744,9 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme,
 		frame = pte_pfn(pte);
 		flags = PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
+	} if (pte_file(pte)) {
+		frame = pte_to_pgoff(pte);
+		flags = PM_FILE;
 	} if (is_swap_pte(pte)) {
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
@@ -755,8 +758,13 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme,
 	} else
 		return;
 
-	if (page && !PageAnon(page))
-		flags |= PM_FILE;
+	if (page) {
+		if (vma->vm_flags & VM_NONLINEAR) {
+			frame = page->index;
+			flags = PM_FILE | PM_SWAP | PM_PRESENT;
+		} else if (!PageAnon(page))
+			flags |= PM_FILE;
+	}
 
 	*pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags);
 }
@@ -874,6 +882,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * consisting of the following:
  *
  * Bits 0-54  page frame number (PFN) if present
+ * Bits 0-54  page index for non-linear mappings
  * Bits 0-4   swap type if swapped
  * Bits 5-54  swap offset if swapped
  * Bits 55-60 page shift (page size = 1<<page shift)


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] proc: report page->index instead of pfn for non-linear mappings in /proc/pid/pagemap
  2012-04-27 12:39 ` [PATCH 2/2] proc: report page->index instead of pfn for non-linear mappings " Konstantin Khlebnikov
@ 2012-04-27 13:37   ` Pavel Emelyanov
  0 siblings, 0 replies; 9+ messages in thread
From: Pavel Emelyanov @ 2012-04-27 13:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel

On 04/27/2012 04:39 PM, Konstantin Khlebnikov wrote:
> Currently there is no way to find out current layout of non-linear mapping.
> Also there is no way to distinguish ordinary file mapping from non-linear mapping.
> 
> Now in pagemap non-linear pte can be recognized as present swapped file-backed,
> or as non-present non-swapped file-backed for non-present non-linear file-pte:
> 
>     present swapped file    data        description
>     0       0       0       null        non-present
>     0       0       1       page-index  non-linear file-pte
>     0       1       0       swap-entry  anon-page in swap, migration or hwpoison
>     0       1       1       swap-entry  file-page in migration or hwpoison
>     1       0       0       page-pfn    present private-anon or special page
>     1       0       1       page-pfn    present file or shared-anon page
>     1       1       0       none        impossible combination
>     1       1       1       page-index  non-linear file-page
> 
> [ the last unused combination 1-1-0 can be used for special pages, if anyone want this ]

This means that

a) Any application doing if (pme & PAGE_IS_XXX) checks will get ... broken
b) In order to determine that a mapping is non-linear we'll have to scan it
   ALL and check. Currently in CRIU we just don't read the pagemap for shared
   file maps but will have to. This is not very optimal. I'd prefer having
   this linear/nonlinear info in /proc/pid/smaps or smth like this.

> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap
  2012-04-27 12:39 ` [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap Konstantin Khlebnikov
@ 2012-04-27 13:37   ` Pavel Emelyanov
  2012-04-27 23:40     ` Andrew Morton
  2012-04-28 13:32   ` KOSAKI Motohiro
  1 sibling, 1 reply; 9+ messages in thread
From: Pavel Emelyanov @ 2012-04-27 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel

On 04/27/2012 04:39 PM, Konstantin Khlebnikov wrote:
> This is an implementation of Andrew's proposal to extend the pagemap file
> bits to report what is missing about tasks' working set.
> 
> The problem with the working set detection is multilateral. In the criu
> (checkpoint/restore) project we dump the tasks' memory into image files
> and to do it properly we need to detect which pages inside mappings are
> really in use. The mincore syscall I though could help with this did not.
> First, it doesn't report swapped pages, thus we cannot find out which
> parts of anonymous mappings to dump. Next, it does report pages from page
> cache as present even if they are not mapped, and it doesn't make
> difference between private pages that has been cow-ed and private pages
> that has not been cow-ed.
> 
> Note, that issue with swap pages is critical -- we must dump swap pages to
> image file. But the issues with file pages are optimization -- we can take
> all file pages to image, this would be correct, but if we know that a page
> is not mapped or not cow-ed, we can remove them from dump file. The dump
> would still be self-consistent, though significantly smaller in size (up
> to 10 times smaller on real apps).
> 
> Andrew noticed, that the proc pagemap file solved 2 of 3 above issues -- it
> reports whether a page is present or swapped and it doesn't report not
> mapped page cache pages. But, it doesn't distinguish cow-ed file pages from
> not cow-ed.
> 
> I would like to make the last unused bit in this file to report whether the
> page mapped into respective pte is PageAnon or not.
> 
> v2:
> * Rebase to uptodate kernel
> * Fix file/anon bit reporting for migration entries
> * Fix frame bits interval comment, it uses 55 lower bits (64 - 3 - 6)
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Pavel Emelyanov <xemul@parallels.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap
  2012-04-27 13:37   ` Pavel Emelyanov
@ 2012-04-27 23:40     ` Andrew Morton
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2012-04-27 23:40 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Hugh Dickins,
	Rik van Riel, Matt Mackall

On Fri, 27 Apr 2012 17:37:34 +0400
Pavel Emelyanov <xemul@parallels.com> wrote:

> > Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Rik van Riel <riel@redhat.com>
> 
> Acked-by: Pavel Emelyanov <xemul@parallels.com>

hm, I'd have thought this should be From:Pavel and certainly
Signed-off-by:Pavel, but I'll let you guys decide.

Rik acked the earlier version and that isn't reflected here.  I never
know what to do about this.  I usually play it safe and assume that a
change in the patch erases the ack.

Please cc the original pagemap author (Matt Mackall <mpm@selenic.com>)
on these patches.  He's sometimes useful ;)

The patches looked nice to me, but as it appears that Pavel is unhappy
with [2/2] I shall tip this patchset into my bitbucket and shall await
the next rev, thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap
  2012-04-27 12:39 ` [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap Konstantin Khlebnikov
  2012-04-27 13:37   ` Pavel Emelyanov
@ 2012-04-28 13:32   ` KOSAKI Motohiro
  2012-04-29  8:28     ` Konstantin Khlebnikov
  1 sibling, 1 reply; 9+ messages in thread
From: KOSAKI Motohiro @ 2012-04-28 13:32 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Rik van Riel, Pavel Emelyanov

> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 4600cbe..7587493 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -16,7 +16,7 @@ There are three components to pagemap:
>     * Bits 0-4   swap type if swapped
>     * Bits 5-54  swap offset if swapped
>     * Bits 55-60 page shift (page size = 1<<page shift)
> -    * Bit  61    reserved for future use
> +    * Bit  61    page is file-page or shared-anon
>     * Bit  62    page swapped
>     * Bit  63    page present

hmm..
Here says, file or shmem.


> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 2d60492..bc3df31 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -700,6 +700,7 @@ struct pagemapread {
>
>  #define PM_PRESENT          PM_STATUS(4LL)
>  #define PM_SWAP             PM_STATUS(2LL)
> +#define PM_FILE             PM_STATUS(1LL)
>  #define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
>  #define PM_END_OF_BUFFER    1

But, this macro says it's file. it seems a bit misleading. ;-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap
  2012-04-28 13:32   ` KOSAKI Motohiro
@ 2012-04-29  8:28     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-29  8:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Rik van Riel, Pavel Emelianov

KOSAKI Motohiro wrote:
>> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
>> index 4600cbe..7587493 100644
>> --- a/Documentation/vm/pagemap.txt
>> +++ b/Documentation/vm/pagemap.txt
>> @@ -16,7 +16,7 @@ There are three components to pagemap:
>>      * Bits 0-4   swap type if swapped
>>      * Bits 5-54  swap offset if swapped
>>      * Bits 55-60 page shift (page size = 1<<page shift)
>> -    * Bit  61    reserved for future use
>> +    * Bit  61    page is file-page or shared-anon
>>      * Bit  62    page swapped
>>      * Bit  63    page present
>
> hmm..
> Here says, file or shmem.
>
>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 2d60492..bc3df31 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -700,6 +700,7 @@ struct pagemapread {
>>
>>   #define PM_PRESENT          PM_STATUS(4LL)
>>   #define PM_SWAP             PM_STATUS(2LL)
>> +#define PM_FILE             PM_STATUS(1LL)
>>   #define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
>>   #define PM_END_OF_BUFFER    1
>
> But, this macro says it's file. it seems a bit misleading. ;-)

well... you know, shmem/shared-anon actually lays on tmpfs. so they really file-pages.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=ilto:"dont@kvack.org">  email@kvack.org</a>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3] proc: report file/anon bit in /proc/pid/pagemap
       [not found] <4F91BC8A.9020503@parallels.com>
  2012-04-27 12:39 ` [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap Konstantin Khlebnikov
  2012-04-27 12:39 ` [PATCH 2/2] proc: report page->index instead of pfn for non-linear mappings " Konstantin Khlebnikov
@ 2012-04-30 10:48 ` Konstantin Khlebnikov
  2012-04-30 20:32   ` KOSAKI Motohiro
  2 siblings, 1 reply; 9+ messages in thread
From: Konstantin Khlebnikov @ 2012-04-30 10:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Pavel Emelyanov, Hugh Dickins, Matt Mackall,
	KOSAKI Motohiro, Andrew Morton

This is an implementation of Andrew's proposal to extend the pagemap file
bits to report what is missing about tasks' working set.

The problem with the working set detection is multilateral. In the criu
(checkpoint/restore) project we dump the tasks' memory into image files
and to do it properly we need to detect which pages inside mappings are
really in use. The mincore syscall I though could help with this did not.
First, it doesn't report swapped pages, thus we cannot find out which
parts of anonymous mappings to dump. Next, it does report pages from page
cache as present even if they are not mapped, and it doesn't make
difference between private pages that has been cow-ed and private pages
that has not been cow-ed.

Note, that issue with swap pages is critical -- we must dump swap pages to
image file. But the issues with file pages are optimization -- we can take
all file pages to image, this would be correct, but if we know that a page
is not mapped or not cow-ed, we can remove them from dump file. The dump
would still be self-consistent, though significantly smaller in size (up
to 10 times smaller on real apps).

Andrew noticed, that the proc pagemap file solved 2 of 3 above issues -- it
reports whether a page is present or swapped and it doesn't report not
mapped page cache pages. But, it doesn't distinguish cow-ed file pages from
not cow-ed.

I would like to make the last unused bit in this file to report whether the
page mapped into respective pte is PageAnon or not.

[comment stolen from Pavel Emelyanov's v1 patch]

v2:
* Rebase to uptodate kernel
* Fix file/anon bit reporting for migration entries
* Fix frame bits interval comment, it uses 55 lower bits (64 - 3 - 6)

v3:
* fix stupid misprint s/if/else if/
* rebase on top of "[PATCH bugfix] proc/pagemap: correctly report non-present
  ptes and holes between vmas"
* second patch (with indexes for nonlinear mappings) was droppped.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
---
 Documentation/vm/pagemap.txt |    2 +-
 fs/proc/task_mmu.c           |   48 ++++++++++++++++++++++++++----------------
 2 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 4600cbe..7587493 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,7 @@ There are three components to pagemap:
     * Bits 0-4   swap type if swapped
     * Bits 5-54  swap offset if swapped
     * Bits 55-60 page shift (page size = 1<<page shift)
-    * Bit  61    reserved for future use
+    * Bit  61    page is file-page or shared-anon
     * Bit  62    page swapped
     * Bit  63    page present
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9f9c033..b073971 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -700,6 +700,7 @@ struct pagemapread {
 
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
+#define PM_FILE             PM_STATUS(1LL)
 #define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
 #define PM_END_OF_BUFFER    1
 
@@ -733,22 +734,33 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	return err;
 }
 
-static u64 swap_pte_to_pagemap_entry(pte_t pte)
+static void pte_to_pagemap_entry(pagemap_entry_t *pme,
+		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
-	swp_entry_t e = pte_to_swp_entry(pte);
-	return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
-}
-
-static void pte_to_pagemap_entry(pagemap_entry_t *pme, pte_t pte)
-{
-	if (is_swap_pte(pte))
-		*pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte))
-				| PM_PSHIFT(PAGE_SHIFT) | PM_SWAP);
-	else if (pte_present(pte))
-		*pme = make_pme(PM_PFRAME(pte_pfn(pte))
-				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
-	else
+	u64 frame, flags;
+	struct page *page = NULL;
+
+	if (pte_present(pte)) {
+		frame = pte_pfn(pte);
+		flags = PM_PRESENT;
+		page = vm_normal_page(vma, addr, pte);
+	} else if (is_swap_pte(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		frame = swp_type(entry) |
+			(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+		flags = PM_SWAP;
+		if (is_migration_entry(entry))
+			page = migration_entry_to_page(entry);
+	} else {
 		*pme = make_pme(PM_NOT_PRESENT);
+		return;
+	}
+
+	if (page && !PageAnon(page))
+		flags |= PM_FILE;
+
+	*pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -815,7 +827,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		if (vma && (vma->vm_start <= addr) &&
 		    !is_vm_hugetlb_page(vma)) {
 			pte = pte_offset_map(pmd, addr);
-			pte_to_pagemap_entry(&pme, *pte);
+			pte_to_pagemap_entry(&pme, vma, addr, *pte);
 			/* unmap before userspace copy */
 			pte_unmap(pte);
 		}
@@ -869,11 +881,11 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * For each page in the address space, this file contains one 64-bit entry
  * consisting of the following:
  *
- * Bits 0-55  page frame number (PFN) if present
+ * Bits 0-54  page frame number (PFN) if present
  * Bits 0-4   swap type if swapped
- * Bits 5-55  swap offset if swapped
+ * Bits 5-54  swap offset if swapped
  * Bits 55-60 page shift (page size = 1<<page shift)
- * Bit  61    reserved for future use
+ * Bit  61    page is file-page or shared-anon
  * Bit  62    page swapped
  * Bit  63    page present
  *


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] proc: report file/anon bit in /proc/pid/pagemap
  2012-04-30 10:48 ` [PATCH v3] proc: report file/anon bit " Konstantin Khlebnikov
@ 2012-04-30 20:32   ` KOSAKI Motohiro
  0 siblings, 0 replies; 9+ messages in thread
From: KOSAKI Motohiro @ 2012-04-30 20:32 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Rik van Riel, Pavel Emelyanov,
	Hugh Dickins, Matt Mackall, Andrew Morton

On Mon, Apr 30, 2012 at 6:48 AM, Konstantin Khlebnikov
<khlebnikov@openvz.org> wrote:
> This is an implementation of Andrew's proposal to extend the pagemap file
> bits to report what is missing about tasks' working set.
>
> The problem with the working set detection is multilateral. In the criu
> (checkpoint/restore) project we dump the tasks' memory into image files
> and to do it properly we need to detect which pages inside mappings are
> really in use. The mincore syscall I though could help with this did not.
> First, it doesn't report swapped pages, thus we cannot find out which
> parts of anonymous mappings to dump. Next, it does report pages from page
> cache as present even if they are not mapped, and it doesn't make
> difference between private pages that has been cow-ed and private pages
> that has not been cow-ed.
>
> Note, that issue with swap pages is critical -- we must dump swap pages to
> image file. But the issues with file pages are optimization -- we can take
> all file pages to image, this would be correct, but if we know that a page
> is not mapped or not cow-ed, we can remove them from dump file. The dump
> would still be self-consistent, though significantly smaller in size (up
> to 10 times smaller on real apps).
>
> Andrew noticed, that the proc pagemap file solved 2 of 3 above issues -- it
> reports whether a page is present or swapped and it doesn't report not
> mapped page cache pages. But, it doesn't distinguish cow-ed file pages from
> not cow-ed.
>
> I would like to make the last unused bit in this file to report whether the
> page mapped into respective pte is PageAnon or not.
>
> [comment stolen from Pavel Emelyanov's v1 patch]
>
> v2:
> * Rebase to uptodate kernel
> * Fix file/anon bit reporting for migration entries
> * Fix frame bits interval comment, it uses 55 lower bits (64 - 3 - 6)
>
> v3:
> * fix stupid misprint s/if/else if/
> * rebase on top of "[PATCH bugfix] proc/pagemap: correctly report non-present
>  ptes and holes between vmas"
> * second patch (with indexes for nonlinear mappings) was droppped.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matt Mackall <mpm@selenic.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>

I don't like an exporting naive kernel internal. But unfortunately I
have no alternative idea..
 Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-04-30 20:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4F91BC8A.9020503@parallels.com>
2012-04-27 12:39 ` [PATCH 1/2] proc: report file/anon bit in /proc/pid/pagemap Konstantin Khlebnikov
2012-04-27 13:37   ` Pavel Emelyanov
2012-04-27 23:40     ` Andrew Morton
2012-04-28 13:32   ` KOSAKI Motohiro
2012-04-29  8:28     ` Konstantin Khlebnikov
2012-04-27 12:39 ` [PATCH 2/2] proc: report page->index instead of pfn for non-linear mappings " Konstantin Khlebnikov
2012-04-27 13:37   ` Pavel Emelyanov
2012-04-30 10:48 ` [PATCH v3] proc: report file/anon bit " Konstantin Khlebnikov
2012-04-30 20:32   ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).