All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-15 13:54 ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

It is based on top of v4.2-rc1-mmotm-2015-07-06-16-25

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

---- USER API ----

The user API consists of two new proc files:

 * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
   to a page, indexed by PFN. When the bit is set, the corresponding page is
   idle. A page is considered idle if it has not been accessed since it was
   marked idle. To mark a page idle one should set the bit corresponding to the
   page by writing to the file. A value written to the file is OR-ed with the
   current bitmap value. Only user memory pages can be marked idle, for other
   page types input is silently ignored. Writing to this file beyond max PFN
   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
   set.

   This file can be used to estimate the amount of pages that are not
   used by a particular workload as follows:

   1. mark all pages of interest idle by setting corresponding bits in the
      /proc/kpageidle bitmap
   2. wait until the workload accesses its working set
   3. read /proc/kpageidle and count the number of bits set

 * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
   memory cgroup each page is charged to, indexed by PFN. Only available when
   CONFIG_MEMCG is set.

   This file can be used to find all pages (including unmapped file
   pages) accounted to a particular cgroup. Using /proc/kpageidle, one
   can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 5.

---- CHANGE LOG ----

Changes in v8:

 - clear referenced/accessed bit in secondary ptes while accessing
   /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
 - check the young flag when collapsing a huge page
 - copy idle/young flags on page migration

Changes in v7:

This iteration addresses Andres's comments to v6:

 - do not reuse page_referenced for clearing idle flag, introduce a
   separate function instead; this way we won't issue expensive tlb
   flushes on /proc/kpageidle read/write
 - propagate young/idle flags from head to tail pages on thp split
 - skip compound tail pages while reading/writing /proc/kpageidle
 - cleanup page_referenced_one

Changes in v6:

 - Split the patch introducing page_cgroup_ino helper to ease review.
 - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55

Changes in v5:

 - Fix possible race between kpageidle_clear_pte_refs() and
   __page_set_anon_rmap() by checking that a page is on an LRU list
   under zone->lru_lock (Minchan).
 - Export idle flag via /proc/kpageflags (Minchan).
 - Rebase on top of 4.1-rc3.

Changes in v4:

This iteration primarily addresses Minchan's comments to v3:

 - Implement /proc/kpageidle as a bitmap instead of using u64 per each page,
   because there does not seem to be any future uses for the other 63 bits.
 - Do not double-increase pra->referenced in page_referenced_one() if the page
   was young and referenced recently.
 - Remove the pointless (page_count == 0) check from kpageidle_get_page().
 - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
 - Improve comments to kpageidle-related functions.
 - Rebase on top of 4.1-rc2.

Note it does not address Minchan's concern of possible __page_set_anon_rmap vs
page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is still
unclear if this race can really happen (see https://lkml.org/lkml/2015/5/4/160)

Changes in v3:

 - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
   requires two extra page flags and there is no space for them on 32
   bit, page ext is used (thanks to Minchan Kim).
 - Minor code cleanups and comments improved.
 - Rebase on top of 4.1-rc1.

Changes in v2:

 - The main difference from v1 is the API change. In v1 the user can
   only set the idle flag for all pages at once, and for clearing the
   Idle flag on pages accessed via page tables /proc/PID/clear_refs
   should be used.
   The main drawback of the v1 approach, as noted by Minchan, is that on
   big machines setting the idle flag for each pages can result in CPU
   bursts, which would be especially frustrating if the user only wanted
   to estimate the amount of idle pages for a particular process or VMA.
   With the new API a more fine-grained approach is possible: one can
   read a process's /proc/PID/pagemap and set/check the Idle flag only
   for those pages of the process's address space he or she is
   interested in.
   Another good point about the v2 API is that it is possible to limit
   /proc/kpage* scanning rate when the user wants to estimate the total
   number of idle pages, which is unachievable with the v1 approach.
 - Make /proc/kpagecgroup return the ino of the closest online ancestor
   in case the cgroup a page is charged to is offline.
 - Fix /proc/PID/clear_refs not clearing Young page flag.
 - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54

v7: https://lkml.org/lkml/2015/7/11/119
v6: https://lkml.org/lkml/2015/6/12/301
v5: https://lkml.org/lkml/2015/5/12/449
v4: https://lkml.org/lkml/2015/5/7/580
v3: https://lkml.org/lkml/2015/4/28/224
v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

 - patch 1 adds page_cgroup_ino() helper for the sake of
   /proc/kpagecgroup and patches 2-3 do related cleanup
 - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
   charged to
 - patch 5 introduces a new mmu notifier callback, clear_young, which is
   a lightweight version of clear_flush_young; it is used in patch 6
 - patch 6 implements the idle page tracking feature, including the
   userspace API, /proc/kpageidle
 - patch 7 exports idle flag via /proc/kpageflags

---- SIMILAR WORKS ----

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel
implemented a kernel space daemon for estimating idle memory size per
cgroup while this patch only provides the userspace with the minimal API
for doing the job, leaving the rest up to the userspace. However, they
both share the same idea of Idle/Young page flags to avoid affecting the
reclaimer logic.

---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"
BUFSIZE = 8 * 1024  # must be multiple of 8


def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
        for s in f:
            k, v = s.split(":")
            if k == "Hugepagesize":
                return int(v.split()[0]) * 1024

PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
HUGEPAGE_SIZE = get_hugepage_size()


def set_idle():
    f = open("/proc/kpageidle", "wb", BUFSIZE)
    while True:
        try:
            f.write(struct.pack("Q", pow(2, 64) - 1))
        except IOError as err:
            if err.errno == errno.ENXIO:
                break
            raise
    f.close()


def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/proc/kpageidle", "rb", BUFSIZE) as f:
        while f.read(BUFSIZE): pass  # update idle flag

    idlememsz = {}
    while True:
        s1, s2 = f_flags.read(8), f_cgroup.read(8)
        if not s1 or not s2:
            break

        flags, = struct.unpack('Q', s1)
        cgino, = struct.unpack('Q', s2)

        unevictable = (flags >> 18) & 1
        huge = (flags >> 22) & 1
        idle = (flags >> 25) & 1

        if idle and not unevictable:
            idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz


if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
              "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
        ino = os.stat(dir)[stat.ST_INO]
        print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (7):
  memcg: add page_cgroup_ino helper
  hwpoison: use page_cgroup_ino for filtering by memcg
  memcg: zap try_get_mem_cgroup_from_page
  proc: add kpagecgroup file
  mmu-notifier: add clear_young callback
  proc: add kpageidle file
  proc: export idle flag via kpageflags

 Documentation/vm/pagemap.txt           |  22 ++-
 fs/proc/page.c                         | 274 +++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   4 +-
 include/linux/memcontrol.h             |   7 +-
 include/linux/mm.h                     |  98 ++++++++++++
 include/linux/mmu_notifier.h           |  44 ++++++
 include/linux/page-flags.h             |  11 ++
 include/linux/page_ext.h               |   4 +
 include/uapi/linux/kernel-page-flags.h |   1 +
 mm/Kconfig                             |  12 ++
 mm/debug.c                             |   4 +
 mm/huge_memory.c                       |  11 +-
 mm/hwpoison-inject.c                   |   5 +-
 mm/memcontrol.c                        |  71 +++++----
 mm/memory-failure.c                    |  16 +-
 mm/migrate.c                           |   5 +
 mm/mmu_notifier.c                      |  17 ++
 mm/page_ext.c                          |   3 +
 mm/rmap.c                              |   5 +
 mm/swap.c                              |   2 +
 virt/kvm/kvm_main.c                    |  18 +++
 21 files changed, 570 insertions(+), 64 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-15 13:54 ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

It is based on top of v4.2-rc1-mmotm-2015-07-06-16-25

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

---- USER API ----

The user API consists of two new proc files:

 * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
   to a page, indexed by PFN. When the bit is set, the corresponding page is
   idle. A page is considered idle if it has not been accessed since it was
   marked idle. To mark a page idle one should set the bit corresponding to the
   page by writing to the file. A value written to the file is OR-ed with the
   current bitmap value. Only user memory pages can be marked idle, for other
   page types input is silently ignored. Writing to this file beyond max PFN
   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
   set.

   This file can be used to estimate the amount of pages that are not
   used by a particular workload as follows:

   1. mark all pages of interest idle by setting corresponding bits in the
      /proc/kpageidle bitmap
   2. wait until the workload accesses its working set
   3. read /proc/kpageidle and count the number of bits set

 * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
   memory cgroup each page is charged to, indexed by PFN. Only available when
   CONFIG_MEMCG is set.

   This file can be used to find all pages (including unmapped file
   pages) accounted to a particular cgroup. Using /proc/kpageidle, one
   can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 5.

---- CHANGE LOG ----

Changes in v8:

 - clear referenced/accessed bit in secondary ptes while accessing
   /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
 - check the young flag when collapsing a huge page
 - copy idle/young flags on page migration

Changes in v7:

This iteration addresses Andres's comments to v6:

 - do not reuse page_referenced for clearing idle flag, introduce a
   separate function instead; this way we won't issue expensive tlb
   flushes on /proc/kpageidle read/write
 - propagate young/idle flags from head to tail pages on thp split
 - skip compound tail pages while reading/writing /proc/kpageidle
 - cleanup page_referenced_one

Changes in v6:

 - Split the patch introducing page_cgroup_ino helper to ease review.
 - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55

Changes in v5:

 - Fix possible race between kpageidle_clear_pte_refs() and
   __page_set_anon_rmap() by checking that a page is on an LRU list
   under zone->lru_lock (Minchan).
 - Export idle flag via /proc/kpageflags (Minchan).
 - Rebase on top of 4.1-rc3.

Changes in v4:

This iteration primarily addresses Minchan's comments to v3:

 - Implement /proc/kpageidle as a bitmap instead of using u64 per each page,
   because there does not seem to be any future uses for the other 63 bits.
 - Do not double-increase pra->referenced in page_referenced_one() if the page
   was young and referenced recently.
 - Remove the pointless (page_count == 0) check from kpageidle_get_page().
 - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
 - Improve comments to kpageidle-related functions.
 - Rebase on top of 4.1-rc2.

Note it does not address Minchan's concern of possible __page_set_anon_rmap vs
page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is still
unclear if this race can really happen (see https://lkml.org/lkml/2015/5/4/160)

Changes in v3:

 - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
   requires two extra page flags and there is no space for them on 32
   bit, page ext is used (thanks to Minchan Kim).
 - Minor code cleanups and comments improved.
 - Rebase on top of 4.1-rc1.

Changes in v2:

 - The main difference from v1 is the API change. In v1 the user can
   only set the idle flag for all pages at once, and for clearing the
   Idle flag on pages accessed via page tables /proc/PID/clear_refs
   should be used.
   The main drawback of the v1 approach, as noted by Minchan, is that on
   big machines setting the idle flag for each pages can result in CPU
   bursts, which would be especially frustrating if the user only wanted
   to estimate the amount of idle pages for a particular process or VMA.
   With the new API a more fine-grained approach is possible: one can
   read a process's /proc/PID/pagemap and set/check the Idle flag only
   for those pages of the process's address space he or she is
   interested in.
   Another good point about the v2 API is that it is possible to limit
   /proc/kpage* scanning rate when the user wants to estimate the total
   number of idle pages, which is unachievable with the v1 approach.
 - Make /proc/kpagecgroup return the ino of the closest online ancestor
   in case the cgroup a page is charged to is offline.
 - Fix /proc/PID/clear_refs not clearing Young page flag.
 - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54

v7: https://lkml.org/lkml/2015/7/11/119
v6: https://lkml.org/lkml/2015/6/12/301
v5: https://lkml.org/lkml/2015/5/12/449
v4: https://lkml.org/lkml/2015/5/7/580
v3: https://lkml.org/lkml/2015/4/28/224
v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

 - patch 1 adds page_cgroup_ino() helper for the sake of
   /proc/kpagecgroup and patches 2-3 do related cleanup
 - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
   charged to
 - patch 5 introduces a new mmu notifier callback, clear_young, which is
   a lightweight version of clear_flush_young; it is used in patch 6
 - patch 6 implements the idle page tracking feature, including the
   userspace API, /proc/kpageidle
 - patch 7 exports idle flag via /proc/kpageflags

---- SIMILAR WORKS ----

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel
implemented a kernel space daemon for estimating idle memory size per
cgroup while this patch only provides the userspace with the minimal API
for doing the job, leaving the rest up to the userspace. However, they
both share the same idea of Idle/Young page flags to avoid affecting the
reclaimer logic.

---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"
BUFSIZE = 8 * 1024  # must be multiple of 8


def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
        for s in f:
            k, v = s.split(":")
            if k == "Hugepagesize":
                return int(v.split()[0]) * 1024

PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
HUGEPAGE_SIZE = get_hugepage_size()


def set_idle():
    f = open("/proc/kpageidle", "wb", BUFSIZE)
    while True:
        try:
            f.write(struct.pack("Q", pow(2, 64) - 1))
        except IOError as err:
            if err.errno == errno.ENXIO:
                break
            raise
    f.close()


def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/proc/kpageidle", "rb", BUFSIZE) as f:
        while f.read(BUFSIZE): pass  # update idle flag

    idlememsz = {}
    while True:
        s1, s2 = f_flags.read(8), f_cgroup.read(8)
        if not s1 or not s2:
            break

        flags, = struct.unpack('Q', s1)
        cgino, = struct.unpack('Q', s2)

        unevictable = (flags >> 18) & 1
        huge = (flags >> 22) & 1
        idle = (flags >> 25) & 1

        if idle and not unevictable:
            idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz


if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
              "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
        ino = os.stat(dir)[stat.ST_INO]
        print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (7):
  memcg: add page_cgroup_ino helper
  hwpoison: use page_cgroup_ino for filtering by memcg
  memcg: zap try_get_mem_cgroup_from_page
  proc: add kpagecgroup file
  mmu-notifier: add clear_young callback
  proc: add kpageidle file
  proc: export idle flag via kpageflags

 Documentation/vm/pagemap.txt           |  22 ++-
 fs/proc/page.c                         | 274 +++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   4 +-
 include/linux/memcontrol.h             |   7 +-
 include/linux/mm.h                     |  98 ++++++++++++
 include/linux/mmu_notifier.h           |  44 ++++++
 include/linux/page-flags.h             |  11 ++
 include/linux/page_ext.h               |   4 +
 include/uapi/linux/kernel-page-flags.h |   1 +
 mm/Kconfig                             |  12 ++
 mm/debug.c                             |   4 +
 mm/huge_memory.c                       |  11 +-
 mm/hwpoison-inject.c                   |   5 +-
 mm/memcontrol.c                        |  71 +++++----
 mm/memory-failure.c                    |  16 +-
 mm/migrate.c                           |   5 +
 mm/mmu_notifier.c                      |  17 ++
 mm/page_ext.c                          |   3 +
 mm/rmap.c                              |   5 +
 mm/swap.c                              |   2 +
 virt/kvm/kvm_main.c                    |  18 +++
 21 files changed, 570 insertions(+), 64 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-15 13:54 ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

It is based on top of v4.2-rc1-mmotm-2015-07-06-16-25

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

---- USER API ----

The user API consists of two new proc files:

 * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
   to a page, indexed by PFN. When the bit is set, the corresponding page is
   idle. A page is considered idle if it has not been accessed since it was
   marked idle. To mark a page idle one should set the bit corresponding to the
   page by writing to the file. A value written to the file is OR-ed with the
   current bitmap value. Only user memory pages can be marked idle, for other
   page types input is silently ignored. Writing to this file beyond max PFN
   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
   set.

   This file can be used to estimate the amount of pages that are not
   used by a particular workload as follows:

   1. mark all pages of interest idle by setting corresponding bits in the
      /proc/kpageidle bitmap
   2. wait until the workload accesses its working set
   3. read /proc/kpageidle and count the number of bits set

 * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
   memory cgroup each page is charged to, indexed by PFN. Only available when
   CONFIG_MEMCG is set.

   This file can be used to find all pages (including unmapped file
   pages) accounted to a particular cgroup. Using /proc/kpageidle, one
   can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 5.

---- CHANGE LOG ----

Changes in v8:

 - clear referenced/accessed bit in secondary ptes while accessing
   /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
 - check the young flag when collapsing a huge page
 - copy idle/young flags on page migration

Changes in v7:

This iteration addresses Andres's comments to v6:

 - do not reuse page_referenced for clearing idle flag, introduce a
   separate function instead; this way we won't issue expensive tlb
   flushes on /proc/kpageidle read/write
 - propagate young/idle flags from head to tail pages on thp split
 - skip compound tail pages while reading/writing /proc/kpageidle
 - cleanup page_referenced_one

Changes in v6:

 - Split the patch introducing page_cgroup_ino helper to ease review.
 - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55

Changes in v5:

 - Fix possible race between kpageidle_clear_pte_refs() and
   __page_set_anon_rmap() by checking that a page is on an LRU list
   under zone->lru_lock (Minchan).
 - Export idle flag via /proc/kpageflags (Minchan).
 - Rebase on top of 4.1-rc3.

Changes in v4:

This iteration primarily addresses Minchan's comments to v3:

 - Implement /proc/kpageidle as a bitmap instead of using u64 per each page,
   because there does not seem to be any future uses for the other 63 bits.
 - Do not double-increase pra->referenced in page_referenced_one() if the page
   was young and referenced recently.
 - Remove the pointless (page_count == 0) check from kpageidle_get_page().
 - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
 - Improve comments to kpageidle-related functions.
 - Rebase on top of 4.1-rc2.

Note it does not address Minchan's concern of possible __page_set_anon_rmap vs
page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is still
unclear if this race can really happen (see https://lkml.org/lkml/2015/5/4/160)

Changes in v3:

 - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
   requires two extra page flags and there is no space for them on 32
   bit, page ext is used (thanks to Minchan Kim).
 - Minor code cleanups and comments improved.
 - Rebase on top of 4.1-rc1.

Changes in v2:

 - The main difference from v1 is the API change. In v1 the user can
   only set the idle flag for all pages at once, and for clearing the
   Idle flag on pages accessed via page tables /proc/PID/clear_refs
   should be used.
   The main drawback of the v1 approach, as noted by Minchan, is that on
   big machines setting the idle flag for each pages can result in CPU
   bursts, which would be especially frustrating if the user only wanted
   to estimate the amount of idle pages for a particular process or VMA.
   With the new API a more fine-grained approach is possible: one can
   read a process's /proc/PID/pagemap and set/check the Idle flag only
   for those pages of the process's address space he or she is
   interested in.
   Another good point about the v2 API is that it is possible to limit
   /proc/kpage* scanning rate when the user wants to estimate the total
   number of idle pages, which is unachievable with the v1 approach.
 - Make /proc/kpagecgroup return the ino of the closest online ancestor
   in case the cgroup a page is charged to is offline.
 - Fix /proc/PID/clear_refs not clearing Young page flag.
 - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54

v7: https://lkml.org/lkml/2015/7/11/119
v6: https://lkml.org/lkml/2015/6/12/301
v5: https://lkml.org/lkml/2015/5/12/449
v4: https://lkml.org/lkml/2015/5/7/580
v3: https://lkml.org/lkml/2015/4/28/224
v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

 - patch 1 adds page_cgroup_ino() helper for the sake of
   /proc/kpagecgroup and patches 2-3 do related cleanup
 - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
   charged to
 - patch 5 introduces a new mmu notifier callback, clear_young, which is
   a lightweight version of clear_flush_young; it is used in patch 6
 - patch 6 implements the idle page tracking feature, including the
   userspace API, /proc/kpageidle
 - patch 7 exports idle flag via /proc/kpageflags

---- SIMILAR WORKS ----

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel
implemented a kernel space daemon for estimating idle memory size per
cgroup while this patch only provides the userspace with the minimal API
for doing the job, leaving the rest up to the userspace. However, they
both share the same idea of Idle/Young page flags to avoid affecting the
reclaimer logic.

---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"
BUFSIZE = 8 * 1024  # must be multiple of 8


def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
        for s in f:
            k, v = s.split(":")
            if k == "Hugepagesize":
                return int(v.split()[0]) * 1024

PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
HUGEPAGE_SIZE = get_hugepage_size()


def set_idle():
    f = open("/proc/kpageidle", "wb", BUFSIZE)
    while True:
        try:
            f.write(struct.pack("Q", pow(2, 64) - 1))
        except IOError as err:
            if err.errno == errno.ENXIO:
                break
            raise
    f.close()


def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/proc/kpageidle", "rb", BUFSIZE) as f:
        while f.read(BUFSIZE): pass  # update idle flag

    idlememsz = {}
    while True:
        s1, s2 = f_flags.read(8), f_cgroup.read(8)
        if not s1 or not s2:
            break

        flags, = struct.unpack('Q', s1)
        cgino, = struct.unpack('Q', s2)

        unevictable = (flags >> 18) & 1
        huge = (flags >> 22) & 1
        idle = (flags >> 25) & 1

        if idle and not unevictable:
            idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz


if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
              "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
        ino = os.stat(dir)[stat.ST_INO]
        print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (7):
  memcg: add page_cgroup_ino helper
  hwpoison: use page_cgroup_ino for filtering by memcg
  memcg: zap try_get_mem_cgroup_from_page
  proc: add kpagecgroup file
  mmu-notifier: add clear_young callback
  proc: add kpageidle file
  proc: export idle flag via kpageflags

 Documentation/vm/pagemap.txt           |  22 ++-
 fs/proc/page.c                         | 274 +++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   4 +-
 include/linux/memcontrol.h             |   7 +-
 include/linux/mm.h                     |  98 ++++++++++++
 include/linux/mmu_notifier.h           |  44 ++++++
 include/linux/page-flags.h             |  11 ++
 include/linux/page_ext.h               |   4 +
 include/uapi/linux/kernel-page-flags.h |   1 +
 mm/Kconfig                             |  12 ++
 mm/debug.c                             |   4 +
 mm/huge_memory.c                       |  11 +-
 mm/hwpoison-inject.c                   |   5 +-
 mm/memcontrol.c                        |  71 +++++----
 mm/memory-failure.c                    |  16 +-
 mm/migrate.c                           |   5 +
 mm/mmu_notifier.c                      |  17 ++
 mm/page_ext.c                          |   3 +
 mm/rmap.c                              |   5 +
 mm/swap.c                              |   2 +
 virt/kvm/kvm_main.c                    |  18 +++
 21 files changed, 570 insertions(+), 64 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper
  2015-07-15 13:54 ` Vladimir Davydov
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73b02b0a8f60..50069abebc3c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+extern unsigned long page_cgroup_ino(struct page *page);
 
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..894dc2169979 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
 	return &memcg->css;
 }
 
+/**
+ * page_cgroup_ino - return inode number of the memcg a page is charged to
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number or 0 if @page is not charged to any cgroup. It
+ * is safe to call this function without holding a reference to @page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ino = 0;
+
+	rcu_read_lock();
+	memcg = READ_ONCE(page->mem_cgroup);
+	while (memcg && !(memcg->css.flags & CSS_ONLINE))
+		memcg = parent_mem_cgroup(memcg);
+	if (memcg)
+		ino = cgroup_ino(memcg->css.cgroup);
+	rcu_read_unlock();
+	return ino;
+}
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73b02b0a8f60..50069abebc3c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+extern unsigned long page_cgroup_ino(struct page *page);
 
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..894dc2169979 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
 	return &memcg->css;
 }
 
+/**
+ * page_cgroup_ino - return inode number of the memcg a page is charged to
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number or 0 if @page is not charged to any cgroup. It
+ * is safe to call this function without holding a reference to @page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ino = 0;
+
+	rcu_read_lock();
+	memcg = READ_ONCE(page->mem_cgroup);
+	while (memcg && !(memcg->css.flags & CSS_ONLINE))
+		memcg = parent_mem_cgroup(memcg);
+	if (memcg)
+		ino = cgroup_ino(memcg->css.cgroup);
+	rcu_read_unlock();
+	return ino;
+}
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg
  2015-07-15 13:54 ` Vladimir Davydov
  (?)
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/hwpoison-inject.c |  5 +----
 mm/memory-failure.c  | 16 ++--------------
 2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index bf73ac17dad4..5015679014c1 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		goto put_out;
 
@@ -126,7 +123,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..97005396a507 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/hwpoison-inject.c |  5 +----
 mm/memory-failure.c  | 16 ++--------------
 2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index bf73ac17dad4..5015679014c1 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		goto put_out;
 
@@ -126,7 +123,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..97005396a507 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/hwpoison-inject.c |  5 +----
 mm/memory-failure.c  | 16 ++--------------
 2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index bf73ac17dad4..5015679014c1 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		goto put_out;
 
@@ -126,7 +123,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..97005396a507 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 3/7] memcg: zap try_get_mem_cgroup_from_page
  2015-07-15 13:54 ` Vladimir Davydov
  (?)
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |  6 ------
 mm/memcontrol.c            | 48 ++++++++++++----------------------------------
 2 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50069abebc3c..635edfe06bac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -259,11 +258,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 894dc2169979..fa1447fcba33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2378,40 +2378,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -5628,8 +5594,20 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		 * the page lock, which serializes swap cache removal, which
 		 * in turn serializes uncharging.
 		 */
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		if (page->mem_cgroup)
 			goto out;
+
+		if (do_swap_account) {
+			swp_entry_t ent = { .val = page_private(page), };
+			unsigned short id = lookup_swap_cgroup_id(ent);
+
+			rcu_read_lock();
+			memcg = mem_cgroup_from_id(id);
+			if (memcg && !css_tryget_online(&memcg->css))
+				memcg = NULL;
+			rcu_read_unlock();
+		}
 	}
 
 	if (PageTransHuge(page)) {
@@ -5637,8 +5615,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 3/7] memcg: zap try_get_mem_cgroup_from_page
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |  6 ------
 mm/memcontrol.c            | 48 ++++++++++++----------------------------------
 2 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50069abebc3c..635edfe06bac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -259,11 +258,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 894dc2169979..fa1447fcba33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2378,40 +2378,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -5628,8 +5594,20 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		 * the page lock, which serializes swap cache removal, which
 		 * in turn serializes uncharging.
 		 */
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		if (page->mem_cgroup)
 			goto out;
+
+		if (do_swap_account) {
+			swp_entry_t ent = { .val = page_private(page), };
+			unsigned short id = lookup_swap_cgroup_id(ent);
+
+			rcu_read_lock();
+			memcg = mem_cgroup_from_id(id);
+			if (memcg && !css_tryget_online(&memcg->css))
+				memcg = NULL;
+			rcu_read_unlock();
+		}
 	}
 
 	if (PageTransHuge(page)) {
@@ -5637,8 +5615,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 3/7] memcg: zap try_get_mem_cgroup_from_page
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |  6 ------
 mm/memcontrol.c            | 48 ++++++++++++----------------------------------
 2 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50069abebc3c..635edfe06bac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -259,11 +258,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 894dc2169979..fa1447fcba33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2378,40 +2378,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -5628,8 +5594,20 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		 * the page lock, which serializes swap cache removal, which
 		 * in turn serializes uncharging.
 		 */
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		if (page->mem_cgroup)
 			goto out;
+
+		if (do_swap_account) {
+			swp_entry_t ent = { .val = page_private(page), };
+			unsigned short id = lookup_swap_cgroup_id(ent);
+
+			rcu_read_lock();
+			memcg = mem_cgroup_from_id(id);
+			if (memcg && !css_tryget_online(&memcg->css))
+				memcg = NULL;
+			rcu_read_unlock();
+		}
 	}
 
 	if (PageTransHuge(page)) {
@@ -5637,8 +5615,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 4/7] proc: add kpagecgroup file
  2015-07-15 13:54 ` Vladimir Davydov
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |  6 ++++-
 fs/proc/page.c               | 53 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are three components to pagemap:
+There are four components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
     23. BALLOON
     24. ZERO_PAGE
 
+ * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
 #include <linux/kernel-page-flags.h>
 #include <asm/uaccess.h>
 #include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 ino;
+
+	pfn = src / KPMSIZE;
+	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	while (count > 0) {
+		if (pfn_valid(pfn))
+			ppage = pfn_to_page(pfn);
+		else
+			ppage = NULL;
+
+		if (ppage)
+			ino = page_cgroup_ino(ppage);
+		else
+			ino = 0;
+
+		if (put_user(ino, out)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		pfn++;
+		out++;
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+	.llseek = mem_lseek,
+	.read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |  6 ++++-
 fs/proc/page.c               | 53 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are three components to pagemap:
+There are four components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
     23. BALLOON
     24. ZERO_PAGE
 
+ * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
 #include <linux/kernel-page-flags.h>
 #include <asm/uaccess.h>
 #include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 ino;
+
+	pfn = src / KPMSIZE;
+	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	while (count > 0) {
+		if (pfn_valid(pfn))
+			ppage = pfn_to_page(pfn);
+		else
+			ppage = NULL;
+
+		if (ppage)
+			ino = page_cgroup_ino(ppage);
+		else
+			ino = 0;
+
+		if (put_user(ino, out)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		pfn++;
+		out++;
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+	.llseek = mem_lseek,
+	.read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
  2015-07-15 13:54 ` Vladimir Davydov
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not
only in primary, but also in secondary ptes. The latter is required in
order to estimate wss of KVM VMs. At the same time we want to avoid
flushing tlb, because it is quite expensive and it won't really affect
the final result.

Currently, there is no function for clearing pte young bit that would
meet our requirements, so this patch introduces one. To achieve that we
have to add a new mmu-notifier callback, clear_young, since there is no
method for testing-and-clearing a secondary pte w/o flushing tlb. The
new method is not mandatory and currently only implemented by KVM.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
 mm/mmu_notifier.c            | 17 +++++++++++++++++
 virt/kvm/kvm_main.c          | 18 ++++++++++++++++++
 3 files changed, 79 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 61cd67f4d788..a5b17137c683 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -66,6 +66,16 @@ struct mmu_notifier_ops {
 				 unsigned long end);
 
 	/*
+	 * clear_young is a lightweight version of clear_flush_young. Like the
+	 * latter, it is supposed to test-and-clear the young/accessed bitflag
+	 * in the secondary pte, but it may omit flushing the secondary tlb.
+	 */
+	int (*clear_young)(struct mmu_notifier *mn,
+			   struct mm_struct *mm,
+			   unsigned long start,
+			   unsigned long end);
+
+	/*
 	 * test_young is called to check the young/accessed bitflag in
 	 * the secondary pte. This is used to know if the page is
 	 * frequently used without actually clearing the flag or tearing
@@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
+extern int __mmu_notifier_clear_young(struct mm_struct *mm,
+				      unsigned long start,
+				      unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
@@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_young(mm, start, end);
+	return 0;
+}
+
 static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
@@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	__young;							\
 })
 
+#define ptep_clear_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
+	__young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,	\
+					    ___address + PAGE_SIZE);	\
+	__young;							\
+})
+
+#define pmdp_clear_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
+	__young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,	\
+					    ___address + PMD_SIZE);	\
+	__young;							\
+})
+
 #define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
 ({									\
 	unsigned long ___addr = __address & PAGE_MASK;			\
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0741b2..5fbdd367bbed 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -123,6 +123,23 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return young;
 }
 
+int __mmu_notifier_clear_young(struct mm_struct *mm,
+			       unsigned long start,
+			       unsigned long end)
+{
+	struct mmu_notifier *mn;
+	int young = 0, id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_young)
+			young |= mn->ops->clear_young(mn, mm, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+
+	return young;
+}
+
 int __mmu_notifier_test_young(struct mm_struct *mm,
 			      unsigned long address)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05148a43ef9c..61500cb028a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -388,6 +388,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	return young;
 }
 
+static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int young, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	young = kvm_age_hva(kvm, start, end);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return young;
+}
+
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
@@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
+	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not
only in primary, but also in secondary ptes. The latter is required in
order to estimate wss of KVM VMs. At the same time we want to avoid
flushing tlb, because it is quite expensive and it won't really affect
the final result.

Currently, there is no function for clearing pte young bit that would
meet our requirements, so this patch introduces one. To achieve that we
have to add a new mmu-notifier callback, clear_young, since there is no
method for testing-and-clearing a secondary pte w/o flushing tlb. The
new method is not mandatory and currently only implemented by KVM.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
 mm/mmu_notifier.c            | 17 +++++++++++++++++
 virt/kvm/kvm_main.c          | 18 ++++++++++++++++++
 3 files changed, 79 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 61cd67f4d788..a5b17137c683 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -66,6 +66,16 @@ struct mmu_notifier_ops {
 				 unsigned long end);
 
 	/*
+	 * clear_young is a lightweight version of clear_flush_young. Like the
+	 * latter, it is supposed to test-and-clear the young/accessed bitflag
+	 * in the secondary pte, but it may omit flushing the secondary tlb.
+	 */
+	int (*clear_young)(struct mmu_notifier *mn,
+			   struct mm_struct *mm,
+			   unsigned long start,
+			   unsigned long end);
+
+	/*
 	 * test_young is called to check the young/accessed bitflag in
 	 * the secondary pte. This is used to know if the page is
 	 * frequently used without actually clearing the flag or tearing
@@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
+extern int __mmu_notifier_clear_young(struct mm_struct *mm,
+				      unsigned long start,
+				      unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
@@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_young(mm, start, end);
+	return 0;
+}
+
 static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
@@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	__young;							\
 })
 
+#define ptep_clear_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
+	__young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,	\
+					    ___address + PAGE_SIZE);	\
+	__young;							\
+})
+
+#define pmdp_clear_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
+	__young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,	\
+					    ___address + PMD_SIZE);	\
+	__young;							\
+})
+
 #define	ptep_clear_flush_notify(__vma, __address, __ptep)		\
 ({									\
 	unsigned long ___addr = __address & PAGE_MASK;			\
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0741b2..5fbdd367bbed 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -123,6 +123,23 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return young;
 }
 
+int __mmu_notifier_clear_young(struct mm_struct *mm,
+			       unsigned long start,
+			       unsigned long end)
+{
+	struct mmu_notifier *mn;
+	int young = 0, id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_young)
+			young |= mn->ops->clear_young(mn, mm, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+
+	return young;
+}
+
 int __mmu_notifier_test_young(struct mm_struct *mm,
 			      unsigned long address)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05148a43ef9c..61500cb028a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -388,6 +388,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	return young;
 }
 
+static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int young, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	young = kvm_age_hva(kvm, start, end);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return young;
+}
+
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
@@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
+	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 6/7] proc: add kpageidle file
  2015-07-15 13:54 ` Vladimir Davydov
  (?)
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |  12 ++-
 fs/proc/page.c               | 218 +++++++++++++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c           |   4 +-
 include/linux/mm.h           |  98 +++++++++++++++++++
 include/linux/page-flags.h   |  11 +++
 include/linux/page_ext.h     |   4 +
 mm/Kconfig                   |  12 +++
 mm/debug.c                   |   4 +
 mm/huge_memory.c             |  11 ++-
 mm/migrate.c                 |   5 +
 mm/page_ext.c                |   3 +
 mm/rmap.c                    |   5 +
 mm/swap.c                    |   2 +
 13 files changed, 385 insertions(+), 4 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are four components to pagemap:
+There are five components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
    memory cgroup each page is charged to, indexed by PFN. Only available when
    CONFIG_MEMCG is set.
 
+ * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
+   to a page, indexed by PFN. When the bit is set, the corresponding page is
+   idle. A page is considered idle if it has not been accessed since it was
+   marked idle. To mark a page idle one should set the bit corresponding to the
+   page by writing to the file. A value written to the file is OR-ed with the
+   current bitmap value. Only user memory pages can be marked idle, for other
+   page types input is silently ignored. Writing to this file beyond max PFN
+   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+   set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..273537885ab4 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -5,6 +5,8 @@
 #include <linux/ksm.h>
 #include <linux/mm.h>
 #include <linux/mmzone.h>
+#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/huge_mm.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -16,6 +18,7 @@
 
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
 
 /* /proc/kpagecount - an array exposing page counts
  *
@@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to rmap_walk(), which is essential for idle
+ * page tracking. With such an indicator of user pages we can skip isolated
+ * pages, but since there are not usually many of them, it will hardly affect
+ * the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+	struct page *page;
+	struct zone *zone;
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	page = pfn_to_page(pfn);
+	if (!page || !PageLRU(page) ||
+	    !get_page_unless_zero(page))
+		return NULL;
+
+	zone = page_zone(page);
+	spin_lock_irq(&zone->lru_lock);
+	if (unlikely(!PageLRU(page))) {
+		put_page(page);
+		page = NULL;
+	}
+	spin_unlock_irq(&zone->lru_lock);
+	return page;
+}
+
+static int kpageidle_clear_pte_refs_one(struct page *page,
+					struct vm_area_struct *vma,
+					unsigned long addr, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte;
+	bool referenced = false;
+
+	if (unlikely(PageTransHuge(page))) {
+		pmd = page_check_address_pmd(page, mm, addr,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		if (pmd) {
+			referenced = pmdp_clear_young_notify(vma, addr, pmd);
+			spin_unlock(ptl);
+		}
+	} else {
+		pte = page_check_address(page, mm, addr, &ptl, 0);
+		if (pte) {
+			referenced = ptep_clear_young_notify(vma, addr, pte);
+			pte_unmap_unlock(pte, ptl);
+		}
+	}
+	if (referenced) {
+		clear_page_idle(page);
+		/*
+		 * We cleared the referenced bit in a mapping to this page. To
+		 * avoid interference with page reclaim, mark it young so that
+		 * page_referenced() will return > 0.
+		 */
+		set_page_young(page);
+	}
+	return SWAP_AGAIN;
+}
+
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+	struct rmap_walk_control rwc = {
+		.rmap_one = kpageidle_clear_pte_refs_one,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+	bool need_lock;
+
+	if (!page_mapped(page) ||
+	    !page_rmapping(page))
+		return;
+
+	need_lock = !PageAnon(page) || PageKsm(page);
+	if (need_lock && !trylock_page(page))
+		return;
+
+	rmap_walk(page, &rwc);
+
+	if (need_lock)
+		unlock_page(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *page;
+	unsigned long pfn, end_pfn;
+	ssize_t ret = 0;
+	u64 idle_bitmap = 0;
+	int bit;
+
+	if (*ppos & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	pfn = *ppos * BITS_PER_BYTE;
+	if (pfn >= max_pfn)
+		return 0;
+
+	end_pfn = pfn + count * BITS_PER_BYTE;
+	if (end_pfn > max_pfn)
+		end_pfn = ALIGN(max_pfn, KPMBITS);
+
+	for (; pfn < end_pfn; pfn++) {
+		bit = pfn % KPMBITS;
+		page = kpageidle_get_page(pfn);
+		if (page) {
+			if (page_is_idle(page)) {
+				/*
+				 * The page might have been referenced via a
+				 * pte, in which case it is not idle. Clear
+				 * refs and recheck.
+				 */
+				kpageidle_clear_pte_refs(page);
+				if (page_is_idle(page))
+					idle_bitmap |= 1ULL << bit;
+			}
+			put_page(page);
+		}
+		if (bit == KPMBITS - 1) {
+			if (put_user(idle_bitmap, out)) {
+				ret = -EFAULT;
+				break;
+			}
+			idle_bitmap = 0;
+			out++;
+		}
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	const u64 __user *in = (const u64 __user *)buf;
+	struct page *page;
+	unsigned long pfn, end_pfn;
+	ssize_t ret = 0;
+	u64 idle_bitmap = 0;
+	int bit;
+
+	if (*ppos & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	pfn = *ppos * BITS_PER_BYTE;
+	if (pfn >= max_pfn)
+		return -ENXIO;
+
+	end_pfn = pfn + count * BITS_PER_BYTE;
+	if (end_pfn > max_pfn)
+		end_pfn = ALIGN(max_pfn, KPMBITS);
+
+	for (; pfn < end_pfn; pfn++) {
+		bit = pfn % KPMBITS;
+		if (bit == 0) {
+			if (get_user(idle_bitmap, in)) {
+				ret = -EFAULT;
+				break;
+			}
+			in++;
+		}
+		if (idle_bitmap >> bit & 1) {
+			page = kpageidle_get_page(pfn);
+			if (page) {
+				kpageidle_clear_pte_refs(page);
+				set_page_idle(page);
+				put_page(page);
+			}
+		}
+	}
+
+	*ppos += (const char __user *)in - buf;
+	if (!ret)
+		ret = (const char __user *)in - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+	.llseek = mem_lseek,
+	.read = kpageidle_read,
+	.write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+	return true;
+}
+struct page_ext_operations page_idle_ops = {
+	.need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +496,10 @@ static int __init proc_page_init(void)
 #ifdef CONFIG_MEMCG
 	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
 #endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+		    &proc_kpageidle_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3b4d8255e806..3efd7f641f92 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
-	if (young || PageReferenced(page))
+	if (young || page_is_young(page) || PageReferenced(page))
 		mss->referenced += size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
@@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
+		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 out:
 		spin_unlock(ptl);
@@ -837,6 +838,7 @@ out:
 
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
+		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..de450c1191b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+	return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+	SetPageYoung(page);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return TestClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return PageIdle(page);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+	SetPageIdle(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	ClearPageIdle(page);
+}
+#else /* !CONFIG_64BIT */
+/*
+ * If there is not enough space to store Idle and Young bits in page flags, use
+ * page ext flags instead.
+ */
+extern struct page_ext_operations page_idle_ops;
+
+static inline bool page_is_young(struct page *page)
+{
+	return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_young(struct page *page)
+{
+	set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return test_and_clear_bit(PAGE_EXT_YOUNG,
+				  &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+	set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+	return false;
+}
+
+static inline void set_page_young(struct page *page)
+{
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return false;
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return false;
+}
+
+static inline void set_page_idle(struct page *page)
+{
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..478f2241f284 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+	PG_young,
+	PG_idle,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+TESTPAGEFLAG(Young, young, PF_ANY)
+SETPAGEFLAG(Young, young, PF_ANY)
+TESTCLEARFLAG(Young, young, PF_ANY)
+PAGEFLAG(Idle, idle, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+	PAGE_EXT_YOUNG,
+	PAGE_EXT_IDLE,
+#endif
 };
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..db817e2c2ec8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+config IDLE_PAGE_TRACKING
+	bool "Enable idle page tracking"
+	select PROC_PAGE_MONITOR
+	select PAGE_EXTENSION if !64BIT
+	help
+	  This feature allows to estimate the amount of user pages that have
+	  not been touched during a given period of time. This information can
+	  be useful to tune memory cgroup limits and/or for job placement
+	  within a compute cluster.
+
+	  See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..6c1b3ea61bfd 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+	{1UL << PG_young,		"young"		},
+	{1UL << PG_idle,		"idle"		},
+#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51e954d..bb6d2ec1f268 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
 		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
+		if (page_is_young(page))
+			set_page_young(page_tail);
+		if (page_is_idle(page))
+			set_page_idle(page_tail);
+
 		/*
 		 * __split_huge_page_splitting() already set the
 		 * splitting bit in all pmd that could map this
@@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
 		/* If there is no mapped pte young don't collapse the page */
-		if (pte_young(pteval) || PageReferenced(page) ||
+		if (pte_young(pteval) ||
+		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
@@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 */
 		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
-		if (pte_young(pteval) || PageReferenced(page) ||
+		if (pte_young(pteval) ||
+		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25e79d9..3e7bb4f2b51c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 			__set_page_dirty_nobuffers(newpage);
  	}
 
+	if (page_is_young(page))
+		set_page_young(newpage);
+	if (page_is_idle(page))
+		set_page_idle(newpage);
+
 	/*
 	 * Copy NUMA information to the new page, to prevent over-eager
 	 * future migrations of this same page.
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #ifdef CONFIG_PAGE_OWNER
 	&page_owner_ops,
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+	&page_idle_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 49b244b1f18c..c96677ade3d1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
+	if (referenced)
+		clear_page_idle(page);
+	if (test_and_clear_page_young(page))
+		referenced++;
+
 	if (referenced) {
 		pra->referenced++;
 		pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..db43c9b4891d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	if (page_is_idle(page))
+		clear_page_idle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 6/7] proc: add kpageidle file
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |  12 ++-
 fs/proc/page.c               | 218 +++++++++++++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c           |   4 +-
 include/linux/mm.h           |  98 +++++++++++++++++++
 include/linux/page-flags.h   |  11 +++
 include/linux/page_ext.h     |   4 +
 mm/Kconfig                   |  12 +++
 mm/debug.c                   |   4 +
 mm/huge_memory.c             |  11 ++-
 mm/migrate.c                 |   5 +
 mm/page_ext.c                |   3 +
 mm/rmap.c                    |   5 +
 mm/swap.c                    |   2 +
 13 files changed, 385 insertions(+), 4 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are four components to pagemap:
+There are five components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
    memory cgroup each page is charged to, indexed by PFN. Only available when
    CONFIG_MEMCG is set.
 
+ * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
+   to a page, indexed by PFN. When the bit is set, the corresponding page is
+   idle. A page is considered idle if it has not been accessed since it was
+   marked idle. To mark a page idle one should set the bit corresponding to the
+   page by writing to the file. A value written to the file is OR-ed with the
+   current bitmap value. Only user memory pages can be marked idle, for other
+   page types input is silently ignored. Writing to this file beyond max PFN
+   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+   set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..273537885ab4 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -5,6 +5,8 @@
 #include <linux/ksm.h>
 #include <linux/mm.h>
 #include <linux/mmzone.h>
+#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/huge_mm.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -16,6 +18,7 @@
 
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
 
 /* /proc/kpagecount - an array exposing page counts
  *
@@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to rmap_walk(), which is essential for idle
+ * page tracking. With such an indicator of user pages we can skip isolated
+ * pages, but since there are not usually many of them, it will hardly affect
+ * the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+	struct page *page;
+	struct zone *zone;
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	page = pfn_to_page(pfn);
+	if (!page || !PageLRU(page) ||
+	    !get_page_unless_zero(page))
+		return NULL;
+
+	zone = page_zone(page);
+	spin_lock_irq(&zone->lru_lock);
+	if (unlikely(!PageLRU(page))) {
+		put_page(page);
+		page = NULL;
+	}
+	spin_unlock_irq(&zone->lru_lock);
+	return page;
+}
+
+static int kpageidle_clear_pte_refs_one(struct page *page,
+					struct vm_area_struct *vma,
+					unsigned long addr, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte;
+	bool referenced = false;
+
+	if (unlikely(PageTransHuge(page))) {
+		pmd = page_check_address_pmd(page, mm, addr,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		if (pmd) {
+			referenced = pmdp_clear_young_notify(vma, addr, pmd);
+			spin_unlock(ptl);
+		}
+	} else {
+		pte = page_check_address(page, mm, addr, &ptl, 0);
+		if (pte) {
+			referenced = ptep_clear_young_notify(vma, addr, pte);
+			pte_unmap_unlock(pte, ptl);
+		}
+	}
+	if (referenced) {
+		clear_page_idle(page);
+		/*
+		 * We cleared the referenced bit in a mapping to this page. To
+		 * avoid interference with page reclaim, mark it young so that
+		 * page_referenced() will return > 0.
+		 */
+		set_page_young(page);
+	}
+	return SWAP_AGAIN;
+}
+
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+	struct rmap_walk_control rwc = {
+		.rmap_one = kpageidle_clear_pte_refs_one,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+	bool need_lock;
+
+	if (!page_mapped(page) ||
+	    !page_rmapping(page))
+		return;
+
+	need_lock = !PageAnon(page) || PageKsm(page);
+	if (need_lock && !trylock_page(page))
+		return;
+
+	rmap_walk(page, &rwc);
+
+	if (need_lock)
+		unlock_page(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *page;
+	unsigned long pfn, end_pfn;
+	ssize_t ret = 0;
+	u64 idle_bitmap = 0;
+	int bit;
+
+	if (*ppos & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	pfn = *ppos * BITS_PER_BYTE;
+	if (pfn >= max_pfn)
+		return 0;
+
+	end_pfn = pfn + count * BITS_PER_BYTE;
+	if (end_pfn > max_pfn)
+		end_pfn = ALIGN(max_pfn, KPMBITS);
+
+	for (; pfn < end_pfn; pfn++) {
+		bit = pfn % KPMBITS;
+		page = kpageidle_get_page(pfn);
+		if (page) {
+			if (page_is_idle(page)) {
+				/*
+				 * The page might have been referenced via a
+				 * pte, in which case it is not idle. Clear
+				 * refs and recheck.
+				 */
+				kpageidle_clear_pte_refs(page);
+				if (page_is_idle(page))
+					idle_bitmap |= 1ULL << bit;
+			}
+			put_page(page);
+		}
+		if (bit == KPMBITS - 1) {
+			if (put_user(idle_bitmap, out)) {
+				ret = -EFAULT;
+				break;
+			}
+			idle_bitmap = 0;
+			out++;
+		}
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	const u64 __user *in = (const u64 __user *)buf;
+	struct page *page;
+	unsigned long pfn, end_pfn;
+	ssize_t ret = 0;
+	u64 idle_bitmap = 0;
+	int bit;
+
+	if (*ppos & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	pfn = *ppos * BITS_PER_BYTE;
+	if (pfn >= max_pfn)
+		return -ENXIO;
+
+	end_pfn = pfn + count * BITS_PER_BYTE;
+	if (end_pfn > max_pfn)
+		end_pfn = ALIGN(max_pfn, KPMBITS);
+
+	for (; pfn < end_pfn; pfn++) {
+		bit = pfn % KPMBITS;
+		if (bit == 0) {
+			if (get_user(idle_bitmap, in)) {
+				ret = -EFAULT;
+				break;
+			}
+			in++;
+		}
+		if (idle_bitmap >> bit & 1) {
+			page = kpageidle_get_page(pfn);
+			if (page) {
+				kpageidle_clear_pte_refs(page);
+				set_page_idle(page);
+				put_page(page);
+			}
+		}
+	}
+
+	*ppos += (const char __user *)in - buf;
+	if (!ret)
+		ret = (const char __user *)in - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+	.llseek = mem_lseek,
+	.read = kpageidle_read,
+	.write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+	return true;
+}
+struct page_ext_operations page_idle_ops = {
+	.need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +496,10 @@ static int __init proc_page_init(void)
 #ifdef CONFIG_MEMCG
 	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
 #endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+		    &proc_kpageidle_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3b4d8255e806..3efd7f641f92 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
-	if (young || PageReferenced(page))
+	if (young || page_is_young(page) || PageReferenced(page))
 		mss->referenced += size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
@@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
+		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 out:
 		spin_unlock(ptl);
@@ -837,6 +838,7 @@ out:
 
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
+		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..de450c1191b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+	return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+	SetPageYoung(page);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return TestClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return PageIdle(page);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+	SetPageIdle(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	ClearPageIdle(page);
+}
+#else /* !CONFIG_64BIT */
+/*
+ * If there is not enough space to store Idle and Young bits in page flags, use
+ * page ext flags instead.
+ */
+extern struct page_ext_operations page_idle_ops;
+
+static inline bool page_is_young(struct page *page)
+{
+	return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_young(struct page *page)
+{
+	set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return test_and_clear_bit(PAGE_EXT_YOUNG,
+				  &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+	set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+	return false;
+}
+
+static inline void set_page_young(struct page *page)
+{
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return false;
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return false;
+}
+
+static inline void set_page_idle(struct page *page)
+{
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..478f2241f284 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+	PG_young,
+	PG_idle,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+TESTPAGEFLAG(Young, young, PF_ANY)
+SETPAGEFLAG(Young, young, PF_ANY)
+TESTCLEARFLAG(Young, young, PF_ANY)
+PAGEFLAG(Idle, idle, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+	PAGE_EXT_YOUNG,
+	PAGE_EXT_IDLE,
+#endif
 };
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..db817e2c2ec8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+config IDLE_PAGE_TRACKING
+	bool "Enable idle page tracking"
+	select PROC_PAGE_MONITOR
+	select PAGE_EXTENSION if !64BIT
+	help
+	  This feature allows to estimate the amount of user pages that have
+	  not been touched during a given period of time. This information can
+	  be useful to tune memory cgroup limits and/or for job placement
+	  within a compute cluster.
+
+	  See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..6c1b3ea61bfd 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+	{1UL << PG_young,		"young"		},
+	{1UL << PG_idle,		"idle"		},
+#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51e954d..bb6d2ec1f268 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
 		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
+		if (page_is_young(page))
+			set_page_young(page_tail);
+		if (page_is_idle(page))
+			set_page_idle(page_tail);
+
 		/*
 		 * __split_huge_page_splitting() already set the
 		 * splitting bit in all pmd that could map this
@@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
 		/* If there is no mapped pte young don't collapse the page */
-		if (pte_young(pteval) || PageReferenced(page) ||
+		if (pte_young(pteval) ||
+		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
@@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 */
 		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
-		if (pte_young(pteval) || PageReferenced(page) ||
+		if (pte_young(pteval) ||
+		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25e79d9..3e7bb4f2b51c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 			__set_page_dirty_nobuffers(newpage);
  	}
 
+	if (page_is_young(page))
+		set_page_young(newpage);
+	if (page_is_idle(page))
+		set_page_idle(newpage);
+
 	/*
 	 * Copy NUMA information to the new page, to prevent over-eager
 	 * future migrations of this same page.
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #ifdef CONFIG_PAGE_OWNER
 	&page_owner_ops,
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+	&page_idle_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 49b244b1f18c..c96677ade3d1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
+	if (referenced)
+		clear_page_idle(page);
+	if (test_and_clear_page_young(page))
+		referenced++;
+
 	if (referenced) {
 		pra->referenced++;
 		pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..db43c9b4891d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	if (page_is_idle(page))
+		clear_page_idle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 6/7] proc: add kpageidle file
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |  12 ++-
 fs/proc/page.c               | 218 +++++++++++++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c           |   4 +-
 include/linux/mm.h           |  98 +++++++++++++++++++
 include/linux/page-flags.h   |  11 +++
 include/linux/page_ext.h     |   4 +
 mm/Kconfig                   |  12 +++
 mm/debug.c                   |   4 +
 mm/huge_memory.c             |  11 ++-
 mm/migrate.c                 |   5 +
 mm/page_ext.c                |   3 +
 mm/rmap.c                    |   5 +
 mm/swap.c                    |   2 +
 13 files changed, 385 insertions(+), 4 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are four components to pagemap:
+There are five components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
    memory cgroup each page is charged to, indexed by PFN. Only available when
    CONFIG_MEMCG is set.
 
+ * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
+   to a page, indexed by PFN. When the bit is set, the corresponding page is
+   idle. A page is considered idle if it has not been accessed since it was
+   marked idle. To mark a page idle one should set the bit corresponding to the
+   page by writing to the file. A value written to the file is OR-ed with the
+   current bitmap value. Only user memory pages can be marked idle, for other
+   page types input is silently ignored. Writing to this file beyond max PFN
+   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+   set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..273537885ab4 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -5,6 +5,8 @@
 #include <linux/ksm.h>
 #include <linux/mm.h>
 #include <linux/mmzone.h>
+#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/huge_mm.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -16,6 +18,7 @@
 
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
 
 /* /proc/kpagecount - an array exposing page counts
  *
@@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to rmap_walk(), which is essential for idle
+ * page tracking. With such an indicator of user pages we can skip isolated
+ * pages, but since there are not usually many of them, it will hardly affect
+ * the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+	struct page *page;
+	struct zone *zone;
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	page = pfn_to_page(pfn);
+	if (!page || !PageLRU(page) ||
+	    !get_page_unless_zero(page))
+		return NULL;
+
+	zone = page_zone(page);
+	spin_lock_irq(&zone->lru_lock);
+	if (unlikely(!PageLRU(page))) {
+		put_page(page);
+		page = NULL;
+	}
+	spin_unlock_irq(&zone->lru_lock);
+	return page;
+}
+
+static int kpageidle_clear_pte_refs_one(struct page *page,
+					struct vm_area_struct *vma,
+					unsigned long addr, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte;
+	bool referenced = false;
+
+	if (unlikely(PageTransHuge(page))) {
+		pmd = page_check_address_pmd(page, mm, addr,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		if (pmd) {
+			referenced = pmdp_clear_young_notify(vma, addr, pmd);
+			spin_unlock(ptl);
+		}
+	} else {
+		pte = page_check_address(page, mm, addr, &ptl, 0);
+		if (pte) {
+			referenced = ptep_clear_young_notify(vma, addr, pte);
+			pte_unmap_unlock(pte, ptl);
+		}
+	}
+	if (referenced) {
+		clear_page_idle(page);
+		/*
+		 * We cleared the referenced bit in a mapping to this page. To
+		 * avoid interference with page reclaim, mark it young so that
+		 * page_referenced() will return > 0.
+		 */
+		set_page_young(page);
+	}
+	return SWAP_AGAIN;
+}
+
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+	struct rmap_walk_control rwc = {
+		.rmap_one = kpageidle_clear_pte_refs_one,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+	bool need_lock;
+
+	if (!page_mapped(page) ||
+	    !page_rmapping(page))
+		return;
+
+	need_lock = !PageAnon(page) || PageKsm(page);
+	if (need_lock && !trylock_page(page))
+		return;
+
+	rmap_walk(page, &rwc);
+
+	if (need_lock)
+		unlock_page(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *page;
+	unsigned long pfn, end_pfn;
+	ssize_t ret = 0;
+	u64 idle_bitmap = 0;
+	int bit;
+
+	if (*ppos & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	pfn = *ppos * BITS_PER_BYTE;
+	if (pfn >= max_pfn)
+		return 0;
+
+	end_pfn = pfn + count * BITS_PER_BYTE;
+	if (end_pfn > max_pfn)
+		end_pfn = ALIGN(max_pfn, KPMBITS);
+
+	for (; pfn < end_pfn; pfn++) {
+		bit = pfn % KPMBITS;
+		page = kpageidle_get_page(pfn);
+		if (page) {
+			if (page_is_idle(page)) {
+				/*
+				 * The page might have been referenced via a
+				 * pte, in which case it is not idle. Clear
+				 * refs and recheck.
+				 */
+				kpageidle_clear_pte_refs(page);
+				if (page_is_idle(page))
+					idle_bitmap |= 1ULL << bit;
+			}
+			put_page(page);
+		}
+		if (bit == KPMBITS - 1) {
+			if (put_user(idle_bitmap, out)) {
+				ret = -EFAULT;
+				break;
+			}
+			idle_bitmap = 0;
+			out++;
+		}
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	const u64 __user *in = (const u64 __user *)buf;
+	struct page *page;
+	unsigned long pfn, end_pfn;
+	ssize_t ret = 0;
+	u64 idle_bitmap = 0;
+	int bit;
+
+	if (*ppos & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	pfn = *ppos * BITS_PER_BYTE;
+	if (pfn >= max_pfn)
+		return -ENXIO;
+
+	end_pfn = pfn + count * BITS_PER_BYTE;
+	if (end_pfn > max_pfn)
+		end_pfn = ALIGN(max_pfn, KPMBITS);
+
+	for (; pfn < end_pfn; pfn++) {
+		bit = pfn % KPMBITS;
+		if (bit == 0) {
+			if (get_user(idle_bitmap, in)) {
+				ret = -EFAULT;
+				break;
+			}
+			in++;
+		}
+		if (idle_bitmap >> bit & 1) {
+			page = kpageidle_get_page(pfn);
+			if (page) {
+				kpageidle_clear_pte_refs(page);
+				set_page_idle(page);
+				put_page(page);
+			}
+		}
+	}
+
+	*ppos += (const char __user *)in - buf;
+	if (!ret)
+		ret = (const char __user *)in - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+	.llseek = mem_lseek,
+	.read = kpageidle_read,
+	.write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+	return true;
+}
+struct page_ext_operations page_idle_ops = {
+	.need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +496,10 @@ static int __init proc_page_init(void)
 #ifdef CONFIG_MEMCG
 	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
 #endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+		    &proc_kpageidle_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3b4d8255e806..3efd7f641f92 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
-	if (young || PageReferenced(page))
+	if (young || page_is_young(page) || PageReferenced(page))
 		mss->referenced += size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
@@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
+		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 out:
 		spin_unlock(ptl);
@@ -837,6 +838,7 @@ out:
 
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
+		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..de450c1191b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+	return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+	SetPageYoung(page);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return TestClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return PageIdle(page);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+	SetPageIdle(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	ClearPageIdle(page);
+}
+#else /* !CONFIG_64BIT */
+/*
+ * If there is not enough space to store Idle and Young bits in page flags, use
+ * page ext flags instead.
+ */
+extern struct page_ext_operations page_idle_ops;
+
+static inline bool page_is_young(struct page *page)
+{
+	return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_young(struct page *page)
+{
+	set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return test_and_clear_bit(PAGE_EXT_YOUNG,
+				  &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+	set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+	return false;
+}
+
+static inline void set_page_young(struct page *page)
+{
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+	return false;
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+	return false;
+}
+
+static inline void set_page_idle(struct page *page)
+{
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..478f2241f284 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+	PG_young,
+	PG_idle,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+TESTPAGEFLAG(Young, young, PF_ANY)
+SETPAGEFLAG(Young, young, PF_ANY)
+TESTCLEARFLAG(Young, young, PF_ANY)
+PAGEFLAG(Idle, idle, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
 	PAGE_EXT_DEBUG_POISON,		/* Page is poisoned */
 	PAGE_EXT_DEBUG_GUARD,
 	PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+	PAGE_EXT_YOUNG,
+	PAGE_EXT_IDLE,
+#endif
 };
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..db817e2c2ec8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+config IDLE_PAGE_TRACKING
+	bool "Enable idle page tracking"
+	select PROC_PAGE_MONITOR
+	select PAGE_EXTENSION if !64BIT
+	help
+	  This feature allows to estimate the amount of user pages that have
+	  not been touched during a given period of time. This information can
+	  be useful to tune memory cgroup limits and/or for job placement
+	  within a compute cluster.
+
+	  See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..6c1b3ea61bfd 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+	{1UL << PG_young,		"young"		},
+	{1UL << PG_idle,		"idle"		},
+#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51e954d..bb6d2ec1f268 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
 		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
+		if (page_is_young(page))
+			set_page_young(page_tail);
+		if (page_is_idle(page))
+			set_page_idle(page_tail);
+
 		/*
 		 * __split_huge_page_splitting() already set the
 		 * splitting bit in all pmd that could map this
@@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
 		/* If there is no mapped pte young don't collapse the page */
-		if (pte_young(pteval) || PageReferenced(page) ||
+		if (pte_young(pteval) ||
+		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
@@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 */
 		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
-		if (pte_young(pteval) || PageReferenced(page) ||
+		if (pte_young(pteval) ||
+		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25e79d9..3e7bb4f2b51c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 			__set_page_dirty_nobuffers(newpage);
  	}
 
+	if (page_is_young(page))
+		set_page_young(newpage);
+	if (page_is_idle(page))
+		set_page_idle(newpage);
+
 	/*
 	 * Copy NUMA information to the new page, to prevent over-eager
 	 * future migrations of this same page.
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
 #ifdef CONFIG_PAGE_OWNER
 	&page_owner_ops,
 #endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+	&page_idle_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 49b244b1f18c..c96677ade3d1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
+	if (referenced)
+		clear_page_idle(page);
+	if (test_and_clear_page_young(page))
+		referenced++;
+
 	if (referenced) {
 		pra->referenced++;
 		pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..db43c9b4891d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	if (page_is_idle(page))
+		clear_page_idle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 7/7] proc: export idle flag via kpageflags
  2015-07-15 13:54 ` Vladimir Davydov
@ 2015-07-15 13:54   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt           | 6 ++++++
 fs/proc/page.c                         | 3 +++
 include/uapi/linux/kernel-page-flags.h | 1 +
 3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
     22. THP
     23. BALLOON
     24. ZERO_PAGE
+    25. IDLE
 
  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
 24. ZERO_PAGE
     zero page for pfn_zero or huge_zero page
 
+25. IDLE
+    page has not been accessed since it was marked idle (see /proc/kpageidle)
+    Note that this flag may be stale in case the page was accessed via a PTE.
+    To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 273537885ab4..13dcb823fe4e 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
 	if (PageBalloon(page))
 		u |= 1 << KPF_BALLOON;
 
+	if (page_is_idle(page))
+		u |= 1 << KPF_IDLE;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
 #define KPF_THP			22
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
+#define KPF_IDLE		25
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v8 7/7] proc: export idle flag via kpageflags
@ 2015-07-15 13:54   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-15 13:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T,
	Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, cgroups,
	linux-kernel

As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt           | 6 ++++++
 fs/proc/page.c                         | 3 +++
 include/uapi/linux/kernel-page-flags.h | 1 +
 3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
     22. THP
     23. BALLOON
     24. ZERO_PAGE
+    25. IDLE
 
  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
 24. ZERO_PAGE
     zero page for pfn_zero or huge_zero page
 
+25. IDLE
+    page has not been accessed since it was marked idle (see /proc/kpageidle)
+    Note that this flag may be stale in case the page was accessed via a PTE.
+    To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 273537885ab4..13dcb823fe4e 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
 	if (PageBalloon(page))
 		u |= 1 << KPF_BALLOON;
 
+	if (page_is_idle(page))
+		u |= 1 << KPF_IDLE;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
 #define KPF_THP			22
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
+#define KPF_IDLE		25
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper
@ 2015-07-15 18:59     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 18:59 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> This function returns the inode number of the closest online ancestor of
> the memory cgroup a page is charged to. It is required for exporting
> information about which page is charged to which cgroup to userspace,
> which will be introduced by a following patch.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c            | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 73b02b0a8f60..50069abebc3c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
>
>  extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
>  extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
> +extern unsigned long page_cgroup_ino(struct page *page);
>
>  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
>                                    struct mem_cgroup *,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acb93c554f6e..894dc2169979 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
>         return &memcg->css;
>  }
>
> +/**
> + * page_cgroup_ino - return inode number of the memcg a page is charged to
> + * @page: the page
> + *
> + * Look up the closest online ancestor of the memory cgroup @page is charged to
> + * and return its inode number or 0 if @page is not charged to any cgroup. It
> + * is safe to call this function without holding a reference to @page.
> + */
> +unsigned long page_cgroup_ino(struct page *page)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ino = 0;
> +
> +       rcu_read_lock();
> +       memcg = READ_ONCE(page->mem_cgroup);
> +       while (memcg && !(memcg->css.flags & CSS_ONLINE))
> +               memcg = parent_mem_cgroup(memcg);
> +       if (memcg)
> +               ino = cgroup_ino(memcg->css.cgroup);
> +       rcu_read_unlock();
> +       return ino;
> +}
> +
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
>  {
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper
@ 2015-07-15 18:59     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 18:59 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> This function returns the inode number of the closest online ancestor of
> the memory cgroup a page is charged to. It is required for exporting
> information about which page is charged to which cgroup to userspace,
> which will be introduced by a following patch.
>
> Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Reviewed-by: Andres Lagar-Cavilla <andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c            | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 73b02b0a8f60..50069abebc3c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
>
>  extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
>  extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
> +extern unsigned long page_cgroup_ino(struct page *page);
>
>  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
>                                    struct mem_cgroup *,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acb93c554f6e..894dc2169979 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
>         return &memcg->css;
>  }
>
> +/**
> + * page_cgroup_ino - return inode number of the memcg a page is charged to
> + * @page: the page
> + *
> + * Look up the closest online ancestor of the memory cgroup @page is charged to
> + * and return its inode number or 0 if @page is not charged to any cgroup. It
> + * is safe to call this function without holding a reference to @page.
> + */
> +unsigned long page_cgroup_ino(struct page *page)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ino = 0;
> +
> +       rcu_read_lock();
> +       memcg = READ_ONCE(page->mem_cgroup);
> +       while (memcg && !(memcg->css.flags & CSS_ONLINE))
> +               memcg = parent_mem_cgroup(memcg);
> +       if (memcg)
> +               ino = cgroup_ino(memcg->css.cgroup);
> +       rcu_read_unlock();
> +       return ino;
> +}
> +
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
>  {
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper
@ 2015-07-15 18:59     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 18:59 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> This function returns the inode number of the closest online ancestor of
> the memory cgroup a page is charged to. It is required for exporting
> information about which page is charged to which cgroup to userspace,
> which will be introduced by a following patch.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c            | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 73b02b0a8f60..50069abebc3c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
>
>  extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
>  extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
> +extern unsigned long page_cgroup_ino(struct page *page);
>
>  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
>                                    struct mem_cgroup *,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acb93c554f6e..894dc2169979 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
>         return &memcg->css;
>  }
>
> +/**
> + * page_cgroup_ino - return inode number of the memcg a page is charged to
> + * @page: the page
> + *
> + * Look up the closest online ancestor of the memory cgroup @page is charged to
> + * and return its inode number or 0 if @page is not charged to any cgroup. It
> + * is safe to call this function without holding a reference to @page.
> + */
> +unsigned long page_cgroup_ino(struct page *page)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ino = 0;
> +
> +       rcu_read_lock();
> +       memcg = READ_ONCE(page->mem_cgroup);
> +       while (memcg && !(memcg->css.flags & CSS_ONLINE))
> +               memcg = parent_mem_cgroup(memcg);
> +       if (memcg)
> +               ino = cgroup_ino(memcg->css.cgroup);
> +       rcu_read_unlock();
> +       return ino;
> +}
> +
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
>  {
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg
@ 2015-07-15 19:00     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Hwpoison allows to filter pages by memory cgroup ino. Currently, it
> calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
> then its ino using cgroup_ino, but now we have an apter method for that,
> page_cgroup_ino, so use it instead.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

> ---
>  mm/hwpoison-inject.c |  5 +----
>  mm/memory-failure.c  | 16 ++--------------
>  2 files changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index bf73ac17dad4..5015679014c1 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
>         /*
>          * do a racy check with elevated page count, to make sure PG_hwpoison
>          * will only be set for the targeted owner (or on a free page).
> -        * We temporarily take page lock for try_get_mem_cgroup_from_page().
>          * memory_failure() will redo the check reliably inside page lock.
>          */
> -       lock_page(hpage);
>         err = hwpoison_filter(hpage);
> -       unlock_page(hpage);
>         if (err)
>                 goto put_out;
>
> @@ -126,7 +123,7 @@ static int pfn_inject_init(void)
>         if (!dentry)
>                 goto fail;
>
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
>         dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
>                                     hwpoison_dir, &hwpoison_filter_memcg);
>         if (!dentry)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1cf7f2988422..97005396a507 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
>   * can only guarantee that the page either belongs to the memcg tasks, or is
>   * a freed page.
>   */
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
>  u64 hwpoison_filter_memcg;
>  EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
>  static int hwpoison_filter_task(struct page *p)
>  {
> -       struct mem_cgroup *mem;
> -       struct cgroup_subsys_state *css;
> -       unsigned long ino;
> -
>         if (!hwpoison_filter_memcg)
>                 return 0;
>
> -       mem = try_get_mem_cgroup_from_page(p);
> -       if (!mem)
> -               return -EINVAL;
> -
> -       css = mem_cgroup_css(mem);
> -       ino = cgroup_ino(css->cgroup);
> -       css_put(css);
> -
> -       if (ino != hwpoison_filter_memcg)
> +       if (page_cgroup_ino(p) != hwpoison_filter_memcg)
>                 return -EINVAL;
>
>         return 0;
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg
@ 2015-07-15 19:00     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> Hwpoison allows to filter pages by memory cgroup ino. Currently, it
> calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
> then its ino using cgroup_ino, but now we have an apter method for that,
> page_cgroup_ino, so use it instead.
>
> Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Reviewed-by: Andres Lagar-Cavilla <andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>  mm/hwpoison-inject.c |  5 +----
>  mm/memory-failure.c  | 16 ++--------------
>  2 files changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index bf73ac17dad4..5015679014c1 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
>         /*
>          * do a racy check with elevated page count, to make sure PG_hwpoison
>          * will only be set for the targeted owner (or on a free page).
> -        * We temporarily take page lock for try_get_mem_cgroup_from_page().
>          * memory_failure() will redo the check reliably inside page lock.
>          */
> -       lock_page(hpage);
>         err = hwpoison_filter(hpage);
> -       unlock_page(hpage);
>         if (err)
>                 goto put_out;
>
> @@ -126,7 +123,7 @@ static int pfn_inject_init(void)
>         if (!dentry)
>                 goto fail;
>
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
>         dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
>                                     hwpoison_dir, &hwpoison_filter_memcg);
>         if (!dentry)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1cf7f2988422..97005396a507 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
>   * can only guarantee that the page either belongs to the memcg tasks, or is
>   * a freed page.
>   */
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
>  u64 hwpoison_filter_memcg;
>  EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
>  static int hwpoison_filter_task(struct page *p)
>  {
> -       struct mem_cgroup *mem;
> -       struct cgroup_subsys_state *css;
> -       unsigned long ino;
> -
>         if (!hwpoison_filter_memcg)
>                 return 0;
>
> -       mem = try_get_mem_cgroup_from_page(p);
> -       if (!mem)
> -               return -EINVAL;
> -
> -       css = mem_cgroup_css(mem);
> -       ino = cgroup_ino(css->cgroup);
> -       css_put(css);
> -
> -       if (ino != hwpoison_filter_memcg)
> +       if (page_cgroup_ino(p) != hwpoison_filter_memcg)
>                 return -EINVAL;
>
>         return 0;
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg
@ 2015-07-15 19:00     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Hwpoison allows to filter pages by memory cgroup ino. Currently, it
> calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
> then its ino using cgroup_ino, but now we have an apter method for that,
> page_cgroup_ino, so use it instead.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

> ---
>  mm/hwpoison-inject.c |  5 +----
>  mm/memory-failure.c  | 16 ++--------------
>  2 files changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index bf73ac17dad4..5015679014c1 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
>         /*
>          * do a racy check with elevated page count, to make sure PG_hwpoison
>          * will only be set for the targeted owner (or on a free page).
> -        * We temporarily take page lock for try_get_mem_cgroup_from_page().
>          * memory_failure() will redo the check reliably inside page lock.
>          */
> -       lock_page(hpage);
>         err = hwpoison_filter(hpage);
> -       unlock_page(hpage);
>         if (err)
>                 goto put_out;
>
> @@ -126,7 +123,7 @@ static int pfn_inject_init(void)
>         if (!dentry)
>                 goto fail;
>
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
>         dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
>                                     hwpoison_dir, &hwpoison_filter_memcg);
>         if (!dentry)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1cf7f2988422..97005396a507 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
>   * can only guarantee that the page either belongs to the memcg tasks, or is
>   * a freed page.
>   */
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
>  u64 hwpoison_filter_memcg;
>  EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
>  static int hwpoison_filter_task(struct page *p)
>  {
> -       struct mem_cgroup *mem;
> -       struct cgroup_subsys_state *css;
> -       unsigned long ino;
> -
>         if (!hwpoison_filter_memcg)
>                 return 0;
>
> -       mem = try_get_mem_cgroup_from_page(p);
> -       if (!mem)
> -               return -EINVAL;
> -
> -       css = mem_cgroup_css(mem);
> -       ino = cgroup_ino(css->cgroup);
> -       css_put(css);
> -
> -       if (ino != hwpoison_filter_memcg)
> +       if (page_cgroup_ino(p) != hwpoison_filter_memcg)
>                 return -EINVAL;
>
>         return 0;
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
  2015-07-15 13:54   ` Vladimir Davydov
@ 2015-07-15 19:03     ` Andres Lagar-Cavilla
  -1 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:03 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
> each page is charged to, indexed by PFN. Having this information is
> useful for estimating a cgroup working set size.
>
> The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  Documentation/vm/pagemap.txt |  6 ++++-
>  fs/proc/page.c               | 53 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 6bfbc172cdb9..a9b7afc8fbc6 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
>  userspace programs to examine the page tables and related information by
>  reading files in /proc.
>
> -There are three components to pagemap:
> +There are four components to pagemap:
>
>   * /proc/pid/pagemap.  This file lets a userspace process find out which
>     physical frame each virtual page is mapped to.  It contains one 64-bit
> @@ -65,6 +65,10 @@ There are three components to pagemap:
>      23. BALLOON
>      24. ZERO_PAGE
>
> + * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
> +   memory cgroup each page is charged to, indexed by PFN. Only available when
> +   CONFIG_MEMCG is set.
> +
>  Short descriptions to the page flags:
>
>   0. LOCKED
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 7eee2d8b97d9..70d23245dd43 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -9,6 +9,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
>  #include <linux/hugetlb.h>
> +#include <linux/memcontrol.h>
>  #include <linux/kernel-page-flags.h>
>  #include <asm/uaccess.h>
>  #include "internal.h"
> @@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
>         .read = kpageflags_read,
>  };
>
> +#ifdef CONFIG_MEMCG
> +static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
> +                               size_t count, loff_t *ppos)
> +{
> +       u64 __user *out = (u64 __user *)buf;
> +       struct page *ppage;
> +       unsigned long src = *ppos;
> +       unsigned long pfn;
> +       ssize_t ret = 0;
> +       u64 ino;
> +
> +       pfn = src / KPMSIZE;
> +       count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
> +       if (src & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       while (count > 0) {
> +               if (pfn_valid(pfn))
> +                       ppage = pfn_to_page(pfn);
> +               else
> +                       ppage = NULL;
> +
> +               if (ppage)
> +                       ino = page_cgroup_ino(ppage);
> +               else
> +                       ino = 0;
> +

For both /proc/kpage* interfaces you add (and more critically for the
rmap-causing one, kpageidle):

It's a good idea to do cond_sched(). Whether after each pfn, each Nth
pfn, each put_user, I leave to you, but a reasonable cadence is
needed, because user-space can call this on the entire physical
address space, and that's a lot of work to do without re-scheduling.

Andres

> +               if (put_user(ino, out)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +
> +               pfn++;
> +               out++;
> +               count -= KPMSIZE;
> +       }
> +
> +       *ppos += (char __user *)out - buf;
> +       if (!ret)
> +               ret = (char __user *)out - buf;
> +       return ret;
> +}
> +
> +static const struct file_operations proc_kpagecgroup_operations = {
> +       .llseek = mem_lseek,
> +       .read = kpagecgroup_read,
> +};
> +#endif /* CONFIG_MEMCG */
> +
>  static int __init proc_page_init(void)
>  {
>         proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
>         proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
> +#ifdef CONFIG_MEMCG
> +       proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
> +#endif
>         return 0;
>  }
>  fs_initcall(proc_page_init);
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-15 19:03     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:03 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
> each page is charged to, indexed by PFN. Having this information is
> useful for estimating a cgroup working set size.
>
> The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  Documentation/vm/pagemap.txt |  6 ++++-
>  fs/proc/page.c               | 53 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 6bfbc172cdb9..a9b7afc8fbc6 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
>  userspace programs to examine the page tables and related information by
>  reading files in /proc.
>
> -There are three components to pagemap:
> +There are four components to pagemap:
>
>   * /proc/pid/pagemap.  This file lets a userspace process find out which
>     physical frame each virtual page is mapped to.  It contains one 64-bit
> @@ -65,6 +65,10 @@ There are three components to pagemap:
>      23. BALLOON
>      24. ZERO_PAGE
>
> + * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
> +   memory cgroup each page is charged to, indexed by PFN. Only available when
> +   CONFIG_MEMCG is set.
> +
>  Short descriptions to the page flags:
>
>   0. LOCKED
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 7eee2d8b97d9..70d23245dd43 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -9,6 +9,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
>  #include <linux/hugetlb.h>
> +#include <linux/memcontrol.h>
>  #include <linux/kernel-page-flags.h>
>  #include <asm/uaccess.h>
>  #include "internal.h"
> @@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
>         .read = kpageflags_read,
>  };
>
> +#ifdef CONFIG_MEMCG
> +static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
> +                               size_t count, loff_t *ppos)
> +{
> +       u64 __user *out = (u64 __user *)buf;
> +       struct page *ppage;
> +       unsigned long src = *ppos;
> +       unsigned long pfn;
> +       ssize_t ret = 0;
> +       u64 ino;
> +
> +       pfn = src / KPMSIZE;
> +       count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
> +       if (src & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       while (count > 0) {
> +               if (pfn_valid(pfn))
> +                       ppage = pfn_to_page(pfn);
> +               else
> +                       ppage = NULL;
> +
> +               if (ppage)
> +                       ino = page_cgroup_ino(ppage);
> +               else
> +                       ino = 0;
> +

For both /proc/kpage* interfaces you add (and more critically for the
rmap-causing one, kpageidle):

It's a good idea to do cond_sched(). Whether after each pfn, each Nth
pfn, each put_user, I leave to you, but a reasonable cadence is
needed, because user-space can call this on the entire physical
address space, and that's a lot of work to do without re-scheduling.

Andres

> +               if (put_user(ino, out)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +
> +               pfn++;
> +               out++;
> +               count -= KPMSIZE;
> +       }
> +
> +       *ppos += (char __user *)out - buf;
> +       if (!ret)
> +               ret = (char __user *)out - buf;
> +       return ret;
> +}
> +
> +static const struct file_operations proc_kpagecgroup_operations = {
> +       .llseek = mem_lseek,
> +       .read = kpagecgroup_read,
> +};
> +#endif /* CONFIG_MEMCG */
> +
>  static int __init proc_page_init(void)
>  {
>         proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
>         proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
> +#ifdef CONFIG_MEMCG
> +       proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
> +#endif
>         return 0;
>  }
>  fs_initcall(proc_page_init);
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
  2015-07-15 13:54   ` Vladimir Davydov
@ 2015-07-15 19:16     ` Andres Lagar-Cavilla
  -1 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:16 UTC (permalink / raw)
  To: Vladimir Davydov, Paolo Bonzini, kvm, Eric Northup
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> In the scope of the idle memory tracking feature, which is introduced by
> the following patch, we need to clear the referenced/accessed bit not
> only in primary, but also in secondary ptes. The latter is required in
> order to estimate wss of KVM VMs. At the same time we want to avoid
> flushing tlb, because it is quite expensive and it won't really affect
> the final result.
>
> Currently, there is no function for clearing pte young bit that would
> meet our requirements, so this patch introduces one. To achieve that we
> have to add a new mmu-notifier callback, clear_young, since there is no
> method for testing-and-clearing a secondary pte w/o flushing tlb. The
> new method is not mandatory and currently only implemented by KVM.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

Added Paolo Bonzini, kvm list, Eric Northup.

> ---
>  include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/mmu_notifier.c            | 17 +++++++++++++++++
>  virt/kvm/kvm_main.c          | 18 ++++++++++++++++++
>  3 files changed, 79 insertions(+)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 61cd67f4d788..a5b17137c683 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -66,6 +66,16 @@ struct mmu_notifier_ops {
>                                  unsigned long end);
>
>         /*
> +        * clear_young is a lightweight version of clear_flush_young. Like the
> +        * latter, it is supposed to test-and-clear the young/accessed bitflag
> +        * in the secondary pte, but it may omit flushing the secondary tlb.
> +        */
> +       int (*clear_young)(struct mmu_notifier *mn,
> +                          struct mm_struct *mm,
> +                          unsigned long start,
> +                          unsigned long end);
> +
> +       /*
>          * test_young is called to check the young/accessed bitflag in
>          * the secondary pte. This is used to know if the page is
>          * frequently used without actually clearing the flag or tearing
> @@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
>  extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>                                           unsigned long start,
>                                           unsigned long end);
> +extern int __mmu_notifier_clear_young(struct mm_struct *mm,
> +                                     unsigned long start,
> +                                     unsigned long end);
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>                                      unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> @@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
>         return 0;
>  }
>
> +static inline int mmu_notifier_clear_young(struct mm_struct *mm,
> +                                          unsigned long start,
> +                                          unsigned long end)
> +{
> +       if (mm_has_notifiers(mm))
> +               return __mmu_notifier_clear_young(mm, start, end);
> +       return 0;
> +}
> +
>  static inline int mmu_notifier_test_young(struct mm_struct *mm,
>                                           unsigned long address)
>  {
> @@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>         __young;                                                        \
>  })
>
> +#define ptep_clear_young_notify(__vma, __address, __ptep)              \
> +({                                                                     \
> +       int __young;                                                    \
> +       struct vm_area_struct *___vma = __vma;                          \
> +       unsigned long ___address = __address;                           \
> +       __young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
> +       __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,  \
> +                                           ___address + PAGE_SIZE);    \
> +       __young;                                                        \
> +})
> +
> +#define pmdp_clear_young_notify(__vma, __address, __pmdp)              \
> +({                                                                     \
> +       int __young;                                                    \
> +       struct vm_area_struct *___vma = __vma;                          \
> +       unsigned long ___address = __address;                           \
> +       __young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
> +       __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,  \
> +                                           ___address + PMD_SIZE);     \
> +       __young;                                                        \
> +})
> +
>  #define        ptep_clear_flush_notify(__vma, __address, __ptep)               \
>  ({                                                                     \
>         unsigned long ___addr = __address & PAGE_MASK;                  \
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 3b9b3d0741b2..5fbdd367bbed 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -123,6 +123,23 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>         return young;
>  }
>
> +int __mmu_notifier_clear_young(struct mm_struct *mm,
> +                              unsigned long start,
> +                              unsigned long end)
> +{
> +       struct mmu_notifier *mn;
> +       int young = 0, id;
> +
> +       id = srcu_read_lock(&srcu);
> +       hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> +               if (mn->ops->clear_young)
> +                       young |= mn->ops->clear_young(mn, mm, start, end);
> +       }
> +       srcu_read_unlock(&srcu, id);
> +
> +       return young;
> +}
> +
>  int __mmu_notifier_test_young(struct mm_struct *mm,
>                               unsigned long address)
>  {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 05148a43ef9c..61500cb028a3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -388,6 +388,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
>         return young;
>  }
>
> +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> +                                       struct mm_struct *mm,
> +                                       unsigned long start,
> +                                       unsigned long end)
> +{
> +       struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +       int young, idx;

For reclaim, the clear_flush_young notifier may blow up the secondary
pte to estimate the access pattern, depending on hardware support (EPT
access bits available in Haswell onwards, not sure about AMD, PPC,
etc).

This is ok, because it's reclaim, we need to know the access pattern,
chances are the page is a goner anyway.

However, not so sure about that cost in this context. Depending on
user-space, this will periodically tear down all EPT tables in the
system. That's tricky.

So please add a note to that effect, so in the fullness of time kvm
may be able to refuse enacting this notifier based on performance/VM
priority/foo concerns.

> +
> +       idx = srcu_read_lock(&kvm->srcu);
> +       spin_lock(&kvm->mmu_lock);
> +       young = kvm_age_hva(kvm, start, end);

Also please add a comment along the lines of no one really knowing
when and if to flush the secondary tlb.

We might come up with a heuristic later, or leave up to the regular
system cadence. We just don't know at the moment.

Andres

> +       spin_unlock(&kvm->mmu_lock);
> +       srcu_read_unlock(&kvm->srcu, idx);
> +
> +       return young;
> +}
> +
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>                                        struct mm_struct *mm,
>                                        unsigned long address)
> @@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>         .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
>         .invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
>         .clear_flush_young      = kvm_mmu_notifier_clear_flush_young,
> +       .clear_young            = kvm_mmu_notifier_clear_young,
>         .test_young             = kvm_mmu_notifier_test_young,
>         .change_pte             = kvm_mmu_notifier_change_pte,
>         .release                = kvm_mmu_notifier_release,
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
@ 2015-07-15 19:16     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:16 UTC (permalink / raw)
  To: Vladimir Davydov, Paolo Bonzini, kvm, Eric Northup
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> In the scope of the idle memory tracking feature, which is introduced by
> the following patch, we need to clear the referenced/accessed bit not
> only in primary, but also in secondary ptes. The latter is required in
> order to estimate wss of KVM VMs. At the same time we want to avoid
> flushing tlb, because it is quite expensive and it won't really affect
> the final result.
>
> Currently, there is no function for clearing pte young bit that would
> meet our requirements, so this patch introduces one. To achieve that we
> have to add a new mmu-notifier callback, clear_young, since there is no
> method for testing-and-clearing a secondary pte w/o flushing tlb. The
> new method is not mandatory and currently only implemented by KVM.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

Added Paolo Bonzini, kvm list, Eric Northup.

> ---
>  include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/mmu_notifier.c            | 17 +++++++++++++++++
>  virt/kvm/kvm_main.c          | 18 ++++++++++++++++++
>  3 files changed, 79 insertions(+)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 61cd67f4d788..a5b17137c683 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -66,6 +66,16 @@ struct mmu_notifier_ops {
>                                  unsigned long end);
>
>         /*
> +        * clear_young is a lightweight version of clear_flush_young. Like the
> +        * latter, it is supposed to test-and-clear the young/accessed bitflag
> +        * in the secondary pte, but it may omit flushing the secondary tlb.
> +        */
> +       int (*clear_young)(struct mmu_notifier *mn,
> +                          struct mm_struct *mm,
> +                          unsigned long start,
> +                          unsigned long end);
> +
> +       /*
>          * test_young is called to check the young/accessed bitflag in
>          * the secondary pte. This is used to know if the page is
>          * frequently used without actually clearing the flag or tearing
> @@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
>  extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>                                           unsigned long start,
>                                           unsigned long end);
> +extern int __mmu_notifier_clear_young(struct mm_struct *mm,
> +                                     unsigned long start,
> +                                     unsigned long end);
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>                                      unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> @@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
>         return 0;
>  }
>
> +static inline int mmu_notifier_clear_young(struct mm_struct *mm,
> +                                          unsigned long start,
> +                                          unsigned long end)
> +{
> +       if (mm_has_notifiers(mm))
> +               return __mmu_notifier_clear_young(mm, start, end);
> +       return 0;
> +}
> +
>  static inline int mmu_notifier_test_young(struct mm_struct *mm,
>                                           unsigned long address)
>  {
> @@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>         __young;                                                        \
>  })
>
> +#define ptep_clear_young_notify(__vma, __address, __ptep)              \
> +({                                                                     \
> +       int __young;                                                    \
> +       struct vm_area_struct *___vma = __vma;                          \
> +       unsigned long ___address = __address;                           \
> +       __young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
> +       __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,  \
> +                                           ___address + PAGE_SIZE);    \
> +       __young;                                                        \
> +})
> +
> +#define pmdp_clear_young_notify(__vma, __address, __pmdp)              \
> +({                                                                     \
> +       int __young;                                                    \
> +       struct vm_area_struct *___vma = __vma;                          \
> +       unsigned long ___address = __address;                           \
> +       __young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
> +       __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address,  \
> +                                           ___address + PMD_SIZE);     \
> +       __young;                                                        \
> +})
> +
>  #define        ptep_clear_flush_notify(__vma, __address, __ptep)               \
>  ({                                                                     \
>         unsigned long ___addr = __address & PAGE_MASK;                  \
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 3b9b3d0741b2..5fbdd367bbed 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -123,6 +123,23 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>         return young;
>  }
>
> +int __mmu_notifier_clear_young(struct mm_struct *mm,
> +                              unsigned long start,
> +                              unsigned long end)
> +{
> +       struct mmu_notifier *mn;
> +       int young = 0, id;
> +
> +       id = srcu_read_lock(&srcu);
> +       hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> +               if (mn->ops->clear_young)
> +                       young |= mn->ops->clear_young(mn, mm, start, end);
> +       }
> +       srcu_read_unlock(&srcu, id);
> +
> +       return young;
> +}
> +
>  int __mmu_notifier_test_young(struct mm_struct *mm,
>                               unsigned long address)
>  {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 05148a43ef9c..61500cb028a3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -388,6 +388,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
>         return young;
>  }
>
> +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> +                                       struct mm_struct *mm,
> +                                       unsigned long start,
> +                                       unsigned long end)
> +{
> +       struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +       int young, idx;

For reclaim, the clear_flush_young notifier may blow up the secondary
pte to estimate the access pattern, depending on hardware support (EPT
access bits available in Haswell onwards, not sure about AMD, PPC,
etc).

This is ok, because it's reclaim, we need to know the access pattern,
chances are the page is a goner anyway.

However, not so sure about that cost in this context. Depending on
user-space, this will periodically tear down all EPT tables in the
system. That's tricky.

So please add a note to that effect, so in the fullness of time kvm
may be able to refuse enacting this notifier based on performance/VM
priority/foo concerns.

> +
> +       idx = srcu_read_lock(&kvm->srcu);
> +       spin_lock(&kvm->mmu_lock);
> +       young = kvm_age_hva(kvm, start, end);

Also please add a comment along the lines of no one really knowing
when and if to flush the secondary tlb.

We might come up with a heuristic later, or leave up to the regular
system cadence. We just don't know at the moment.

Andres

> +       spin_unlock(&kvm->mmu_lock);
> +       srcu_read_unlock(&kvm->srcu, idx);
> +
> +       return young;
> +}
> +
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>                                        struct mm_struct *mm,
>                                        unsigned long address)
> @@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>         .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
>         .invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
>         .clear_flush_young      = kvm_mmu_notifier_clear_flush_young,
> +       .clear_young            = kvm_mmu_notifier_clear_young,
>         .test_young             = kvm_mmu_notifier_test_young,
>         .change_pte             = kvm_mmu_notifier_change_pte,
>         .release                = kvm_mmu_notifier_release,
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 7/7] proc: export idle flag via kpageflags
@ 2015-07-15 19:17     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:17 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> As noted by Minchan, a benefit of reading idle flag from
> /proc/kpageflags is that one can easily filter dirty and/or unevictable
> pages while estimating the size of unused memory.
>
> Note that idle flag read from /proc/kpageflags may be stale in case the
> page was accessed via a PTE, because it would be too costly to iterate
> over all page mappings on each /proc/kpageflags read to provide an
> up-to-date value. To make sure the flag is up-to-date one has to read
> /proc/kpageidle first.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

> ---
>  Documentation/vm/pagemap.txt           | 6 ++++++
>  fs/proc/page.c                         | 3 +++
>  include/uapi/linux/kernel-page-flags.h | 1 +
>  3 files changed, 10 insertions(+)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index c9266340852c..5896b7d7fd74 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -64,6 +64,7 @@ There are five components to pagemap:
>      22. THP
>      23. BALLOON
>      24. ZERO_PAGE
> +    25. IDLE
>
>   * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>     memory cgroup each page is charged to, indexed by PFN. Only available when
> @@ -124,6 +125,11 @@ Short descriptions to the page flags:
>  24. ZERO_PAGE
>      zero page for pfn_zero or huge_zero page
>
> +25. IDLE
> +    page has not been accessed since it was marked idle (see /proc/kpageidle)
> +    Note that this flag may be stale in case the page was accessed via a PTE.
> +    To make sure the flag is up-to-date one has to read /proc/kpageidle first.
> +
>      [IO related page flags]
>   1. ERROR     IO error occurred
>   3. UPTODATE  page has up-to-date data
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 273537885ab4..13dcb823fe4e 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
>         if (PageBalloon(page))
>                 u |= 1 << KPF_BALLOON;
>
> +       if (page_is_idle(page))
> +               u |= 1 << KPF_IDLE;
> +
>         u |= kpf_copy_bit(k, KPF_LOCKED,        PG_locked);
>
>         u |= kpf_copy_bit(k, KPF_SLAB,          PG_slab);
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index a6c4962e5d46..5da5f8751ce7 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -33,6 +33,7 @@
>  #define KPF_THP                        22
>  #define KPF_BALLOON            23
>  #define KPF_ZERO_PAGE          24
> +#define KPF_IDLE               25
>
>
>  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 7/7] proc: export idle flag via kpageflags
@ 2015-07-15 19:17     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:17 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> As noted by Minchan, a benefit of reading idle flag from
> /proc/kpageflags is that one can easily filter dirty and/or unevictable
> pages while estimating the size of unused memory.
>
> Note that idle flag read from /proc/kpageflags may be stale in case the
> page was accessed via a PTE, because it would be too costly to iterate
> over all page mappings on each /proc/kpageflags read to provide an
> up-to-date value. To make sure the flag is up-to-date one has to read
> /proc/kpageidle first.
>
> Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Reviewed-by: Andres Lagar-Cavilla <andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>  Documentation/vm/pagemap.txt           | 6 ++++++
>  fs/proc/page.c                         | 3 +++
>  include/uapi/linux/kernel-page-flags.h | 1 +
>  3 files changed, 10 insertions(+)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index c9266340852c..5896b7d7fd74 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -64,6 +64,7 @@ There are five components to pagemap:
>      22. THP
>      23. BALLOON
>      24. ZERO_PAGE
> +    25. IDLE
>
>   * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>     memory cgroup each page is charged to, indexed by PFN. Only available when
> @@ -124,6 +125,11 @@ Short descriptions to the page flags:
>  24. ZERO_PAGE
>      zero page for pfn_zero or huge_zero page
>
> +25. IDLE
> +    page has not been accessed since it was marked idle (see /proc/kpageidle)
> +    Note that this flag may be stale in case the page was accessed via a PTE.
> +    To make sure the flag is up-to-date one has to read /proc/kpageidle first.
> +
>      [IO related page flags]
>   1. ERROR     IO error occurred
>   3. UPTODATE  page has up-to-date data
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 273537885ab4..13dcb823fe4e 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
>         if (PageBalloon(page))
>                 u |= 1 << KPF_BALLOON;
>
> +       if (page_is_idle(page))
> +               u |= 1 << KPF_IDLE;
> +
>         u |= kpf_copy_bit(k, KPF_LOCKED,        PG_locked);
>
>         u |= kpf_copy_bit(k, KPF_SLAB,          PG_slab);
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index a6c4962e5d46..5da5f8751ce7 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -33,6 +33,7 @@
>  #define KPF_THP                        22
>  #define KPF_BALLOON            23
>  #define KPF_ZERO_PAGE          24
> +#define KPF_IDLE               25
>
>
>  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 7/7] proc: export idle flag via kpageflags
@ 2015-07-15 19:17     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:17 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> As noted by Minchan, a benefit of reading idle flag from
> /proc/kpageflags is that one can easily filter dirty and/or unevictable
> pages while estimating the size of unused memory.
>
> Note that idle flag read from /proc/kpageflags may be stale in case the
> page was accessed via a PTE, because it would be too costly to iterate
> over all page mappings on each /proc/kpageflags read to provide an
> up-to-date value. To make sure the flag is up-to-date one has to read
> /proc/kpageidle first.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>

> ---
>  Documentation/vm/pagemap.txt           | 6 ++++++
>  fs/proc/page.c                         | 3 +++
>  include/uapi/linux/kernel-page-flags.h | 1 +
>  3 files changed, 10 insertions(+)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index c9266340852c..5896b7d7fd74 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -64,6 +64,7 @@ There are five components to pagemap:
>      22. THP
>      23. BALLOON
>      24. ZERO_PAGE
> +    25. IDLE
>
>   * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>     memory cgroup each page is charged to, indexed by PFN. Only available when
> @@ -124,6 +125,11 @@ Short descriptions to the page flags:
>  24. ZERO_PAGE
>      zero page for pfn_zero or huge_zero page
>
> +25. IDLE
> +    page has not been accessed since it was marked idle (see /proc/kpageidle)
> +    Note that this flag may be stale in case the page was accessed via a PTE.
> +    To make sure the flag is up-to-date one has to read /proc/kpageidle first.
> +
>      [IO related page flags]
>   1. ERROR     IO error occurred
>   3. UPTODATE  page has up-to-date data
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 273537885ab4..13dcb823fe4e 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
>         if (PageBalloon(page))
>                 u |= 1 << KPF_BALLOON;
>
> +       if (page_is_idle(page))
> +               u |= 1 << KPF_IDLE;
> +
>         u |= kpf_copy_bit(k, KPF_LOCKED,        PG_locked);
>
>         u |= kpf_copy_bit(k, KPF_SLAB,          PG_slab);
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index a6c4962e5d46..5da5f8751ce7 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -33,6 +33,7 @@
>  #define KPF_THP                        22
>  #define KPF_BALLOON            23
>  #define KPF_ZERO_PAGE          24
> +#define KPF_IDLE               25
>
>
>  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 6/7] proc: add kpageidle file
@ 2015-07-15 19:42     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently, e.g. by setting memory cgroup limits appropriately.
> Currently, the only means to estimate the amount of idle memory provided
> by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> access bit for all pages mapped to a particular process by writing 1 to
> clear_refs, wait for some time, and then count smaps:Referenced.
> However, this method has two serious shortcomings:
>
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
>
> To overcome these drawbacks, this patch introduces two new page flags,
> Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> can only be set from userspace by setting bit in /proc/kpageidle at the
> offset corresponding to the page, and it is cleared whenever the page is
> accessed either through page tables (it is cleared in page_referenced()
> in this case) or using the read(2) system call (mark_page_accessed()).
> Thus by setting the Idle flag for pages of a particular workload, which
> can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> let the workload access its working set, and then reading the kpageidle
> file, one can estimate the amount of pages that are not used by the
> workload.
>
> The Young page flag is used to avoid interference with the memory
> reclaimer. A page's Young flag is set whenever the Access bit of a page
> table entry pointing to the page is cleared by writing to kpageidle. If
> page_referenced() is called on a Young page, it will add 1 to its return
> value, therefore concealing the fact that the Access bit was cleared.
>
> Note, since there is no room for extra page flags on 32 bit, this
> feature uses extended page flags when compiled on 32 bit.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  Documentation/vm/pagemap.txt |  12 ++-
>  fs/proc/page.c               | 218 +++++++++++++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c           |   4 +-
>  include/linux/mm.h           |  98 +++++++++++++++++++
>  include/linux/page-flags.h   |  11 +++
>  include/linux/page_ext.h     |   4 +
>  mm/Kconfig                   |  12 +++
>  mm/debug.c                   |   4 +
>  mm/huge_memory.c             |  11 ++-
>  mm/migrate.c                 |   5 +
>  mm/page_ext.c                |   3 +
>  mm/rmap.c                    |   5 +
>  mm/swap.c                    |   2 +
>  13 files changed, 385 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index a9b7afc8fbc6..c9266340852c 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
>  userspace programs to examine the page tables and related information by
>  reading files in /proc.
>
> -There are four components to pagemap:
> +There are five components to pagemap:
>
>   * /proc/pid/pagemap.  This file lets a userspace process find out which
>     physical frame each virtual page is mapped to.  It contains one 64-bit
> @@ -69,6 +69,16 @@ There are four components to pagemap:
>     memory cgroup each page is charged to, indexed by PFN. Only available when
>     CONFIG_MEMCG is set.
>
> + * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
> +   to a page, indexed by PFN. When the bit is set, the corresponding page is
> +   idle. A page is considered idle if it has not been accessed since it was
> +   marked idle. To mark a page idle one should set the bit corresponding to the
> +   page by writing to the file. A value written to the file is OR-ed with the
> +   current bitmap value. Only user memory pages can be marked idle, for other
> +   page types input is silently ignored. Writing to this file beyond max PFN
> +   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
> +   set.
> +
>  Short descriptions to the page flags:
>
>   0. LOCKED
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 70d23245dd43..273537885ab4 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -5,6 +5,8 @@
>  #include <linux/ksm.h>
>  #include <linux/mm.h>
>  #include <linux/mmzone.h>
> +#include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/huge_mm.h>
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> @@ -16,6 +18,7 @@
>
>  #define KPMSIZE sizeof(u64)
>  #define KPMMASK (KPMSIZE - 1)
> +#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
>
>  /* /proc/kpagecount - an array exposing page counts
>   *
> @@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
>  };
>  #endif /* CONFIG_MEMCG */
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +/*
> + * Idle page tracking only considers user memory pages, for other types of
> + * pages the idle flag is always unset and an attempt to set it is silently
> + * ignored.
> + *
> + * We treat a page as a user memory page if it is on an LRU list, because it is
> + * always safe to pass such a page to rmap_walk(), which is essential for idle
> + * page tracking. With such an indicator of user pages we can skip isolated
> + * pages, but since there are not usually many of them, it will hardly affect
> + * the overall result.
> + *
> + * This function tries to get a user memory page by pfn as described above.
> + */
> +static struct page *kpageidle_get_page(unsigned long pfn)
> +{
> +       struct page *page;
> +       struct zone *zone;
> +
> +       if (!pfn_valid(pfn))
> +               return NULL;
> +
> +       page = pfn_to_page(pfn);
> +       if (!page || !PageLRU(page) ||
> +           !get_page_unless_zero(page))
> +               return NULL;
> +
> +       zone = page_zone(page);
> +       spin_lock_irq(&zone->lru_lock);
> +       if (unlikely(!PageLRU(page))) {
> +               put_page(page);
> +               page = NULL;
> +       }
> +       spin_unlock_irq(&zone->lru_lock);
> +       return page;
> +}
> +
> +static int kpageidle_clear_pte_refs_one(struct page *page,
> +                                       struct vm_area_struct *vma,
> +                                       unsigned long addr, void *arg)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       spinlock_t *ptl;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +       bool referenced = false;
> +
> +       if (unlikely(PageTransHuge(page))) {
> +               pmd = page_check_address_pmd(page, mm, addr,
> +                                            PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> +               if (pmd) {
> +                       referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +                       spin_unlock(ptl);
> +               }
> +       } else {
> +               pte = page_check_address(page, mm, addr, &ptl, 0);
> +               if (pte) {
> +                       referenced = ptep_clear_young_notify(vma, addr, pte);
> +                       pte_unmap_unlock(pte, ptl);
> +               }
> +       }
> +       if (referenced) {
> +               clear_page_idle(page);
> +               /*
> +                * We cleared the referenced bit in a mapping to this page. To
> +                * avoid interference with page reclaim, mark it young so that
> +                * page_referenced() will return > 0.
> +                */
> +               set_page_young(page);
> +       }
> +       return SWAP_AGAIN;
> +}
> +
> +static void kpageidle_clear_pte_refs(struct page *page)
> +{
> +       struct rmap_walk_control rwc = {
> +               .rmap_one = kpageidle_clear_pte_refs_one,
> +               .anon_lock = page_lock_anon_vma_read,
> +       };
> +       bool need_lock;
> +
> +       if (!page_mapped(page) ||

Question: what about mlocked pages? Is there any point in calculating
their idleness?

> +           !page_rmapping(page))

Not sure, does this skip SwapCache pages? Is there any point in
calculating their idleness?

> +               return;
> +
> +       need_lock = !PageAnon(page) || PageKsm(page);
> +       if (need_lock && !trylock_page(page))
> +               return;
> +
> +       rmap_walk(page, &rwc);
> +
> +       if (need_lock)
> +               unlock_page(page);
> +}
> +
> +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> +                             size_t count, loff_t *ppos)
> +{
> +       u64 __user *out = (u64 __user *)buf;
> +       struct page *page;
> +       unsigned long pfn, end_pfn;
> +       ssize_t ret = 0;
> +       u64 idle_bitmap = 0;
> +       int bit;
> +
> +       if (*ppos & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       pfn = *ppos * BITS_PER_BYTE;
> +       if (pfn >= max_pfn)
> +               return 0;
> +
> +       end_pfn = pfn + count * BITS_PER_BYTE;
> +       if (end_pfn > max_pfn)
> +               end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> +       for (; pfn < end_pfn; pfn++) {
> +               bit = pfn % KPMBITS;
> +               page = kpageidle_get_page(pfn);
> +               if (page) {
> +                       if (page_is_idle(page)) {
> +                               /*
> +                                * The page might have been referenced via a
> +                                * pte, in which case it is not idle. Clear
> +                                * refs and recheck.
> +                                */
> +                               kpageidle_clear_pte_refs(page);
> +                               if (page_is_idle(page))
> +                                       idle_bitmap |= 1ULL << bit;
> +                       }
> +                       put_page(page);
> +               }
> +               if (bit == KPMBITS - 1) {

Reminder to add cond_sched() or similar at some regular cadence.

> +                       if (put_user(idle_bitmap, out)) {
> +                               ret = -EFAULT;
> +                               break;
> +                       }
> +                       idle_bitmap = 0;
> +                       out++;
> +               }
> +       }
> +
> +       *ppos += (char __user *)out - buf;
> +       if (!ret)
> +               ret = (char __user *)out - buf;
> +       return ret;
> +}
> +
> +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> +                              size_t count, loff_t *ppos)
> +{
> +       const u64 __user *in = (const u64 __user *)buf;
> +       struct page *page;
> +       unsigned long pfn, end_pfn;
> +       ssize_t ret = 0;
> +       u64 idle_bitmap = 0;
> +       int bit;
> +
> +       if (*ppos & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       pfn = *ppos * BITS_PER_BYTE;
> +       if (pfn >= max_pfn)
> +               return -ENXIO;
> +
> +       end_pfn = pfn + count * BITS_PER_BYTE;
> +       if (end_pfn > max_pfn)
> +               end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> +       for (; pfn < end_pfn; pfn++) {
> +               bit = pfn % KPMBITS;
> +               if (bit == 0) {
> +                       if (get_user(idle_bitmap, in)) {

Same...

> +                               ret = -EFAULT;
> +                               break;
> +                       }
> +                       in++;
> +               }
> +               if (idle_bitmap >> bit & 1) {
> +                       page = kpageidle_get_page(pfn);
> +                       if (page) {
> +                               kpageidle_clear_pte_refs(page);
> +                               set_page_idle(page);
> +                               put_page(page);
> +                       }
> +               }
> +       }
> +
> +       *ppos += (const char __user *)in - buf;
> +       if (!ret)
> +               ret = (const char __user *)in - buf;
> +       return ret;
> +}
> +
> +static const struct file_operations proc_kpageidle_operations = {
> +       .llseek = mem_lseek,
> +       .read = kpageidle_read,
> +       .write = kpageidle_write,
> +};
> +
> +#ifndef CONFIG_64BIT
> +static bool need_page_idle(void)
> +{
> +       return true;
> +}
> +struct page_ext_operations page_idle_ops = {
> +       .need = need_page_idle,
> +};
> +#endif
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  static int __init proc_page_init(void)
>  {
>         proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
> @@ -282,6 +496,10 @@ static int __init proc_page_init(void)
>  #ifdef CONFIG_MEMCG
>         proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
>  #endif
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +       proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
> +                   &proc_kpageidle_operations);
> +#endif
>         return 0;
>  }
>  fs_initcall(proc_page_init);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 3b4d8255e806..3efd7f641f92 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>
>         mss->resident += size;
>         /* Accumulate the size in pages that have been accessed. */
> -       if (young || PageReferenced(page))
> +       if (young || page_is_young(page) || PageReferenced(page))
>                 mss->referenced += size;
>         mapcount = page_mapcount(page);
>         if (mapcount >= 2) {
> @@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>
>                 /* Clear accessed and referenced bits. */
>                 pmdp_test_and_clear_young(vma, addr, pmd);
> +               test_and_clear_page_young(page);
>                 ClearPageReferenced(page);
>  out:
>                 spin_unlock(ptl);
> @@ -837,6 +838,7 @@ out:
>
>                 /* Clear accessed and referenced bits. */
>                 ptep_test_and_clear_young(vma, addr, pte);
> +               test_and_clear_page_young(page);
>                 ClearPageReferenced(page);
>         }
>         pte_unmap_unlock(pte - 1, ptl);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f471789781a..de450c1191b9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
>  static inline void setup_nr_node_ids(void) {}
>  #endif
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +#ifdef CONFIG_64BIT
> +static inline bool page_is_young(struct page *page)
> +{
> +       return PageYoung(page);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +       SetPageYoung(page);
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return TestClearPageYoung(page);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return PageIdle(page);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +       SetPageIdle(page);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +       ClearPageIdle(page);
> +}
> +#else /* !CONFIG_64BIT */
> +/*
> + * If there is not enough space to store Idle and Young bits in page flags, use
> + * page ext flags instead.
> + */
> +extern struct page_ext_operations page_idle_ops;
> +
> +static inline bool page_is_young(struct page *page)
> +{
> +       return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +       set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return test_and_clear_bit(PAGE_EXT_YOUNG,
> +                                 &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +       set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +       clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +#endif /* CONFIG_64BIT */
> +#else /* !CONFIG_IDLE_PAGE_TRACKING */
> +static inline bool page_is_young(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +}
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..478f2241f284 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -109,6 +109,10 @@ enum pageflags {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         PG_compound_lock,
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +       PG_young,
> +       PG_idle,
> +#endif
>         __NR_PAGEFLAGS,
>
>         /* Filesystems */
> @@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
>  #define __PG_HWPOISON 0
>  #endif
>
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +TESTPAGEFLAG(Young, young, PF_ANY)
> +SETPAGEFLAG(Young, young, PF_ANY)
> +TESTCLEARFLAG(Young, young, PF_ANY)
> +PAGEFLAG(Idle, idle, PF_ANY)
> +#endif
> +
>  /*
>   * On an anonymous page mapped into a user virtual memory area,
>   * page->mapping points to its anon_vma, not to a struct address_space;
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index c42981cd99aa..17f118a82854 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -26,6 +26,10 @@ enum page_ext_flags {
>         PAGE_EXT_DEBUG_POISON,          /* Page is poisoned */
>         PAGE_EXT_DEBUG_GUARD,
>         PAGE_EXT_OWNER,
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> +       PAGE_EXT_YOUNG,
> +       PAGE_EXT_IDLE,
> +#endif
>  };
>
>  /*
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e79de2bd12cd..db817e2c2ec8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
>           when kswapd starts. This has a potential performance impact on
>           processes running early in the lifetime of the systemm until kswapd
>           finishes the initialisation.
> +
> +config IDLE_PAGE_TRACKING
> +       bool "Enable idle page tracking"
> +       select PROC_PAGE_MONITOR
> +       select PAGE_EXTENSION if !64BIT
> +       help
> +         This feature allows to estimate the amount of user pages that have
> +         not been touched during a given period of time. This information can
> +         be useful to tune memory cgroup limits and/or for job placement
> +         within a compute cluster.
> +
> +         See Documentation/vm/pagemap.txt for more details.
> diff --git a/mm/debug.c b/mm/debug.c
> index 76089ddf99ea..6c1b3ea61bfd 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         {1UL << PG_compound_lock,       "compound_lock" },
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +       {1UL << PG_young,               "young"         },
> +       {1UL << PG_idle,                "idle"          },
> +#endif
>  };
>
>  static void dump_flags(unsigned long flags,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9671f51e954d..bb6d2ec1f268 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
>                 /* clear PageTail before overwriting first_page */
>                 smp_wmb();
>
> +               if (page_is_young(page))
> +                       set_page_young(page_tail);
> +               if (page_is_idle(page))
> +                       set_page_idle(page_tail);
> +

Why not in the block above?

page_tail->flags |= (page->flags &
...
#ifdef CONFIG_WHATEVER_IT_WAS
1 << PG_idle
1 << PG_young
#endif


>                 /*
>                  * __split_huge_page_splitting() already set the
>                  * splitting bit in all pmd that could map this
> @@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
>                 /* If there is no mapped pte young don't collapse the page */
> -               if (pte_young(pteval) || PageReferenced(page) ||
> +               if (pte_young(pteval) ||
> +                   page_is_young(page) || PageReferenced(page) ||
>                     mmu_notifier_test_young(vma->vm_mm, address))
>                         referenced = true;
>         }
> @@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                  */
>                 if (page_count(page) != 1 + !!PageSwapCache(page))
>                         goto out_unmap;
> -               if (pte_young(pteval) || PageReferenced(page) ||
> +               if (pte_young(pteval) ||
> +                   page_is_young(page) || PageReferenced(page) ||
>                     mmu_notifier_test_young(vma->vm_mm, address))
>                         referenced = true;
>         }

Cool finds, thanks for the thoroughness

Andres
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 236ee25e79d9..3e7bb4f2b51c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>                         __set_page_dirty_nobuffers(newpage);
>         }
>
> +       if (page_is_young(page))
> +               set_page_young(newpage);
> +       if (page_is_idle(page))
> +               set_page_idle(newpage);
> +
>         /*
>          * Copy NUMA information to the new page, to prevent over-eager
>          * future migrations of this same page.
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index d86fd2f5353f..e4b3af054bf2 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #ifdef CONFIG_PAGE_OWNER
>         &page_owner_ops,
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> +       &page_idle_ops,
> +#endif
>  };
>
>  static unsigned long total_usage;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 49b244b1f18c..c96677ade3d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                 pte_unmap_unlock(pte, ptl);
>         }
>
> +       if (referenced)
> +               clear_page_idle(page);
> +       if (test_and_clear_page_young(page))
> +               referenced++;
> +
>         if (referenced) {
>                 pra->referenced++;
>                 pra->vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> index ab7c338eda87..db43c9b4891d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
>         } else if (!PageReferenced(page)) {
>                 SetPageReferenced(page);
>         }
> +       if (page_is_idle(page))
> +               clear_page_idle(page);
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 6/7] proc: add kpageidle file
@ 2015-07-15 19:42     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently, e.g. by setting memory cgroup limits appropriately.
> Currently, the only means to estimate the amount of idle memory provided
> by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> access bit for all pages mapped to a particular process by writing 1 to
> clear_refs, wait for some time, and then count smaps:Referenced.
> However, this method has two serious shortcomings:
>
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
>
> To overcome these drawbacks, this patch introduces two new page flags,
> Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> can only be set from userspace by setting bit in /proc/kpageidle at the
> offset corresponding to the page, and it is cleared whenever the page is
> accessed either through page tables (it is cleared in page_referenced()
> in this case) or using the read(2) system call (mark_page_accessed()).
> Thus by setting the Idle flag for pages of a particular workload, which
> can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> let the workload access its working set, and then reading the kpageidle
> file, one can estimate the amount of pages that are not used by the
> workload.
>
> The Young page flag is used to avoid interference with the memory
> reclaimer. A page's Young flag is set whenever the Access bit of a page
> table entry pointing to the page is cleared by writing to kpageidle. If
> page_referenced() is called on a Young page, it will add 1 to its return
> value, therefore concealing the fact that the Access bit was cleared.
>
> Note, since there is no room for extra page flags on 32 bit, this
> feature uses extended page flags when compiled on 32 bit.
>
> Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> ---
>  Documentation/vm/pagemap.txt |  12 ++-
>  fs/proc/page.c               | 218 +++++++++++++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c           |   4 +-
>  include/linux/mm.h           |  98 +++++++++++++++++++
>  include/linux/page-flags.h   |  11 +++
>  include/linux/page_ext.h     |   4 +
>  mm/Kconfig                   |  12 +++
>  mm/debug.c                   |   4 +
>  mm/huge_memory.c             |  11 ++-
>  mm/migrate.c                 |   5 +
>  mm/page_ext.c                |   3 +
>  mm/rmap.c                    |   5 +
>  mm/swap.c                    |   2 +
>  13 files changed, 385 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index a9b7afc8fbc6..c9266340852c 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
>  userspace programs to examine the page tables and related information by
>  reading files in /proc.
>
> -There are four components to pagemap:
> +There are five components to pagemap:
>
>   * /proc/pid/pagemap.  This file lets a userspace process find out which
>     physical frame each virtual page is mapped to.  It contains one 64-bit
> @@ -69,6 +69,16 @@ There are four components to pagemap:
>     memory cgroup each page is charged to, indexed by PFN. Only available when
>     CONFIG_MEMCG is set.
>
> + * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
> +   to a page, indexed by PFN. When the bit is set, the corresponding page is
> +   idle. A page is considered idle if it has not been accessed since it was
> +   marked idle. To mark a page idle one should set the bit corresponding to the
> +   page by writing to the file. A value written to the file is OR-ed with the
> +   current bitmap value. Only user memory pages can be marked idle, for other
> +   page types input is silently ignored. Writing to this file beyond max PFN
> +   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
> +   set.
> +
>  Short descriptions to the page flags:
>
>   0. LOCKED
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 70d23245dd43..273537885ab4 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -5,6 +5,8 @@
>  #include <linux/ksm.h>
>  #include <linux/mm.h>
>  #include <linux/mmzone.h>
> +#include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/huge_mm.h>
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> @@ -16,6 +18,7 @@
>
>  #define KPMSIZE sizeof(u64)
>  #define KPMMASK (KPMSIZE - 1)
> +#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
>
>  /* /proc/kpagecount - an array exposing page counts
>   *
> @@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
>  };
>  #endif /* CONFIG_MEMCG */
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +/*
> + * Idle page tracking only considers user memory pages, for other types of
> + * pages the idle flag is always unset and an attempt to set it is silently
> + * ignored.
> + *
> + * We treat a page as a user memory page if it is on an LRU list, because it is
> + * always safe to pass such a page to rmap_walk(), which is essential for idle
> + * page tracking. With such an indicator of user pages we can skip isolated
> + * pages, but since there are not usually many of them, it will hardly affect
> + * the overall result.
> + *
> + * This function tries to get a user memory page by pfn as described above.
> + */
> +static struct page *kpageidle_get_page(unsigned long pfn)
> +{
> +       struct page *page;
> +       struct zone *zone;
> +
> +       if (!pfn_valid(pfn))
> +               return NULL;
> +
> +       page = pfn_to_page(pfn);
> +       if (!page || !PageLRU(page) ||
> +           !get_page_unless_zero(page))
> +               return NULL;
> +
> +       zone = page_zone(page);
> +       spin_lock_irq(&zone->lru_lock);
> +       if (unlikely(!PageLRU(page))) {
> +               put_page(page);
> +               page = NULL;
> +       }
> +       spin_unlock_irq(&zone->lru_lock);
> +       return page;
> +}
> +
> +static int kpageidle_clear_pte_refs_one(struct page *page,
> +                                       struct vm_area_struct *vma,
> +                                       unsigned long addr, void *arg)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       spinlock_t *ptl;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +       bool referenced = false;
> +
> +       if (unlikely(PageTransHuge(page))) {
> +               pmd = page_check_address_pmd(page, mm, addr,
> +                                            PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> +               if (pmd) {
> +                       referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +                       spin_unlock(ptl);
> +               }
> +       } else {
> +               pte = page_check_address(page, mm, addr, &ptl, 0);
> +               if (pte) {
> +                       referenced = ptep_clear_young_notify(vma, addr, pte);
> +                       pte_unmap_unlock(pte, ptl);
> +               }
> +       }
> +       if (referenced) {
> +               clear_page_idle(page);
> +               /*
> +                * We cleared the referenced bit in a mapping to this page. To
> +                * avoid interference with page reclaim, mark it young so that
> +                * page_referenced() will return > 0.
> +                */
> +               set_page_young(page);
> +       }
> +       return SWAP_AGAIN;
> +}
> +
> +static void kpageidle_clear_pte_refs(struct page *page)
> +{
> +       struct rmap_walk_control rwc = {
> +               .rmap_one = kpageidle_clear_pte_refs_one,
> +               .anon_lock = page_lock_anon_vma_read,
> +       };
> +       bool need_lock;
> +
> +       if (!page_mapped(page) ||

Question: what about mlocked pages? Is there any point in calculating
their idleness?

> +           !page_rmapping(page))

Not sure, does this skip SwapCache pages? Is there any point in
calculating their idleness?

> +               return;
> +
> +       need_lock = !PageAnon(page) || PageKsm(page);
> +       if (need_lock && !trylock_page(page))
> +               return;
> +
> +       rmap_walk(page, &rwc);
> +
> +       if (need_lock)
> +               unlock_page(page);
> +}
> +
> +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> +                             size_t count, loff_t *ppos)
> +{
> +       u64 __user *out = (u64 __user *)buf;
> +       struct page *page;
> +       unsigned long pfn, end_pfn;
> +       ssize_t ret = 0;
> +       u64 idle_bitmap = 0;
> +       int bit;
> +
> +       if (*ppos & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       pfn = *ppos * BITS_PER_BYTE;
> +       if (pfn >= max_pfn)
> +               return 0;
> +
> +       end_pfn = pfn + count * BITS_PER_BYTE;
> +       if (end_pfn > max_pfn)
> +               end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> +       for (; pfn < end_pfn; pfn++) {
> +               bit = pfn % KPMBITS;
> +               page = kpageidle_get_page(pfn);
> +               if (page) {
> +                       if (page_is_idle(page)) {
> +                               /*
> +                                * The page might have been referenced via a
> +                                * pte, in which case it is not idle. Clear
> +                                * refs and recheck.
> +                                */
> +                               kpageidle_clear_pte_refs(page);
> +                               if (page_is_idle(page))
> +                                       idle_bitmap |= 1ULL << bit;
> +                       }
> +                       put_page(page);
> +               }
> +               if (bit == KPMBITS - 1) {

Reminder to add cond_sched() or similar at some regular cadence.

> +                       if (put_user(idle_bitmap, out)) {
> +                               ret = -EFAULT;
> +                               break;
> +                       }
> +                       idle_bitmap = 0;
> +                       out++;
> +               }
> +       }
> +
> +       *ppos += (char __user *)out - buf;
> +       if (!ret)
> +               ret = (char __user *)out - buf;
> +       return ret;
> +}
> +
> +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> +                              size_t count, loff_t *ppos)
> +{
> +       const u64 __user *in = (const u64 __user *)buf;
> +       struct page *page;
> +       unsigned long pfn, end_pfn;
> +       ssize_t ret = 0;
> +       u64 idle_bitmap = 0;
> +       int bit;
> +
> +       if (*ppos & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       pfn = *ppos * BITS_PER_BYTE;
> +       if (pfn >= max_pfn)
> +               return -ENXIO;
> +
> +       end_pfn = pfn + count * BITS_PER_BYTE;
> +       if (end_pfn > max_pfn)
> +               end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> +       for (; pfn < end_pfn; pfn++) {
> +               bit = pfn % KPMBITS;
> +               if (bit == 0) {
> +                       if (get_user(idle_bitmap, in)) {

Same...

> +                               ret = -EFAULT;
> +                               break;
> +                       }
> +                       in++;
> +               }
> +               if (idle_bitmap >> bit & 1) {
> +                       page = kpageidle_get_page(pfn);
> +                       if (page) {
> +                               kpageidle_clear_pte_refs(page);
> +                               set_page_idle(page);
> +                               put_page(page);
> +                       }
> +               }
> +       }
> +
> +       *ppos += (const char __user *)in - buf;
> +       if (!ret)
> +               ret = (const char __user *)in - buf;
> +       return ret;
> +}
> +
> +static const struct file_operations proc_kpageidle_operations = {
> +       .llseek = mem_lseek,
> +       .read = kpageidle_read,
> +       .write = kpageidle_write,
> +};
> +
> +#ifndef CONFIG_64BIT
> +static bool need_page_idle(void)
> +{
> +       return true;
> +}
> +struct page_ext_operations page_idle_ops = {
> +       .need = need_page_idle,
> +};
> +#endif
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  static int __init proc_page_init(void)
>  {
>         proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
> @@ -282,6 +496,10 @@ static int __init proc_page_init(void)
>  #ifdef CONFIG_MEMCG
>         proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
>  #endif
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +       proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
> +                   &proc_kpageidle_operations);
> +#endif
>         return 0;
>  }
>  fs_initcall(proc_page_init);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 3b4d8255e806..3efd7f641f92 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>
>         mss->resident += size;
>         /* Accumulate the size in pages that have been accessed. */
> -       if (young || PageReferenced(page))
> +       if (young || page_is_young(page) || PageReferenced(page))
>                 mss->referenced += size;
>         mapcount = page_mapcount(page);
>         if (mapcount >= 2) {
> @@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>
>                 /* Clear accessed and referenced bits. */
>                 pmdp_test_and_clear_young(vma, addr, pmd);
> +               test_and_clear_page_young(page);
>                 ClearPageReferenced(page);
>  out:
>                 spin_unlock(ptl);
> @@ -837,6 +838,7 @@ out:
>
>                 /* Clear accessed and referenced bits. */
>                 ptep_test_and_clear_young(vma, addr, pte);
> +               test_and_clear_page_young(page);
>                 ClearPageReferenced(page);
>         }
>         pte_unmap_unlock(pte - 1, ptl);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f471789781a..de450c1191b9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
>  static inline void setup_nr_node_ids(void) {}
>  #endif
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +#ifdef CONFIG_64BIT
> +static inline bool page_is_young(struct page *page)
> +{
> +       return PageYoung(page);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +       SetPageYoung(page);
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return TestClearPageYoung(page);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return PageIdle(page);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +       SetPageIdle(page);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +       ClearPageIdle(page);
> +}
> +#else /* !CONFIG_64BIT */
> +/*
> + * If there is not enough space to store Idle and Young bits in page flags, use
> + * page ext flags instead.
> + */
> +extern struct page_ext_operations page_idle_ops;
> +
> +static inline bool page_is_young(struct page *page)
> +{
> +       return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +       set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return test_and_clear_bit(PAGE_EXT_YOUNG,
> +                                 &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +       set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +       clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +#endif /* CONFIG_64BIT */
> +#else /* !CONFIG_IDLE_PAGE_TRACKING */
> +static inline bool page_is_young(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +}
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..478f2241f284 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -109,6 +109,10 @@ enum pageflags {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         PG_compound_lock,
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +       PG_young,
> +       PG_idle,
> +#endif
>         __NR_PAGEFLAGS,
>
>         /* Filesystems */
> @@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
>  #define __PG_HWPOISON 0
>  #endif
>
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +TESTPAGEFLAG(Young, young, PF_ANY)
> +SETPAGEFLAG(Young, young, PF_ANY)
> +TESTCLEARFLAG(Young, young, PF_ANY)
> +PAGEFLAG(Idle, idle, PF_ANY)
> +#endif
> +
>  /*
>   * On an anonymous page mapped into a user virtual memory area,
>   * page->mapping points to its anon_vma, not to a struct address_space;
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index c42981cd99aa..17f118a82854 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -26,6 +26,10 @@ enum page_ext_flags {
>         PAGE_EXT_DEBUG_POISON,          /* Page is poisoned */
>         PAGE_EXT_DEBUG_GUARD,
>         PAGE_EXT_OWNER,
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> +       PAGE_EXT_YOUNG,
> +       PAGE_EXT_IDLE,
> +#endif
>  };
>
>  /*
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e79de2bd12cd..db817e2c2ec8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
>           when kswapd starts. This has a potential performance impact on
>           processes running early in the lifetime of the systemm until kswapd
>           finishes the initialisation.
> +
> +config IDLE_PAGE_TRACKING
> +       bool "Enable idle page tracking"
> +       select PROC_PAGE_MONITOR
> +       select PAGE_EXTENSION if !64BIT
> +       help
> +         This feature allows to estimate the amount of user pages that have
> +         not been touched during a given period of time. This information can
> +         be useful to tune memory cgroup limits and/or for job placement
> +         within a compute cluster.
> +
> +         See Documentation/vm/pagemap.txt for more details.
> diff --git a/mm/debug.c b/mm/debug.c
> index 76089ddf99ea..6c1b3ea61bfd 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         {1UL << PG_compound_lock,       "compound_lock" },
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +       {1UL << PG_young,               "young"         },
> +       {1UL << PG_idle,                "idle"          },
> +#endif
>  };
>
>  static void dump_flags(unsigned long flags,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9671f51e954d..bb6d2ec1f268 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
>                 /* clear PageTail before overwriting first_page */
>                 smp_wmb();
>
> +               if (page_is_young(page))
> +                       set_page_young(page_tail);
> +               if (page_is_idle(page))
> +                       set_page_idle(page_tail);
> +

Why not in the block above?

page_tail->flags |= (page->flags &
...
#ifdef CONFIG_WHATEVER_IT_WAS
1 << PG_idle
1 << PG_young
#endif


>                 /*
>                  * __split_huge_page_splitting() already set the
>                  * splitting bit in all pmd that could map this
> @@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
>                 /* If there is no mapped pte young don't collapse the page */
> -               if (pte_young(pteval) || PageReferenced(page) ||
> +               if (pte_young(pteval) ||
> +                   page_is_young(page) || PageReferenced(page) ||
>                     mmu_notifier_test_young(vma->vm_mm, address))
>                         referenced = true;
>         }
> @@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                  */
>                 if (page_count(page) != 1 + !!PageSwapCache(page))
>                         goto out_unmap;
> -               if (pte_young(pteval) || PageReferenced(page) ||
> +               if (pte_young(pteval) ||
> +                   page_is_young(page) || PageReferenced(page) ||
>                     mmu_notifier_test_young(vma->vm_mm, address))
>                         referenced = true;
>         }

Cool finds, thanks for the thoroughness

Andres
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 236ee25e79d9..3e7bb4f2b51c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>                         __set_page_dirty_nobuffers(newpage);
>         }
>
> +       if (page_is_young(page))
> +               set_page_young(newpage);
> +       if (page_is_idle(page))
> +               set_page_idle(newpage);
> +
>         /*
>          * Copy NUMA information to the new page, to prevent over-eager
>          * future migrations of this same page.
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index d86fd2f5353f..e4b3af054bf2 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #ifdef CONFIG_PAGE_OWNER
>         &page_owner_ops,
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> +       &page_idle_ops,
> +#endif
>  };
>
>  static unsigned long total_usage;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 49b244b1f18c..c96677ade3d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                 pte_unmap_unlock(pte, ptl);
>         }
>
> +       if (referenced)
> +               clear_page_idle(page);
> +       if (test_and_clear_page_young(page))
> +               referenced++;
> +
>         if (referenced) {
>                 pra->referenced++;
>                 pra->vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> index ab7c338eda87..db43c9b4891d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
>         } else if (!PageReferenced(page)) {
>                 SetPageReferenced(page);
>         }
> +       if (page_is_idle(page))
> +               clear_page_idle(page);
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 6/7] proc: add kpageidle file
@ 2015-07-15 19:42     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 19:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently, e.g. by setting memory cgroup limits appropriately.
> Currently, the only means to estimate the amount of idle memory provided
> by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> access bit for all pages mapped to a particular process by writing 1 to
> clear_refs, wait for some time, and then count smaps:Referenced.
> However, this method has two serious shortcomings:
>
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
>
> To overcome these drawbacks, this patch introduces two new page flags,
> Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> can only be set from userspace by setting bit in /proc/kpageidle at the
> offset corresponding to the page, and it is cleared whenever the page is
> accessed either through page tables (it is cleared in page_referenced()
> in this case) or using the read(2) system call (mark_page_accessed()).
> Thus by setting the Idle flag for pages of a particular workload, which
> can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> let the workload access its working set, and then reading the kpageidle
> file, one can estimate the amount of pages that are not used by the
> workload.
>
> The Young page flag is used to avoid interference with the memory
> reclaimer. A page's Young flag is set whenever the Access bit of a page
> table entry pointing to the page is cleared by writing to kpageidle. If
> page_referenced() is called on a Young page, it will add 1 to its return
> value, therefore concealing the fact that the Access bit was cleared.
>
> Note, since there is no room for extra page flags on 32 bit, this
> feature uses extended page flags when compiled on 32 bit.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  Documentation/vm/pagemap.txt |  12 ++-
>  fs/proc/page.c               | 218 +++++++++++++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c           |   4 +-
>  include/linux/mm.h           |  98 +++++++++++++++++++
>  include/linux/page-flags.h   |  11 +++
>  include/linux/page_ext.h     |   4 +
>  mm/Kconfig                   |  12 +++
>  mm/debug.c                   |   4 +
>  mm/huge_memory.c             |  11 ++-
>  mm/migrate.c                 |   5 +
>  mm/page_ext.c                |   3 +
>  mm/rmap.c                    |   5 +
>  mm/swap.c                    |   2 +
>  13 files changed, 385 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index a9b7afc8fbc6..c9266340852c 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
>  userspace programs to examine the page tables and related information by
>  reading files in /proc.
>
> -There are four components to pagemap:
> +There are five components to pagemap:
>
>   * /proc/pid/pagemap.  This file lets a userspace process find out which
>     physical frame each virtual page is mapped to.  It contains one 64-bit
> @@ -69,6 +69,16 @@ There are four components to pagemap:
>     memory cgroup each page is charged to, indexed by PFN. Only available when
>     CONFIG_MEMCG is set.
>
> + * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
> +   to a page, indexed by PFN. When the bit is set, the corresponding page is
> +   idle. A page is considered idle if it has not been accessed since it was
> +   marked idle. To mark a page idle one should set the bit corresponding to the
> +   page by writing to the file. A value written to the file is OR-ed with the
> +   current bitmap value. Only user memory pages can be marked idle, for other
> +   page types input is silently ignored. Writing to this file beyond max PFN
> +   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
> +   set.
> +
>  Short descriptions to the page flags:
>
>   0. LOCKED
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 70d23245dd43..273537885ab4 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -5,6 +5,8 @@
>  #include <linux/ksm.h>
>  #include <linux/mm.h>
>  #include <linux/mmzone.h>
> +#include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/huge_mm.h>
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> @@ -16,6 +18,7 @@
>
>  #define KPMSIZE sizeof(u64)
>  #define KPMMASK (KPMSIZE - 1)
> +#define KPMBITS (KPMSIZE * BITS_PER_BYTE)
>
>  /* /proc/kpagecount - an array exposing page counts
>   *
> @@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
>  };
>  #endif /* CONFIG_MEMCG */
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +/*
> + * Idle page tracking only considers user memory pages, for other types of
> + * pages the idle flag is always unset and an attempt to set it is silently
> + * ignored.
> + *
> + * We treat a page as a user memory page if it is on an LRU list, because it is
> + * always safe to pass such a page to rmap_walk(), which is essential for idle
> + * page tracking. With such an indicator of user pages we can skip isolated
> + * pages, but since there are not usually many of them, it will hardly affect
> + * the overall result.
> + *
> + * This function tries to get a user memory page by pfn as described above.
> + */
> +static struct page *kpageidle_get_page(unsigned long pfn)
> +{
> +       struct page *page;
> +       struct zone *zone;
> +
> +       if (!pfn_valid(pfn))
> +               return NULL;
> +
> +       page = pfn_to_page(pfn);
> +       if (!page || !PageLRU(page) ||
> +           !get_page_unless_zero(page))
> +               return NULL;
> +
> +       zone = page_zone(page);
> +       spin_lock_irq(&zone->lru_lock);
> +       if (unlikely(!PageLRU(page))) {
> +               put_page(page);
> +               page = NULL;
> +       }
> +       spin_unlock_irq(&zone->lru_lock);
> +       return page;
> +}
> +
> +static int kpageidle_clear_pte_refs_one(struct page *page,
> +                                       struct vm_area_struct *vma,
> +                                       unsigned long addr, void *arg)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       spinlock_t *ptl;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +       bool referenced = false;
> +
> +       if (unlikely(PageTransHuge(page))) {
> +               pmd = page_check_address_pmd(page, mm, addr,
> +                                            PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> +               if (pmd) {
> +                       referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +                       spin_unlock(ptl);
> +               }
> +       } else {
> +               pte = page_check_address(page, mm, addr, &ptl, 0);
> +               if (pte) {
> +                       referenced = ptep_clear_young_notify(vma, addr, pte);
> +                       pte_unmap_unlock(pte, ptl);
> +               }
> +       }
> +       if (referenced) {
> +               clear_page_idle(page);
> +               /*
> +                * We cleared the referenced bit in a mapping to this page. To
> +                * avoid interference with page reclaim, mark it young so that
> +                * page_referenced() will return > 0.
> +                */
> +               set_page_young(page);
> +       }
> +       return SWAP_AGAIN;
> +}
> +
> +static void kpageidle_clear_pte_refs(struct page *page)
> +{
> +       struct rmap_walk_control rwc = {
> +               .rmap_one = kpageidle_clear_pte_refs_one,
> +               .anon_lock = page_lock_anon_vma_read,
> +       };
> +       bool need_lock;
> +
> +       if (!page_mapped(page) ||

Question: what about mlocked pages? Is there any point in calculating
their idleness?

> +           !page_rmapping(page))

Not sure, does this skip SwapCache pages? Is there any point in
calculating their idleness?

> +               return;
> +
> +       need_lock = !PageAnon(page) || PageKsm(page);
> +       if (need_lock && !trylock_page(page))
> +               return;
> +
> +       rmap_walk(page, &rwc);
> +
> +       if (need_lock)
> +               unlock_page(page);
> +}
> +
> +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> +                             size_t count, loff_t *ppos)
> +{
> +       u64 __user *out = (u64 __user *)buf;
> +       struct page *page;
> +       unsigned long pfn, end_pfn;
> +       ssize_t ret = 0;
> +       u64 idle_bitmap = 0;
> +       int bit;
> +
> +       if (*ppos & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       pfn = *ppos * BITS_PER_BYTE;
> +       if (pfn >= max_pfn)
> +               return 0;
> +
> +       end_pfn = pfn + count * BITS_PER_BYTE;
> +       if (end_pfn > max_pfn)
> +               end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> +       for (; pfn < end_pfn; pfn++) {
> +               bit = pfn % KPMBITS;
> +               page = kpageidle_get_page(pfn);
> +               if (page) {
> +                       if (page_is_idle(page)) {
> +                               /*
> +                                * The page might have been referenced via a
> +                                * pte, in which case it is not idle. Clear
> +                                * refs and recheck.
> +                                */
> +                               kpageidle_clear_pte_refs(page);
> +                               if (page_is_idle(page))
> +                                       idle_bitmap |= 1ULL << bit;
> +                       }
> +                       put_page(page);
> +               }
> +               if (bit == KPMBITS - 1) {

Reminder to add cond_sched() or similar at some regular cadence.

> +                       if (put_user(idle_bitmap, out)) {
> +                               ret = -EFAULT;
> +                               break;
> +                       }
> +                       idle_bitmap = 0;
> +                       out++;
> +               }
> +       }
> +
> +       *ppos += (char __user *)out - buf;
> +       if (!ret)
> +               ret = (char __user *)out - buf;
> +       return ret;
> +}
> +
> +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> +                              size_t count, loff_t *ppos)
> +{
> +       const u64 __user *in = (const u64 __user *)buf;
> +       struct page *page;
> +       unsigned long pfn, end_pfn;
> +       ssize_t ret = 0;
> +       u64 idle_bitmap = 0;
> +       int bit;
> +
> +       if (*ppos & KPMMASK || count & KPMMASK)
> +               return -EINVAL;
> +
> +       pfn = *ppos * BITS_PER_BYTE;
> +       if (pfn >= max_pfn)
> +               return -ENXIO;
> +
> +       end_pfn = pfn + count * BITS_PER_BYTE;
> +       if (end_pfn > max_pfn)
> +               end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> +       for (; pfn < end_pfn; pfn++) {
> +               bit = pfn % KPMBITS;
> +               if (bit == 0) {
> +                       if (get_user(idle_bitmap, in)) {

Same...

> +                               ret = -EFAULT;
> +                               break;
> +                       }
> +                       in++;
> +               }
> +               if (idle_bitmap >> bit & 1) {
> +                       page = kpageidle_get_page(pfn);
> +                       if (page) {
> +                               kpageidle_clear_pte_refs(page);
> +                               set_page_idle(page);
> +                               put_page(page);
> +                       }
> +               }
> +       }
> +
> +       *ppos += (const char __user *)in - buf;
> +       if (!ret)
> +               ret = (const char __user *)in - buf;
> +       return ret;
> +}
> +
> +static const struct file_operations proc_kpageidle_operations = {
> +       .llseek = mem_lseek,
> +       .read = kpageidle_read,
> +       .write = kpageidle_write,
> +};
> +
> +#ifndef CONFIG_64BIT
> +static bool need_page_idle(void)
> +{
> +       return true;
> +}
> +struct page_ext_operations page_idle_ops = {
> +       .need = need_page_idle,
> +};
> +#endif
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  static int __init proc_page_init(void)
>  {
>         proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
> @@ -282,6 +496,10 @@ static int __init proc_page_init(void)
>  #ifdef CONFIG_MEMCG
>         proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
>  #endif
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +       proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
> +                   &proc_kpageidle_operations);
> +#endif
>         return 0;
>  }
>  fs_initcall(proc_page_init);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 3b4d8255e806..3efd7f641f92 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>
>         mss->resident += size;
>         /* Accumulate the size in pages that have been accessed. */
> -       if (young || PageReferenced(page))
> +       if (young || page_is_young(page) || PageReferenced(page))
>                 mss->referenced += size;
>         mapcount = page_mapcount(page);
>         if (mapcount >= 2) {
> @@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>
>                 /* Clear accessed and referenced bits. */
>                 pmdp_test_and_clear_young(vma, addr, pmd);
> +               test_and_clear_page_young(page);
>                 ClearPageReferenced(page);
>  out:
>                 spin_unlock(ptl);
> @@ -837,6 +838,7 @@ out:
>
>                 /* Clear accessed and referenced bits. */
>                 ptep_test_and_clear_young(vma, addr, pte);
> +               test_and_clear_page_young(page);
>                 ClearPageReferenced(page);
>         }
>         pte_unmap_unlock(pte - 1, ptl);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f471789781a..de450c1191b9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
>  static inline void setup_nr_node_ids(void) {}
>  #endif
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +#ifdef CONFIG_64BIT
> +static inline bool page_is_young(struct page *page)
> +{
> +       return PageYoung(page);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +       SetPageYoung(page);
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return TestClearPageYoung(page);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return PageIdle(page);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +       SetPageIdle(page);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +       ClearPageIdle(page);
> +}
> +#else /* !CONFIG_64BIT */
> +/*
> + * If there is not enough space to store Idle and Young bits in page flags, use
> + * page ext flags instead.
> + */
> +extern struct page_ext_operations page_idle_ops;
> +
> +static inline bool page_is_young(struct page *page)
> +{
> +       return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +       set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return test_and_clear_bit(PAGE_EXT_YOUNG,
> +                                 &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +       set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +       clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +#endif /* CONFIG_64BIT */
> +#else /* !CONFIG_IDLE_PAGE_TRACKING */
> +static inline bool page_is_young(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> +}
> +
> +static inline bool test_and_clear_page_young(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +}
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..478f2241f284 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -109,6 +109,10 @@ enum pageflags {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         PG_compound_lock,
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +       PG_young,
> +       PG_idle,
> +#endif
>         __NR_PAGEFLAGS,
>
>         /* Filesystems */
> @@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
>  #define __PG_HWPOISON 0
>  #endif
>
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +TESTPAGEFLAG(Young, young, PF_ANY)
> +SETPAGEFLAG(Young, young, PF_ANY)
> +TESTCLEARFLAG(Young, young, PF_ANY)
> +PAGEFLAG(Idle, idle, PF_ANY)
> +#endif
> +
>  /*
>   * On an anonymous page mapped into a user virtual memory area,
>   * page->mapping points to its anon_vma, not to a struct address_space;
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index c42981cd99aa..17f118a82854 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -26,6 +26,10 @@ enum page_ext_flags {
>         PAGE_EXT_DEBUG_POISON,          /* Page is poisoned */
>         PAGE_EXT_DEBUG_GUARD,
>         PAGE_EXT_OWNER,
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> +       PAGE_EXT_YOUNG,
> +       PAGE_EXT_IDLE,
> +#endif
>  };
>
>  /*
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e79de2bd12cd..db817e2c2ec8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
>           when kswapd starts. This has a potential performance impact on
>           processes running early in the lifetime of the systemm until kswapd
>           finishes the initialisation.
> +
> +config IDLE_PAGE_TRACKING
> +       bool "Enable idle page tracking"
> +       select PROC_PAGE_MONITOR
> +       select PAGE_EXTENSION if !64BIT
> +       help
> +         This feature allows to estimate the amount of user pages that have
> +         not been touched during a given period of time. This information can
> +         be useful to tune memory cgroup limits and/or for job placement
> +         within a compute cluster.
> +
> +         See Documentation/vm/pagemap.txt for more details.
> diff --git a/mm/debug.c b/mm/debug.c
> index 76089ddf99ea..6c1b3ea61bfd 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         {1UL << PG_compound_lock,       "compound_lock" },
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +       {1UL << PG_young,               "young"         },
> +       {1UL << PG_idle,                "idle"          },
> +#endif
>  };
>
>  static void dump_flags(unsigned long flags,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9671f51e954d..bb6d2ec1f268 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
>                 /* clear PageTail before overwriting first_page */
>                 smp_wmb();
>
> +               if (page_is_young(page))
> +                       set_page_young(page_tail);
> +               if (page_is_idle(page))
> +                       set_page_idle(page_tail);
> +

Why not in the block above?

page_tail->flags |= (page->flags &
...
#ifdef CONFIG_WHATEVER_IT_WAS
1 << PG_idle
1 << PG_young
#endif


>                 /*
>                  * __split_huge_page_splitting() already set the
>                  * splitting bit in all pmd that could map this
> @@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
>                 /* If there is no mapped pte young don't collapse the page */
> -               if (pte_young(pteval) || PageReferenced(page) ||
> +               if (pte_young(pteval) ||
> +                   page_is_young(page) || PageReferenced(page) ||
>                     mmu_notifier_test_young(vma->vm_mm, address))
>                         referenced = true;
>         }
> @@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                  */
>                 if (page_count(page) != 1 + !!PageSwapCache(page))
>                         goto out_unmap;
> -               if (pte_young(pteval) || PageReferenced(page) ||
> +               if (pte_young(pteval) ||
> +                   page_is_young(page) || PageReferenced(page) ||
>                     mmu_notifier_test_young(vma->vm_mm, address))
>                         referenced = true;
>         }

Cool finds, thanks for the thoroughness

Andres
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 236ee25e79d9..3e7bb4f2b51c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>                         __set_page_dirty_nobuffers(newpage);
>         }
>
> +       if (page_is_young(page))
> +               set_page_young(newpage);
> +       if (page_is_idle(page))
> +               set_page_idle(newpage);
> +
>         /*
>          * Copy NUMA information to the new page, to prevent over-eager
>          * future migrations of this same page.
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index d86fd2f5353f..e4b3af054bf2 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
>  #ifdef CONFIG_PAGE_OWNER
>         &page_owner_ops,
>  #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> +       &page_idle_ops,
> +#endif
>  };
>
>  static unsigned long total_usage;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 49b244b1f18c..c96677ade3d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                 pte_unmap_unlock(pte, ptl);
>         }
>
> +       if (referenced)
> +               clear_page_idle(page);
> +       if (test_and_clear_page_young(page))
> +               referenced++;
> +
>         if (referenced) {
>                 pra->referenced++;
>                 pra->vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> index ab7c338eda87..db43c9b4891d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
>         } else if (!PageReferenced(page)) {
>                 SetPageReferenced(page);
>         }
> +       if (page_is_idle(page))
> +               clear_page_idle(page);
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 0/7] idle memory tracking
  2015-07-15 13:54 ` Vladimir Davydov
@ 2015-07-15 20:47   ` Andres Lagar-Cavilla
  -1 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 20:47 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> It is based on top of v4.2-rc1-mmotm-2015-07-06-16-25
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.
>
> Also, as noted by Minchan, this would be useful for per-process reclaim
> (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
> pages only by smart user memory manager.
>
> ---- USER API ----
>
> The user API consists of two new proc files:
>
>  * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
>    to a page, indexed by PFN. When the bit is set, the corresponding page is
>    idle. A page is considered idle if it has not been accessed since it was
>    marked idle. To mark a page idle one should set the bit corresponding to the
>    page by writing to the file. A value written to the file is OR-ed with the
>    current bitmap value. Only user memory pages can be marked idle, for other
>    page types input is silently ignored. Writing to this file beyond max PFN
>    results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
>    set.
>
>    This file can be used to estimate the amount of pages that are not
>    used by a particular workload as follows:
>
>    1. mark all pages of interest idle by setting corresponding bits in the
>       /proc/kpageidle bitmap
>    2. wait until the workload accesses its working set
>    3. read /proc/kpageidle and count the number of bits set
>
>  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>    memory cgroup each page is charged to, indexed by PFN. Only available when
>    CONFIG_MEMCG is set.
>
>    This file can be used to find all pages (including unmapped file
>    pages) accounted to a particular cgroup. Using /proc/kpageidle, one
>    can then estimate the cgroup working set size.
>
> For an example of using these files for estimating the amount of unused
> memory pages per each memory cgroup, please see the script attached
> below.
>
> ---- REASONING ----
>
> The reason to introduce the new user API instead of using
> /proc/PID/{clear_refs,smaps} is that the latter has two serious
> drawbacks:
>
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
>
> The new API attempts to overcome them both. For more details on how it
> is achieved, please see the comment to patch 5.
>
> ---- CHANGE LOG ----
>
> Changes in v8:
>
>  - clear referenced/accessed bit in secondary ptes while accessing
>    /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
>  - check the young flag when collapsing a huge page
>  - copy idle/young flags on page migration

Both good catches, thanks!

I think the remaining question here is performance.

Have you conducted any studies where
- there is a workload
- a daemon is poking kpageidle every N seconds/minutes
- what is the daemon cpu consumption?
- what is the workload degradation if any?

N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....

Workload candidates include TPC, spec int memory intensive things like
429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
memory bandwidth" vs floating point performance)

I'm not asking for a research paper, but if, say, a 2 minute-period
daemon introduces no degradation and adds up to a minute of cpu per
hour, then we're golden.

Andres

>
> Changes in v7:
>
> This iteration addresses Andres's comments to v6:
>
>  - do not reuse page_referenced for clearing idle flag, introduce a
>    separate function instead; this way we won't issue expensive tlb
>    flushes on /proc/kpageidle read/write
>  - propagate young/idle flags from head to tail pages on thp split
>  - skip compound tail pages while reading/writing /proc/kpageidle
>  - cleanup page_referenced_one
>
> Changes in v6:
>
>  - Split the patch introducing page_cgroup_ino helper to ease review.
>  - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55
>
> Changes in v5:
>
>  - Fix possible race between kpageidle_clear_pte_refs() and
>    __page_set_anon_rmap() by checking that a page is on an LRU list
>    under zone->lru_lock (Minchan).
>  - Export idle flag via /proc/kpageflags (Minchan).
>  - Rebase on top of 4.1-rc3.
>
> Changes in v4:
>
> This iteration primarily addresses Minchan's comments to v3:
>
>  - Implement /proc/kpageidle as a bitmap instead of using u64 per each page,
>    because there does not seem to be any future uses for the other 63 bits.
>  - Do not double-increase pra->referenced in page_referenced_one() if the page
>    was young and referenced recently.
>  - Remove the pointless (page_count == 0) check from kpageidle_get_page().
>  - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
>  - Improve comments to kpageidle-related functions.
>  - Rebase on top of 4.1-rc2.
>
> Note it does not address Minchan's concern of possible __page_set_anon_rmap vs
> page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is still
> unclear if this race can really happen (see https://lkml.org/lkml/2015/5/4/160)
>
> Changes in v3:
>
>  - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
>    requires two extra page flags and there is no space for them on 32
>    bit, page ext is used (thanks to Minchan Kim).
>  - Minor code cleanups and comments improved.
>  - Rebase on top of 4.1-rc1.
>
> Changes in v2:
>
>  - The main difference from v1 is the API change. In v1 the user can
>    only set the idle flag for all pages at once, and for clearing the
>    Idle flag on pages accessed via page tables /proc/PID/clear_refs
>    should be used.
>    The main drawback of the v1 approach, as noted by Minchan, is that on
>    big machines setting the idle flag for each pages can result in CPU
>    bursts, which would be especially frustrating if the user only wanted
>    to estimate the amount of idle pages for a particular process or VMA.
>    With the new API a more fine-grained approach is possible: one can
>    read a process's /proc/PID/pagemap and set/check the Idle flag only
>    for those pages of the process's address space he or she is
>    interested in.
>    Another good point about the v2 API is that it is possible to limit
>    /proc/kpage* scanning rate when the user wants to estimate the total
>    number of idle pages, which is unachievable with the v1 approach.
>  - Make /proc/kpagecgroup return the ino of the closest online ancestor
>    in case the cgroup a page is charged to is offline.
>  - Fix /proc/PID/clear_refs not clearing Young page flag.
>  - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54
>
> v7: https://lkml.org/lkml/2015/7/11/119
> v6: https://lkml.org/lkml/2015/6/12/301
> v5: https://lkml.org/lkml/2015/5/12/449
> v4: https://lkml.org/lkml/2015/5/7/580
> v3: https://lkml.org/lkml/2015/4/28/224
> v2: https://lkml.org/lkml/2015/4/7/260
> v1: https://lkml.org/lkml/2015/3/18/794
>
> ---- PATCH SET STRUCTURE ----
>
> The patch set is organized as follows:
>
>  - patch 1 adds page_cgroup_ino() helper for the sake of
>    /proc/kpagecgroup and patches 2-3 do related cleanup
>  - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
>    charged to
>  - patch 5 introduces a new mmu notifier callback, clear_young, which is
>    a lightweight version of clear_flush_young; it is used in patch 6
>  - patch 6 implements the idle page tracking feature, including the
>    userspace API, /proc/kpageidle
>  - patch 7 exports idle flag via /proc/kpageflags
>
> ---- SIMILAR WORKS ----
>
> Originally, the patch for tracking idle memory was proposed back in 2011
> by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
> difference between Michel's patch and this one is that Michel
> implemented a kernel space daemon for estimating idle memory size per
> cgroup while this patch only provides the userspace with the minimal API
> for doing the job, leaving the rest up to the userspace. However, they
> both share the same idea of Idle/Young page flags to avoid affecting the
> reclaimer logic.
>
> ---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
> #! /usr/bin/python
> #
>
> import os
> import stat
> import errno
> import struct
>
> CGROUP_MOUNT = "/sys/fs/cgroup/memory"
> BUFSIZE = 8 * 1024  # must be multiple of 8
>
>
> def get_hugepage_size():
>     with open("/proc/meminfo", "r") as f:
>         for s in f:
>             k, v = s.split(":")
>             if k == "Hugepagesize":
>                 return int(v.split()[0]) * 1024
>
> PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
> HUGEPAGE_SIZE = get_hugepage_size()
>
>
> def set_idle():
>     f = open("/proc/kpageidle", "wb", BUFSIZE)
>     while True:
>         try:
>             f.write(struct.pack("Q", pow(2, 64) - 1))
>         except IOError as err:
>             if err.errno == errno.ENXIO:
>                 break
>             raise
>     f.close()
>
>
> def count_idle():
>     f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
>     f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
>
>     with open("/proc/kpageidle", "rb", BUFSIZE) as f:
>         while f.read(BUFSIZE): pass  # update idle flag
>
>     idlememsz = {}
>     while True:
>         s1, s2 = f_flags.read(8), f_cgroup.read(8)
>         if not s1 or not s2:
>             break
>
>         flags, = struct.unpack('Q', s1)
>         cgino, = struct.unpack('Q', s2)
>
>         unevictable = (flags >> 18) & 1
>         huge = (flags >> 22) & 1
>         idle = (flags >> 25) & 1
>
>         if idle and not unevictable:
>             idlememsz[cgino] = idlememsz.get(cgino, 0) + \
>                 (HUGEPAGE_SIZE if huge else PAGE_SIZE)
>
>     f_flags.close()
>     f_cgroup.close()
>     return idlememsz
>
>
> if __name__ == "__main__":
>     print "Setting the idle flag for each page..."
>     set_idle()
>
>     raw_input("Wait until the workload accesses its working set, "
>               "then press Enter")
>
>     print "Counting idle pages..."
>     idlememsz = count_idle()
>
>     for dir, subdirs, files in os.walk(CGROUP_MOUNT):
>         ino = os.stat(dir)[stat.ST_INO]
>         print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
> ---- END SCRIPT ----
>
> Comments are more than welcome.
>
> Thanks,
>
> Vladimir Davydov (7):
>   memcg: add page_cgroup_ino helper
>   hwpoison: use page_cgroup_ino for filtering by memcg
>   memcg: zap try_get_mem_cgroup_from_page
>   proc: add kpagecgroup file
>   mmu-notifier: add clear_young callback
>   proc: add kpageidle file
>   proc: export idle flag via kpageflags
>
>  Documentation/vm/pagemap.txt           |  22 ++-
>  fs/proc/page.c                         | 274 +++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c                     |   4 +-
>  include/linux/memcontrol.h             |   7 +-
>  include/linux/mm.h                     |  98 ++++++++++++
>  include/linux/mmu_notifier.h           |  44 ++++++
>  include/linux/page-flags.h             |  11 ++
>  include/linux/page_ext.h               |   4 +
>  include/uapi/linux/kernel-page-flags.h |   1 +
>  mm/Kconfig                             |  12 ++
>  mm/debug.c                             |   4 +
>  mm/huge_memory.c                       |  11 +-
>  mm/hwpoison-inject.c                   |   5 +-
>  mm/memcontrol.c                        |  71 +++++----
>  mm/memory-failure.c                    |  16 +-
>  mm/migrate.c                           |   5 +
>  mm/mmu_notifier.c                      |  17 ++
>  mm/page_ext.c                          |   3 +
>  mm/rmap.c                              |   5 +
>  mm/swap.c                              |   2 +
>  virt/kvm/kvm_main.c                    |  18 +++
>  21 files changed, 570 insertions(+), 64 deletions(-)
>
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-15 20:47   ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-15 20:47 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> It is based on top of v4.2-rc1-mmotm-2015-07-06-16-25
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.
>
> Also, as noted by Minchan, this would be useful for per-process reclaim
> (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
> pages only by smart user memory manager.
>
> ---- USER API ----
>
> The user API consists of two new proc files:
>
>  * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
>    to a page, indexed by PFN. When the bit is set, the corresponding page is
>    idle. A page is considered idle if it has not been accessed since it was
>    marked idle. To mark a page idle one should set the bit corresponding to the
>    page by writing to the file. A value written to the file is OR-ed with the
>    current bitmap value. Only user memory pages can be marked idle, for other
>    page types input is silently ignored. Writing to this file beyond max PFN
>    results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
>    set.
>
>    This file can be used to estimate the amount of pages that are not
>    used by a particular workload as follows:
>
>    1. mark all pages of interest idle by setting corresponding bits in the
>       /proc/kpageidle bitmap
>    2. wait until the workload accesses its working set
>    3. read /proc/kpageidle and count the number of bits set
>
>  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
>    memory cgroup each page is charged to, indexed by PFN. Only available when
>    CONFIG_MEMCG is set.
>
>    This file can be used to find all pages (including unmapped file
>    pages) accounted to a particular cgroup. Using /proc/kpageidle, one
>    can then estimate the cgroup working set size.
>
> For an example of using these files for estimating the amount of unused
> memory pages per each memory cgroup, please see the script attached
> below.
>
> ---- REASONING ----
>
> The reason to introduce the new user API instead of using
> /proc/PID/{clear_refs,smaps} is that the latter has two serious
> drawbacks:
>
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
>
> The new API attempts to overcome them both. For more details on how it
> is achieved, please see the comment to patch 5.
>
> ---- CHANGE LOG ----
>
> Changes in v8:
>
>  - clear referenced/accessed bit in secondary ptes while accessing
>    /proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
>  - check the young flag when collapsing a huge page
>  - copy idle/young flags on page migration

Both good catches, thanks!

I think the remaining question here is performance.

Have you conducted any studies where
- there is a workload
- a daemon is poking kpageidle every N seconds/minutes
- what is the daemon cpu consumption?
- what is the workload degradation if any?

N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....

Workload candidates include TPC, spec int memory intensive things like
429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
memory bandwidth" vs floating point performance)

I'm not asking for a research paper, but if, say, a 2 minute-period
daemon introduces no degradation and adds up to a minute of cpu per
hour, then we're golden.

Andres

>
> Changes in v7:
>
> This iteration addresses Andres's comments to v6:
>
>  - do not reuse page_referenced for clearing idle flag, introduce a
>    separate function instead; this way we won't issue expensive tlb
>    flushes on /proc/kpageidle read/write
>  - propagate young/idle flags from head to tail pages on thp split
>  - skip compound tail pages while reading/writing /proc/kpageidle
>  - cleanup page_referenced_one
>
> Changes in v6:
>
>  - Split the patch introducing page_cgroup_ino helper to ease review.
>  - Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55
>
> Changes in v5:
>
>  - Fix possible race between kpageidle_clear_pte_refs() and
>    __page_set_anon_rmap() by checking that a page is on an LRU list
>    under zone->lru_lock (Minchan).
>  - Export idle flag via /proc/kpageflags (Minchan).
>  - Rebase on top of 4.1-rc3.
>
> Changes in v4:
>
> This iteration primarily addresses Minchan's comments to v3:
>
>  - Implement /proc/kpageidle as a bitmap instead of using u64 per each page,
>    because there does not seem to be any future uses for the other 63 bits.
>  - Do not double-increase pra->referenced in page_referenced_one() if the page
>    was young and referenced recently.
>  - Remove the pointless (page_count == 0) check from kpageidle_get_page().
>  - Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
>  - Improve comments to kpageidle-related functions.
>  - Rebase on top of 4.1-rc2.
>
> Note it does not address Minchan's concern of possible __page_set_anon_rmap vs
> page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is still
> unclear if this race can really happen (see https://lkml.org/lkml/2015/5/4/160)
>
> Changes in v3:
>
>  - Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
>    requires two extra page flags and there is no space for them on 32
>    bit, page ext is used (thanks to Minchan Kim).
>  - Minor code cleanups and comments improved.
>  - Rebase on top of 4.1-rc1.
>
> Changes in v2:
>
>  - The main difference from v1 is the API change. In v1 the user can
>    only set the idle flag for all pages at once, and for clearing the
>    Idle flag on pages accessed via page tables /proc/PID/clear_refs
>    should be used.
>    The main drawback of the v1 approach, as noted by Minchan, is that on
>    big machines setting the idle flag for each pages can result in CPU
>    bursts, which would be especially frustrating if the user only wanted
>    to estimate the amount of idle pages for a particular process or VMA.
>    With the new API a more fine-grained approach is possible: one can
>    read a process's /proc/PID/pagemap and set/check the Idle flag only
>    for those pages of the process's address space he or she is
>    interested in.
>    Another good point about the v2 API is that it is possible to limit
>    /proc/kpage* scanning rate when the user wants to estimate the total
>    number of idle pages, which is unachievable with the v1 approach.
>  - Make /proc/kpagecgroup return the ino of the closest online ancestor
>    in case the cgroup a page is charged to is offline.
>  - Fix /proc/PID/clear_refs not clearing Young page flag.
>  - Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54
>
> v7: https://lkml.org/lkml/2015/7/11/119
> v6: https://lkml.org/lkml/2015/6/12/301
> v5: https://lkml.org/lkml/2015/5/12/449
> v4: https://lkml.org/lkml/2015/5/7/580
> v3: https://lkml.org/lkml/2015/4/28/224
> v2: https://lkml.org/lkml/2015/4/7/260
> v1: https://lkml.org/lkml/2015/3/18/794
>
> ---- PATCH SET STRUCTURE ----
>
> The patch set is organized as follows:
>
>  - patch 1 adds page_cgroup_ino() helper for the sake of
>    /proc/kpagecgroup and patches 2-3 do related cleanup
>  - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
>    charged to
>  - patch 5 introduces a new mmu notifier callback, clear_young, which is
>    a lightweight version of clear_flush_young; it is used in patch 6
>  - patch 6 implements the idle page tracking feature, including the
>    userspace API, /proc/kpageidle
>  - patch 7 exports idle flag via /proc/kpageflags
>
> ---- SIMILAR WORKS ----
>
> Originally, the patch for tracking idle memory was proposed back in 2011
> by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
> difference between Michel's patch and this one is that Michel
> implemented a kernel space daemon for estimating idle memory size per
> cgroup while this patch only provides the userspace with the minimal API
> for doing the job, leaving the rest up to the userspace. However, they
> both share the same idea of Idle/Young page flags to avoid affecting the
> reclaimer logic.
>
> ---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
> #! /usr/bin/python
> #
>
> import os
> import stat
> import errno
> import struct
>
> CGROUP_MOUNT = "/sys/fs/cgroup/memory"
> BUFSIZE = 8 * 1024  # must be multiple of 8
>
>
> def get_hugepage_size():
>     with open("/proc/meminfo", "r") as f:
>         for s in f:
>             k, v = s.split(":")
>             if k == "Hugepagesize":
>                 return int(v.split()[0]) * 1024
>
> PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
> HUGEPAGE_SIZE = get_hugepage_size()
>
>
> def set_idle():
>     f = open("/proc/kpageidle", "wb", BUFSIZE)
>     while True:
>         try:
>             f.write(struct.pack("Q", pow(2, 64) - 1))
>         except IOError as err:
>             if err.errno == errno.ENXIO:
>                 break
>             raise
>     f.close()
>
>
> def count_idle():
>     f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
>     f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
>
>     with open("/proc/kpageidle", "rb", BUFSIZE) as f:
>         while f.read(BUFSIZE): pass  # update idle flag
>
>     idlememsz = {}
>     while True:
>         s1, s2 = f_flags.read(8), f_cgroup.read(8)
>         if not s1 or not s2:
>             break
>
>         flags, = struct.unpack('Q', s1)
>         cgino, = struct.unpack('Q', s2)
>
>         unevictable = (flags >> 18) & 1
>         huge = (flags >> 22) & 1
>         idle = (flags >> 25) & 1
>
>         if idle and not unevictable:
>             idlememsz[cgino] = idlememsz.get(cgino, 0) + \
>                 (HUGEPAGE_SIZE if huge else PAGE_SIZE)
>
>     f_flags.close()
>     f_cgroup.close()
>     return idlememsz
>
>
> if __name__ == "__main__":
>     print "Setting the idle flag for each page..."
>     set_idle()
>
>     raw_input("Wait until the workload accesses its working set, "
>               "then press Enter")
>
>     print "Counting idle pages..."
>     idlememsz = count_idle()
>
>     for dir, subdirs, files in os.walk(CGROUP_MOUNT):
>         ino = os.stat(dir)[stat.ST_INO]
>         print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
> ---- END SCRIPT ----
>
> Comments are more than welcome.
>
> Thanks,
>
> Vladimir Davydov (7):
>   memcg: add page_cgroup_ino helper
>   hwpoison: use page_cgroup_ino for filtering by memcg
>   memcg: zap try_get_mem_cgroup_from_page
>   proc: add kpagecgroup file
>   mmu-notifier: add clear_young callback
>   proc: add kpageidle file
>   proc: export idle flag via kpageflags
>
>  Documentation/vm/pagemap.txt           |  22 ++-
>  fs/proc/page.c                         | 274 +++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c                     |   4 +-
>  include/linux/memcontrol.h             |   7 +-
>  include/linux/mm.h                     |  98 ++++++++++++
>  include/linux/mmu_notifier.h           |  44 ++++++
>  include/linux/page-flags.h             |  11 ++
>  include/linux/page_ext.h               |   4 +
>  include/uapi/linux/kernel-page-flags.h |   1 +
>  mm/Kconfig                             |  12 ++
>  mm/debug.c                             |   4 +
>  mm/huge_memory.c                       |  11 +-
>  mm/hwpoison-inject.c                   |   5 +-
>  mm/memcontrol.c                        |  71 +++++----
>  mm/memory-failure.c                    |  16 +-
>  mm/migrate.c                           |   5 +
>  mm/mmu_notifier.c                      |  17 ++
>  mm/page_ext.c                          |   3 +
>  mm/rmap.c                              |   5 +
>  mm/swap.c                              |   2 +
>  virt/kvm/kvm_main.c                    |  18 +++
>  21 files changed, 570 insertions(+), 64 deletions(-)
>
> --
> 2.1.4
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-16  9:28       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16  9:28 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> For both /proc/kpage* interfaces you add (and more critically for the
> rmap-causing one, kpageidle):
> 
> It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> pfn, each put_user, I leave to you, but a reasonable cadence is
> needed, because user-space can call this on the entire physical
> address space, and that's a lot of work to do without re-scheduling.

I really don't think it's necessary. These files can only be
read/written by the root, who has plenty ways to kill the system anyway.
The program that is allowed to read/write these files must be conscious
and do it in batches of reasonable size. AFAICS the same reasoning
already lays behind /proc/kpagecount and /proc/kpageflag, which also do
not thrust the "right" batch size on their readers.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-16  9:28       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16  9:28 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> For both /proc/kpage* interfaces you add (and more critically for the
> rmap-causing one, kpageidle):
> 
> It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> pfn, each put_user, I leave to you, but a reasonable cadence is
> needed, because user-space can call this on the entire physical
> address space, and that's a lot of work to do without re-scheduling.

I really don't think it's necessary. These files can only be
read/written by the root, who has plenty ways to kill the system anyway.
The program that is allowed to read/write these files must be conscious
and do it in batches of reasonable size. AFAICS the same reasoning
already lays behind /proc/kpagecount and /proc/kpageflag, which also do
not thrust the "right" batch size on their readers.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-16  9:28       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16  9:28 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> For both /proc/kpage* interfaces you add (and more critically for the
> rmap-causing one, kpageidle):
> 
> It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> pfn, each put_user, I leave to you, but a reasonable cadence is
> needed, because user-space can call this on the entire physical
> address space, and that's a lot of work to do without re-scheduling.

I really don't think it's necessary. These files can only be
read/written by the root, who has plenty ways to kill the system anyway.
The program that is allowed to read/write these files must be conscious
and do it in batches of reasonable size. AFAICS the same reasoning
already lays behind /proc/kpagecount and /proc/kpageflag, which also do
not thrust the "right" batch size on their readers.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 6/7] proc: add kpageidle file
  2015-07-15 19:42     ` Andres Lagar-Cavilla
@ 2015-07-16  9:53       ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16  9:53 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 12:42:28PM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
> <vdavydov@parallels.com> wrote:
[...]
> > +static void kpageidle_clear_pte_refs(struct page *page)
> > +{
> > +       struct rmap_walk_control rwc = {
> > +               .rmap_one = kpageidle_clear_pte_refs_one,
> > +               .anon_lock = page_lock_anon_vma_read,
> > +       };
> > +       bool need_lock;
> > +
> > +       if (!page_mapped(page) ||
> 
> Question: what about mlocked pages? Is there any point in calculating
> their idleness?

Those can be filtered out with the aid of /proc/kpageflags (this is what
the script attached to patch #0 of the series actually does). We have to
read the latter anyway in order to get information about THP. That said,
I prefer not to introduce any artificial checks for locked memory. Who
knows, may be one day somebody will use this API to track access pattern
to an mlocked area.

> 
> > +           !page_rmapping(page))
> 
> Not sure, does this skip SwapCache pages? Is there any point in
> calculating their idleness?

A SwapCache page may be mapped, and if it is we should not skip it. If
it is unmapped, we have nothing to do.

Regarding idleness of SwapCache pages, I think we shouldn't
differentiate them from other user pages here, because a shmem/anon page
can migrate to-and-fro the swap cache occasionally during a
memory-active workload, and we don't want to lose its idle status then.

> 
> > +               return;
> > +
> > +       need_lock = !PageAnon(page) || PageKsm(page);
> > +       if (need_lock && !trylock_page(page))
> > +               return;
> > +
> > +       rmap_walk(page, &rwc);
> > +
> > +       if (need_lock)
> > +               unlock_page(page);
> > +}
[...]
> > @@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
> >                 /* clear PageTail before overwriting first_page */
> >                 smp_wmb();
> >
> > +               if (page_is_young(page))
> > +                       set_page_young(page_tail);
> > +               if (page_is_idle(page))
> > +                       set_page_idle(page_tail);
> > +
> 
> Why not in the block above?
> 
> page_tail->flags |= (page->flags &
> ...
> #ifdef CONFIG_WHATEVER_IT_WAS
> 1 << PG_idle
> 1 << PG_young
> #endif

Too many ifdef's :-/ Note, the flags can be in page_ext, which mean we
would have to add something like this

#if defined(CONFIG_WHATEVER_IT_WAS) && defined(CONFIG_64BIT)
1 << PG_idle
1 << PG_young
#endif
<...>
#ifndef CONFIG_64BIT
if (page_is_young(page))
	set_page_young(page_tail);
if (page_is_idle(page))
	set_page_idle(page_tail);
#endif

which IMO looks less readable than what we have now.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 6/7] proc: add kpageidle file
@ 2015-07-16  9:53       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16  9:53 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 12:42:28PM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
> <vdavydov@parallels.com> wrote:
[...]
> > +static void kpageidle_clear_pte_refs(struct page *page)
> > +{
> > +       struct rmap_walk_control rwc = {
> > +               .rmap_one = kpageidle_clear_pte_refs_one,
> > +               .anon_lock = page_lock_anon_vma_read,
> > +       };
> > +       bool need_lock;
> > +
> > +       if (!page_mapped(page) ||
> 
> Question: what about mlocked pages? Is there any point in calculating
> their idleness?

Those can be filtered out with the aid of /proc/kpageflags (this is what
the script attached to patch #0 of the series actually does). We have to
read the latter anyway in order to get information about THP. That said,
I prefer not to introduce any artificial checks for locked memory. Who
knows, may be one day somebody will use this API to track access pattern
to an mlocked area.

> 
> > +           !page_rmapping(page))
> 
> Not sure, does this skip SwapCache pages? Is there any point in
> calculating their idleness?

A SwapCache page may be mapped, and if it is we should not skip it. If
it is unmapped, we have nothing to do.

Regarding idleness of SwapCache pages, I think we shouldn't
differentiate them from other user pages here, because a shmem/anon page
can migrate to-and-fro the swap cache occasionally during a
memory-active workload, and we don't want to lose its idle status then.

> 
> > +               return;
> > +
> > +       need_lock = !PageAnon(page) || PageKsm(page);
> > +       if (need_lock && !trylock_page(page))
> > +               return;
> > +
> > +       rmap_walk(page, &rwc);
> > +
> > +       if (need_lock)
> > +               unlock_page(page);
> > +}
[...]
> > @@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
> >                 /* clear PageTail before overwriting first_page */
> >                 smp_wmb();
> >
> > +               if (page_is_young(page))
> > +                       set_page_young(page_tail);
> > +               if (page_is_idle(page))
> > +                       set_page_idle(page_tail);
> > +
> 
> Why not in the block above?
> 
> page_tail->flags |= (page->flags &
> ...
> #ifdef CONFIG_WHATEVER_IT_WAS
> 1 << PG_idle
> 1 << PG_young
> #endif

Too many ifdef's :-/ Note, the flags can be in page_ext, which mean we
would have to add something like this

#if defined(CONFIG_WHATEVER_IT_WAS) && defined(CONFIG_64BIT)
1 << PG_idle
1 << PG_young
#endif
<...>
#ifndef CONFIG_64BIT
if (page_is_young(page))
	set_page_young(page_tail);
if (page_is_idle(page))
	set_page_idle(page_tail);
#endif

which IMO looks less readable than what we have now.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-16 10:02     ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16 10:02 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 01:47:15PM -0700, Andres Lagar-Cavilla wrote:
> I think the remaining question here is performance.
> 
> Have you conducted any studies where
> - there is a workload
> - a daemon is poking kpageidle every N seconds/minutes
> - what is the daemon cpu consumption?
> - what is the workload degradation if any?
> 
> N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....
> 
> Workload candidates include TPC, spec int memory intensive things like
> 429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
> memory bandwidth" vs floating point performance)
> 
> I'm not asking for a research paper, but if, say, a 2 minute-period
> daemon introduces no degradation and adds up to a minute of cpu per
> hour, then we're golden.

Fair enough. Will do that soon and report back.

Thanks a lot for the review, it was really helpful!

Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-16 10:02     ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16 10:02 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 15, 2015 at 01:47:15PM -0700, Andres Lagar-Cavilla wrote:
> I think the remaining question here is performance.
> 
> Have you conducted any studies where
> - there is a workload
> - a daemon is poking kpageidle every N seconds/minutes
> - what is the daemon cpu consumption?
> - what is the workload degradation if any?
> 
> N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....
> 
> Workload candidates include TPC, spec int memory intensive things like
> 429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
> memory bandwidth" vs floating point performance)
> 
> I'm not asking for a research paper, but if, say, a 2 minute-period
> daemon introduces no degradation and adds up to a minute of cpu per
> hour, then we're golden.

Fair enough. Will do that soon and report back.

Thanks a lot for the review, it was really helpful!

Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 0/7] idle memory tracking
@ 2015-07-16 10:02     ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-16 10:02 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Wed, Jul 15, 2015 at 01:47:15PM -0700, Andres Lagar-Cavilla wrote:
> I think the remaining question here is performance.
> 
> Have you conducted any studies where
> - there is a workload
> - a daemon is poking kpageidle every N seconds/minutes
> - what is the daemon cpu consumption?
> - what is the workload degradation if any?
> 
> N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....
> 
> Workload candidates include TPC, spec int memory intensive things like
> 429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
> memory bandwidth" vs floating point performance)
> 
> I'm not asking for a research paper, but if, say, a 2 minute-period
> daemon introduces no degradation and adds up to a minute of cpu per
> hour, then we're golden.

Fair enough. Will do that soon and report back.

Thanks a lot for the review, it was really helpful!

Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
  2015-07-15 19:16     ` Andres Lagar-Cavilla
@ 2015-07-16 11:35       ` Paolo Bonzini
  -1 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2015-07-16 11:35 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, Vladimir Davydov, kvm, Eric Northup
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel



On 15/07/2015 21:16, Andres Lagar-Cavilla wrote:
>> > +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>> > +                                       struct mm_struct *mm,
>> > +                                       unsigned long start,
>> > +                                       unsigned long end)
>> > +{
>> > +       struct kvm *kvm = mmu_notifier_to_kvm(mn);
>> > +       int young, idx;
> For reclaim, the clear_flush_young notifier may blow up the secondary
> pte to estimate the access pattern, depending on hardware support (EPT
> access bits available in Haswell onwards, not sure about AMD, PPC,
> etc).

It seems like this problem is limited to pre-Haswell EPT.

I'm okay with the patch.  If we find problems later we can always add a
parameter to kvm_age_hva so that it effectively doesn't do anything on
clear_young.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
@ 2015-07-16 11:35       ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2015-07-16 11:35 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, Vladimir Davydov, kvm, Eric Northup
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel



On 15/07/2015 21:16, Andres Lagar-Cavilla wrote:
>> > +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>> > +                                       struct mm_struct *mm,
>> > +                                       unsigned long start,
>> > +                                       unsigned long end)
>> > +{
>> > +       struct kvm *kvm = mmu_notifier_to_kvm(mn);
>> > +       int young, idx;
> For reclaim, the clear_flush_young notifier may blow up the secondary
> pte to estimate the access pattern, depending on hardware support (EPT
> access bits available in Haswell onwards, not sure about AMD, PPC,
> etc).

It seems like this problem is limited to pre-Haswell EPT.

I'm okay with the patch.  If we find problems later we can always add a
parameter to kvm_age_hva so that it effectively doesn't do anything on
clear_young.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
  2015-07-16  9:28       ` Vladimir Davydov
  (?)
  (?)
@ 2015-07-16 19:04       ` Andres Lagar-Cavilla
  2015-07-17  9:27           ` Vladimir Davydov
  -1 siblings, 1 reply; 52+ messages in thread
From: Andres Lagar-Cavilla @ 2015-07-16 19:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

On Thu, Jul 16, 2015 at 2:28 AM, Vladimir Davydov <vdavydov@parallels.com>
wrote:

> On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> > For both /proc/kpage* interfaces you add (and more critically for the
> > rmap-causing one, kpageidle):
> >
> > It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> > pfn, each put_user, I leave to you, but a reasonable cadence is
> > needed, because user-space can call this on the entire physical
> > address space, and that's a lot of work to do without re-scheduling.
>
> I really don't think it's necessary. These files can only be
> read/written by the root, who has plenty ways to kill the system anyway.
> The program that is allowed to read/write these files must be conscious
> and do it in batches of reasonable size. AFAICS the same reasoning
> already lays behind /proc/kpagecount and /proc/kpageflag, which also do
> not thrust the "right" batch size on their readers.
>

Beg to disagree. You're conflating intended use with system health. A
cond_sched() is a one-liner.

Andres

>
> Thanks,
> Vladimir
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

[-- Attachment #2: Type: text/html, Size: 2520 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-17  9:27           ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-17  9:27 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Thu, Jul 16, 2015 at 12:04:59PM -0700, Andres Lagar-Cavilla wrote:
> On Thu, Jul 16, 2015 at 2:28 AM, Vladimir Davydov <vdavydov@parallels.com>
> wrote:
> 
> > On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> > > For both /proc/kpage* interfaces you add (and more critically for the
> > > rmap-causing one, kpageidle):
> > >
> > > It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> > > pfn, each put_user, I leave to you, but a reasonable cadence is
> > > needed, because user-space can call this on the entire physical
> > > address space, and that's a lot of work to do without re-scheduling.
> >
> > I really don't think it's necessary. These files can only be
> > read/written by the root, who has plenty ways to kill the system anyway.
> > The program that is allowed to read/write these files must be conscious
> > and do it in batches of reasonable size. AFAICS the same reasoning
> > already lays behind /proc/kpagecount and /proc/kpageflag, which also do
> > not thrust the "right" batch size on their readers.
> >
> 
> Beg to disagree. You're conflating intended use with system health. A
> cond_sched() is a one-liner.

I would still prefer not to clutter the code with cond_resched's, but I
don't think it's a matter worth arguing upon, so I'll prepare a patch
that makes all /proc/kapge* files issue cond_resched periodically and
leave it up to Andrew to decide if it should be applied or not.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-17  9:27           ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-17  9:27 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 16, 2015 at 12:04:59PM -0700, Andres Lagar-Cavilla wrote:
> On Thu, Jul 16, 2015 at 2:28 AM, Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> wrote:
> 
> > On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> > > For both /proc/kpage* interfaces you add (and more critically for the
> > > rmap-causing one, kpageidle):
> > >
> > > It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> > > pfn, each put_user, I leave to you, but a reasonable cadence is
> > > needed, because user-space can call this on the entire physical
> > > address space, and that's a lot of work to do without re-scheduling.
> >
> > I really don't think it's necessary. These files can only be
> > read/written by the root, who has plenty ways to kill the system anyway.
> > The program that is allowed to read/write these files must be conscious
> > and do it in batches of reasonable size. AFAICS the same reasoning
> > already lays behind /proc/kpagecount and /proc/kpageflag, which also do
> > not thrust the "right" batch size on their readers.
> >
> 
> Beg to disagree. You're conflating intended use with system health. A
> cond_sched() is a one-liner.

I would still prefer not to clutter the code with cond_resched's, but I
don't think it's a matter worth arguing upon, so I'll prepare a patch
that makes all /proc/kapge* files issue cond_resched periodically and
leave it up to Andrew to decide if it should be applied or not.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v8 4/7] proc: add kpagecgroup file
@ 2015-07-17  9:27           ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-07-17  9:27 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner,
	Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes,
	Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linux-api,
	linux-doc, linux-mm, cgroups, linux-kernel

On Thu, Jul 16, 2015 at 12:04:59PM -0700, Andres Lagar-Cavilla wrote:
> On Thu, Jul 16, 2015 at 2:28 AM, Vladimir Davydov <vdavydov@parallels.com>
> wrote:
> 
> > On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> > > For both /proc/kpage* interfaces you add (and more critically for the
> > > rmap-causing one, kpageidle):
> > >
> > > It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> > > pfn, each put_user, I leave to you, but a reasonable cadence is
> > > needed, because user-space can call this on the entire physical
> > > address space, and that's a lot of work to do without re-scheduling.
> >
> > I really don't think it's necessary. These files can only be
> > read/written by the root, who has plenty ways to kill the system anyway.
> > The program that is allowed to read/write these files must be conscious
> > and do it in batches of reasonable size. AFAICS the same reasoning
> > already lays behind /proc/kpagecount and /proc/kpageflag, which also do
> > not thrust the "right" batch size on their readers.
> >
> 
> Beg to disagree. You're conflating intended use with system health. A
> cond_sched() is a one-liner.

I would still prefer not to clutter the code with cond_resched's, but I
don't think it's a matter worth arguing upon, so I'll prepare a patch
that makes all /proc/kapge* files issue cond_resched periodically and
leave it up to Andrew to decide if it should be applied or not.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2015-07-17  9:28 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-15 13:54 [PATCH -mm v8 0/7] idle memory tracking Vladimir Davydov
2015-07-15 13:54 ` Vladimir Davydov
2015-07-15 13:54 ` Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 18:59   ` Andres Lagar-Cavilla
2015-07-15 18:59     ` Andres Lagar-Cavilla
2015-07-15 18:59     ` Andres Lagar-Cavilla
2015-07-15 13:54 ` [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 19:00   ` Andres Lagar-Cavilla
2015-07-15 19:00     ` Andres Lagar-Cavilla
2015-07-15 19:00     ` Andres Lagar-Cavilla
2015-07-15 13:54 ` [PATCH -mm v8 3/7] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 4/7] proc: add kpagecgroup file Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 19:03   ` Andres Lagar-Cavilla
2015-07-15 19:03     ` Andres Lagar-Cavilla
2015-07-16  9:28     ` Vladimir Davydov
2015-07-16  9:28       ` Vladimir Davydov
2015-07-16  9:28       ` Vladimir Davydov
2015-07-16 19:04       ` Andres Lagar-Cavilla
2015-07-17  9:27         ` Vladimir Davydov
2015-07-17  9:27           ` Vladimir Davydov
2015-07-17  9:27           ` Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 19:16   ` Andres Lagar-Cavilla
2015-07-15 19:16     ` Andres Lagar-Cavilla
2015-07-16 11:35     ` Paolo Bonzini
2015-07-16 11:35       ` Paolo Bonzini
2015-07-15 13:54 ` [PATCH -mm v8 6/7] proc: add kpageidle file Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 19:42   ` Andres Lagar-Cavilla
2015-07-15 19:42     ` Andres Lagar-Cavilla
2015-07-15 19:42     ` Andres Lagar-Cavilla
2015-07-16  9:53     ` Vladimir Davydov
2015-07-16  9:53       ` Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 7/7] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-15 13:54   ` Vladimir Davydov
2015-07-15 19:17   ` Andres Lagar-Cavilla
2015-07-15 19:17     ` Andres Lagar-Cavilla
2015-07-15 19:17     ` Andres Lagar-Cavilla
2015-07-15 20:47 ` [PATCH -mm v8 0/7] idle memory tracking Andres Lagar-Cavilla
2015-07-15 20:47   ` Andres Lagar-Cavilla
2015-07-16 10:02   ` Vladimir Davydov
2015-07-16 10:02     ` Vladimir Davydov
2015-07-16 10:02     ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.