All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Hogan <james.hogan@imgtec.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>, Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Minchan Kim <minchan@kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	LKML <linux-kernel@vger.kernel.org>,
	metag <linux-metag@vger.kernel.org>
Subject: Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
Date: Thu, 4 Aug 2016 21:59:17 +0100	[thread overview]
Message-ID: <CAAG0J9_k3edxDzqpEjt2BqqZXMW4PVj7BNUBAk6TWtw3Zh_oMg@mail.gmail.com> (raw)
In-Reply-To: <1467970510-21195-4-git-send-email-mgorman@techsingularity.net>

On 8 July 2016 at 10:34, Mel Gorman <mgorman@techsingularity.net> wrote:
> This moves the LRU lists from the zone to the node and related data
> such as counters, tracing, congestion tracking and writeback tracking.
> Unfortunately, due to reclaim and compaction retry logic, it is necessary
> to account for the number of LRU pages on both zone and node logic.
> Most reclaim logic is based on the node counters but the retry logic uses
> the zone counters which do not distinguish inactive and active sizes.
> It would be possible to leave the LRU counters on a per-zone basis but
> it's a heavier calculation across multiple cache lines that is much more
> frequent than the retry checks.
>
> Other than the LRU counters, this is mostly a mechanical patch but note
> that it introduces a number of anomalies.  For example, the scans are
> per-zone but using per-node counters.  We also mark a node as congested
> when a zone is congested.  This causes weird problems that are fixed later
> but is easier to review.
>
> In the event that there is excessive overhead on 32-bit systems due to
> the nodes being on LRU then there are two potential solutions
>
> 1. Long-term isolation of highmem pages when reclaim is lowmem
>
>    When pages are skipped, they are immediately added back onto the LRU
>    list. If lowmem reclaim persisted for long periods of time, the same
>    highmem pages get continually scanned. The idea would be that lowmem
>    keeps those pages on a separate list until a reclaim for highmem pages
>    arrives that splices the highmem pages back onto the LRU. It potentially
>    could be implemented similar to the UNEVICTABLE list.
>
>    That would reduce the skip rate with the potential corner case is that
>    highmem pages have to be scanned and reclaimed to free lowmem slab pages.
>
> 2. Linear scan lowmem pages if the initial LRU shrink fails
>
>    This will break LRU ordering but may be preferable and faster during
>    memory pressure than skipping LRU pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This breaks boot on metag architecture:
Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]

It appears to be in node_page_state_snapshot() (via
pgdat_reclaimable()), and have come via mm_init. Here's the relevant
bit of the backtrace:

    node_page_state_snapshot@0x4009c884(enum node_stat_item item =
???, struct pglist_data * pgdat = ???) + 0x48
    pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
    show_free_areas(unsigned int filter = 0) + 0x2cc
    show_mem(unsigned int filter = 0) + 0x18
    mm_init@0x4025c3d4()
    start_kernel() + 0x204

__per_cpu_offset[0] == 0x233000 (close to bad addr),
pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
definitely hasn't been called yet (mm_init is called before
setup_per_cpu_pageset()).

Any ideas what the correct solution is (and why presumably others
haven't seen the same issue on other architectures?).

Thanks
James

> ---
>  arch/tile/mm/pgtable.c                    |   8 +-
>  drivers/base/node.c                       |  19 +--
>  drivers/staging/android/lowmemorykiller.c |   8 +-
>  include/linux/backing-dev.h               |   2 +-
>  include/linux/memcontrol.h                |  18 +--
>  include/linux/mm_inline.h                 |  21 ++-
>  include/linux/mmzone.h                    |  68 +++++----
>  include/linux/swap.h                      |   1 +
>  include/linux/vm_event_item.h             |  10 +-
>  include/linux/vmstat.h                    |  17 +++
>  include/trace/events/vmscan.h             |  12 +-
>  kernel/power/snapshot.c                   |  10 +-
>  mm/backing-dev.c                          |  15 +-
>  mm/compaction.c                           |  18 +--
>  mm/huge_memory.c                          |   2 +-
>  mm/internal.h                             |   2 +-
>  mm/khugepaged.c                           |   4 +-
>  mm/memcontrol.c                           |  17 +--
>  mm/memory-failure.c                       |   4 +-
>  mm/memory_hotplug.c                       |   2 +-
>  mm/mempolicy.c                            |   2 +-
>  mm/migrate.c                              |  21 +--
>  mm/mlock.c                                |   2 +-
>  mm/page-writeback.c                       |   8 +-
>  mm/page_alloc.c                           |  68 +++++----
>  mm/swap.c                                 |  50 +++----
>  mm/vmscan.c                               | 226 +++++++++++++++++-------------
>  mm/vmstat.c                               |  47 ++++---
>  mm/workingset.c                           |   4 +-
>  29 files changed, 386 insertions(+), 300 deletions(-)
>
> diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
> index c4d5bf841a7f..9e389213580d 100644
> --- a/arch/tile/mm/pgtable.c
> +++ b/arch/tile/mm/pgtable.c
> @@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
>         struct zone *zone;
>
>         pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
> -              (global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE)),
> -              (global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE)),
> +              (global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE)),
> +              (global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE)),
>                global_page_state(NR_FILE_DIRTY),
>                global_page_state(NR_WRITEBACK),
>                global_page_state(NR_UNSTABLE_NFS),
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 92d8e090c5b3..b7f01a4a642d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  {
>         int n;
>         int nid = dev->id;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
>         struct sysinfo i;
>
>         si_meminfo_node(&i, nid);
> @@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(i.totalram),
>                        nid, K(i.freeram),
>                        nid, K(i.totalram - i.freeram),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
>                        nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>
>  #ifdef CONFIG_HIGHMEM
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index 24d2745e9437..93dbcc38eb0f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>                                   struct shrink_control *sc)
>  {
> -       return global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE) +
> -               global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE);
> +       return global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE) +
> +               global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE);
>  }
>
>  static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c82794f20110..491a91717788 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  }
>
>  long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct zone *zone, int sync, long timeout);
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 104efa6874db..68f1121c8fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = zone_lruvec(zone);
>                 goto out;
>         }
>
> @@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>  out:
>         /*
>          * Since a node can be onlined after the mem_cgroup was created,
> -        * we have to be prepared to initialize lruvec->zone here;
> +        * we have to be prepared to initialize lruvec->pgdat here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> +               lruvec->pgdat = zone->zone_pgdat;
>         return lruvec;
>  }
>
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
>
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -               int nr_pages);
> +               enum zone_type zid, int nr_pages);
>
>  unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                                            int nid, unsigned int lru_mask);
> @@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>                                                     struct mem_cgroup *memcg)
>  {
> -       return &zone->lruvec;
> +       return zone_lruvec(zone);
>  }
>
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> -                                                   struct zone *zone)
> +                                                   struct pglist_data *pgdat)
>  {
> -       return &zone->lruvec;
> +       return &pgdat->lruvec;
>  }
>
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 5bd29ba4f174..9aadcc781857 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
>  }
>
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
> -       __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +       __mod_zone_page_state(&pgdat->node_zones[zid],
> +               NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> +               nr_pages);
>  }
>
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
>  #ifdef CONFIG_MEMCG
> -       mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
> +       mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
>  #else
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>  #endif
>  }
>
>  static __always_inline void add_page_to_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
> -       update_lru_size(lruvec, lru, hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
>         list_add(&page->lru, &lruvec->lists[lru]);
>  }
>
> @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
>         list_del(&page->lru);
> -       update_lru_size(lruvec, lru, -hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
>  }
>
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cfa870107abe..d4f5cac0a8c3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,12 +111,9 @@ enum zone_stat_item {
>         /* First 128 byte cacheline (assuming 64 bit words) */
>         NR_FREE_PAGES,
>         NR_ALLOC_BATCH,
> -       NR_LRU_BASE,
> -       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> -       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> -       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> -       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> -       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +       NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> +       NR_ZONE_LRU_FILE,
>         NR_MLOCK,               /* mlock()ed pages found and moved off LRU */
>         NR_ANON_PAGES,  /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -134,12 +131,9 @@ enum zone_stat_item {
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
>         NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
> -       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         NR_ZSPAGES,             /* allocated in zsmalloc */
>  #endif
> @@ -161,6 +155,15 @@ enum zone_stat_item {
>         NR_VM_ZONE_STAT_ITEMS };
>
>  enum node_stat_item {
> +       NR_LRU_BASE,
> +       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> +       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> +       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> +       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> +       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> @@ -219,7 +222,7 @@ struct lruvec {
>         /* Evictions & activations on the inactive file list */
>         atomic_long_t                   inactive_age;
>  #ifdef CONFIG_MEMCG
> -       struct zone                     *zone;
> +       struct pglist_data *pgdat;
>  #endif
>  };
>
> @@ -357,13 +360,6 @@ struct zone {
>  #ifdef CONFIG_NUMA
>         int node;
>  #endif
> -
> -       /*
> -        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> -        * this zone's LRU.  Maintained by the pageout code.
> -        */
> -       unsigned int inactive_ratio;
> -
>         struct pglist_data      *zone_pgdat;
>         struct per_cpu_pageset __percpu *pageset;
>
> @@ -495,9 +491,6 @@ struct zone {
>
>         /* Write-intensive fields used by page reclaim */
>
> -       /* Fields commonly accessed by the page reclaim scanner */
> -       struct lruvec           lruvec;
> -
>         /*
>          * When free pages are below this point, additional steps are taken
>          * when reading the number of free pages to avoid per-cpu counter
> @@ -537,17 +530,20 @@ struct zone {
>
>  enum zone_flags {
>         ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
> -       ZONE_CONGESTED,                 /* zone has many dirty pages backed by
> +       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
> +};
> +
> +enum pgdat_flags {
> +       PGDAT_CONGESTED,                /* pgdat has many dirty pages backed by
>                                          * a congested BDI
>                                          */
> -       ZONE_DIRTY,                     /* reclaim scanning has recently found
> +       PGDAT_DIRTY,                    /* reclaim scanning has recently found
>                                          * many dirty file pages at the tail
>                                          * of the LRU.
>                                          */
> -       ZONE_WRITEBACK,                 /* reclaim scanning has recently found
> +       PGDAT_WRITEBACK,                /* reclaim scanning has recently found
>                                          * many pages under writeback
>                                          */
> -       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
>  };
>
>  static inline unsigned long zone_end_pfn(const struct zone *zone)
> @@ -707,6 +703,19 @@ typedef struct pglist_data {
>         unsigned long split_queue_len;
>  #endif
>
> +       /* Fields commonly accessed by the page reclaim scanner */
> +       struct lruvec           lruvec;
> +
> +       /*
> +        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> +        * this node's LRU.  Maintained by the pageout code.
> +        */
> +       unsigned int inactive_ratio;
> +
> +       unsigned long           flags;
> +
> +       ZONE_PADDING(_pad2_)
> +
>         /* Per-node vmstats */
>         struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>         atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
> @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
>         return &zone->zone_pgdat->lru_lock;
>  }
>
> +static inline struct lruvec *zone_lruvec(struct zone *zone)
> +{
> +       return &zone->zone_pgdat->lruvec;
> +}
> +
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>         return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
>
>  extern void lruvec_init(struct lruvec *lruvec);
>
> -static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  {
>  #ifdef CONFIG_MEMCG
> -       return lruvec->zone;
> +       return lruvec->pgdat;
>  #else
> -       return container_of(lruvec, struct zone, lruvec);
> +       return container_of(lruvec, struct pglist_data, lruvec);
>  #endif
>  }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..c82f916008b7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 42604173f122..1798ff542517 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
>                 PGLAZYFREED,
> -               FOR_ALL_ZONES(PGREFILL),
> -               FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> -               FOR_ALL_ZONES(PGSTEAL_DIRECT),
> -               FOR_ALL_ZONES(PGSCAN_KSWAPD),
> -               FOR_ALL_ZONES(PGSCAN_DIRECT),
> +               PGREFILL,
> +               PGSTEAL_KSWAPD,
> +               PGSTEAL_DIRECT,
> +               PGSCAN_KSWAPD,
> +               PGSCAN_DIRECT,
>                 PGSCAN_DIRECT_THROTTLE,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d1744aa3ab9c..fee321c98550 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>         return x;
>  }
>
> +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
> +                                       enum node_stat_item item)
> +{
> +       long x = atomic_long_read(&pgdat->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +       int cpu;
> +       for_each_online_cpu(cpu)
> +               x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
> +
> +       if (x < 0)
> +               x = 0;
> +#endif
> +       return x;
> +}
> +
> +
>  #ifdef CONFIG_NUMA
>  extern unsigned long sum_zone_node_page_state(int node,
>                                                 enum zone_stat_item item);
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 0101ef37f1ee..897f1aa1ee5f 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
>
>  TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>
> -       TP_PROTO(struct zone *zone,
> +       TP_PROTO(int nid,
>                 unsigned long nr_scanned, unsigned long nr_reclaimed,
>                 int priority, int file),
>
> -       TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
> +       TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
>
>         TP_STRUCT__entry(
>                 __field(int, nid)
> -               __field(int, zid)
>                 __field(unsigned long, nr_scanned)
>                 __field(unsigned long, nr_reclaimed)
>                 __field(int, priority)
> @@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>         ),
>
>         TP_fast_assign(
> -               __entry->nid = zone_to_nid(zone);
> -               __entry->zid = zone_idx(zone);
> +               __entry->nid = nid;
>                 __entry->nr_scanned = nr_scanned;
>                 __entry->nr_reclaimed = nr_reclaimed;
>                 __entry->priority = priority;
>                 __entry->reclaim_flags = trace_shrink_flags(file);
>         ),
>
> -       TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> -               __entry->nid, __entry->zid,
> +       TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> +               __entry->nid,
>                 __entry->nr_scanned, __entry->nr_reclaimed,
>                 __entry->priority,
>                 show_reclaim_flags(__entry->reclaim_flags))
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a970604308f..24a06bc23f85 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
>         unsigned long size;
>
>         size = global_page_state(NR_SLAB_RECLAIMABLE)
> -               + global_page_state(NR_ACTIVE_ANON)
> -               + global_page_state(NR_INACTIVE_ANON)
> -               + global_page_state(NR_ACTIVE_FILE)
> -               + global_page_state(NR_INACTIVE_FILE)
> -               - global_page_state(NR_FILE_MAPPED);
> +               + global_node_page_state(NR_ACTIVE_ANON)
> +               + global_node_page_state(NR_INACTIVE_ANON)
> +               + global_node_page_state(NR_ACTIVE_FILE)
> +               + global_node_page_state(NR_INACTIVE_FILE)
> +               - global_node_page_state(NR_FILE_MAPPED);
>
>         return saveable <= size ? 0 : saveable - size;
>  }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f53b23ab7ed7..a8c3af46bd3d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
>  EXPORT_SYMBOL(congestion_wait);
>
>  /**
> - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> - * @zone: A zone to check if it is heavily congested
> + * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> + * @pgdat: A pgdat to check if it is heavily congested
>   * @sync: SYNC or ASYNC IO
>   * @timeout: timeout in jiffies
>   *
>   * In the event of a congested backing_dev (any backing_dev) and the given
> - * @zone has experienced recent congestion, this waits for up to @timeout
> + * @pgdat has experienced recent congestion, this waits for up to @timeout
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> + * In the absence of pgdat congestion, cond_resched() is called to yield
>   * the processor if necessary but otherwise does not sleep.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
>   */
> -long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
>  {
>         long ret;
>         unsigned long start = jiffies;
> @@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>
>         /*
>          * If there is no congestion, or heavy congestion is not being
> -        * encountered in the current zone, yield if necessary instead
> +        * encountered in the current pgdat, yield if necessary instead
>          * of sleeping on the congestion queue
>          */
>         if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> -           !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +           !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
>                 cond_resched();
> +
>                 /* In case we scheduled, work out time remaining */
>                 ret = timeout - (jiffies - start);
>                 if (ret < 0)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7607efb7bee2..a0bd85712516 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
>         list_for_each_entry(page, &cc->migratepages, lru)
>                 count[!!page_is_file_cache(page)]++;
>
> -       mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
>  }
>
>  /* Similar to reclaim, but different enough that they don't share logic */
> @@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
>  {
>         unsigned long active, inactive, isolated;
>
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_INACTIVE_ANON);
> -       active = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_ACTIVE_ANON);
> -       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> -                                       zone_page_state(zone, NR_ISOLATED_ANON);
> +       inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
> +       active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
> +       isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
>
>         return isolated > (inactive + active) / 2;
>  }
> @@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f997328ae64..5d5b2207cfd2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         pgoff_t end = -1;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, zone);
> +       lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
>
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
> diff --git a/mm/internal.h b/mm/internal.h
> index 9b6a6c43ac39..2f80d0343c56 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool zone_reclaimable(struct zone *zone);
> +extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 93d5f87c00d5..d7a49f665f04 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  static void release_pte_page(struct page *page)
>  {
>         /* 0 stands for page_is_file_cache(page) == false */
> -       dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +       dec_node_page_state(page, NR_ISOLATED_ANON + 0);
>         unlock_page(page);
>         putback_lru_page(page);
>  }
> @@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 /* 0 stands for page_is_file_cache(page) == false */
> -               inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +               inc_node_page_state(page, NR_ISOLATED_ANON + 0);
>                 VM_BUG_ON_PAGE(!PageLocked(page), page);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b70f9ca8ddf..50c86ad121bc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * and putback protocol: the LRU lock must be held, and the page must
>   * either be PageLRU() or the caller must have isolated/allocated it.
>   */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
>  {
>         struct mem_cgroup_per_zone *mz;
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = &pgdat->lruvec;
>                 goto out;
>         }
>
> @@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>          * we have to be prepared to initialize lruvec->zone here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != pgdat))
> +               lruvec->pgdat = pgdat;
>         return lruvec;
>  }
>
> @@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
>   * @lru: index of lru list the page is sitting on
> + * @zid: Zone ID of the zone pages have been added to
>   * @nr_pages: positive when adding or negative when removing
>   *
>   * This function must be called under lru_lock, just before a page is added
> @@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * so as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -                               int nr_pages)
> +                               enum zone_type zid, int nr_pages)
>  {
>         struct mem_cgroup_per_zone *mz;
>         unsigned long *lru_size;
>         long size;
>         bool empty;
>
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
> @@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
>         if (isolated) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fcca6b0e005..11de752ccaf5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
>         put_hwpoison_page(page);
>         if (!ret) {
>                 LIST_HEAD(pagelist);
> -               inc_zone_page_state(page, NR_ISOLATED_ANON +
> +               inc_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                 list_add(&page->lru, &pagelist);
>                 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> @@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
>                 if (ret) {
>                         if (!list_empty(&pagelist)) {
>                                 list_del(&page->lru);
> -                               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                                 page_is_file_cache(page));
>                                 putback_lru_page(page);
>                         }
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 82d0b98d27f8..c5278360ca66 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         put_page(page);
>                         list_add_tail(&page->lru, &source);
>                         move_pages--;
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>
>                 } else {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 53e40d3f3933..d8c4e38fb5f4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
>                 if (!isolate_lru_page(page)) {
>                         list_add_tail(&page->lru, pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2232f6923cc7..3033dae33a0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
>                         continue;
>                 }
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>                 /*
>                  * We isolated non-lru movable page so here we can use
> @@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>                  * restored.
>                  */
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>         }
>
> @@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>                 err = isolate_lru_page(page);
>                 if (!err) {
>                         list_add_tail(&page->lru, &pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>  put_and_set:
> @@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>                                    unsigned long nr_migrate_pages)
>  {
>         int z;
> +
> +       if (!pgdat_reclaimable(pgdat))
> +               return false;
> +
>         for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>                 struct zone *zone = pgdat->node_zones + z;
>
>                 if (!populated_zone(zone))
>                         continue;
>
> -               if (!zone_reclaimable(zone))
> -                       continue;
> -
>                 /* Avoid waking kswapd by allocating pages_to_migrate pages. */
>                 if (!zone_watermark_ok(zone, 0,
>                                        high_wmark_pages(zone) +
> @@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>         }
>
>         page_lru = page_is_file_cache(page);
> -       mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>                                 hpage_nr_pages(page));
>
>         /*
> @@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> -                       dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                       dec_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                         putback_lru_page(page);
>                 }
> @@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>                 /* Retake the callers reference and putback on LRU */
>                 get_page(page);
>                 putback_lru_page(page);
> -               mod_zone_page_state(page_zone(page),
> +               mod_node_page_state(page_pgdat(page),
>                          NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
>
>                 goto out_unlock;
> @@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
>         count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
>
> -       mod_zone_page_state(page_zone(page),
> +       mod_node_page_state(page_pgdat(page),
>                         NR_ISOLATED_ANON + page_lru,
>                         -HPAGE_PMD_NR);
>         return isolated;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 997f63082ff5..14645be06e30 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d578d2a56b19..0ada2b2954b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
>          */
>         nr_pages -= min(nr_pages, zone->totalreserve_pages);
>
> -       nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> -       nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         return nr_pages;
>  }
> @@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
>          */
>         x -= min(x, totalreserve_pages);
>
> -       x += global_page_state(NR_INACTIVE_FILE);
> -       x += global_page_state(NR_ACTIVE_FILE);
> +       x += global_node_page_state(NR_INACTIVE_FILE);
> +       x += global_node_page_state(NR_ACTIVE_FILE);
>
>         if (!vm_highmem_is_dirtyable)
>                 x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48b5414009ac..b84b85ae54ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>         spin_lock(&zone->lock);
>         isolated_pageblocks = has_isolate_pageblock(zone);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         while (count) {
>                 struct page *page;
> @@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
>  {
>         unsigned long nr_scanned;
>         spin_lock(&zone->lock);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
> @@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
>         unsigned long free_pcp = 0;
>         int cpu;
>         struct zone *zone;
> +       pg_data_t *pgdat;
>
>         for_each_populated_zone(zone) {
>                 if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
>                 " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
>  #endif
>                 " free:%lu free_pcp:%lu free_cma:%lu\n",
> -               global_page_state(NR_ACTIVE_ANON),
> -               global_page_state(NR_INACTIVE_ANON),
> -               global_page_state(NR_ISOLATED_ANON),
> -               global_page_state(NR_ACTIVE_FILE),
> -               global_page_state(NR_INACTIVE_FILE),
> -               global_page_state(NR_ISOLATED_FILE),
> -               global_page_state(NR_UNEVICTABLE),
> +               global_node_page_state(NR_ACTIVE_ANON),
> +               global_node_page_state(NR_INACTIVE_ANON),
> +               global_node_page_state(NR_ISOLATED_ANON),
> +               global_node_page_state(NR_ACTIVE_FILE),
> +               global_node_page_state(NR_INACTIVE_FILE),
> +               global_node_page_state(NR_ISOLATED_FILE),
> +               global_node_page_state(NR_UNEVICTABLE),
>                 global_page_state(NR_FILE_DIRTY),
>                 global_page_state(NR_WRITEBACK),
>                 global_page_state(NR_UNSTABLE_NFS),
> @@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
>                 free_pcp,
>                 global_page_state(NR_FREE_CMA_PAGES));
>
> +       for_each_online_pgdat(pgdat) {
> +               printk("Node %d"
> +                       " active_anon:%lukB"
> +                       " inactive_anon:%lukB"
> +                       " active_file:%lukB"
> +                       " inactive_file:%lukB"
> +                       " unevictable:%lukB"
> +                       " isolated(anon):%lukB"
> +                       " isolated(file):%lukB"
> +                       " all_unreclaimable? %s"
> +                       "\n",
> +                       pgdat->node_id,
> +                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
> +                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +       }
> +
>         for_each_populated_zone(zone) {
>                 int i;
>
> @@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
>                         " min:%lukB"
>                         " low:%lukB"
>                         " high:%lukB"
> -                       " active_anon:%lukB"
> -                       " inactive_anon:%lukB"
> -                       " active_file:%lukB"
> -                       " inactive_file:%lukB"
> -                       " unevictable:%lukB"
> -                       " isolated(anon):%lukB"
> -                       " isolated(file):%lukB"
>                         " present:%lukB"
>                         " managed:%lukB"
>                         " mlocked:%lukB"
> @@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
>                         " local_pcp:%ukB"
>                         " free_cma:%lukB"
>                         " writeback_tmp:%lukB"
> -                       " pages_scanned:%lu"
> -                       " all_unreclaimable? %s"
> +                       " node_pages_scanned:%lu"
>                         "\n",
>                         zone->name,
>                         K(zone_page_state(zone, NR_FREE_PAGES)),
>                         K(min_wmark_pages(zone)),
>                         K(low_wmark_pages(zone)),
>                         K(high_wmark_pages(zone)),
> -                       K(zone_page_state(zone, NR_ACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_INACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_ACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_INACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_UNEVICTABLE)),
> -                       K(zone_page_state(zone, NR_ISOLATED_ANON)),
> -                       K(zone_page_state(zone, NR_ISOLATED_FILE)),
>                         K(zone->present_pages),
>                         K(zone->managed_pages),
>                         K(zone_page_state(zone, NR_MLOCK)),
> @@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
>                         K(this_cpu_read(zone->pageset->pcp.count)),
>                         K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>                         K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> -                       K(zone_page_state(zone, NR_PAGES_SCANNED)),
> -                       (!zone_reclaimable(zone) ? "yes" : "no")
> -                       );
> +                       K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
>                 printk("lowmem_reserve[]:");
>                 for (i = 0; i < MAX_NR_ZONES; i++)
>                         printk(" %ld", zone->lowmem_reserve[i]);
> @@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>                 /* For bootup, initialized properly in watermark setup */
>                 mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> -               lruvec_init(&zone->lruvec);
> +               lruvec_init(zone_lruvec(zone));
>                 if (!size)
>                         continue;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf37e5cfae81..77af473635fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
>                 unsigned long flags;
>
>                 spin_lock_irqsave(zone_lru_lock(zone), flags);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(zone_lru_lock(zone), flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 (*move_fn)(page, lruvec, arg);
>         }
>         if (zone)
> @@ -319,7 +319,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(zone_lru_lock(zone));
> -       __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
> +       __activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
>         spin_unlock_irq(zone_lru_lock(zone));
>  }
>  #endif
> @@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
>   */
>  void add_page_to_unevictable_list(struct page *page)
>  {
> -       struct zone *zone = page_zone(page);
> +       struct pglist_data *pgdat = page_pgdat(page);
>         struct lruvec *lruvec;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> -       lruvec = mem_cgroup_page_lruvec(page, zone);
> +       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         ClearPageActive(page);
>         SetPageUnevictable(page);
>         SetPageLRU(page);
>         add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>  }
>
>  /**
> @@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct zone *zone = NULL;
> +       struct pglist_data *locked_pgdat = NULL;
>         struct lruvec *lruvec;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
> @@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same zone. The lock is held only if zone != NULL.
> +                * same pgdat. The lock is held only if pgdat != NULL.
>                  */
> -               if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                       zone = NULL;
> +               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                       locked_pgdat = NULL;
>                 }
>
>                 if (is_huge_zero_page(page)) {
> @@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (zone) {
> -                               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                               zone = NULL;
> +                       if (locked_pgdat) {
> +                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                               locked_pgdat = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct zone *pagezone = page_zone(page);
> +                       struct pglist_data *pgdat = page_pgdat(page);
>
> -                       if (pagezone != zone) {
> -                               if (zone)
> -                                       spin_unlock_irqrestore(zone_lru_lock(zone),
> +                       if (pgdat != locked_pgdat) {
> +                               if (locked_pgdat)
> +                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               zone = pagezone;
> -                               spin_lock_irqsave(zone_lru_lock(zone), flags);
> +                               locked_pgdat = pgdat;
> +                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, zone);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (zone)
> -               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> +       if (locked_pgdat)
> +               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_hot_cold_page_list(&pages_to_free, cold);
> @@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>         VM_BUG_ON(NR_CPUS != 1 &&
> -                 !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
> +                 !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e7ffcd259cc4..86a523a761c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * This misses isolated pages which are not accounted for to save counters.
> + * As the data only determines if reclaim or compaction continues, it is
> + * not expected that isolated pages will be a dominating factor.
> + */
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>         unsigned long nr;
>
> -       nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> +       nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> +       if (get_nr_swap_pages() > 0)
> +               nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> +
> +       return nr;
> +}
> +
> +unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> +{
> +       unsigned long nr;
> +
> +       nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
>
>         if (get_nr_swap_pages() > 0)
> -               nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> +               nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
>
>         return nr;
>  }
>
> -bool zone_reclaimable(struct zone *zone)
> +bool pgdat_reclaimable(struct pglist_data *pgdat)
>  {
> -       return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
> -               zone_reclaimable_pages(zone) * 6;
> +       return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> +               pgdat_reclaimable_pages(pgdat) * 6;
>  }
>
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>         if (!mem_cgroup_disabled())
>                 return mem_cgroup_get_lru_size(lruvec, lru);
>
> -       return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> +       return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  }
>
>  /*
> @@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -                                     struct zone *zone,
> +                                     struct pglist_data *pgdat,
>                                       struct scan_control *sc,
>                                       enum ttu_flags ttu_flags,
>                                       unsigned long *ret_nr_dirty,
> @@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         goto keep;
>
>                 VM_BUG_ON_PAGE(PageActive(page), page);
> -               VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
>                 sc->nr_scanned++;
>
> @@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         /* Case 1 above */
>                         if (current_is_kswapd() &&
>                             PageReclaim(page) &&
> -                           test_bit(ZONE_WRITEBACK, &zone->flags)) {
> +                           test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>                                 nr_immediate++;
>                                 goto keep_locked;
>
> @@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                          */
>                         if (page_is_file_cache(page) &&
>                                         (!current_is_kswapd() ||
> -                                        !test_bit(ZONE_DIRTY, &zone->flags))) {
> +                                        !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>                                 /*
>                                  * Immediately reclaim when written back.
>                                  * Similar in principal to deactivate_page()
> @@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>                 }
>         }
>
> -       ret = shrink_page_list(&clean_pages, zone, &sc,
> +       ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
>                         TTU_UNMAP|TTU_IGNORE_ACCESS,
>                         &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>         list_splice(&clean_pages, page_list);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>         return ret;
>  }
>
> @@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  {
>         struct list_head *src = &lruvec->lists[lru];
>         unsigned long nr_taken = 0;
> -       unsigned long scan;
> +       unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +       unsigned long scan, nr_pages;
>
>         for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
>                                         !list_empty(src); scan++) {
> @@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
>                 switch (__isolate_lru_page(page, mode)) {
>                 case 0:
> -                       nr_taken += hpage_nr_pages(page);
> +                       nr_pages = hpage_nr_pages(page);
> +                       nr_taken += nr_pages;
> +                       nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
>
> @@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>         *nr_scanned = scan;
>         trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
>                                     nr_taken, mode, is_file_lru(lru));
> +       for (scan = 0; scan < MAX_NR_ZONES; scan++) {
> +               nr_pages = nr_zone_taken[scan];
> +               if (!nr_pages)
> +                       continue;
> +
> +               update_lru_size(lruvec, lru, scan, -nr_pages);
> +       }
>         return nr_taken;
>  }
>
> @@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
>                 struct lruvec *lruvec;
>
>                 spin_lock_irq(zone_lru_lock(zone));
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct zone *zone, int file,
> +static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> @@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
>                 return 0;
>
>         if (file) {
> -               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
>         } else {
> -               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
>         }
>
>         /*
> @@ -1499,7 +1524,7 @@ static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         LIST_HEAD(pages_to_free);
>
>         /*
> @@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(zone_lru_lock(zone));
> +                       spin_unlock_irq(&pgdat->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(zone_lru_lock(zone));
> +                       spin_lock_irq(&pgdat->lru_lock);
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 }
> @@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         unsigned long nr_immediate = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
> @@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc)) {
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>                 else
> -                       __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_DIRECT, nr_scanned);
>         }
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
>
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +       nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
>                                 &nr_dirty, &nr_unqueued_dirty, &nr_congested,
>                                 &nr_writeback, &nr_immediate,
>                                 false);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         if (global_reclaim(sc)) {
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
>                 else
> -                       __count_zone_vm_events(PGSTEAL_DIRECT, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
>         }
>
>         putback_inactive_pages(lruvec, &page_list);
>
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_hot_cold_page_list(&page_list, true);
> @@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          * are encountered in the nr_immediate check below.
>          */
>         if (nr_writeback && nr_writeback == nr_taken)
> -               set_bit(ZONE_WRITEBACK, &zone->flags);
> +               set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * Legacy memcg will stall in page writeback so avoid forcibly
> @@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                  * backed by a congested BDI and wait_iff_congested will stall.
>                  */
>                 if (nr_dirty && nr_dirty == nr_congested)
> -                       set_bit(ZONE_CONGESTED, &zone->flags);
> +                       set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
>                 /*
>                  * If dirty pages are scanned that are not queued for IO, it
>                  * implies that flushers are not keeping up. In this case, flag
> -                * the zone ZONE_DIRTY and kswapd will start writing pages from
> +                * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
>                  * reclaim context.
>                  */
>                 if (nr_unqueued_dirty == nr_taken)
> -                       set_bit(ZONE_DIRTY, &zone->flags);
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>
>                 /*
>                  * If kswapd scans pages marked marked for immediate
> @@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          */
>         if (!sc->hibernation_mode && !current_is_kswapd() &&
>             current_may_throttle())
> -               wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +               wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> -       trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed,
>                         sc->priority, file);
>         return nr_reclaimed;
>  }
> @@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                                      struct list_head *pages_to_free,
>                                      enum lru_list lru)
>  {
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long pgmoved = 0;
>         struct page *page;
>         int nr_pages;
>
>         while (!list_empty(list)) {
>                 page = lru_to_page(list);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>
>                 nr_pages = hpage_nr_pages(page);
> -               update_lru_size(lruvec, lru, nr_pages);
> +               update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>                 list_move(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>
> @@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, pages_to_free);
>                 }
> @@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         unsigned long nr_rotated = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         lru_add_drain();
>
> @@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc))
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> -       __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
> +       __count_vm_events(PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>         move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_hold);
>         free_hot_cold_page_list(&l_hold, true);
> @@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 fraction[2];
>         u64 denominator = 0;    /* gcc */
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long anon_prio, file_prio;
>         enum scan_balance scan_balance;
>         unsigned long anon, file;
> @@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * well.
>          */
>         if (current_is_kswapd()) {
> -               if (!zone_reclaimable(zone))
> +               if (!pgdat_reclaimable(pgdat))
>                         force_scan = true;
>                 if (!mem_cgroup_online(memcg))
>                         force_scan = true;
> @@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (global_reclaim(sc)) {
> -               unsigned long zonefile;
> -               unsigned long zonefree;
> +               unsigned long pgdatfile;
> +               unsigned long pgdatfree;
> +               int z;
> +               unsigned long total_high_wmark = 0;
>
> -               zonefree = zone_page_state(zone, NR_FREE_PAGES);
> -               zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                          zone_page_state(zone, NR_INACTIVE_FILE);
> +               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
>
> -               if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
> +               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
>                         scan_balance = SCAN_ANON;
>                         goto out;
>                 }
> @@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
>          * inactive lists are large enough, continue reclaiming
>          */
>         pages_for_compaction = (2UL << sc->order);
> -       inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> +       inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
>         if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +               inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
>         if (sc->nr_reclaimed < pages_for_compaction &&
>                         inactive_lru_pages > pages_for_compaction)
>                 return true;
> @@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                 continue;
>
>                         if (sc->priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;       /* Let kswapd poll it */
>
>                         /*
> @@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!populated_zone(zone) ||
> -                   zone_reclaimable_pages(zone) == 0)
> +                   pgdat_reclaimable_pages(pgdat) == 0)
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>                  * DEF_PRIORITY. Effectively, it considers them balanced so
>                  * they must be considered balanced here as well!
>                  */
> -               if (!zone_reclaimable(zone)) {
> +               if (!pgdat_reclaimable(zone->zone_pgdat)) {
>                         balanced_pages += zone->managed_pages;
>                         continue;
>                 }
> @@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  {
>         unsigned long balance_gap;
>         bool lowmem_pressure;
> +       struct pglist_data *pgdat = zone->zone_pgdat;
>
>         /* Reclaim above the high watermark. */
>         sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> @@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
>         shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> -       clear_bit(ZONE_WRITEBACK, &zone->flags);
> +       /* TODO: ANOMALY */
> +       clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * If a zone reaches its high watermark, consider it to be no longer
> @@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
>          * BDIs but as pressure is relieved, speculatively avoid congestion
>          * waits.
>          */
> -       if (zone_reclaimable(zone) &&
> +       if (pgdat_reclaimable(zone->zone_pgdat) &&
>             zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
> -               clear_bit(ZONE_CONGESTED, &zone->flags);
> -               clear_bit(ZONE_DIRTY, &zone->flags);
> +               clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +               clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         }
>
>         return sc->nr_scanned >= sc->nr_to_reclaim;
> @@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         /*
> @@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 /*
>                                  * If balanced, clear the dirty and congested
>                                  * flags
> +                                *
> +                                * TODO: ANOMALY
>                                  */
> -                               clear_bit(ZONE_CONGESTED, &zone->flags);
> -                               clear_bit(ZONE_DIRTY, &zone->flags);
> +                               clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +                               clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>                         }
>                 }
>
> @@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         sc.nr_scanned = 0;
> @@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
>  static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
>  {
>         unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> -       unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> -               zone_page_state(zone, NR_ACTIVE_FILE);
> +       unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         /*
>          * It's possible for there to be more file mapped pages than
> @@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>             zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                 return ZONE_RECLAIM_FULL;
>
> -       if (!zone_reclaimable(zone))
> +       if (!pgdat_reclaimable(zone->zone_pgdat))
>                 return ZONE_RECLAIM_FULL;
>
>         /*
> @@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
>                         zone = pagezone;
>                         spin_lock_irq(zone_lru_lock(zone));
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3345d396a99b..de0c17076270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
>         /* enum zone_stat_item countes */
>         "nr_free_pages",
>         "nr_alloc_batch",
> -       "nr_inactive_anon",
> -       "nr_active_anon",
> -       "nr_inactive_file",
> -       "nr_active_file",
> -       "nr_unevictable",
> +       "nr_zone_anon_lru",
> +       "nr_zone_file_lru",
>         "nr_mlock",
>         "nr_anon_pages",
>         "nr_mapped",
> @@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
>         "nr_writeback_temp",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
>         "nr_shmem",
>         "nr_dirtied",
>         "nr_written",
> -       "nr_pages_scanned",
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         "nr_zspages",
>  #endif
> @@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
>         "nr_shmem_pmdmapped",
>         "nr_free_cma",
>
> +       /* Node-based counters */
> +       "nr_inactive_anon",
> +       "nr_active_anon",
> +       "nr_inactive_file",
> +       "nr_active_file",
> +       "nr_unevictable",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "nr_pages_scanned",
> +
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> @@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
>         "pgmajfault",
>         "pglazyfreed",
>
> -       TEXTS_FOR_ZONES("pgrefill")
> -       TEXTS_FOR_ZONES("pgsteal_kswapd")
> -       TEXTS_FOR_ZONES("pgsteal_direct")
> -       TEXTS_FOR_ZONES("pgscan_kswapd")
> -       TEXTS_FOR_ZONES("pgscan_direct")
> +       "pgrefill",
> +       "pgsteal_kswapd",
> +       "pgsteal_direct",
> +       "pgscan_kswapd",
> +       "pgscan_direct",
>         "pgscan_direct_throttle",
>
>  #ifdef CONFIG_NUMA
> @@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        min      %lu"
>                    "\n        low      %lu"
>                    "\n        high     %lu"
> -                  "\n        scanned  %lu"
> +                  "\n   node_scanned  %lu"
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu",
> @@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    min_wmark_pages(zone),
>                    low_wmark_pages(zone),
>                    high_wmark_pages(zone),
> -                  zone_page_state(zone, NR_PAGES_SCANNED),
> +                  node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone->managed_pages);
>
>         for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -               seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
> +               seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
>                                 zone_page_state(zone, i));
>
>         seq_printf(m,
> @@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  #endif
>         }
>         seq_printf(m,
> -                  "\n  all_unreclaimable: %u"
> -                  "\n  start_pfn:         %lu"
> -                  "\n  inactive_ratio:    %u",
> -                  !zone_reclaimable(zone),
> +                  "\n  node_unreclaimable:  %u"
> +                  "\n  start_pfn:           %lu"
> +                  "\n  node_inactive_ratio: %u",
> +                  !pgdat_reclaimable(zone->zone_pgdat),
>                    zone->zone_start_pfn,
> -                  zone->inactive_ratio);
> +                  zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
>  }
>
> @@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  {
>         unsigned long *l = arg;
>         unsigned long off = l - (unsigned long *)m->private;
> -
>         seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
>         return 0;
>  }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ba972ac2dfdd..ebe14445809a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                      LRU_ALL_FILE);
>         } else {
> -               pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -                       sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
> +               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
> +                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>         }
>
>         /*
> --
> 2.6.4
>

WARNING: multiple messages have this Message-ID (diff)
From: James Hogan <james.hogan@imgtec.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>, Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Minchan Kim <minchan@kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	LKML <linux-kernel@vger.kernel.org>,
	metag <linux-metag@vger.kernel.org>
Subject: Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
Date: Thu, 4 Aug 2016 21:59:17 +0100	[thread overview]
Message-ID: <CAAG0J9_k3edxDzqpEjt2BqqZXMW4PVj7BNUBAk6TWtw3Zh_oMg@mail.gmail.com> (raw)
In-Reply-To: <1467970510-21195-4-git-send-email-mgorman@techsingularity.net>

On 8 July 2016 at 10:34, Mel Gorman <mgorman@techsingularity.net> wrote:
> This moves the LRU lists from the zone to the node and related data
> such as counters, tracing, congestion tracking and writeback tracking.
> Unfortunately, due to reclaim and compaction retry logic, it is necessary
> to account for the number of LRU pages on both zone and node logic.
> Most reclaim logic is based on the node counters but the retry logic uses
> the zone counters which do not distinguish inactive and active sizes.
> It would be possible to leave the LRU counters on a per-zone basis but
> it's a heavier calculation across multiple cache lines that is much more
> frequent than the retry checks.
>
> Other than the LRU counters, this is mostly a mechanical patch but note
> that it introduces a number of anomalies.  For example, the scans are
> per-zone but using per-node counters.  We also mark a node as congested
> when a zone is congested.  This causes weird problems that are fixed later
> but is easier to review.
>
> In the event that there is excessive overhead on 32-bit systems due to
> the nodes being on LRU then there are two potential solutions
>
> 1. Long-term isolation of highmem pages when reclaim is lowmem
>
>    When pages are skipped, they are immediately added back onto the LRU
>    list. If lowmem reclaim persisted for long periods of time, the same
>    highmem pages get continually scanned. The idea would be that lowmem
>    keeps those pages on a separate list until a reclaim for highmem pages
>    arrives that splices the highmem pages back onto the LRU. It potentially
>    could be implemented similar to the UNEVICTABLE list.
>
>    That would reduce the skip rate with the potential corner case is that
>    highmem pages have to be scanned and reclaimed to free lowmem slab pages.
>
> 2. Linear scan lowmem pages if the initial LRU shrink fails
>
>    This will break LRU ordering but may be preferable and faster during
>    memory pressure than skipping LRU pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This breaks boot on metag architecture:
Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]

It appears to be in node_page_state_snapshot() (via
pgdat_reclaimable()), and have come via mm_init. Here's the relevant
bit of the backtrace:

    node_page_state_snapshot@0x4009c884(enum node_stat_item item =
???, struct pglist_data * pgdat = ???) + 0x48
    pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
    show_free_areas(unsigned int filter = 0) + 0x2cc
    show_mem(unsigned int filter = 0) + 0x18
    mm_init@0x4025c3d4()
    start_kernel() + 0x204

__per_cpu_offset[0] == 0x233000 (close to bad addr),
pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
definitely hasn't been called yet (mm_init is called before
setup_per_cpu_pageset()).

Any ideas what the correct solution is (and why presumably others
haven't seen the same issue on other architectures?).

Thanks
James

> ---
>  arch/tile/mm/pgtable.c                    |   8 +-
>  drivers/base/node.c                       |  19 +--
>  drivers/staging/android/lowmemorykiller.c |   8 +-
>  include/linux/backing-dev.h               |   2 +-
>  include/linux/memcontrol.h                |  18 +--
>  include/linux/mm_inline.h                 |  21 ++-
>  include/linux/mmzone.h                    |  68 +++++----
>  include/linux/swap.h                      |   1 +
>  include/linux/vm_event_item.h             |  10 +-
>  include/linux/vmstat.h                    |  17 +++
>  include/trace/events/vmscan.h             |  12 +-
>  kernel/power/snapshot.c                   |  10 +-
>  mm/backing-dev.c                          |  15 +-
>  mm/compaction.c                           |  18 +--
>  mm/huge_memory.c                          |   2 +-
>  mm/internal.h                             |   2 +-
>  mm/khugepaged.c                           |   4 +-
>  mm/memcontrol.c                           |  17 +--
>  mm/memory-failure.c                       |   4 +-
>  mm/memory_hotplug.c                       |   2 +-
>  mm/mempolicy.c                            |   2 +-
>  mm/migrate.c                              |  21 +--
>  mm/mlock.c                                |   2 +-
>  mm/page-writeback.c                       |   8 +-
>  mm/page_alloc.c                           |  68 +++++----
>  mm/swap.c                                 |  50 +++----
>  mm/vmscan.c                               | 226 +++++++++++++++++-------------
>  mm/vmstat.c                               |  47 ++++---
>  mm/workingset.c                           |   4 +-
>  29 files changed, 386 insertions(+), 300 deletions(-)
>
> diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
> index c4d5bf841a7f..9e389213580d 100644
> --- a/arch/tile/mm/pgtable.c
> +++ b/arch/tile/mm/pgtable.c
> @@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
>         struct zone *zone;
>
>         pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
> -              (global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE)),
> -              (global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE)),
> +              (global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE)),
> +              (global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE)),
>                global_page_state(NR_FILE_DIRTY),
>                global_page_state(NR_WRITEBACK),
>                global_page_state(NR_UNSTABLE_NFS),
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 92d8e090c5b3..b7f01a4a642d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  {
>         int n;
>         int nid = dev->id;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
>         struct sysinfo i;
>
>         si_meminfo_node(&i, nid);
> @@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(i.totalram),
>                        nid, K(i.freeram),
>                        nid, K(i.totalram - i.freeram),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
>                        nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>
>  #ifdef CONFIG_HIGHMEM
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index 24d2745e9437..93dbcc38eb0f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>                                   struct shrink_control *sc)
>  {
> -       return global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE) +
> -               global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE);
> +       return global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE) +
> +               global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE);
>  }
>
>  static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c82794f20110..491a91717788 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  }
>
>  long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct zone *zone, int sync, long timeout);
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 104efa6874db..68f1121c8fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = zone_lruvec(zone);
>                 goto out;
>         }
>
> @@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>  out:
>         /*
>          * Since a node can be onlined after the mem_cgroup was created,
> -        * we have to be prepared to initialize lruvec->zone here;
> +        * we have to be prepared to initialize lruvec->pgdat here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> +               lruvec->pgdat = zone->zone_pgdat;
>         return lruvec;
>  }
>
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
>
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -               int nr_pages);
> +               enum zone_type zid, int nr_pages);
>
>  unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                                            int nid, unsigned int lru_mask);
> @@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>                                                     struct mem_cgroup *memcg)
>  {
> -       return &zone->lruvec;
> +       return zone_lruvec(zone);
>  }
>
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> -                                                   struct zone *zone)
> +                                                   struct pglist_data *pgdat)
>  {
> -       return &zone->lruvec;
> +       return &pgdat->lruvec;
>  }
>
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 5bd29ba4f174..9aadcc781857 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
>  }
>
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
> -       __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +       __mod_zone_page_state(&pgdat->node_zones[zid],
> +               NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> +               nr_pages);
>  }
>
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
>  #ifdef CONFIG_MEMCG
> -       mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
> +       mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
>  #else
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>  #endif
>  }
>
>  static __always_inline void add_page_to_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
> -       update_lru_size(lruvec, lru, hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
>         list_add(&page->lru, &lruvec->lists[lru]);
>  }
>
> @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
>         list_del(&page->lru);
> -       update_lru_size(lruvec, lru, -hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
>  }
>
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cfa870107abe..d4f5cac0a8c3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,12 +111,9 @@ enum zone_stat_item {
>         /* First 128 byte cacheline (assuming 64 bit words) */
>         NR_FREE_PAGES,
>         NR_ALLOC_BATCH,
> -       NR_LRU_BASE,
> -       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> -       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> -       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> -       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> -       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +       NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> +       NR_ZONE_LRU_FILE,
>         NR_MLOCK,               /* mlock()ed pages found and moved off LRU */
>         NR_ANON_PAGES,  /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -134,12 +131,9 @@ enum zone_stat_item {
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
>         NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
> -       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         NR_ZSPAGES,             /* allocated in zsmalloc */
>  #endif
> @@ -161,6 +155,15 @@ enum zone_stat_item {
>         NR_VM_ZONE_STAT_ITEMS };
>
>  enum node_stat_item {
> +       NR_LRU_BASE,
> +       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> +       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> +       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> +       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> +       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> @@ -219,7 +222,7 @@ struct lruvec {
>         /* Evictions & activations on the inactive file list */
>         atomic_long_t                   inactive_age;
>  #ifdef CONFIG_MEMCG
> -       struct zone                     *zone;
> +       struct pglist_data *pgdat;
>  #endif
>  };
>
> @@ -357,13 +360,6 @@ struct zone {
>  #ifdef CONFIG_NUMA
>         int node;
>  #endif
> -
> -       /*
> -        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> -        * this zone's LRU.  Maintained by the pageout code.
> -        */
> -       unsigned int inactive_ratio;
> -
>         struct pglist_data      *zone_pgdat;
>         struct per_cpu_pageset __percpu *pageset;
>
> @@ -495,9 +491,6 @@ struct zone {
>
>         /* Write-intensive fields used by page reclaim */
>
> -       /* Fields commonly accessed by the page reclaim scanner */
> -       struct lruvec           lruvec;
> -
>         /*
>          * When free pages are below this point, additional steps are taken
>          * when reading the number of free pages to avoid per-cpu counter
> @@ -537,17 +530,20 @@ struct zone {
>
>  enum zone_flags {
>         ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
> -       ZONE_CONGESTED,                 /* zone has many dirty pages backed by
> +       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
> +};
> +
> +enum pgdat_flags {
> +       PGDAT_CONGESTED,                /* pgdat has many dirty pages backed by
>                                          * a congested BDI
>                                          */
> -       ZONE_DIRTY,                     /* reclaim scanning has recently found
> +       PGDAT_DIRTY,                    /* reclaim scanning has recently found
>                                          * many dirty file pages at the tail
>                                          * of the LRU.
>                                          */
> -       ZONE_WRITEBACK,                 /* reclaim scanning has recently found
> +       PGDAT_WRITEBACK,                /* reclaim scanning has recently found
>                                          * many pages under writeback
>                                          */
> -       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
>  };
>
>  static inline unsigned long zone_end_pfn(const struct zone *zone)
> @@ -707,6 +703,19 @@ typedef struct pglist_data {
>         unsigned long split_queue_len;
>  #endif
>
> +       /* Fields commonly accessed by the page reclaim scanner */
> +       struct lruvec           lruvec;
> +
> +       /*
> +        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> +        * this node's LRU.  Maintained by the pageout code.
> +        */
> +       unsigned int inactive_ratio;
> +
> +       unsigned long           flags;
> +
> +       ZONE_PADDING(_pad2_)
> +
>         /* Per-node vmstats */
>         struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>         atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
> @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
>         return &zone->zone_pgdat->lru_lock;
>  }
>
> +static inline struct lruvec *zone_lruvec(struct zone *zone)
> +{
> +       return &zone->zone_pgdat->lruvec;
> +}
> +
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>         return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
>
>  extern void lruvec_init(struct lruvec *lruvec);
>
> -static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  {
>  #ifdef CONFIG_MEMCG
> -       return lruvec->zone;
> +       return lruvec->pgdat;
>  #else
> -       return container_of(lruvec, struct zone, lruvec);
> +       return container_of(lruvec, struct pglist_data, lruvec);
>  #endif
>  }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..c82f916008b7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 42604173f122..1798ff542517 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
>                 PGLAZYFREED,
> -               FOR_ALL_ZONES(PGREFILL),
> -               FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> -               FOR_ALL_ZONES(PGSTEAL_DIRECT),
> -               FOR_ALL_ZONES(PGSCAN_KSWAPD),
> -               FOR_ALL_ZONES(PGSCAN_DIRECT),
> +               PGREFILL,
> +               PGSTEAL_KSWAPD,
> +               PGSTEAL_DIRECT,
> +               PGSCAN_KSWAPD,
> +               PGSCAN_DIRECT,
>                 PGSCAN_DIRECT_THROTTLE,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d1744aa3ab9c..fee321c98550 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>         return x;
>  }
>
> +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
> +                                       enum node_stat_item item)
> +{
> +       long x = atomic_long_read(&pgdat->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +       int cpu;
> +       for_each_online_cpu(cpu)
> +               x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
> +
> +       if (x < 0)
> +               x = 0;
> +#endif
> +       return x;
> +}
> +
> +
>  #ifdef CONFIG_NUMA
>  extern unsigned long sum_zone_node_page_state(int node,
>                                                 enum zone_stat_item item);
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 0101ef37f1ee..897f1aa1ee5f 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
>
>  TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>
> -       TP_PROTO(struct zone *zone,
> +       TP_PROTO(int nid,
>                 unsigned long nr_scanned, unsigned long nr_reclaimed,
>                 int priority, int file),
>
> -       TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
> +       TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
>
>         TP_STRUCT__entry(
>                 __field(int, nid)
> -               __field(int, zid)
>                 __field(unsigned long, nr_scanned)
>                 __field(unsigned long, nr_reclaimed)
>                 __field(int, priority)
> @@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>         ),
>
>         TP_fast_assign(
> -               __entry->nid = zone_to_nid(zone);
> -               __entry->zid = zone_idx(zone);
> +               __entry->nid = nid;
>                 __entry->nr_scanned = nr_scanned;
>                 __entry->nr_reclaimed = nr_reclaimed;
>                 __entry->priority = priority;
>                 __entry->reclaim_flags = trace_shrink_flags(file);
>         ),
>
> -       TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> -               __entry->nid, __entry->zid,
> +       TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> +               __entry->nid,
>                 __entry->nr_scanned, __entry->nr_reclaimed,
>                 __entry->priority,
>                 show_reclaim_flags(__entry->reclaim_flags))
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a970604308f..24a06bc23f85 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
>         unsigned long size;
>
>         size = global_page_state(NR_SLAB_RECLAIMABLE)
> -               + global_page_state(NR_ACTIVE_ANON)
> -               + global_page_state(NR_INACTIVE_ANON)
> -               + global_page_state(NR_ACTIVE_FILE)
> -               + global_page_state(NR_INACTIVE_FILE)
> -               - global_page_state(NR_FILE_MAPPED);
> +               + global_node_page_state(NR_ACTIVE_ANON)
> +               + global_node_page_state(NR_INACTIVE_ANON)
> +               + global_node_page_state(NR_ACTIVE_FILE)
> +               + global_node_page_state(NR_INACTIVE_FILE)
> +               - global_node_page_state(NR_FILE_MAPPED);
>
>         return saveable <= size ? 0 : saveable - size;
>  }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f53b23ab7ed7..a8c3af46bd3d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
>  EXPORT_SYMBOL(congestion_wait);
>
>  /**
> - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> - * @zone: A zone to check if it is heavily congested
> + * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> + * @pgdat: A pgdat to check if it is heavily congested
>   * @sync: SYNC or ASYNC IO
>   * @timeout: timeout in jiffies
>   *
>   * In the event of a congested backing_dev (any backing_dev) and the given
> - * @zone has experienced recent congestion, this waits for up to @timeout
> + * @pgdat has experienced recent congestion, this waits for up to @timeout
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> + * In the absence of pgdat congestion, cond_resched() is called to yield
>   * the processor if necessary but otherwise does not sleep.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
>   */
> -long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
>  {
>         long ret;
>         unsigned long start = jiffies;
> @@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>
>         /*
>          * If there is no congestion, or heavy congestion is not being
> -        * encountered in the current zone, yield if necessary instead
> +        * encountered in the current pgdat, yield if necessary instead
>          * of sleeping on the congestion queue
>          */
>         if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> -           !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +           !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
>                 cond_resched();
> +
>                 /* In case we scheduled, work out time remaining */
>                 ret = timeout - (jiffies - start);
>                 if (ret < 0)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7607efb7bee2..a0bd85712516 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
>         list_for_each_entry(page, &cc->migratepages, lru)
>                 count[!!page_is_file_cache(page)]++;
>
> -       mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
>  }
>
>  /* Similar to reclaim, but different enough that they don't share logic */
> @@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
>  {
>         unsigned long active, inactive, isolated;
>
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_INACTIVE_ANON);
> -       active = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_ACTIVE_ANON);
> -       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> -                                       zone_page_state(zone, NR_ISOLATED_ANON);
> +       inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
> +       active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
> +       isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
>
>         return isolated > (inactive + active) / 2;
>  }
> @@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f997328ae64..5d5b2207cfd2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         pgoff_t end = -1;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, zone);
> +       lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
>
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
> diff --git a/mm/internal.h b/mm/internal.h
> index 9b6a6c43ac39..2f80d0343c56 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool zone_reclaimable(struct zone *zone);
> +extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 93d5f87c00d5..d7a49f665f04 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  static void release_pte_page(struct page *page)
>  {
>         /* 0 stands for page_is_file_cache(page) == false */
> -       dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +       dec_node_page_state(page, NR_ISOLATED_ANON + 0);
>         unlock_page(page);
>         putback_lru_page(page);
>  }
> @@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 /* 0 stands for page_is_file_cache(page) == false */
> -               inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +               inc_node_page_state(page, NR_ISOLATED_ANON + 0);
>                 VM_BUG_ON_PAGE(!PageLocked(page), page);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b70f9ca8ddf..50c86ad121bc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * and putback protocol: the LRU lock must be held, and the page must
>   * either be PageLRU() or the caller must have isolated/allocated it.
>   */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
>  {
>         struct mem_cgroup_per_zone *mz;
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = &pgdat->lruvec;
>                 goto out;
>         }
>
> @@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>          * we have to be prepared to initialize lruvec->zone here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != pgdat))
> +               lruvec->pgdat = pgdat;
>         return lruvec;
>  }
>
> @@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
>   * @lru: index of lru list the page is sitting on
> + * @zid: Zone ID of the zone pages have been added to
>   * @nr_pages: positive when adding or negative when removing
>   *
>   * This function must be called under lru_lock, just before a page is added
> @@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * so as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -                               int nr_pages)
> +                               enum zone_type zid, int nr_pages)
>  {
>         struct mem_cgroup_per_zone *mz;
>         unsigned long *lru_size;
>         long size;
>         bool empty;
>
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
> @@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
>         if (isolated) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fcca6b0e005..11de752ccaf5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
>         put_hwpoison_page(page);
>         if (!ret) {
>                 LIST_HEAD(pagelist);
> -               inc_zone_page_state(page, NR_ISOLATED_ANON +
> +               inc_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                 list_add(&page->lru, &pagelist);
>                 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> @@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
>                 if (ret) {
>                         if (!list_empty(&pagelist)) {
>                                 list_del(&page->lru);
> -                               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                                 page_is_file_cache(page));
>                                 putback_lru_page(page);
>                         }
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 82d0b98d27f8..c5278360ca66 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         put_page(page);
>                         list_add_tail(&page->lru, &source);
>                         move_pages--;
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>
>                 } else {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 53e40d3f3933..d8c4e38fb5f4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
>                 if (!isolate_lru_page(page)) {
>                         list_add_tail(&page->lru, pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2232f6923cc7..3033dae33a0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
>                         continue;
>                 }
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>                 /*
>                  * We isolated non-lru movable page so here we can use
> @@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>                  * restored.
>                  */
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>         }
>
> @@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>                 err = isolate_lru_page(page);
>                 if (!err) {
>                         list_add_tail(&page->lru, &pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>  put_and_set:
> @@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>                                    unsigned long nr_migrate_pages)
>  {
>         int z;
> +
> +       if (!pgdat_reclaimable(pgdat))
> +               return false;
> +
>         for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>                 struct zone *zone = pgdat->node_zones + z;
>
>                 if (!populated_zone(zone))
>                         continue;
>
> -               if (!zone_reclaimable(zone))
> -                       continue;
> -
>                 /* Avoid waking kswapd by allocating pages_to_migrate pages. */
>                 if (!zone_watermark_ok(zone, 0,
>                                        high_wmark_pages(zone) +
> @@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>         }
>
>         page_lru = page_is_file_cache(page);
> -       mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>                                 hpage_nr_pages(page));
>
>         /*
> @@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> -                       dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                       dec_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                         putback_lru_page(page);
>                 }
> @@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>                 /* Retake the callers reference and putback on LRU */
>                 get_page(page);
>                 putback_lru_page(page);
> -               mod_zone_page_state(page_zone(page),
> +               mod_node_page_state(page_pgdat(page),
>                          NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
>
>                 goto out_unlock;
> @@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
>         count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
>
> -       mod_zone_page_state(page_zone(page),
> +       mod_node_page_state(page_pgdat(page),
>                         NR_ISOLATED_ANON + page_lru,
>                         -HPAGE_PMD_NR);
>         return isolated;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 997f63082ff5..14645be06e30 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d578d2a56b19..0ada2b2954b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
>          */
>         nr_pages -= min(nr_pages, zone->totalreserve_pages);
>
> -       nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> -       nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         return nr_pages;
>  }
> @@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
>          */
>         x -= min(x, totalreserve_pages);
>
> -       x += global_page_state(NR_INACTIVE_FILE);
> -       x += global_page_state(NR_ACTIVE_FILE);
> +       x += global_node_page_state(NR_INACTIVE_FILE);
> +       x += global_node_page_state(NR_ACTIVE_FILE);
>
>         if (!vm_highmem_is_dirtyable)
>                 x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48b5414009ac..b84b85ae54ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>         spin_lock(&zone->lock);
>         isolated_pageblocks = has_isolate_pageblock(zone);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         while (count) {
>                 struct page *page;
> @@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
>  {
>         unsigned long nr_scanned;
>         spin_lock(&zone->lock);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
> @@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
>         unsigned long free_pcp = 0;
>         int cpu;
>         struct zone *zone;
> +       pg_data_t *pgdat;
>
>         for_each_populated_zone(zone) {
>                 if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
>                 " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
>  #endif
>                 " free:%lu free_pcp:%lu free_cma:%lu\n",
> -               global_page_state(NR_ACTIVE_ANON),
> -               global_page_state(NR_INACTIVE_ANON),
> -               global_page_state(NR_ISOLATED_ANON),
> -               global_page_state(NR_ACTIVE_FILE),
> -               global_page_state(NR_INACTIVE_FILE),
> -               global_page_state(NR_ISOLATED_FILE),
> -               global_page_state(NR_UNEVICTABLE),
> +               global_node_page_state(NR_ACTIVE_ANON),
> +               global_node_page_state(NR_INACTIVE_ANON),
> +               global_node_page_state(NR_ISOLATED_ANON),
> +               global_node_page_state(NR_ACTIVE_FILE),
> +               global_node_page_state(NR_INACTIVE_FILE),
> +               global_node_page_state(NR_ISOLATED_FILE),
> +               global_node_page_state(NR_UNEVICTABLE),
>                 global_page_state(NR_FILE_DIRTY),
>                 global_page_state(NR_WRITEBACK),
>                 global_page_state(NR_UNSTABLE_NFS),
> @@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
>                 free_pcp,
>                 global_page_state(NR_FREE_CMA_PAGES));
>
> +       for_each_online_pgdat(pgdat) {
> +               printk("Node %d"
> +                       " active_anon:%lukB"
> +                       " inactive_anon:%lukB"
> +                       " active_file:%lukB"
> +                       " inactive_file:%lukB"
> +                       " unevictable:%lukB"
> +                       " isolated(anon):%lukB"
> +                       " isolated(file):%lukB"
> +                       " all_unreclaimable? %s"
> +                       "\n",
> +                       pgdat->node_id,
> +                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
> +                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +       }
> +
>         for_each_populated_zone(zone) {
>                 int i;
>
> @@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
>                         " min:%lukB"
>                         " low:%lukB"
>                         " high:%lukB"
> -                       " active_anon:%lukB"
> -                       " inactive_anon:%lukB"
> -                       " active_file:%lukB"
> -                       " inactive_file:%lukB"
> -                       " unevictable:%lukB"
> -                       " isolated(anon):%lukB"
> -                       " isolated(file):%lukB"
>                         " present:%lukB"
>                         " managed:%lukB"
>                         " mlocked:%lukB"
> @@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
>                         " local_pcp:%ukB"
>                         " free_cma:%lukB"
>                         " writeback_tmp:%lukB"
> -                       " pages_scanned:%lu"
> -                       " all_unreclaimable? %s"
> +                       " node_pages_scanned:%lu"
>                         "\n",
>                         zone->name,
>                         K(zone_page_state(zone, NR_FREE_PAGES)),
>                         K(min_wmark_pages(zone)),
>                         K(low_wmark_pages(zone)),
>                         K(high_wmark_pages(zone)),
> -                       K(zone_page_state(zone, NR_ACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_INACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_ACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_INACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_UNEVICTABLE)),
> -                       K(zone_page_state(zone, NR_ISOLATED_ANON)),
> -                       K(zone_page_state(zone, NR_ISOLATED_FILE)),
>                         K(zone->present_pages),
>                         K(zone->managed_pages),
>                         K(zone_page_state(zone, NR_MLOCK)),
> @@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
>                         K(this_cpu_read(zone->pageset->pcp.count)),
>                         K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>                         K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> -                       K(zone_page_state(zone, NR_PAGES_SCANNED)),
> -                       (!zone_reclaimable(zone) ? "yes" : "no")
> -                       );
> +                       K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
>                 printk("lowmem_reserve[]:");
>                 for (i = 0; i < MAX_NR_ZONES; i++)
>                         printk(" %ld", zone->lowmem_reserve[i]);
> @@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>                 /* For bootup, initialized properly in watermark setup */
>                 mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> -               lruvec_init(&zone->lruvec);
> +               lruvec_init(zone_lruvec(zone));
>                 if (!size)
>                         continue;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf37e5cfae81..77af473635fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
>                 unsigned long flags;
>
>                 spin_lock_irqsave(zone_lru_lock(zone), flags);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(zone_lru_lock(zone), flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 (*move_fn)(page, lruvec, arg);
>         }
>         if (zone)
> @@ -319,7 +319,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(zone_lru_lock(zone));
> -       __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
> +       __activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
>         spin_unlock_irq(zone_lru_lock(zone));
>  }
>  #endif
> @@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
>   */
>  void add_page_to_unevictable_list(struct page *page)
>  {
> -       struct zone *zone = page_zone(page);
> +       struct pglist_data *pgdat = page_pgdat(page);
>         struct lruvec *lruvec;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> -       lruvec = mem_cgroup_page_lruvec(page, zone);
> +       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         ClearPageActive(page);
>         SetPageUnevictable(page);
>         SetPageLRU(page);
>         add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>  }
>
>  /**
> @@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct zone *zone = NULL;
> +       struct pglist_data *locked_pgdat = NULL;
>         struct lruvec *lruvec;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
> @@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same zone. The lock is held only if zone != NULL.
> +                * same pgdat. The lock is held only if pgdat != NULL.
>                  */
> -               if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                       zone = NULL;
> +               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                       locked_pgdat = NULL;
>                 }
>
>                 if (is_huge_zero_page(page)) {
> @@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (zone) {
> -                               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                               zone = NULL;
> +                       if (locked_pgdat) {
> +                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                               locked_pgdat = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct zone *pagezone = page_zone(page);
> +                       struct pglist_data *pgdat = page_pgdat(page);
>
> -                       if (pagezone != zone) {
> -                               if (zone)
> -                                       spin_unlock_irqrestore(zone_lru_lock(zone),
> +                       if (pgdat != locked_pgdat) {
> +                               if (locked_pgdat)
> +                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               zone = pagezone;
> -                               spin_lock_irqsave(zone_lru_lock(zone), flags);
> +                               locked_pgdat = pgdat;
> +                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, zone);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (zone)
> -               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> +       if (locked_pgdat)
> +               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_hot_cold_page_list(&pages_to_free, cold);
> @@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>         VM_BUG_ON(NR_CPUS != 1 &&
> -                 !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
> +                 !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e7ffcd259cc4..86a523a761c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * This misses isolated pages which are not accounted for to save counters.
> + * As the data only determines if reclaim or compaction continues, it is
> + * not expected that isolated pages will be a dominating factor.
> + */
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>         unsigned long nr;
>
> -       nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> +       nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> +       if (get_nr_swap_pages() > 0)
> +               nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> +
> +       return nr;
> +}
> +
> +unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> +{
> +       unsigned long nr;
> +
> +       nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
>
>         if (get_nr_swap_pages() > 0)
> -               nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> +               nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
>
>         return nr;
>  }
>
> -bool zone_reclaimable(struct zone *zone)
> +bool pgdat_reclaimable(struct pglist_data *pgdat)
>  {
> -       return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
> -               zone_reclaimable_pages(zone) * 6;
> +       return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> +               pgdat_reclaimable_pages(pgdat) * 6;
>  }
>
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>         if (!mem_cgroup_disabled())
>                 return mem_cgroup_get_lru_size(lruvec, lru);
>
> -       return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> +       return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  }
>
>  /*
> @@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -                                     struct zone *zone,
> +                                     struct pglist_data *pgdat,
>                                       struct scan_control *sc,
>                                       enum ttu_flags ttu_flags,
>                                       unsigned long *ret_nr_dirty,
> @@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         goto keep;
>
>                 VM_BUG_ON_PAGE(PageActive(page), page);
> -               VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
>                 sc->nr_scanned++;
>
> @@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         /* Case 1 above */
>                         if (current_is_kswapd() &&
>                             PageReclaim(page) &&
> -                           test_bit(ZONE_WRITEBACK, &zone->flags)) {
> +                           test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>                                 nr_immediate++;
>                                 goto keep_locked;
>
> @@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                          */
>                         if (page_is_file_cache(page) &&
>                                         (!current_is_kswapd() ||
> -                                        !test_bit(ZONE_DIRTY, &zone->flags))) {
> +                                        !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>                                 /*
>                                  * Immediately reclaim when written back.
>                                  * Similar in principal to deactivate_page()
> @@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>                 }
>         }
>
> -       ret = shrink_page_list(&clean_pages, zone, &sc,
> +       ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
>                         TTU_UNMAP|TTU_IGNORE_ACCESS,
>                         &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>         list_splice(&clean_pages, page_list);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>         return ret;
>  }
>
> @@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  {
>         struct list_head *src = &lruvec->lists[lru];
>         unsigned long nr_taken = 0;
> -       unsigned long scan;
> +       unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +       unsigned long scan, nr_pages;
>
>         for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
>                                         !list_empty(src); scan++) {
> @@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
>                 switch (__isolate_lru_page(page, mode)) {
>                 case 0:
> -                       nr_taken += hpage_nr_pages(page);
> +                       nr_pages = hpage_nr_pages(page);
> +                       nr_taken += nr_pages;
> +                       nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
>
> @@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>         *nr_scanned = scan;
>         trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
>                                     nr_taken, mode, is_file_lru(lru));
> +       for (scan = 0; scan < MAX_NR_ZONES; scan++) {
> +               nr_pages = nr_zone_taken[scan];
> +               if (!nr_pages)
> +                       continue;
> +
> +               update_lru_size(lruvec, lru, scan, -nr_pages);
> +       }
>         return nr_taken;
>  }
>
> @@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
>                 struct lruvec *lruvec;
>
>                 spin_lock_irq(zone_lru_lock(zone));
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct zone *zone, int file,
> +static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> @@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
>                 return 0;
>
>         if (file) {
> -               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
>         } else {
> -               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
>         }
>
>         /*
> @@ -1499,7 +1524,7 @@ static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         LIST_HEAD(pages_to_free);
>
>         /*
> @@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(zone_lru_lock(zone));
> +                       spin_unlock_irq(&pgdat->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(zone_lru_lock(zone));
> +                       spin_lock_irq(&pgdat->lru_lock);
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 }
> @@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         unsigned long nr_immediate = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
> @@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc)) {
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>                 else
> -                       __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_DIRECT, nr_scanned);
>         }
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
>
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +       nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
>                                 &nr_dirty, &nr_unqueued_dirty, &nr_congested,
>                                 &nr_writeback, &nr_immediate,
>                                 false);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         if (global_reclaim(sc)) {
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
>                 else
> -                       __count_zone_vm_events(PGSTEAL_DIRECT, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
>         }
>
>         putback_inactive_pages(lruvec, &page_list);
>
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_hot_cold_page_list(&page_list, true);
> @@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          * are encountered in the nr_immediate check below.
>          */
>         if (nr_writeback && nr_writeback == nr_taken)
> -               set_bit(ZONE_WRITEBACK, &zone->flags);
> +               set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * Legacy memcg will stall in page writeback so avoid forcibly
> @@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                  * backed by a congested BDI and wait_iff_congested will stall.
>                  */
>                 if (nr_dirty && nr_dirty == nr_congested)
> -                       set_bit(ZONE_CONGESTED, &zone->flags);
> +                       set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
>                 /*
>                  * If dirty pages are scanned that are not queued for IO, it
>                  * implies that flushers are not keeping up. In this case, flag
> -                * the zone ZONE_DIRTY and kswapd will start writing pages from
> +                * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
>                  * reclaim context.
>                  */
>                 if (nr_unqueued_dirty == nr_taken)
> -                       set_bit(ZONE_DIRTY, &zone->flags);
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>
>                 /*
>                  * If kswapd scans pages marked marked for immediate
> @@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          */
>         if (!sc->hibernation_mode && !current_is_kswapd() &&
>             current_may_throttle())
> -               wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +               wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> -       trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed,
>                         sc->priority, file);
>         return nr_reclaimed;
>  }
> @@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                                      struct list_head *pages_to_free,
>                                      enum lru_list lru)
>  {
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long pgmoved = 0;
>         struct page *page;
>         int nr_pages;
>
>         while (!list_empty(list)) {
>                 page = lru_to_page(list);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>
>                 nr_pages = hpage_nr_pages(page);
> -               update_lru_size(lruvec, lru, nr_pages);
> +               update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>                 list_move(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>
> @@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, pages_to_free);
>                 }
> @@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         unsigned long nr_rotated = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         lru_add_drain();
>
> @@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc))
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> -       __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
> +       __count_vm_events(PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>         move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_hold);
>         free_hot_cold_page_list(&l_hold, true);
> @@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 fraction[2];
>         u64 denominator = 0;    /* gcc */
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long anon_prio, file_prio;
>         enum scan_balance scan_balance;
>         unsigned long anon, file;
> @@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * well.
>          */
>         if (current_is_kswapd()) {
> -               if (!zone_reclaimable(zone))
> +               if (!pgdat_reclaimable(pgdat))
>                         force_scan = true;
>                 if (!mem_cgroup_online(memcg))
>                         force_scan = true;
> @@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (global_reclaim(sc)) {
> -               unsigned long zonefile;
> -               unsigned long zonefree;
> +               unsigned long pgdatfile;
> +               unsigned long pgdatfree;
> +               int z;
> +               unsigned long total_high_wmark = 0;
>
> -               zonefree = zone_page_state(zone, NR_FREE_PAGES);
> -               zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                          zone_page_state(zone, NR_INACTIVE_FILE);
> +               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
>
> -               if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
> +               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
>                         scan_balance = SCAN_ANON;
>                         goto out;
>                 }
> @@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
>          * inactive lists are large enough, continue reclaiming
>          */
>         pages_for_compaction = (2UL << sc->order);
> -       inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> +       inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
>         if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +               inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
>         if (sc->nr_reclaimed < pages_for_compaction &&
>                         inactive_lru_pages > pages_for_compaction)
>                 return true;
> @@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                 continue;
>
>                         if (sc->priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;       /* Let kswapd poll it */
>
>                         /*
> @@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!populated_zone(zone) ||
> -                   zone_reclaimable_pages(zone) == 0)
> +                   pgdat_reclaimable_pages(pgdat) == 0)
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>                  * DEF_PRIORITY. Effectively, it considers them balanced so
>                  * they must be considered balanced here as well!
>                  */
> -               if (!zone_reclaimable(zone)) {
> +               if (!pgdat_reclaimable(zone->zone_pgdat)) {
>                         balanced_pages += zone->managed_pages;
>                         continue;
>                 }
> @@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  {
>         unsigned long balance_gap;
>         bool lowmem_pressure;
> +       struct pglist_data *pgdat = zone->zone_pgdat;
>
>         /* Reclaim above the high watermark. */
>         sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> @@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
>         shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> -       clear_bit(ZONE_WRITEBACK, &zone->flags);
> +       /* TODO: ANOMALY */
> +       clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * If a zone reaches its high watermark, consider it to be no longer
> @@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
>          * BDIs but as pressure is relieved, speculatively avoid congestion
>          * waits.
>          */
> -       if (zone_reclaimable(zone) &&
> +       if (pgdat_reclaimable(zone->zone_pgdat) &&
>             zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
> -               clear_bit(ZONE_CONGESTED, &zone->flags);
> -               clear_bit(ZONE_DIRTY, &zone->flags);
> +               clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +               clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         }
>
>         return sc->nr_scanned >= sc->nr_to_reclaim;
> @@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         /*
> @@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 /*
>                                  * If balanced, clear the dirty and congested
>                                  * flags
> +                                *
> +                                * TODO: ANOMALY
>                                  */
> -                               clear_bit(ZONE_CONGESTED, &zone->flags);
> -                               clear_bit(ZONE_DIRTY, &zone->flags);
> +                               clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +                               clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>                         }
>                 }
>
> @@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         sc.nr_scanned = 0;
> @@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
>  static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
>  {
>         unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> -       unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> -               zone_page_state(zone, NR_ACTIVE_FILE);
> +       unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         /*
>          * It's possible for there to be more file mapped pages than
> @@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>             zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                 return ZONE_RECLAIM_FULL;
>
> -       if (!zone_reclaimable(zone))
> +       if (!pgdat_reclaimable(zone->zone_pgdat))
>                 return ZONE_RECLAIM_FULL;
>
>         /*
> @@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
>                         zone = pagezone;
>                         spin_lock_irq(zone_lru_lock(zone));
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3345d396a99b..de0c17076270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
>         /* enum zone_stat_item countes */
>         "nr_free_pages",
>         "nr_alloc_batch",
> -       "nr_inactive_anon",
> -       "nr_active_anon",
> -       "nr_inactive_file",
> -       "nr_active_file",
> -       "nr_unevictable",
> +       "nr_zone_anon_lru",
> +       "nr_zone_file_lru",
>         "nr_mlock",
>         "nr_anon_pages",
>         "nr_mapped",
> @@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
>         "nr_writeback_temp",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
>         "nr_shmem",
>         "nr_dirtied",
>         "nr_written",
> -       "nr_pages_scanned",
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         "nr_zspages",
>  #endif
> @@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
>         "nr_shmem_pmdmapped",
>         "nr_free_cma",
>
> +       /* Node-based counters */
> +       "nr_inactive_anon",
> +       "nr_active_anon",
> +       "nr_inactive_file",
> +       "nr_active_file",
> +       "nr_unevictable",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "nr_pages_scanned",
> +
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> @@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
>         "pgmajfault",
>         "pglazyfreed",
>
> -       TEXTS_FOR_ZONES("pgrefill")
> -       TEXTS_FOR_ZONES("pgsteal_kswapd")
> -       TEXTS_FOR_ZONES("pgsteal_direct")
> -       TEXTS_FOR_ZONES("pgscan_kswapd")
> -       TEXTS_FOR_ZONES("pgscan_direct")
> +       "pgrefill",
> +       "pgsteal_kswapd",
> +       "pgsteal_direct",
> +       "pgscan_kswapd",
> +       "pgscan_direct",
>         "pgscan_direct_throttle",
>
>  #ifdef CONFIG_NUMA
> @@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        min      %lu"
>                    "\n        low      %lu"
>                    "\n        high     %lu"
> -                  "\n        scanned  %lu"
> +                  "\n   node_scanned  %lu"
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu",
> @@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    min_wmark_pages(zone),
>                    low_wmark_pages(zone),
>                    high_wmark_pages(zone),
> -                  zone_page_state(zone, NR_PAGES_SCANNED),
> +                  node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone->managed_pages);
>
>         for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -               seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
> +               seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
>                                 zone_page_state(zone, i));
>
>         seq_printf(m,
> @@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  #endif
>         }
>         seq_printf(m,
> -                  "\n  all_unreclaimable: %u"
> -                  "\n  start_pfn:         %lu"
> -                  "\n  inactive_ratio:    %u",
> -                  !zone_reclaimable(zone),
> +                  "\n  node_unreclaimable:  %u"
> +                  "\n  start_pfn:           %lu"
> +                  "\n  node_inactive_ratio: %u",
> +                  !pgdat_reclaimable(zone->zone_pgdat),
>                    zone->zone_start_pfn,
> -                  zone->inactive_ratio);
> +                  zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
>  }
>
> @@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  {
>         unsigned long *l = arg;
>         unsigned long off = l - (unsigned long *)m->private;
> -
>         seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
>         return 0;
>  }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ba972ac2dfdd..ebe14445809a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                      LRU_ALL_FILE);
>         } else {
> -               pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -                       sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
> +               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
> +                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>         }
>
>         /*
> --
> 2.6.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: James Hogan <james.hogan-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>
To: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Linux-MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	Rik van Riel <riel-ebMLmSuQjDVBDgjK7y7TUQ@public.gmane.org>,
	Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>,
	LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	metag <linux-metag-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
Date: Thu, 4 Aug 2016 21:59:17 +0100	[thread overview]
Message-ID: <CAAG0J9_k3edxDzqpEjt2BqqZXMW4PVj7BNUBAk6TWtw3Zh_oMg@mail.gmail.com> (raw)
In-Reply-To: <1467970510-21195-4-git-send-email-mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>

On 8 July 2016 at 10:34, Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org> wrote:
> This moves the LRU lists from the zone to the node and related data
> such as counters, tracing, congestion tracking and writeback tracking.
> Unfortunately, due to reclaim and compaction retry logic, it is necessary
> to account for the number of LRU pages on both zone and node logic.
> Most reclaim logic is based on the node counters but the retry logic uses
> the zone counters which do not distinguish inactive and active sizes.
> It would be possible to leave the LRU counters on a per-zone basis but
> it's a heavier calculation across multiple cache lines that is much more
> frequent than the retry checks.
>
> Other than the LRU counters, this is mostly a mechanical patch but note
> that it introduces a number of anomalies.  For example, the scans are
> per-zone but using per-node counters.  We also mark a node as congested
> when a zone is congested.  This causes weird problems that are fixed later
> but is easier to review.
>
> In the event that there is excessive overhead on 32-bit systems due to
> the nodes being on LRU then there are two potential solutions
>
> 1. Long-term isolation of highmem pages when reclaim is lowmem
>
>    When pages are skipped, they are immediately added back onto the LRU
>    list. If lowmem reclaim persisted for long periods of time, the same
>    highmem pages get continually scanned. The idea would be that lowmem
>    keeps those pages on a separate list until a reclaim for highmem pages
>    arrives that splices the highmem pages back onto the LRU. It potentially
>    could be implemented similar to the UNEVICTABLE list.
>
>    That would reduce the skip rate with the potential corner case is that
>    highmem pages have to be scanned and reclaimed to free lowmem slab pages.
>
> 2. Linear scan lowmem pages if the initial LRU shrink fails
>
>    This will break LRU ordering but may be preferable and faster during
>    memory pressure than skipping LRU pages.
>
> Signed-off-by: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

This breaks boot on metag architecture:
Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]

It appears to be in node_page_state_snapshot() (via
pgdat_reclaimable()), and have come via mm_init. Here's the relevant
bit of the backtrace:

    node_page_state_snapshot@0x4009c884(enum node_stat_item item =
???, struct pglist_data * pgdat = ???) + 0x48
    pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
    show_free_areas(unsigned int filter = 0) + 0x2cc
    show_mem(unsigned int filter = 0) + 0x18
    mm_init@0x4025c3d4()
    start_kernel() + 0x204

__per_cpu_offset[0] == 0x233000 (close to bad addr),
pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
definitely hasn't been called yet (mm_init is called before
setup_per_cpu_pageset()).

Any ideas what the correct solution is (and why presumably others
haven't seen the same issue on other architectures?).

Thanks
James

> ---
>  arch/tile/mm/pgtable.c                    |   8 +-
>  drivers/base/node.c                       |  19 +--
>  drivers/staging/android/lowmemorykiller.c |   8 +-
>  include/linux/backing-dev.h               |   2 +-
>  include/linux/memcontrol.h                |  18 +--
>  include/linux/mm_inline.h                 |  21 ++-
>  include/linux/mmzone.h                    |  68 +++++----
>  include/linux/swap.h                      |   1 +
>  include/linux/vm_event_item.h             |  10 +-
>  include/linux/vmstat.h                    |  17 +++
>  include/trace/events/vmscan.h             |  12 +-
>  kernel/power/snapshot.c                   |  10 +-
>  mm/backing-dev.c                          |  15 +-
>  mm/compaction.c                           |  18 +--
>  mm/huge_memory.c                          |   2 +-
>  mm/internal.h                             |   2 +-
>  mm/khugepaged.c                           |   4 +-
>  mm/memcontrol.c                           |  17 +--
>  mm/memory-failure.c                       |   4 +-
>  mm/memory_hotplug.c                       |   2 +-
>  mm/mempolicy.c                            |   2 +-
>  mm/migrate.c                              |  21 +--
>  mm/mlock.c                                |   2 +-
>  mm/page-writeback.c                       |   8 +-
>  mm/page_alloc.c                           |  68 +++++----
>  mm/swap.c                                 |  50 +++----
>  mm/vmscan.c                               | 226 +++++++++++++++++-------------
>  mm/vmstat.c                               |  47 ++++---
>  mm/workingset.c                           |   4 +-
>  29 files changed, 386 insertions(+), 300 deletions(-)
>
> diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
> index c4d5bf841a7f..9e389213580d 100644
> --- a/arch/tile/mm/pgtable.c
> +++ b/arch/tile/mm/pgtable.c
> @@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
>         struct zone *zone;
>
>         pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
> -              (global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE)),
> -              (global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE)),
> +              (global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE)),
> +              (global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE)),
>                global_page_state(NR_FILE_DIRTY),
>                global_page_state(NR_WRITEBACK),
>                global_page_state(NR_UNSTABLE_NFS),
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 92d8e090c5b3..b7f01a4a642d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  {
>         int n;
>         int nid = dev->id;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
>         struct sysinfo i;
>
>         si_meminfo_node(&i, nid);
> @@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(i.totalram),
>                        nid, K(i.freeram),
>                        nid, K(i.totalram - i.freeram),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
>                        nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>
>  #ifdef CONFIG_HIGHMEM
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index 24d2745e9437..93dbcc38eb0f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>                                   struct shrink_control *sc)
>  {
> -       return global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE) +
> -               global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE);
> +       return global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE) +
> +               global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE);
>  }
>
>  static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c82794f20110..491a91717788 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  }
>
>  long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct zone *zone, int sync, long timeout);
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 104efa6874db..68f1121c8fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = zone_lruvec(zone);
>                 goto out;
>         }
>
> @@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>  out:
>         /*
>          * Since a node can be onlined after the mem_cgroup was created,
> -        * we have to be prepared to initialize lruvec->zone here;
> +        * we have to be prepared to initialize lruvec->pgdat here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> +               lruvec->pgdat = zone->zone_pgdat;
>         return lruvec;
>  }
>
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
>
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -               int nr_pages);
> +               enum zone_type zid, int nr_pages);
>
>  unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                                            int nid, unsigned int lru_mask);
> @@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>                                                     struct mem_cgroup *memcg)
>  {
> -       return &zone->lruvec;
> +       return zone_lruvec(zone);
>  }
>
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> -                                                   struct zone *zone)
> +                                                   struct pglist_data *pgdat)
>  {
> -       return &zone->lruvec;
> +       return &pgdat->lruvec;
>  }
>
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 5bd29ba4f174..9aadcc781857 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
>  }
>
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
> -       __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +       __mod_zone_page_state(&pgdat->node_zones[zid],
> +               NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> +               nr_pages);
>  }
>
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
>  #ifdef CONFIG_MEMCG
> -       mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
> +       mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
>  #else
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>  #endif
>  }
>
>  static __always_inline void add_page_to_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
> -       update_lru_size(lruvec, lru, hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
>         list_add(&page->lru, &lruvec->lists[lru]);
>  }
>
> @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
>         list_del(&page->lru);
> -       update_lru_size(lruvec, lru, -hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
>  }
>
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cfa870107abe..d4f5cac0a8c3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,12 +111,9 @@ enum zone_stat_item {
>         /* First 128 byte cacheline (assuming 64 bit words) */
>         NR_FREE_PAGES,
>         NR_ALLOC_BATCH,
> -       NR_LRU_BASE,
> -       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> -       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> -       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> -       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> -       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +       NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> +       NR_ZONE_LRU_FILE,
>         NR_MLOCK,               /* mlock()ed pages found and moved off LRU */
>         NR_ANON_PAGES,  /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -134,12 +131,9 @@ enum zone_stat_item {
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
>         NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
> -       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         NR_ZSPAGES,             /* allocated in zsmalloc */
>  #endif
> @@ -161,6 +155,15 @@ enum zone_stat_item {
>         NR_VM_ZONE_STAT_ITEMS };
>
>  enum node_stat_item {
> +       NR_LRU_BASE,
> +       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> +       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> +       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> +       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> +       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> @@ -219,7 +222,7 @@ struct lruvec {
>         /* Evictions & activations on the inactive file list */
>         atomic_long_t                   inactive_age;
>  #ifdef CONFIG_MEMCG
> -       struct zone                     *zone;
> +       struct pglist_data *pgdat;
>  #endif
>  };
>
> @@ -357,13 +360,6 @@ struct zone {
>  #ifdef CONFIG_NUMA
>         int node;
>  #endif
> -
> -       /*
> -        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> -        * this zone's LRU.  Maintained by the pageout code.
> -        */
> -       unsigned int inactive_ratio;
> -
>         struct pglist_data      *zone_pgdat;
>         struct per_cpu_pageset __percpu *pageset;
>
> @@ -495,9 +491,6 @@ struct zone {
>
>         /* Write-intensive fields used by page reclaim */
>
> -       /* Fields commonly accessed by the page reclaim scanner */
> -       struct lruvec           lruvec;
> -
>         /*
>          * When free pages are below this point, additional steps are taken
>          * when reading the number of free pages to avoid per-cpu counter
> @@ -537,17 +530,20 @@ struct zone {
>
>  enum zone_flags {
>         ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
> -       ZONE_CONGESTED,                 /* zone has many dirty pages backed by
> +       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
> +};
> +
> +enum pgdat_flags {
> +       PGDAT_CONGESTED,                /* pgdat has many dirty pages backed by
>                                          * a congested BDI
>                                          */
> -       ZONE_DIRTY,                     /* reclaim scanning has recently found
> +       PGDAT_DIRTY,                    /* reclaim scanning has recently found
>                                          * many dirty file pages at the tail
>                                          * of the LRU.
>                                          */
> -       ZONE_WRITEBACK,                 /* reclaim scanning has recently found
> +       PGDAT_WRITEBACK,                /* reclaim scanning has recently found
>                                          * many pages under writeback
>                                          */
> -       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
>  };
>
>  static inline unsigned long zone_end_pfn(const struct zone *zone)
> @@ -707,6 +703,19 @@ typedef struct pglist_data {
>         unsigned long split_queue_len;
>  #endif
>
> +       /* Fields commonly accessed by the page reclaim scanner */
> +       struct lruvec           lruvec;
> +
> +       /*
> +        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> +        * this node's LRU.  Maintained by the pageout code.
> +        */
> +       unsigned int inactive_ratio;
> +
> +       unsigned long           flags;
> +
> +       ZONE_PADDING(_pad2_)
> +
>         /* Per-node vmstats */
>         struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>         atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
> @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
>         return &zone->zone_pgdat->lru_lock;
>  }
>
> +static inline struct lruvec *zone_lruvec(struct zone *zone)
> +{
> +       return &zone->zone_pgdat->lruvec;
> +}
> +
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>         return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
>
>  extern void lruvec_init(struct lruvec *lruvec);
>
> -static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  {
>  #ifdef CONFIG_MEMCG
> -       return lruvec->zone;
> +       return lruvec->pgdat;
>  #else
> -       return container_of(lruvec, struct zone, lruvec);
> +       return container_of(lruvec, struct pglist_data, lruvec);
>  #endif
>  }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..c82f916008b7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 42604173f122..1798ff542517 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
>                 PGLAZYFREED,
> -               FOR_ALL_ZONES(PGREFILL),
> -               FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> -               FOR_ALL_ZONES(PGSTEAL_DIRECT),
> -               FOR_ALL_ZONES(PGSCAN_KSWAPD),
> -               FOR_ALL_ZONES(PGSCAN_DIRECT),
> +               PGREFILL,
> +               PGSTEAL_KSWAPD,
> +               PGSTEAL_DIRECT,
> +               PGSCAN_KSWAPD,
> +               PGSCAN_DIRECT,
>                 PGSCAN_DIRECT_THROTTLE,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d1744aa3ab9c..fee321c98550 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>         return x;
>  }
>
> +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
> +                                       enum node_stat_item item)
> +{
> +       long x = atomic_long_read(&pgdat->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +       int cpu;
> +       for_each_online_cpu(cpu)
> +               x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
> +
> +       if (x < 0)
> +               x = 0;
> +#endif
> +       return x;
> +}
> +
> +
>  #ifdef CONFIG_NUMA
>  extern unsigned long sum_zone_node_page_state(int node,
>                                                 enum zone_stat_item item);
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 0101ef37f1ee..897f1aa1ee5f 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
>
>  TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>
> -       TP_PROTO(struct zone *zone,
> +       TP_PROTO(int nid,
>                 unsigned long nr_scanned, unsigned long nr_reclaimed,
>                 int priority, int file),
>
> -       TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
> +       TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
>
>         TP_STRUCT__entry(
>                 __field(int, nid)
> -               __field(int, zid)
>                 __field(unsigned long, nr_scanned)
>                 __field(unsigned long, nr_reclaimed)
>                 __field(int, priority)
> @@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>         ),
>
>         TP_fast_assign(
> -               __entry->nid = zone_to_nid(zone);
> -               __entry->zid = zone_idx(zone);
> +               __entry->nid = nid;
>                 __entry->nr_scanned = nr_scanned;
>                 __entry->nr_reclaimed = nr_reclaimed;
>                 __entry->priority = priority;
>                 __entry->reclaim_flags = trace_shrink_flags(file);
>         ),
>
> -       TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> -               __entry->nid, __entry->zid,
> +       TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> +               __entry->nid,
>                 __entry->nr_scanned, __entry->nr_reclaimed,
>                 __entry->priority,
>                 show_reclaim_flags(__entry->reclaim_flags))
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a970604308f..24a06bc23f85 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
>         unsigned long size;
>
>         size = global_page_state(NR_SLAB_RECLAIMABLE)
> -               + global_page_state(NR_ACTIVE_ANON)
> -               + global_page_state(NR_INACTIVE_ANON)
> -               + global_page_state(NR_ACTIVE_FILE)
> -               + global_page_state(NR_INACTIVE_FILE)
> -               - global_page_state(NR_FILE_MAPPED);
> +               + global_node_page_state(NR_ACTIVE_ANON)
> +               + global_node_page_state(NR_INACTIVE_ANON)
> +               + global_node_page_state(NR_ACTIVE_FILE)
> +               + global_node_page_state(NR_INACTIVE_FILE)
> +               - global_node_page_state(NR_FILE_MAPPED);
>
>         return saveable <= size ? 0 : saveable - size;
>  }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f53b23ab7ed7..a8c3af46bd3d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
>  EXPORT_SYMBOL(congestion_wait);
>
>  /**
> - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> - * @zone: A zone to check if it is heavily congested
> + * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> + * @pgdat: A pgdat to check if it is heavily congested
>   * @sync: SYNC or ASYNC IO
>   * @timeout: timeout in jiffies
>   *
>   * In the event of a congested backing_dev (any backing_dev) and the given
> - * @zone has experienced recent congestion, this waits for up to @timeout
> + * @pgdat has experienced recent congestion, this waits for up to @timeout
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> + * In the absence of pgdat congestion, cond_resched() is called to yield
>   * the processor if necessary but otherwise does not sleep.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
>   */
> -long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
>  {
>         long ret;
>         unsigned long start = jiffies;
> @@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>
>         /*
>          * If there is no congestion, or heavy congestion is not being
> -        * encountered in the current zone, yield if necessary instead
> +        * encountered in the current pgdat, yield if necessary instead
>          * of sleeping on the congestion queue
>          */
>         if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> -           !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +           !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
>                 cond_resched();
> +
>                 /* In case we scheduled, work out time remaining */
>                 ret = timeout - (jiffies - start);
>                 if (ret < 0)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7607efb7bee2..a0bd85712516 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
>         list_for_each_entry(page, &cc->migratepages, lru)
>                 count[!!page_is_file_cache(page)]++;
>
> -       mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
>  }
>
>  /* Similar to reclaim, but different enough that they don't share logic */
> @@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
>  {
>         unsigned long active, inactive, isolated;
>
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_INACTIVE_ANON);
> -       active = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_ACTIVE_ANON);
> -       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> -                                       zone_page_state(zone, NR_ISOLATED_ANON);
> +       inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
> +       active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
> +       isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
>
>         return isolated > (inactive + active) / 2;
>  }
> @@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f997328ae64..5d5b2207cfd2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         pgoff_t end = -1;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, zone);
> +       lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
>
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
> diff --git a/mm/internal.h b/mm/internal.h
> index 9b6a6c43ac39..2f80d0343c56 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool zone_reclaimable(struct zone *zone);
> +extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 93d5f87c00d5..d7a49f665f04 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  static void release_pte_page(struct page *page)
>  {
>         /* 0 stands for page_is_file_cache(page) == false */
> -       dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +       dec_node_page_state(page, NR_ISOLATED_ANON + 0);
>         unlock_page(page);
>         putback_lru_page(page);
>  }
> @@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 /* 0 stands for page_is_file_cache(page) == false */
> -               inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +               inc_node_page_state(page, NR_ISOLATED_ANON + 0);
>                 VM_BUG_ON_PAGE(!PageLocked(page), page);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b70f9ca8ddf..50c86ad121bc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * and putback protocol: the LRU lock must be held, and the page must
>   * either be PageLRU() or the caller must have isolated/allocated it.
>   */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
>  {
>         struct mem_cgroup_per_zone *mz;
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = &pgdat->lruvec;
>                 goto out;
>         }
>
> @@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>          * we have to be prepared to initialize lruvec->zone here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != pgdat))
> +               lruvec->pgdat = pgdat;
>         return lruvec;
>  }
>
> @@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
>   * @lru: index of lru list the page is sitting on
> + * @zid: Zone ID of the zone pages have been added to
>   * @nr_pages: positive when adding or negative when removing
>   *
>   * This function must be called under lru_lock, just before a page is added
> @@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * so as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -                               int nr_pages)
> +                               enum zone_type zid, int nr_pages)
>  {
>         struct mem_cgroup_per_zone *mz;
>         unsigned long *lru_size;
>         long size;
>         bool empty;
>
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
> @@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
>         if (isolated) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fcca6b0e005..11de752ccaf5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
>         put_hwpoison_page(page);
>         if (!ret) {
>                 LIST_HEAD(pagelist);
> -               inc_zone_page_state(page, NR_ISOLATED_ANON +
> +               inc_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                 list_add(&page->lru, &pagelist);
>                 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> @@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
>                 if (ret) {
>                         if (!list_empty(&pagelist)) {
>                                 list_del(&page->lru);
> -                               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                                 page_is_file_cache(page));
>                                 putback_lru_page(page);
>                         }
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 82d0b98d27f8..c5278360ca66 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         put_page(page);
>                         list_add_tail(&page->lru, &source);
>                         move_pages--;
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>
>                 } else {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 53e40d3f3933..d8c4e38fb5f4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
>                 if (!isolate_lru_page(page)) {
>                         list_add_tail(&page->lru, pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2232f6923cc7..3033dae33a0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
>                         continue;
>                 }
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>                 /*
>                  * We isolated non-lru movable page so here we can use
> @@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>                  * restored.
>                  */
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>         }
>
> @@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>                 err = isolate_lru_page(page);
>                 if (!err) {
>                         list_add_tail(&page->lru, &pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>  put_and_set:
> @@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>                                    unsigned long nr_migrate_pages)
>  {
>         int z;
> +
> +       if (!pgdat_reclaimable(pgdat))
> +               return false;
> +
>         for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>                 struct zone *zone = pgdat->node_zones + z;
>
>                 if (!populated_zone(zone))
>                         continue;
>
> -               if (!zone_reclaimable(zone))
> -                       continue;
> -
>                 /* Avoid waking kswapd by allocating pages_to_migrate pages. */
>                 if (!zone_watermark_ok(zone, 0,
>                                        high_wmark_pages(zone) +
> @@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>         }
>
>         page_lru = page_is_file_cache(page);
> -       mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>                                 hpage_nr_pages(page));
>
>         /*
> @@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> -                       dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                       dec_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                         putback_lru_page(page);
>                 }
> @@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>                 /* Retake the callers reference and putback on LRU */
>                 get_page(page);
>                 putback_lru_page(page);
> -               mod_zone_page_state(page_zone(page),
> +               mod_node_page_state(page_pgdat(page),
>                          NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
>
>                 goto out_unlock;
> @@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
>         count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
>
> -       mod_zone_page_state(page_zone(page),
> +       mod_node_page_state(page_pgdat(page),
>                         NR_ISOLATED_ANON + page_lru,
>                         -HPAGE_PMD_NR);
>         return isolated;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 997f63082ff5..14645be06e30 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d578d2a56b19..0ada2b2954b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
>          */
>         nr_pages -= min(nr_pages, zone->totalreserve_pages);
>
> -       nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> -       nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         return nr_pages;
>  }
> @@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
>          */
>         x -= min(x, totalreserve_pages);
>
> -       x += global_page_state(NR_INACTIVE_FILE);
> -       x += global_page_state(NR_ACTIVE_FILE);
> +       x += global_node_page_state(NR_INACTIVE_FILE);
> +       x += global_node_page_state(NR_ACTIVE_FILE);
>
>         if (!vm_highmem_is_dirtyable)
>                 x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48b5414009ac..b84b85ae54ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>         spin_lock(&zone->lock);
>         isolated_pageblocks = has_isolate_pageblock(zone);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         while (count) {
>                 struct page *page;
> @@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
>  {
>         unsigned long nr_scanned;
>         spin_lock(&zone->lock);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
> @@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
>         unsigned long free_pcp = 0;
>         int cpu;
>         struct zone *zone;
> +       pg_data_t *pgdat;
>
>         for_each_populated_zone(zone) {
>                 if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
>                 " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
>  #endif
>                 " free:%lu free_pcp:%lu free_cma:%lu\n",
> -               global_page_state(NR_ACTIVE_ANON),
> -               global_page_state(NR_INACTIVE_ANON),
> -               global_page_state(NR_ISOLATED_ANON),
> -               global_page_state(NR_ACTIVE_FILE),
> -               global_page_state(NR_INACTIVE_FILE),
> -               global_page_state(NR_ISOLATED_FILE),
> -               global_page_state(NR_UNEVICTABLE),
> +               global_node_page_state(NR_ACTIVE_ANON),
> +               global_node_page_state(NR_INACTIVE_ANON),
> +               global_node_page_state(NR_ISOLATED_ANON),
> +               global_node_page_state(NR_ACTIVE_FILE),
> +               global_node_page_state(NR_INACTIVE_FILE),
> +               global_node_page_state(NR_ISOLATED_FILE),
> +               global_node_page_state(NR_UNEVICTABLE),
>                 global_page_state(NR_FILE_DIRTY),
>                 global_page_state(NR_WRITEBACK),
>                 global_page_state(NR_UNSTABLE_NFS),
> @@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
>                 free_pcp,
>                 global_page_state(NR_FREE_CMA_PAGES));
>
> +       for_each_online_pgdat(pgdat) {
> +               printk("Node %d"
> +                       " active_anon:%lukB"
> +                       " inactive_anon:%lukB"
> +                       " active_file:%lukB"
> +                       " inactive_file:%lukB"
> +                       " unevictable:%lukB"
> +                       " isolated(anon):%lukB"
> +                       " isolated(file):%lukB"
> +                       " all_unreclaimable? %s"
> +                       "\n",
> +                       pgdat->node_id,
> +                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
> +                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +       }
> +
>         for_each_populated_zone(zone) {
>                 int i;
>
> @@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
>                         " min:%lukB"
>                         " low:%lukB"
>                         " high:%lukB"
> -                       " active_anon:%lukB"
> -                       " inactive_anon:%lukB"
> -                       " active_file:%lukB"
> -                       " inactive_file:%lukB"
> -                       " unevictable:%lukB"
> -                       " isolated(anon):%lukB"
> -                       " isolated(file):%lukB"
>                         " present:%lukB"
>                         " managed:%lukB"
>                         " mlocked:%lukB"
> @@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
>                         " local_pcp:%ukB"
>                         " free_cma:%lukB"
>                         " writeback_tmp:%lukB"
> -                       " pages_scanned:%lu"
> -                       " all_unreclaimable? %s"
> +                       " node_pages_scanned:%lu"
>                         "\n",
>                         zone->name,
>                         K(zone_page_state(zone, NR_FREE_PAGES)),
>                         K(min_wmark_pages(zone)),
>                         K(low_wmark_pages(zone)),
>                         K(high_wmark_pages(zone)),
> -                       K(zone_page_state(zone, NR_ACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_INACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_ACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_INACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_UNEVICTABLE)),
> -                       K(zone_page_state(zone, NR_ISOLATED_ANON)),
> -                       K(zone_page_state(zone, NR_ISOLATED_FILE)),
>                         K(zone->present_pages),
>                         K(zone->managed_pages),
>                         K(zone_page_state(zone, NR_MLOCK)),
> @@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
>                         K(this_cpu_read(zone->pageset->pcp.count)),
>                         K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>                         K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> -                       K(zone_page_state(zone, NR_PAGES_SCANNED)),
> -                       (!zone_reclaimable(zone) ? "yes" : "no")
> -                       );
> +                       K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
>                 printk("lowmem_reserve[]:");
>                 for (i = 0; i < MAX_NR_ZONES; i++)
>                         printk(" %ld", zone->lowmem_reserve[i]);
> @@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>                 /* For bootup, initialized properly in watermark setup */
>                 mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> -               lruvec_init(&zone->lruvec);
> +               lruvec_init(zone_lruvec(zone));
>                 if (!size)
>                         continue;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf37e5cfae81..77af473635fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
>                 unsigned long flags;
>
>                 spin_lock_irqsave(zone_lru_lock(zone), flags);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(zone_lru_lock(zone), flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 (*move_fn)(page, lruvec, arg);
>         }
>         if (zone)
> @@ -319,7 +319,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(zone_lru_lock(zone));
> -       __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
> +       __activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
>         spin_unlock_irq(zone_lru_lock(zone));
>  }
>  #endif
> @@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
>   */
>  void add_page_to_unevictable_list(struct page *page)
>  {
> -       struct zone *zone = page_zone(page);
> +       struct pglist_data *pgdat = page_pgdat(page);
>         struct lruvec *lruvec;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> -       lruvec = mem_cgroup_page_lruvec(page, zone);
> +       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         ClearPageActive(page);
>         SetPageUnevictable(page);
>         SetPageLRU(page);
>         add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>  }
>
>  /**
> @@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct zone *zone = NULL;
> +       struct pglist_data *locked_pgdat = NULL;
>         struct lruvec *lruvec;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
> @@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same zone. The lock is held only if zone != NULL.
> +                * same pgdat. The lock is held only if pgdat != NULL.
>                  */
> -               if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                       zone = NULL;
> +               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                       locked_pgdat = NULL;
>                 }
>
>                 if (is_huge_zero_page(page)) {
> @@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (zone) {
> -                               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                               zone = NULL;
> +                       if (locked_pgdat) {
> +                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                               locked_pgdat = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct zone *pagezone = page_zone(page);
> +                       struct pglist_data *pgdat = page_pgdat(page);
>
> -                       if (pagezone != zone) {
> -                               if (zone)
> -                                       spin_unlock_irqrestore(zone_lru_lock(zone),
> +                       if (pgdat != locked_pgdat) {
> +                               if (locked_pgdat)
> +                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               zone = pagezone;
> -                               spin_lock_irqsave(zone_lru_lock(zone), flags);
> +                               locked_pgdat = pgdat;
> +                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, zone);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (zone)
> -               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> +       if (locked_pgdat)
> +               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_hot_cold_page_list(&pages_to_free, cold);
> @@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>         VM_BUG_ON(NR_CPUS != 1 &&
> -                 !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
> +                 !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e7ffcd259cc4..86a523a761c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * This misses isolated pages which are not accounted for to save counters.
> + * As the data only determines if reclaim or compaction continues, it is
> + * not expected that isolated pages will be a dominating factor.
> + */
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>         unsigned long nr;
>
> -       nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> +       nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> +       if (get_nr_swap_pages() > 0)
> +               nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> +
> +       return nr;
> +}
> +
> +unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> +{
> +       unsigned long nr;
> +
> +       nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
>
>         if (get_nr_swap_pages() > 0)
> -               nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> +               nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
>
>         return nr;
>  }
>
> -bool zone_reclaimable(struct zone *zone)
> +bool pgdat_reclaimable(struct pglist_data *pgdat)
>  {
> -       return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
> -               zone_reclaimable_pages(zone) * 6;
> +       return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> +               pgdat_reclaimable_pages(pgdat) * 6;
>  }
>
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>         if (!mem_cgroup_disabled())
>                 return mem_cgroup_get_lru_size(lruvec, lru);
>
> -       return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> +       return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  }
>
>  /*
> @@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -                                     struct zone *zone,
> +                                     struct pglist_data *pgdat,
>                                       struct scan_control *sc,
>                                       enum ttu_flags ttu_flags,
>                                       unsigned long *ret_nr_dirty,
> @@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         goto keep;
>
>                 VM_BUG_ON_PAGE(PageActive(page), page);
> -               VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
>                 sc->nr_scanned++;
>
> @@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         /* Case 1 above */
>                         if (current_is_kswapd() &&
>                             PageReclaim(page) &&
> -                           test_bit(ZONE_WRITEBACK, &zone->flags)) {
> +                           test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>                                 nr_immediate++;
>                                 goto keep_locked;
>
> @@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                          */
>                         if (page_is_file_cache(page) &&
>                                         (!current_is_kswapd() ||
> -                                        !test_bit(ZONE_DIRTY, &zone->flags))) {
> +                                        !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>                                 /*
>                                  * Immediately reclaim when written back.
>                                  * Similar in principal to deactivate_page()
> @@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>                 }
>         }
>
> -       ret = shrink_page_list(&clean_pages, zone, &sc,
> +       ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
>                         TTU_UNMAP|TTU_IGNORE_ACCESS,
>                         &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>         list_splice(&clean_pages, page_list);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>         return ret;
>  }
>
> @@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  {
>         struct list_head *src = &lruvec->lists[lru];
>         unsigned long nr_taken = 0;
> -       unsigned long scan;
> +       unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +       unsigned long scan, nr_pages;
>
>         for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
>                                         !list_empty(src); scan++) {
> @@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
>                 switch (__isolate_lru_page(page, mode)) {
>                 case 0:
> -                       nr_taken += hpage_nr_pages(page);
> +                       nr_pages = hpage_nr_pages(page);
> +                       nr_taken += nr_pages;
> +                       nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
>
> @@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>         *nr_scanned = scan;
>         trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
>                                     nr_taken, mode, is_file_lru(lru));
> +       for (scan = 0; scan < MAX_NR_ZONES; scan++) {
> +               nr_pages = nr_zone_taken[scan];
> +               if (!nr_pages)
> +                       continue;
> +
> +               update_lru_size(lruvec, lru, scan, -nr_pages);
> +       }
>         return nr_taken;
>  }
>
> @@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
>                 struct lruvec *lruvec;
>
>                 spin_lock_irq(zone_lru_lock(zone));
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct zone *zone, int file,
> +static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> @@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
>                 return 0;
>
>         if (file) {
> -               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
>         } else {
> -               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
>         }
>
>         /*
> @@ -1499,7 +1524,7 @@ static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         LIST_HEAD(pages_to_free);
>
>         /*
> @@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(zone_lru_lock(zone));
> +                       spin_unlock_irq(&pgdat->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(zone_lru_lock(zone));
> +                       spin_lock_irq(&pgdat->lru_lock);
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 }
> @@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         unsigned long nr_immediate = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
> @@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc)) {
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>                 else
> -                       __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_DIRECT, nr_scanned);
>         }
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
>
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +       nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
>                                 &nr_dirty, &nr_unqueued_dirty, &nr_congested,
>                                 &nr_writeback, &nr_immediate,
>                                 false);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         if (global_reclaim(sc)) {
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
>                 else
> -                       __count_zone_vm_events(PGSTEAL_DIRECT, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
>         }
>
>         putback_inactive_pages(lruvec, &page_list);
>
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_hot_cold_page_list(&page_list, true);
> @@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          * are encountered in the nr_immediate check below.
>          */
>         if (nr_writeback && nr_writeback == nr_taken)
> -               set_bit(ZONE_WRITEBACK, &zone->flags);
> +               set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * Legacy memcg will stall in page writeback so avoid forcibly
> @@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                  * backed by a congested BDI and wait_iff_congested will stall.
>                  */
>                 if (nr_dirty && nr_dirty == nr_congested)
> -                       set_bit(ZONE_CONGESTED, &zone->flags);
> +                       set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
>                 /*
>                  * If dirty pages are scanned that are not queued for IO, it
>                  * implies that flushers are not keeping up. In this case, flag
> -                * the zone ZONE_DIRTY and kswapd will start writing pages from
> +                * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
>                  * reclaim context.
>                  */
>                 if (nr_unqueued_dirty == nr_taken)
> -                       set_bit(ZONE_DIRTY, &zone->flags);
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>
>                 /*
>                  * If kswapd scans pages marked marked for immediate
> @@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          */
>         if (!sc->hibernation_mode && !current_is_kswapd() &&
>             current_may_throttle())
> -               wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +               wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> -       trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed,
>                         sc->priority, file);
>         return nr_reclaimed;
>  }
> @@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                                      struct list_head *pages_to_free,
>                                      enum lru_list lru)
>  {
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long pgmoved = 0;
>         struct page *page;
>         int nr_pages;
>
>         while (!list_empty(list)) {
>                 page = lru_to_page(list);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>
>                 nr_pages = hpage_nr_pages(page);
> -               update_lru_size(lruvec, lru, nr_pages);
> +               update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>                 list_move(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>
> @@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, pages_to_free);
>                 }
> @@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         unsigned long nr_rotated = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         lru_add_drain();
>
> @@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc))
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> -       __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
> +       __count_vm_events(PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>         move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_hold);
>         free_hot_cold_page_list(&l_hold, true);
> @@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 fraction[2];
>         u64 denominator = 0;    /* gcc */
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long anon_prio, file_prio;
>         enum scan_balance scan_balance;
>         unsigned long anon, file;
> @@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * well.
>          */
>         if (current_is_kswapd()) {
> -               if (!zone_reclaimable(zone))
> +               if (!pgdat_reclaimable(pgdat))
>                         force_scan = true;
>                 if (!mem_cgroup_online(memcg))
>                         force_scan = true;
> @@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (global_reclaim(sc)) {
> -               unsigned long zonefile;
> -               unsigned long zonefree;
> +               unsigned long pgdatfile;
> +               unsigned long pgdatfree;
> +               int z;
> +               unsigned long total_high_wmark = 0;
>
> -               zonefree = zone_page_state(zone, NR_FREE_PAGES);
> -               zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                          zone_page_state(zone, NR_INACTIVE_FILE);
> +               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
>
> -               if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
> +               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
>                         scan_balance = SCAN_ANON;
>                         goto out;
>                 }
> @@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
>          * inactive lists are large enough, continue reclaiming
>          */
>         pages_for_compaction = (2UL << sc->order);
> -       inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> +       inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
>         if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +               inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
>         if (sc->nr_reclaimed < pages_for_compaction &&
>                         inactive_lru_pages > pages_for_compaction)
>                 return true;
> @@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                 continue;
>
>                         if (sc->priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;       /* Let kswapd poll it */
>
>                         /*
> @@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!populated_zone(zone) ||
> -                   zone_reclaimable_pages(zone) == 0)
> +                   pgdat_reclaimable_pages(pgdat) == 0)
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>                  * DEF_PRIORITY. Effectively, it considers them balanced so
>                  * they must be considered balanced here as well!
>                  */
> -               if (!zone_reclaimable(zone)) {
> +               if (!pgdat_reclaimable(zone->zone_pgdat)) {
>                         balanced_pages += zone->managed_pages;
>                         continue;
>                 }
> @@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  {
>         unsigned long balance_gap;
>         bool lowmem_pressure;
> +       struct pglist_data *pgdat = zone->zone_pgdat;
>
>         /* Reclaim above the high watermark. */
>         sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> @@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
>         shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> -       clear_bit(ZONE_WRITEBACK, &zone->flags);
> +       /* TODO: ANOMALY */
> +       clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * If a zone reaches its high watermark, consider it to be no longer
> @@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
>          * BDIs but as pressure is relieved, speculatively avoid congestion
>          * waits.
>          */
> -       if (zone_reclaimable(zone) &&
> +       if (pgdat_reclaimable(zone->zone_pgdat) &&
>             zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
> -               clear_bit(ZONE_CONGESTED, &zone->flags);
> -               clear_bit(ZONE_DIRTY, &zone->flags);
> +               clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +               clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         }
>
>         return sc->nr_scanned >= sc->nr_to_reclaim;
> @@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         /*
> @@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 /*
>                                  * If balanced, clear the dirty and congested
>                                  * flags
> +                                *
> +                                * TODO: ANOMALY
>                                  */
> -                               clear_bit(ZONE_CONGESTED, &zone->flags);
> -                               clear_bit(ZONE_DIRTY, &zone->flags);
> +                               clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +                               clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>                         }
>                 }
>
> @@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         sc.nr_scanned = 0;
> @@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
>  static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
>  {
>         unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> -       unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> -               zone_page_state(zone, NR_ACTIVE_FILE);
> +       unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         /*
>          * It's possible for there to be more file mapped pages than
> @@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>             zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                 return ZONE_RECLAIM_FULL;
>
> -       if (!zone_reclaimable(zone))
> +       if (!pgdat_reclaimable(zone->zone_pgdat))
>                 return ZONE_RECLAIM_FULL;
>
>         /*
> @@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
>                         zone = pagezone;
>                         spin_lock_irq(zone_lru_lock(zone));
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3345d396a99b..de0c17076270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
>         /* enum zone_stat_item countes */
>         "nr_free_pages",
>         "nr_alloc_batch",
> -       "nr_inactive_anon",
> -       "nr_active_anon",
> -       "nr_inactive_file",
> -       "nr_active_file",
> -       "nr_unevictable",
> +       "nr_zone_anon_lru",
> +       "nr_zone_file_lru",
>         "nr_mlock",
>         "nr_anon_pages",
>         "nr_mapped",
> @@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
>         "nr_writeback_temp",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
>         "nr_shmem",
>         "nr_dirtied",
>         "nr_written",
> -       "nr_pages_scanned",
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         "nr_zspages",
>  #endif
> @@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
>         "nr_shmem_pmdmapped",
>         "nr_free_cma",
>
> +       /* Node-based counters */
> +       "nr_inactive_anon",
> +       "nr_active_anon",
> +       "nr_inactive_file",
> +       "nr_active_file",
> +       "nr_unevictable",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "nr_pages_scanned",
> +
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> @@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
>         "pgmajfault",
>         "pglazyfreed",
>
> -       TEXTS_FOR_ZONES("pgrefill")
> -       TEXTS_FOR_ZONES("pgsteal_kswapd")
> -       TEXTS_FOR_ZONES("pgsteal_direct")
> -       TEXTS_FOR_ZONES("pgscan_kswapd")
> -       TEXTS_FOR_ZONES("pgscan_direct")
> +       "pgrefill",
> +       "pgsteal_kswapd",
> +       "pgsteal_direct",
> +       "pgscan_kswapd",
> +       "pgscan_direct",
>         "pgscan_direct_throttle",
>
>  #ifdef CONFIG_NUMA
> @@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        min      %lu"
>                    "\n        low      %lu"
>                    "\n        high     %lu"
> -                  "\n        scanned  %lu"
> +                  "\n   node_scanned  %lu"
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu",
> @@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    min_wmark_pages(zone),
>                    low_wmark_pages(zone),
>                    high_wmark_pages(zone),
> -                  zone_page_state(zone, NR_PAGES_SCANNED),
> +                  node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone->managed_pages);
>
>         for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -               seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
> +               seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
>                                 zone_page_state(zone, i));
>
>         seq_printf(m,
> @@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  #endif
>         }
>         seq_printf(m,
> -                  "\n  all_unreclaimable: %u"
> -                  "\n  start_pfn:         %lu"
> -                  "\n  inactive_ratio:    %u",
> -                  !zone_reclaimable(zone),
> +                  "\n  node_unreclaimable:  %u"
> +                  "\n  start_pfn:           %lu"
> +                  "\n  node_inactive_ratio: %u",
> +                  !pgdat_reclaimable(zone->zone_pgdat),
>                    zone->zone_start_pfn,
> -                  zone->inactive_ratio);
> +                  zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
>  }
>
> @@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  {
>         unsigned long *l = arg;
>         unsigned long off = l - (unsigned long *)m->private;
> -
>         seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
>         return 0;
>  }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ba972ac2dfdd..ebe14445809a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                      LRU_ALL_FILE);
>         } else {
> -               pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -                       sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
> +               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
> +                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>         }
>
>         /*
> --
> 2.6.4
>

  reply	other threads:[~2016-08-04 20:59 UTC|newest]

Thread overview: 220+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-08  9:34 [PATCH 00/34] Move LRU page reclaim from zones to nodes v9 Mel Gorman
2016-07-08  9:34 ` Mel Gorman
2016-07-08  9:34 ` [PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-08-03 19:13   ` Reza Arbab
2016-08-03 19:13     ` Reza Arbab
2016-07-08  9:34 ` [PATCH 02/34] mm, vmscan: move lru_lock to the node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 11:06   ` Balbir Singh
2016-07-12 11:06     ` Balbir Singh
2016-07-12 11:18     ` Mel Gorman
2016-07-12 11:18       ` Mel Gorman
2016-07-13  5:50       ` Balbir Singh
2016-07-13  5:50         ` Balbir Singh
2016-07-13  8:39         ` Vlastimil Babka
2016-07-13  8:39           ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 03/34] mm, vmscan: move LRU lists to node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-08-04 20:59   ` James Hogan [this message]
2016-08-04 20:59     ` James Hogan
2016-08-04 20:59     ` James Hogan
2016-08-05  8:41     ` Mel Gorman
2016-08-05  8:41       ` Mel Gorman
2016-08-05 10:52       ` James Hogan
2016-08-05 10:52         ` James Hogan
2016-08-05 11:55         ` Mel Gorman
2016-08-05 11:55           ` Mel Gorman
2016-08-05 11:55           ` Mel Gorman
2016-08-05 12:02           ` James Hogan
2016-08-05 12:02             ` James Hogan
2016-07-08  9:34 ` [PATCH 04/34] mm, mmzone: clarify the usage of zone padding Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 13:49   ` Johannes Weiner
2016-07-12 13:49     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 13:54   ` Johannes Weiner
2016-07-12 13:54     ` Johannes Weiner
2016-07-14  9:19   ` Vlastimil Babka
2016-07-14  9:19     ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:05   ` Johannes Weiner
2016-07-12 14:05     ` Johannes Weiner
2016-07-13  8:37     ` Mel Gorman
2016-07-13  8:37       ` Mel Gorman
2016-07-08  9:34 ` [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-08-29  9:38   ` Srikar Dronamraju
2016-08-29  9:38     ` Srikar Dronamraju
2016-08-30 12:07     ` Mel Gorman
2016-08-30 12:07       ` Mel Gorman
2016-08-30 14:25       ` Srikar Dronamraju
2016-08-30 14:25         ` Srikar Dronamraju
2016-08-30 15:00         ` Mel Gorman
2016-08-30 15:00           ` Mel Gorman
2016-08-31  6:09           ` Srikar Dronamraju
2016-08-31  6:09             ` Srikar Dronamraju
2016-08-31  8:49             ` Mel Gorman
2016-08-31  8:49               ` Mel Gorman
2016-08-31 11:09               ` Michal Hocko
2016-08-31 11:09                 ` Michal Hocko
2016-08-31 12:46                 ` Mel Gorman
2016-08-31 12:46                   ` Mel Gorman
2016-08-31 17:33               ` Srikar Dronamraju
2016-08-31 17:33                 ` Srikar Dronamraju
2016-07-08  9:34 ` [PATCH 08/34] mm, vmscan: remove balance gap Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:06   ` Johannes Weiner
2016-07-12 14:06     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 09/34] mm, vmscan: simplify the logic deciding whether kswapd sleeps Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 10/34] mm, vmscan: by default have direct reclaim only shrink once per node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:22   ` Johannes Weiner
2016-07-12 14:22     ` Johannes Weiner
2016-07-13  8:40     ` Mel Gorman
2016-07-13  8:40       ` Mel Gorman
2016-07-14  9:45   ` Vlastimil Babka
2016-07-14  9:45     ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:29   ` Johannes Weiner
2016-07-12 14:29     ` Johannes Weiner
2016-07-13  8:47     ` Mel Gorman
2016-07-13  8:47       ` Mel Gorman
2016-07-13 12:28       ` Johannes Weiner
2016-07-13 12:28         ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:32   ` Johannes Weiner
2016-07-12 14:32     ` Johannes Weiner
2016-07-13  8:48     ` Mel Gorman
2016-07-13  8:48       ` Mel Gorman
2016-07-08  9:34 ` [PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:38   ` Johannes Weiner
2016-07-12 14:38     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 15/34] mm, workingset: make working set detection node-aware Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 16/34] mm, page_alloc: consider dirtyable memory in terms of nodes Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 17/34] mm: move page mapped accounting to the node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:42   ` Johannes Weiner
2016-07-12 14:42     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:58   ` Johannes Weiner
2016-07-12 14:58     ` Johannes Weiner
2016-07-13  8:55     ` Mel Gorman
2016-07-13  8:55       ` Mel Gorman
2016-07-13 13:04       ` Johannes Weiner
2016-07-13 13:04         ` Johannes Weiner
2016-07-13 13:37         ` Mel Gorman
2016-07-13 13:37           ` Mel Gorman
2016-07-13 21:13           ` Andrew Morton
2016-07-13 21:13             ` Andrew Morton
2016-07-15 10:46             ` Mel Gorman
2016-07-15 10:46               ` Mel Gorman
2016-07-15 22:35               ` Andrew Morton
2016-07-15 22:35                 ` Andrew Morton
2016-07-18 13:34                 ` Johannes Weiner
2016-07-18 13:34                   ` Johannes Weiner
2016-07-14  1:27           ` Minchan Kim
2016-07-14  1:27             ` Minchan Kim
2016-07-08  9:34 ` [PATCH 19/34] mm: move most file-based accounting to the node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 15:11   ` Johannes Weiner
2016-07-12 15:11     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 20/34] mm: move vmscan writes and file write " Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 15:15   ` Johannes Weiner
2016-07-12 15:15     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 17:18   ` Johannes Weiner
2016-07-12 17:18     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 17:24   ` Johannes Weiner
2016-07-12 17:24     ` Johannes Weiner
2016-07-14 10:05   ` Vlastimil Babka
2016-07-14 10:05     ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 23/34] mm: convert zone_reclaim to node_reclaim Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 17:28   ` Johannes Weiner
2016-07-12 17:28     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 17:31   ` Johannes Weiner
2016-07-12 17:31     ` Johannes Weiner
2016-07-14 10:09   ` Vlastimil Babka
2016-07-14 10:09     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:01   ` Johannes Weiner
2016-07-12 18:01     ` Johannes Weiner
2016-07-14 12:12   ` Vlastimil Babka
2016-07-14 12:12     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:06   ` Johannes Weiner
2016-07-12 18:06     ` Johannes Weiner
2016-07-14 12:48   ` Vlastimil Babka
2016-07-14 12:48     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:10   ` Johannes Weiner
2016-07-12 18:10     ` Johannes Weiner
2016-07-14 12:54   ` Vlastimil Babka
2016-07-14 12:54     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 28/34] mm, vmscan: add classzone information to tracepoints Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:13   ` Johannes Weiner
2016-07-12 18:13     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 29/34] mm, page_alloc: remove fair zone allocation policy Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:18   ` Johannes Weiner
2016-07-12 18:18     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:43   ` Johannes Weiner
2016-07-12 18:43     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 31/34] mm: vmstat: replace __count_zone_vm_events with a zone id equivalent Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 19:10   ` Johannes Weiner
2016-07-12 19:10     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 32/34] mm: vmstat: account per-zone stalls and pages skipped during reclaim Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 19:06   ` Johannes Weiner
2016-07-12 19:06     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 19:18   ` Johannes Weiner
2016-07-12 19:18     ` Johannes Weiner
2016-07-14 12:56   ` Vlastimil Babka
2016-07-14 12:56     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-14 13:40   ` Vlastimil Babka
2016-07-14 13:40     ` Vlastimil Babka
2016-07-15  7:48     ` Mel Gorman
2016-07-15  7:48       ` Mel Gorman
2016-07-15 12:20       ` Vlastimil Babka
2016-07-15 12:20         ` Vlastimil Babka
2016-08-19 13:12 ` [PATCH 00/34] Move LRU page reclaim from zones to nodes v9 Andrea Arcangeli
2016-08-19 13:12   ` Andrea Arcangeli
2016-08-19 13:23   ` Vlastimil Babka
2016-08-19 13:23     ` Vlastimil Babka
2016-08-19 13:55     ` Andrea Arcangeli
2016-08-19 13:55       ` Andrea Arcangeli
2016-08-19 14:53   ` Mel Gorman
2016-08-19 14:53     ` Mel Gorman
2016-08-19 15:32     ` Andrea Arcangeli
2016-08-19 15:32       ` Andrea Arcangeli
2016-08-19 15:55       ` Mel Gorman
2016-08-19 15:55         ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAG0J9_k3edxDzqpEjt2BqqZXMW4PVj7BNUBAk6TWtw3Zh_oMg@mail.gmail.com \
    --to=james.hogan@imgtec.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-metag@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=minchan@kernel.org \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.