All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] memcg: fix swap account (26/May)[0/5]
@ 2009-05-26  3:12 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:12 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, hugh.dickins, hannes, linux-kernel


As Nishimura reported, there is a race at handling swap cache.

Typical cases are following (from Nishimura's mail)


== Type-1 ==
  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.


== Type-2 ==
    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.
==

This patch is a trial for fixing above problems by fixing memcg's swap account logic.
But this requires some amount of changes in swap.

Comaparing with my previous post (22/May)
(http://marc.info/?l=linux-mm&m=124297915418698&w=2),
I think this one is much easier to read...


[1/5] change interface of swap_duplicate()/swap_free()
    Adds an function swapcache_prepare() and swapcache_free().

[2/5] add SWAP_HAS_CACHE flag to swap_map
    Add SWAP_HAS_CACHE flag to swap_map array for knowing an information that
    "there is an only swap cache and swap has no reference" 
    without calling find_get_page().

[3/5] Count the number of swap-cache-only swaps
    After repeating swap-in/out, there are tons of cache-only swaps.
   (via a mapped swapcache under vm_swap_full()==false)
    This patch counts the number of entry and show it in debug information.
   (for example, sysrq-m)

[4/5] fix memcg's swap accounting.
    change the memcg's swap accounting logic to see # of references to swap.

[5/5] experimental garbage collection for cache-only swaps.
    reclaim swap enty which is not used.

patch [4/5] is for type-1
patch [5/5] is for type-2 and sanity of swaps control...

Thank you for all helps. Any comments are welcome.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH] memcg: fix swap account (26/May)[0/5]
@ 2009-05-26  3:12 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:12 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, hugh.dickins, hannes, linux-kernel


As Nishimura reported, there is a race at handling swap cache.

Typical cases are following (from Nishimura's mail)


== Type-1 ==
  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.


== Type-2 ==
    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.
==

This patch is a trial for fixing above problems by fixing memcg's swap account logic.
But this requires some amount of changes in swap.

Comaparing with my previous post (22/May)
(http://marc.info/?l=linux-mm&m=124297915418698&w=2),
I think this one is much easier to read...


[1/5] change interface of swap_duplicate()/swap_free()
    Adds an function swapcache_prepare() and swapcache_free().

[2/5] add SWAP_HAS_CACHE flag to swap_map
    Add SWAP_HAS_CACHE flag to swap_map array for knowing an information that
    "there is an only swap cache and swap has no reference" 
    without calling find_get_page().

[3/5] Count the number of swap-cache-only swaps
    After repeating swap-in/out, there are tons of cache-only swaps.
   (via a mapped swapcache under vm_swap_full()==false)
    This patch counts the number of entry and show it in debug information.
   (for example, sysrq-m)

[4/5] fix memcg's swap accounting.
    change the memcg's swap accounting logic to see # of references to swap.

[5/5] experimental garbage collection for cache-only swaps.
    reclaim swap enty which is not used.

patch [4/5] is for type-1
patch [5/5] is for type-2 and sanity of swaps control...

Thank you for all helps. Any comments are welcome.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 1/5] change swap cache interfaces
  2009-05-26  3:12 ` KAMEZAWA Hiroyuki
@ 2009-05-26  3:14   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In following patch, usage of swap cache will be recorded into swap_map.
This patch is for necessary interface changes to do that.

2 interfaces:
  - swapcache_prepare()
  - swapcache_free()
is added for allocating/freeing refcnt from swap-cache to existing
swap entries. But implementation itself is not changed under this patch.
At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
This is better than using scattered hooks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    7 +++++++
 mm/swap_state.c      |   11 +++++------
 mm/swapfile.c        |   19 +++++++++++++++++++
 mm/vmscan.c          |    3 +--
 4 files changed, 32 insertions(+), 8 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -301,8 +301,10 @@ extern void si_swapinfo(struct sysinfo *
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int swap_duplicate(swp_entry_t);
+extern int swapcache_prepare(swp_entry_t);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
+extern void swapcache_free(swp_entry_t, struct page *page);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -371,11 +373,16 @@ static inline void show_swap_cache_info(
 
 #define free_swap_and_cache(swp)	is_migration_entry(swp)
 #define swap_duplicate(swp)		is_migration_entry(swp)
+#define swapcache_prepare(swp)		is_migration_entry(swp)
 
 static inline void swap_free(swp_entry_t swp)
 {
 }
 
+static inline void swapcache_free(swp_entry_t swp, struct page *page)
+{
+}
+
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
Index: new-trial-swapcount/mm/swap_state.c
===================================================================
--- new-trial-swapcount.orig/mm/swap_state.c
+++ new-trial-swapcount/mm/swap_state.c
@@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
 			return 1;
 		case -EEXIST:
 			/* Raced with "speculative" read_swap_cache_async */
-			swap_free(entry);
+			swapcache_free(entry, NULL);
 			continue;
 		default:
 			/* -ENOMEM radix-tree allocation failure */
-			swap_free(entry);
+			swapcache_free(entry, NULL);
 			return 0;
 		}
 	}
@@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page 
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
-	mem_cgroup_uncharge_swapcache(page, entry);
-	swap_free(entry);
+	swapcache_free(entry, page);
 	page_cache_release(page);
 }
 
@@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
 		/*
 		 * Swap entry may have been freed since our caller observed it.
 		 */
-		if (!swap_duplicate(entry))
+		if (!swapcache_prepare(entry))
 			break;
 
 		/*
@@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
 		}
 		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
-		swap_free(entry);
+		swapcache_free(entry, NULL);
 	} while (err != -ENOMEM);
 
 	if (new_page)
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
 }
 
 /*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void swapcache_free(swp_entry_t entry, struct page *page)
+{
+	if (page)
+		mem_cgroup_uncharge_swapcache(page, entry);
+	return swap_free(entry);
+}
+
+/*
  * How many references to page are currently swapped out?
  */
 static inline int page_swapcount(struct page *page)
@@ -1979,6 +1989,15 @@ bad_file:
 	goto out;
 }
 
+/*
+ * Called when allocating swap cache for exising swap entry,
+ */
+int swapcache_prepare(swp_entry_t entry)
+{
+	return swap_duplicate(entry);
+}
+
+
 struct swap_info_struct *
 get_swap_info_struct(unsigned type)
 {
Index: new-trial-swapcount/mm/vmscan.c
===================================================================
--- new-trial-swapcount.orig/mm/vmscan.c
+++ new-trial-swapcount/mm/vmscan.c
@@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_swapcache(page, swap);
-		swap_free(swap);
+		swapcache_free(swap, page);
 	} else {
 		__remove_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 1/5] change swap cache interfaces
@ 2009-05-26  3:14   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In following patch, usage of swap cache will be recorded into swap_map.
This patch is for necessary interface changes to do that.

2 interfaces:
  - swapcache_prepare()
  - swapcache_free()
is added for allocating/freeing refcnt from swap-cache to existing
swap entries. But implementation itself is not changed under this patch.
At adding swapcache_free(), memcg's hook code is moved under swapcache_free().
This is better than using scattered hooks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    7 +++++++
 mm/swap_state.c      |   11 +++++------
 mm/swapfile.c        |   19 +++++++++++++++++++
 mm/vmscan.c          |    3 +--
 4 files changed, 32 insertions(+), 8 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -301,8 +301,10 @@ extern void si_swapinfo(struct sysinfo *
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int swap_duplicate(swp_entry_t);
+extern int swapcache_prepare(swp_entry_t);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
+extern void swapcache_free(swp_entry_t, struct page *page);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -371,11 +373,16 @@ static inline void show_swap_cache_info(
 
 #define free_swap_and_cache(swp)	is_migration_entry(swp)
 #define swap_duplicate(swp)		is_migration_entry(swp)
+#define swapcache_prepare(swp)		is_migration_entry(swp)
 
 static inline void swap_free(swp_entry_t swp)
 {
 }
 
+static inline void swapcache_free(swp_entry_t swp, struct page *page)
+{
+}
+
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
Index: new-trial-swapcount/mm/swap_state.c
===================================================================
--- new-trial-swapcount.orig/mm/swap_state.c
+++ new-trial-swapcount/mm/swap_state.c
@@ -162,11 +162,11 @@ int add_to_swap(struct page *page)
 			return 1;
 		case -EEXIST:
 			/* Raced with "speculative" read_swap_cache_async */
-			swap_free(entry);
+			swapcache_free(entry, NULL);
 			continue;
 		default:
 			/* -ENOMEM radix-tree allocation failure */
-			swap_free(entry);
+			swapcache_free(entry, NULL);
 			return 0;
 		}
 	}
@@ -188,8 +188,7 @@ void delete_from_swap_cache(struct page 
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
-	mem_cgroup_uncharge_swapcache(page, entry);
-	swap_free(entry);
+	swapcache_free(entry, page);
 	page_cache_release(page);
 }
 
@@ -293,7 +292,7 @@ struct page *read_swap_cache_async(swp_e
 		/*
 		 * Swap entry may have been freed since our caller observed it.
 		 */
-		if (!swap_duplicate(entry))
+		if (!swapcache_prepare(entry))
 			break;
 
 		/*
@@ -317,7 +316,7 @@ struct page *read_swap_cache_async(swp_e
 		}
 		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
-		swap_free(entry);
+		swapcache_free(entry, NULL);
 	} while (err != -ENOMEM);
 
 	if (new_page)
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -510,6 +510,16 @@ void swap_free(swp_entry_t entry)
 }
 
 /*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void swapcache_free(swp_entry_t entry, struct page *page)
+{
+	if (page)
+		mem_cgroup_uncharge_swapcache(page, entry);
+	return swap_free(entry);
+}
+
+/*
  * How many references to page are currently swapped out?
  */
 static inline int page_swapcount(struct page *page)
@@ -1979,6 +1989,15 @@ bad_file:
 	goto out;
 }
 
+/*
+ * Called when allocating swap cache for exising swap entry,
+ */
+int swapcache_prepare(swp_entry_t entry)
+{
+	return swap_duplicate(entry);
+}
+
+
 struct swap_info_struct *
 get_swap_info_struct(unsigned type)
 {
Index: new-trial-swapcount/mm/vmscan.c
===================================================================
--- new-trial-swapcount.orig/mm/vmscan.c
+++ new-trial-swapcount/mm/vmscan.c
@@ -477,8 +477,7 @@ static int __remove_mapping(struct addre
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_swapcache(page, swap);
-		swap_free(swap);
+		swapcache_free(swap, page);
 	} else {
 		__remove_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-26  3:12 ` KAMEZAWA Hiroyuki
@ 2009-05-26  3:15   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a part of patches for fixing memcg's swap account leak. But, IMHO,
not a bad patch even if no memcg.

Now, reference to swap is counted by swap_map[], an array of unsigned short.
There are 2 kinds of references to swap.
 - reference from swap entry
 - reference from swap cache
Then, 
 - If there is swap cache && swap's refcnt is 1, there is only swap cache.
  (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

This counting logic have worked well for a long time. But considering that
we cannot know there is a _real_ reference or not by swap_map[], current usage
of counter is not very good.

This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry
has a cache or not. This will remove -1 magic used in swapfile.c and be a help
to avoid unnecessary find_get_page().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    7 +-
 mm/swapfile.c        |  169 +++++++++++++++++++++++++++++++++++----------------
 2 files changed, 123 insertions(+), 53 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -129,9 +129,10 @@ enum {
 
 #define SWAP_CLUSTER_MAX 32
 
-#define SWAP_MAP_MAX	0x7fff
-#define SWAP_MAP_BAD	0x8000
-
+#define SWAP_MAP_MAX	0x7ffe
+#define SWAP_MAP_BAD	0x7fff
+#define SWAP_HAS_CACHE  0x8000		/* There is a swap cache of entry. */
+#define SWAP_COUNT_MASK (~SWAP_HAS_CACHE)
 /*
  * The in-memory structure used to track swap areas.
  */
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -53,6 +53,26 @@ static struct swap_info_struct swap_info
 
 static DEFINE_MUTEX(swapon_mutex);
 
+/* For reference count accounting in swap_map */
+static inline int swap_count(unsigned short ent)
+{
+	return ent & SWAP_COUNT_MASK;
+}
+
+static inline int swap_has_cache(unsigned short ent)
+{
+	return ent & SWAP_HAS_CACHE;
+}
+
+static inline unsigned short make_swap_count(int count, int has_cache)
+{
+	unsigned short ret = count;
+
+	if (has_cache)
+		return SWAP_HAS_CACHE | ret;
+	return ret;
+}
+
 /*
  * We need this because the bdev->unplug_fn can sleep and we cannot
  * hold swap_lock while calling the unplug_fn. And swap_lock
@@ -167,7 +187,8 @@ static int wait_for_discard(void *word)
 #define SWAPFILE_CLUSTER	256
 #define LATENCY_LIMIT		256
 
-static inline unsigned long scan_swap_map(struct swap_info_struct *si)
+static inline unsigned long scan_swap_map(struct swap_info_struct *si,
+					  int cache)
 {
 	unsigned long offset;
 	unsigned long scan_base;
@@ -285,7 +306,10 @@ checks:
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 	}
-	si->swap_map[offset] = 1;
+	if (cache) /* at usual swap-out via vmscan.c */
+		si->swap_map[offset] = make_swap_count(0, 1);
+	else /* at suspend */
+		si->swap_map[offset] = make_swap_count(1, 0);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -401,7 +425,8 @@ swp_entry_t get_swap_page(void)
 			continue;
 
 		swap_list.next = next;
-		offset = scan_swap_map(si);
+		/* This is called for allocating swap entry for cache */
+		offset = scan_swap_map(si, 1);
 		if (offset) {
 			spin_unlock(&swap_lock);
 			return swp_entry(type, offset);
@@ -415,6 +440,7 @@ noswap:
 	return (swp_entry_t) {0};
 }
 
+/* The only caller of this function is now susupend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
 	struct swap_info_struct *si;
@@ -424,7 +450,8 @@ swp_entry_t get_swap_page_of_type(int ty
 	si = swap_info + type;
 	if (si->flags & SWP_WRITEOK) {
 		nr_swap_pages--;
-		offset = scan_swap_map(si);
+		/* This is called for allocating swap entry, not cache */
+		offset = scan_swap_map(si, 0);
 		if (offset) {
 			spin_unlock(&swap_lock);
 			return swp_entry(type, offset);
@@ -471,25 +498,36 @@ out:
 	return NULL;
 }
 
-static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
+static int swap_entry_free(struct swap_info_struct *p,
+			   swp_entry_t ent, int cache)
 {
 	unsigned long offset = swp_offset(ent);
-	int count = p->swap_map[offset];
+	int count = swap_count(p->swap_map[offset]);
+	int has_cache = swap_has_cache(p->swap_map[offset]);
 
-	if (count < SWAP_MAP_MAX) {
-		count--;
-		p->swap_map[offset] = count;
-		if (!count) {
-			if (offset < p->lowest_bit)
-				p->lowest_bit = offset;
-			if (offset > p->highest_bit)
-				p->highest_bit = offset;
-			if (p->prio > swap_info[swap_list.next].prio)
-				swap_list.next = p - swap_info;
-			nr_swap_pages++;
-			p->inuse_pages--;
-			mem_cgroup_uncharge_swap(ent);
-		}
+	if (!cache) { /* dropping usage count of swap */
+		if (count < SWAP_MAP_MAX) {
+			count--;
+			p->swap_map[offset] = make_swap_count(count, has_cache);
+		}
+	} else { /* dropping swap cache flag */
+		VM_BUG_ON(!has_cache);
+		p->swap_map[offset] = make_swap_count(count, 0);
+
+	}
+	/* return code. */
+	count = p->swap_map[offset];
+	/* free if no reference */
+	if (!count) {
+		if (offset < p->lowest_bit)
+			p->lowest_bit = offset;
+		if (offset > p->highest_bit)
+			p->highest_bit = offset;
+		if (p->prio > swap_info[swap_list.next].prio)
+			swap_list.next = p - swap_info;
+		nr_swap_pages++;
+		p->inuse_pages--;
+		mem_cgroup_uncharge_swap(ent);
 	}
 	return count;
 }
@@ -504,7 +542,7 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry);
+		swap_entry_free(p, entry, 0);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -514,9 +552,16 @@ void swap_free(swp_entry_t entry)
  */
 void swapcache_free(swp_entry_t entry, struct page *page)
 {
+	struct swap_info_struct *p;
+
 	if (page)
 		mem_cgroup_uncharge_swapcache(page, entry);
-	return swap_free(entry);
+	p = swap_info_get(entry);
+	if (p) {
+		swap_entry_free(p, entry, 1);
+		spin_unlock(&swap_lock);
+	}
+	return;
 }
 
 /*
@@ -531,8 +576,7 @@ static inline int page_swapcount(struct 
 	entry.val = page_private(page);
 	p = swap_info_get(entry);
 	if (p) {
-		/* Subtract the 1 for the swap cache itself */
-		count = p->swap_map[swp_offset(entry)] - 1;
+		count = swap_count(p->swap_map[swp_offset(entry)]);
 		spin_unlock(&swap_lock);
 	}
 	return count;
@@ -594,7 +638,7 @@ int free_swap_and_cache(swp_entry_t entr
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry) == 1) {
+		if (swap_entry_free(p, entry, 0) == SWAP_HAS_CACHE) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -901,7 +945,7 @@ static unsigned int find_next_to_unuse(s
 			i = 1;
 		}
 		count = si->swap_map[i];
-		if (count && count != SWAP_MAP_BAD)
+		if (count && swap_count(count) != SWAP_MAP_BAD)
 			break;
 	}
 	return i;
@@ -1005,13 +1049,13 @@ static int try_to_unuse(unsigned int typ
 		 */
 		shmem = 0;
 		swcount = *swap_map;
-		if (swcount > 1) {
+		if (swap_count(swcount)) {
 			if (start_mm == &init_mm)
 				shmem = shmem_unuse(entry, page);
 			else
 				retval = unuse_mm(start_mm, entry, page);
 		}
-		if (*swap_map > 1) {
+		if (swap_count(*swap_map)) {
 			int set_start_mm = (*swap_map >= swcount);
 			struct list_head *p = &start_mm->mmlist;
 			struct mm_struct *new_start_mm = start_mm;
@@ -1021,7 +1065,7 @@ static int try_to_unuse(unsigned int typ
 			atomic_inc(&new_start_mm->mm_users);
 			atomic_inc(&prev_mm->mm_users);
 			spin_lock(&mmlist_lock);
-			while (*swap_map > 1 && !retval && !shmem &&
+			while (swap_count(*swap_map) && !retval && !shmem &&
 					(p = p->next) != &start_mm->mmlist) {
 				mm = list_entry(p, struct mm_struct, mmlist);
 				if (!atomic_inc_not_zero(&mm->mm_users))
@@ -1033,14 +1077,16 @@ static int try_to_unuse(unsigned int typ
 				cond_resched();
 
 				swcount = *swap_map;
-				if (swcount <= 1)
+				if (!swap_count(swcount)) /* any usage ? */
 					;
 				else if (mm == &init_mm) {
 					set_start_mm = 1;
 					shmem = shmem_unuse(entry, page);
 				} else
 					retval = unuse_mm(mm, entry, page);
-				if (set_start_mm && *swap_map < swcount) {
+
+				if (set_start_mm &&
+				    swap_count(*swap_map) < swcount) {
 					mmput(new_start_mm);
 					atomic_inc(&mm->mm_users);
 					new_start_mm = mm;
@@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
 		}
 
 		/*
-		 * How could swap count reach 0x7fff when the maximum
-		 * pid is 0x7fff, and there's no way to repeat a swap
-		 * page within an mm (except in shmem, where it's the
-		 * shared object which takes the reference count)?
-		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
-		 *
+		 * How could swap count reach 0x7ffe ?
+		 * There's no way to repeat a swap page within an mm
+		 * (except in shmem, where it's the shared object which takes
+		 * the reference count)?
+		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
+		 * short is too small....)
 		 * If that's wrong, then we should worry more about
 		 * exit_mmap() and do_munmap() cases described above:
 		 * we might be resetting SWAP_MAP_MAX too early here.
 		 * We know "Undead"s can happen, they're okay, so don't
 		 * report them; but do report if we reset SWAP_MAP_MAX.
 		 */
-		if (*swap_map == SWAP_MAP_MAX) {
+		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
 			spin_lock(&swap_lock);
-			*swap_map = 1;
+			*swap_map = make_swap_count(0, 1);
 			spin_unlock(&swap_lock);
 			reset_overflow = 1;
 		}
@@ -1099,7 +1145,8 @@ static int try_to_unuse(unsigned int typ
 		 * pages would be incorrect if swap supported "shared
 		 * private" pages, but they are handled by tmpfs files.
 		 */
-		if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {
+		if (swap_count(*swap_map) &&
+		     PageDirty(page) && PageSwapCache(page)) {
 			struct writeback_control wbc = {
 				.sync_mode = WB_SYNC_NONE,
 			};
@@ -1953,11 +2000,12 @@ void si_swapinfo(struct sysinfo *val)
  * Note: if swap_map[] reaches SWAP_MAP_MAX the entries are treated as
  * "permanent", but will be reclaimed by the next swapoff.
  */
-int swap_duplicate(swp_entry_t entry)
+static int __swap_duplicate(swp_entry_t entry, int cache)
 {
 	struct swap_info_struct * p;
 	unsigned long offset, type;
 	int result = 0;
+	int count, has_cache;
 
 	if (is_migration_entry(entry))
 		return 1;
@@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
 	offset = swp_offset(entry);
 
 	spin_lock(&swap_lock);
-	if (offset < p->max && p->swap_map[offset]) {
-		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
-			p->swap_map[offset]++;
+
+	if (unlikely(offset >= p->max))
+		goto unlock_out;
+
+	count = swap_count(p->swap_map[offset]);
+	has_cache = swap_has_cache(p->swap_map[offset]);
+	if (cache) {
+		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
+		if (!has_cache && count) {
+			p->swap_map[offset] = make_swap_count(count, 1);
+			result = 1;
+		}
+	} else if (count || has_cache) {
+		if (count < SWAP_MAP_MAX - 1) {
+			p->swap_map[offset] = make_swap_count(count + 1,
+							      has_cache);
 			result = 1;
-		} else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
+		} else if (count <= SWAP_MAP_MAX) {
 			if (swap_overflow++ < 5)
-				printk(KERN_WARNING "swap_dup: swap entry overflow\n");
-			p->swap_map[offset] = SWAP_MAP_MAX;
+				printk(KERN_WARNING
+				       "swap_dup: swap entry overflow\n");
+			p->swap_map[offset] = make_swap_count(SWAP_MAP_MAX,
+							      has_cache);
 			result = 1;
 		}
 	}
+unlock_out:
 	spin_unlock(&swap_lock);
 out:
 	return result;
@@ -1989,12 +2053,17 @@ bad_file:
 	goto out;
 }
 
+int swap_duplicate(swp_entry_t entry)
+{
+	return __swap_duplicate(entry, 0);
+}
+
 /*
  * Called when allocating swap cache for exising swap entry,
  */
 int swapcache_prepare(swp_entry_t entry)
 {
-	return swap_duplicate(entry);
+	return __swap_duplicate(entry, 1);
 }
 
 
@@ -2035,7 +2104,7 @@ int valid_swaphandles(swp_entry_t entry,
 		/* Don't read in free or bad pages */
 		if (!si->swap_map[toff])
 			break;
-		if (si->swap_map[toff] == SWAP_MAP_BAD)
+		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
 			break;
 	}
 	/* Count contiguous allocated slots below our target */
@@ -2043,7 +2112,7 @@ int valid_swaphandles(swp_entry_t entry,
 		/* Don't read in free or bad pages */
 		if (!si->swap_map[toff])
 			break;
-		if (si->swap_map[toff] == SWAP_MAP_BAD)
+		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
 			break;
 	}
 	spin_unlock(&swap_lock);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-26  3:15   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a part of patches for fixing memcg's swap account leak. But, IMHO,
not a bad patch even if no memcg.

Now, reference to swap is counted by swap_map[], an array of unsigned short.
There are 2 kinds of references to swap.
 - reference from swap entry
 - reference from swap cache
Then, 
 - If there is swap cache && swap's refcnt is 1, there is only swap cache.
  (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

This counting logic have worked well for a long time. But considering that
we cannot know there is a _real_ reference or not by swap_map[], current usage
of counter is not very good.

This patch adds a flag SWAP_HAS_CACHE and recored information that a swap entry
has a cache or not. This will remove -1 magic used in swapfile.c and be a help
to avoid unnecessary find_get_page().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    7 +-
 mm/swapfile.c        |  169 +++++++++++++++++++++++++++++++++++----------------
 2 files changed, 123 insertions(+), 53 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -129,9 +129,10 @@ enum {
 
 #define SWAP_CLUSTER_MAX 32
 
-#define SWAP_MAP_MAX	0x7fff
-#define SWAP_MAP_BAD	0x8000
-
+#define SWAP_MAP_MAX	0x7ffe
+#define SWAP_MAP_BAD	0x7fff
+#define SWAP_HAS_CACHE  0x8000		/* There is a swap cache of entry. */
+#define SWAP_COUNT_MASK (~SWAP_HAS_CACHE)
 /*
  * The in-memory structure used to track swap areas.
  */
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -53,6 +53,26 @@ static struct swap_info_struct swap_info
 
 static DEFINE_MUTEX(swapon_mutex);
 
+/* For reference count accounting in swap_map */
+static inline int swap_count(unsigned short ent)
+{
+	return ent & SWAP_COUNT_MASK;
+}
+
+static inline int swap_has_cache(unsigned short ent)
+{
+	return ent & SWAP_HAS_CACHE;
+}
+
+static inline unsigned short make_swap_count(int count, int has_cache)
+{
+	unsigned short ret = count;
+
+	if (has_cache)
+		return SWAP_HAS_CACHE | ret;
+	return ret;
+}
+
 /*
  * We need this because the bdev->unplug_fn can sleep and we cannot
  * hold swap_lock while calling the unplug_fn. And swap_lock
@@ -167,7 +187,8 @@ static int wait_for_discard(void *word)
 #define SWAPFILE_CLUSTER	256
 #define LATENCY_LIMIT		256
 
-static inline unsigned long scan_swap_map(struct swap_info_struct *si)
+static inline unsigned long scan_swap_map(struct swap_info_struct *si,
+					  int cache)
 {
 	unsigned long offset;
 	unsigned long scan_base;
@@ -285,7 +306,10 @@ checks:
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 	}
-	si->swap_map[offset] = 1;
+	if (cache) /* at usual swap-out via vmscan.c */
+		si->swap_map[offset] = make_swap_count(0, 1);
+	else /* at suspend */
+		si->swap_map[offset] = make_swap_count(1, 0);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -401,7 +425,8 @@ swp_entry_t get_swap_page(void)
 			continue;
 
 		swap_list.next = next;
-		offset = scan_swap_map(si);
+		/* This is called for allocating swap entry for cache */
+		offset = scan_swap_map(si, 1);
 		if (offset) {
 			spin_unlock(&swap_lock);
 			return swp_entry(type, offset);
@@ -415,6 +440,7 @@ noswap:
 	return (swp_entry_t) {0};
 }
 
+/* The only caller of this function is now susupend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
 	struct swap_info_struct *si;
@@ -424,7 +450,8 @@ swp_entry_t get_swap_page_of_type(int ty
 	si = swap_info + type;
 	if (si->flags & SWP_WRITEOK) {
 		nr_swap_pages--;
-		offset = scan_swap_map(si);
+		/* This is called for allocating swap entry, not cache */
+		offset = scan_swap_map(si, 0);
 		if (offset) {
 			spin_unlock(&swap_lock);
 			return swp_entry(type, offset);
@@ -471,25 +498,36 @@ out:
 	return NULL;
 }
 
-static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
+static int swap_entry_free(struct swap_info_struct *p,
+			   swp_entry_t ent, int cache)
 {
 	unsigned long offset = swp_offset(ent);
-	int count = p->swap_map[offset];
+	int count = swap_count(p->swap_map[offset]);
+	int has_cache = swap_has_cache(p->swap_map[offset]);
 
-	if (count < SWAP_MAP_MAX) {
-		count--;
-		p->swap_map[offset] = count;
-		if (!count) {
-			if (offset < p->lowest_bit)
-				p->lowest_bit = offset;
-			if (offset > p->highest_bit)
-				p->highest_bit = offset;
-			if (p->prio > swap_info[swap_list.next].prio)
-				swap_list.next = p - swap_info;
-			nr_swap_pages++;
-			p->inuse_pages--;
-			mem_cgroup_uncharge_swap(ent);
-		}
+	if (!cache) { /* dropping usage count of swap */
+		if (count < SWAP_MAP_MAX) {
+			count--;
+			p->swap_map[offset] = make_swap_count(count, has_cache);
+		}
+	} else { /* dropping swap cache flag */
+		VM_BUG_ON(!has_cache);
+		p->swap_map[offset] = make_swap_count(count, 0);
+
+	}
+	/* return code. */
+	count = p->swap_map[offset];
+	/* free if no reference */
+	if (!count) {
+		if (offset < p->lowest_bit)
+			p->lowest_bit = offset;
+		if (offset > p->highest_bit)
+			p->highest_bit = offset;
+		if (p->prio > swap_info[swap_list.next].prio)
+			swap_list.next = p - swap_info;
+		nr_swap_pages++;
+		p->inuse_pages--;
+		mem_cgroup_uncharge_swap(ent);
 	}
 	return count;
 }
@@ -504,7 +542,7 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry);
+		swap_entry_free(p, entry, 0);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -514,9 +552,16 @@ void swap_free(swp_entry_t entry)
  */
 void swapcache_free(swp_entry_t entry, struct page *page)
 {
+	struct swap_info_struct *p;
+
 	if (page)
 		mem_cgroup_uncharge_swapcache(page, entry);
-	return swap_free(entry);
+	p = swap_info_get(entry);
+	if (p) {
+		swap_entry_free(p, entry, 1);
+		spin_unlock(&swap_lock);
+	}
+	return;
 }
 
 /*
@@ -531,8 +576,7 @@ static inline int page_swapcount(struct 
 	entry.val = page_private(page);
 	p = swap_info_get(entry);
 	if (p) {
-		/* Subtract the 1 for the swap cache itself */
-		count = p->swap_map[swp_offset(entry)] - 1;
+		count = swap_count(p->swap_map[swp_offset(entry)]);
 		spin_unlock(&swap_lock);
 	}
 	return count;
@@ -594,7 +638,7 @@ int free_swap_and_cache(swp_entry_t entr
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry) == 1) {
+		if (swap_entry_free(p, entry, 0) == SWAP_HAS_CACHE) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -901,7 +945,7 @@ static unsigned int find_next_to_unuse(s
 			i = 1;
 		}
 		count = si->swap_map[i];
-		if (count && count != SWAP_MAP_BAD)
+		if (count && swap_count(count) != SWAP_MAP_BAD)
 			break;
 	}
 	return i;
@@ -1005,13 +1049,13 @@ static int try_to_unuse(unsigned int typ
 		 */
 		shmem = 0;
 		swcount = *swap_map;
-		if (swcount > 1) {
+		if (swap_count(swcount)) {
 			if (start_mm == &init_mm)
 				shmem = shmem_unuse(entry, page);
 			else
 				retval = unuse_mm(start_mm, entry, page);
 		}
-		if (*swap_map > 1) {
+		if (swap_count(*swap_map)) {
 			int set_start_mm = (*swap_map >= swcount);
 			struct list_head *p = &start_mm->mmlist;
 			struct mm_struct *new_start_mm = start_mm;
@@ -1021,7 +1065,7 @@ static int try_to_unuse(unsigned int typ
 			atomic_inc(&new_start_mm->mm_users);
 			atomic_inc(&prev_mm->mm_users);
 			spin_lock(&mmlist_lock);
-			while (*swap_map > 1 && !retval && !shmem &&
+			while (swap_count(*swap_map) && !retval && !shmem &&
 					(p = p->next) != &start_mm->mmlist) {
 				mm = list_entry(p, struct mm_struct, mmlist);
 				if (!atomic_inc_not_zero(&mm->mm_users))
@@ -1033,14 +1077,16 @@ static int try_to_unuse(unsigned int typ
 				cond_resched();
 
 				swcount = *swap_map;
-				if (swcount <= 1)
+				if (!swap_count(swcount)) /* any usage ? */
 					;
 				else if (mm == &init_mm) {
 					set_start_mm = 1;
 					shmem = shmem_unuse(entry, page);
 				} else
 					retval = unuse_mm(mm, entry, page);
-				if (set_start_mm && *swap_map < swcount) {
+
+				if (set_start_mm &&
+				    swap_count(*swap_map) < swcount) {
 					mmput(new_start_mm);
 					atomic_inc(&mm->mm_users);
 					new_start_mm = mm;
@@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
 		}
 
 		/*
-		 * How could swap count reach 0x7fff when the maximum
-		 * pid is 0x7fff, and there's no way to repeat a swap
-		 * page within an mm (except in shmem, where it's the
-		 * shared object which takes the reference count)?
-		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
-		 *
+		 * How could swap count reach 0x7ffe ?
+		 * There's no way to repeat a swap page within an mm
+		 * (except in shmem, where it's the shared object which takes
+		 * the reference count)?
+		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
+		 * short is too small....)
 		 * If that's wrong, then we should worry more about
 		 * exit_mmap() and do_munmap() cases described above:
 		 * we might be resetting SWAP_MAP_MAX too early here.
 		 * We know "Undead"s can happen, they're okay, so don't
 		 * report them; but do report if we reset SWAP_MAP_MAX.
 		 */
-		if (*swap_map == SWAP_MAP_MAX) {
+		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
 			spin_lock(&swap_lock);
-			*swap_map = 1;
+			*swap_map = make_swap_count(0, 1);
 			spin_unlock(&swap_lock);
 			reset_overflow = 1;
 		}
@@ -1099,7 +1145,8 @@ static int try_to_unuse(unsigned int typ
 		 * pages would be incorrect if swap supported "shared
 		 * private" pages, but they are handled by tmpfs files.
 		 */
-		if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {
+		if (swap_count(*swap_map) &&
+		     PageDirty(page) && PageSwapCache(page)) {
 			struct writeback_control wbc = {
 				.sync_mode = WB_SYNC_NONE,
 			};
@@ -1953,11 +2000,12 @@ void si_swapinfo(struct sysinfo *val)
  * Note: if swap_map[] reaches SWAP_MAP_MAX the entries are treated as
  * "permanent", but will be reclaimed by the next swapoff.
  */
-int swap_duplicate(swp_entry_t entry)
+static int __swap_duplicate(swp_entry_t entry, int cache)
 {
 	struct swap_info_struct * p;
 	unsigned long offset, type;
 	int result = 0;
+	int count, has_cache;
 
 	if (is_migration_entry(entry))
 		return 1;
@@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
 	offset = swp_offset(entry);
 
 	spin_lock(&swap_lock);
-	if (offset < p->max && p->swap_map[offset]) {
-		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
-			p->swap_map[offset]++;
+
+	if (unlikely(offset >= p->max))
+		goto unlock_out;
+
+	count = swap_count(p->swap_map[offset]);
+	has_cache = swap_has_cache(p->swap_map[offset]);
+	if (cache) {
+		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
+		if (!has_cache && count) {
+			p->swap_map[offset] = make_swap_count(count, 1);
+			result = 1;
+		}
+	} else if (count || has_cache) {
+		if (count < SWAP_MAP_MAX - 1) {
+			p->swap_map[offset] = make_swap_count(count + 1,
+							      has_cache);
 			result = 1;
-		} else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
+		} else if (count <= SWAP_MAP_MAX) {
 			if (swap_overflow++ < 5)
-				printk(KERN_WARNING "swap_dup: swap entry overflow\n");
-			p->swap_map[offset] = SWAP_MAP_MAX;
+				printk(KERN_WARNING
+				       "swap_dup: swap entry overflow\n");
+			p->swap_map[offset] = make_swap_count(SWAP_MAP_MAX,
+							      has_cache);
 			result = 1;
 		}
 	}
+unlock_out:
 	spin_unlock(&swap_lock);
 out:
 	return result;
@@ -1989,12 +2053,17 @@ bad_file:
 	goto out;
 }
 
+int swap_duplicate(swp_entry_t entry)
+{
+	return __swap_duplicate(entry, 0);
+}
+
 /*
  * Called when allocating swap cache for exising swap entry,
  */
 int swapcache_prepare(swp_entry_t entry)
 {
-	return swap_duplicate(entry);
+	return __swap_duplicate(entry, 1);
 }
 
 
@@ -2035,7 +2104,7 @@ int valid_swaphandles(swp_entry_t entry,
 		/* Don't read in free or bad pages */
 		if (!si->swap_map[toff])
 			break;
-		if (si->swap_map[toff] == SWAP_MAP_BAD)
+		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
 			break;
 	}
 	/* Count contiguous allocated slots below our target */
@@ -2043,7 +2112,7 @@ int valid_swaphandles(swp_entry_t entry,
 		/* Don't read in free or bad pages */
 		if (!si->swap_map[toff])
 			break;
-		if (si->swap_map[toff] == SWAP_MAP_BAD)
+		if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD)
 			break;
 	}
 	spin_unlock(&swap_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 3/5] count cache-only swaps
  2009-05-26  3:12 ` KAMEZAWA Hiroyuki
@ 2009-05-26  3:16   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch adds a counter for unused swap caches.
Maybe useful to see "we're really under shortage of swap".

The value can be seen as kernel message at Sysrq-m etc.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    3 +++
 mm/swap_state.c      |    2 ++
 mm/swapfile.c        |   23 ++++++++++++++++++++---
 3 files changed, 25 insertions(+), 3 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -155,6 +155,7 @@ struct swap_info_struct {
 	unsigned int max;
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
+	unsigned int cache_only;
 };
 
 struct swap_list_t {
@@ -298,6 +299,7 @@ extern struct page *swapin_readahead(swp
 /* linux/mm/swapfile.c */
 extern long nr_swap_pages;
 extern long total_swap_pages;
+extern long nr_cache_only_swaps;
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
@@ -358,6 +360,7 @@ static inline void mem_cgroup_uncharge_s
 #define nr_swap_pages				0L
 #define total_swap_pages			0L
 #define total_swapcache_pages			0UL
+#define nr_cache_only_swaps			0UL
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -39,6 +39,7 @@ static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 long nr_swap_pages;
 long total_swap_pages;
+long nr_cache_only_swaps;
 static int swap_overflow;
 static int least_priority;
 
@@ -306,9 +307,11 @@ checks:
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 	}
-	if (cache) /* at usual swap-out via vmscan.c */
+	if (cache) {/* at usual swap-out via vmscan.c */
 		si->swap_map[offset] = make_swap_count(0, 1);
-	else /* at suspend */
+		si->cache_only++;
+		nr_cache_only_swaps++;
+	} else /* at suspend */
 		si->swap_map[offset] = make_swap_count(1, 0);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
@@ -513,7 +516,10 @@ static int swap_entry_free(struct swap_i
 	} else { /* dropping swap cache flag */
 		VM_BUG_ON(!has_cache);
 		p->swap_map[offset] = make_swap_count(count, 0);
-
+		if (!count) {
+			p->cache_only--;
+			nr_cache_only_swaps--;
+		}
 	}
 	/* return code. */
 	count = p->swap_map[offset];
@@ -529,6 +535,11 @@ static int swap_entry_free(struct swap_i
 		p->inuse_pages--;
 		mem_cgroup_uncharge_swap(ent);
 	}
+	if (swap_has_cache(count) && !swap_count(count)) {
+		nr_cache_only_swaps++;
+		p->cache_only++;
+	}
+
 	return count;
 }
 
@@ -1128,6 +1139,8 @@ static int try_to_unuse(unsigned int typ
 		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
 			spin_lock(&swap_lock);
 			*swap_map = make_swap_count(0, 1);
+			si->cache_only++;
+			nr_cache_only_swaps++;
 			spin_unlock(&swap_lock);
 			reset_overflow = 1;
 		}
@@ -2033,6 +2046,10 @@ static int __swap_duplicate(swp_entry_t 
 		if (count < SWAP_MAP_MAX - 1) {
 			p->swap_map[offset] = make_swap_count(count + 1,
 							      has_cache);
+			if (has_cache && !count) {
+				p->cache_only--;
+				nr_cache_only_swaps--;
+			}
 			result = 1;
 		} else if (count <= SWAP_MAP_MAX) {
 			if (swap_overflow++ < 5)
Index: new-trial-swapcount/mm/swap_state.c
===================================================================
--- new-trial-swapcount.orig/mm/swap_state.c
+++ new-trial-swapcount/mm/swap_state.c
@@ -63,6 +63,8 @@ void show_swap_cache_info(void)
 		swap_cache_info.find_success, swap_cache_info.find_total);
 	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
+	printk("Cache only swap = %lukB\n",
+	       nr_cache_only_swaps << (PAGE_SHIFT - 10));
 }
 
 /*


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 3/5] count cache-only swaps
@ 2009-05-26  3:16   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch adds a counter for unused swap caches.
Maybe useful to see "we're really under shortage of swap".

The value can be seen as kernel message at Sysrq-m etc.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    3 +++
 mm/swap_state.c      |    2 ++
 mm/swapfile.c        |   23 ++++++++++++++++++++---
 3 files changed, 25 insertions(+), 3 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -155,6 +155,7 @@ struct swap_info_struct {
 	unsigned int max;
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
+	unsigned int cache_only;
 };
 
 struct swap_list_t {
@@ -298,6 +299,7 @@ extern struct page *swapin_readahead(swp
 /* linux/mm/swapfile.c */
 extern long nr_swap_pages;
 extern long total_swap_pages;
+extern long nr_cache_only_swaps;
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
@@ -358,6 +360,7 @@ static inline void mem_cgroup_uncharge_s
 #define nr_swap_pages				0L
 #define total_swap_pages			0L
 #define total_swapcache_pages			0UL
+#define nr_cache_only_swaps			0UL
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -39,6 +39,7 @@ static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 long nr_swap_pages;
 long total_swap_pages;
+long nr_cache_only_swaps;
 static int swap_overflow;
 static int least_priority;
 
@@ -306,9 +307,11 @@ checks:
 		si->lowest_bit = si->max;
 		si->highest_bit = 0;
 	}
-	if (cache) /* at usual swap-out via vmscan.c */
+	if (cache) {/* at usual swap-out via vmscan.c */
 		si->swap_map[offset] = make_swap_count(0, 1);
-	else /* at suspend */
+		si->cache_only++;
+		nr_cache_only_swaps++;
+	} else /* at suspend */
 		si->swap_map[offset] = make_swap_count(1, 0);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
@@ -513,7 +516,10 @@ static int swap_entry_free(struct swap_i
 	} else { /* dropping swap cache flag */
 		VM_BUG_ON(!has_cache);
 		p->swap_map[offset] = make_swap_count(count, 0);
-
+		if (!count) {
+			p->cache_only--;
+			nr_cache_only_swaps--;
+		}
 	}
 	/* return code. */
 	count = p->swap_map[offset];
@@ -529,6 +535,11 @@ static int swap_entry_free(struct swap_i
 		p->inuse_pages--;
 		mem_cgroup_uncharge_swap(ent);
 	}
+	if (swap_has_cache(count) && !swap_count(count)) {
+		nr_cache_only_swaps++;
+		p->cache_only++;
+	}
+
 	return count;
 }
 
@@ -1128,6 +1139,8 @@ static int try_to_unuse(unsigned int typ
 		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
 			spin_lock(&swap_lock);
 			*swap_map = make_swap_count(0, 1);
+			si->cache_only++;
+			nr_cache_only_swaps++;
 			spin_unlock(&swap_lock);
 			reset_overflow = 1;
 		}
@@ -2033,6 +2046,10 @@ static int __swap_duplicate(swp_entry_t 
 		if (count < SWAP_MAP_MAX - 1) {
 			p->swap_map[offset] = make_swap_count(count + 1,
 							      has_cache);
+			if (has_cache && !count) {
+				p->cache_only--;
+				nr_cache_only_swaps--;
+			}
 			result = 1;
 		} else if (count <= SWAP_MAP_MAX) {
 			if (swap_overflow++ < 5)
Index: new-trial-swapcount/mm/swap_state.c
===================================================================
--- new-trial-swapcount.orig/mm/swap_state.c
+++ new-trial-swapcount/mm/swap_state.c
@@ -63,6 +63,8 @@ void show_swap_cache_info(void)
 		swap_cache_info.find_success, swap_cache_info.find_total);
 	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
+	printk("Cache only swap = %lukB\n",
+	       nr_cache_only_swaps << (PAGE_SHIFT - 10));
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 4/5] memcg: fix swap account
  2009-05-26  3:12 ` KAMEZAWA Hiroyuki
@ 2009-05-26  3:17   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

This patch fixes mis-accounting of swap usage in memcg.

In current implementation, memcg's swap account is uncharged only when
swap is completely freed. But there are several cases where swap
cannot be freed cleanly. For handling that, this patch changes that
memcg uncharges swap account when swap has no references other than cache.

By this, memcg's swap entry accounting can be fully synchronous with
the application's behavior.
This patch also changes memcg's hooks for swap-out.
(If delete_from_swap_cache() is called but there is no swap-reference,
 charge to swaps doesn't occur.
 (the charge for mem+swap is attached to the page itself if mapped)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    5 +++--
 mm/memcontrol.c      |   17 ++++++++++++-----
 mm/swapfile.c        |   14 ++++++++++----
 3 files changed, 25 insertions(+), 11 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -340,10 +340,11 @@ static inline void disable_swap_token(vo
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout);
 #else
 static inline void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout)
 {
 }
 #endif
Index: new-trial-swapcount/mm/memcontrol.c
===================================================================
--- new-trial-swapcount.orig/mm/memcontrol.c
+++ new-trial-swapcount/mm/memcontrol.c
@@ -189,6 +189,7 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
+	MEM_CGROUP_CHARGE_TYPE_DROP,	/* a page was unused swap cache */
 	NR_CHARGE_TYPE,
 };
 
@@ -1501,6 +1502,7 @@ __mem_cgroup_uncharge_common(struct page
 
 	switch (ctype) {
 	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+	case MEM_CGROUP_CHARGE_TYPE_DROP:
 		if (page_mapped(page))
 			goto unlock_out;
 		break;
@@ -1564,18 +1566,23 @@ void mem_cgroup_uncharge_cache_page(stru
  * called after __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
  */
-void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout)
 {
 	struct mem_cgroup *memcg;
+	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
+
+	if (!swapout) /* this was a swap cache but the swap is unused ! */
+		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
+
+	memcg = __mem_cgroup_uncharge_common(page, ctype);
 
-	memcg = __mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
 	/* record memcg information */
-	if (do_swap_account && memcg) {
+	if (do_swap_account && swapout && memcg) {
 		swap_cgroup_record(ent, css_id(&memcg->css));
 		mem_cgroup_get(memcg);
 	}
-	if (memcg)
+	if (swapout && memcg)
 		css_put(&memcg->css);
 }
 #endif
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -533,8 +533,9 @@ static int swap_entry_free(struct swap_i
 			swap_list.next = p - swap_info;
 		nr_swap_pages++;
 		p->inuse_pages--;
-		mem_cgroup_uncharge_swap(ent);
 	}
+	if (!swap_count(count))
+		mem_cgroup_uncharge_swap(ent);
 	if (swap_has_cache(count) && !swap_count(count)) {
 		nr_cache_only_swaps++;
 		p->cache_only++;
@@ -564,12 +565,17 @@ void swap_free(swp_entry_t entry)
 void swapcache_free(swp_entry_t entry, struct page *page)
 {
 	struct swap_info_struct *p;
+	int ret;
 
-	if (page)
-		mem_cgroup_uncharge_swapcache(page, entry);
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry, 1);
+		ret = swap_entry_free(p, entry, 1);
+		if (page) {
+			if (ret)
+				mem_cgroup_uncharge_swapcache(page, entry, 1);
+			else
+				mem_cgroup_uncharge_swapcache(page, entry, 0);
+		}
 		spin_unlock(&swap_lock);
 	}
 	return;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 4/5] memcg: fix swap account
@ 2009-05-26  3:17   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

This patch fixes mis-accounting of swap usage in memcg.

In current implementation, memcg's swap account is uncharged only when
swap is completely freed. But there are several cases where swap
cannot be freed cleanly. For handling that, this patch changes that
memcg uncharges swap account when swap has no references other than cache.

By this, memcg's swap entry accounting can be fully synchronous with
the application's behavior.
This patch also changes memcg's hooks for swap-out.
(If delete_from_swap_cache() is called but there is no swap-reference,
 charge to swaps doesn't occur.
 (the charge for mem+swap is attached to the page itself if mapped)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    5 +++--
 mm/memcontrol.c      |   17 ++++++++++++-----
 mm/swapfile.c        |   14 ++++++++++----
 3 files changed, 25 insertions(+), 11 deletions(-)

Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -340,10 +340,11 @@ static inline void disable_swap_token(vo
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout);
 #else
 static inline void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout)
 {
 }
 #endif
Index: new-trial-swapcount/mm/memcontrol.c
===================================================================
--- new-trial-swapcount.orig/mm/memcontrol.c
+++ new-trial-swapcount/mm/memcontrol.c
@@ -189,6 +189,7 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
+	MEM_CGROUP_CHARGE_TYPE_DROP,	/* a page was unused swap cache */
 	NR_CHARGE_TYPE,
 };
 
@@ -1501,6 +1502,7 @@ __mem_cgroup_uncharge_common(struct page
 
 	switch (ctype) {
 	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+	case MEM_CGROUP_CHARGE_TYPE_DROP:
 		if (page_mapped(page))
 			goto unlock_out;
 		break;
@@ -1564,18 +1566,23 @@ void mem_cgroup_uncharge_cache_page(stru
  * called after __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
  */
-void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, int swapout)
 {
 	struct mem_cgroup *memcg;
+	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
+
+	if (!swapout) /* this was a swap cache but the swap is unused ! */
+		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
+
+	memcg = __mem_cgroup_uncharge_common(page, ctype);
 
-	memcg = __mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
 	/* record memcg information */
-	if (do_swap_account && memcg) {
+	if (do_swap_account && swapout && memcg) {
 		swap_cgroup_record(ent, css_id(&memcg->css));
 		mem_cgroup_get(memcg);
 	}
-	if (memcg)
+	if (swapout && memcg)
 		css_put(&memcg->css);
 }
 #endif
Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -533,8 +533,9 @@ static int swap_entry_free(struct swap_i
 			swap_list.next = p - swap_info;
 		nr_swap_pages++;
 		p->inuse_pages--;
-		mem_cgroup_uncharge_swap(ent);
 	}
+	if (!swap_count(count))
+		mem_cgroup_uncharge_swap(ent);
 	if (swap_has_cache(count) && !swap_count(count)) {
 		nr_cache_only_swaps++;
 		p->cache_only++;
@@ -564,12 +565,17 @@ void swap_free(swp_entry_t entry)
 void swapcache_free(swp_entry_t entry, struct page *page)
 {
 	struct swap_info_struct *p;
+	int ret;
 
-	if (page)
-		mem_cgroup_uncharge_swapcache(page, entry);
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry, 1);
+		ret = swap_entry_free(p, entry, 1);
+		if (page) {
+			if (ret)
+				mem_cgroup_uncharge_swapcache(page, entry, 1);
+			else
+				mem_cgroup_uncharge_swapcache(page, entry, 0);
+		}
 		spin_unlock(&swap_lock);
 	}
 	return;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-26  3:12 ` KAMEZAWA Hiroyuki
@ 2009-05-26  3:18   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Just a trial/example patch.
I'd like to consider more. Better implementation idea is welcome.

When the system does swap-in/swap-out repeatedly, there are 
cache-only swaps in general.
Typically,
 - swapped out in past but on memory now while vm_swap_full() returns true
pages are cache-only swaps. (swap_map has no references.)

This cache-only swaps can be an obstacles for smooth page reclaiming.
Current implemantation is very naive, just scan & free.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    1 
 mm/swapfile.c        |   88 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -74,6 +74,8 @@ static inline unsigned short make_swap_c
 	return ret;
 }
 
+static void call_gc_cache_only_swap(void);
+
 /*
  * We need this because the bdev->unplug_fn can sleep and we cannot
  * hold swap_lock while calling the unplug_fn. And swap_lock
@@ -432,6 +434,8 @@ swp_entry_t get_swap_page(void)
 		offset = scan_swap_map(si, 1);
 		if (offset) {
 			spin_unlock(&swap_lock);
+			/* reclaim cache-only swaps if vm_swap_full() */
+			call_gc_cache_only_swap();
 			return swp_entry(type, offset);
 		}
 		next = swap_list.next;
@@ -2147,3 +2151,87 @@ int valid_swaphandles(swp_entry_t entry,
 	*offset = ++toff;
 	return nr_pages? ++nr_pages: 0;
 }
+
+/*
+ * Following code is for freeing Cache-only swap entries. These are calle in
+ * vm_swap_full() situation, and freeing cache-only swap and make some swap
+ * entries usable.
+ */
+
+static int find_free_cache_only_swap(int type)
+{
+	unsigned long buf[SWAP_CLUSTER_MAX];
+	int nr, offset, i;
+	unsigned short count;
+	struct swap_info_struct *si = swap_info + type;
+	int ret = 0;
+
+	spin_lock(&swap_lock);
+	nr = 0;
+	if (!(si->flags & SWP_WRITEOK) || si->cache_only == 0) {
+		ret = 1;
+		goto unlock;
+	}
+	offset = si->garbage_scan_offset;
+	/* Scan 2048 entries at most and free up to 32 entries per scan.*/
+	for (i = 2048; i > 0 && nr < 32; i--, offset++) {
+		if (offset >= si->max) {
+			offset = 0;
+			ret = 1;
+			break;
+		}
+		count = si->swap_map[offset];
+		if (count == SWAP_HAS_CACHE)
+			buf[nr++] = offset;
+	}
+	si->garbage_scan_offset = offset;
+unlock:
+	spin_unlock(&swap_lock);
+
+	for (i = 0; i < nr; i++) {
+		swp_entry_t ent;
+		struct page *page;
+
+		ent = swp_entry(type, buf[i]);
+
+		page = find_get_page(&swapper_space, ent.val);
+		if (page) {
+			lock_page(page);
+			try_to_free_swap(page);
+			unlock_page(page);
+		}
+	}
+	return ret;
+}
+
+#define SWAP_GC_THRESH	(4096)
+static void scan_and_free_cache_only_swap_work(struct work_struct *work);
+DECLARE_DELAYED_WORK(swap_gc_work, scan_and_free_cache_only_swap_work);
+static int swap_gc_last_scan;
+
+static void scan_and_free_cache_only_swap_work(struct work_struct *work)
+{
+	int type = swap_gc_last_scan;
+	int i;
+
+	spin_lock(&swap_lock);
+	for (i = type; i < MAX_SWAPFILES; i++) {
+		if (swap_info[i].flags & SWP_WRITEOK)
+			break;
+	}
+	if (i >= MAX_SWAPFILES)
+		i = 0;
+	spin_unlock(&swap_lock);
+	if (find_free_cache_only_swap(i))
+		swap_gc_last_scan = i + 1;
+
+	if (vm_swap_full() && (nr_cache_only_swaps > SWAP_GC_THRESH))
+		schedule_delayed_work(&swap_gc_work, HZ/10);
+}
+
+static void call_gc_cache_only_swap(void)
+{
+	if (vm_swap_full()  && (nr_cache_only_swaps > SWAP_GC_THRESH))
+		schedule_delayed_work(&swap_gc_work, HZ/10);
+}
+
Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -156,6 +156,7 @@ struct swap_info_struct {
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
 	unsigned int cache_only;
+	unsigned int garbage_scan_offset;
 };
 
 struct swap_list_t {


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-26  3:18   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26  3:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Just a trial/example patch.
I'd like to consider more. Better implementation idea is welcome.

When the system does swap-in/swap-out repeatedly, there are 
cache-only swaps in general.
Typically,
 - swapped out in past but on memory now while vm_swap_full() returns true
pages are cache-only swaps. (swap_map has no references.)

This cache-only swaps can be an obstacles for smooth page reclaiming.
Current implemantation is very naive, just scan & free.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    1 
 mm/swapfile.c        |   88 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -74,6 +74,8 @@ static inline unsigned short make_swap_c
 	return ret;
 }
 
+static void call_gc_cache_only_swap(void);
+
 /*
  * We need this because the bdev->unplug_fn can sleep and we cannot
  * hold swap_lock while calling the unplug_fn. And swap_lock
@@ -432,6 +434,8 @@ swp_entry_t get_swap_page(void)
 		offset = scan_swap_map(si, 1);
 		if (offset) {
 			spin_unlock(&swap_lock);
+			/* reclaim cache-only swaps if vm_swap_full() */
+			call_gc_cache_only_swap();
 			return swp_entry(type, offset);
 		}
 		next = swap_list.next;
@@ -2147,3 +2151,87 @@ int valid_swaphandles(swp_entry_t entry,
 	*offset = ++toff;
 	return nr_pages? ++nr_pages: 0;
 }
+
+/*
+ * Following code is for freeing Cache-only swap entries. These are calle in
+ * vm_swap_full() situation, and freeing cache-only swap and make some swap
+ * entries usable.
+ */
+
+static int find_free_cache_only_swap(int type)
+{
+	unsigned long buf[SWAP_CLUSTER_MAX];
+	int nr, offset, i;
+	unsigned short count;
+	struct swap_info_struct *si = swap_info + type;
+	int ret = 0;
+
+	spin_lock(&swap_lock);
+	nr = 0;
+	if (!(si->flags & SWP_WRITEOK) || si->cache_only == 0) {
+		ret = 1;
+		goto unlock;
+	}
+	offset = si->garbage_scan_offset;
+	/* Scan 2048 entries at most and free up to 32 entries per scan.*/
+	for (i = 2048; i > 0 && nr < 32; i--, offset++) {
+		if (offset >= si->max) {
+			offset = 0;
+			ret = 1;
+			break;
+		}
+		count = si->swap_map[offset];
+		if (count == SWAP_HAS_CACHE)
+			buf[nr++] = offset;
+	}
+	si->garbage_scan_offset = offset;
+unlock:
+	spin_unlock(&swap_lock);
+
+	for (i = 0; i < nr; i++) {
+		swp_entry_t ent;
+		struct page *page;
+
+		ent = swp_entry(type, buf[i]);
+
+		page = find_get_page(&swapper_space, ent.val);
+		if (page) {
+			lock_page(page);
+			try_to_free_swap(page);
+			unlock_page(page);
+		}
+	}
+	return ret;
+}
+
+#define SWAP_GC_THRESH	(4096)
+static void scan_and_free_cache_only_swap_work(struct work_struct *work);
+DECLARE_DELAYED_WORK(swap_gc_work, scan_and_free_cache_only_swap_work);
+static int swap_gc_last_scan;
+
+static void scan_and_free_cache_only_swap_work(struct work_struct *work)
+{
+	int type = swap_gc_last_scan;
+	int i;
+
+	spin_lock(&swap_lock);
+	for (i = type; i < MAX_SWAPFILES; i++) {
+		if (swap_info[i].flags & SWP_WRITEOK)
+			break;
+	}
+	if (i >= MAX_SWAPFILES)
+		i = 0;
+	spin_unlock(&swap_lock);
+	if (find_free_cache_only_swap(i))
+		swap_gc_last_scan = i + 1;
+
+	if (vm_swap_full() && (nr_cache_only_swaps > SWAP_GC_THRESH))
+		schedule_delayed_work(&swap_gc_work, HZ/10);
+}
+
+static void call_gc_cache_only_swap(void)
+{
+	if (vm_swap_full()  && (nr_cache_only_swaps > SWAP_GC_THRESH))
+		schedule_delayed_work(&swap_gc_work, HZ/10);
+}
+
Index: new-trial-swapcount/include/linux/swap.h
===================================================================
--- new-trial-swapcount.orig/include/linux/swap.h
+++ new-trial-swapcount/include/linux/swap.h
@@ -156,6 +156,7 @@ struct swap_info_struct {
 	unsigned int inuse_pages;
 	unsigned int old_block_size;
 	unsigned int cache_only;
+	unsigned int garbage_scan_offset;
 };
 
 struct swap_list_t {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 3/5] count cache-only swaps
  2009-05-26  3:16   ` KAMEZAWA Hiroyuki
@ 2009-05-26 17:37     ` Johannes Weiner
  -1 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-26 17:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, May 26, 2009 at 12:16:38PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch adds a counter for unused swap caches.
> Maybe useful to see "we're really under shortage of swap".
> 
> The value can be seen as kernel message at Sysrq-m etc.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/swap.h |    3 +++
>  mm/swap_state.c      |    2 ++
>  mm/swapfile.c        |   23 ++++++++++++++++++++---
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> Index: new-trial-swapcount/include/linux/swap.h
> ===================================================================
> --- new-trial-swapcount.orig/include/linux/swap.h
> +++ new-trial-swapcount/include/linux/swap.h
> @@ -155,6 +155,7 @@ struct swap_info_struct {
>  	unsigned int max;
>  	unsigned int inuse_pages;
>  	unsigned int old_block_size;
> +	unsigned int cache_only;
>  };
>  
>  struct swap_list_t {
> @@ -298,6 +299,7 @@ extern struct page *swapin_readahead(swp
>  /* linux/mm/swapfile.c */
>  extern long nr_swap_pages;
>  extern long total_swap_pages;
> +extern long nr_cache_only_swaps;
>  extern void si_swapinfo(struct sysinfo *);
>  extern swp_entry_t get_swap_page(void);
>  extern swp_entry_t get_swap_page_of_type(int);
> @@ -358,6 +360,7 @@ static inline void mem_cgroup_uncharge_s
>  #define nr_swap_pages				0L
>  #define total_swap_pages			0L
>  #define total_swapcache_pages			0UL
> +#define nr_cache_only_swaps			0UL
>  
>  #define si_swapinfo(val) \
>  	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> Index: new-trial-swapcount/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount.orig/mm/swapfile.c
> +++ new-trial-swapcount/mm/swapfile.c
> @@ -39,6 +39,7 @@ static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
>  long nr_swap_pages;
>  long total_swap_pages;
> +long nr_cache_only_swaps;
>  static int swap_overflow;
>  static int least_priority;
>  
> @@ -306,9 +307,11 @@ checks:
>  		si->lowest_bit = si->max;
>  		si->highest_bit = 0;
>  	}
> -	if (cache) /* at usual swap-out via vmscan.c */
> +	if (cache) {/* at usual swap-out via vmscan.c */
>  		si->swap_map[offset] = make_swap_count(0, 1);
> -	else /* at suspend */
> +		si->cache_only++;
> +		nr_cache_only_swaps++;
> +	} else /* at suspend */
>  		si->swap_map[offset] = make_swap_count(1, 0);
>  	si->cluster_next = offset + 1;
>  	si->flags -= SWP_SCANNING;
> @@ -513,7 +516,10 @@ static int swap_entry_free(struct swap_i
>  	} else { /* dropping swap cache flag */
>  		VM_BUG_ON(!has_cache);
>  		p->swap_map[offset] = make_swap_count(count, 0);
> -
> +		if (!count) {
> +			p->cache_only--;
> +			nr_cache_only_swaps--;
> +		}
>  	}
>  	/* return code. */
>  	count = p->swap_map[offset];
> @@ -529,6 +535,11 @@ static int swap_entry_free(struct swap_i
>  		p->inuse_pages--;
>  		mem_cgroup_uncharge_swap(ent);
>  	}
> +	if (swap_has_cache(count) && !swap_count(count)) {
> +		nr_cache_only_swaps++;
> +		p->cache_only++;
> +	}
> +
>  	return count;
>  }
>  
> @@ -1128,6 +1139,8 @@ static int try_to_unuse(unsigned int typ
>  		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
>  			spin_lock(&swap_lock);
>  			*swap_map = make_swap_count(0, 1);
> +			si->cache_only++;
> +			nr_cache_only_swaps++;
>  			spin_unlock(&swap_lock);
>  			reset_overflow = 1;
>  		}
> @@ -2033,6 +2046,10 @@ static int __swap_duplicate(swp_entry_t 
>  		if (count < SWAP_MAP_MAX - 1) {
>  			p->swap_map[offset] = make_swap_count(count + 1,
>  							      has_cache);
> +			if (has_cache && !count) {
> +				p->cache_only--;
> +				nr_cache_only_swaps--;
> +			}
>  			result = 1;
>  		} else if (count <= SWAP_MAP_MAX) {
>  			if (swap_overflow++ < 5)
> Index: new-trial-swapcount/mm/swap_state.c
> ===================================================================
> --- new-trial-swapcount.orig/mm/swap_state.c
> +++ new-trial-swapcount/mm/swap_state.c
> @@ -63,6 +63,8 @@ void show_swap_cache_info(void)
>  		swap_cache_info.find_success, swap_cache_info.find_total);
>  	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
>  	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
> +	printk("Cache only swap = %lukB\n",
> +	       nr_cache_only_swaps << (PAGE_SHIFT - 10));
>  }

This is shown rather seldomly (sysrq and oom), for that purpose two
counters are overkill.  Maybe remove the global one and sum up the
per-swapdevice counters on demand?

	Hannes

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 3/5] count cache-only swaps
@ 2009-05-26 17:37     ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-26 17:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, May 26, 2009 at 12:16:38PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch adds a counter for unused swap caches.
> Maybe useful to see "we're really under shortage of swap".
> 
> The value can be seen as kernel message at Sysrq-m etc.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/swap.h |    3 +++
>  mm/swap_state.c      |    2 ++
>  mm/swapfile.c        |   23 ++++++++++++++++++++---
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> Index: new-trial-swapcount/include/linux/swap.h
> ===================================================================
> --- new-trial-swapcount.orig/include/linux/swap.h
> +++ new-trial-swapcount/include/linux/swap.h
> @@ -155,6 +155,7 @@ struct swap_info_struct {
>  	unsigned int max;
>  	unsigned int inuse_pages;
>  	unsigned int old_block_size;
> +	unsigned int cache_only;
>  };
>  
>  struct swap_list_t {
> @@ -298,6 +299,7 @@ extern struct page *swapin_readahead(swp
>  /* linux/mm/swapfile.c */
>  extern long nr_swap_pages;
>  extern long total_swap_pages;
> +extern long nr_cache_only_swaps;
>  extern void si_swapinfo(struct sysinfo *);
>  extern swp_entry_t get_swap_page(void);
>  extern swp_entry_t get_swap_page_of_type(int);
> @@ -358,6 +360,7 @@ static inline void mem_cgroup_uncharge_s
>  #define nr_swap_pages				0L
>  #define total_swap_pages			0L
>  #define total_swapcache_pages			0UL
> +#define nr_cache_only_swaps			0UL
>  
>  #define si_swapinfo(val) \
>  	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> Index: new-trial-swapcount/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount.orig/mm/swapfile.c
> +++ new-trial-swapcount/mm/swapfile.c
> @@ -39,6 +39,7 @@ static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
>  long nr_swap_pages;
>  long total_swap_pages;
> +long nr_cache_only_swaps;
>  static int swap_overflow;
>  static int least_priority;
>  
> @@ -306,9 +307,11 @@ checks:
>  		si->lowest_bit = si->max;
>  		si->highest_bit = 0;
>  	}
> -	if (cache) /* at usual swap-out via vmscan.c */
> +	if (cache) {/* at usual swap-out via vmscan.c */
>  		si->swap_map[offset] = make_swap_count(0, 1);
> -	else /* at suspend */
> +		si->cache_only++;
> +		nr_cache_only_swaps++;
> +	} else /* at suspend */
>  		si->swap_map[offset] = make_swap_count(1, 0);
>  	si->cluster_next = offset + 1;
>  	si->flags -= SWP_SCANNING;
> @@ -513,7 +516,10 @@ static int swap_entry_free(struct swap_i
>  	} else { /* dropping swap cache flag */
>  		VM_BUG_ON(!has_cache);
>  		p->swap_map[offset] = make_swap_count(count, 0);
> -
> +		if (!count) {
> +			p->cache_only--;
> +			nr_cache_only_swaps--;
> +		}
>  	}
>  	/* return code. */
>  	count = p->swap_map[offset];
> @@ -529,6 +535,11 @@ static int swap_entry_free(struct swap_i
>  		p->inuse_pages--;
>  		mem_cgroup_uncharge_swap(ent);
>  	}
> +	if (swap_has_cache(count) && !swap_count(count)) {
> +		nr_cache_only_swaps++;
> +		p->cache_only++;
> +	}
> +
>  	return count;
>  }
>  
> @@ -1128,6 +1139,8 @@ static int try_to_unuse(unsigned int typ
>  		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
>  			spin_lock(&swap_lock);
>  			*swap_map = make_swap_count(0, 1);
> +			si->cache_only++;
> +			nr_cache_only_swaps++;
>  			spin_unlock(&swap_lock);
>  			reset_overflow = 1;
>  		}
> @@ -2033,6 +2046,10 @@ static int __swap_duplicate(swp_entry_t 
>  		if (count < SWAP_MAP_MAX - 1) {
>  			p->swap_map[offset] = make_swap_count(count + 1,
>  							      has_cache);
> +			if (has_cache && !count) {
> +				p->cache_only--;
> +				nr_cache_only_swaps--;
> +			}
>  			result = 1;
>  		} else if (count <= SWAP_MAP_MAX) {
>  			if (swap_overflow++ < 5)
> Index: new-trial-swapcount/mm/swap_state.c
> ===================================================================
> --- new-trial-swapcount.orig/mm/swap_state.c
> +++ new-trial-swapcount/mm/swap_state.c
> @@ -63,6 +63,8 @@ void show_swap_cache_info(void)
>  		swap_cache_info.find_success, swap_cache_info.find_total);
>  	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
>  	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
> +	printk("Cache only swap = %lukB\n",
> +	       nr_cache_only_swaps << (PAGE_SHIFT - 10));
>  }

This is shown rather seldomly (sysrq and oom), for that purpose two
counters are overkill.  Maybe remove the global one and sum up the
per-swapdevice counters on demand?

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-26  3:18   ` KAMEZAWA Hiroyuki
@ 2009-05-26 18:14     ` Johannes Weiner
  -1 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-26 18:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Just a trial/example patch.
> I'd like to consider more. Better implementation idea is welcome.
> 
> When the system does swap-in/swap-out repeatedly, there are 
> cache-only swaps in general.
> Typically,
>  - swapped out in past but on memory now while vm_swap_full() returns true
> pages are cache-only swaps. (swap_map has no references.)
> 
> This cache-only swaps can be an obstacles for smooth page reclaiming.
> Current implemantation is very naive, just scan & free.

I think we can just remove that vm_swap_full() check in do_swap_page()
and try to remove the page from swap cache unconditionally.

If it's still mapped someplace else, we let it cached.  If not, there
is not much use for keeping it around and we free it.

When I removed it and did benchmarks, I couldn't spot any difference
in the timings, though.  Did you measure the benefits of your patch
somehow?

According to the git history tree, vm_swap_full() was initially only
used to aggressively drop cache entries even when they are mapped.

Rik put it into vmscan to reclaim swap cache _at all_ for activated
pages.  But I think unconditionally dropping the cache entry makes
sense if the page gets shuffled around on the LRU list.  Better to
re-allocate a swap slot close to the new LRU buddies on the next scan.

And having this all covered, the need for the scanning your patch does
should be gone, unless I missed something.

	Hannes

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-26 18:14     ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-26 18:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Just a trial/example patch.
> I'd like to consider more. Better implementation idea is welcome.
> 
> When the system does swap-in/swap-out repeatedly, there are 
> cache-only swaps in general.
> Typically,
>  - swapped out in past but on memory now while vm_swap_full() returns true
> pages are cache-only swaps. (swap_map has no references.)
> 
> This cache-only swaps can be an obstacles for smooth page reclaiming.
> Current implemantation is very naive, just scan & free.

I think we can just remove that vm_swap_full() check in do_swap_page()
and try to remove the page from swap cache unconditionally.

If it's still mapped someplace else, we let it cached.  If not, there
is not much use for keeping it around and we free it.

When I removed it and did benchmarks, I couldn't spot any difference
in the timings, though.  Did you measure the benefits of your patch
somehow?

According to the git history tree, vm_swap_full() was initially only
used to aggressively drop cache entries even when they are mapped.

Rik put it into vmscan to reclaim swap cache _at all_ for activated
pages.  But I think unconditionally dropping the cache entry makes
sense if the page gets shuffled around on the LRU list.  Better to
re-allocate a swap slot close to the new LRU buddies on the next scan.

And having this all covered, the need for the scanning your patch does
should be gone, unless I missed something.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 3/5] count cache-only swaps
  2009-05-26 17:37     ` Johannes Weiner
@ 2009-05-26 23:49       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26 23:49 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, 26 May 2009 19:37:36 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 26, 2009 at 12:16:38PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > This patch adds a counter for unused swap caches.
> > Maybe useful to see "we're really under shortage of swap".
> > 
> > The value can be seen as kernel message at Sysrq-m etc.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/swap.h |    3 +++
> >  mm/swap_state.c      |    2 ++
> >  mm/swapfile.c        |   23 ++++++++++++++++++++---
> >  3 files changed, 25 insertions(+), 3 deletions(-)
> > 
> > Index: new-trial-swapcount/include/linux/swap.h
> > ===================================================================
> > --- new-trial-swapcount.orig/include/linux/swap.h
> > +++ new-trial-swapcount/include/linux/swap.h
> > @@ -155,6 +155,7 @@ struct swap_info_struct {
> >  	unsigned int max;
> >  	unsigned int inuse_pages;
> >  	unsigned int old_block_size;
> > +	unsigned int cache_only;
> >  };
> >  
> >  struct swap_list_t {
> > @@ -298,6 +299,7 @@ extern struct page *swapin_readahead(swp
> >  /* linux/mm/swapfile.c */
> >  extern long nr_swap_pages;
> >  extern long total_swap_pages;
> > +extern long nr_cache_only_swaps;
> >  extern void si_swapinfo(struct sysinfo *);
> >  extern swp_entry_t get_swap_page(void);
> >  extern swp_entry_t get_swap_page_of_type(int);
> > @@ -358,6 +360,7 @@ static inline void mem_cgroup_uncharge_s
> >  #define nr_swap_pages				0L
> >  #define total_swap_pages			0L
> >  #define total_swapcache_pages			0UL
> > +#define nr_cache_only_swaps			0UL
> >  
> >  #define si_swapinfo(val) \
> >  	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> > Index: new-trial-swapcount/mm/swapfile.c
> > ===================================================================
> > --- new-trial-swapcount.orig/mm/swapfile.c
> > +++ new-trial-swapcount/mm/swapfile.c
> > @@ -39,6 +39,7 @@ static DEFINE_SPINLOCK(swap_lock);
> >  static unsigned int nr_swapfiles;
> >  long nr_swap_pages;
> >  long total_swap_pages;
> > +long nr_cache_only_swaps;
> >  static int swap_overflow;
> >  static int least_priority;
> >  
> > @@ -306,9 +307,11 @@ checks:
> >  		si->lowest_bit = si->max;
> >  		si->highest_bit = 0;
> >  	}
> > -	if (cache) /* at usual swap-out via vmscan.c */
> > +	if (cache) {/* at usual swap-out via vmscan.c */
> >  		si->swap_map[offset] = make_swap_count(0, 1);
> > -	else /* at suspend */
> > +		si->cache_only++;
> > +		nr_cache_only_swaps++;
> > +	} else /* at suspend */
> >  		si->swap_map[offset] = make_swap_count(1, 0);
> >  	si->cluster_next = offset + 1;
> >  	si->flags -= SWP_SCANNING;
> > @@ -513,7 +516,10 @@ static int swap_entry_free(struct swap_i
> >  	} else { /* dropping swap cache flag */
> >  		VM_BUG_ON(!has_cache);
> >  		p->swap_map[offset] = make_swap_count(count, 0);
> > -
> > +		if (!count) {
> > +			p->cache_only--;
> > +			nr_cache_only_swaps--;
> > +		}
> >  	}
> >  	/* return code. */
> >  	count = p->swap_map[offset];
> > @@ -529,6 +535,11 @@ static int swap_entry_free(struct swap_i
> >  		p->inuse_pages--;
> >  		mem_cgroup_uncharge_swap(ent);
> >  	}
> > +	if (swap_has_cache(count) && !swap_count(count)) {
> > +		nr_cache_only_swaps++;
> > +		p->cache_only++;
> > +	}
> > +
> >  	return count;
> >  }
> >  
> > @@ -1128,6 +1139,8 @@ static int try_to_unuse(unsigned int typ
> >  		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> >  			spin_lock(&swap_lock);
> >  			*swap_map = make_swap_count(0, 1);
> > +			si->cache_only++;
> > +			nr_cache_only_swaps++;
> >  			spin_unlock(&swap_lock);
> >  			reset_overflow = 1;
> >  		}
> > @@ -2033,6 +2046,10 @@ static int __swap_duplicate(swp_entry_t 
> >  		if (count < SWAP_MAP_MAX - 1) {
> >  			p->swap_map[offset] = make_swap_count(count + 1,
> >  							      has_cache);
> > +			if (has_cache && !count) {
> > +				p->cache_only--;
> > +				nr_cache_only_swaps--;
> > +			}
> >  			result = 1;
> >  		} else if (count <= SWAP_MAP_MAX) {
> >  			if (swap_overflow++ < 5)
> > Index: new-trial-swapcount/mm/swap_state.c
> > ===================================================================
> > --- new-trial-swapcount.orig/mm/swap_state.c
> > +++ new-trial-swapcount/mm/swap_state.c
> > @@ -63,6 +63,8 @@ void show_swap_cache_info(void)
> >  		swap_cache_info.find_success, swap_cache_info.find_total);
> >  	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
> >  	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
> > +	printk("Cache only swap = %lukB\n",
> > +	       nr_cache_only_swaps << (PAGE_SHIFT - 10));
> >  }
> 
> This is shown rather seldomly (sysrq and oom), for that purpose two
> counters are overkill.  Maybe remove the global one and sum up the
> per-swapdevice counters on demand?
> 
One for scanning (as 5/5), One for checking we should scan or not.
But yes, I feel 2 counters may be overkilling....
Maybe I'll remove global counter in the next version.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 3/5] count cache-only swaps
@ 2009-05-26 23:49       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-26 23:49 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, 26 May 2009 19:37:36 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 26, 2009 at 12:16:38PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > This patch adds a counter for unused swap caches.
> > Maybe useful to see "we're really under shortage of swap".
> > 
> > The value can be seen as kernel message at Sysrq-m etc.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/swap.h |    3 +++
> >  mm/swap_state.c      |    2 ++
> >  mm/swapfile.c        |   23 ++++++++++++++++++++---
> >  3 files changed, 25 insertions(+), 3 deletions(-)
> > 
> > Index: new-trial-swapcount/include/linux/swap.h
> > ===================================================================
> > --- new-trial-swapcount.orig/include/linux/swap.h
> > +++ new-trial-swapcount/include/linux/swap.h
> > @@ -155,6 +155,7 @@ struct swap_info_struct {
> >  	unsigned int max;
> >  	unsigned int inuse_pages;
> >  	unsigned int old_block_size;
> > +	unsigned int cache_only;
> >  };
> >  
> >  struct swap_list_t {
> > @@ -298,6 +299,7 @@ extern struct page *swapin_readahead(swp
> >  /* linux/mm/swapfile.c */
> >  extern long nr_swap_pages;
> >  extern long total_swap_pages;
> > +extern long nr_cache_only_swaps;
> >  extern void si_swapinfo(struct sysinfo *);
> >  extern swp_entry_t get_swap_page(void);
> >  extern swp_entry_t get_swap_page_of_type(int);
> > @@ -358,6 +360,7 @@ static inline void mem_cgroup_uncharge_s
> >  #define nr_swap_pages				0L
> >  #define total_swap_pages			0L
> >  #define total_swapcache_pages			0UL
> > +#define nr_cache_only_swaps			0UL
> >  
> >  #define si_swapinfo(val) \
> >  	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> > Index: new-trial-swapcount/mm/swapfile.c
> > ===================================================================
> > --- new-trial-swapcount.orig/mm/swapfile.c
> > +++ new-trial-swapcount/mm/swapfile.c
> > @@ -39,6 +39,7 @@ static DEFINE_SPINLOCK(swap_lock);
> >  static unsigned int nr_swapfiles;
> >  long nr_swap_pages;
> >  long total_swap_pages;
> > +long nr_cache_only_swaps;
> >  static int swap_overflow;
> >  static int least_priority;
> >  
> > @@ -306,9 +307,11 @@ checks:
> >  		si->lowest_bit = si->max;
> >  		si->highest_bit = 0;
> >  	}
> > -	if (cache) /* at usual swap-out via vmscan.c */
> > +	if (cache) {/* at usual swap-out via vmscan.c */
> >  		si->swap_map[offset] = make_swap_count(0, 1);
> > -	else /* at suspend */
> > +		si->cache_only++;
> > +		nr_cache_only_swaps++;
> > +	} else /* at suspend */
> >  		si->swap_map[offset] = make_swap_count(1, 0);
> >  	si->cluster_next = offset + 1;
> >  	si->flags -= SWP_SCANNING;
> > @@ -513,7 +516,10 @@ static int swap_entry_free(struct swap_i
> >  	} else { /* dropping swap cache flag */
> >  		VM_BUG_ON(!has_cache);
> >  		p->swap_map[offset] = make_swap_count(count, 0);
> > -
> > +		if (!count) {
> > +			p->cache_only--;
> > +			nr_cache_only_swaps--;
> > +		}
> >  	}
> >  	/* return code. */
> >  	count = p->swap_map[offset];
> > @@ -529,6 +535,11 @@ static int swap_entry_free(struct swap_i
> >  		p->inuse_pages--;
> >  		mem_cgroup_uncharge_swap(ent);
> >  	}
> > +	if (swap_has_cache(count) && !swap_count(count)) {
> > +		nr_cache_only_swaps++;
> > +		p->cache_only++;
> > +	}
> > +
> >  	return count;
> >  }
> >  
> > @@ -1128,6 +1139,8 @@ static int try_to_unuse(unsigned int typ
> >  		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> >  			spin_lock(&swap_lock);
> >  			*swap_map = make_swap_count(0, 1);
> > +			si->cache_only++;
> > +			nr_cache_only_swaps++;
> >  			spin_unlock(&swap_lock);
> >  			reset_overflow = 1;
> >  		}
> > @@ -2033,6 +2046,10 @@ static int __swap_duplicate(swp_entry_t 
> >  		if (count < SWAP_MAP_MAX - 1) {
> >  			p->swap_map[offset] = make_swap_count(count + 1,
> >  							      has_cache);
> > +			if (has_cache && !count) {
> > +				p->cache_only--;
> > +				nr_cache_only_swaps--;
> > +			}
> >  			result = 1;
> >  		} else if (count <= SWAP_MAP_MAX) {
> >  			if (swap_overflow++ < 5)
> > Index: new-trial-swapcount/mm/swap_state.c
> > ===================================================================
> > --- new-trial-swapcount.orig/mm/swap_state.c
> > +++ new-trial-swapcount/mm/swap_state.c
> > @@ -63,6 +63,8 @@ void show_swap_cache_info(void)
> >  		swap_cache_info.find_success, swap_cache_info.find_total);
> >  	printk("Free swap  = %ldkB\n", nr_swap_pages << (PAGE_SHIFT - 10));
> >  	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
> > +	printk("Cache only swap = %lukB\n",
> > +	       nr_cache_only_swaps << (PAGE_SHIFT - 10));
> >  }
> 
> This is shown rather seldomly (sysrq and oom), for that purpose two
> counters are overkill.  Maybe remove the global one and sum up the
> per-swapdevice counters on demand?
> 
One for scanning (as 5/5), One for checking we should scan or not.
But yes, I feel 2 counters may be overkilling....
Maybe I'll remove global counter in the next version.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-26 18:14     ` Johannes Weiner
@ 2009-05-27  0:08       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  0:08 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, 26 May 2009 20:14:00 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Just a trial/example patch.
> > I'd like to consider more. Better implementation idea is welcome.
> > 
> > When the system does swap-in/swap-out repeatedly, there are 
> > cache-only swaps in general.
> > Typically,
> >  - swapped out in past but on memory now while vm_swap_full() returns true
> > pages are cache-only swaps. (swap_map has no references.)
> > 
> > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > Current implemantation is very naive, just scan & free.
> 
> I think we can just remove that vm_swap_full() check in do_swap_page()
> and try to remove the page from swap cache unconditionally.
> 
I'm not sure why reclaim swap entry only at write fault.

> If it's still mapped someplace else, we let it cached.  If not, there
> is not much use for keeping it around and we free it.
> 
yes.

> When I removed it and did benchmarks, I couldn't spot any difference
> in the timings, though.  Did you measure the benefits of your patch
> somehow?
My patche has no "performance benefit". (My patch description may be bad.)
I just checked that cache-only-swap can be big.(by sysrq-m)

Even when we remove vm_swap_full() in do_swap_page(),
swapin-readahead + trylock-at-zap + vmscan makes "unused" swap caches easily.
It reaches 1M in 2hours test of heavy swap program.

But yes, I admit I don't like scan & free. I'm now thinking some mark-and-sweep
approach...but it tends to consume memory and to be racy ;)

> 
> According to the git history tree, vm_swap_full() was initially only
> used to aggressively drop cache entries even when they are mapped.
> 
> Rik put it into vmscan to reclaim swap cache _at all_ for activated
> pages.  But I think unconditionally dropping the cache entry makes
> sense if the page gets shuffled around on the LRU list.  Better to
> re-allocate a swap slot close to the new LRU buddies on the next scan.
> 
> And having this all covered, the need for the scanning your patch does
> should be gone, unless I missed something.
> 
Considering memcg, global lru scanning is no help ;(
And I'm writing this patch for memcg.

It seems I should make this 5/5 patch as an independent one and
test 1-4/5 first.

Thank you for review.

Regards,
-Kame




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  0:08       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  0:08 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Tue, 26 May 2009 20:14:00 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Just a trial/example patch.
> > I'd like to consider more. Better implementation idea is welcome.
> > 
> > When the system does swap-in/swap-out repeatedly, there are 
> > cache-only swaps in general.
> > Typically,
> >  - swapped out in past but on memory now while vm_swap_full() returns true
> > pages are cache-only swaps. (swap_map has no references.)
> > 
> > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > Current implemantation is very naive, just scan & free.
> 
> I think we can just remove that vm_swap_full() check in do_swap_page()
> and try to remove the page from swap cache unconditionally.
> 
I'm not sure why reclaim swap entry only at write fault.

> If it's still mapped someplace else, we let it cached.  If not, there
> is not much use for keeping it around and we free it.
> 
yes.

> When I removed it and did benchmarks, I couldn't spot any difference
> in the timings, though.  Did you measure the benefits of your patch
> somehow?
My patche has no "performance benefit". (My patch description may be bad.)
I just checked that cache-only-swap can be big.(by sysrq-m)

Even when we remove vm_swap_full() in do_swap_page(),
swapin-readahead + trylock-at-zap + vmscan makes "unused" swap caches easily.
It reaches 1M in 2hours test of heavy swap program.

But yes, I admit I don't like scan & free. I'm now thinking some mark-and-sweep
approach...but it tends to consume memory and to be racy ;)

> 
> According to the git history tree, vm_swap_full() was initially only
> used to aggressively drop cache entries even when they are mapped.
> 
> Rik put it into vmscan to reclaim swap cache _at all_ for activated
> pages.  But I think unconditionally dropping the cache entry makes
> sense if the page gets shuffled around on the LRU list.  Better to
> re-allocate a swap slot close to the new LRU buddies on the next scan.
> 
> And having this all covered, the need for the scanning your patch does
> should be gone, unless I missed something.
> 
Considering memcg, global lru scanning is no help ;(
And I'm writing this patch for memcg.

It seems I should make this 5/5 patch as an independent one and
test 1-4/5 first.

Thank you for review.

Regards,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-27  0:08       ` KAMEZAWA Hiroyuki
@ 2009-05-27  1:26         ` Johannes Weiner
  -1 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-27  1:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Wed, May 27, 2009 at 09:08:13AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 26 May 2009 20:14:00 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > 
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > Just a trial/example patch.
> > > I'd like to consider more. Better implementation idea is welcome.
> > > 
> > > When the system does swap-in/swap-out repeatedly, there are 
> > > cache-only swaps in general.
> > > Typically,
> > >  - swapped out in past but on memory now while vm_swap_full() returns true
> > > pages are cache-only swaps. (swap_map has no references.)
> > > 
> > > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > > Current implemantation is very naive, just scan & free.
> > 
> > I think we can just remove that vm_swap_full() check in do_swap_page()
> > and try to remove the page from swap cache unconditionally.
> > 
> I'm not sure why reclaim swap entry only at write fault.

How do you come to that conclusion?  Do you mean the current code does
that?  Did you understand that I suggested that?

> > If it's still mapped someplace else, we let it cached.  If not, there
> > is not much use for keeping it around and we free it.
> > 
> yes.
> 
> > When I removed it and did benchmarks, I couldn't spot any difference
> > in the timings, though.  Did you measure the benefits of your patch
> > somehow?
> My patche has no "performance benefit". (My patch description may be bad.)
> I just checked that cache-only-swap can be big.(by sysrq-m)
> 
> Even when we remove vm_swap_full() in do_swap_page(),
> swapin-readahead + trylock-at-zap + vmscan makes "unused" swap caches easily.
> It reaches 1M in 2hours test of heavy swap program.

Ouch.

> > According to the git history tree, vm_swap_full() was initially only
> > used to aggressively drop cache entries even when they are mapped.
> > 
> > Rik put it into vmscan to reclaim swap cache _at all_ for activated
> > pages.  But I think unconditionally dropping the cache entry makes
> > sense if the page gets shuffled around on the LRU list.  Better to
> > re-allocate a swap slot close to the new LRU buddies on the next scan.
> > 
> > And having this all covered, the need for the scanning your patch does
> > should be gone, unless I missed something.
> > 
> Considering memcg, global lru scanning is no help ;(
> And I'm writing this patch for memcg.

Oh, sorry.  That makes sense of course.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  1:26         ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-27  1:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Wed, May 27, 2009 at 09:08:13AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 26 May 2009 20:14:00 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > 
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > Just a trial/example patch.
> > > I'd like to consider more. Better implementation idea is welcome.
> > > 
> > > When the system does swap-in/swap-out repeatedly, there are 
> > > cache-only swaps in general.
> > > Typically,
> > >  - swapped out in past but on memory now while vm_swap_full() returns true
> > > pages are cache-only swaps. (swap_map has no references.)
> > > 
> > > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > > Current implemantation is very naive, just scan & free.
> > 
> > I think we can just remove that vm_swap_full() check in do_swap_page()
> > and try to remove the page from swap cache unconditionally.
> > 
> I'm not sure why reclaim swap entry only at write fault.

How do you come to that conclusion?  Do you mean the current code does
that?  Did you understand that I suggested that?

> > If it's still mapped someplace else, we let it cached.  If not, there
> > is not much use for keeping it around and we free it.
> > 
> yes.
> 
> > When I removed it and did benchmarks, I couldn't spot any difference
> > in the timings, though.  Did you measure the benefits of your patch
> > somehow?
> My patche has no "performance benefit". (My patch description may be bad.)
> I just checked that cache-only-swap can be big.(by sysrq-m)
> 
> Even when we remove vm_swap_full() in do_swap_page(),
> swapin-readahead + trylock-at-zap + vmscan makes "unused" swap caches easily.
> It reaches 1M in 2hours test of heavy swap program.

Ouch.

> > According to the git history tree, vm_swap_full() was initially only
> > used to aggressively drop cache entries even when they are mapped.
> > 
> > Rik put it into vmscan to reclaim swap cache _at all_ for activated
> > pages.  But I think unconditionally dropping the cache entry makes
> > sense if the page gets shuffled around on the LRU list.  Better to
> > re-allocate a swap slot close to the new LRU buddies on the next scan.
> > 
> > And having this all covered, the need for the scanning your patch does
> > should be gone, unless I missed something.
> > 
> Considering memcg, global lru scanning is no help ;(
> And I'm writing this patch for memcg.

Oh, sorry.  That makes sense of course.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-27  1:26         ` Johannes Weiner
@ 2009-05-27  1:31           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  1:31 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Wed, 27 May 2009 03:26:58 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, May 27, 2009 at 09:08:13AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 26 May 2009 20:14:00 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > 
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > > Just a trial/example patch.
> > > > I'd like to consider more. Better implementation idea is welcome.
> > > > 
> > > > When the system does swap-in/swap-out repeatedly, there are 
> > > > cache-only swaps in general.
> > > > Typically,
> > > >  - swapped out in past but on memory now while vm_swap_full() returns true
> > > > pages are cache-only swaps. (swap_map has no references.)
> > > > 
> > > > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > > > Current implemantation is very naive, just scan & free.
> > > 
> > > I think we can just remove that vm_swap_full() check in do_swap_page()
> > > and try to remove the page from swap cache unconditionally.
> > > 
> > I'm not sure why reclaim swap entry only at write fault.
> 
> How do you come to that conclusion?  Do you mean the current code does
> that? 
yes.

2474         pte = mk_pte(page, vma->vm_page_prot);
2475         if (write_access && reuse_swap_page(page)) {
2476                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
2477                 write_access = 0;
2478         }


> Did you understand that I suggested that?
> 

I thought you suggested that swp_entry should be reclaimed in read-fault as
same way as write-fault.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  1:31           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  1:31 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Wed, 27 May 2009 03:26:58 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, May 27, 2009 at 09:08:13AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 26 May 2009 20:14:00 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > 
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > > Just a trial/example patch.
> > > > I'd like to consider more. Better implementation idea is welcome.
> > > > 
> > > > When the system does swap-in/swap-out repeatedly, there are 
> > > > cache-only swaps in general.
> > > > Typically,
> > > >  - swapped out in past but on memory now while vm_swap_full() returns true
> > > > pages are cache-only swaps. (swap_map has no references.)
> > > > 
> > > > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > > > Current implemantation is very naive, just scan & free.
> > > 
> > > I think we can just remove that vm_swap_full() check in do_swap_page()
> > > and try to remove the page from swap cache unconditionally.
> > > 
> > I'm not sure why reclaim swap entry only at write fault.
> 
> How do you come to that conclusion?  Do you mean the current code does
> that? 
yes.

2474         pte = mk_pte(page, vma->vm_page_prot);
2475         if (write_access && reuse_swap_page(page)) {
2476                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
2477                 write_access = 0;
2478         }


> Did you understand that I suggested that?
> 

I thought you suggested that swp_entry should be reclaimed in read-fault as
same way as write-fault.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-27  1:31           ` KAMEZAWA Hiroyuki
@ 2009-05-27  2:06             ` Johannes Weiner
  -1 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-27  2:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Wed, May 27, 2009 at 10:31:07AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 27 May 2009 03:26:58 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, May 27, 2009 at 09:08:13AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 26 May 2009 20:14:00 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > 
> > > > On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > 
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > 
> > > > > Just a trial/example patch.
> > > > > I'd like to consider more. Better implementation idea is welcome.
> > > > > 
> > > > > When the system does swap-in/swap-out repeatedly, there are 
> > > > > cache-only swaps in general.
> > > > > Typically,
> > > > >  - swapped out in past but on memory now while vm_swap_full() returns true
> > > > > pages are cache-only swaps. (swap_map has no references.)
> > > > > 
> > > > > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > > > > Current implemantation is very naive, just scan & free.
> > > > 
> > > > I think we can just remove that vm_swap_full() check in do_swap_page()
> > > > and try to remove the page from swap cache unconditionally.
> > > > 
> > > I'm not sure why reclaim swap entry only at write fault.
> > 
> > How do you come to that conclusion?  Do you mean the current code does
> > that? 
> yes.
> 
> 2474         pte = mk_pte(page, vma->vm_page_prot);
> 2475         if (write_access && reuse_swap_page(page)) {
> 2476                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> 2477                 write_access = 0;
> 2478         }

Ahh.  But further down after installing the PTE, it does

	swap_free(entry);
	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
	        try_to_free_swap(page);
	unlock_page(page);

You are right, it tries to reuse the page and free the swap slot for
writes, but later it removes the swap reference from the pte and then
tries to free the slot again, also for reads.

My suggestion was to remove these checks in the second attempt and
just try regardless of swap usage or mlock.  I just sent out a patch
that does that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  2:06             ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2009-05-27  2:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, hugh.dickins, linux-kernel

On Wed, May 27, 2009 at 10:31:07AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 27 May 2009 03:26:58 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, May 27, 2009 at 09:08:13AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 26 May 2009 20:14:00 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > 
> > > > On Tue, May 26, 2009 at 12:18:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > 
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > 
> > > > > Just a trial/example patch.
> > > > > I'd like to consider more. Better implementation idea is welcome.
> > > > > 
> > > > > When the system does swap-in/swap-out repeatedly, there are 
> > > > > cache-only swaps in general.
> > > > > Typically,
> > > > >  - swapped out in past but on memory now while vm_swap_full() returns true
> > > > > pages are cache-only swaps. (swap_map has no references.)
> > > > > 
> > > > > This cache-only swaps can be an obstacles for smooth page reclaiming.
> > > > > Current implemantation is very naive, just scan & free.
> > > > 
> > > > I think we can just remove that vm_swap_full() check in do_swap_page()
> > > > and try to remove the page from swap cache unconditionally.
> > > > 
> > > I'm not sure why reclaim swap entry only at write fault.
> > 
> > How do you come to that conclusion?  Do you mean the current code does
> > that? 
> yes.
> 
> 2474         pte = mk_pte(page, vma->vm_page_prot);
> 2475         if (write_access && reuse_swap_page(page)) {
> 2476                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> 2477                 write_access = 0;
> 2478         }

Ahh.  But further down after installing the PTE, it does

	swap_free(entry);
	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
	        try_to_free_swap(page);
	unlock_page(page);

You are right, it tries to reuse the page and free the swap slot for
writes, but later it removes the swap reference from the pte and then
tries to free the slot again, also for reads.

My suggestion was to remove these checks in the second attempt and
just try regardless of swap usage or mlock.  I just sent out a patch
that does that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-26  3:15   ` KAMEZAWA Hiroyuki
@ 2009-05-27  4:02     ` Daisuke Nishimura
  -1 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-27  4:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

> @@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
>  		}
>  
>  		/*
> -		 * How could swap count reach 0x7fff when the maximum
> -		 * pid is 0x7fff, and there's no way to repeat a swap
> -		 * page within an mm (except in shmem, where it's the
> -		 * shared object which takes the reference count)?
> -		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> -		 *
> +		 * How could swap count reach 0x7ffe ?
> +		 * There's no way to repeat a swap page within an mm
> +		 * (except in shmem, where it's the shared object which takes
> +		 * the reference count)?
> +		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> +		 * short is too small....)
>  		 * If that's wrong, then we should worry more about
>  		 * exit_mmap() and do_munmap() cases described above:
>  		 * we might be resetting SWAP_MAP_MAX too early here.
>  		 * We know "Undead"s can happen, they're okay, so don't
>  		 * report them; but do report if we reset SWAP_MAP_MAX.
>  		 */
> -		if (*swap_map == SWAP_MAP_MAX) {
> +		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
>  			spin_lock(&swap_lock);
> -			*swap_map = 1;
> +			*swap_map = make_swap_count(0, 1);
Can we assume the entry has SWAP_HAS_CACHE here ?
Shouldn't we check PageSwapCache beforehand ?

>  			spin_unlock(&swap_lock);
>  			reset_overflow = 1;
>  		}


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-27  4:02     ` Daisuke Nishimura
  0 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-27  4:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

> @@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
>  		}
>  
>  		/*
> -		 * How could swap count reach 0x7fff when the maximum
> -		 * pid is 0x7fff, and there's no way to repeat a swap
> -		 * page within an mm (except in shmem, where it's the
> -		 * shared object which takes the reference count)?
> -		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> -		 *
> +		 * How could swap count reach 0x7ffe ?
> +		 * There's no way to repeat a swap page within an mm
> +		 * (except in shmem, where it's the shared object which takes
> +		 * the reference count)?
> +		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> +		 * short is too small....)
>  		 * If that's wrong, then we should worry more about
>  		 * exit_mmap() and do_munmap() cases described above:
>  		 * we might be resetting SWAP_MAP_MAX too early here.
>  		 * We know "Undead"s can happen, they're okay, so don't
>  		 * report them; but do report if we reset SWAP_MAP_MAX.
>  		 */
> -		if (*swap_map == SWAP_MAP_MAX) {
> +		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
>  			spin_lock(&swap_lock);
> -			*swap_map = 1;
> +			*swap_map = make_swap_count(0, 1);
Can we assume the entry has SWAP_HAS_CACHE here ?
Shouldn't we check PageSwapCache beforehand ?

>  			spin_unlock(&swap_lock);
>  			reset_overflow = 1;
>  		}


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-27  4:02     ` Daisuke Nishimura
@ 2009-05-27  4:36       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  4:36 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Wed, 27 May 2009 13:02:46 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
> >  		}
> >  
> >  		/*
> > -		 * How could swap count reach 0x7fff when the maximum
> > -		 * pid is 0x7fff, and there's no way to repeat a swap
> > -		 * page within an mm (except in shmem, where it's the
> > -		 * shared object which takes the reference count)?
> > -		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> > -		 *
> > +		 * How could swap count reach 0x7ffe ?
> > +		 * There's no way to repeat a swap page within an mm
> > +		 * (except in shmem, where it's the shared object which takes
> > +		 * the reference count)?
> > +		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> > +		 * short is too small....)
> >  		 * If that's wrong, then we should worry more about
> >  		 * exit_mmap() and do_munmap() cases described above:
> >  		 * we might be resetting SWAP_MAP_MAX too early here.
> >  		 * We know "Undead"s can happen, they're okay, so don't
> >  		 * report them; but do report if we reset SWAP_MAP_MAX.
> >  		 */
> > -		if (*swap_map == SWAP_MAP_MAX) {
> > +		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> >  			spin_lock(&swap_lock);
> > -			*swap_map = 1;
> > +			*swap_map = make_swap_count(0, 1);
> Can we assume the entry has SWAP_HAS_CACHE here ?
> Shouldn't we check PageSwapCache beforehand ?
> 

IIUC, in this try_to_unuse code, the page is added to swap cache and locked
before reaches here. But....ah,ok, unuse_mm() may release lock_page() before
reach here. Then...

if (PageSwapCache(page) && swap_count(*swap_map) == SWAP_MAP_MAX)

is right ? (maybe original code, set to "1" is also buggy.)

Thanks,
-Kame


> >  			spin_unlock(&swap_lock);
> >  			reset_overflow = 1;
> >  		}
> 
> 
> Thanks,
> Daisuke Nishimura.
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-27  4:36       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  4:36 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Wed, 27 May 2009 13:02:46 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
> >  		}
> >  
> >  		/*
> > -		 * How could swap count reach 0x7fff when the maximum
> > -		 * pid is 0x7fff, and there's no way to repeat a swap
> > -		 * page within an mm (except in shmem, where it's the
> > -		 * shared object which takes the reference count)?
> > -		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> > -		 *
> > +		 * How could swap count reach 0x7ffe ?
> > +		 * There's no way to repeat a swap page within an mm
> > +		 * (except in shmem, where it's the shared object which takes
> > +		 * the reference count)?
> > +		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> > +		 * short is too small....)
> >  		 * If that's wrong, then we should worry more about
> >  		 * exit_mmap() and do_munmap() cases described above:
> >  		 * we might be resetting SWAP_MAP_MAX too early here.
> >  		 * We know "Undead"s can happen, they're okay, so don't
> >  		 * report them; but do report if we reset SWAP_MAP_MAX.
> >  		 */
> > -		if (*swap_map == SWAP_MAP_MAX) {
> > +		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> >  			spin_lock(&swap_lock);
> > -			*swap_map = 1;
> > +			*swap_map = make_swap_count(0, 1);
> Can we assume the entry has SWAP_HAS_CACHE here ?
> Shouldn't we check PageSwapCache beforehand ?
> 

IIUC, in this try_to_unuse code, the page is added to swap cache and locked
before reaches here. But....ah,ok, unuse_mm() may release lock_page() before
reach here. Then...

if (PageSwapCache(page) && swap_count(*swap_map) == SWAP_MAP_MAX)

is right ? (maybe original code, set to "1" is also buggy.)

Thanks,
-Kame


> >  			spin_unlock(&swap_lock);
> >  			reset_overflow = 1;
> >  		}
> 
> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-27  4:36       ` KAMEZAWA Hiroyuki
@ 2009-05-27  5:00         ` Daisuke Nishimura
  -1 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-27  5:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

On Wed, 27 May 2009 13:36:29 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 27 May 2009 13:02:46 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > @@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
> > >  		}
> > >  
> > >  		/*
> > > -		 * How could swap count reach 0x7fff when the maximum
> > > -		 * pid is 0x7fff, and there's no way to repeat a swap
> > > -		 * page within an mm (except in shmem, where it's the
> > > -		 * shared object which takes the reference count)?
> > > -		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> > > -		 *
> > > +		 * How could swap count reach 0x7ffe ?
> > > +		 * There's no way to repeat a swap page within an mm
> > > +		 * (except in shmem, where it's the shared object which takes
> > > +		 * the reference count)?
> > > +		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> > > +		 * short is too small....)
> > >  		 * If that's wrong, then we should worry more about
> > >  		 * exit_mmap() and do_munmap() cases described above:
> > >  		 * we might be resetting SWAP_MAP_MAX too early here.
> > >  		 * We know "Undead"s can happen, they're okay, so don't
> > >  		 * report them; but do report if we reset SWAP_MAP_MAX.
> > >  		 */
> > > -		if (*swap_map == SWAP_MAP_MAX) {
> > > +		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> > >  			spin_lock(&swap_lock);
> > > -			*swap_map = 1;
> > > +			*swap_map = make_swap_count(0, 1);
> > Can we assume the entry has SWAP_HAS_CACHE here ?
> > Shouldn't we check PageSwapCache beforehand ?
> > 
> 
> IIUC, in this try_to_unuse code, the page is added to swap cache and locked
> before reaches here. But....ah,ok, unuse_mm() may release lock_page() before
> reach here. Then...
> 
And the owner process might have removed the swap cache before we take the lock,
as the following comments in try_to_unuse() says.

> if (PageSwapCache(page) && swap_count(*swap_map) == SWAP_MAP_MAX)
> 
> is right ? (maybe original code, set to "1" is also buggy.)
> 
Reading the following code in try_to_unuse(), I think

	int valid_swap_cache = !!(PageSwapCache(page) &&
					page_private(page) == entry.val)
	  :
	*swap_map = make_swap_count(0(or 1?), valid_swap_cache);

might be better.

But I can't confirm it anyway. I've never hit SWAP_MAP_MAX.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-27  5:00         ` Daisuke Nishimura
  0 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-27  5:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

On Wed, 27 May 2009 13:36:29 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 27 May 2009 13:02:46 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > @@ -1067,21 +1113,21 @@ static int try_to_unuse(unsigned int typ
> > >  		}
> > >  
> > >  		/*
> > > -		 * How could swap count reach 0x7fff when the maximum
> > > -		 * pid is 0x7fff, and there's no way to repeat a swap
> > > -		 * page within an mm (except in shmem, where it's the
> > > -		 * shared object which takes the reference count)?
> > > -		 * We believe SWAP_MAP_MAX cannot occur in Linux 2.4.
> > > -		 *
> > > +		 * How could swap count reach 0x7ffe ?
> > > +		 * There's no way to repeat a swap page within an mm
> > > +		 * (except in shmem, where it's the shared object which takes
> > > +		 * the reference count)?
> > > +		 * We believe SWAP_MAP_MAX cannot occur.(if occur, unsigned
> > > +		 * short is too small....)
> > >  		 * If that's wrong, then we should worry more about
> > >  		 * exit_mmap() and do_munmap() cases described above:
> > >  		 * we might be resetting SWAP_MAP_MAX too early here.
> > >  		 * We know "Undead"s can happen, they're okay, so don't
> > >  		 * report them; but do report if we reset SWAP_MAP_MAX.
> > >  		 */
> > > -		if (*swap_map == SWAP_MAP_MAX) {
> > > +		if (swap_count(*swap_map) == SWAP_MAP_MAX) {
> > >  			spin_lock(&swap_lock);
> > > -			*swap_map = 1;
> > > +			*swap_map = make_swap_count(0, 1);
> > Can we assume the entry has SWAP_HAS_CACHE here ?
> > Shouldn't we check PageSwapCache beforehand ?
> > 
> 
> IIUC, in this try_to_unuse code, the page is added to swap cache and locked
> before reaches here. But....ah,ok, unuse_mm() may release lock_page() before
> reach here. Then...
> 
And the owner process might have removed the swap cache before we take the lock,
as the following comments in try_to_unuse() says.

> if (PageSwapCache(page) && swap_count(*swap_map) == SWAP_MAP_MAX)
> 
> is right ? (maybe original code, set to "1" is also buggy.)
> 
Reading the following code in try_to_unuse(), I think

	int valid_swap_cache = !!(PageSwapCache(page) &&
					page_private(page) == entry.val)
	  :
	*swap_map = make_swap_count(0(or 1?), valid_swap_cache);

might be better.

But I can't confirm it anyway. I've never hit SWAP_MAP_MAX.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-26  3:18   ` KAMEZAWA Hiroyuki
@ 2009-05-27  5:14     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  5:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

On Tue, 26 May 2009 12:18:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
This is a replacement for this. (just an idea, not testd.)

I think this works well. Does anyone has concerns ?
Do I have to modify swap-cluster code to do this in sane way ?

---
 mm/swapfile.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -74,6 +74,26 @@ static inline unsigned short make_swap_c
 	return ret;
 }
 
+static int try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
+{
+	int type = si - swap_info;
+	swp_entry_t entry = swp_entry(type, offset);
+	struct page *page;
+
+	page = find_get_page(page);
+	if (!page)
+		return 0;
+	if (!trylock_page(page)) {
+		page_cache_release(page);
+		return 0;
+	}
+	try_to_free_swap(page);
+	unlock_page(page);
+	page_cache_release(page);
+	return 1;
+}
+
+
 /*
  * We need this because the bdev->unplug_fn can sleep and we cannot
  * hold swap_lock while calling the unplug_fn. And swap_lock
@@ -295,6 +315,18 @@ checks:
 		goto no_page;
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
+
+	/* reuse swap entry of cache-only swap if not busy. */
+	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+		int ret;
+		spin_unlock(&swap_lock);
+		ret = try_to_reuse_swap(si, offset));
+		spin_lock(&swap_lock);
+		if (ret)
+			goto checks; /* we released swap_lock */
+		goto scan;
+	}
+
 	if (si->swap_map[offset])
 		goto scan;
 
@@ -378,6 +410,10 @@ scan:
 			spin_lock(&swap_lock);
 			goto checks;
 		}
+		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+			spin_lock(&swap_lock);
+			goto checks;
+		}
 		if (unlikely(--latency_ration < 0)) {
 			cond_resched();
 			latency_ration = LATENCY_LIMIT;
@@ -389,6 +425,10 @@ scan:
 			spin_lock(&swap_lock);
 			goto checks;
 		}
+		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+			spin_lock(&swap_lock);
+			goto checks;
+		}
 		if (unlikely(--latency_ration < 0)) {
 			cond_resched();
 			latency_ration = LATENCY_LIMIT;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  5:14     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  5:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

On Tue, 26 May 2009 12:18:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
This is a replacement for this. (just an idea, not testd.)

I think this works well. Does anyone has concerns ?
Do I have to modify swap-cluster code to do this in sane way ?

---
 mm/swapfile.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

Index: new-trial-swapcount/mm/swapfile.c
===================================================================
--- new-trial-swapcount.orig/mm/swapfile.c
+++ new-trial-swapcount/mm/swapfile.c
@@ -74,6 +74,26 @@ static inline unsigned short make_swap_c
 	return ret;
 }
 
+static int try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
+{
+	int type = si - swap_info;
+	swp_entry_t entry = swp_entry(type, offset);
+	struct page *page;
+
+	page = find_get_page(page);
+	if (!page)
+		return 0;
+	if (!trylock_page(page)) {
+		page_cache_release(page);
+		return 0;
+	}
+	try_to_free_swap(page);
+	unlock_page(page);
+	page_cache_release(page);
+	return 1;
+}
+
+
 /*
  * We need this because the bdev->unplug_fn can sleep and we cannot
  * hold swap_lock while calling the unplug_fn. And swap_lock
@@ -295,6 +315,18 @@ checks:
 		goto no_page;
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
+
+	/* reuse swap entry of cache-only swap if not busy. */
+	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+		int ret;
+		spin_unlock(&swap_lock);
+		ret = try_to_reuse_swap(si, offset));
+		spin_lock(&swap_lock);
+		if (ret)
+			goto checks; /* we released swap_lock */
+		goto scan;
+	}
+
 	if (si->swap_map[offset])
 		goto scan;
 
@@ -378,6 +410,10 @@ scan:
 			spin_lock(&swap_lock);
 			goto checks;
 		}
+		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+			spin_lock(&swap_lock);
+			goto checks;
+		}
 		if (unlikely(--latency_ration < 0)) {
 			cond_resched();
 			latency_ration = LATENCY_LIMIT;
@@ -389,6 +425,10 @@ scan:
 			spin_lock(&swap_lock);
 			goto checks;
 		}
+		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+			spin_lock(&swap_lock);
+			goto checks;
+		}
 		if (unlikely(--latency_ration < 0)) {
 			cond_resched();
 			latency_ration = LATENCY_LIMIT;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-27  5:14     ` KAMEZAWA Hiroyuki
@ 2009-05-27  6:30       ` Daisuke Nishimura
  -1 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-27  6:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

On Wed, 27 May 2009 14:14:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 26 May 2009 12:18:34 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> This is a replacement for this. (just an idea, not testd.)
> 
> I think this works well. Does anyone has concerns ?
I think so too, except some trivial build errors ;)

I'll test it, but it will take a long time to see the effect of this patch
even if setting the swap space to reasonable size.


Thanks,
Daisuke Nishimura.

> Do I have to modify swap-cluster code to do this in sane way ?
> 
> ---
>  mm/swapfile.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> Index: new-trial-swapcount/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount.orig/mm/swapfile.c
> +++ new-trial-swapcount/mm/swapfile.c
> @@ -74,6 +74,26 @@ static inline unsigned short make_swap_c
>  	return ret;
>  }
>  
> +static int try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
> +{
> +	int type = si - swap_info;
> +	swp_entry_t entry = swp_entry(type, offset);
> +	struct page *page;
> +
> +	page = find_get_page(page);
> +	if (!page)
> +		return 0;
> +	if (!trylock_page(page)) {
> +		page_cache_release(page);
> +		return 0;
> +	}
> +	try_to_free_swap(page);
> +	unlock_page(page);
> +	page_cache_release(page);
> +	return 1;
> +}
> +
> +
>  /*
>   * We need this because the bdev->unplug_fn can sleep and we cannot
>   * hold swap_lock while calling the unplug_fn. And swap_lock
> @@ -295,6 +315,18 @@ checks:
>  		goto no_page;
>  	if (offset > si->highest_bit)
>  		scan_base = offset = si->lowest_bit;
> +
> +	/* reuse swap entry of cache-only swap if not busy. */
> +	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +		int ret;
> +		spin_unlock(&swap_lock);
> +		ret = try_to_reuse_swap(si, offset));
> +		spin_lock(&swap_lock);
> +		if (ret)
> +			goto checks; /* we released swap_lock */
> +		goto scan;
> +	}
> +
>  	if (si->swap_map[offset])
>  		goto scan;
>  
> @@ -378,6 +410,10 @@ scan:
>  			spin_lock(&swap_lock);
>  			goto checks;
>  		}
> +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +			spin_lock(&swap_lock);
> +			goto checks;
> +		}
>  		if (unlikely(--latency_ration < 0)) {
>  			cond_resched();
>  			latency_ration = LATENCY_LIMIT;
> @@ -389,6 +425,10 @@ scan:
>  			spin_lock(&swap_lock);
>  			goto checks;
>  		}
> +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +			spin_lock(&swap_lock);
> +			goto checks;
> +		}
>  		if (unlikely(--latency_ration < 0)) {
>  			cond_resched();
>  			latency_ration = LATENCY_LIMIT;
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  6:30       ` Daisuke Nishimura
  0 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-27  6:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

On Wed, 27 May 2009 14:14:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 26 May 2009 12:18:34 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> This is a replacement for this. (just an idea, not testd.)
> 
> I think this works well. Does anyone has concerns ?
I think so too, except some trivial build errors ;)

I'll test it, but it will take a long time to see the effect of this patch
even if setting the swap space to reasonable size.


Thanks,
Daisuke Nishimura.

> Do I have to modify swap-cluster code to do this in sane way ?
> 
> ---
>  mm/swapfile.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> Index: new-trial-swapcount/mm/swapfile.c
> ===================================================================
> --- new-trial-swapcount.orig/mm/swapfile.c
> +++ new-trial-swapcount/mm/swapfile.c
> @@ -74,6 +74,26 @@ static inline unsigned short make_swap_c
>  	return ret;
>  }
>  
> +static int try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
> +{
> +	int type = si - swap_info;
> +	swp_entry_t entry = swp_entry(type, offset);
> +	struct page *page;
> +
> +	page = find_get_page(page);
> +	if (!page)
> +		return 0;
> +	if (!trylock_page(page)) {
> +		page_cache_release(page);
> +		return 0;
> +	}
> +	try_to_free_swap(page);
> +	unlock_page(page);
> +	page_cache_release(page);
> +	return 1;
> +}
> +
> +
>  /*
>   * We need this because the bdev->unplug_fn can sleep and we cannot
>   * hold swap_lock while calling the unplug_fn. And swap_lock
> @@ -295,6 +315,18 @@ checks:
>  		goto no_page;
>  	if (offset > si->highest_bit)
>  		scan_base = offset = si->lowest_bit;
> +
> +	/* reuse swap entry of cache-only swap if not busy. */
> +	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +		int ret;
> +		spin_unlock(&swap_lock);
> +		ret = try_to_reuse_swap(si, offset));
> +		spin_lock(&swap_lock);
> +		if (ret)
> +			goto checks; /* we released swap_lock */
> +		goto scan;
> +	}
> +
>  	if (si->swap_map[offset])
>  		goto scan;
>  
> @@ -378,6 +410,10 @@ scan:
>  			spin_lock(&swap_lock);
>  			goto checks;
>  		}
> +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +			spin_lock(&swap_lock);
> +			goto checks;
> +		}
>  		if (unlikely(--latency_ration < 0)) {
>  			cond_resched();
>  			latency_ration = LATENCY_LIMIT;
> @@ -389,6 +425,10 @@ scan:
>  			spin_lock(&swap_lock);
>  			goto checks;
>  		}
> +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +			spin_lock(&swap_lock);
> +			goto checks;
> +		}
>  		if (unlikely(--latency_ration < 0)) {
>  			cond_resched();
>  			latency_ration = LATENCY_LIMIT;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH] memcg: fix swap account (26/May)[0/5]
  2009-05-26  3:12 ` KAMEZAWA Hiroyuki
@ 2009-05-27  6:43   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  6:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

On Tue, 26 May 2009 12:12:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> [1/5] change interface of swap_duplicate()/swap_free()
>     Adds an function swapcache_prepare() and swapcache_free().
> 
> [2/5] add SWAP_HAS_CACHE flag to swap_map
>     Add SWAP_HAS_CACHE flag to swap_map array for knowing an information that
>     "there is an only swap cache and swap has no reference" 
>     without calling find_get_page().
> 
> [3/5] Count the number of swap-cache-only swaps
>     After repeating swap-in/out, there are tons of cache-only swaps.
>    (via a mapped swapcache under vm_swap_full()==false)
>     This patch counts the number of entry and show it in debug information.
>    (for example, sysrq-m)
> 
> [4/5] fix memcg's swap accounting.
>     change the memcg's swap accounting logic to see # of references to swap.
> 
> [5/5] experimental garbage collection for cache-only swaps.
>     reclaim swap enty which is not used.
> 

Thank you for all reviews. I'll repost when new mmotm comes. 
maybe
[1/5] ... nochage
[2/5] ... fix the bug Nishimura-san pointed out
[3/5] ... drop
[4/5] ... no change
[5/5] ... will use direct reclaim in get_swap_page().

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH] memcg: fix swap account (26/May)[0/5]
@ 2009-05-27  6:43   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  6:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, hugh.dickins, hannes, linux-kernel

On Tue, 26 May 2009 12:12:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> [1/5] change interface of swap_duplicate()/swap_free()
>     Adds an function swapcache_prepare() and swapcache_free().
> 
> [2/5] add SWAP_HAS_CACHE flag to swap_map
>     Add SWAP_HAS_CACHE flag to swap_map array for knowing an information that
>     "there is an only swap cache and swap has no reference" 
>     without calling find_get_page().
> 
> [3/5] Count the number of swap-cache-only swaps
>     After repeating swap-in/out, there are tons of cache-only swaps.
>    (via a mapped swapcache under vm_swap_full()==false)
>     This patch counts the number of entry and show it in debug information.
>    (for example, sysrq-m)
> 
> [4/5] fix memcg's swap accounting.
>     change the memcg's swap accounting logic to see # of references to swap.
> 
> [5/5] experimental garbage collection for cache-only swaps.
>     reclaim swap enty which is not used.
> 

Thank you for all reviews. I'll repost when new mmotm comes. 
maybe
[1/5] ... nochage
[2/5] ... fix the bug Nishimura-san pointed out
[3/5] ... drop
[4/5] ... no change
[5/5] ... will use direct reclaim in get_swap_page().

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
  2009-05-27  6:30       ` Daisuke Nishimura
@ 2009-05-27  6:50         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  6:50 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Wed, 27 May 2009 15:30:24 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 27 May 2009 14:14:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 26 May 2009 12:18:34 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > 
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > This is a replacement for this. (just an idea, not testd.)
> > 
> > I think this works well. Does anyone has concerns ?
> I think so too, except some trivial build errors ;)
> 
Oh, will be fixed in the next post :(

> I'll test it, but it will take a long time to see the effect of this patch
> even if setting the swap space to reasonable size.
> 
Yes, Hot-to-test is my concern, too..

Thanks,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> > Do I have to modify swap-cluster code to do this in sane way ?
> > 
> > ---
> >  mm/swapfile.c |   40 ++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> > 
> > Index: new-trial-swapcount/mm/swapfile.c
> > ===================================================================
> > --- new-trial-swapcount.orig/mm/swapfile.c
> > +++ new-trial-swapcount/mm/swapfile.c
> > @@ -74,6 +74,26 @@ static inline unsigned short make_swap_c
> >  	return ret;
> >  }
> >  
> > +static int try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
> > +{
> > +	int type = si - swap_info;
> > +	swp_entry_t entry = swp_entry(type, offset);
> > +	struct page *page;
> > +
> > +	page = find_get_page(page);
> > +	if (!page)
> > +		return 0;
> > +	if (!trylock_page(page)) {
> > +		page_cache_release(page);
> > +		return 0;
> > +	}
> > +	try_to_free_swap(page);
> > +	unlock_page(page);
> > +	page_cache_release(page);
> > +	return 1;
> > +}
> > +
> > +
> >  /*
> >   * We need this because the bdev->unplug_fn can sleep and we cannot
> >   * hold swap_lock while calling the unplug_fn. And swap_lock
> > @@ -295,6 +315,18 @@ checks:
> >  		goto no_page;
> >  	if (offset > si->highest_bit)
> >  		scan_base = offset = si->lowest_bit;
> > +
> > +	/* reuse swap entry of cache-only swap if not busy. */
> > +	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +		int ret;
> > +		spin_unlock(&swap_lock);
> > +		ret = try_to_reuse_swap(si, offset));
> > +		spin_lock(&swap_lock);
> > +		if (ret)
> > +			goto checks; /* we released swap_lock */
> > +		goto scan;
> > +	}
> > +
> >  	if (si->swap_map[offset])
> >  		goto scan;
> >  
> > @@ -378,6 +410,10 @@ scan:
> >  			spin_lock(&swap_lock);
> >  			goto checks;
> >  		}
> > +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +			spin_lock(&swap_lock);
> > +			goto checks;
> > +		}
> >  		if (unlikely(--latency_ration < 0)) {
> >  			cond_resched();
> >  			latency_ration = LATENCY_LIMIT;
> > @@ -389,6 +425,10 @@ scan:
> >  			spin_lock(&swap_lock);
> >  			goto checks;
> >  		}
> > +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +			spin_lock(&swap_lock);
> > +			goto checks;
> > +		}
> >  		if (unlikely(--latency_ration < 0)) {
> >  			cond_resched();
> >  			latency_ration = LATENCY_LIMIT;
> > 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/5] (experimental) chase and free cache only swap
@ 2009-05-27  6:50         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-27  6:50 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Wed, 27 May 2009 15:30:24 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 27 May 2009 14:14:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 26 May 2009 12:18:34 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > 
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > This is a replacement for this. (just an idea, not testd.)
> > 
> > I think this works well. Does anyone has concerns ?
> I think so too, except some trivial build errors ;)
> 
Oh, will be fixed in the next post :(

> I'll test it, but it will take a long time to see the effect of this patch
> even if setting the swap space to reasonable size.
> 
Yes, Hot-to-test is my concern, too..

Thanks,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> > Do I have to modify swap-cluster code to do this in sane way ?
> > 
> > ---
> >  mm/swapfile.c |   40 ++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> > 
> > Index: new-trial-swapcount/mm/swapfile.c
> > ===================================================================
> > --- new-trial-swapcount.orig/mm/swapfile.c
> > +++ new-trial-swapcount/mm/swapfile.c
> > @@ -74,6 +74,26 @@ static inline unsigned short make_swap_c
> >  	return ret;
> >  }
> >  
> > +static int try_to_reuse_swap(struct swap_info_struct *si, unsigned long offset)
> > +{
> > +	int type = si - swap_info;
> > +	swp_entry_t entry = swp_entry(type, offset);
> > +	struct page *page;
> > +
> > +	page = find_get_page(page);
> > +	if (!page)
> > +		return 0;
> > +	if (!trylock_page(page)) {
> > +		page_cache_release(page);
> > +		return 0;
> > +	}
> > +	try_to_free_swap(page);
> > +	unlock_page(page);
> > +	page_cache_release(page);
> > +	return 1;
> > +}
> > +
> > +
> >  /*
> >   * We need this because the bdev->unplug_fn can sleep and we cannot
> >   * hold swap_lock while calling the unplug_fn. And swap_lock
> > @@ -295,6 +315,18 @@ checks:
> >  		goto no_page;
> >  	if (offset > si->highest_bit)
> >  		scan_base = offset = si->lowest_bit;
> > +
> > +	/* reuse swap entry of cache-only swap if not busy. */
> > +	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +		int ret;
> > +		spin_unlock(&swap_lock);
> > +		ret = try_to_reuse_swap(si, offset));
> > +		spin_lock(&swap_lock);
> > +		if (ret)
> > +			goto checks; /* we released swap_lock */
> > +		goto scan;
> > +	}
> > +
> >  	if (si->swap_map[offset])
> >  		goto scan;
> >  
> > @@ -378,6 +410,10 @@ scan:
> >  			spin_lock(&swap_lock);
> >  			goto checks;
> >  		}
> > +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +			spin_lock(&swap_lock);
> > +			goto checks;
> > +		}
> >  		if (unlikely(--latency_ration < 0)) {
> >  			cond_resched();
> >  			latency_ration = LATENCY_LIMIT;
> > @@ -389,6 +425,10 @@ scan:
> >  			spin_lock(&swap_lock);
> >  			goto checks;
> >  		}
> > +		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +			spin_lock(&swap_lock);
> > +			goto checks;
> > +		}
> >  		if (unlikely(--latency_ration < 0)) {
> >  			cond_resched();
> >  			latency_ration = LATENCY_LIMIT;
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-26  3:15   ` KAMEZAWA Hiroyuki
@ 2009-05-28  0:41     ` Daisuke Nishimura
  -1 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-28  0:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

> @@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
>  	offset = swp_offset(entry);
>  
>  	spin_lock(&swap_lock);
> -	if (offset < p->max && p->swap_map[offset]) {
> -		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> -			p->swap_map[offset]++;
> +
> +	if (unlikely(offset >= p->max))
> +		goto unlock_out;
> +
> +	count = swap_count(p->swap_map[offset]);
> +	has_cache = swap_has_cache(p->swap_map[offset]);
> +	if (cache) {
> +		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
> +		if (!has_cache && count) {
Should we check !has_cache here ?

Concurrent read_swap_cache_async() might have set SWAP_HAS_CACHE, but not have added
a page to swap cache yet when find_get_page() was called.
add_to_swap_cache() would handle the race of concurrent read_swap_cache_async(),
but considering more, swapcache_free() at the end of the loop might dangerous in this case...
So I think it should be like:

	read_swap_cache_async()
		:
		valid = swapcache_prepare(entry);
		if (!valid)
			break;
		if (valid == -EAGAIN);
			continue;

to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().


Thanks,
Daisuke Nishimura.

> +			p->swap_map[offset] = make_swap_count(count, 1);
> +			result = 1;
> +		}
> +	} else if (count || has_cache) {
> +		if (count < SWAP_MAP_MAX - 1) {
> +			p->swap_map[offset] = make_swap_count(count + 1,
> +							      has_cache);
>  			result = 1;
> -		} else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
> +		} else if (count <= SWAP_MAP_MAX) {
>  			if (swap_overflow++ < 5)
> -				printk(KERN_WARNING "swap_dup: swap entry overflow\n");
> -			p->swap_map[offset] = SWAP_MAP_MAX;
> +				printk(KERN_WARNING
> +				       "swap_dup: swap entry overflow\n");
> +			p->swap_map[offset] = make_swap_count(SWAP_MAP_MAX,
> +							      has_cache);
>  			result = 1;
>  		}
>  	}
> +unlock_out:
>  	spin_unlock(&swap_lock);
>  out:
>  	return result;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-28  0:41     ` Daisuke Nishimura
  0 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-28  0:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel, Daisuke Nishimura

> @@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
>  	offset = swp_offset(entry);
>  
>  	spin_lock(&swap_lock);
> -	if (offset < p->max && p->swap_map[offset]) {
> -		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> -			p->swap_map[offset]++;
> +
> +	if (unlikely(offset >= p->max))
> +		goto unlock_out;
> +
> +	count = swap_count(p->swap_map[offset]);
> +	has_cache = swap_has_cache(p->swap_map[offset]);
> +	if (cache) {
> +		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
> +		if (!has_cache && count) {
Should we check !has_cache here ?

Concurrent read_swap_cache_async() might have set SWAP_HAS_CACHE, but not have added
a page to swap cache yet when find_get_page() was called.
add_to_swap_cache() would handle the race of concurrent read_swap_cache_async(),
but considering more, swapcache_free() at the end of the loop might dangerous in this case...
So I think it should be like:

	read_swap_cache_async()
		:
		valid = swapcache_prepare(entry);
		if (!valid)
			break;
		if (valid == -EAGAIN);
			continue;

to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().


Thanks,
Daisuke Nishimura.

> +			p->swap_map[offset] = make_swap_count(count, 1);
> +			result = 1;
> +		}
> +	} else if (count || has_cache) {
> +		if (count < SWAP_MAP_MAX - 1) {
> +			p->swap_map[offset] = make_swap_count(count + 1,
> +							      has_cache);
>  			result = 1;
> -		} else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
> +		} else if (count <= SWAP_MAP_MAX) {
>  			if (swap_overflow++ < 5)
> -				printk(KERN_WARNING "swap_dup: swap entry overflow\n");
> -			p->swap_map[offset] = SWAP_MAP_MAX;
> +				printk(KERN_WARNING
> +				       "swap_dup: swap entry overflow\n");
> +			p->swap_map[offset] = make_swap_count(SWAP_MAP_MAX,
> +							      has_cache);
>  			result = 1;
>  		}
>  	}
> +unlock_out:
>  	spin_unlock(&swap_lock);
>  out:
>  	return result;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-28  0:41     ` Daisuke Nishimura
@ 2009-05-28  1:05       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28  1:05 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Thu, 28 May 2009 09:41:57 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
> >  	offset = swp_offset(entry);
> >  
> >  	spin_lock(&swap_lock);
> > -	if (offset < p->max && p->swap_map[offset]) {
> > -		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> > -			p->swap_map[offset]++;
> > +
> > +	if (unlikely(offset >= p->max))
> > +		goto unlock_out;
> > +
> > +	count = swap_count(p->swap_map[offset]);
> > +	has_cache = swap_has_cache(p->swap_map[offset]);
> > +	if (cache) {
> > +		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > +		if (!has_cache && count) {
> Should we check !has_cache here ?
I added !has_cache to return 0 in racy case.

> 
> Concurrent read_swap_cache_async() might have set SWAP_HAS_CACHE, but not have added
> a page to swap cache yet when find_get_page() was called.
yes.

> add_to_swap_cache() would handle the race of concurrent read_swap_cache_async(),
> but considering more, swapcache_free() at the end of the loop might dangerous in this case...

I can't catch what you mean.

I think swapcache_prepare() returns 0 in racy case and no add_to_swap_cache() happens.
wrong ?

> So I think it should be like:
> 
> 	read_swap_cache_async()
> 		:
> 		valid = swapcache_prepare(entry);
> 		if (!valid)
> 			break;
> 		if (valid == -EAGAIN);
> 			continue;
> 
> to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().
> 

What you reccomend is code like this ?

==
	ret = swapcache_prapare(entry);
	if (ret == -ENOENT)
		break;    /* unused swap entry */
	if (ret == -EBUSY)
		continue; /* to call find_get_page() again */
==

-Kame






^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-28  1:05       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28  1:05 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Thu, 28 May 2009 09:41:57 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
> >  	offset = swp_offset(entry);
> >  
> >  	spin_lock(&swap_lock);
> > -	if (offset < p->max && p->swap_map[offset]) {
> > -		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> > -			p->swap_map[offset]++;
> > +
> > +	if (unlikely(offset >= p->max))
> > +		goto unlock_out;
> > +
> > +	count = swap_count(p->swap_map[offset]);
> > +	has_cache = swap_has_cache(p->swap_map[offset]);
> > +	if (cache) {
> > +		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > +		if (!has_cache && count) {
> Should we check !has_cache here ?
I added !has_cache to return 0 in racy case.

> 
> Concurrent read_swap_cache_async() might have set SWAP_HAS_CACHE, but not have added
> a page to swap cache yet when find_get_page() was called.
yes.

> add_to_swap_cache() would handle the race of concurrent read_swap_cache_async(),
> but considering more, swapcache_free() at the end of the loop might dangerous in this case...

I can't catch what you mean.

I think swapcache_prepare() returns 0 in racy case and no add_to_swap_cache() happens.
wrong ?

> So I think it should be like:
> 
> 	read_swap_cache_async()
> 		:
> 		valid = swapcache_prepare(entry);
> 		if (!valid)
> 			break;
> 		if (valid == -EAGAIN);
> 			continue;
> 
> to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().
> 

What you reccomend is code like this ?

==
	ret = swapcache_prapare(entry);
	if (ret == -ENOENT)
		break;    /* unused swap entry */
	if (ret == -EBUSY)
		continue; /* to call find_get_page() again */
==

-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-28  1:05       ` KAMEZAWA Hiroyuki
@ 2009-05-28  1:40         ` Daisuke Nishimura
  -1 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-28  1:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Thu, 28 May 2009 10:05:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 28 May 2009 09:41:57 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > @@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
> > >  	offset = swp_offset(entry);
> > >  
> > >  	spin_lock(&swap_lock);
> > > -	if (offset < p->max && p->swap_map[offset]) {
> > > -		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> > > -			p->swap_map[offset]++;
> > > +
> > > +	if (unlikely(offset >= p->max))
> > > +		goto unlock_out;
> > > +
> > > +	count = swap_count(p->swap_map[offset]);
> > > +	has_cache = swap_has_cache(p->swap_map[offset]);
> > > +	if (cache) {
> > > +		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > > +		if (!has_cache && count) {
> > Should we check !has_cache here ?
> I added !has_cache to return 0 in racy case.
> 
> > 
> > Concurrent read_swap_cache_async() might have set SWAP_HAS_CACHE, but not have added
> > a page to swap cache yet when find_get_page() was called.
> yes.
> 
> > add_to_swap_cache() would handle the race of concurrent read_swap_cache_async(),
> > but considering more, swapcache_free() at the end of the loop might dangerous in this case...
> 
> I can't catch what you mean.
> 
> I think swapcache_prepare() returns 0 in racy case and no add_to_swap_cache() happens.
> wrong ?
> 
Ah, you're right in this version of your patch.
I said the case if we changed swapcache_prepare() simply not to return 0 in
SWAP_HAS_CACHE case.

> > So I think it should be like:
> > 
> > 	read_swap_cache_async()
> > 		:
> > 		valid = swapcache_prepare(entry);
> > 		if (!valid)
> > 			break;
> > 		if (valid == -EAGAIN);
> > 			continue;
> > 
> > to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().
> > 
> 
> What you reccomend is code like this ?
> 
> ==
> 	ret = swapcache_prapare(entry);
> 	if (ret == -ENOENT)
> 		break;    /* unused swap entry */
> 	if (ret == -EBUSY)
> 		continue; /* to call find_get_page() again */
> ==
> 
Yes.
By current version of your patch, read_swap_cache_async() might return NULL
if concurrent read_swap_cache_async() exists. It is different from current behavior.
And this means swapin_readahead() might fail(it calls read_swap_cache_async()
twice, though) and can cause oom, right ?


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-28  1:40         ` Daisuke Nishimura
  0 siblings, 0 replies; 48+ messages in thread
From: Daisuke Nishimura @ 2009-05-28  1:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Thu, 28 May 2009 10:05:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 28 May 2009 09:41:57 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > @@ -1969,17 +2017,33 @@ int swap_duplicate(swp_entry_t entry)
> > >  	offset = swp_offset(entry);
> > >  
> > >  	spin_lock(&swap_lock);
> > > -	if (offset < p->max && p->swap_map[offset]) {
> > > -		if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
> > > -			p->swap_map[offset]++;
> > > +
> > > +	if (unlikely(offset >= p->max))
> > > +		goto unlock_out;
> > > +
> > > +	count = swap_count(p->swap_map[offset]);
> > > +	has_cache = swap_has_cache(p->swap_map[offset]);
> > > +	if (cache) {
> > > +		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
> > > +		if (!has_cache && count) {
> > Should we check !has_cache here ?
> I added !has_cache to return 0 in racy case.
> 
> > 
> > Concurrent read_swap_cache_async() might have set SWAP_HAS_CACHE, but not have added
> > a page to swap cache yet when find_get_page() was called.
> yes.
> 
> > add_to_swap_cache() would handle the race of concurrent read_swap_cache_async(),
> > but considering more, swapcache_free() at the end of the loop might dangerous in this case...
> 
> I can't catch what you mean.
> 
> I think swapcache_prepare() returns 0 in racy case and no add_to_swap_cache() happens.
> wrong ?
> 
Ah, you're right in this version of your patch.
I said the case if we changed swapcache_prepare() simply not to return 0 in
SWAP_HAS_CACHE case.

> > So I think it should be like:
> > 
> > 	read_swap_cache_async()
> > 		:
> > 		valid = swapcache_prepare(entry);
> > 		if (!valid)
> > 			break;
> > 		if (valid == -EAGAIN);
> > 			continue;
> > 
> > to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().
> > 
> 
> What you reccomend is code like this ?
> 
> ==
> 	ret = swapcache_prapare(entry);
> 	if (ret == -ENOENT)
> 		break;    /* unused swap entry */
> 	if (ret == -EBUSY)
> 		continue; /* to call find_get_page() again */
> ==
> 
Yes.
By current version of your patch, read_swap_cache_async() might return NULL
if concurrent read_swap_cache_async() exists. It is different from current behavior.
And this means swapin_readahead() might fail(it calls read_swap_cache_async()
twice, though) and can cause oom, right ?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
  2009-05-28  1:40         ` Daisuke Nishimura
@ 2009-05-28  1:44           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28  1:44 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Thu, 28 May 2009 10:40:13 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > > So I think it should be like:
> > > 
> > > 	read_swap_cache_async()
> > > 		:
> > > 		valid = swapcache_prepare(entry);
> > > 		if (!valid)
> > > 			break;
> > > 		if (valid == -EAGAIN);
> > > 			continue;
> > > 
> > > to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().
> > > 
> > 
> > What you reccomend is code like this ?
> > 
> > ==
> > 	ret = swapcache_prapare(entry);
> > 	if (ret == -ENOENT)
> > 		break;    /* unused swap entry */
> > 	if (ret == -EBUSY)
> > 		continue; /* to call find_get_page() again */
> > ==
> > 
> Yes.
> By current version of your patch, read_swap_cache_async() might return NULL
> if concurrent read_swap_cache_async() exists. It is different from current behavior.
> And this means swapin_readahead() might fail(it calls read_swap_cache_async()
> twice, though) and can cause oom, right ?
> 
Good point. I fixed this in a patch in my box ;)
Thank you fot good review.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map
@ 2009-05-28  1:44           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-28  1:44 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, hugh.dickins, hannes, linux-kernel

On Thu, 28 May 2009 10:40:13 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > > So I think it should be like:
> > > 
> > > 	read_swap_cache_async()
> > > 		:
> > > 		valid = swapcache_prepare(entry);
> > > 		if (!valid)
> > > 			break;
> > > 		if (valid == -EAGAIN);
> > > 			continue;
> > > 
> > > to let the context that succeeded in swapcache_prepare() do add_to_swap_cache().
> > > 
> > 
> > What you reccomend is code like this ?
> > 
> > ==
> > 	ret = swapcache_prapare(entry);
> > 	if (ret == -ENOENT)
> > 		break;    /* unused swap entry */
> > 	if (ret == -EBUSY)
> > 		continue; /* to call find_get_page() again */
> > ==
> > 
> Yes.
> By current version of your patch, read_swap_cache_async() might return NULL
> if concurrent read_swap_cache_async() exists. It is different from current behavior.
> And this means swapin_readahead() might fail(it calls read_swap_cache_async()
> twice, though) and can cause oom, right ?
> 
Good point. I fixed this in a patch in my box ;)
Thank you fot good review.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2009-05-28  1:46 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-26  3:12 [RFC][PATCH] memcg: fix swap account (26/May)[0/5] KAMEZAWA Hiroyuki
2009-05-26  3:12 ` KAMEZAWA Hiroyuki
2009-05-26  3:14 ` [RFC][PATCH 1/5] change swap cache interfaces KAMEZAWA Hiroyuki
2009-05-26  3:14   ` KAMEZAWA Hiroyuki
2009-05-26  3:15 ` [RFC][PATCH 2/5] add SWAP_HAS_CACHE flag to swap_map KAMEZAWA Hiroyuki
2009-05-26  3:15   ` KAMEZAWA Hiroyuki
2009-05-27  4:02   ` Daisuke Nishimura
2009-05-27  4:02     ` Daisuke Nishimura
2009-05-27  4:36     ` KAMEZAWA Hiroyuki
2009-05-27  4:36       ` KAMEZAWA Hiroyuki
2009-05-27  5:00       ` Daisuke Nishimura
2009-05-27  5:00         ` Daisuke Nishimura
2009-05-28  0:41   ` Daisuke Nishimura
2009-05-28  0:41     ` Daisuke Nishimura
2009-05-28  1:05     ` KAMEZAWA Hiroyuki
2009-05-28  1:05       ` KAMEZAWA Hiroyuki
2009-05-28  1:40       ` Daisuke Nishimura
2009-05-28  1:40         ` Daisuke Nishimura
2009-05-28  1:44         ` KAMEZAWA Hiroyuki
2009-05-28  1:44           ` KAMEZAWA Hiroyuki
2009-05-26  3:16 ` [RFC][PATCH 3/5] count cache-only swaps KAMEZAWA Hiroyuki
2009-05-26  3:16   ` KAMEZAWA Hiroyuki
2009-05-26 17:37   ` Johannes Weiner
2009-05-26 17:37     ` Johannes Weiner
2009-05-26 23:49     ` KAMEZAWA Hiroyuki
2009-05-26 23:49       ` KAMEZAWA Hiroyuki
2009-05-26  3:17 ` [RFC][PATCH 4/5] memcg: fix swap account KAMEZAWA Hiroyuki
2009-05-26  3:17   ` KAMEZAWA Hiroyuki
2009-05-26  3:18 ` [RFC][PATCH 5/5] (experimental) chase and free cache only swap KAMEZAWA Hiroyuki
2009-05-26  3:18   ` KAMEZAWA Hiroyuki
2009-05-26 18:14   ` Johannes Weiner
2009-05-26 18:14     ` Johannes Weiner
2009-05-27  0:08     ` KAMEZAWA Hiroyuki
2009-05-27  0:08       ` KAMEZAWA Hiroyuki
2009-05-27  1:26       ` Johannes Weiner
2009-05-27  1:26         ` Johannes Weiner
2009-05-27  1:31         ` KAMEZAWA Hiroyuki
2009-05-27  1:31           ` KAMEZAWA Hiroyuki
2009-05-27  2:06           ` Johannes Weiner
2009-05-27  2:06             ` Johannes Weiner
2009-05-27  5:14   ` KAMEZAWA Hiroyuki
2009-05-27  5:14     ` KAMEZAWA Hiroyuki
2009-05-27  6:30     ` Daisuke Nishimura
2009-05-27  6:30       ` Daisuke Nishimura
2009-05-27  6:50       ` KAMEZAWA Hiroyuki
2009-05-27  6:50         ` KAMEZAWA Hiroyuki
2009-05-27  6:43 ` [RFC][PATCH] memcg: fix swap account (26/May)[0/5] KAMEZAWA Hiroyuki
2009-05-27  6:43   ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.