All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
@ 2014-07-21 10:26 Joerg Roedel
  2014-07-21 10:26 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel

Changes v1->v2:

* Rebased to v3.16-rc6
* Fixed the style issues in Patch 1 mentioned by Rafael

Hi,

here is the revised patch set to improve the scalability of
the memory bitmap implementation used for hibernation. The
current implementation does not scale well to machines with
several TB of memory. A resume on those machines may cause
soft lockups to be reported.

These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).

A test on a 12TB machine showed an improvement in resume
time from 76s with the old implementation to 2.4s with the
radix tree and the improved swsusp_free function. See below
for details of this test.

Patches 1-3 that add the radix tree while keeping the
existing memory bitmap implementation in place and add code
to compare the results between both implementations. This
was used during development to make sure both data
structures return the same results.

Patch 4 re-implements the swsusp_free() function to not
iterate over all pfns but only over the bits set in the
bitmaps. This showed to scale better on large memory
machines.

Patch 5 removes the old memory bitmap implementation now
that the radix tree is in place and working correctly.

The last patch adds touching the soft lockup watchdog in
rtree_next_node. This is necessary because the worst case
performance (all bits set in the forbidden_pages_map and
free_pages_map) is the same as with the old implementation
and may still cause soft lockups. Patch 6 avoids this.

The code was tested in 32 and 64 bit x86 and showed no
issues there.

Below is an example test that shows the performance
improvement on a 12TB machine. First the test with the old
memory bitmap:

# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 2.882 MB perf.data (~125898 samples) ]

real    1m16.043s
user    0m0.016s
sys     0m0.312s
# perf report --stdio |head -50
# Events: 75K cycles
#
# Overhead  Command         Shared Object                                   
Symbol
# ........  .......  .................... 
........................................
#
    56.16%   resume  [kernel.kallsyms]     [k] memory_bm_test_bit
    19.35%   resume  [kernel.kallsyms]     [k] swsusp_free
    14.90%   resume  [kernel.kallsyms]     [k] memory_bm_find_bit
     7.28%   resume  [kernel.kallsyms]     [k] swsusp_page_is_forbidden

And here is the same test on the same machine with these
patches applied:

#  time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ]

real    0m2.376s
user    0m0.020s
sys     0m0.408s

# perf report --stdio |head -50
# Events: 762  cycles
#
# Overhead  Command      Shared Object                     Symbol
# ........  .......  .................  .........................
#
    34.78%   resume  [kernel.kallsyms]  [k] find_next_bit
    27.03%   resume  [kernel.kallsyms]  [k] clear_page_c_e
     9.70%   resume  [kernel.kallsyms]  [k] mark_nosave_pages
     3.92%   resume  [kernel.kallsyms]  [k] alloc_rtree_node
     2.38%   resume  [kernel.kallsyms]  [k] get_image_page

As can be seen on these results these patches improve the
scalability significantly. Please review, any comments
appreciated.

Thanks,

	Joerg

Joerg Roedel (6):
  PM / Hibernate: Create a Radix-Tree to store memory bitmap
  PM / Hibernate: Add memory_rtree_find_bit function
  PM / Hibernate: Implement position keeping in radix tree
  PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
  PM / Hibernate: Remove the old memory-bitmap implementation
  PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node

 kernel/power/snapshot.c | 494 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 367 insertions(+), 127 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
@ 2014-07-21 10:26 ` Joerg Roedel
  2014-07-21 22:36   ` Joerg Roedel
  2014-07-21 10:26 ` [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function Joerg Roedel
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

This patch adds the code to allocate and build the radix
tree to store the memory bitmap. The old data structure is
left in place until the radix tree implementation is
finished.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 223 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 222 insertions(+), 1 deletion(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 1ea328a..cc6bd73 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -248,11 +248,24 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
  *	information is stored (in the form of a block of bitmap)
  *	It also contains the pfns that correspond to the start and end of
  *	the represented memory area.
+ *
+ *	The memory bitmap is organized as a radix tree to guarantee fast random
+ *	access to the bits. There is one radix tree for each zone (as returned
+ *	from create_mem_extents).
+ *
+ *	One radix tree is represented by one struct mem_zone_bm_rtree. There are
+ *	two linked lists for the nodes of the tree, one for the inner nodes and
+ *	one for the leave nodes. The linked leave nodes are used for fast linear
+ *	access of the memory bitmap.
+ *
+ *	The struct rtree_node represents one node of the radix tree.
  */
 
 #define BM_END_OF_MAP	(~0UL)
 
 #define BM_BITS_PER_BLOCK	(PAGE_SIZE * BITS_PER_BYTE)
+#define BM_BLOCK_SHIFT		(PAGE_SHIFT + 3)
+#define BM_BLOCK_MASK		((1UL << BM_BLOCK_SHIFT) - 1)
 
 struct bm_block {
 	struct list_head hook;	/* hook into a list of bitmap blocks */
@@ -266,6 +279,31 @@ static inline unsigned long bm_block_bits(struct bm_block *bb)
 	return bb->end_pfn - bb->start_pfn;
 }
 
+/*
+ * struct rtree_node is a wrapper struct to link the nodes
+ * of the rtree together for easy linear iteration over
+ * bits and easy freeing
+ */
+struct rtree_node {
+	struct list_head list;
+	unsigned long *data;
+};
+
+/*
+ * struct mem_zone_bm_rtree represents a bitmap used for one
+ * populated memory zone.
+ */
+struct mem_zone_bm_rtree {
+	struct list_head list;		/* Link Zones together         */
+	struct list_head nodes;		/* Radix Tree inner nodes      */
+	struct list_head leaves;	/* Radix Tree leaves           */
+	unsigned long start_pfn;	/* Zone start page frame       */
+	unsigned long end_pfn;		/* Zone end page frame + 1     */
+	struct rtree_node *rtree;	/* Radix Tree Root             */
+	int levels;			/* Number of Radix Tree Levels */
+	unsigned int blocks;		/* Number of Bitmap Blocks     */
+};
+
 /* strcut bm_position is used for browsing memory bitmaps */
 
 struct bm_position {
@@ -274,6 +312,7 @@ struct bm_position {
 };
 
 struct memory_bitmap {
+	struct list_head zones;
 	struct list_head blocks;	/* list of bitmap blocks */
 	struct linked_page *p_list;	/* list of pages used to store zone
 					 * bitmap objects and bitmap block
@@ -284,6 +323,166 @@ struct memory_bitmap {
 
 /* Functions that operate on memory bitmaps */
 
+#define BM_ENTRIES_PER_LEVEL	(PAGE_SIZE / sizeof(unsigned long))
+#if BITS_PER_LONG == 32
+#define BM_RTREE_LEVEL_SHIFT	(PAGE_SHIFT - 2)
+#else
+#define BM_RTREE_LEVEL_SHIFT	(PAGE_SHIFT - 3)
+#endif
+#define BM_RTREE_LEVEL_MASK	((1UL << BM_RTREE_LEVEL_SHIFT) - 1)
+
+/*
+ *	alloc_rtree_node - Allocate a new node and add it to the radix tree.
+ *
+ *	This function is used to allocate inner nodes as well as the
+ *	leave nodes of the radix tree. It also adds the node to the
+ *	corresponding linked list passed in by the *list parameter.
+ */
+static struct rtree_node *alloc_rtree_node(gfp_t gfp_mask, int safe_needed,
+					   struct chain_allocator *ca,
+					   struct list_head *list)
+{
+	struct rtree_node *node;
+
+	node = chain_alloc(ca, sizeof(struct rtree_node));
+	if (!node)
+		return NULL;
+
+	node->data = get_image_page(gfp_mask, safe_needed);
+	if (!node->data)
+		return NULL;
+
+	list_add_tail(&node->list, list);
+
+	return node;
+}
+
+/*
+ *	add_rtree_block - Add a new leave node to the radix tree
+ *
+ *	The leave nodes need to be allocated in order to keep the leaves
+ *	linked list in order. This is guaranteed by the zone->blocks
+ *	counter.
+ */
+static int add_rtree_block(struct mem_zone_bm_rtree *zone, gfp_t gfp_mask,
+			   int safe_needed, struct chain_allocator *ca)
+{
+	struct rtree_node *node, *block, **dst;
+	unsigned int levels_needed, block_nr;
+	int i;
+
+	block_nr      = zone->blocks;
+	levels_needed = 0;
+
+	/* How many levels do we need for this block nr? */
+	while (block_nr) {
+		levels_needed += 1;
+		block_nr >>= BM_RTREE_LEVEL_SHIFT;
+	}
+
+	/* Make sure the rtree has enough levels */
+	for (i = zone->levels; i < levels_needed; i++) {
+		node = alloc_rtree_node(gfp_mask, safe_needed, ca,
+					&zone->nodes);
+		if (!node)
+			return -ENOMEM;
+
+		node->data[0] = (unsigned long)zone->rtree;
+		zone->rtree   = node;
+		zone->levels += 1;
+	}
+
+	/* Allocate new block */
+	block = alloc_rtree_node(gfp_mask, safe_needed, ca, &zone->leaves);
+	if (!block)
+		return -ENOMEM;
+
+	/* Now walk the rtree to insert the block */
+	node     = zone->rtree;
+	dst      = &zone->rtree;
+	block_nr = zone->blocks;
+	for (i = zone->levels; i > 0; i--) {
+		int index;
+
+		if (!node) {
+			node = alloc_rtree_node(gfp_mask, safe_needed, ca,
+						&zone->nodes);
+			if (!node)
+				return -ENOMEM;
+			*dst = node;
+		}
+
+		index  = block_nr >> ((i - 1) * BM_RTREE_LEVEL_SHIFT);
+		index &= BM_RTREE_LEVEL_MASK;
+		dst    = (struct rtree_node **)&((*dst)->data[index]);
+		node   = *dst;
+	}
+
+	zone->blocks += 1;
+	*dst          = block;
+
+	return 0;
+}
+
+static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone,
+			       int clear_nosave_free);
+
+/*
+ *	create_zone_bm_rtree - create a radix tree for one zone
+ *
+ *	Allocated the mem_zone_bm_rtree structure and initializes it.
+ *	This function also allocated and builds the radix tree for the
+ *	zone.
+ */
+static struct mem_zone_bm_rtree *
+create_zone_bm_rtree(gfp_t gfp_mask, int safe_needed,
+		     struct chain_allocator *ca,
+		     unsigned long start, unsigned long end)
+{
+	struct mem_zone_bm_rtree *zone;
+	unsigned int i, nr_blocks;
+	unsigned long pages;
+
+	pages = end - start;
+	zone  = chain_alloc(ca, sizeof(struct mem_zone_bm_rtree));
+	if (!zone)
+		return NULL;
+
+	INIT_LIST_HEAD(&zone->nodes);
+	INIT_LIST_HEAD(&zone->leaves);
+	zone->start_pfn = start;
+	zone->end_pfn   = end;
+	nr_blocks       = DIV_ROUND_UP(pages, BM_BITS_PER_BLOCK);
+
+	for (i = 0; i < nr_blocks; i++) {
+		if (add_rtree_block(zone, gfp_mask, safe_needed, ca)) {
+			free_zone_bm_rtree(zone, PG_UNSAFE_CLEAR);
+			return NULL;
+		}
+	}
+
+	return zone;
+}
+
+/*
+ *	free_zone_bm_rtree - Free the memory of the radix tree
+ *
+ *	Free all node pages of the radix tree. The mem_zone_bm_rtree
+ *	structure itself is not freed here nor are the rtree_node
+ *	structs.
+ */
+static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone,
+			       int clear_nosave_free)
+{
+	struct rtree_node *node;
+
+	list_for_each_entry(node, &zone->nodes, list)
+		free_image_page(node->data, clear_nosave_free);
+
+	list_for_each_entry(node, &zone->leaves, list)
+		free_image_page(node->data, clear_nosave_free);
+}
+
 static void memory_bm_position_reset(struct memory_bitmap *bm)
 {
 	bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
@@ -408,12 +607,14 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 
 	chain_init(&ca, gfp_mask, safe_needed);
 	INIT_LIST_HEAD(&bm->blocks);
+	INIT_LIST_HEAD(&bm->zones);
 
 	error = create_mem_extents(&mem_extents, gfp_mask);
 	if (error)
 		return error;
 
 	list_for_each_entry(ext, &mem_extents, hook) {
+		struct mem_zone_bm_rtree *zone;
 		struct bm_block *bb;
 		unsigned long pfn = ext->start;
 		unsigned long pages = ext->end - ext->start;
@@ -441,6 +642,12 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 			}
 			bb->end_pfn = pfn;
 		}
+
+		zone = create_zone_bm_rtree(gfp_mask, safe_needed, &ca,
+					    ext->start, ext->end);
+		if (!zone)
+			goto Error;
+		list_add_tail(&zone->list, &bm->zones);
 	}
 
 	bm->p_list = ca.chain;
@@ -460,14 +667,19 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
   */
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free)
 {
+	struct mem_zone_bm_rtree *zone;
 	struct bm_block *bb;
 
 	list_for_each_entry(bb, &bm->blocks, hook)
 		if (bb->data)
 			free_image_page(bb->data, clear_nosave_free);
 
+	list_for_each_entry(zone, &bm->zones, list)
+		free_zone_bm_rtree(zone, clear_nosave_free);
+
 	free_list_of_pages(bm->p_list, clear_nosave_free);
 
+	INIT_LIST_HEAD(&bm->zones);
 	INIT_LIST_HEAD(&bm->blocks);
 }
 
@@ -816,12 +1028,21 @@ void free_basic_memory_bitmaps(void)
 
 unsigned int snapshot_additional_pages(struct zone *zone)
 {
+	unsigned int rtree, nodes;
 	unsigned int res;
 
 	res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
 	res += DIV_ROUND_UP(res * sizeof(struct bm_block),
 			    LINKED_PAGE_DATA_SIZE);
-	return 2 * res;
+	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
+	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
+			      LINKED_PAGE_DATA_SIZE);
+	while (nodes > 1) {
+		nodes = DIV_ROUND_UP(nodes, BM_ENTRIES_PER_LEVEL);
+		rtree += nodes;
+	}
+
+	return 2 * (res + rtree);
 }
 
 #ifdef CONFIG_HIGHMEM
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
  2014-07-21 10:26 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
@ 2014-07-21 10:26 ` Joerg Roedel
  2014-07-21 10:26 ` [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree Joerg Roedel
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add a function to find a bit in the radix tree for a given
pfn. Also add code to the memory bitmap wrapper functions to
use the radix tree together with the existing memory bitmap
implementation.

On read accesses compare the results of both bitmaps to make
sure the radix tree behaves the same way.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 3 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index cc6bd73..d5eea1b 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -720,6 +720,56 @@ static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 	return 0;
 }
 
+/*
+ *	memory_rtree_find_bit - Find the bit for pfn in the memory
+ *				bitmap
+ *
+ *	Walks the radix tree to find the page which contains the bit for
+ *	pfn and returns the bit position in **addr and *bit_nr.
+ */
+static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
+				 void **addr, unsigned int *bit_nr)
+{
+	struct mem_zone_bm_rtree *curr, *zone;
+	struct rtree_node *node;
+	int i, block_nr;
+
+	zone = NULL;
+
+	/* Find the right zone */
+	list_for_each_entry(curr, &bm->zones, list) {
+		if (pfn >= curr->start_pfn && pfn < curr->end_pfn) {
+			zone = curr;
+			break;
+		}
+	}
+
+	if (!zone)
+		return -EFAULT;
+
+	/*
+	 * We have a zone. Now walk the radix tree to find the leave
+	 * node for our pfn.
+	 */
+	node      = zone->rtree;
+	block_nr  = (pfn - zone->start_pfn) >> BM_BLOCK_SHIFT;
+
+	for (i = zone->levels; i > 0; i--) {
+		int index;
+
+		index  = block_nr >> ((i - 1) * BM_RTREE_LEVEL_SHIFT);
+		index &= BM_RTREE_LEVEL_MASK;
+		BUG_ON(node->data[index] == 0);
+		node = (struct rtree_node *)node->data[index];
+	}
+
+	/* Set return values */
+	*addr   = node->data;
+	*bit_nr = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
+
+	return 0;
+}
+
 static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
@@ -729,6 +779,10 @@ static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	set_bit(bit, addr);
+
+	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	BUG_ON(error);
+	set_bit(bit, addr);
 }
 
 static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
@@ -740,6 +794,13 @@ static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	if (!error)
 		set_bit(bit, addr);
+	else
+		return error;
+
+	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	if (!error)
+		set_bit(bit, addr);
+
 	return error;
 }
 
@@ -752,25 +813,42 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	clear_bit(bit, addr);
+
+	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	BUG_ON(error);
+	clear_bit(bit, addr);
 }
 
 static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
-	int error;
+	int error, error2;
+	int v;
 
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
-	return test_bit(bit, addr);
+	v = test_bit(bit, addr);
+
+	error2 = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	BUG_ON(error2);
+
+	WARN_ON_ONCE(v != test_bit(bit, addr));
+
+	return v;
 }
 
 static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
+	int present;
+
+	present = !memory_bm_find_bit(bm, pfn, &addr, &bit);
+
+	WARN_ON_ONCE(present != !memory_rtree_find_bit(bm, pfn, &addr, &bit));
 
-	return !memory_bm_find_bit(bm, pfn, &addr, &bit);
+	return present;
 }
 
 /**
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
  2014-07-21 10:26 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
  2014-07-21 10:26 ` [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function Joerg Roedel
@ 2014-07-21 10:26 ` Joerg Roedel
  2014-07-21 10:27 ` [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free() Joerg Roedel
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add code to remember the last position that was requested in
the radix tree. Use it as a cache for faster linear walking
of the bitmap in the memory_bm_rtree_next_pfn() function
which is also added with this patch.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d5eea1b..6c09879 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -309,6 +309,11 @@ struct mem_zone_bm_rtree {
 struct bm_position {
 	struct bm_block *block;
 	int bit;
+
+	struct mem_zone_bm_rtree *zone;
+	struct rtree_node *node;
+	unsigned long node_pfn;
+	int node_bit;
 };
 
 struct memory_bitmap {
@@ -487,6 +492,13 @@ static void memory_bm_position_reset(struct memory_bitmap *bm)
 {
 	bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
 	bm->cur.bit = 0;
+
+	bm->cur.zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree,
+				  list);
+	bm->cur.node = list_entry(bm->cur.zone->leaves.next,
+				  struct rtree_node, list);
+	bm->cur.node_pfn = 0;
+	bm->cur.node_bit = 0;
 }
 
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free);
@@ -734,6 +746,11 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 	struct rtree_node *node;
 	int i, block_nr;
 
+	zone = bm->cur.zone;
+
+	if (pfn >= zone->start_pfn && pfn < zone->end_pfn)
+		goto zone_found;
+
 	zone = NULL;
 
 	/* Find the right zone */
@@ -747,10 +764,16 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 	if (!zone)
 		return -EFAULT;
 
+zone_found:
 	/*
 	 * We have a zone. Now walk the radix tree to find the leave
 	 * node for our pfn.
 	 */
+
+	node = bm->cur.node;
+	if (((pfn - zone->start_pfn) & ~BM_BLOCK_MASK) == bm->cur.node_pfn)
+		goto node_found;
+
 	node      = zone->rtree;
 	block_nr  = (pfn - zone->start_pfn) >> BM_BLOCK_SHIFT;
 
@@ -763,9 +786,15 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 		node = (struct rtree_node *)node->data[index];
 	}
 
+node_found:
+	/* Update last position */
+	bm->cur.zone     = zone;
+	bm->cur.node     = node;
+	bm->cur.node_pfn = (pfn - zone->start_pfn) & ~BM_BLOCK_MASK;
+
 	/* Set return values */
-	*addr   = node->data;
-	*bit_nr = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
+	*addr            = node->data;
+	*bit_nr          = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
 
 	return 0;
 }
@@ -860,11 +889,16 @@ static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
  *	this function.
  */
 
+static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm);
+
 static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
 {
+	unsigned long rtree_pfn;
 	struct bm_block *bb;
 	int bit;
 
+	rtree_pfn = memory_bm_rtree_next_pfn(bm);
+
 	bb = bm->cur.block;
 	do {
 		bit = bm->cur.bit;
@@ -878,13 +912,77 @@ static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
 	} while (&bb->hook != &bm->blocks);
 
 	memory_bm_position_reset(bm);
+	WARN_ON_ONCE(rtree_pfn != BM_END_OF_MAP);
 	return BM_END_OF_MAP;
 
  Return_pfn:
+	WARN_ON_ONCE(bb->start_pfn + bit != rtree_pfn);
 	bm->cur.bit = bit + 1;
 	return bb->start_pfn + bit;
 }
 
+/*
+ *	rtree_next_node - Jumps to the next leave node
+ *
+ *	Sets the position to the beginning of the next node in the
+ *	memory bitmap. This is either the next node in the current
+ *	zone's radix tree or the first node in the radix tree of the
+ *	next zone.
+ *
+ *	Returns true if there is a next node, false otherwise.
+ */
+static bool rtree_next_node(struct memory_bitmap *bm)
+{
+	bm->cur.node = list_entry(bm->cur.node->list.next,
+				  struct rtree_node, list);
+	if (&bm->cur.node->list != &bm->cur.zone->leaves) {
+		bm->cur.node_pfn += BM_BITS_PER_BLOCK;
+		bm->cur.node_bit  = 0;
+		return true;
+	}
+
+	/* No more nodes, goto next zone */
+	bm->cur.zone = list_entry(bm->cur.zone->list.next,
+				  struct mem_zone_bm_rtree, list);
+	if (&bm->cur.zone->list != &bm->zones) {
+		bm->cur.node = list_entry(bm->cur.zone->leaves.next,
+					  struct rtree_node, list);
+		bm->cur.node_pfn = 0;
+		bm->cur.node_bit = 0;
+		return true;
+	}
+
+	/* No more zones */
+	return false;
+}
+
+/*
+ *	memory_bm_rtree_next_pfn - Find the next set bit
+ *
+ *	Starting from the last returned position this function searches
+ *	for the next set bit in the memory bitmap and returns its
+ *	number. If no more bit is set BM_END_OF_MAP is returned.
+ */
+static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm)
+{
+	unsigned long bits, pfn, pages;
+	int bit;
+
+	do {
+		pages	  = bm->cur.zone->end_pfn - bm->cur.zone->start_pfn;
+		bits      = min(pages - bm->cur.node_pfn, BM_BITS_PER_BLOCK);
+		bit	  = find_next_bit(bm->cur.node->data, bits,
+					  bm->cur.node_bit);
+		if (bit < bits) {
+			pfn = bm->cur.zone->start_pfn + bm->cur.node_pfn + bit;
+			bm->cur.node_bit = bit + 1;
+			return pfn;
+		}
+	} while (rtree_next_node(bm));
+
+	return BM_END_OF_MAP;
+}
+
 /**
  *	This structure represents a range of page frames the contents of which
  *	should not be saved during the suspend.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (2 preceding siblings ...)
  2014-07-21 10:26 ` [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree Joerg Roedel
@ 2014-07-21 10:27 ` Joerg Roedel
  2014-07-21 10:27 ` [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation Joerg Roedel
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:27 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The existing implementation of swsusp_free iterates over all
pfns in the system and checks every bit in the two memory
bitmaps.

This doesn't scale very well with large numbers of pfns,
especially when the bitmaps are not populated very densly.
Change the algorithm to iterate over the set bits in the
bitmaps instead to make it scale better in large memory
configurations.

Also add a memory_bm_clear_current() helper function that
clears the bit for the last position returned from the
memory bitmap.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 53 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 6c09879..83a6c61 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -848,6 +848,17 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
 	clear_bit(bit, addr);
 }
 
+static void memory_bm_clear_current(struct memory_bitmap *bm)
+{
+	int bit;
+
+	bit  = max(bm->cur.node_bit - 1, 0);
+	clear_bit(bit, bm->cur.node->data);
+
+	bit = max(bm->cur.bit - 1, 0);
+	clear_bit(bit, bm->cur.block->data);
+}
+
 static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
@@ -1491,23 +1502,35 @@ static struct memory_bitmap copy_bm;
 
 void swsusp_free(void)
 {
-	struct zone *zone;
-	unsigned long pfn, max_zone_pfn;
+	unsigned long fb_pfn, fr_pfn;
 
-	for_each_populated_zone(zone) {
-		max_zone_pfn = zone_end_pfn(zone);
-		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
-			if (pfn_valid(pfn)) {
-				struct page *page = pfn_to_page(pfn);
-
-				if (swsusp_page_is_forbidden(page) &&
-				    swsusp_page_is_free(page)) {
-					swsusp_unset_page_forbidden(page);
-					swsusp_unset_page_free(page);
-					__free_page(page);
-				}
-			}
+	memory_bm_position_reset(forbidden_pages_map);
+	memory_bm_position_reset(free_pages_map);
+
+loop:
+	fr_pfn = memory_bm_next_pfn(free_pages_map);
+	fb_pfn = memory_bm_next_pfn(forbidden_pages_map);
+
+	/*
+	 * Find the next bit set in both bitmaps. This is guaranteed to
+	 * terminate when fb_pfn == fr_pfn == BM_END_OF_MAP.
+	 */
+	do {
+		if (fb_pfn < fr_pfn)
+			fb_pfn = memory_bm_next_pfn(forbidden_pages_map);
+		if (fr_pfn < fb_pfn)
+			fr_pfn = memory_bm_next_pfn(free_pages_map);
+	} while (fb_pfn != fr_pfn);
+
+	if (fr_pfn != BM_END_OF_MAP && pfn_valid(fr_pfn)) {
+		struct page *page = pfn_to_page(fr_pfn);
+
+		memory_bm_clear_current(forbidden_pages_map);
+		memory_bm_clear_current(free_pages_map);
+		__free_page(page);
+		goto loop;
 	}
+
 	nr_copy_pages = 0;
 	nr_meta_pages = 0;
 	restore_pblist = NULL;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (3 preceding siblings ...)
  2014-07-21 10:27 ` [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free() Joerg Roedel
@ 2014-07-21 10:27 ` Joerg Roedel
  2014-07-21 10:27 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel
  2014-07-21 12:00 ` [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Pavel Machek
  6 siblings, 0 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:27 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The radix tree implementatio is proved to work the same as
the old implementation now. So the old implementation can be
removed to finish the switch to the radix tree for the
memory bitmaps.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 223 +++++-------------------------------------------
 1 file changed, 21 insertions(+), 202 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 83a6c61..6a4e07e 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -267,18 +267,6 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
 #define BM_BLOCK_SHIFT		(PAGE_SHIFT + 3)
 #define BM_BLOCK_MASK		((1UL << BM_BLOCK_SHIFT) - 1)
 
-struct bm_block {
-	struct list_head hook;	/* hook into a list of bitmap blocks */
-	unsigned long start_pfn;	/* pfn represented by the first bit */
-	unsigned long end_pfn;	/* pfn represented by the last bit plus 1 */
-	unsigned long *data;	/* bitmap representing pages */
-};
-
-static inline unsigned long bm_block_bits(struct bm_block *bb)
-{
-	return bb->end_pfn - bb->start_pfn;
-}
-
 /*
  * struct rtree_node is a wrapper struct to link the nodes
  * of the rtree together for easy linear iteration over
@@ -307,9 +295,6 @@ struct mem_zone_bm_rtree {
 /* strcut bm_position is used for browsing memory bitmaps */
 
 struct bm_position {
-	struct bm_block *block;
-	int bit;
-
 	struct mem_zone_bm_rtree *zone;
 	struct rtree_node *node;
 	unsigned long node_pfn;
@@ -318,7 +303,6 @@ struct bm_position {
 
 struct memory_bitmap {
 	struct list_head zones;
-	struct list_head blocks;	/* list of bitmap blocks */
 	struct linked_page *p_list;	/* list of pages used to store zone
 					 * bitmap objects and bitmap block
 					 * objects
@@ -490,9 +474,6 @@ static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone,
 
 static void memory_bm_position_reset(struct memory_bitmap *bm)
 {
-	bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
-	bm->cur.bit = 0;
-
 	bm->cur.zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree,
 				  list);
 	bm->cur.node = list_entry(bm->cur.zone->leaves.next,
@@ -503,30 +484,6 @@ static void memory_bm_position_reset(struct memory_bitmap *bm)
 
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free);
 
-/**
- *	create_bm_block_list - create a list of block bitmap objects
- *	@pages - number of pages to track
- *	@list - list to put the allocated blocks into
- *	@ca - chain allocator to be used for allocating memory
- */
-static int create_bm_block_list(unsigned long pages,
-				struct list_head *list,
-				struct chain_allocator *ca)
-{
-	unsigned int nr_blocks = DIV_ROUND_UP(pages, BM_BITS_PER_BLOCK);
-
-	while (nr_blocks-- > 0) {
-		struct bm_block *bb;
-
-		bb = chain_alloc(ca, sizeof(struct bm_block));
-		if (!bb)
-			return -ENOMEM;
-		list_add(&bb->hook, list);
-	}
-
-	return 0;
-}
-
 struct mem_extent {
 	struct list_head hook;
 	unsigned long start;
@@ -618,7 +575,6 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 	int error;
 
 	chain_init(&ca, gfp_mask, safe_needed);
-	INIT_LIST_HEAD(&bm->blocks);
 	INIT_LIST_HEAD(&bm->zones);
 
 	error = create_mem_extents(&mem_extents, gfp_mask);
@@ -627,38 +583,13 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 
 	list_for_each_entry(ext, &mem_extents, hook) {
 		struct mem_zone_bm_rtree *zone;
-		struct bm_block *bb;
-		unsigned long pfn = ext->start;
-		unsigned long pages = ext->end - ext->start;
-
-		bb = list_entry(bm->blocks.prev, struct bm_block, hook);
-
-		error = create_bm_block_list(pages, bm->blocks.prev, &ca);
-		if (error)
-			goto Error;
-
-		list_for_each_entry_continue(bb, &bm->blocks, hook) {
-			bb->data = get_image_page(gfp_mask, safe_needed);
-			if (!bb->data) {
-				error = -ENOMEM;
-				goto Error;
-			}
-
-			bb->start_pfn = pfn;
-			if (pages >= BM_BITS_PER_BLOCK) {
-				pfn += BM_BITS_PER_BLOCK;
-				pages -= BM_BITS_PER_BLOCK;
-			} else {
-				/* This is executed only once in the loop */
-				pfn += pages;
-			}
-			bb->end_pfn = pfn;
-		}
 
 		zone = create_zone_bm_rtree(gfp_mask, safe_needed, &ca,
 					    ext->start, ext->end);
-		if (!zone)
+		if (!zone) {
+			error = -ENOMEM;
 			goto Error;
+		}
 		list_add_tail(&zone->list, &bm->zones);
 	}
 
@@ -680,11 +611,6 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free)
 {
 	struct mem_zone_bm_rtree *zone;
-	struct bm_block *bb;
-
-	list_for_each_entry(bb, &bm->blocks, hook)
-		if (bb->data)
-			free_image_page(bb->data, clear_nosave_free);
 
 	list_for_each_entry(zone, &bm->zones, list)
 		free_zone_bm_rtree(zone, clear_nosave_free);
@@ -692,55 +618,20 @@ static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free)
 	free_list_of_pages(bm->p_list, clear_nosave_free);
 
 	INIT_LIST_HEAD(&bm->zones);
-	INIT_LIST_HEAD(&bm->blocks);
 }
 
 /**
- *	memory_bm_find_bit - find the bit in the bitmap @bm that corresponds
- *	to given pfn.  The cur_zone_bm member of @bm and the cur_block member
- *	of @bm->cur_zone_bm are updated.
- */
-static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn,
-				void **addr, unsigned int *bit_nr)
-{
-	struct bm_block *bb;
-
-	/*
-	 * Check if the pfn corresponds to the current bitmap block and find
-	 * the block where it fits if this is not the case.
-	 */
-	bb = bm->cur.block;
-	if (pfn < bb->start_pfn)
-		list_for_each_entry_continue_reverse(bb, &bm->blocks, hook)
-			if (pfn >= bb->start_pfn)
-				break;
-
-	if (pfn >= bb->end_pfn)
-		list_for_each_entry_continue(bb, &bm->blocks, hook)
-			if (pfn >= bb->start_pfn && pfn < bb->end_pfn)
-				break;
-
-	if (&bb->hook == &bm->blocks)
-		return -EFAULT;
-
-	/* The block has been found */
-	bm->cur.block = bb;
-	pfn -= bb->start_pfn;
-	bm->cur.bit = pfn + 1;
-	*bit_nr = pfn;
-	*addr = bb->data;
-	return 0;
-}
-
-/*
- *	memory_rtree_find_bit - Find the bit for pfn in the memory
- *				bitmap
+ *	memory_bm_find_bit - Find the bit for pfn in the memory
+ *			     bitmap
  *
- *	Walks the radix tree to find the page which contains the bit for
+ *	Find the bit in the bitmap @bm that corresponds to given pfn.
+ *	The cur.zone, cur.block and cur.node_pfn member of @bm are
+ *	updated.
+ *	It walks the radix tree to find the page which contains the bit for
  *	pfn and returns the bit position in **addr and *bit_nr.
  */
-static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
-				 void **addr, unsigned int *bit_nr)
+static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn,
+			      void **addr, unsigned int *bit_nr)
 {
 	struct mem_zone_bm_rtree *curr, *zone;
 	struct rtree_node *node;
@@ -808,10 +699,6 @@ static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	set_bit(bit, addr);
-
-	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	BUG_ON(error);
-	set_bit(bit, addr);
 }
 
 static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
@@ -823,12 +710,6 @@ static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	if (!error)
 		set_bit(bit, addr);
-	else
-		return error;
-
-	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	if (!error)
-		set_bit(bit, addr);
 
 	return error;
 }
@@ -842,10 +723,6 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	clear_bit(bit, addr);
-
-	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	BUG_ON(error);
-	clear_bit(bit, addr);
 }
 
 static void memory_bm_clear_current(struct memory_bitmap *bm)
@@ -854,82 +731,25 @@ static void memory_bm_clear_current(struct memory_bitmap *bm)
 
 	bit  = max(bm->cur.node_bit - 1, 0);
 	clear_bit(bit, bm->cur.node->data);
-
-	bit = max(bm->cur.bit - 1, 0);
-	clear_bit(bit, bm->cur.block->data);
 }
 
 static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
-	int error, error2;
-	int v;
+	int error;
 
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
-	v = test_bit(bit, addr);
-
-	error2 = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	BUG_ON(error2);
-
-	WARN_ON_ONCE(v != test_bit(bit, addr));
-
-	return v;
+	return test_bit(bit, addr);
 }
 
 static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
-	int present;
-
-	present = !memory_bm_find_bit(bm, pfn, &addr, &bit);
-
-	WARN_ON_ONCE(present != !memory_rtree_find_bit(bm, pfn, &addr, &bit));
 
-	return present;
-}
-
-/**
- *	memory_bm_next_pfn - find the pfn that corresponds to the next set bit
- *	in the bitmap @bm.  If the pfn cannot be found, BM_END_OF_MAP is
- *	returned.
- *
- *	It is required to run memory_bm_position_reset() before the first call to
- *	this function.
- */
-
-static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm);
-
-static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
-{
-	unsigned long rtree_pfn;
-	struct bm_block *bb;
-	int bit;
-
-	rtree_pfn = memory_bm_rtree_next_pfn(bm);
-
-	bb = bm->cur.block;
-	do {
-		bit = bm->cur.bit;
-		bit = find_next_bit(bb->data, bm_block_bits(bb), bit);
-		if (bit < bm_block_bits(bb))
-			goto Return_pfn;
-
-		bb = list_entry(bb->hook.next, struct bm_block, hook);
-		bm->cur.block = bb;
-		bm->cur.bit = 0;
-	} while (&bb->hook != &bm->blocks);
-
-	memory_bm_position_reset(bm);
-	WARN_ON_ONCE(rtree_pfn != BM_END_OF_MAP);
-	return BM_END_OF_MAP;
-
- Return_pfn:
-	WARN_ON_ONCE(bb->start_pfn + bit != rtree_pfn);
-	bm->cur.bit = bit + 1;
-	return bb->start_pfn + bit;
+	return !memory_bm_find_bit(bm, pfn, &addr, &bit);
 }
 
 /*
@@ -967,14 +787,17 @@ static bool rtree_next_node(struct memory_bitmap *bm)
 	return false;
 }
 
-/*
- *	memory_bm_rtree_next_pfn - Find the next set bit
+/**
+ *	memory_bm_rtree_next_pfn - Find the next set bit in the bitmap @bm
  *
  *	Starting from the last returned position this function searches
  *	for the next set bit in the memory bitmap and returns its
  *	number. If no more bit is set BM_END_OF_MAP is returned.
+ *
+ *	It is required to run memory_bm_position_reset() before the
+ *	first call to this function.
  */
-static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm)
+static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
 {
 	unsigned long bits, pfn, pages;
 	int bit;
@@ -1216,11 +1039,7 @@ void free_basic_memory_bitmaps(void)
 unsigned int snapshot_additional_pages(struct zone *zone)
 {
 	unsigned int rtree, nodes;
-	unsigned int res;
 
-	res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
-	res += DIV_ROUND_UP(res * sizeof(struct bm_block),
-			    LINKED_PAGE_DATA_SIZE);
 	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
 	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
 			      LINKED_PAGE_DATA_SIZE);
@@ -1229,7 +1048,7 @@ unsigned int snapshot_additional_pages(struct zone *zone)
 		rtree += nodes;
 	}
 
-	return 2 * (res + rtree);
+	return 2 * rtree;
 }
 
 #ifdef CONFIG_HIGHMEM
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (4 preceding siblings ...)
  2014-07-21 10:27 ` [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation Joerg Roedel
@ 2014-07-21 10:27 ` Joerg Roedel
  2014-07-21 12:00 ` [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Pavel Machek
  6 siblings, 0 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:27 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

When a memory bitmap is fully populated on a large memory
machine (several TB of RAM) it can take more than a minute
to walk through all bits. This causes the soft lockup
detector on these machine to report warnings.

Avoid this by touching the soft lockup watchdog in the
memory bitmap walking code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 6a4e07e..738d930 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -769,6 +769,7 @@ static bool rtree_next_node(struct memory_bitmap *bm)
 	if (&bm->cur.node->list != &bm->cur.zone->leaves) {
 		bm->cur.node_pfn += BM_BITS_PER_BLOCK;
 		bm->cur.node_bit  = 0;
+		touch_softlockup_watchdog();
 		return true;
 	}
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (5 preceding siblings ...)
  2014-07-21 10:27 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel
@ 2014-07-21 12:00 ` Pavel Machek
  2014-07-21 12:36   ` Joerg Roedel
  6 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2014-07-21 12:00 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi!

> * Rebased to v3.16-rc6
> * Fixed the style issues in Patch 1 mentioned by Rafael
> 
> Hi,
> 
> here is the revised patch set to improve the scalability of
> the memory bitmap implementation used for hibernation. The
> current implementation does not scale well to machines with
> several TB of memory. A resume on those machines may cause
> soft lockups to be reported.
> 
> These patches improve the data structure by adding a radix
> tree to the linked list structure to improve random access
> performance from O(n) to O(log_b(n)), where b depends on the
> architecture (b=512 on amd64, 1024 in i386).

Why are we doing random access there?

Is the improvement from fact that normally very little memory is used
on big memory machines?

> A test on a 12TB machine showed an improvement in resume
> time from 76s with the old implementation to 2.4s with the
> radix tree and the improved swsusp_free function. See below
> for details of this test.

Actually... how long does it take to hibernate 12TB machine? That
should be many hours, right? You just can't hibernate machine that
big.

> The last patch adds touching the soft lockup watchdog in
> rtree_next_node. This is necessary because the worst case
> performance (all bits set in the forbidden_pages_map and
> free_pages_map) is the same as with the old implementation
> and may still cause soft lockups. Patch 6 avoids this.

Ok, so what about simpler patch? Just touch the watchdog?

Additional 70 seconds will be lost in noise if you write 12TB of RAM
to (even quite fast) disk.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 12:00 ` [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Pavel Machek
@ 2014-07-21 12:36   ` Joerg Roedel
  2014-07-21 13:06     ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 12:36 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi,

On Mon, Jul 21, 2014 at 02:00:53PM +0200, Pavel Machek wrote:
> > These patches improve the data structure by adding a radix
> > tree to the linked list structure to improve random access
> > performance from O(n) to O(log_b(n)), where b depends on the
> > architecture (b=512 on amd64, 1024 in i386).
> 
> Why are we doing random access there?

Mostly because the bits in the bitmaps need to be set, cleared and
tested, basically at every place that calls one of the swsusp_*page*
functions.

> Is the improvement from fact that normally very little memory is used
> on big memory machines?

That is one part of the optimzation, yes. As I wrote in Patch 6 the
worst-case performance is still the same as with the old implementation,
hence it is still necessary to touch the soft lockup watchdog.

> Actually... how long does it take to hibernate 12TB machine? That
> should be many hours, right? You just can't hibernate machine that
> big.

Sorry, I didn't run the tests on these big machines myself as I don't
have access to them, I relied on our partner to do that and report back
the results.

> > The last patch adds touching the soft lockup watchdog in
> > rtree_next_node. This is necessary because the worst case
> > performance (all bits set in the forbidden_pages_map and
> > free_pages_map) is the same as with the old implementation
> > and may still cause soft lockups. Patch 6 avoids this.
> 
> Ok, so what about simpler patch? Just touch the watchdog?

That would just cover the problem that the bitmap data structure and the
algorithm in swsusp_free do not scale well on bigmem machines.

If you want to test the correctnes of these patches yourself, you can
test with only patches 1-4. This will run hibernate with both bitmap
implementations in parallel and trigger a WARN_ON_ONCE when any
difference is found.

I did that with different configurations (64 and 32 bit kernel, in KVM
and on real hardware) and found no issues with only patches 1-4.

> Additional 70 seconds will be lost in noise if you write 12TB of RAM
> to (even quite fast) disk.

Sure, but you would still get the soft lockup warnings when swsusp_free
runs in the end.


	Joerg


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 12:36   ` Joerg Roedel
@ 2014-07-21 13:06     ` Pavel Machek
  2014-07-21 13:38       ` Joerg Roedel
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2014-07-21 13:06 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi!

> > > The last patch adds touching the soft lockup watchdog in
> > > rtree_next_node. This is necessary because the worst case
> > > performance (all bits set in the forbidden_pages_map and
> > > free_pages_map) is the same as with the old implementation
> > > and may still cause soft lockups. Patch 6 avoids this.
> > 
> > Ok, so what about simpler patch? Just touch the watchdog?
> 
> That would just cover the problem that the bitmap data structure and the
> algorithm in swsusp_free do not scale well on bigmem machines.

And is it a problem? Hibernation of 12TB machine will take 6 hours if
you back your swap with SSDs.

Does not scale == burns additional 60 seconds of CPU time. I think we
can live with that...

...because noone sane will hibernate 12TB machine.

> > Additional 70 seconds will be lost in noise if you write 12TB of RAM
> > to (even quite fast) disk.
> 
> Sure, but you would still get the soft lockup warnings when swsusp_free
> runs in the end.

Yes, that's why I propose to apply just patch 6 -- to avoid soft
lockup warnings.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 13:06     ` Pavel Machek
@ 2014-07-21 13:38       ` Joerg Roedel
  2014-07-21 14:10         ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 13:38 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi,

On Mon, Jul 21, 2014 at 03:06:29PM +0200, Pavel Machek wrote:
> > That would just cover the problem that the bitmap data structure and the
> > algorithm in swsusp_free do not scale well on bigmem machines.
> 
> And is it a problem? Hibernation of 12TB machine will take 6 hours if
> you back your swap with SSDs.
> 
> Does not scale == burns additional 60 seconds of CPU time. I think we
> can live with that...

Problem is that these 76s are burned every time, whether you just use
500MB or the full 12TB of the machine.

Next problem is that the bitmaps are allocated (and need to be freed)
without even being sure that a resume will happen.

So when you boot the kernel with 'resume=/dev/something' on the cmdline
it will always take these 76s just for allocating and freeing the
bitmaps for nothing on such a machine.

> ...because noone sane will hibernate 12TB machine.

And Linux is only made for sane people? Thats pretty new to me ;-)

> Yes, that's why I propose to apply just patch 6 -- to avoid soft
> lockup warnings.

Only patch 6 would wrap the problem with the soft lockups, but the other
patches actually improve the resume and boot situation on those
machines.


	Joerg


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 13:38       ` Joerg Roedel
@ 2014-07-21 14:10         ` Pavel Machek
  2014-07-21 16:03           ` Joerg Roedel
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2014-07-21 14:10 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi!

> > > That would just cover the problem that the bitmap data structure and the
> > > algorithm in swsusp_free do not scale well on bigmem machines.
> > 
> > And is it a problem? Hibernation of 12TB machine will take 6 hours if
> > you back your swap with SSDs.
> > 
> > Does not scale == burns additional 60 seconds of CPU time. I think we
> > can live with that...
> 
> Problem is that these 76s are burned every time, whether you just use
> 500MB or the full 12TB of the machine.

On 12TB machine, how much memory is used after boot finishes? I
suspect it is not 500MB.

> Next problem is that the bitmaps are allocated (and need to be freed)
> without even being sure that a resume will happen.

Could we fix that instead? Allocating bitmaps before we see
hibernation signature is a bug.

[How long does 12TB machine boot?]

> > ...because noone sane will hibernate 12TB machine.
> 
> And Linux is only made for sane people? Thats pretty new to me ;-)

Yeah, but please don't optimize for insane people :-).

> > Yes, that's why I propose to apply just patch 6 -- to avoid soft
> > lockup warnings.
> 
> Only patch 6 would wrap the problem with the soft lockups, but the other
> patches actually improve the resume and boot situation on those
> machines.

Yeah. Provide timings for 12TB machine suspending/resuming and we may
talk about it.

#6 is bugfix. We should take it. Rest is performance improvement for
insane config. I don't think we should take it.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 14:10         ` Pavel Machek
@ 2014-07-21 16:03           ` Joerg Roedel
  2014-07-21 23:05             ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 16:03 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi,

On Mon, Jul 21, 2014 at 04:10:27PM +0200, Pavel Machek wrote:
> > And Linux is only made for sane people? Thats pretty new to me ;-)
> 
> Yeah, but please don't optimize for insane people :-).

Thats rather discriminating, even more when you realize that 'insane' is
a very subjective attribute.

> #6 is bugfix. We should take it. Rest is performance improvement for
> insane config. I don't think we should take it.

Have you any other reason against these changes besides your
proven-to-be-wrong disbelief that it is necessary?


	Joerg


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap
  2014-07-21 10:26 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
@ 2014-07-21 22:36   ` Joerg Roedel
  2014-07-21 23:05     ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Joerg Roedel @ 2014-07-21 22:36 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel, Joerg Roedel

Pavel,

On Mon, Jul 21, 2014 at 12:26:57PM +0200, Joerg Roedel wrote:
>  unsigned int snapshot_additional_pages(struct zone *zone)
>  {
> +	unsigned int rtree, nodes;
>  	unsigned int res;
>  
>  	res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
>  	res += DIV_ROUND_UP(res * sizeof(struct bm_block),
>  			    LINKED_PAGE_DATA_SIZE);
> -	return 2 * res;
> +	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
> +	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
> +			      LINKED_PAGE_DATA_SIZE);
> +	while (nodes > 1) {
> +		nodes = DIV_ROUND_UP(nodes, BM_ENTRIES_PER_LEVEL);
> +		rtree += nodes;
> +	}
> +
> +	return 2 * (res + rtree);
>  }

Since you asked in another mail if I took the new data structure size
requirements into account, here is the code I added for this
computation.

Note that the above diff leaves the old code around because the old
memory bitmap implementation is not removed before patch 5/6 (which also
removes the old computation). So the final code will only have the new
size calculation in it.

Do you have any more concerns?


	Joerg


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 16:03           ` Joerg Roedel
@ 2014-07-21 23:05             ` Pavel Machek
  2014-07-22  0:41               ` Rafael J. Wysocki
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2014-07-21 23:05 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi!

On Mon 2014-07-21 18:03:46, Joerg Roedel wrote:
> On Mon, Jul 21, 2014 at 04:10:27PM +0200, Pavel Machek wrote:
> > > And Linux is only made for sane people? Thats pretty new to me ;-)
> > 
> > Yeah, but please don't optimize for insane people :-).
> 
> Thats rather discriminating, even more when you realize that 'insane' is
> a very subjective attribute.

You admitted you don't even have access to machine you are trying to
optimize for.

> > #6 is bugfix. We should take it. Rest is performance improvement for
> > insane config. I don't think we should take it.
> 
> Have you any other reason against these changes besides your
> proven-to-be-wrong disbelief that it is necessary?

Where you proven me wrong?

If kernel is preparing bitmaps before checking resume signature, that
is bug that costs you 60 seconds boot time on your 12TB machine. Fix
that bug. It will benefit everyone.

Software watchdog during hibernation on 12TB machine is a bug that
hits insane people. Ok, lets fix it.

Optimizing hibernation for memory-is-empty case is wrong. Linux tries
to use as much memory as it can.

Hibernating 12TB machine will take 5 hours or more. Redoing data
structures to save 60 seconds of 5 hour process is not worth
it. Nobody sane would hibernate 12TB machine.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap
  2014-07-21 22:36   ` Joerg Roedel
@ 2014-07-21 23:05     ` Pavel Machek
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2014-07-21 23:05 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel, Joerg Roedel

On Tue 2014-07-22 00:36:37, Joerg Roedel wrote:
> Pavel,
> 
> On Mon, Jul 21, 2014 at 12:26:57PM +0200, Joerg Roedel wrote:
> >  unsigned int snapshot_additional_pages(struct zone *zone)
> >  {
> > +	unsigned int rtree, nodes;
> >  	unsigned int res;
> >  
> >  	res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
> >  	res += DIV_ROUND_UP(res * sizeof(struct bm_block),
> >  			    LINKED_PAGE_DATA_SIZE);
> > -	return 2 * res;
> > +	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
> > +	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
> > +			      LINKED_PAGE_DATA_SIZE);
> > +	while (nodes > 1) {
> > +		nodes = DIV_ROUND_UP(nodes, BM_ENTRIES_PER_LEVEL);
> > +		rtree += nodes;
> > +	}
> > +
> > +	return 2 * (res + rtree);
> >  }
> 
> Since you asked in another mail if I took the new data structure size
> requirements into account, here is the code I added for this
> computation.

Yep, thanks, I found it in the meantime.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 23:05             ` Pavel Machek
@ 2014-07-22  0:41               ` Rafael J. Wysocki
  2014-07-22 10:34                 ` Joerg Roedel
  0 siblings, 1 reply; 25+ messages in thread
From: Rafael J. Wysocki @ 2014-07-22  0:41 UTC (permalink / raw)
  To: Pavel Machek, Joerg Roedel; +Cc: Len Brown, linux-pm, linux-kernel

On Tuesday, July 22, 2014 01:05:00 AM Pavel Machek wrote:
> Hi!
> 
> On Mon 2014-07-21 18:03:46, Joerg Roedel wrote:
> > On Mon, Jul 21, 2014 at 04:10:27PM +0200, Pavel Machek wrote:
> > > > And Linux is only made for sane people? Thats pretty new to me ;-)
> > > 
> > > Yeah, but please don't optimize for insane people :-).
> > 
> > Thats rather discriminating, even more when you realize that 'insane' is
> > a very subjective attribute.
> 
> You admitted you don't even have access to machine you are trying to
> optimize for.
> 
> > > #6 is bugfix. We should take it. Rest is performance improvement for
> > > insane config. I don't think we should take it.
> > 
> > Have you any other reason against these changes besides your
> > proven-to-be-wrong disbelief that it is necessary?
> 
> Where you proven me wrong?
> 
> If kernel is preparing bitmaps before checking resume signature, that
> is bug that costs you 60 seconds boot time on your 12TB machine. Fix
> that bug. It will benefit everyone.
> 
> Software watchdog during hibernation on 12TB machine is a bug that
> hits insane people. Ok, lets fix it.
> 
> Optimizing hibernation for memory-is-empty case is wrong. Linux tries
> to use as much memory as it can.
> 
> Hibernating 12TB machine will take 5 hours or more. Redoing data
> structures to save 60 seconds of 5 hour process is not worth
> it. Nobody sane would hibernate 12TB machine.

It looks like some specific need motivated the Joerg's work, however,
so let's just not dismiss the use case lightly without knowing it.

That said I would like to know how much time we save through this
optimization relative to the total hibernation time on systems with
various amounts of memory (say, 4 GB, 8 GB, 16 GB, 32 GB, more) and
whether or not it makes hibernation slower in any case.

Rafael


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22  0:41               ` Rafael J. Wysocki
@ 2014-07-22 10:34                 ` Joerg Roedel
  2014-07-22 10:55                   ` Pavel Machek
                                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-22 10:34 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pavel Machek, Len Brown, linux-pm, linux-kernel

On Tue, Jul 22, 2014 at 02:41:29AM +0200, Rafael J. Wysocki wrote:
> It looks like some specific need motivated the Joerg's work, however,
> so let's just not dismiss the use case lightly without knowing it.

The motivation was to optimize the data structures for machines with
large amounts of RAM without penalizing average machines. On a 12TB
machine you are close to 100000 pages just for one bitmap. Scanning
through that linearly to find a given bit just doesnt scale anymore in
this case.

Same for the algorithm currently used in swsusp_free(). Scanning over
every pfn also doesn't scale well anymore in these ranges. I agree that
the optimizations are not noticable on average systems (see below), but
they are still measurable.

I also see how the problem could be solved differently, but what I
didn't get from the discussion yet is: What is actually *wrong* with
*this* approach?

> That said I would like to know how much time we save through this
> optimization relative to the total hibernation time on systems with
> various amounts of memory (say, 4 GB, 8 GB, 16 GB, 32 GB, more) and
> whether or not it makes hibernation slower in any case.

Okay, I tested on a 16GB system (actually 15GB, one GB is taken by the
GPU). Since the total time for hibernation depends not only on the
amount of RAM in the machine but more on the size of the hibernation
image and the speed of the disk, there is not much value in measuring a
complete resume cycle. The time needed there depends more on the system
and the work load than anything else.

So my test was to resume from a swap partition that contained no image.
Here is the result from the 16GB machine. First with a v3.16-rc6 kernel
without my changes:

kv:~/base # time perf record /usr/sbin/resume /dev/sda1
resume: libgcrypt version: 1.5.3
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (~823 samples) ]

real    0m0.084s
user    0m0.012s
sys     0m0.064s

Here is the result with my patches on top:

kv:~/hibernate # time perf record /usr/sbin/resume /dev/sda1
resume: libgcrypt version: 1.5.3
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (~602 samples) ]

real    0m0.032s
user    0m0.003s
sys     0m0.027s

So we save around 50ms (or 62% of time) already on this 16GB machine.

Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22 10:34                 ` Joerg Roedel
@ 2014-07-22 10:55                   ` Pavel Machek
  2014-07-22 12:24                     ` Joerg Roedel
  2014-07-22 10:58                   ` Pavel Machek
  2014-07-28 13:59                   ` Borislav Petkov
  2 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2014-07-22 10:55 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Tue 2014-07-22 12:34:44, Joerg Roedel wrote:
> On Tue, Jul 22, 2014 at 02:41:29AM +0200, Rafael J. Wysocki wrote:
> > It looks like some specific need motivated the Joerg's work, however,
> > so let's just not dismiss the use case lightly without knowing it.
> 
> The motivation was to optimize the data structures for machines with
> large amounts of RAM without penalizing average machines. On a 12TB
> machine you are close to 100000 pages just for one bitmap. Scanning
> through that linearly to find a given bit just doesnt scale anymore in
> this case.

Writing out every single page on 12TB machine to disk does not scale,
either :-).

> I also see how the problem could be solved differently, but what I
> didn't get from the discussion yet is: What is actually *wrong* with
> *this* approach?

It throws complex / tricky to review code at a problem... that is not
a problem in any reasonable configuration.

Now... should I spend half an hour reviewing your changes, or are we
maybe better without them?

> So we save around 50ms (or 62% of time) already on this 16GB machine.

Or about 5 seeks or about 0.000% of total hibernation cycle.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22 10:34                 ` Joerg Roedel
  2014-07-22 10:55                   ` Pavel Machek
@ 2014-07-22 10:58                   ` Pavel Machek
  2014-07-22 12:10                     ` Joerg Roedel
  2014-07-28 13:59                   ` Borislav Petkov
  2 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2014-07-22 10:58 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Tue 2014-07-22 12:34:44, Joerg Roedel wrote:
> On Tue, Jul 22, 2014 at 02:41:29AM +0200, Rafael J. Wysocki wrote:
> > It looks like some specific need motivated the Joerg's work, however,
> > so let's just not dismiss the use case lightly without knowing it.
> 
> The motivation was to optimize the data structures for machines with
> large amounts of RAM without penalizing average machines. On a 12TB
> machine you are close to 100000 pages just for one bitmap. Scanning
> through that linearly to find a given bit just doesnt scale anymore in
> this case.

Can you produce backtrace where 12TB machine spends time during boot
with resume= parameter but no suspend image?

AFAICT swsusp_check() does not play with bitmaps.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22 10:58                   ` Pavel Machek
@ 2014-07-22 12:10                     ` Joerg Roedel
  2014-07-23 10:57                       ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Joerg Roedel @ 2014-07-22 12:10 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Tue, Jul 22, 2014 at 12:58:12PM +0200, Pavel Machek wrote:
> On Tue 2014-07-22 12:34:44, Joerg Roedel wrote:
> > On Tue, Jul 22, 2014 at 02:41:29AM +0200, Rafael J. Wysocki wrote:
> > > It looks like some specific need motivated the Joerg's work, however,
> > > so let's just not dismiss the use case lightly without knowing it.
> > 
> > The motivation was to optimize the data structures for machines with
> > large amounts of RAM without penalizing average machines. On a 12TB
> > machine you are close to 100000 pages just for one bitmap. Scanning
> > through that linearly to find a given bit just doesnt scale anymore in
> > this case.
> 
> Can you produce backtrace where 12TB machine spends time during boot
> with resume= parameter but no suspend image?
> 
> AFAICT swsusp_check() does not play with bitmaps.

I can ask for a backtrace, I currently don't have one. But I have
perf-data for this case. It also shows that it spends most of its time
with bitmap operations:

~# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 2.882 MB perf.data (~125898 samples) ]

real    1m16.043s
user    0m0.016s
sys     0m0.312s
~# perf report --stdio |head -50
# Events: 75K cycles
#
# Overhead  Command         Shared Object                                   
Symbol
# ........  .......  .................... 
........................................
#
    56.16%   resume  [kernel.kallsyms]     [k] memory_bm_test_bit
    19.35%   resume  [kernel.kallsyms]     [k] swsusp_free
    14.90%   resume  [kernel.kallsyms]     [k] memory_bm_find_bit
     7.28%   resume  [kernel.kallsyms]     [k] swsusp_page_is_forbidden
     0.40%   resume  [kernel.kallsyms]     [k] clear_page_c_e
     0.34%   resume  [kernel.kallsyms]     [k] native_read_tsc
     0.22%   resume  [kernel.kallsyms]     [k] delay_tsc
     0.19%   resume  [kernel.kallsyms]     [k] pfn_valid
     0.11%   resume  [kernel.kallsyms]     [k] create_basic_memory_bitmaps
     0.09%   resume  [kernel.kallsyms]     [k] _raw_spin_lock
     0.09%   resume  [kernel.kallsyms]     [k] __alloc_pages_nodemask
     0.07%   resume  [kernel.kallsyms]     [k] scheduler_tick
     0.06%   resume  [kernel.kallsyms]     [k] io_serial_in
     0.05%   resume  [kernel.kallsyms]     [k] update_curr
     0.05%   resume  [kernel.kallsyms]     [k] idle_cpu
     0.05%   resume  [kernel.kallsyms]     [k] perf_adjust_freq_unthr_context
     0.04%   resume  [kernel.kallsyms]     [k] find_get_page
     0.04%   resume  [kernel.kallsyms]     [k] __memset
     0.04%   resume  [kernel.kallsyms]     [k] ktime_get_update_offsets
     0.03%   resume  [kernel.kallsyms]     [k] get_page_from_freelist
     0.03%   resume  [kernel.kallsyms]     [k] __bitmap_empty
     0.02%   resume  [kernel.kallsyms]     [k] update_cpu_load
     0.02%   resume  [kernel.kallsyms]     [k] update_rq_clock
     0.02%   resume  [kernel.kallsyms]     [k] native_write_msr_safe
     0.02%   resume  [kernel.kallsyms]     [k] mem_cgroup_del_lru_list
     0.02%   resume  [kernel.kallsyms]     [k] free_pcppages_bulk
     0.02%   resume  [kernel.kallsyms]     [k] __rcu_pending
     0.02%   resume  [kernel.kallsyms]     [k] rcu_exit_nohz
     0.01%   resume  [kernel.kallsyms]     [k] __rmqueue
     0.01%   resume  [kernel.kallsyms]     [k] x86_pmu_disable
     0.01%   resume  [kernel.kallsyms]     [k] touch_nmi_watchdog
     0.01%   resume  [kernel.kallsyms]     [k] __do_softirq
     0.01%   resume  [kernel.kallsyms]     [k] alloc_pages_current
     0.01%   resume  [kernel.kallsyms]     [k] memory_bm_create
     0.01%   resume  [kernel.kallsyms]     [k] check_cpu_stall
     0.01%   resume  [kernel.kallsyms]     [k] account_system_time
     0.01%   resume  [kernel.kallsyms]     [k] vma_prio_tree_next
     0.01%   resume  [kernel.kallsyms]     [k] free_hot_cold_page
     0.01%   resume  [kernel.kallsyms]     [k] find_vma
     0.01%   resume  [kernel.kallsyms]     [k] entity_tick
     0.01%   resume  [kernel.kallsyms]     [k] apic_timer_interrupt
     0.01%   resume  [kernel.kallsyms]     [k] read_tsc

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22 10:55                   ` Pavel Machek
@ 2014-07-22 12:24                     ` Joerg Roedel
  0 siblings, 0 replies; 25+ messages in thread
From: Joerg Roedel @ 2014-07-22 12:24 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Tue, Jul 22, 2014 at 12:55:46PM +0200, Pavel Machek wrote:
> On Tue 2014-07-22 12:34:44, Joerg Roedel wrote:

> Writing out every single page on 12TB machine to disk does not scale,
> either :-).

But there is not much potential optimizing the write-out either (in
software). But that doesn't mean we shouldn't do the parts better that
we *can* do better.

> > I also see how the problem could be solved differently, but what I
> > didn't get from the discussion yet is: What is actually *wrong* with
> > *this* approach?
> 
> It throws complex / tricky to review code at a problem...

A radix tree implementation is neither tricky nor overly complex. Also,
if you have questions about particular parts of the implementation I am
happy to explain them/add a comment to make them more clear.

But all you did so far was looking for reasons why this change is bad,
and as that failed you went back to just question its usefulnes.

> that is not a problem in any reasonable configuration.

What is reasonable heavily depends on who you ask. Only if its not an
issue for you doesn't mean it isn't for anybody else. For our partner
the current situation is a real world problem and these patches fix
it.

> Now... should I spend half an hour reviewing your changes, or are we
> maybe better without them?

I bet you already spent more than half an hour discussing reasons with
me not to review this code.


	Joerg


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22 12:10                     ` Joerg Roedel
@ 2014-07-23 10:57                       ` Pavel Machek
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2014-07-23 10:57 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Tue 2014-07-22 14:10:22, Joerg Roedel wrote:
> On Tue, Jul 22, 2014 at 12:58:12PM +0200, Pavel Machek wrote:
> > On Tue 2014-07-22 12:34:44, Joerg Roedel wrote:
> > > On Tue, Jul 22, 2014 at 02:41:29AM +0200, Rafael J. Wysocki wrote:
> > > > It looks like some specific need motivated the Joerg's work, however,
> > > > so let's just not dismiss the use case lightly without knowing it.
> > > 
> > > The motivation was to optimize the data structures for machines with
> > > large amounts of RAM without penalizing average machines. On a 12TB
> > > machine you are close to 100000 pages just for one bitmap. Scanning
> > > through that linearly to find a given bit just doesnt scale anymore in
> > > this case.
> > 
> > Can you produce backtrace where 12TB machine spends time during boot
> > with resume= parameter but no suspend image?
> > 
> > AFAICT swsusp_check() does not play with bitmaps.
> 
> I can ask for a backtrace, I currently don't have one. But I have
> perf-data for this case. It also shows that it spends most of its time
> with bitmap operations:
> 
> ~# time perf record /usr/sbin/resume $sdev
> resume: libgcrypt version: 1.5.0
> [ perf record: Woken up 12 times to write data ]
> [ perf record: Captured and wrote 2.882 MB perf.data (~125898
samples) ]

Aha, but that's the userland code... suspend package, resume.c: 

        dev = open(snapshot_dev_name, O_WRONLY);
        if (dev < 0) {
                error = ENOENT;
                goto Free;
        }

        resume_dev = open_resume_dev(resume_dev_name, &swsusp_header);
        if (resume_dev == -ENOMEDIUM) {
                error = 0;
                goto Close;
        } else if (resume_dev < 0) {
                error = -resume_dev;
                goto Close;
        }

What your probably want to do is open_resume_dev first, and only
open(snapshot_dev_name...) if it finds correct signature.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-22 10:34                 ` Joerg Roedel
  2014-07-22 10:55                   ` Pavel Machek
  2014-07-22 10:58                   ` Pavel Machek
@ 2014-07-28 13:59                   ` Borislav Petkov
  2014-07-29 21:22                     ` Rafael J. Wysocki
  2 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2014-07-28 13:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm, linux-kernel

On Tue, Jul 22, 2014 at 12:34:44PM +0200, Joerg Roedel wrote:
> So my test was to resume from a swap partition that contained no image.
> Here is the result from the 16GB machine. First with a v3.16-rc6 kernel
> without my changes:
> 
> kv:~/base # time perf record /usr/sbin/resume /dev/sda1
> resume: libgcrypt version: 1.5.3
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.019 MB perf.data (~823 samples) ]
> 
> real    0m0.084s
> user    0m0.012s
> sys     0m0.064s
> 
> Here is the result with my patches on top:
> 
> kv:~/hibernate # time perf record /usr/sbin/resume /dev/sda1
> resume: libgcrypt version: 1.5.3
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.014 MB perf.data (~602 samples) ]
> 
> real    0m0.032s
> user    0m0.003s
> sys     0m0.027s
> 
> So we save around 50ms (or 62% of time) already on this 16GB machine.

So, let's see, with Joerg's patches we

- solve the issue on huge boxes. And yes, we most definitely want to be
able to suspend them too. RAS is one very prominent use case here.

*and*

- we see improvement on smaller boxes, above numbers look good to me.

and all that for an additional 8K for a S/R cycle?! And for some
additional complexity of a radix tree which is self-contained, well
tested and understood?

This looks like certainly like net win to me.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-28 13:59                   ` Borislav Petkov
@ 2014-07-29 21:22                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 25+ messages in thread
From: Rafael J. Wysocki @ 2014-07-29 21:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joerg Roedel, Pavel Machek, Len Brown, linux-pm, linux-kernel

On Monday, July 28, 2014 03:59:33 PM Borislav Petkov wrote:
> On Tue, Jul 22, 2014 at 12:34:44PM +0200, Joerg Roedel wrote:
> > So my test was to resume from a swap partition that contained no image.
> > Here is the result from the 16GB machine. First with a v3.16-rc6 kernel
> > without my changes:
> > 
> > kv:~/base # time perf record /usr/sbin/resume /dev/sda1
> > resume: libgcrypt version: 1.5.3
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.019 MB perf.data (~823 samples) ]
> > 
> > real    0m0.084s
> > user    0m0.012s
> > sys     0m0.064s
> > 
> > Here is the result with my patches on top:
> > 
> > kv:~/hibernate # time perf record /usr/sbin/resume /dev/sda1
> > resume: libgcrypt version: 1.5.3
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.014 MB perf.data (~602 samples) ]
> > 
> > real    0m0.032s
> > user    0m0.003s
> > sys     0m0.027s
> > 
> > So we save around 50ms (or 62% of time) already on this 16GB machine.
> 
> So, let's see, with Joerg's patches we
> 
> - solve the issue on huge boxes. And yes, we most definitely want to be
> able to suspend them too. RAS is one very prominent use case here.
> 
> *and*
> 
> - we see improvement on smaller boxes, above numbers look good to me.
> 
> and all that for an additional 8K for a S/R cycle?! And for some
> additional complexity of a radix tree which is self-contained, well
> tested and understood?
> 
> This looks like certainly like net win to me.

Yes, it does.

I've queued up the Joergs patches for 3.17, thanks!

Rafael


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-07-29 21:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-21 10:26 [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
2014-07-21 10:26 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
2014-07-21 22:36   ` Joerg Roedel
2014-07-21 23:05     ` Pavel Machek
2014-07-21 10:26 ` [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function Joerg Roedel
2014-07-21 10:26 ` [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree Joerg Roedel
2014-07-21 10:27 ` [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free() Joerg Roedel
2014-07-21 10:27 ` [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation Joerg Roedel
2014-07-21 10:27 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel
2014-07-21 12:00 ` [PATCH 0/6 v2] PM / Hibernate: Memory bitmap scalability improvements Pavel Machek
2014-07-21 12:36   ` Joerg Roedel
2014-07-21 13:06     ` Pavel Machek
2014-07-21 13:38       ` Joerg Roedel
2014-07-21 14:10         ` Pavel Machek
2014-07-21 16:03           ` Joerg Roedel
2014-07-21 23:05             ` Pavel Machek
2014-07-22  0:41               ` Rafael J. Wysocki
2014-07-22 10:34                 ` Joerg Roedel
2014-07-22 10:55                   ` Pavel Machek
2014-07-22 12:24                     ` Joerg Roedel
2014-07-22 10:58                   ` Pavel Machek
2014-07-22 12:10                     ` Joerg Roedel
2014-07-23 10:57                       ` Pavel Machek
2014-07-28 13:59                   ` Borislav Petkov
2014-07-29 21:22                     ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.