All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
@ 2014-07-18 11:57 Joerg Roedel
  2014-07-18 11:57 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel

Hi,

here is a patch set to improve the scalability of the memory
bitmap implementation used for hibernation. The current
implementation does not scale well to machines with several
TB of memory. A resume on those machines may cause soft
lockups to be reported.

These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).

A test on a 12TB machine showed an improvement in resume
time from 76s with the old implementation to 2.4s with the
radix tree and the improved swsusp_free function. See below
for details of this test.

Patches 1-3 that add the radix tree while keeping the
existing memory bitmap implementation in place and add code
to compare the results between both implementations. This
was used during development to make sure both data
structures return the same results.

Patch 4 re-implements the swsusp_free() function to not
iterate over all pfns but only over the bits set in the
bitmaps. This showed to scale better on large memory
machines.

Patch 5 removes the old memory bitmap implementation now
that the radix tree is in place and working correctly.

The last patch adds touching the soft lockup watchdog in
rtree_next_node. This is necessary because the worst case
performance (all bits set in the forbidden_pages_map and
free_pages_map) is the same as with the old implementation
and may still cause soft lockups. Patch 6 avoids this.

The code was tested in 32 and 64 bit x86 and showed no
issues there.

Below is an example test that shows the performance
improvement on a 12TB machine. First the test with the old
memory bitmap:

# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 2.882 MB perf.data (~125898 samples) ]

real    1m16.043s
user    0m0.016s
sys     0m0.312s
# perf report --stdio |head -50
# Events: 75K cycles
#
# Overhead  Command         Shared Object                                   
Symbol
# ........  .......  .................... 
........................................
#
    56.16%   resume  [kernel.kallsyms]     [k] memory_bm_test_bit
    19.35%   resume  [kernel.kallsyms]     [k] swsusp_free
    14.90%   resume  [kernel.kallsyms]     [k] memory_bm_find_bit
     7.28%   resume  [kernel.kallsyms]     [k] swsusp_page_is_forbidden

And here is the same test on the same machine with these
patches applied:

#  time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ]

real    0m2.376s
user    0m0.020s
sys     0m0.408s

# perf report --stdio |head -50
# Events: 762  cycles
#
# Overhead  Command      Shared Object                     Symbol
# ........  .......  .................  .........................
#
    34.78%   resume  [kernel.kallsyms]  [k] find_next_bit
    27.03%   resume  [kernel.kallsyms]  [k] clear_page_c_e
     9.70%   resume  [kernel.kallsyms]  [k] mark_nosave_pages
     3.92%   resume  [kernel.kallsyms]  [k] alloc_rtree_node
     2.38%   resume  [kernel.kallsyms]  [k] get_image_page

As can be seen on these results these patches improve the
scalability significantly. Please review, any comments
appreciated.

Thanks,

	Joerg

Joerg Roedel (6):
  PM / Hibernate: Create a Radix-Tree to store memory bitmap
  PM / Hibernate: Add memory_rtree_find_bit function
  PM / Hibernate: Implement position keeping in radix tree
  PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
  PM / Hibernate: Remove the old memory-bitmap implementation
  PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node

 kernel/power/snapshot.c | 510 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 368 insertions(+), 142 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
@ 2014-07-18 11:57 ` Joerg Roedel
  2014-07-19  0:00   ` Rafael J. Wysocki
  2014-07-18 11:57 ` [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function Joerg Roedel
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

This patch adds the code to allocate and build the radix
tree to store the memory bitmap. The old data structure is
left in place until the radix tree implementation is
finished.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 223 insertions(+), 1 deletion(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 1ea328a..d0f11ec 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -248,11 +248,24 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
  *	information is stored (in the form of a block of bitmap)
  *	It also contains the pfns that correspond to the start and end of
  *	the represented memory area.
+ *
+ *	The memory bitmap is organized as a radix tree to guarantee fast random
+ *	access to the bits. There is one radix tree for each zone (as returned
+ *	from create_mem_extents).
+ *
+ *	One radix tree is represented by one struct mem_zone_bm_rtree. There are
+ *	two linked lists for the nodes of the tree, one for the inner nodes and
+ *	one for the leave nodes. The linked leave nodes are used for fast linear
+ *	access of the memory bitmap.
+ *
+ *	The struct rtree_node represents one node of the radix tree.
  */
 
 #define BM_END_OF_MAP	(~0UL)
 
 #define BM_BITS_PER_BLOCK	(PAGE_SIZE * BITS_PER_BYTE)
+#define BM_BLOCK_SHIFT		(PAGE_SHIFT + 3)
+#define BM_BLOCK_MASK		((1UL << BM_BLOCK_SHIFT) - 1)
 
 struct bm_block {
 	struct list_head hook;	/* hook into a list of bitmap blocks */
@@ -266,6 +279,31 @@ static inline unsigned long bm_block_bits(struct bm_block *bb)
 	return bb->end_pfn - bb->start_pfn;
 }
 
+/*
+ * struct rtree_node is a wrapper struct to link the nodes
+ * of the rtree together for easy linear iteration over
+ * bits and easy freeing
+ */
+struct rtree_node {
+	struct list_head list;
+	unsigned long *data;
+};
+
+/*
+ * struct mem_zone_bm_rtree represents a bitmap used for one
+ * populated memory zone.
+ */
+struct mem_zone_bm_rtree {
+	struct list_head list;		/* Link Zones together         */
+	struct list_head nodes;		/* Radix Tree inner nodes      */
+	struct list_head leaves;	/* Radix Tree leaves           */
+	unsigned long start_pfn;	/* Zone start page frame       */
+	unsigned long end_pfn;		/* Zone end page frame + 1     */
+	struct rtree_node *rtree;	/* Radix Tree Root             */
+	int levels;			/* Number of Radix Tree Levels */
+	unsigned int blocks;		/* Number of Bitmap Blocks     */
+};
+
 /* strcut bm_position is used for browsing memory bitmaps */
 
 struct bm_position {
@@ -274,6 +312,7 @@ struct bm_position {
 };
 
 struct memory_bitmap {
+	struct list_head zones;
 	struct list_head blocks;	/* list of bitmap blocks */
 	struct linked_page *p_list;	/* list of pages used to store zone
 					 * bitmap objects and bitmap block
@@ -284,6 +323,167 @@ struct memory_bitmap {
 
 /* Functions that operate on memory bitmaps */
 
+#define BM_ENTRIES_PER_LEVEL	(PAGE_SIZE / sizeof(unsigned long))
+#if BITS_PER_LONG == 32
+#define BM_RTREE_LEVEL_SHIFT	(PAGE_SHIFT - 2)
+#else
+#define BM_RTREE_LEVEL_SHIFT	(PAGE_SHIFT - 3)
+#endif
+#define BM_RTREE_LEVEL_MASK	((1UL << BM_RTREE_LEVEL_SHIFT) - 1)
+
+/*
+ *	alloc_rtree_node - Allocate a new node and add it to the radix
+ *			   tree.
+ *
+ *	This function is used to allocate inner nodes as well as the
+ *	leave nodes of the radix tree. It also adds the node to the
+ *	corresponding linked list passed in by the *list parameter.
+ */
+static struct rtree_node *alloc_rtree_node(gfp_t gfp_mask, int safe_needed,
+					   struct chain_allocator *ca,
+					   struct list_head *list)
+{
+	struct rtree_node *node;
+
+	node = chain_alloc(ca, sizeof(struct rtree_node));
+	if (!node)
+		return NULL;
+
+	node->data = get_image_page(gfp_mask, safe_needed);
+	if (!node->data)
+		return NULL;
+
+	list_add_tail(&node->list, list);
+
+	return node;
+}
+
+/*
+ *	add_rtree_block - Add a new leave node to the radix tree
+ *
+ *	The leave nodes need to be allocated in order to keep the leaves
+ *	linked list in order. This is guaranteed by the zone->blocks
+ *	counter.
+ */
+static int add_rtree_block(struct mem_zone_bm_rtree *zone, gfp_t gfp_mask,
+			   int safe_needed, struct chain_allocator *ca)
+{
+	struct rtree_node *node, *block, **dst;
+	unsigned int levels_needed, block_nr;
+	int i;
+
+	block_nr	= zone->blocks;
+	levels_needed	= 0;
+
+	/* How many levels do we need for this block nr? */
+	while (block_nr) {
+		levels_needed += 1;
+		block_nr >>= BM_RTREE_LEVEL_SHIFT;
+	}
+
+	/* Make sure the rtree has enough levels */
+	for (i = zone->levels; i < levels_needed; i++) {
+		node = alloc_rtree_node(gfp_mask, safe_needed, ca,
+					&zone->nodes);
+		if (!node)
+			return -ENOMEM;
+
+		node->data[0] = (unsigned long)zone->rtree;
+		zone->rtree   = node;
+		zone->levels += 1;
+	}
+
+	/* Allocate new block */
+	block = alloc_rtree_node(gfp_mask, safe_needed, ca, &zone->leaves);
+	if (!block)
+		return -ENOMEM;
+
+	/* Now walk the rtree to insert the block */
+	node     = zone->rtree;
+	dst      = &zone->rtree;
+	block_nr = zone->blocks;
+	for (i = zone->levels; i > 0; i--) {
+		int index;
+
+		if (!node) {
+			node = alloc_rtree_node(gfp_mask, safe_needed, ca,
+						&zone->nodes);
+			if (!node)
+				return -ENOMEM;
+			*dst = node;
+		}
+
+		index  = block_nr >> ((i - 1) * BM_RTREE_LEVEL_SHIFT);
+		index &= BM_RTREE_LEVEL_MASK;
+		dst    = (struct rtree_node **)&((*dst)->data[index]);
+		node   = *dst;
+	}
+
+	zone->blocks += 1;
+	*dst          = block;
+
+	return 0;
+}
+
+static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone,
+			       int clear_nosave_free);
+
+/*
+ *	create_zone_bm_rtree - create a radix tree for one zone
+ *
+ *	Allocated the mem_zone_bm_rtree structure and initializes it.
+ *	This function also allocated and builds the radix tree for the
+ *	zone.
+ */
+static struct mem_zone_bm_rtree *
+create_zone_bm_rtree(gfp_t gfp_mask, int safe_needed,
+		     struct chain_allocator *ca,
+		     unsigned long start, unsigned long end)
+{
+	struct mem_zone_bm_rtree *zone;
+	unsigned int i, nr_blocks;
+	unsigned long pages;
+
+	pages = end - start;
+	zone  = chain_alloc(ca, sizeof(struct mem_zone_bm_rtree));
+	if (!zone)
+		return NULL;
+
+	INIT_LIST_HEAD(&zone->nodes);
+	INIT_LIST_HEAD(&zone->leaves);
+	zone->start_pfn = start;
+	zone->end_pfn   = end;
+	nr_blocks       = DIV_ROUND_UP(pages, BM_BITS_PER_BLOCK);
+
+	for (i = 0; i < nr_blocks; i++) {
+		if (add_rtree_block(zone, gfp_mask, safe_needed, ca)) {
+			free_zone_bm_rtree(zone, PG_UNSAFE_CLEAR);
+			return NULL;
+		}
+	}
+
+	return zone;
+}
+
+/*
+ *	free_zone_bm_rtree - Free the memory of the radix tree
+ *
+ *	Free all node pages of the radix tree. The mem_zone_bm_rtree
+ *	structure itself is not freed here nor are the rtree_node
+ *	structs.
+ */
+static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone,
+			       int clear_nosave_free)
+{
+	struct rtree_node *node;
+
+	list_for_each_entry(node, &zone->nodes, list)
+		free_image_page(node->data, clear_nosave_free);
+
+	list_for_each_entry(node, &zone->leaves, list)
+		free_image_page(node->data, clear_nosave_free);
+}
+
 static void memory_bm_position_reset(struct memory_bitmap *bm)
 {
 	bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
@@ -408,12 +608,14 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 
 	chain_init(&ca, gfp_mask, safe_needed);
 	INIT_LIST_HEAD(&bm->blocks);
+	INIT_LIST_HEAD(&bm->zones);
 
 	error = create_mem_extents(&mem_extents, gfp_mask);
 	if (error)
 		return error;
 
 	list_for_each_entry(ext, &mem_extents, hook) {
+		struct mem_zone_bm_rtree *zone;
 		struct bm_block *bb;
 		unsigned long pfn = ext->start;
 		unsigned long pages = ext->end - ext->start;
@@ -441,6 +643,12 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 			}
 			bb->end_pfn = pfn;
 		}
+
+		zone = create_zone_bm_rtree(gfp_mask, safe_needed, &ca,
+					    ext->start, ext->end);
+		if (!zone)
+			goto Error;
+		list_add_tail(&zone->list, &bm->zones);
 	}
 
 	bm->p_list = ca.chain;
@@ -460,14 +668,19 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
   */
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free)
 {
+	struct mem_zone_bm_rtree *zone;
 	struct bm_block *bb;
 
 	list_for_each_entry(bb, &bm->blocks, hook)
 		if (bb->data)
 			free_image_page(bb->data, clear_nosave_free);
 
+	list_for_each_entry(zone, &bm->zones, list)
+		free_zone_bm_rtree(zone, clear_nosave_free);
+
 	free_list_of_pages(bm->p_list, clear_nosave_free);
 
+	INIT_LIST_HEAD(&bm->zones);
 	INIT_LIST_HEAD(&bm->blocks);
 }
 
@@ -816,12 +1029,21 @@ void free_basic_memory_bitmaps(void)
 
 unsigned int snapshot_additional_pages(struct zone *zone)
 {
+	unsigned int rtree, nodes;
 	unsigned int res;
 
 	res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
 	res += DIV_ROUND_UP(res * sizeof(struct bm_block),
 			    LINKED_PAGE_DATA_SIZE);
-	return 2 * res;
+	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
+	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
+			      LINKED_PAGE_DATA_SIZE);
+	while (nodes > 1) {
+		nodes = DIV_ROUND_UP(nodes, BM_ENTRIES_PER_LEVEL);
+		rtree += nodes;
+	}
+
+	return 2 * (res + rtree);
 }
 
 #ifdef CONFIG_HIGHMEM
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
  2014-07-18 11:57 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
@ 2014-07-18 11:57 ` Joerg Roedel
  2014-07-18 11:57 ` [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree Joerg Roedel
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add a function to find a bit in the radix tree for a given
pfn. Also add code to the memory bitmap wrapper functions to
use the radix tree together with the existing memory bitmap
implementation.

On read accesses compare the results of both bitmaps to make
sure the radix tree behaves the same way.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 3 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d0f11ec..9a3b723 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -721,6 +721,56 @@ static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 	return 0;
 }
 
+/*
+ *	memory_rtree_find_bit - Find the bit for pfn in the memory
+ *				bitmap
+ *
+ *	Walks the radix tree to find the page which contains the bit for
+ *	pfn and returns the bit position in **addr and *bit_nr.
+ */
+static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
+				 void **addr, unsigned int *bit_nr)
+{
+	struct mem_zone_bm_rtree *curr, *zone;
+	struct rtree_node *node;
+	int i, block_nr;
+
+	zone = NULL;
+
+	/* Find the right zone */
+	list_for_each_entry(curr, &bm->zones, list) {
+		if (pfn >= curr->start_pfn && pfn < curr->end_pfn) {
+			zone = curr;
+			break;
+		}
+	}
+
+	if (!zone)
+		return -EFAULT;
+
+	/*
+	 * We have a zone. Now walk the radix tree to find the leave
+	 * node for our pfn.
+	 */
+	node      = zone->rtree;
+	block_nr  = (pfn - zone->start_pfn) >> BM_BLOCK_SHIFT;
+
+	for (i = zone->levels; i > 0; i--) {
+		int index;
+
+		index  = block_nr >> ((i - 1) * BM_RTREE_LEVEL_SHIFT);
+		index &= BM_RTREE_LEVEL_MASK;
+		BUG_ON(node->data[index] == 0);
+		node = (struct rtree_node *)node->data[index];
+	}
+
+	/* Set return values */
+	*addr   = node->data;
+	*bit_nr = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
+
+	return 0;
+}
+
 static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
@@ -730,6 +780,10 @@ static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	set_bit(bit, addr);
+
+	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	BUG_ON(error);
+	set_bit(bit, addr);
 }
 
 static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
@@ -741,6 +795,13 @@ static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	if (!error)
 		set_bit(bit, addr);
+	else
+		return error;
+
+	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	if (!error)
+		set_bit(bit, addr);
+
 	return error;
 }
 
@@ -753,25 +814,42 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	clear_bit(bit, addr);
+
+	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	BUG_ON(error);
+	clear_bit(bit, addr);
 }
 
 static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
-	int error;
+	int error, error2;
+	int v;
 
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
-	return test_bit(bit, addr);
+	v = test_bit(bit, addr);
+
+	error2 = memory_rtree_find_bit(bm, pfn, &addr, &bit);
+	BUG_ON(error2);
+
+	WARN_ON_ONCE(v != test_bit(bit, addr));
+
+	return v;
 }
 
 static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
+	int present;
+
+	present = !memory_bm_find_bit(bm, pfn, &addr, &bit);
+
+	WARN_ON_ONCE(present != !memory_rtree_find_bit(bm, pfn, &addr, &bit));
 
-	return !memory_bm_find_bit(bm, pfn, &addr, &bit);
+	return present;
 }
 
 /**
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
  2014-07-18 11:57 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
  2014-07-18 11:57 ` [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function Joerg Roedel
@ 2014-07-18 11:57 ` Joerg Roedel
  2014-07-18 11:57 ` [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free() Joerg Roedel
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add code to remember the last position that was requested in
the radix tree. Use it as a cache for faster linear walking
of the bitmap in the memory_bm_rtree_next_pfn() function
which is also added with this patch.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 9a3b723..2ccbf77 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -309,6 +309,11 @@ struct mem_zone_bm_rtree {
 struct bm_position {
 	struct bm_block *block;
 	int bit;
+
+	struct mem_zone_bm_rtree *zone;
+	struct rtree_node *node;
+	unsigned long node_pfn;
+	int node_bit;
 };
 
 struct memory_bitmap {
@@ -488,6 +493,13 @@ static void memory_bm_position_reset(struct memory_bitmap *bm)
 {
 	bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
 	bm->cur.bit = 0;
+
+	bm->cur.zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree,
+				  list);
+	bm->cur.node = list_entry(bm->cur.zone->leaves.next,
+				  struct rtree_node, list);
+	bm->cur.node_pfn = 0;
+	bm->cur.node_bit = 0;
 }
 
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free);
@@ -735,6 +747,11 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 	struct rtree_node *node;
 	int i, block_nr;
 
+	zone = bm->cur.zone;
+
+	if (pfn >= zone->start_pfn && pfn < zone->end_pfn)
+		goto zone_found;
+
 	zone = NULL;
 
 	/* Find the right zone */
@@ -748,10 +765,16 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 	if (!zone)
 		return -EFAULT;
 
+zone_found:
 	/*
 	 * We have a zone. Now walk the radix tree to find the leave
 	 * node for our pfn.
 	 */
+
+	node = bm->cur.node;
+	if (((pfn - zone->start_pfn) & ~BM_BLOCK_MASK) == bm->cur.node_pfn)
+		goto node_found;
+
 	node      = zone->rtree;
 	block_nr  = (pfn - zone->start_pfn) >> BM_BLOCK_SHIFT;
 
@@ -764,9 +787,15 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
 		node = (struct rtree_node *)node->data[index];
 	}
 
+node_found:
+	/* Update last position */
+	bm->cur.zone     = zone;
+	bm->cur.node     = node;
+	bm->cur.node_pfn = (pfn - zone->start_pfn) & ~BM_BLOCK_MASK;
+
 	/* Set return values */
-	*addr   = node->data;
-	*bit_nr = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
+	*addr            = node->data;
+	*bit_nr          = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
 
 	return 0;
 }
@@ -861,11 +890,16 @@ static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
  *	this function.
  */
 
+static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm);
+
 static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
 {
+	unsigned long rtree_pfn;
 	struct bm_block *bb;
 	int bit;
 
+	rtree_pfn = memory_bm_rtree_next_pfn(bm);
+
 	bb = bm->cur.block;
 	do {
 		bit = bm->cur.bit;
@@ -879,13 +913,77 @@ static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
 	} while (&bb->hook != &bm->blocks);
 
 	memory_bm_position_reset(bm);
+	WARN_ON_ONCE(rtree_pfn != BM_END_OF_MAP);
 	return BM_END_OF_MAP;
 
  Return_pfn:
+	WARN_ON_ONCE(bb->start_pfn + bit != rtree_pfn);
 	bm->cur.bit = bit + 1;
 	return bb->start_pfn + bit;
 }
 
+/*
+ *	rtree_next_node - Jumps to the next leave node
+ *
+ *	Sets the position to the beginning of the next node in the
+ *	memory bitmap. This is either the next node in the current
+ *	zone's radix tree or the first node in the radix tree of the
+ *	next zone.
+ *
+ *	Returns true if there is a next node, false otherwise.
+ */
+static bool rtree_next_node(struct memory_bitmap *bm)
+{
+	bm->cur.node = list_entry(bm->cur.node->list.next,
+				  struct rtree_node, list);
+	if (&bm->cur.node->list != &bm->cur.zone->leaves) {
+		bm->cur.node_pfn += BM_BITS_PER_BLOCK;
+		bm->cur.node_bit  = 0;
+		return true;
+	}
+
+	/* No more nodes, goto next zone */
+	bm->cur.zone = list_entry(bm->cur.zone->list.next,
+				  struct mem_zone_bm_rtree, list);
+	if (&bm->cur.zone->list != &bm->zones) {
+		bm->cur.node = list_entry(bm->cur.zone->leaves.next,
+					  struct rtree_node, list);
+		bm->cur.node_pfn = 0;
+		bm->cur.node_bit = 0;
+		return true;
+	}
+
+	/* No more zones */
+	return false;
+}
+
+/*
+ *	memory_bm_rtree_next_pfn - Find the next set bit
+ *
+ *	Starting from the last returned position this function searches
+ *	for the next set bit in the memory bitmap and returns its
+ *	number. If no more bit is set BM_END_OF_MAP is returned.
+ */
+static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm)
+{
+	unsigned long bits, pfn, pages;
+	int bit;
+
+	do {
+		pages	  = bm->cur.zone->end_pfn - bm->cur.zone->start_pfn;
+		bits      = min(pages - bm->cur.node_pfn, BM_BITS_PER_BLOCK);
+		bit	  = find_next_bit(bm->cur.node->data, bits,
+					  bm->cur.node_bit);
+		if (bit < bits) {
+			pfn = bm->cur.zone->start_pfn + bm->cur.node_pfn + bit;
+			bm->cur.node_bit = bit + 1;
+			return pfn;
+		}
+	} while (rtree_next_node(bm));
+
+	return BM_END_OF_MAP;
+}
+
 /**
  *	This structure represents a range of page frames the contents of which
  *	should not be saved during the suspend.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (2 preceding siblings ...)
  2014-07-18 11:57 ` [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree Joerg Roedel
@ 2014-07-18 11:57 ` Joerg Roedel
  2014-07-18 11:57 ` [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation Joerg Roedel
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The existing implementation of swsusp_free iterates over all
pfns in the system and checks every bit in the two memory
bitmaps.

This doesn't scale very well with large numbers of pfns,
especially when the bitmaps are not populated very densly.
Change the algorithm to iterate over the set bits in the
bitmaps instead to make it scale better in large memory
configurations.

Also add a memory_bm_clear_current() helper function that
clears the bit for the last position returned from the
memory bitmap.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 53 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 2ccbf77..af733c3 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -849,6 +849,17 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
 	clear_bit(bit, addr);
 }
 
+static void memory_bm_clear_current(struct memory_bitmap *bm)
+{
+	int bit;
+
+	bit  = max(bm->cur.node_bit - 1, 0);
+	clear_bit(bit, bm->cur.node->data);
+
+	bit = max(bm->cur.bit - 1, 0);
+	clear_bit(bit, bm->cur.block->data);
+}
+
 static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
@@ -1492,23 +1503,35 @@ static struct memory_bitmap copy_bm;
 
 void swsusp_free(void)
 {
-	struct zone *zone;
-	unsigned long pfn, max_zone_pfn;
+	unsigned long fb_pfn, fr_pfn;
 
-	for_each_populated_zone(zone) {
-		max_zone_pfn = zone_end_pfn(zone);
-		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
-			if (pfn_valid(pfn)) {
-				struct page *page = pfn_to_page(pfn);
-
-				if (swsusp_page_is_forbidden(page) &&
-				    swsusp_page_is_free(page)) {
-					swsusp_unset_page_forbidden(page);
-					swsusp_unset_page_free(page);
-					__free_page(page);
-				}
-			}
+	memory_bm_position_reset(forbidden_pages_map);
+	memory_bm_position_reset(free_pages_map);
+
+loop:
+	fr_pfn = memory_bm_next_pfn(free_pages_map);
+	fb_pfn = memory_bm_next_pfn(forbidden_pages_map);
+
+	/*
+	 * Find the next bit set in both bitmaps. This is guaranteed to
+	 * terminate when fb_pfn == fr_pfn == BM_END_OF_MAP.
+	 */
+	do {
+		if (fb_pfn < fr_pfn)
+			fb_pfn = memory_bm_next_pfn(forbidden_pages_map);
+		if (fr_pfn < fb_pfn)
+			fr_pfn = memory_bm_next_pfn(free_pages_map);
+	} while (fb_pfn != fr_pfn);
+
+	if (fr_pfn != BM_END_OF_MAP && pfn_valid(fr_pfn)) {
+		struct page *page = pfn_to_page(fr_pfn);
+
+		memory_bm_clear_current(forbidden_pages_map);
+		memory_bm_clear_current(free_pages_map);
+		__free_page(page);
+		goto loop;
 	}
+
 	nr_copy_pages = 0;
 	nr_meta_pages = 0;
 	restore_pblist = NULL;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (3 preceding siblings ...)
  2014-07-18 11:57 ` [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free() Joerg Roedel
@ 2014-07-18 11:57 ` Joerg Roedel
  2014-07-18 11:57 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The radix tree implementatio is proved to work the same as
the old implementation now. So the old implementation can be
removed to finish the switch to the radix tree for the
memory bitmaps.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 260 ++++++------------------------------------------
 1 file changed, 32 insertions(+), 228 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index af733c3..750b32f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -223,32 +223,6 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
 /**
  *	Data types related to memory bitmaps.
  *
- *	Memory bitmap is a structure consiting of many linked lists of
- *	objects.  The main list's elements are of type struct zone_bitmap
- *	and each of them corresonds to one zone.  For each zone bitmap
- *	object there is a list of objects of type struct bm_block that
- *	represent each blocks of bitmap in which information is stored.
- *
- *	struct memory_bitmap contains a pointer to the main list of zone
- *	bitmap objects, a struct bm_position used for browsing the bitmap,
- *	and a pointer to the list of pages used for allocating all of the
- *	zone bitmap objects and bitmap block objects.
- *
- *	NOTE: It has to be possible to lay out the bitmap in memory
- *	using only allocations of order 0.  Additionally, the bitmap is
- *	designed to work with arbitrary number of zones (this is over the
- *	top for now, but let's avoid making unnecessary assumptions ;-).
- *
- *	struct zone_bitmap contains a pointer to a list of bitmap block
- *	objects and a pointer to the bitmap block object that has been
- *	most recently used for setting bits.  Additionally, it contains the
- *	pfns that correspond to the start and end of the represented zone.
- *
- *	struct bm_block contains a pointer to the memory page in which
- *	information is stored (in the form of a block of bitmap)
- *	It also contains the pfns that correspond to the start and end of
- *	the represented memory area.
- *
  *	The memory bitmap is organized as a radix tree to guarantee fast random
  *	access to the bits. There is one radix tree for each zone (as returned
  *	from create_mem_extents).
@@ -258,7 +232,18 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
  *	one for the leave nodes. The linked leave nodes are used for fast linear
  *	access of the memory bitmap.
  *
+ *	struct memory_bitmap contains a pointer to the main list of zone
+ *	rtree objects, a struct bm_position used for browsing the bitmap,
+ *	and a pointer to the list of pages used for allocating all of the
+ *	zone bitmap objects and bitmap block objects.
+ *
  *	The struct rtree_node represents one node of the radix tree.
+ *
+ *	NOTE: It has to be possible to lay out the bitmap in memory
+ *	using only allocations of order 0.  Additionally, the bitmap is
+ *	designed to work with arbitrary number of zones (this is over the
+ *	top for now, but let's avoid making unnecessary assumptions ;-).
+ *
  */
 
 #define BM_END_OF_MAP	(~0UL)
@@ -267,18 +252,6 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
 #define BM_BLOCK_SHIFT		(PAGE_SHIFT + 3)
 #define BM_BLOCK_MASK		((1UL << BM_BLOCK_SHIFT) - 1)
 
-struct bm_block {
-	struct list_head hook;	/* hook into a list of bitmap blocks */
-	unsigned long start_pfn;	/* pfn represented by the first bit */
-	unsigned long end_pfn;	/* pfn represented by the last bit plus 1 */
-	unsigned long *data;	/* bitmap representing pages */
-};
-
-static inline unsigned long bm_block_bits(struct bm_block *bb)
-{
-	return bb->end_pfn - bb->start_pfn;
-}
-
 /*
  * struct rtree_node is a wrapper struct to link the nodes
  * of the rtree together for easy linear iteration over
@@ -307,9 +280,6 @@ struct mem_zone_bm_rtree {
 /* strcut bm_position is used for browsing memory bitmaps */
 
 struct bm_position {
-	struct bm_block *block;
-	int bit;
-
 	struct mem_zone_bm_rtree *zone;
 	struct rtree_node *node;
 	unsigned long node_pfn;
@@ -318,7 +288,6 @@ struct bm_position {
 
 struct memory_bitmap {
 	struct list_head zones;
-	struct list_head blocks;	/* list of bitmap blocks */
 	struct linked_page *p_list;	/* list of pages used to store zone
 					 * bitmap objects and bitmap block
 					 * objects
@@ -491,9 +460,6 @@ static void free_zone_bm_rtree(struct mem_zone_bm_rtree *zone,
 
 static void memory_bm_position_reset(struct memory_bitmap *bm)
 {
-	bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
-	bm->cur.bit = 0;
-
 	bm->cur.zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree,
 				  list);
 	bm->cur.node = list_entry(bm->cur.zone->leaves.next,
@@ -504,30 +470,6 @@ static void memory_bm_position_reset(struct memory_bitmap *bm)
 
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free);
 
-/**
- *	create_bm_block_list - create a list of block bitmap objects
- *	@pages - number of pages to track
- *	@list - list to put the allocated blocks into
- *	@ca - chain allocator to be used for allocating memory
- */
-static int create_bm_block_list(unsigned long pages,
-				struct list_head *list,
-				struct chain_allocator *ca)
-{
-	unsigned int nr_blocks = DIV_ROUND_UP(pages, BM_BITS_PER_BLOCK);
-
-	while (nr_blocks-- > 0) {
-		struct bm_block *bb;
-
-		bb = chain_alloc(ca, sizeof(struct bm_block));
-		if (!bb)
-			return -ENOMEM;
-		list_add(&bb->hook, list);
-	}
-
-	return 0;
-}
-
 struct mem_extent {
 	struct list_head hook;
 	unsigned long start;
@@ -619,7 +561,6 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 	int error;
 
 	chain_init(&ca, gfp_mask, safe_needed);
-	INIT_LIST_HEAD(&bm->blocks);
 	INIT_LIST_HEAD(&bm->zones);
 
 	error = create_mem_extents(&mem_extents, gfp_mask);
@@ -628,38 +569,13 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 
 	list_for_each_entry(ext, &mem_extents, hook) {
 		struct mem_zone_bm_rtree *zone;
-		struct bm_block *bb;
-		unsigned long pfn = ext->start;
-		unsigned long pages = ext->end - ext->start;
-
-		bb = list_entry(bm->blocks.prev, struct bm_block, hook);
-
-		error = create_bm_block_list(pages, bm->blocks.prev, &ca);
-		if (error)
-			goto Error;
-
-		list_for_each_entry_continue(bb, &bm->blocks, hook) {
-			bb->data = get_image_page(gfp_mask, safe_needed);
-			if (!bb->data) {
-				error = -ENOMEM;
-				goto Error;
-			}
-
-			bb->start_pfn = pfn;
-			if (pages >= BM_BITS_PER_BLOCK) {
-				pfn += BM_BITS_PER_BLOCK;
-				pages -= BM_BITS_PER_BLOCK;
-			} else {
-				/* This is executed only once in the loop */
-				pfn += pages;
-			}
-			bb->end_pfn = pfn;
-		}
 
 		zone = create_zone_bm_rtree(gfp_mask, safe_needed, &ca,
 					    ext->start, ext->end);
-		if (!zone)
+		if (!zone) {
+			error = -ENOMEM;
 			goto Error;
+		}
 		list_add_tail(&zone->list, &bm->zones);
 	}
 
@@ -681,11 +597,6 @@ memory_bm_create(struct memory_bitmap *bm, gfp_t gfp_mask, int safe_needed)
 static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free)
 {
 	struct mem_zone_bm_rtree *zone;
-	struct bm_block *bb;
-
-	list_for_each_entry(bb, &bm->blocks, hook)
-		if (bb->data)
-			free_image_page(bb->data, clear_nosave_free);
 
 	list_for_each_entry(zone, &bm->zones, list)
 		free_zone_bm_rtree(zone, clear_nosave_free);
@@ -693,55 +604,20 @@ static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free)
 	free_list_of_pages(bm->p_list, clear_nosave_free);
 
 	INIT_LIST_HEAD(&bm->zones);
-	INIT_LIST_HEAD(&bm->blocks);
 }
 
 /**
- *	memory_bm_find_bit - find the bit in the bitmap @bm that corresponds
- *	to given pfn.  The cur_zone_bm member of @bm and the cur_block member
- *	of @bm->cur_zone_bm are updated.
- */
-static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn,
-				void **addr, unsigned int *bit_nr)
-{
-	struct bm_block *bb;
-
-	/*
-	 * Check if the pfn corresponds to the current bitmap block and find
-	 * the block where it fits if this is not the case.
-	 */
-	bb = bm->cur.block;
-	if (pfn < bb->start_pfn)
-		list_for_each_entry_continue_reverse(bb, &bm->blocks, hook)
-			if (pfn >= bb->start_pfn)
-				break;
-
-	if (pfn >= bb->end_pfn)
-		list_for_each_entry_continue(bb, &bm->blocks, hook)
-			if (pfn >= bb->start_pfn && pfn < bb->end_pfn)
-				break;
-
-	if (&bb->hook == &bm->blocks)
-		return -EFAULT;
-
-	/* The block has been found */
-	bm->cur.block = bb;
-	pfn -= bb->start_pfn;
-	bm->cur.bit = pfn + 1;
-	*bit_nr = pfn;
-	*addr = bb->data;
-	return 0;
-}
-
-/*
- *	memory_rtree_find_bit - Find the bit for pfn in the memory
- *				bitmap
+ *	memory_bm_find_bit - Find the bit for pfn in the memory
+ *			     bitmap
  *
- *	Walks the radix tree to find the page which contains the bit for
+ *	Find the bit in the bitmap @bm that corresponds to given pfn.
+ *	The cur.zone, cur.block and cur.node_pfn member of @bm are
+ *	updated.
+ *	It walks the radix tree to find the page which contains the bit for
  *	pfn and returns the bit position in **addr and *bit_nr.
  */
-static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
-				 void **addr, unsigned int *bit_nr)
+static int memory_bm_find_bit(struct memory_bitmap *bm, unsigned long pfn,
+			      void **addr, unsigned int *bit_nr)
 {
 	struct mem_zone_bm_rtree *curr, *zone;
 	struct rtree_node *node;
@@ -809,10 +685,6 @@ static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	set_bit(bit, addr);
-
-	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	BUG_ON(error);
-	set_bit(bit, addr);
 }
 
 static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
@@ -824,12 +696,6 @@ static int mem_bm_set_bit_check(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	if (!error)
 		set_bit(bit, addr);
-	else
-		return error;
-
-	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	if (!error)
-		set_bit(bit, addr);
 
 	return error;
 }
@@ -843,10 +709,6 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
 	clear_bit(bit, addr);
-
-	error = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	BUG_ON(error);
-	clear_bit(bit, addr);
 }
 
 static void memory_bm_clear_current(struct memory_bitmap *bm)
@@ -855,82 +717,25 @@ static void memory_bm_clear_current(struct memory_bitmap *bm)
 
 	bit  = max(bm->cur.node_bit - 1, 0);
 	clear_bit(bit, bm->cur.node->data);
-
-	bit = max(bm->cur.bit - 1, 0);
-	clear_bit(bit, bm->cur.block->data);
 }
 
 static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
-	int error, error2;
-	int v;
+	int error;
 
 	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
 	BUG_ON(error);
-	v = test_bit(bit, addr);
-
-	error2 = memory_rtree_find_bit(bm, pfn, &addr, &bit);
-	BUG_ON(error2);
-
-	WARN_ON_ONCE(v != test_bit(bit, addr));
-
-	return v;
+	return test_bit(bit, addr);
 }
 
 static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
 {
 	void *addr;
 	unsigned int bit;
-	int present;
-
-	present = !memory_bm_find_bit(bm, pfn, &addr, &bit);
 
-	WARN_ON_ONCE(present != !memory_rtree_find_bit(bm, pfn, &addr, &bit));
-
-	return present;
-}
-
-/**
- *	memory_bm_next_pfn - find the pfn that corresponds to the next set bit
- *	in the bitmap @bm.  If the pfn cannot be found, BM_END_OF_MAP is
- *	returned.
- *
- *	It is required to run memory_bm_position_reset() before the first call to
- *	this function.
- */
-
-static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm);
-
-static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
-{
-	unsigned long rtree_pfn;
-	struct bm_block *bb;
-	int bit;
-
-	rtree_pfn = memory_bm_rtree_next_pfn(bm);
-
-	bb = bm->cur.block;
-	do {
-		bit = bm->cur.bit;
-		bit = find_next_bit(bb->data, bm_block_bits(bb), bit);
-		if (bit < bm_block_bits(bb))
-			goto Return_pfn;
-
-		bb = list_entry(bb->hook.next, struct bm_block, hook);
-		bm->cur.block = bb;
-		bm->cur.bit = 0;
-	} while (&bb->hook != &bm->blocks);
-
-	memory_bm_position_reset(bm);
-	WARN_ON_ONCE(rtree_pfn != BM_END_OF_MAP);
-	return BM_END_OF_MAP;
-
- Return_pfn:
-	WARN_ON_ONCE(bb->start_pfn + bit != rtree_pfn);
-	bm->cur.bit = bit + 1;
-	return bb->start_pfn + bit;
+	return !memory_bm_find_bit(bm, pfn, &addr, &bit);
 }
 
 /*
@@ -968,14 +773,17 @@ static bool rtree_next_node(struct memory_bitmap *bm)
 	return false;
 }
 
-/*
- *	memory_bm_rtree_next_pfn - Find the next set bit
+/**
+ *	memory_bm_rtree_next_pfn - Find the next set bit in the bitmap @bm
  *
  *	Starting from the last returned position this function searches
  *	for the next set bit in the memory bitmap and returns its
  *	number. If no more bit is set BM_END_OF_MAP is returned.
+ *
+ *	It is required to run memory_bm_position_reset() before the
+ *	first call to this function.
  */
-static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm)
+static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
 {
 	unsigned long bits, pfn, pages;
 	int bit;
@@ -1217,11 +1025,7 @@ void free_basic_memory_bitmaps(void)
 unsigned int snapshot_additional_pages(struct zone *zone)
 {
 	unsigned int rtree, nodes;
-	unsigned int res;
 
-	res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
-	res += DIV_ROUND_UP(res * sizeof(struct bm_block),
-			    LINKED_PAGE_DATA_SIZE);
 	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
 	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
 			      LINKED_PAGE_DATA_SIZE);
@@ -1230,7 +1034,7 @@ unsigned int snapshot_additional_pages(struct zone *zone)
 		rtree += nodes;
 	}
 
-	return 2 * (res + rtree);
+	return 2 * rtree;
 }
 
 #ifdef CONFIG_HIGHMEM
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (4 preceding siblings ...)
  2014-07-18 11:57 ` [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation Joerg Roedel
@ 2014-07-18 11:57 ` Joerg Roedel
  2014-07-18 22:05 ` [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Rafael J. Wysocki
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-18 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

When a memory bitmap is fully populated on a large memory
machine (several TB of RAM) it can take more than a minute
to walk through all bits. This causes the soft lockup
detector on these machine to report warnings.

Avoid this by touching the soft lockup watchdog in the
memory bitmap walking code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 750b32f..c5d19a9 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -755,6 +755,7 @@ static bool rtree_next_node(struct memory_bitmap *bm)
 	if (&bm->cur.node->list != &bm->cur.zone->leaves) {
 		bm->cur.node_pfn += BM_BITS_PER_BLOCK;
 		bm->cur.node_bit  = 0;
+		touch_softlockup_watchdog();
 		return true;
 	}
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (5 preceding siblings ...)
  2014-07-18 11:57 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel
@ 2014-07-18 22:05 ` Rafael J. Wysocki
  2014-07-19  7:55   ` Joerg Roedel
  2014-07-19 10:26 ` Pavel Machek
  2014-07-19 13:49 ` Pavel Machek
  8 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2014-07-18 22:05 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Pavel Machek, Len Brown, linux-pm, linux-kernel

On Friday, July 18, 2014 01:57:17 PM Joerg Roedel wrote:
> Hi,
> 
> here is a patch set to improve the scalability of the memory
> bitmap implementation used for hibernation. The current
> implementation does not scale well to machines with several
> TB of memory. A resume on those machines may cause soft
> lockups to be reported.
> 
> These patches improve the data structure by adding a radix
> tree to the linked list structure to improve random access
> performance from O(n) to O(log_b(n)), where b depends on the
> architecture (b=512 on amd64, 1024 in i386).
> 
> A test on a 12TB machine showed an improvement in resume
> time from 76s with the old implementation to 2.4s with the
> radix tree and the improved swsusp_free function. See below
> for details of this test.
> 
> Patches 1-3 that add the radix tree while keeping the
> existing memory bitmap implementation in place and add code
> to compare the results between both implementations. This
> was used during development to make sure both data
> structures return the same results.
> 
> Patch 4 re-implements the swsusp_free() function to not
> iterate over all pfns but only over the bits set in the
> bitmaps. This showed to scale better on large memory
> machines.
> 
> Patch 5 removes the old memory bitmap implementation now
> that the radix tree is in place and working correctly.
> 
> The last patch adds touching the soft lockup watchdog in
> rtree_next_node. This is necessary because the worst case
> performance (all bits set in the forbidden_pages_map and
> free_pages_map) is the same as with the old implementation
> and may still cause soft lockups. Patch 6 avoids this.
> 
> The code was tested in 32 and 64 bit x86 and showed no
> issues there.
> 
> Below is an example test that shows the performance
> improvement on a 12TB machine. First the test with the old
> memory bitmap:
> 
> # time perf record /usr/sbin/resume $sdev
> resume: libgcrypt version: 1.5.0
> [ perf record: Woken up 12 times to write data ]
> [ perf record: Captured and wrote 2.882 MB perf.data (~125898 samples) ]
> 
> real    1m16.043s
> user    0m0.016s
> sys     0m0.312s
> # perf report --stdio |head -50
> # Events: 75K cycles
> #
> # Overhead  Command         Shared Object                                   
> Symbol
> # ........  .......  .................... 
> ........................................
> #
>     56.16%   resume  [kernel.kallsyms]     [k] memory_bm_test_bit
>     19.35%   resume  [kernel.kallsyms]     [k] swsusp_free
>     14.90%   resume  [kernel.kallsyms]     [k] memory_bm_find_bit
>      7.28%   resume  [kernel.kallsyms]     [k] swsusp_page_is_forbidden
> 
> And here is the same test on the same machine with these
> patches applied:
> 
> #  time perf record /usr/sbin/resume $sdev
> resume: libgcrypt version: 1.5.0
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ]
> 
> real    0m2.376s
> user    0m0.020s
> sys     0m0.408s
> 
> # perf report --stdio |head -50
> # Events: 762  cycles
> #
> # Overhead  Command      Shared Object                     Symbol
> # ........  .......  .................  .........................
> #
>     34.78%   resume  [kernel.kallsyms]  [k] find_next_bit
>     27.03%   resume  [kernel.kallsyms]  [k] clear_page_c_e
>      9.70%   resume  [kernel.kallsyms]  [k] mark_nosave_pages
>      3.92%   resume  [kernel.kallsyms]  [k] alloc_rtree_node
>      2.38%   resume  [kernel.kallsyms]  [k] get_image_page
> 
> As can be seen on these results these patches improve the
> scalability significantly. Please review, any comments
> appreciated.

Looks good.

How much testing did that receive?

Rafael


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap
  2014-07-18 11:57 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
@ 2014-07-19  0:00   ` Rafael J. Wysocki
  2014-07-19  7:57     ` Joerg Roedel
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2014-07-19  0:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Pavel Machek, Len Brown, linux-pm, linux-kernel, Joerg Roedel

On Friday, July 18, 2014 01:57:18 PM Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> This patch adds the code to allocate and build the radix
> tree to store the memory bitmap. The old data structure is
> left in place until the radix tree implementation is
> finished.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  kernel/power/snapshot.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 223 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 1ea328a..d0f11ec 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -248,11 +248,24 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
>   *	information is stored (in the form of a block of bitmap)
>   *	It also contains the pfns that correspond to the start and end of
>   *	the represented memory area.
> + *
> + *	The memory bitmap is organized as a radix tree to guarantee fast random
> + *	access to the bits. There is one radix tree for each zone (as returned
> + *	from create_mem_extents).
> + *
> + *	One radix tree is represented by one struct mem_zone_bm_rtree. There are
> + *	two linked lists for the nodes of the tree, one for the inner nodes and
> + *	one for the leave nodes. The linked leave nodes are used for fast linear
> + *	access of the memory bitmap.
> + *
> + *	The struct rtree_node represents one node of the radix tree.
>   */
>  
>  #define BM_END_OF_MAP	(~0UL)
>  
>  #define BM_BITS_PER_BLOCK	(PAGE_SIZE * BITS_PER_BYTE)
> +#define BM_BLOCK_SHIFT		(PAGE_SHIFT + 3)
> +#define BM_BLOCK_MASK		((1UL << BM_BLOCK_SHIFT) - 1)
>  
>  struct bm_block {
>  	struct list_head hook;	/* hook into a list of bitmap blocks */
> @@ -266,6 +279,31 @@ static inline unsigned long bm_block_bits(struct bm_block *bb)
>  	return bb->end_pfn - bb->start_pfn;
>  }
>  
> +/*
> + * struct rtree_node is a wrapper struct to link the nodes
> + * of the rtree together for easy linear iteration over
> + * bits and easy freeing
> + */
> +struct rtree_node {
> +	struct list_head list;
> +	unsigned long *data;
> +};
> +
> +/*
> + * struct mem_zone_bm_rtree represents a bitmap used for one
> + * populated memory zone.
> + */
> +struct mem_zone_bm_rtree {
> +	struct list_head list;		/* Link Zones together         */
> +	struct list_head nodes;		/* Radix Tree inner nodes      */
> +	struct list_head leaves;	/* Radix Tree leaves           */
> +	unsigned long start_pfn;	/* Zone start page frame       */
> +	unsigned long end_pfn;		/* Zone end page frame + 1     */
> +	struct rtree_node *rtree;	/* Radix Tree Root             */
> +	int levels;			/* Number of Radix Tree Levels */
> +	unsigned int blocks;		/* Number of Bitmap Blocks     */
> +};
> +
>  /* strcut bm_position is used for browsing memory bitmaps */
>  
>  struct bm_position {
> @@ -274,6 +312,7 @@ struct bm_position {
>  };
>  
>  struct memory_bitmap {
> +	struct list_head zones;
>  	struct list_head blocks;	/* list of bitmap blocks */
>  	struct linked_page *p_list;	/* list of pages used to store zone
>  					 * bitmap objects and bitmap block
> @@ -284,6 +323,167 @@ struct memory_bitmap {
>  
>  /* Functions that operate on memory bitmaps */
>  
> +#define BM_ENTRIES_PER_LEVEL	(PAGE_SIZE / sizeof(unsigned long))
> +#if BITS_PER_LONG == 32
> +#define BM_RTREE_LEVEL_SHIFT	(PAGE_SHIFT - 2)
> +#else
> +#define BM_RTREE_LEVEL_SHIFT	(PAGE_SHIFT - 3)
> +#endif
> +#define BM_RTREE_LEVEL_MASK	((1UL << BM_RTREE_LEVEL_SHIFT) - 1)
> +
> +/*
> + *	alloc_rtree_node - Allocate a new node and add it to the radix
> + *			   tree.

Please make this a single line.

> + *
> + *	This function is used to allocate inner nodes as well as the
> + *	leave nodes of the radix tree. It also adds the node to the
> + *	corresponding linked list passed in by the *list parameter.
> + */
> +static struct rtree_node *alloc_rtree_node(gfp_t gfp_mask, int safe_needed,
> +					   struct chain_allocator *ca,
> +					   struct list_head *list)
> +{
> +	struct rtree_node *node;
> +
> +	node = chain_alloc(ca, sizeof(struct rtree_node));
> +	if (!node)
> +		return NULL;
> +
> +	node->data = get_image_page(gfp_mask, safe_needed);
> +	if (!node->data)
> +		return NULL;
> +
> +	list_add_tail(&node->list, list);
> +
> +	return node;
> +}
> +
> +/*
> + *	add_rtree_block - Add a new leave node to the radix tree
> + *
> + *	The leave nodes need to be allocated in order to keep the leaves
> + *	linked list in order. This is guaranteed by the zone->blocks
> + *	counter.
> + */
> +static int add_rtree_block(struct mem_zone_bm_rtree *zone, gfp_t gfp_mask,
> +			   int safe_needed, struct chain_allocator *ca)
> +{
> +	struct rtree_node *node, *block, **dst;
> +	unsigned int levels_needed, block_nr;
> +	int i;
> +
> +	block_nr	= zone->blocks;

This isn't proper kernel coding style.  Please use spaces as separators
before the "=" here and elsewhere.

Rafael


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-18 22:05 ` [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Rafael J. Wysocki
@ 2014-07-19  7:55   ` Joerg Roedel
  0 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-19  7:55 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pavel Machek, Len Brown, linux-pm, linux-kernel

Hi Rafael,

On Sat, Jul 19, 2014 at 12:05:02AM +0200, Rafael J. Wysocki wrote:
> On Friday, July 18, 2014 01:57:17 PM Joerg Roedel wrote:
> > here is a patch set to improve the scalability of the memory
> > bitmap implementation used for hibernation. The current
> > implementation does not scale well to machines with several
> > TB of memory. A resume on those machines may cause soft
> > lockups to be reported.
> 
> Looks good.
> 
> How much testing did that receive?

Well, I did function testing of the radix tree and the old memory bitmap
in parallel (patches 1-4) and looked for the warnings to trigger in case
the bitmaps return different results. With these patches the radix tree
produced always the same results as the old implementation (none of the
warnings did trigger).

I first tested with KVM (64 bit and 32 bit x86 guest) and later on real
amd64 machines. One of our partners tested the patches on a 1TB, a 4TB
and a 12TB memory machine, always with positive result.

Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap
  2014-07-19  0:00   ` Rafael J. Wysocki
@ 2014-07-19  7:57     ` Joerg Roedel
  0 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-19  7:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Len Brown, linux-pm, linux-kernel, Joerg Roedel

On Sat, Jul 19, 2014 at 02:00:39AM +0200, Rafael J. Wysocki wrote:
> > +/*
> > + *	alloc_rtree_node - Allocate a new node and add it to the radix
> > + *			   tree.
> 
> Please make this a single line.
> 
> > +static int add_rtree_block(struct mem_zone_bm_rtree *zone, gfp_t gfp_mask,
> > +			   int safe_needed, struct chain_allocator *ca)
> > +{
> > +	struct rtree_node *node, *block, **dst;
> > +	unsigned int levels_needed, block_nr;
> > +	int i;
> > +
> > +	block_nr	= zone->blocks;
> 
> This isn't proper kernel coding style.  Please use spaces as separators
> before the "=" here and elsewhere.

Thanks for your comments, will fix this and resend next week.


	Joerg


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (6 preceding siblings ...)
  2014-07-18 22:05 ` [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Rafael J. Wysocki
@ 2014-07-19 10:26 ` Pavel Machek
  2014-07-19 11:35   ` Joerg Roedel
  2014-07-19 13:49 ` Pavel Machek
  8 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2014-07-19 10:26 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi!

> here is a patch set to improve the scalability of the memory
> bitmap implementation used for hibernation. The current
> implementation does not scale well to machines with several
> TB of memory. A resume on those machines may cause soft
> lockups to be reported.
> 
> These patches improve the data structure by adding a radix
> tree to the linked list structure to improve random access
> performance from O(n) to O(log_b(n)), where b depends on the
> architecture (b=512 on amd64, 1024 in i386).
> 
> A test on a 12TB machine showed an improvement in resume
> time from 76s with the old implementation to 2.4s with the
> radix tree and the improved swsusp_free function. See below
> for details of this test.

Ok, nice.

How did space requirements change?

In particular, do we need to reserve a bit more pages for hibernation
now?

Thanks,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-19 10:26 ` Pavel Machek
@ 2014-07-19 11:35   ` Joerg Roedel
  0 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-19 11:35 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi,

On Sat, Jul 19, 2014 at 12:26:23PM +0200, Pavel Machek wrote:
> > A test on a 12TB machine showed an improvement in resume
> > time from 76s with the old implementation to 2.4s with the
> > radix tree and the improved swsusp_free function. See below
> > for details of this test.
> 
> Ok, nice.
> 
> How did space requirements change?
> 
> In particular, do we need to reserve a bit more pages for hibernation
> now?

This depends on the amount of memory in the system. On one side the
struct bm_block (32 bytes) was replaced by struct rtree_node (16 bytes).

On the other side more pages are required to build the radix tree.
For systems with up to 64GB of RAM on amd64 this is one additional page.
On a 12TB system this sums up to 193 additional pages (again, on amd64)
required for the radix tree inner nodes.

These 193 pages take around 800kb while we save ca. 1.5MB because of the
smaller struct rtree_node.

On a 4GB system the situation is different, there we need to reserve one
additional page while saving only 512 bytes with the smaller struct. The
break-even is somewhere around 32GB where the savings of struct
rtree_node outweight the additional page required for the radix tree.

Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
                   ` (7 preceding siblings ...)
  2014-07-19 10:26 ` Pavel Machek
@ 2014-07-19 13:49 ` Pavel Machek
  2014-07-21  8:32   ` Joerg Roedel
  8 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2014-07-19 13:49 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi!

> These patches improve the data structure by adding a radix
> tree to the linked list structure to improve random access
> performance from O(n) to O(log_b(n)), where b depends on the
> architecture (b=512 on amd64, 1024 in i386).

Are you sure? From your other mail, you said you are adding just a
single page. I'd expect random access performance to go from

O(n) to O(n/1024) in such case?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-19 13:49 ` Pavel Machek
@ 2014-07-21  8:32   ` Joerg Roedel
  2014-07-21 11:45     ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Joerg Roedel @ 2014-07-21  8:32 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Sat, Jul 19, 2014 at 03:49:04PM +0200, Pavel Machek wrote:
> Hi!
> 
> > These patches improve the data structure by adding a radix
> > tree to the linked list structure to improve random access
> > performance from O(n) to O(log_b(n)), where b depends on the
> > architecture (b=512 on amd64, 1024 in i386).
> 
> Are you sure? From your other mail, you said you are adding just a
> single page. I'd expect random access performance to go from
> 
> O(n) to O(n/1024) in such case?

No, for the 4GB case the additional page is used as an index page into
the block-bitmap pages. On AM64 a page can hold 512 references and a
single block-bitmap page is enough for 128MB of RAM. This means that for
systems with up to 64GB of RAM we can get the block-bitmap page directly
from the index page. For systems with more than 64GB of RAM we need
another level of index pages, and now we need two memory accesses to get
to the block-bitmap page (okay, with this implementation its actually 2
memory accesses per level, because the index-pages refer to a struct
rtree_node which itself contains the pointer to the index/data page).

A two-level radix tree would be enough for systems with up to 32TB of
RAM. After that we need 3 levels, but you can already see that this
doesn't scale linearly anymore but with log_512(n), where n is the
number of block-bitmap pages required.


	Joerg


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21  8:32   ` Joerg Roedel
@ 2014-07-21 11:45     ` Pavel Machek
  2014-07-21 14:29       ` Joerg Roedel
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2014-07-21 11:45 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Mon 2014-07-21 10:32:11, Joerg Roedel wrote:
> On Sat, Jul 19, 2014 at 03:49:04PM +0200, Pavel Machek wrote:
> > Hi!
> > 
> > > These patches improve the data structure by adding a radix
> > > tree to the linked list structure to improve random access
> > > performance from O(n) to O(log_b(n)), where b depends on the
> > > architecture (b=512 on amd64, 1024 in i386).
> > 
> > Are you sure? From your other mail, you said you are adding just a
> > single page. I'd expect random access performance to go from
> > 
> > O(n) to O(n/1024) in such case?
> 
> No, for the 4GB case the additional page is used as an index page into
> the block-bitmap pages. On AM64 a page can hold 512 references and a
> single block-bitmap page is enough for 128MB of RAM. This means that for
> systems with up to 64GB of RAM we can get the block-bitmap page directly
> from the index page. For systems with more than 64GB of RAM we need
> another level of index pages, and now we need two memory accesses to get
> to the block-bitmap page (okay, with this implementation its actually 2
> memory accesses per level, because the index-pages refer to a struct
> rtree_node which itself contains the pointer to the index/data page).
> 
> A two-level radix tree would be enough for systems with up to 32TB of
> RAM. After that we need 3 levels, but you can already see that this
> doesn't scale linearly anymore but with log_512(n), where n is the
> number of block-bitmap pages required.

Ok, so you have sped it up from O(n) to O(log(n)) speed, and increased
memory requirements from O(n) to O(n * log(n)), right?

does enough_swap() or reqd_free_pages() need to be updated?

Best regards,
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 11:45     ` Pavel Machek
@ 2014-07-21 14:29       ` Joerg Roedel
  2014-07-21 14:39         ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Joerg Roedel @ 2014-07-21 14:29 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi,

On Mon, Jul 21, 2014 at 01:45:07PM +0200, Pavel Machek wrote:
> Ok, so you have sped it up from O(n) to O(log(n)) speed, and increased
> memory requirements from O(n) to O(n * log(n)), right?

No, new memory requirements are still ~O(n), because the additional pages
needed for the index in the radix tree are outweighted by the memory
saved by the smaller size of struct rtree_node compared to struct
bm_block.

If you want to see real numbers, here is a small tool which tells you
exactly when you need more memory and how much you save with the new
memory bitmap implementation:

	http://www.zlug.org/~joro/data/mem_req.c

It will tell you that the new implementation uses at most 2 more pages
for smaller RAM sizes (2 pages because 2 memory bitmaps are allocated)
and will actually save pages for bigger RAM sizes (already 30 pages on a
1TB machine, 384 pages on a 12TB machine)

You can also modify the above tool to give you some data to plot, then
you will also SEE that there is still a linear relationship between RAM
size and memory required by the old and new memory bitmap
implementation.


	Joerg


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 14:29       ` Joerg Roedel
@ 2014-07-21 14:39         ` Pavel Machek
  2014-07-21 15:55           ` Joerg Roedel
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2014-07-21 14:39 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

On Mon 2014-07-21 16:29:51, Joerg Roedel wrote:
> Hi,
> 
> On Mon, Jul 21, 2014 at 01:45:07PM +0200, Pavel Machek wrote:
> > Ok, so you have sped it up from O(n) to O(log(n)) speed, and increased
> > memory requirements from O(n) to O(n * log(n)), right?
> 
> No, new memory requirements are still ~O(n), because the additional pages
> needed for the index in the radix tree are outweighted by the memory
> saved by the smaller size of struct rtree_node compared to struct
> bm_block.

Are we in agreement how O() notation works?

So you reduced O(n) to O(n/2 log(n)) ... that's still O(n log(n)), right?

> You can also modify the above tool to give you some data to plot, then
> you will also SEE that there is still a linear relationship between RAM
> size and memory required by the old and new memory bitmap
> implementation.

I believe radix tree is O(log(N)) per operation, but its memory is
certainly not linear.

Yes, log(n) is looks quite constant even for huge n :-).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
  2014-07-21 14:39         ` Pavel Machek
@ 2014-07-21 15:55           ` Joerg Roedel
  0 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-21 15:55 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Len Brown, linux-pm, linux-kernel

Hi,

On Mon, Jul 21, 2014 at 04:39:05PM +0200, Pavel Machek wrote:
> So you reduced O(n) to O(n/2 log(n)) ... that's still O(n log(n)), right?

Radix trees have a memory complexity of ~O(n + log(n)). We have an
additional log(n) space requirement for the tree itself, but we linearly
safe memory for each node compared to the old implementation. Since n
beats log(n) at some point there is a point where we need less pages
with the new code than with the old. That break-even point is somewhere
around 32GB of RAM, for smaller sizes we need 2 additional pages at
most.

> 
> > You can also modify the above tool to give you some data to plot, then
> > you will also SEE that there is still a linear relationship between RAM
> > size and memory required by the old and new memory bitmap
> > implementation.
> 
> I believe radix tree is O(log(N)) per operation, but its memory is
> certainly not linear.

Numbers beat believes.


	Joerg


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node
  2014-07-21 10:26 [PATCH 0/6 v2] " Joerg Roedel
@ 2014-07-21 10:27 ` Joerg Roedel
  0 siblings, 0 replies; 20+ messages in thread
From: Joerg Roedel @ 2014-07-21 10:27 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown
  Cc: linux-pm, linux-kernel, Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

When a memory bitmap is fully populated on a large memory
machine (several TB of RAM) it can take more than a minute
to walk through all bits. This causes the soft lockup
detector on these machine to report warnings.

Avoid this by touching the soft lockup watchdog in the
memory bitmap walking code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/power/snapshot.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 6a4e07e..738d930 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -769,6 +769,7 @@ static bool rtree_next_node(struct memory_bitmap *bm)
 	if (&bm->cur.node->list != &bm->cur.zone->leaves) {
 		bm->cur.node_pfn += BM_BITS_PER_BLOCK;
 		bm->cur.node_bit  = 0;
+		touch_softlockup_watchdog();
 		return true;
 	}
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-07-21 15:55 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-18 11:57 [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Joerg Roedel
2014-07-18 11:57 ` [PATCH 1/6] PM / Hibernate: Create a Radix-Tree to store memory bitmap Joerg Roedel
2014-07-19  0:00   ` Rafael J. Wysocki
2014-07-19  7:57     ` Joerg Roedel
2014-07-18 11:57 ` [PATCH 2/6] PM / Hibernate: Add memory_rtree_find_bit function Joerg Roedel
2014-07-18 11:57 ` [PATCH 3/6] PM / Hibernate: Implement position keeping in radix tree Joerg Roedel
2014-07-18 11:57 ` [PATCH 4/6] PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free() Joerg Roedel
2014-07-18 11:57 ` [PATCH 5/6] PM / Hibernate: Remove the old memory-bitmap implementation Joerg Roedel
2014-07-18 11:57 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel
2014-07-18 22:05 ` [PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements Rafael J. Wysocki
2014-07-19  7:55   ` Joerg Roedel
2014-07-19 10:26 ` Pavel Machek
2014-07-19 11:35   ` Joerg Roedel
2014-07-19 13:49 ` Pavel Machek
2014-07-21  8:32   ` Joerg Roedel
2014-07-21 11:45     ` Pavel Machek
2014-07-21 14:29       ` Joerg Roedel
2014-07-21 14:39         ` Pavel Machek
2014-07-21 15:55           ` Joerg Roedel
2014-07-21 10:26 [PATCH 0/6 v2] " Joerg Roedel
2014-07-21 10:27 ` [PATCH 6/6] PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.