All of lore.kernel.org
 help / color / mirror / Atom feed
* [kvm-unit-tests RFC v1 0/5] Rewrite the allocators
@ 2020-08-14 15:10 Claudio Imbrenda
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata Claudio Imbrenda
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-14 15:10 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: frankja, david, thuth, cohuck, lvivier

The KVM unit tests are increasingly being used to test more than just
KVM.  They are being used to test TCG, qemu I/O device emulation, other
hypervisors, and even actual hardeware.

The existing memory allocators are becoming more and more inadequate to
the needs of the upcoming unit tests (but also some existing ones, see
below).

Some important features that are lacking:
* ability to perform a small physical page allocation with a big
  alignment withtout wasting huge amounts of memory
* ability to allocate physical pages from specific pools/areaas (e.g.
  below 16M, or 4G, etc)
* ability to reserve arbitrary pages (if free), removing them from the
  free pool

Some other features that are nice, but not so fundamental:
* no need for the generic allocator to keep track of metadata
  (i.e. allocation size), this is now handled by the lower level
  allocators
* coalescing small blocks into bigger ones, to allow contiguous memory
  freed in small blocks in a random order to be used for large
  allocations again

This is achieved in the following ways:

For the virtual allocator:
* only the virtul allocator needs one extra page of metadata, but only
  for allocations that wouldn't fit in one page

For the page allocator:
* page allocator has up to 4 memory pools, each pool has a metadata
  area; the metadata has a byte for each page in the area, describing
  the order of the block it belongs to, and whether it is free
* if there are no free blocks of the desired size, a bigger block is
  split until we reach the required size; the unused parts of the block
  are put back in the free lists
* if an allocation needs an allocation with a larger alignment than its
  size, a larger block of (at least) the required order is split; the
  unused parts put back in the free lists
* if the allocation could not be satisfied, the next allowed area is
  searched; the allocation fails only when all allowed areas have been
  tried
* new functions to perform allocations from specific areas; the areas
  are arch-dependent and should be set up by the arch code
* for now x86 has a memory area for "low" memory under 4GB and one for
  the rest, while s390x has one for under 2GB and one for the rest;
  suggestions more fine grained areas are welcome
* upon freeing a block, an attempt is made to coalesce it into the
  appropriate neighbour (if it is free), and so on for the resulting
  larger block thus obtained

For the physical allocator:
* the minimum alignment is now handled manually, since it has been
  removed from the common struct


This patchset addresses some current but otherwise unsolvable issues on
s390x, such as the need to allocate a block under 2GB for each SMP CPU
upon CPU activation.

This patch has been tested on s390x, amd64 and i386. It has also been
compiled on aarch64.

Claudio Imbrenda (5):
  lib/vmalloc: vmalloc support for handling allocation metadata
  lib/alloc_page: complete rewrite of the page allocator
  lib/alloc: simplify free and malloc
  lib/alloc.h: remove align_min from struct alloc_ops
  lib/alloc_page: allow reserving arbitrary memory ranges

 lib/alloc.h      |   3 +-
 lib/alloc_page.h |  81 +++++++-
 lib/alloc.c      |  42 +---
 lib/alloc_page.c | 510 ++++++++++++++++++++++++++++++++++++++---------
 lib/alloc_phys.c |   9 +-
 lib/arm/setup.c  |   2 +-
 lib/s390x/sclp.c |  11 +-
 lib/s390x/smp.c  |   6 +-
 lib/vmalloc.c    | 121 +++++++++--
 s390x/smp.c      |   4 +-
 10 files changed, 617 insertions(+), 172 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata
  2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
@ 2020-08-14 15:10 ` Claudio Imbrenda
  2020-08-19 14:36   ` Janosch Frank
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator Claudio Imbrenda
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-14 15:10 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: frankja, david, thuth, cohuck, lvivier

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 lib/vmalloc.c | 105 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 95 insertions(+), 10 deletions(-)

diff --git a/lib/vmalloc.c b/lib/vmalloc.c
index e0c7b6b..aca0876 100644
--- a/lib/vmalloc.c
+++ b/lib/vmalloc.c
@@ -15,6 +15,13 @@
 #include <bitops.h>
 #include "vmalloc.h"
 
+#define VM_MAGIC 0x7E57C0DE
+
+struct metadata {
+	unsigned long npages;
+	unsigned long magic;
+};
+
 static struct spinlock lock;
 static void *vfree_top = 0;
 static void *page_root;
@@ -25,8 +32,14 @@ static void *page_root;
  *
  * nr is the number of pages to allocate
  * alignment_pages is the alignment of the allocation *in pages*
+ * metadata indicates whether an extra (unaligned) page needs to be allocated
+ * right before the main (aligned) allocation.
+ *
+ * The return value points to the first allocated virtual page, which will
+ * be the (potentially unaligned) metadata page if the metadata flag is
+ * specified.
  */
-void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
+static void *do_alloc_vpages(ulong nr, unsigned int align_order, bool metadata)
 {
 	uintptr_t ptr;
 
@@ -34,6 +47,8 @@ void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
 	ptr = (uintptr_t)vfree_top;
 	ptr -= PAGE_SIZE * nr;
 	ptr &= GENMASK_ULL(63, PAGE_SHIFT + align_order);
+	if (metadata)
+		ptr -= PAGE_SIZE;
 	vfree_top = (void *)ptr;
 	spin_unlock(&lock);
 
@@ -41,6 +56,11 @@ void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
 	return (void *)ptr;
 }
 
+void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
+{
+	return do_alloc_vpages(nr, align_order, false);
+}
+
 void *alloc_vpages(ulong nr)
 {
 	return alloc_vpages_aligned(nr, 0);
@@ -69,35 +89,100 @@ void *vmap(phys_addr_t phys, size_t size)
 	return mem;
 }
 
+/*
+ * Allocate one page, for an object with specified alignment.
+ * The resulting pointer will be aligned to the required alignment, but
+ * intentionally not page-aligned.
+ */
+static void *vm_alloc_one_page(size_t alignment)
+{
+	void *p;
+
+	assert(alignment >= sizeof(uintptr_t));
+	assert(alignment < PAGE_SIZE);
+	p = alloc_vpage();
+	install_page(page_root, virt_to_phys(alloc_page()), p);
+	/* write the magic at the beginning of the page */
+	*(uintptr_t *)p = VM_MAGIC;
+	return (void*)((uintptr_t)p + alignment);
+}
+
+static struct metadata *get_metadata(void *p)
+{
+	struct metadata *m = p;
+
+	return m - 1;
+}
+
 /*
  * Allocate virtual memory, with the specified minimum alignment.
+ * If the allocation fits in one page, only one page is allocated. Otherwise
+ * enough pages are allocated for the object, plus one to keep metadata
+ * information about the allocation.
  */
 static void *vm_memalign(size_t alignment, size_t size)
 {
+	struct metadata *m;
 	phys_addr_t pa;
-	void *mem, *p;
+	uintptr_t p;
+	void *mem;
+	size_t i;
 
+	if (!size)
+		return NULL;
 	assert(is_power_of_2(alignment));
 
+	if (alignment < sizeof(uintptr_t))
+		alignment = sizeof(uintptr_t);
+	/* it fits in one page, allocate only one page */
+	if (alignment + size <= PAGE_SIZE)
+		return vm_alloc_one_page(alignment);
 	size = PAGE_ALIGN(size) / PAGE_SIZE;
 	alignment = get_order(PAGE_ALIGN(alignment) / PAGE_SIZE);
-	mem = p = alloc_vpages_aligned(size, alignment);
-	while (size--) {
+	mem = do_alloc_vpages(size, alignment, true);
+	p = (uintptr_t)mem;
+	/* skip the metadata page */
+	mem = (void *)(p + PAGE_SIZE);
+	/*
+	 * time to actually allocate the physical pages to back our virtual
+	 * allocation; note that we need to allocate one extra page (for the
+	 * metadata), hence the <=
+	 */
+	for (i = 0; i <= size; i++, p += PAGE_SIZE) {
 		pa = virt_to_phys(alloc_page());
 		assert(pa);
-		install_page(page_root, pa, p);
-		p += PAGE_SIZE;
+		install_page(page_root, pa, (void *)p);
 	}
+	m = get_metadata(mem);
+	m->npages = size;
+	m->magic = VM_MAGIC;
 	return mem;
 }
 
 static void vm_free(void *mem, size_t size)
 {
-	while (size) {
-		free_page(phys_to_virt(virt_to_pte_phys(page_root, mem)));
-		mem += PAGE_SIZE;
-		size -= PAGE_SIZE;
+	struct metadata *m;
+	uintptr_t ptr, end;
+
+	/* the pointer is not page-aligned, it was a single-page allocation */
+	if (!IS_ALIGNED((uintptr_t)mem, PAGE_SIZE)) {
+		ptr = virt_to_pte_phys(page_root, mem) & PAGE_MASK;
+		assert(*(uintptr_t *)ptr == VM_MAGIC);
+		free_page(phys_to_virt(ptr));
+		return;
 	}
+
+	/* the pointer is page-aligned, it was a multi-page allocation */
+	m = get_metadata(mem);
+	assert(m->magic == VM_MAGIC);
+	assert(m->npages > 0);
+	/* free all the pages including the metadata page */
+	ptr = (uintptr_t)mem - PAGE_SIZE;
+	end = ptr + m->npages * PAGE_SIZE;
+	for ( ; ptr < end; ptr += PAGE_SIZE)
+		free_page(phys_to_virt(virt_to_pte_phys(page_root, (void *)ptr)));
+	/* free the last one separately to avoid overflow issues */
+	free_page(phys_to_virt(virt_to_pte_phys(page_root, (void *)ptr)));
 }
 
 static struct alloc_ops vmalloc_ops = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator
  2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata Claudio Imbrenda
@ 2020-08-14 15:10 ` Claudio Imbrenda
  2020-08-18 16:06   ` Cornelia Huck
  2020-08-19 15:44   ` Janosch Frank
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 3/5] lib/alloc: simplify free and malloc Claudio Imbrenda
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-14 15:10 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: frankja, david, thuth, cohuck, lvivier

This is a complete rewrite of the page allocator.

This will bring a few improvements:
* no need to specify the size when freeing
* allocate small areas with a large alignment without wasting memory
* ability to initialize and use multiple memory areas (e.g. DMA)
* more sanity checks

A few things have changed:
* initialization cannot be done with free_pages like before,
  page_alloc_init_area has to be used instead

Arch-specific changes:
* arm and x86 have been adapted to put all the memory in just one big
  area (or two, for x86_64 with highmem).
* s390x instead creates one area below 2GiB and one above; the area
  below 2GiB is used for SMP lowcore initialization.

Details:
Each memory area has metadata at the very beginning. The metadata is a
byte array with one entry per usable page (so, excluding the metadata
itself). Each entry indicates if the page is special (unused for now),
if it is allocated, and the order of the block. Both free and allocated
pages are part of larger blocks.

Some more fixed size metadata is present in a fixed-size static array.
This metadata contains start and end page frame numbers, the pointer to
the metadata array, and the array of freelists. The array of freelists
has an entry for each possible order (indicated by the macro NLISTS,
defined as BITS_PER_LONG - PAGE_SHIFT).

On allocation, if the free list for the needed size is empty, larger
blocks are split. When a small allocation with a large alignment is
requested, an appropriately large block is split, to guarantee the
alignment.

When a block is freed, an attempt will be made to merge it into the
neighbour, iterating the process as long as possible.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 lib/alloc_page.h |  64 ++++++-
 lib/alloc_page.c | 451 ++++++++++++++++++++++++++++++++++++-----------
 lib/arm/setup.c  |   2 +-
 lib/s390x/sclp.c |  11 +-
 lib/s390x/smp.c  |   2 +-
 lib/vmalloc.c    |  13 +-
 6 files changed, 427 insertions(+), 116 deletions(-)

diff --git a/lib/alloc_page.h b/lib/alloc_page.h
index 88540d1..6472abd 100644
--- a/lib/alloc_page.h
+++ b/lib/alloc_page.h
@@ -8,12 +8,72 @@
 #ifndef ALLOC_PAGE_H
 #define ALLOC_PAGE_H 1
 
+#define _AREA(x) (1 << (x))
+
+/* Returns true if the page allocator has been initialized */
 bool page_alloc_initialized(void);
+
+/*
+ * Initializes a memory area.
+ * n is the number of the area to initialize; if n == -1, the first available
+ * area is used
+ * base_pfn is the physical frame number of the start of the area to initialize
+ * top_pfn is the physical frame number of the first page immediately after
+ * the end of the area to initialize
+ */
+void page_alloc_init_area(int n, uintptr_t base_pfn, uintptr_t top_pfn);
+
+/* Enables the page allocator. At least one area must have been initialized */
 void page_alloc_ops_enable(void);
+
+/*
+ * Allocate aligned memory from the specified areas.
+ * areas is a bitmap of allowed areas
+ * alignment must be a power of 2
+ */
+void *memalign_pages_area(unsigned int areas, size_t alignment, size_t size);
+
+/*
+ * Allocate aligned memory from any area.
+ * Equivalent to memalign_pages_area(~0, alignment, size).
+ */
+void *memalign_pages(size_t alignment, size_t size);
+
+/*
+ * Allocate naturally aligned memory from the specified areas.
+ * Equivalent to memalign_pages_area(areas, 1ull << order, 1ull << order).
+ */
+void *alloc_pages_area(unsigned int areas, unsigned int order);
+
+/*
+ * Allocate one page from any area.
+ * Equivalent to alloc_pages(0);
+ */
 void *alloc_page(void);
+
+/*
+ * Allocate naturally aligned memory from any area.
+ * Equivalent to alloc_pages_area(~0, order);
+ */
 void *alloc_pages(unsigned int order);
-void free_page(void *page);
+
+/*
+ * Frees a memory block allocated with any of the memalign_pages* or
+ * alloc_pages* functions.
+ * The pointer must point to the start of the block.
+ */
 void free_pages(void *mem, size_t size);
-void free_pages_by_order(void *mem, unsigned int order);
+
+/* For backwards compatibility */
+static inline void free_page(void *mem)
+{
+	return free_pages(mem, 1);
+}
+
+/* For backwards compatibility */
+static inline void free_pages_by_order(void *mem, unsigned int order)
+{
+	free_pages(mem, 1ull << order);
+}
 
 #endif
diff --git a/lib/alloc_page.c b/lib/alloc_page.c
index 74fe726..7c91f91 100644
--- a/lib/alloc_page.c
+++ b/lib/alloc_page.c
@@ -13,165 +13,410 @@
 #include <asm/io.h>
 #include <asm/spinlock.h>
 
+#define IS_ALIGNED_ORDER(x,order) IS_ALIGNED((x),BIT_ULL(order))
+#define NLISTS ((BITS_PER_LONG) - (PAGE_SHIFT))
+#define PFN(x) ((uintptr_t)(x) >> PAGE_SHIFT)
+
+#define MAX_AREAS	4
+
+#define ORDER_MASK	0x3f
+#define ALLOC_MASK	0x40
+
+struct free_list {
+	struct free_list *prev;
+	struct free_list *next;
+};
+
+struct mem_area {
+	/* Physical frame number of the first usable frame in the area */
+	uintptr_t base;
+	/* Physical frame number of the first frame outside the area */
+	uintptr_t top;
+	/* Combination ALLOC_MASK and order */
+	u8 *page_states;
+	/* One freelist for each possible block size, up to NLISTS */
+	struct free_list freelists[NLISTS];
+};
+
+static struct mem_area areas[MAX_AREAS];
+static unsigned int areas_mask;
 static struct spinlock lock;
-static void *freelist = 0;
 
 bool page_alloc_initialized(void)
 {
-	return freelist != 0;
+	return areas_mask != 0;
 }
 
-void free_pages(void *mem, size_t size)
+static inline bool area_overlaps(struct mem_area *a, uintptr_t pfn)
 {
-	void *old_freelist;
-	void *end;
+	return (pfn >= PFN(a->page_states)) && (pfn < a->top);
+}
 
-	assert_msg((unsigned long) mem % PAGE_SIZE == 0,
-		   "mem not page aligned: %p", mem);
+static inline bool area_contains(struct mem_area *a, uintptr_t pfn)
+{
+	return (pfn >= a->base) && (pfn < a->top);
+}
 
-	assert_msg(size % PAGE_SIZE == 0, "size not page aligned: %#zx", size);
+static inline bool is_list_empty(struct free_list *p)
+{
+	return !p->next || !p->prev || p == p->next || p == p->prev;
+}
 
-	assert_msg(size == 0 || (uintptr_t)mem == -size ||
-		   (uintptr_t)mem + size > (uintptr_t)mem,
-		   "mem + size overflow: %p + %#zx", mem, size);
+static struct free_list *list_remove(struct free_list *l)
+{
+	if (is_list_empty(l))
+		return NULL;
 
-	if (size == 0) {
-		freelist = NULL;
-		return;
-	}
+	l->prev->next = l->next;
+	l->next->prev = l->prev;
+	l->prev = l->next = NULL;
 
-	spin_lock(&lock);
-	old_freelist = freelist;
-	freelist = mem;
-	end = mem + size;
-	while (mem + PAGE_SIZE != end) {
-		*(void **)mem = (mem + PAGE_SIZE);
-		mem += PAGE_SIZE;
-	}
+	return l;
+}
 
-	*(void **)mem = old_freelist;
-	spin_unlock(&lock);
+static void list_add(struct free_list *head, struct free_list *li)
+{
+	assert(li);
+	assert(head);
+	li->prev = head;
+	li->next = head->next;
+	head->next->prev = li;
+	head->next = li;
 }
 
-void free_pages_by_order(void *mem, unsigned int order)
+/*
+ * Splits the free block starting at addr into 2 blocks of half the size.
+ *
+ * The function depends on the following assumptions:
+ * - The allocator must have been initialized
+ * - the block must be within the memory area
+ * - all pages in the block must be free and not special
+ * - the pointer must point to the start of the block
+ * - all pages in the block must have the same block size.
+ * - the block size must be greater than 0
+ * - the block size must be smaller than the maximum allowed
+ * - the block must be in a free list
+ * - the function is called with the lock held
+ */
+static void split(struct mem_area *a, void *addr)
 {
-	free_pages(mem, 1ul << (order + PAGE_SHIFT));
+	uintptr_t pfn = PFN(addr);
+	struct free_list *p;
+	uintptr_t i, idx;
+	u8 order;
+
+	assert(a && area_contains(a, pfn));
+	idx = pfn - a->base;
+	order = a->page_states[idx];
+	assert(!(order & ~ORDER_MASK) && order && (order < NLISTS));
+	assert(IS_ALIGNED_ORDER(pfn, order));
+	assert(area_contains(a, pfn + BIT(order) - 1));
+
+	/* Remove the block from its free list */
+	p = list_remove(addr);
+	assert(p);
+
+	/* update the block size for each page in the block */
+	for (i = 0; i < BIT(order); i++) {
+		assert(a->page_states[idx + i] == order);
+		a->page_states[idx + i] = order - 1;
+	}
+	order--;
+	/* add the first half block to the appropriate free list */
+	list_add(a->freelists + order, p);
+	/* add the second half block to the appropriate free list */
+	list_add(a->freelists + order, (void *)((pfn + BIT(order)) * PAGE_SIZE));
 }
 
-void *alloc_page()
+/*
+ * Returns a block whose alignment and size are at least the parameter values.
+ * If there is not enough free memory, NULL is returned.
+ *
+ * Both parameters must be not larger than the largest allowed order
+ */
+static void *page_memalign_order(struct mem_area *a, u8 al, u8 sz)
 {
-	void *p;
+	struct free_list *p, *res = NULL;
+	u8 order;
 
-	if (!freelist)
-		return 0;
+	assert((al < NLISTS) && (sz < NLISTS));
+	/* we need the bigger of the two as starting point */
+	order = sz > al ? sz : al;
 
-	spin_lock(&lock);
-	p = freelist;
-	freelist = *(void **)freelist;
-	spin_unlock(&lock);
+	/* search all free lists for some memory */
+	for ( ; order < NLISTS; order++) {
+		p = a->freelists[order].next;
+		if (!is_list_empty(p))
+			break;
+	}
+	/* out of memory */
+	if (order >= NLISTS)
+		return NULL;
+
+	/*
+	 * the block is bigger than what we need because either there were
+	 * no smaller blocks, or the smaller blocks were not aligned to our
+	 * needs; therefore we split the block until we reach the needed size
+	 */
+	for (; order > sz; order--)
+		split(a, p);
 
-	if (p)
-		memset(p, 0, PAGE_SIZE);
-	return p;
+	res = list_remove(p);
+	memset(a->page_states + (PFN(res) - a->base), ALLOC_MASK | order, BIT(order));
+	return res;
 }
 
 /*
- * Allocates (1 << order) physically contiguous and naturally aligned pages.
- * Returns NULL if there's no memory left.
+ * Try to merge two blocks into a bigger one.
+ * Returns true in case of a successful merge.
+ * Merging will succeed only if both blocks have the same block size and are
+ * both free.
+ *
+ * The function depends on the following assumptions:
+ * - the first parameter is strictly smaller than the second
+ * - the parameters must point each to the start of their block
+ * - the two parameters point to adjacent blocks
+ * - the two blocks are both in a free list
+ * - all of the pages of the two blocks must be free
+ * - all of the pages of the two blocks must have the same block size
+ * - the function is called with the lock held
  */
-void *alloc_pages(unsigned int order)
+static bool coalesce(struct mem_area *a, u8 order, uintptr_t pfn, uintptr_t pfn2)
 {
-	/* Generic list traversal. */
-	void *prev;
-	void *curr = NULL;
-	void *next = freelist;
+	uintptr_t first, second, i;
+	struct free_list *li;
 
-	/* Looking for a run of length (1 << order). */
-	unsigned long run = 0;
-	const unsigned long n = 1ul << order;
-	const unsigned long align_mask = (n << PAGE_SHIFT) - 1;
-	void *run_start = NULL;
-	void *run_prev = NULL;
-	unsigned long run_next_pa = 0;
-	unsigned long pa;
+	assert(IS_ALIGNED_ORDER(pfn, order) && IS_ALIGNED_ORDER(pfn2, order));
+	assert(pfn2 == pfn + BIT(order));
+	assert(a);
 
-	assert(order < sizeof(unsigned long) * 8);
+	if (!area_contains(a, pfn) || !area_contains(a, pfn2 + BIT(order) - 1))
+		return false;
+	first = pfn - a->base;
+	second = pfn2 - a->base;
+	if ((a->page_states[first] != order) || (a->page_states[second] != order))
+		return false;
 
-	spin_lock(&lock);
-	for (;;) {
-		prev = curr;
-		curr = next;
+	li = list_remove((void *)(pfn2 << PAGE_SHIFT));
+	assert(li);
+	li = list_remove((void *)(pfn << PAGE_SHIFT));
+	assert(li);
+	for (i = 0; i < (2ull << order); i++) {
+		assert(a->page_states[first + i] == order);
+		a->page_states[first + i] = order + 1;
+	}
+	list_add(a->freelists + order + 1, li);
+	return true;
+}
 
-		if (!curr) {
-			run_start = NULL;
-			break;
-		}
+/*
+ * Free a block of memory.
+ * The parameter can be NULL, in which case nothing happens.
+ *
+ * The function depends on the following assumptions:
+ * - the parameter is page aligned
+ * - the parameter belongs to an existing memory area
+ * - the parameter points to the beginning of the block
+ * - the size of the block is less than the maximum allowed
+ * - the block is completely contained in its memory area
+ * - all pages in the block have the same block size
+ * - no pages in the memory block were already free
+ * - no pages in the memory block are special
+ */
+static void _free_pages(void *mem)
+{
+	uintptr_t pfn2, pfn = PFN(mem);
+	struct mem_area *a = NULL;
+	uintptr_t i, p;
+	u8 order;
 
-		next = *((void **) curr);
-		pa = virt_to_phys(curr);
-
-		if (run == 0) {
-			if (!(pa & align_mask)) {
-				run_start = curr;
-				run_prev = prev;
-				run_next_pa = pa + PAGE_SIZE;
-				run = 1;
-			}
-		} else if (pa == run_next_pa) {
-			run_next_pa += PAGE_SIZE;
-			run += 1;
-		} else {
-			run = 0;
-		}
+	if (!mem)
+		return;
+	assert(IS_ALIGNED((uintptr_t)mem, PAGE_SIZE));
+	for (i = 0; !a && (i < MAX_AREAS); i++) {
+		if ((areas_mask & BIT(i)) && area_contains(areas + i, pfn))
+			a = areas + i;
+	}
+	assert_msg(a, "memory does not belong to any area: %p", mem);
 
-		if (run == n) {
-			if (run_prev)
-				*((void **) run_prev) = next;
-			else
-				freelist = next;
-			break;
-		}
+	p = pfn - a->base;
+	order = a->page_states[p] & ORDER_MASK;
+
+	assert(a->page_states[p] == (order | ALLOC_MASK));
+	assert(order < NLISTS);
+	assert(IS_ALIGNED_ORDER(pfn, order));
+	assert(area_contains(a, pfn + BIT(order) - 1));
+
+	for (i = 0; i < BIT(order); i++) {
+		assert(a->page_states[p + i] == (ALLOC_MASK | order));
+		a->page_states[p + i] &= ~ALLOC_MASK;
 	}
-	spin_unlock(&lock);
-	if (run_start)
-		memset(run_start, 0, n * PAGE_SIZE);
-	return run_start;
+	list_add(a->freelists + order, mem);
+	do {
+		order = a->page_states[p] & ORDER_MASK;
+		if (!IS_ALIGNED_ORDER(pfn, order + 1))
+			pfn = pfn - BIT(order);
+		pfn2 = pfn + BIT(order);
+	} while (coalesce(a, order, pfn, pfn2));
 }
 
+void free_pages(void *mem, size_t size)
+{
+	spin_lock(&lock);
+	_free_pages(mem);
+	spin_unlock(&lock);
+}
 
-void free_page(void *page)
+static void *page_memalign_order_area(unsigned area, u8 ord, u8 al)
 {
+	void *res = NULL;
+	int i;
+
 	spin_lock(&lock);
-	*(void **)page = freelist;
-	freelist = page;
+	area &= areas_mask;
+	for (i = 0; !res && (i < MAX_AREAS); i++)
+		if (area & BIT(i))
+			res = page_memalign_order(areas + i, ord, al);
 	spin_unlock(&lock);
+	return res;
 }
 
-static void *page_memalign(size_t alignment, size_t size)
+/*
+ * Allocates (1 << order) physically contiguous and naturally aligned pages.
+ * Returns NULL if the allocation was not possible.
+ */
+void *alloc_pages_area(unsigned int area, unsigned int order)
 {
-	unsigned long n = ALIGN(size, PAGE_SIZE) >> PAGE_SHIFT;
-	unsigned int order;
+	return page_memalign_order_area(area, order, order);
+}
 
-	if (!size)
-		return NULL;
+void *alloc_pages(unsigned int order)
+{
+	return alloc_pages_area(~0, order);
+}
 
-	order = get_order(n);
+/*
+ * Allocates (1 << order) physically contiguous aligned pages.
+ * Returns NULL if the allocation was not possible.
+ */
+void *memalign_pages_area(unsigned int area, size_t alignment, size_t size)
+{
+	assert(is_power_of_2(alignment));
+	alignment = get_order(PAGE_ALIGN(alignment) >> PAGE_SHIFT);
+	size = get_order(PAGE_ALIGN(size) >> PAGE_SHIFT);
+	assert(alignment < NLISTS);
+	assert(size < NLISTS);
+	return page_memalign_order_area(area, size, alignment);
+}
 
-	return alloc_pages(order);
+void *memalign_pages(size_t alignment, size_t size)
+{
+	return memalign_pages_area(~0, alignment, size);
 }
 
-static void page_free(void *mem, size_t size)
+/*
+ * Allocates one page
+ */
+void *alloc_page()
 {
-	free_pages(mem, size);
+	return alloc_pages(0);
 }
 
 static struct alloc_ops page_alloc_ops = {
-	.memalign = page_memalign,
-	.free = page_free,
+	.memalign = memalign_pages,
+	.free = free_pages,
 	.align_min = PAGE_SIZE,
 };
 
+/*
+ * Enables the page allocator.
+ *
+ * Prerequisites:
+ * - at least one memory area has been initialized
+ */
 void page_alloc_ops_enable(void)
 {
+	spin_lock(&lock);
+	assert(page_alloc_initialized());
 	alloc_ops = &page_alloc_ops;
+	spin_unlock(&lock);
+}
+
+/*
+ * Adds a new memory area to the pool of available memory.
+ *
+ * Prerequisites:
+ * - the lock is held
+ * - start and top are page frame numbers
+ * - start is smaller than top
+ * - top does not fall outside of addressable memory
+ * - there is at least one more slot free for memory areas
+ * - if a specific memory area number has been indicated, it needs to be free
+ * - the memory area to add does not overlap with existing areas
+ * - the memory area to add has at least 5 pages available
+ */
+static void _page_alloc_init_area(int n, uintptr_t start, uintptr_t top)
+{
+	size_t table_size, npages, i;
+	struct mem_area *a;
+	u8 order = 0;
+
+	if (n == -1) {
+		for (n = 0; n < MAX_AREAS; n++) {
+			if (!(areas_mask & BIT(n)))
+				break;
+		}
+	}
+	assert(n < MAX_AREAS);
+	assert(!(areas_mask & BIT(n)));
+
+	assert(top > start);
+	assert(top - start > 4);
+	assert(top < BIT_ULL(sizeof(void *) * 8 - PAGE_SHIFT));
+
+	table_size = (top - start + PAGE_SIZE) / (PAGE_SIZE + 1);
+
+	a = areas + n;
+	a->page_states = (void *)(start << PAGE_SHIFT);
+	a->base = start + table_size;
+	a->top = top;
+	npages = top - a->base;
+
+	assert((a->base - start) * PAGE_SIZE >= npages);
+	for (i = 0; i < MAX_AREAS; i++) {
+		if (!(areas_mask & BIT(i)))
+			continue;
+		assert(!area_overlaps(areas + i, start));
+		assert(!area_overlaps(areas + i, top - 1));
+		assert(!area_overlaps(a, PFN(areas[i].page_states)));
+		assert(!area_overlaps(a, areas[i].top - 1));
+	}
+	for (i = 0; i < NLISTS; i++)
+		a->freelists[i].next = a->freelists[i].prev = a->freelists + i;
+
+	for (i = a->base; i < a->top; i += 1ull << order) {
+		while (i + BIT(order) > a->top) {
+			assert(order);
+			order--;
+		}
+		while (IS_ALIGNED_ORDER(i, order + 1) && (i + BIT(order + 1) <= a->top))
+			order++;
+		assert(order < NLISTS);
+		memset(a->page_states + (i - a->base), order, BIT(order));
+		list_add(a->freelists + order, (void *)(i << PAGE_SHIFT));
+	}
+	areas_mask |= BIT(n);
+}
+
+/*
+ * Adds a new memory area to the pool of available memory.
+ *
+ * Prerequisites:
+ * see _page_alloc_init_area
+ */
+void page_alloc_init_area(int n, uintptr_t base_pfn, uintptr_t top_pfn)
+{
+	spin_lock(&lock);
+	_page_alloc_init_area(n, base_pfn, top_pfn);
+	spin_unlock(&lock);
 }
diff --git a/lib/arm/setup.c b/lib/arm/setup.c
index 78562e4..a3c573f 100644
--- a/lib/arm/setup.c
+++ b/lib/arm/setup.c
@@ -155,7 +155,7 @@ static void mem_init(phys_addr_t freemem_start)
 	assert(sizeof(long) == 8 || !(base >> 32));
 	if (sizeof(long) != 8 && (top >> 32) != 0)
 		top = ((uint64_t)1 << 32);
-	free_pages((void *)(unsigned long)base, top - base);
+	page_alloc_init_area(-1, base, top);
 	page_alloc_ops_enable();
 }
 
diff --git a/lib/s390x/sclp.c b/lib/s390x/sclp.c
index 4054d0e..c25d442 100644
--- a/lib/s390x/sclp.c
+++ b/lib/s390x/sclp.c
@@ -37,11 +37,16 @@ static void mem_init(phys_addr_t mem_end)
 
 	phys_alloc_init(freemem_start, mem_end - freemem_start);
 	phys_alloc_get_unused(&base, &top);
-	base = (base + PAGE_SIZE - 1) & -PAGE_SIZE;
-	top = top & -PAGE_SIZE;
+	base = PAGE_ALIGN(base) >> PAGE_SHIFT;
+	top = top >> PAGE_SHIFT;
 
 	/* Make the pages available to the physical allocator */
-	free_pages((void *)(unsigned long)base, top - base);
+	if (top > (2ull * SZ_1G >> PAGE_SHIFT)) {
+		page_alloc_init_area(0, 2ull * SZ_1G >> PAGE_SHIFT, top);
+		page_alloc_init_area(1, base, 2ull * SZ_1G >> PAGE_SHIFT);
+	} else {
+		page_alloc_init_area(1, base, top);
+	}
 	page_alloc_ops_enable();
 }
 
diff --git a/lib/s390x/smp.c b/lib/s390x/smp.c
index 2860e9c..d954094 100644
--- a/lib/s390x/smp.c
+++ b/lib/s390x/smp.c
@@ -190,7 +190,7 @@ int smp_cpu_setup(uint16_t addr, struct psw psw)
 
 	sigp_retry(cpu->addr, SIGP_INITIAL_CPU_RESET, 0, NULL);
 
-	lc = alloc_pages(1);
+	lc = alloc_pages_area(_AREA(1), 1);
 	cpu->lowcore = lc;
 	memset(lc, 0, PAGE_SIZE * 2);
 	sigp_retry(cpu->addr, SIGP_SET_PREFIX, (unsigned long )lc, NULL);
diff --git a/lib/vmalloc.c b/lib/vmalloc.c
index aca0876..f72c5b3 100644
--- a/lib/vmalloc.c
+++ b/lib/vmalloc.c
@@ -217,18 +217,19 @@ void setup_vm()
 	 * so that it can be used to allocate page tables.
 	 */
 	if (!page_alloc_initialized()) {
-		base = PAGE_ALIGN(base);
-		top = top & -PAGE_SIZE;
-		free_pages(phys_to_virt(base), top - base);
+		base = PAGE_ALIGN(base) >> PAGE_SHIFT;
+		top = top >> PAGE_SHIFT;
+		page_alloc_init_area(1, base, top);
+		page_alloc_ops_enable();
 	}
 
 	find_highmem();
 	phys_alloc_get_unused(&base, &top);
 	page_root = setup_mmu(top);
 	if (base != top) {
-		base = PAGE_ALIGN(base);
-		top = top & -PAGE_SIZE;
-		free_pages(phys_to_virt(base), top - base);
+		base = PAGE_ALIGN(base) >> PAGE_SHIFT;
+		top = top >> PAGE_SHIFT;
+		page_alloc_init_area(0, base, top);
 	}
 
 	spin_lock(&lock);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [kvm-unit-tests RFC v1 3/5] lib/alloc: simplify free and malloc
  2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata Claudio Imbrenda
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator Claudio Imbrenda
@ 2020-08-14 15:10 ` Claudio Imbrenda
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 4/5] lib/alloc.h: remove align_min from struct alloc_ops Claudio Imbrenda
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-14 15:10 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: frankja, david, thuth, cohuck, lvivier

Remove the size parameter from the various free functions

Since the backends can handle the allocation sizes on their own,
simplify the generic malloc wrappers.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 lib/alloc.h      |  2 +-
 lib/alloc_page.h |  6 +++---
 lib/alloc.c      | 42 +++++-------------------------------------
 lib/alloc_page.c |  2 +-
 lib/s390x/smp.c  |  4 ++--
 lib/vmalloc.c    |  2 +-
 s390x/smp.c      |  4 ++--
 7 files changed, 15 insertions(+), 47 deletions(-)

diff --git a/lib/alloc.h b/lib/alloc.h
index c44d459..9b4b634 100644
--- a/lib/alloc.h
+++ b/lib/alloc.h
@@ -24,7 +24,7 @@
 
 struct alloc_ops {
 	void *(*memalign)(size_t alignment, size_t size);
-	void (*free)(void *ptr, size_t size);
+	void (*free)(void *ptr);
 	size_t align_min;
 };
 
diff --git a/lib/alloc_page.h b/lib/alloc_page.h
index 6472abd..26caefe 100644
--- a/lib/alloc_page.h
+++ b/lib/alloc_page.h
@@ -62,18 +62,18 @@ void *alloc_pages(unsigned int order);
  * alloc_pages* functions.
  * The pointer must point to the start of the block.
  */
-void free_pages(void *mem, size_t size);
+void free_pages(void *mem);
 
 /* For backwards compatibility */
 static inline void free_page(void *mem)
 {
-	return free_pages(mem, 1);
+	return free_pages(mem);
 }
 
 /* For backwards compatibility */
 static inline void free_pages_by_order(void *mem, unsigned int order)
 {
-	free_pages(mem, 1ull << order);
+	free_pages(mem);
 }
 
 #endif
diff --git a/lib/alloc.c b/lib/alloc.c
index 9d89d24..a46f464 100644
--- a/lib/alloc.c
+++ b/lib/alloc.c
@@ -50,56 +50,24 @@ void *calloc(size_t nmemb, size_t size)
 	return ptr;
 }
 
-#define METADATA_EXTRA	(2 * sizeof(uintptr_t))
-#define OFS_SLACK	(-2 * sizeof(uintptr_t))
-#define OFS_SIZE	(-sizeof(uintptr_t))
-
-static inline void *block_begin(void *mem)
-{
-	uintptr_t slack = *(uintptr_t *)(mem + OFS_SLACK);
-	return mem - slack;
-}
-
-static inline uintptr_t block_size(void *mem)
-{
-	return *(uintptr_t *)(mem + OFS_SIZE);
-}
-
 void free(void *ptr)
 {
-	if (!alloc_ops->free)
-		return;
-
-	void *base = block_begin(ptr);
-	uintptr_t sz = block_size(ptr);
-
-	alloc_ops->free(base, sz);
+	if (alloc_ops->free)
+		alloc_ops->free(ptr);
 }
 
 void *memalign(size_t alignment, size_t size)
 {
 	void *p;
-	uintptr_t blkalign;
-	uintptr_t mem;
 
 	if (!size)
 		return NULL;
 
-	assert(alignment >= sizeof(void *) && is_power_of_2(alignment));
+	assert(is_power_of_2(alignment));
 	assert(alloc_ops && alloc_ops->memalign);
 
-	size += alignment - 1;
-	blkalign = MAX(alignment, alloc_ops->align_min);
-	size = ALIGN(size + METADATA_EXTRA, alloc_ops->align_min);
-	p = alloc_ops->memalign(blkalign, size);
+	p = alloc_ops->memalign(alignment, size);
 	assert(p);
 
-	/* Leave room for metadata before aligning the result.  */
-	mem = (uintptr_t)p + METADATA_EXTRA;
-	mem = ALIGN(mem, alignment);
-
-	/* Write the metadata */
-	*(uintptr_t *)(mem + OFS_SLACK) = mem - (uintptr_t)p;
-	*(uintptr_t *)(mem + OFS_SIZE) = size;
-	return (void *)mem;
+	return (void *)p;
 }
diff --git a/lib/alloc_page.c b/lib/alloc_page.c
index 7c91f91..0e720ad 100644
--- a/lib/alloc_page.c
+++ b/lib/alloc_page.c
@@ -260,7 +260,7 @@ static void _free_pages(void *mem)
 	} while (coalesce(a, order, pfn, pfn2));
 }
 
-void free_pages(void *mem, size_t size)
+void free_pages(void *mem)
 {
 	spin_lock(&lock);
 	_free_pages(mem);
diff --git a/lib/s390x/smp.c b/lib/s390x/smp.c
index d954094..1bee8a9 100644
--- a/lib/s390x/smp.c
+++ b/lib/s390x/smp.c
@@ -163,8 +163,8 @@ int smp_cpu_destroy(uint16_t addr)
 	rc = smp_cpu_stop_nolock(addr, false);
 	if (!rc) {
 		cpu = smp_cpu_from_addr(addr);
-		free_pages(cpu->lowcore, 2 * PAGE_SIZE);
-		free_pages(cpu->stack, 4 * PAGE_SIZE);
+		free_pages(cpu->lowcore);
+		free_pages(cpu->stack);
 		cpu->lowcore = (void *)-1UL;
 		cpu->stack = (void *)-1UL;
 	}
diff --git a/lib/vmalloc.c b/lib/vmalloc.c
index f72c5b3..55b7a74 100644
--- a/lib/vmalloc.c
+++ b/lib/vmalloc.c
@@ -159,7 +159,7 @@ static void *vm_memalign(size_t alignment, size_t size)
 	return mem;
 }
 
-static void vm_free(void *mem, size_t size)
+static void vm_free(void *mem)
 {
 	struct metadata *m;
 	uintptr_t ptr, end;
diff --git a/s390x/smp.c b/s390x/smp.c
index ad30e3c..4ca1dce 100644
--- a/s390x/smp.c
+++ b/s390x/smp.c
@@ -143,7 +143,7 @@ static void test_store_status(void)
 	sigp(1, SIGP_STORE_STATUS_AT_ADDRESS, (uintptr_t)status, NULL);
 	while (!status->prefix) { mb(); }
 	report(1, "status written");
-	free_pages(status, PAGE_SIZE * 2);
+	free_pages(status);
 	report_prefix_pop();
 	smp_cpu_stop(1);
 
@@ -276,7 +276,7 @@ static void test_reset_initial(void)
 	report_prefix_pop();
 
 	report(smp_cpu_stopped(1), "cpu stopped");
-	free_pages(status, PAGE_SIZE);
+	free_pages(status);
 	report_prefix_pop();
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [kvm-unit-tests RFC v1 4/5] lib/alloc.h: remove align_min from struct alloc_ops
  2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
                   ` (2 preceding siblings ...)
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 3/5] lib/alloc: simplify free and malloc Claudio Imbrenda
@ 2020-08-14 15:10 ` Claudio Imbrenda
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 5/5] lib/alloc_page: allow reserving arbitrary memory ranges Claudio Imbrenda
  2020-08-17 16:44 ` [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Paolo Bonzini
  5 siblings, 0 replies; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-14 15:10 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: frankja, david, thuth, cohuck, lvivier

Remove align_min from struct alloc_ops.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 lib/alloc.h      | 1 -
 lib/alloc_page.c | 1 -
 lib/alloc_phys.c | 9 +++++----
 lib/vmalloc.c    | 1 -
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/lib/alloc.h b/lib/alloc.h
index 9b4b634..db90b01 100644
--- a/lib/alloc.h
+++ b/lib/alloc.h
@@ -25,7 +25,6 @@
 struct alloc_ops {
 	void *(*memalign)(size_t alignment, size_t size);
 	void (*free)(void *ptr);
-	size_t align_min;
 };
 
 extern struct alloc_ops *alloc_ops;
diff --git a/lib/alloc_page.c b/lib/alloc_page.c
index 0e720ad..d3ade58 100644
--- a/lib/alloc_page.c
+++ b/lib/alloc_page.c
@@ -325,7 +325,6 @@ void *alloc_page()
 static struct alloc_ops page_alloc_ops = {
 	.memalign = memalign_pages,
 	.free = free_pages,
-	.align_min = PAGE_SIZE,
 };
 
 /*
diff --git a/lib/alloc_phys.c b/lib/alloc_phys.c
index 72e20f7..a4d2bf2 100644
--- a/lib/alloc_phys.c
+++ b/lib/alloc_phys.c
@@ -29,8 +29,8 @@ static phys_addr_t base, top;
 static void *early_memalign(size_t alignment, size_t size);
 static struct alloc_ops early_alloc_ops = {
 	.memalign = early_memalign,
-	.align_min = DEFAULT_MINIMUM_ALIGNMENT
 };
+static size_t align_min;
 
 struct alloc_ops *alloc_ops = &early_alloc_ops;
 
@@ -39,8 +39,7 @@ void phys_alloc_show(void)
 	int i;
 
 	spin_lock(&lock);
-	printf("phys_alloc minimum alignment: %#" PRIx64 "\n",
-		(u64)early_alloc_ops.align_min);
+	printf("phys_alloc minimum alignment: %#" PRIx64 "\n", (u64)align_min);
 	for (i = 0; i < nr_regions; ++i)
 		printf("%016" PRIx64 "-%016" PRIx64 " [%s]\n",
 			(u64)regions[i].base,
@@ -64,7 +63,7 @@ void phys_alloc_set_minimum_alignment(phys_addr_t align)
 {
 	assert(align && !(align & (align - 1)));
 	spin_lock(&lock);
-	early_alloc_ops.align_min = align;
+	align_min = align;
 	spin_unlock(&lock);
 }
 
@@ -83,6 +82,8 @@ static phys_addr_t phys_alloc_aligned_safe(phys_addr_t size,
 		top_safe = MIN(top_safe, 1ULL << 32);
 
 	assert(base < top_safe);
+	if (align < align_min)
+		align = align_min;
 
 	addr = ALIGN(base, align);
 	size += addr - base;
diff --git a/lib/vmalloc.c b/lib/vmalloc.c
index 55b7a74..bcb9bf5 100644
--- a/lib/vmalloc.c
+++ b/lib/vmalloc.c
@@ -188,7 +188,6 @@ static void vm_free(void *mem)
 static struct alloc_ops vmalloc_ops = {
 	.memalign = vm_memalign,
 	.free = vm_free,
-	.align_min = PAGE_SIZE,
 };
 
 void __attribute__((__weak__)) find_highmem(void)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [kvm-unit-tests RFC v1 5/5] lib/alloc_page: allow reserving arbitrary memory ranges
  2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
                   ` (3 preceding siblings ...)
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 4/5] lib/alloc.h: remove align_min from struct alloc_ops Claudio Imbrenda
@ 2020-08-14 15:10 ` Claudio Imbrenda
  2020-08-17 16:44 ` [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Paolo Bonzini
  5 siblings, 0 replies; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-14 15:10 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: frankja, david, thuth, cohuck, lvivier

Two new functions are introduced, that allow specific memory ranges to
be reserved and freed.

This is useful when a testcase needs memory at very specific addresses,
with the guarantee that the page allocator will not touch those pages.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 lib/alloc_page.h | 15 ++++++++++
 lib/alloc_page.c | 78 ++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/lib/alloc_page.h b/lib/alloc_page.h
index 26caefe..7010b20 100644
--- a/lib/alloc_page.h
+++ b/lib/alloc_page.h
@@ -76,4 +76,19 @@ static inline void free_pages_by_order(void *mem, unsigned int order)
 	free_pages(mem);
 }
 
+/*
+ * Allocates and reserves the specified memory range if possible.
+ * Returns NULL in case of failure.
+ */
+void *alloc_pages_special(uintptr_t addr, size_t npages);
+
+/*
+ * Frees a reserved memory range that had been reserved with
+ * alloc_pages_special.
+ * The memory range does not need to match a previous allocation
+ * exactly, it can also be a subset, in which case only the specified
+ * pages will be freed and unreserved.
+ */
+void free_pages_special(uintptr_t addr, size_t npages);
+
 #endif
diff --git a/lib/alloc_page.c b/lib/alloc_page.c
index d3ade58..1ca905b 100644
--- a/lib/alloc_page.c
+++ b/lib/alloc_page.c
@@ -21,6 +21,7 @@
 
 #define ORDER_MASK	0x3f
 #define ALLOC_MASK	0x40
+#define SPECIAL_MASK	0x80
 
 struct free_list {
 	struct free_list *prev;
@@ -32,7 +33,7 @@ struct mem_area {
 	uintptr_t base;
 	/* Physical frame number of the first frame outside the area */
 	uintptr_t top;
-	/* Combination ALLOC_MASK and order */
+	/* Combination of SPECIAL_MASK, ALLOC_MASK, and order */
 	u8 *page_states;
 	/* One freelist for each possible block size, up to NLISTS */
 	struct free_list freelists[NLISTS];
@@ -166,6 +167,16 @@ static void *page_memalign_order(struct mem_area *a, u8 al, u8 sz)
 	return res;
 }
 
+static struct mem_area *get_area(uintptr_t pfn)
+{
+	uintptr_t i;
+
+	for (i = 0; i < MAX_AREAS; i++)
+		if ((areas_mask & BIT(i)) && area_contains(areas + i, pfn))
+			return areas + i;
+	return NULL;
+}
+
 /*
  * Try to merge two blocks into a bigger one.
  * Returns true in case of a successful merge.
@@ -233,10 +244,7 @@ static void _free_pages(void *mem)
 	if (!mem)
 		return;
 	assert(IS_ALIGNED((uintptr_t)mem, PAGE_SIZE));
-	for (i = 0; !a && (i < MAX_AREAS); i++) {
-		if ((areas_mask & BIT(i)) && area_contains(areas + i, pfn))
-			a = areas + i;
-	}
+	a = get_area(pfn);
 	assert_msg(a, "memory does not belong to any area: %p", mem);
 
 	p = pfn - a->base;
@@ -267,6 +275,66 @@ void free_pages(void *mem)
 	spin_unlock(&lock);
 }
 
+static void *_alloc_page_special(uintptr_t addr)
+{
+	struct mem_area *a;
+	uintptr_t mask, i;
+
+	a = get_area(PFN(addr));
+	assert(a);
+	i = PFN(addr) - a->base;
+	if (a->page_states[i] & (ALLOC_MASK | SPECIAL_MASK))
+		return NULL;
+	while (a->page_states[i]) {
+		mask = GENMASK_ULL(63, PAGE_SHIFT + a->page_states[i]);
+		split(a, (void *)(addr & mask));
+	}
+	a->page_states[i] = SPECIAL_MASK;
+	return (void *)addr;
+}
+
+static void _free_page_special(uintptr_t addr)
+{
+	struct mem_area *a;
+	uintptr_t i;
+
+	a = get_area(PFN(addr));
+	assert(a);
+	i = PFN(addr) - a->base;
+	assert(a->page_states[i] == SPECIAL_MASK);
+	a->page_states[i] = ALLOC_MASK;
+	_free_pages((void *)addr);
+}
+
+void *alloc_pages_special(uintptr_t addr, size_t n)
+{
+	uintptr_t i;
+
+	assert(IS_ALIGNED(addr, PAGE_SIZE));
+	spin_lock(&lock);
+	for (i = 0; i < n; i++)
+		if (!_alloc_page_special(addr + i * PAGE_SIZE))
+			break;
+	if (i < n) {
+		for (n = 0 ; n < i; n++)
+			_free_page_special(addr + n * PAGE_SIZE);
+		addr = 0;
+	}
+	spin_unlock(&lock);
+	return (void *)addr;
+}
+
+void free_pages_special(uintptr_t addr, size_t n)
+{
+	uintptr_t i;
+
+	assert(IS_ALIGNED(addr, PAGE_SIZE));
+	spin_lock(&lock);
+	for (i = 0; i < n; i++)
+		_free_page_special(addr + i * PAGE_SIZE);
+	spin_unlock(&lock);
+}
+
 static void *page_memalign_order_area(unsigned area, u8 ord, u8 al)
 {
 	void *res = NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 0/5] Rewrite the allocators
  2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
                   ` (4 preceding siblings ...)
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 5/5] lib/alloc_page: allow reserving arbitrary memory ranges Claudio Imbrenda
@ 2020-08-17 16:44 ` Paolo Bonzini
  5 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2020-08-17 16:44 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm; +Cc: frankja, david, thuth, cohuck, lvivier

On 14/08/20 17:10, Claudio Imbrenda wrote:
> 
> The existing memory allocators are becoming more and more inadequate to
> the needs of the upcoming unit tests (but also some existing ones, see
> below).
> 
> Some important features that are lacking:
> * ability to perform a small physical page allocation with a big
>   alignment withtout wasting huge amounts of memory
> * ability to allocate physical pages from specific pools/areaas (e.g.
>   below 16M, or 4G, etc)
> * ability to reserve arbitrary pages (if free), removing them from the
>   free pool
> 
> Some other features that are nice, but not so fundamental:
> * no need for the generic allocator to keep track of metadata
>   (i.e. allocation size), this is now handled by the lower level
>   allocators
> * coalescing small blocks into bigger ones, to allow contiguous memory
>   freed in small blocks in a random order to be used for large
>   allocations again

I haven't reviewed the patches in detail, but the deficiencies of the
page allocator are of course known and it's welcome that you're
attempting to fix them!

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator Claudio Imbrenda
@ 2020-08-18 16:06   ` Cornelia Huck
  2020-08-19 15:44   ` Janosch Frank
  1 sibling, 0 replies; 13+ messages in thread
From: Cornelia Huck @ 2020-08-18 16:06 UTC (permalink / raw)
  To: Claudio Imbrenda; +Cc: kvm, pbonzini, frankja, david, thuth, lvivier

On Fri, 14 Aug 2020 17:10:06 +0200
Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> This is a complete rewrite of the page allocator.
> 
> This will bring a few improvements:
> * no need to specify the size when freeing
> * allocate small areas with a large alignment without wasting memory
> * ability to initialize and use multiple memory areas (e.g. DMA)
> * more sanity checks
> 
> A few things have changed:
> * initialization cannot be done with free_pages like before,
>   page_alloc_init_area has to be used instead
> 
> Arch-specific changes:
> * arm and x86 have been adapted to put all the memory in just one big
>   area (or two, for x86_64 with highmem).
> * s390x instead creates one area below 2GiB and one above; the area
>   below 2GiB is used for SMP lowcore initialization.

I was just wondering why we did not run into that problem before for
the css control blocks, but these are statically allocated.

> 
> Details:
> Each memory area has metadata at the very beginning. The metadata is a
> byte array with one entry per usable page (so, excluding the metadata
> itself). Each entry indicates if the page is special (unused for now),
> if it is allocated, and the order of the block. Both free and allocated
> pages are part of larger blocks.
> 
> Some more fixed size metadata is present in a fixed-size static array.
> This metadata contains start and end page frame numbers, the pointer to
> the metadata array, and the array of freelists. The array of freelists
> has an entry for each possible order (indicated by the macro NLISTS,
> defined as BITS_PER_LONG - PAGE_SHIFT).
> 
> On allocation, if the free list for the needed size is empty, larger
> blocks are split. When a small allocation with a large alignment is
> requested, an appropriately large block is split, to guarantee the
> alignment.
> 
> When a block is freed, an attempt will be made to merge it into the
> neighbour, iterating the process as long as possible.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  lib/alloc_page.h |  64 ++++++-
>  lib/alloc_page.c | 451 ++++++++++++++++++++++++++++++++++++-----------
>  lib/arm/setup.c  |   2 +-
>  lib/s390x/sclp.c |  11 +-
>  lib/s390x/smp.c  |   2 +-
>  lib/vmalloc.c    |  13 +-
>  6 files changed, 427 insertions(+), 116 deletions(-)

(...)

> diff --git a/lib/s390x/smp.c b/lib/s390x/smp.c
> index 2860e9c..d954094 100644
> --- a/lib/s390x/smp.c
> +++ b/lib/s390x/smp.c
> @@ -190,7 +190,7 @@ int smp_cpu_setup(uint16_t addr, struct psw psw)
>  
>  	sigp_retry(cpu->addr, SIGP_INITIAL_CPU_RESET, 0, NULL);
>  
> -	lc = alloc_pages(1);
> +	lc = alloc_pages_area(_AREA(1), 1);

Add a comment describing what this _AREA(1) is? Or maybe add a wrapper
for allocating below 2G? We might want to use it in other places as
well.

>  	cpu->lowcore = lc;
>  	memset(lc, 0, PAGE_SIZE * 2);
>  	sigp_retry(cpu->addr, SIGP_SET_PREFIX, (unsigned long )lc, NULL);

(...)

I'll pass on reviewing the rest of this patch :)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata Claudio Imbrenda
@ 2020-08-19 14:36   ` Janosch Frank
  2020-08-19 15:31     ` Claudio Imbrenda
  0 siblings, 1 reply; 13+ messages in thread
From: Janosch Frank @ 2020-08-19 14:36 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm, pbonzini; +Cc: david, thuth, cohuck, lvivier


[-- Attachment #1.1: Type: text/plain, Size: 5925 bytes --]

On 8/14/20 5:10 PM, Claudio Imbrenda wrote:

LGTM, some smaller nits/questions below:

Commit message?

> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  lib/vmalloc.c | 105 +++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 95 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/vmalloc.c b/lib/vmalloc.c
> index e0c7b6b..aca0876 100644
> --- a/lib/vmalloc.c
> +++ b/lib/vmalloc.c
> @@ -15,6 +15,13 @@
>  #include <bitops.h>
>  #include "vmalloc.h"
>  
> +#define VM_MAGIC 0x7E57C0DE
> +
> +struct metadata {
> +	unsigned long npages;
> +	unsigned long magic;
> +};
> +
>  static struct spinlock lock;
>  static void *vfree_top = 0;
>  static void *page_root;
> @@ -25,8 +32,14 @@ static void *page_root;
>   *
>   * nr is the number of pages to allocate
>   * alignment_pages is the alignment of the allocation *in pages*
> + * metadata indicates whether an extra (unaligned) page needs to be allocated
> + * right before the main (aligned) allocation.
> + *
> + * The return value points to the first allocated virtual page, which will
> + * be the (potentially unaligned) metadata page if the metadata flag is
> + * specified.
>   */
> -void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
> +static void *do_alloc_vpages(ulong nr, unsigned int align_order, bool metadata)
>  {
>  	uintptr_t ptr;
>  
> @@ -34,6 +47,8 @@ void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
>  	ptr = (uintptr_t)vfree_top;
>  	ptr -= PAGE_SIZE * nr;
>  	ptr &= GENMASK_ULL(63, PAGE_SHIFT + align_order);
> +	if (metadata)
> +		ptr -= PAGE_SIZE;
>  	vfree_top = (void *)ptr;
>  	spin_unlock(&lock);
>  
> @@ -41,6 +56,11 @@ void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
>  	return (void *)ptr;
>  }
>  
> +void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
> +{
> +	return do_alloc_vpages(nr, align_order, false);
> +}
> +
>  void *alloc_vpages(ulong nr)
>  {
>  	return alloc_vpages_aligned(nr, 0);
> @@ -69,35 +89,100 @@ void *vmap(phys_addr_t phys, size_t size)
>  	return mem;
>  }
>  
> +/*
> + * Allocate one page, for an object with specified alignment.
> + * The resulting pointer will be aligned to the required alignment, but
> + * intentionally not page-aligned.
> + */
> +static void *vm_alloc_one_page(size_t alignment)
> +{
> +	void *p;
> +
> +	assert(alignment >= sizeof(uintptr_t));
> +	assert(alignment < PAGE_SIZE);
> +	p = alloc_vpage();
> +	install_page(page_root, virt_to_phys(alloc_page()), p);
> +	/* write the magic at the beginning of the page */
> +	*(uintptr_t *)p = VM_MAGIC;
> +	return (void*)((uintptr_t)p + alignment);

s/(void*)/(void *)/

> +}
> +
> +static struct metadata *get_metadata(void *p)
> +{
> +	struct metadata *m = p;
> +
> +	return m - 1;
> +}

So the metadata is not at the start of the metadata page, but at the
end? We have it at the beginning for the one page case and at the end
for the multi page case with metadata on an extra page.

> +
>  /*
>   * Allocate virtual memory, with the specified minimum alignment.
> + * If the allocation fits in one page, only one page is allocated. Otherwise
> + * enough pages are allocated for the object, plus one to keep metadata
> + * information about the allocation.
>   */
>  static void *vm_memalign(size_t alignment, size_t size)
>  {
> +	struct metadata *m;
>  	phys_addr_t pa;
> -	void *mem, *p;
> +	uintptr_t p;
> +	void *mem;
> +	size_t i;
>  
> +	if (!size)
> +		return NULL;
>  	assert(is_power_of_2(alignment));
>  
> +	if (alignment < sizeof(uintptr_t))
> +		alignment = sizeof(uintptr_t);
> +	/* it fits in one page, allocate only one page */
> +	if (alignment + size <= PAGE_SIZE)
> +		return vm_alloc_one_page(alignment);

Don't we also need to take the metadata into account in any size
calculation for one page?

>  	size = PAGE_ALIGN(size) / PAGE_SIZE;
>  	alignment = get_order(PAGE_ALIGN(alignment) / PAGE_SIZE);
> -	mem = p = alloc_vpages_aligned(size, alignment);
> -	while (size--) {
> +	mem = do_alloc_vpages(size, alignment, true);
> +	p = (uintptr_t)mem;
> +	/* skip the metadata page */
> +	mem = (void *)(p + PAGE_SIZE);
> +	/*
> +	 * time to actually allocate the physical pages to back our virtual
> +	 * allocation; note that we need to allocate one extra page (for the
> +	 * metadata), hence the <=
> +	 */
> +	for (i = 0; i <= size; i++, p += PAGE_SIZE) {
>  		pa = virt_to_phys(alloc_page());
>  		assert(pa);
> -		install_page(page_root, pa, p);
> -		p += PAGE_SIZE;
> +		install_page(page_root, pa, (void *)p);
>  	}
> +	m = get_metadata(mem);
> +	m->npages = size;
> +	m->magic = VM_MAGIC;
>  	return mem;
>  }
>  
>  static void vm_free(void *mem, size_t size)
>  {
> -	while (size) {
> -		free_page(phys_to_virt(virt_to_pte_phys(page_root, mem)));
> -		mem += PAGE_SIZE;
> -		size -= PAGE_SIZE;
> +	struct metadata *m;
> +	uintptr_t ptr, end;
> +
> +	/* the pointer is not page-aligned, it was a single-page allocation */
> +	if (!IS_ALIGNED((uintptr_t)mem, PAGE_SIZE)) {
> +		ptr = virt_to_pte_phys(page_root, mem) & PAGE_MASK;
> +		assert(*(uintptr_t *)ptr == VM_MAGIC);
> +		free_page(phys_to_virt(ptr));
> +		return;
>  	}
> +
> +	/* the pointer is page-aligned, it was a multi-page allocation */
> +	m = get_metadata(mem);
> +	assert(m->magic == VM_MAGIC);
> +	assert(m->npages > 0);
> +	/* free all the pages including the metadata page */
> +	ptr = (uintptr_t)mem - PAGE_SIZE;
> +	end = ptr + m->npages * PAGE_SIZE;
> +	for ( ; ptr < end; ptr += PAGE_SIZE)
> +		free_page(phys_to_virt(virt_to_pte_phys(page_root, (void *)ptr)));
> +	/* free the last one separately to avoid overflow issues */
> +	free_page(phys_to_virt(virt_to_pte_phys(page_root, (void *)ptr)));
>  }
>  
>  static struct alloc_ops vmalloc_ops = {
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata
  2020-08-19 14:36   ` Janosch Frank
@ 2020-08-19 15:31     ` Claudio Imbrenda
  2020-08-19 15:36       ` Janosch Frank
  0 siblings, 1 reply; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-19 15:31 UTC (permalink / raw)
  To: Janosch Frank; +Cc: kvm, pbonzini, david, thuth, cohuck, lvivier

On Wed, 19 Aug 2020 16:36:07 +0200
Janosch Frank <frankja@linux.ibm.com> wrote:

> On 8/14/20 5:10 PM, Claudio Imbrenda wrote:
> 
> LGTM, some smaller nits/questions below:
> 
> Commit message?

oops! I'll fix it

> 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > ---
> >  lib/vmalloc.c | 105
> > +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed,
> > 95 insertions(+), 10 deletions(-)
> > 
> > diff --git a/lib/vmalloc.c b/lib/vmalloc.c
> > index e0c7b6b..aca0876 100644
> > --- a/lib/vmalloc.c
> > +++ b/lib/vmalloc.c
> > @@ -15,6 +15,13 @@
> >  #include <bitops.h>
> >  #include "vmalloc.h"
> >  
> > +#define VM_MAGIC 0x7E57C0DE
> > +
> > +struct metadata {
> > +	unsigned long npages;
> > +	unsigned long magic;
> > +};
> > +
> >  static struct spinlock lock;
> >  static void *vfree_top = 0;
> >  static void *page_root;
> > @@ -25,8 +32,14 @@ static void *page_root;
> >   *
> >   * nr is the number of pages to allocate
> >   * alignment_pages is the alignment of the allocation *in pages*
> > + * metadata indicates whether an extra (unaligned) page needs to
> > be allocated
> > + * right before the main (aligned) allocation.
> > + *
> > + * The return value points to the first allocated virtual page,
> > which will
> > + * be the (potentially unaligned) metadata page if the metadata
> > flag is
> > + * specified.
> >   */
> > -void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
> > +static void *do_alloc_vpages(ulong nr, unsigned int align_order,
> > bool metadata) {
> >  	uintptr_t ptr;
> >  
> > @@ -34,6 +47,8 @@ void *alloc_vpages_aligned(ulong nr, unsigned int
> > align_order) ptr = (uintptr_t)vfree_top;
> >  	ptr -= PAGE_SIZE * nr;
> >  	ptr &= GENMASK_ULL(63, PAGE_SHIFT + align_order);
> > +	if (metadata)
> > +		ptr -= PAGE_SIZE;
> >  	vfree_top = (void *)ptr;
> >  	spin_unlock(&lock);
> >  
> > @@ -41,6 +56,11 @@ void *alloc_vpages_aligned(ulong nr, unsigned
> > int align_order) return (void *)ptr;
> >  }
> >  
> > +void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
> > +{
> > +	return do_alloc_vpages(nr, align_order, false);
> > +}
> > +
> >  void *alloc_vpages(ulong nr)
> >  {
> >  	return alloc_vpages_aligned(nr, 0);
> > @@ -69,35 +89,100 @@ void *vmap(phys_addr_t phys, size_t size)
> >  	return mem;
> >  }
> >  
> > +/*
> > + * Allocate one page, for an object with specified alignment.
> > + * The resulting pointer will be aligned to the required
> > alignment, but
> > + * intentionally not page-aligned.
> > + */
> > +static void *vm_alloc_one_page(size_t alignment)
> > +{
> > +	void *p;
> > +
> > +	assert(alignment >= sizeof(uintptr_t));
> > +	assert(alignment < PAGE_SIZE);
> > +	p = alloc_vpage();
> > +	install_page(page_root, virt_to_phys(alloc_page()), p);
> > +	/* write the magic at the beginning of the page */
> > +	*(uintptr_t *)p = VM_MAGIC;
> > +	return (void*)((uintptr_t)p + alignment);  
> 
> s/(void*)/(void *)/

will be fixed

> > +}
> > +
> > +static struct metadata *get_metadata(void *p)
> > +{
> > +	struct metadata *m = p;
> > +
> > +	return m - 1;
> > +}  
> 
> So the metadata is not at the start of the metadata page, but at the
> end? We have it at the beginning for the one page case and at the end
> for the multi page case with metadata on an extra page.

correct. it doesn't make a huge difference in the end where the
metadata is, as long as it is somewhere. Probably putting it always
right before the start of the memory is better, in order to catch
accidental off-by-one writes (as they would corrupt the magic value)

please note that the metadata for a single page is just the magic value

> > +
> >  /*
> >   * Allocate virtual memory, with the specified minimum alignment.
> > + * If the allocation fits in one page, only one page is allocated.
> > Otherwise
> > + * enough pages are allocated for the object, plus one to keep
> > metadata
> > + * information about the allocation.
> >   */
> >  static void *vm_memalign(size_t alignment, size_t size)
> >  {
> > +	struct metadata *m;
> >  	phys_addr_t pa;
> > -	void *mem, *p;
> > +	uintptr_t p;
> > +	void *mem;
> > +	size_t i;
> >  
> > +	if (!size)
> > +		return NULL;
> >  	assert(is_power_of_2(alignment));
> >  
> > +	if (alignment < sizeof(uintptr_t))
> > +		alignment = sizeof(uintptr_t);

                            ^^^^^^^^^^^^^^^^^

> > +	/* it fits in one page, allocate only one page */
> > +	if (alignment + size <= PAGE_SIZE)
> > +		return vm_alloc_one_page(alignment);  
> 
> Don't we also need to take the metadata into account in any size
> calculation for one page?

kinda... we guarantee a minimum alignment, which is enough to fit the
magic value, which is the only metadata item for single pages (see
above)

> >  	size = PAGE_ALIGN(size) / PAGE_SIZE;
> >  	alignment = get_order(PAGE_ALIGN(alignment) / PAGE_SIZE);
> > -	mem = p = alloc_vpages_aligned(size, alignment);
> > -	while (size--) {
> > +	mem = do_alloc_vpages(size, alignment, true);
> > +	p = (uintptr_t)mem;
> > +	/* skip the metadata page */
> > +	mem = (void *)(p + PAGE_SIZE);
> > +	/*
> > +	 * time to actually allocate the physical pages to back
> > our virtual
> > +	 * allocation; note that we need to allocate one extra
> > page (for the
> > +	 * metadata), hence the <=
> > +	 */
> > +	for (i = 0; i <= size; i++, p += PAGE_SIZE) {
> >  		pa = virt_to_phys(alloc_page());
> >  		assert(pa);
> > -		install_page(page_root, pa, p);
> > -		p += PAGE_SIZE;
> > +		install_page(page_root, pa, (void *)p);
> >  	}
> > +	m = get_metadata(mem);
> > +	m->npages = size;
> > +	m->magic = VM_MAGIC;
> >  	return mem;
> >  }
> >  
> >  static void vm_free(void *mem, size_t size)
> >  {
> > -	while (size) {
> > -		free_page(phys_to_virt(virt_to_pte_phys(page_root,
> > mem)));
> > -		mem += PAGE_SIZE;
> > -		size -= PAGE_SIZE;
> > +	struct metadata *m;
> > +	uintptr_t ptr, end;
> > +
> > +	/* the pointer is not page-aligned, it was a single-page
> > allocation */
> > +	if (!IS_ALIGNED((uintptr_t)mem, PAGE_SIZE)) {
> > +		ptr = virt_to_pte_phys(page_root, mem) & PAGE_MASK;
> > +		assert(*(uintptr_t *)ptr == VM_MAGIC);
> > +		free_page(phys_to_virt(ptr));
> > +		return;
> >  	}
> > +
> > +	/* the pointer is page-aligned, it was a multi-page
> > allocation */
> > +	m = get_metadata(mem);
> > +	assert(m->magic == VM_MAGIC);
> > +	assert(m->npages > 0);
> > +	/* free all the pages including the metadata page */
> > +	ptr = (uintptr_t)mem - PAGE_SIZE;
> > +	end = ptr + m->npages * PAGE_SIZE;
> > +	for ( ; ptr < end; ptr += PAGE_SIZE)
> > +		free_page(phys_to_virt(virt_to_pte_phys(page_root,
> > (void *)ptr)));
> > +	/* free the last one separately to avoid overflow issues */
> > +	free_page(phys_to_virt(virt_to_pte_phys(page_root, (void
> > *)ptr))); }
> >  
> >  static struct alloc_ops vmalloc_ops = {
> >   
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata
  2020-08-19 15:31     ` Claudio Imbrenda
@ 2020-08-19 15:36       ` Janosch Frank
  0 siblings, 0 replies; 13+ messages in thread
From: Janosch Frank @ 2020-08-19 15:36 UTC (permalink / raw)
  To: Claudio Imbrenda; +Cc: kvm, pbonzini, david, thuth, cohuck, lvivier


[-- Attachment #1.1: Type: text/plain, Size: 7239 bytes --]

On 8/19/20 5:31 PM, Claudio Imbrenda wrote:
> On Wed, 19 Aug 2020 16:36:07 +0200
> Janosch Frank <frankja@linux.ibm.com> wrote:
> 
>> On 8/14/20 5:10 PM, Claudio Imbrenda wrote:
>>
>> LGTM, some smaller nits/questions below:
>>
>> Commit message?
> 
> oops! I'll fix it
> 
>>
>>> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>> ---
>>>  lib/vmalloc.c | 105
>>> +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed,
>>> 95 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/lib/vmalloc.c b/lib/vmalloc.c
>>> index e0c7b6b..aca0876 100644
>>> --- a/lib/vmalloc.c
>>> +++ b/lib/vmalloc.c
>>> @@ -15,6 +15,13 @@
>>>  #include <bitops.h>
>>>  #include "vmalloc.h"
>>>  
>>> +#define VM_MAGIC 0x7E57C0DE
>>> +
>>> +struct metadata {
>>> +	unsigned long npages;
>>> +	unsigned long magic;
>>> +};
>>> +
>>>  static struct spinlock lock;
>>>  static void *vfree_top = 0;
>>>  static void *page_root;
>>> @@ -25,8 +32,14 @@ static void *page_root;
>>>   *
>>>   * nr is the number of pages to allocate
>>>   * alignment_pages is the alignment of the allocation *in pages*
>>> + * metadata indicates whether an extra (unaligned) page needs to
>>> be allocated
>>> + * right before the main (aligned) allocation.
>>> + *
>>> + * The return value points to the first allocated virtual page,
>>> which will
>>> + * be the (potentially unaligned) metadata page if the metadata
>>> flag is
>>> + * specified.
>>>   */
>>> -void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
>>> +static void *do_alloc_vpages(ulong nr, unsigned int align_order,
>>> bool metadata) {
>>>  	uintptr_t ptr;
>>>  
>>> @@ -34,6 +47,8 @@ void *alloc_vpages_aligned(ulong nr, unsigned int
>>> align_order) ptr = (uintptr_t)vfree_top;
>>>  	ptr -= PAGE_SIZE * nr;
>>>  	ptr &= GENMASK_ULL(63, PAGE_SHIFT + align_order);
>>> +	if (metadata)
>>> +		ptr -= PAGE_SIZE;
>>>  	vfree_top = (void *)ptr;
>>>  	spin_unlock(&lock);
>>>  
>>> @@ -41,6 +56,11 @@ void *alloc_vpages_aligned(ulong nr, unsigned
>>> int align_order) return (void *)ptr;
>>>  }
>>>  
>>> +void *alloc_vpages_aligned(ulong nr, unsigned int align_order)
>>> +{
>>> +	return do_alloc_vpages(nr, align_order, false);
>>> +}
>>> +
>>>  void *alloc_vpages(ulong nr)
>>>  {
>>>  	return alloc_vpages_aligned(nr, 0);
>>> @@ -69,35 +89,100 @@ void *vmap(phys_addr_t phys, size_t size)
>>>  	return mem;
>>>  }
>>>  
>>> +/*
>>> + * Allocate one page, for an object with specified alignment.
>>> + * The resulting pointer will be aligned to the required
>>> alignment, but
>>> + * intentionally not page-aligned.
>>> + */
>>> +static void *vm_alloc_one_page(size_t alignment)
>>> +{
>>> +	void *p;
>>> +
>>> +	assert(alignment >= sizeof(uintptr_t));
>>> +	assert(alignment < PAGE_SIZE);
>>> +	p = alloc_vpage();
>>> +	install_page(page_root, virt_to_phys(alloc_page()), p);
>>> +	/* write the magic at the beginning of the page */
>>> +	*(uintptr_t *)p = VM_MAGIC;
>>> +	return (void*)((uintptr_t)p + alignment);  
>>
>> s/(void*)/(void *)/
> 
> will be fixed
> 
>>> +}
>>> +
>>> +static struct metadata *get_metadata(void *p)
>>> +{
>>> +	struct metadata *m = p;
>>> +
>>> +	return m - 1;
>>> +}  
>>
>> So the metadata is not at the start of the metadata page, but at the
>> end? We have it at the beginning for the one page case and at the end
>> for the multi page case with metadata on an extra page.
> 
> correct. it doesn't make a huge difference in the end where the
> metadata is, as long as it is somewhere. Probably putting it always
> right before the start of the memory is better, in order to catch
> accidental off-by-one writes (as they would corrupt the magic value)> please note that the metadata for a single page is just the magic value


I assumed you'd write the struct in both cases, should've looked closer.

> 
>>> +
>>>  /*
>>>   * Allocate virtual memory, with the specified minimum alignment.
>>> + * If the allocation fits in one page, only one page is allocated.
>>> Otherwise
>>> + * enough pages are allocated for the object, plus one to keep
>>> metadata
>>> + * information about the allocation.
>>>   */
>>>  static void *vm_memalign(size_t alignment, size_t size)
>>>  {
>>> +	struct metadata *m;
>>>  	phys_addr_t pa;
>>> -	void *mem, *p;
>>> +	uintptr_t p;
>>> +	void *mem;
>>> +	size_t i;
>>>  
>>> +	if (!size)
>>> +		return NULL;
>>>  	assert(is_power_of_2(alignment));
>>>  
>>> +	if (alignment < sizeof(uintptr_t))
>>> +		alignment = sizeof(uintptr_t);
> 
>                             ^^^^^^^^^^^^^^^^^
> 
>>> +	/* it fits in one page, allocate only one page */
>>> +	if (alignment + size <= PAGE_SIZE)
>>> +		return vm_alloc_one_page(alignment);  
>>
>> Don't we also need to take the metadata into account in any size
>> calculation for one page?
> 
> kinda... we guarantee a minimum alignment, which is enough to fit the
> magic value, which is the only metadata item for single pages (see
> above)

Ah, right

> 
>>>  	size = PAGE_ALIGN(size) / PAGE_SIZE;
>>>  	alignment = get_order(PAGE_ALIGN(alignment) / PAGE_SIZE);
>>> -	mem = p = alloc_vpages_aligned(size, alignment);
>>> -	while (size--) {
>>> +	mem = do_alloc_vpages(size, alignment, true);
>>> +	p = (uintptr_t)mem;
>>> +	/* skip the metadata page */
>>> +	mem = (void *)(p + PAGE_SIZE);
>>> +	/*
>>> +	 * time to actually allocate the physical pages to back
>>> our virtual
>>> +	 * allocation; note that we need to allocate one extra
>>> page (for the
>>> +	 * metadata), hence the <=
>>> +	 */
>>> +	for (i = 0; i <= size; i++, p += PAGE_SIZE) {
>>>  		pa = virt_to_phys(alloc_page());
>>>  		assert(pa);
>>> -		install_page(page_root, pa, p);
>>> -		p += PAGE_SIZE;
>>> +		install_page(page_root, pa, (void *)p);
>>>  	}
>>> +	m = get_metadata(mem);
>>> +	m->npages = size;
>>> +	m->magic = VM_MAGIC;
>>>  	return mem;
>>>  }
>>>  
>>>  static void vm_free(void *mem, size_t size)
>>>  {
>>> -	while (size) {
>>> -		free_page(phys_to_virt(virt_to_pte_phys(page_root,
>>> mem)));
>>> -		mem += PAGE_SIZE;
>>> -		size -= PAGE_SIZE;
>>> +	struct metadata *m;
>>> +	uintptr_t ptr, end;
>>> +
>>> +	/* the pointer is not page-aligned, it was a single-page
>>> allocation */
>>> +	if (!IS_ALIGNED((uintptr_t)mem, PAGE_SIZE)) {
>>> +		ptr = virt_to_pte_phys(page_root, mem) & PAGE_MASK;
>>> +		assert(*(uintptr_t *)ptr == VM_MAGIC);
>>> +		free_page(phys_to_virt(ptr));
>>> +		return;
>>>  	}
>>> +
>>> +	/* the pointer is page-aligned, it was a multi-page
>>> allocation */
>>> +	m = get_metadata(mem);
>>> +	assert(m->magic == VM_MAGIC);
>>> +	assert(m->npages > 0);
>>> +	/* free all the pages including the metadata page */
>>> +	ptr = (uintptr_t)mem - PAGE_SIZE;
>>> +	end = ptr + m->npages * PAGE_SIZE;
>>> +	for ( ; ptr < end; ptr += PAGE_SIZE)
>>> +		free_page(phys_to_virt(virt_to_pte_phys(page_root,
>>> (void *)ptr)));
>>> +	/* free the last one separately to avoid overflow issues */
>>> +	free_page(phys_to_virt(virt_to_pte_phys(page_root, (void
>>> *)ptr))); }
>>>  
>>>  static struct alloc_ops vmalloc_ops = {
>>>   
>>
>>
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator
  2020-08-14 15:10 ` [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator Claudio Imbrenda
  2020-08-18 16:06   ` Cornelia Huck
@ 2020-08-19 15:44   ` Janosch Frank
  2020-08-19 16:46     ` Claudio Imbrenda
  1 sibling, 1 reply; 13+ messages in thread
From: Janosch Frank @ 2020-08-19 15:44 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm, pbonzini; +Cc: david, thuth, cohuck, lvivier


[-- Attachment #1.1: Type: text/plain, Size: 21751 bytes --]

On 8/14/20 5:10 PM, Claudio Imbrenda wrote:
> This is a complete rewrite of the page allocator.
> 
> This will bring a few improvements:
> * no need to specify the size when freeing
> * allocate small areas with a large alignment without wasting memory
> * ability to initialize and use multiple memory areas (e.g. DMA)
> * more sanity checks
> 
> A few things have changed:
> * initialization cannot be done with free_pages like before,
>   page_alloc_init_area has to be used instead
> 
> Arch-specific changes:
> * arm and x86 have been adapted to put all the memory in just one big
>   area (or two, for x86_64 with highmem).
> * s390x instead creates one area below 2GiB and one above; the area
>   below 2GiB is used for SMP lowcore initialization.
> 
> Details:
> Each memory area has metadata at the very beginning. The metadata is a
> byte array with one entry per usable page (so, excluding the metadata
> itself). Each entry indicates if the page is special (unused for now),
> if it is allocated, and the order of the block. Both free and allocated
> pages are part of larger blocks.
> 
> Some more fixed size metadata is present in a fixed-size static array.
> This metadata contains start and end page frame numbers, the pointer to
> the metadata array, and the array of freelists. The array of freelists
> has an entry for each possible order (indicated by the macro NLISTS,
> defined as BITS_PER_LONG - PAGE_SHIFT).
> 
> On allocation, if the free list for the needed size is empty, larger
> blocks are split. When a small allocation with a large alignment is
> requested, an appropriately large block is split, to guarantee the
> alignment.
> 
> When a block is freed, an attempt will be made to merge it into the
> neighbour, iterating the process as long as possible.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

I'm not sure if it's possible without too much work, but splitting this
would make my brain hurt less upon reading. I'll try continuing review
tomorrow being wide awake for review is preferable for this patch.

Some comments below:

> ---
>  lib/alloc_page.h |  64 ++++++-
>  lib/alloc_page.c | 451 ++++++++++++++++++++++++++++++++++++-----------
>  lib/arm/setup.c  |   2 +-
>  lib/s390x/sclp.c |  11 +-
>  lib/s390x/smp.c  |   2 +-
>  lib/vmalloc.c    |  13 +-
>  6 files changed, 427 insertions(+), 116 deletions(-)
> 
> diff --git a/lib/alloc_page.h b/lib/alloc_page.h
> index 88540d1..6472abd 100644
> --- a/lib/alloc_page.h
> +++ b/lib/alloc_page.h
[..]
> diff --git a/lib/alloc_page.c b/lib/alloc_page.c
> index 74fe726..7c91f91 100644
> --- a/lib/alloc_page.c
> +++ b/lib/alloc_page.c
> @@ -13,165 +13,410 @@
>  #include <asm/io.h>
>  #include <asm/spinlock.h>
>  
> +#define IS_ALIGNED_ORDER(x,order) IS_ALIGNED((x),BIT_ULL(order))
> +#define NLISTS ((BITS_PER_LONG) - (PAGE_SHIFT))
> +#define PFN(x) ((uintptr_t)(x) >> PAGE_SHIFT)

So, more or less virt_to_pfn but with uintptr_t rather than unsigned long?

> +
> +#define MAX_AREAS	4
> +
> +#define ORDER_MASK	0x3f
> +#define ALLOC_MASK	0x40
> +
> +struct free_list {
> +	struct free_list *prev;
> +	struct free_list *next;
> +};
> +
> +struct mem_area {
> +	/* Physical frame number of the first usable frame in the area */
> +	uintptr_t base;
> +	/* Physical frame number of the first frame outside the area */
> +	uintptr_t top;
> +	/* Combination ALLOC_MASK and order */
> +	u8 *page_states;
> +	/* One freelist for each possible block size, up to NLISTS */
> +	struct free_list freelists[NLISTS];
> +};
> +
> +static struct mem_area areas[MAX_AREAS];
> +static unsigned int areas_mask;
>  static struct spinlock lock;
> -static void *freelist = 0;
>  
>  bool page_alloc_initialized(void)
>  {
> -	return freelist != 0;
> +	return areas_mask != 0;
>  }
>  
> -void free_pages(void *mem, size_t size)
> +static inline bool area_overlaps(struct mem_area *a, uintptr_t pfn)
>  {
> -	void *old_freelist;
> -	void *end;
> +	return (pfn >= PFN(a->page_states)) && (pfn < a->top);
> +}

I have pondered over this function for quite a while now and I don't
think the naming is correct.

This is basically the contains function below, but including the page
states which are at the beginning of the area but will never be shown to
users, right?

>  
> -	assert_msg((unsigned long) mem % PAGE_SIZE == 0,
> -		   "mem not page aligned: %p", mem);
> +static inline bool area_contains(struct mem_area *a, uintptr_t pfn)
> +{
> +	return (pfn >= a->base) && (pfn < a->top);
> +}
>  
> -	assert_msg(size % PAGE_SIZE == 0, "size not page aligned: %#zx", size);
> +static inline bool is_list_empty(struct free_list *p)
> +{
> +	return !p->next || !p->prev || p == p->next || p == p->prev;
> +}
>  
> -	assert_msg(size == 0 || (uintptr_t)mem == -size ||
> -		   (uintptr_t)mem + size > (uintptr_t)mem,
> -		   "mem + size overflow: %p + %#zx", mem, size);
> +static struct free_list *list_remove(struct free_list *l)
> +{
> +	if (is_list_empty(l))
> +		return NULL;
>  
> -	if (size == 0) {
> -		freelist = NULL;
> -		return;
> -	}
> +	l->prev->next = l->next;
> +	l->next->prev = l->prev;
> +	l->prev = l->next = NULL;
>  
> -	spin_lock(&lock);
> -	old_freelist = freelist;
> -	freelist = mem;
> -	end = mem + size;
> -	while (mem + PAGE_SIZE != end) {
> -		*(void **)mem = (mem + PAGE_SIZE);
> -		mem += PAGE_SIZE;
> -	}
> +	return l;
> +}
>  
> -	*(void **)mem = old_freelist;
> -	spin_unlock(&lock);
> +static void list_add(struct free_list *head, struct free_list *li)
> +{
> +	assert(li);
> +	assert(head);
> +	li->prev = head;
> +	li->next = head->next;
> +	head->next->prev = li;
> +	head->next = li;
>  }

lib/lists.(c|h) or maybe util.c

I guess it's only a matter of time until we have a fully preempt testing
kernel...

>  
> -void free_pages_by_order(void *mem, unsigned int order)
> +/*
> + * Splits the free block starting at addr into 2 blocks of half the size.
> + *
> + * The function depends on the following assumptions:
> + * - The allocator must have been initialized
> + * - the block must be within the memory area
> + * - all pages in the block must be free and not special
> + * - the pointer must point to the start of the block
> + * - all pages in the block must have the same block size.
> + * - the block size must be greater than 0
> + * - the block size must be smaller than the maximum allowed
> + * - the block must be in a free list
> + * - the function is called with the lock held
> + */
> +static void split(struct mem_area *a, void *addr)
>  {
> -	free_pages(mem, 1ul << (order + PAGE_SHIFT));
> +	uintptr_t pfn = PFN(addr);
> +	struct free_list *p;
> +	uintptr_t i, idx;
> +	u8 order;
> +
> +	assert(a && area_contains(a, pfn));
> +	idx = pfn - a->base;
> +	order = a->page_states[idx];
> +	assert(!(order & ~ORDER_MASK) && order && (order < NLISTS));
> +	assert(IS_ALIGNED_ORDER(pfn, order));
> +	assert(area_contains(a, pfn + BIT(order) - 1));
> +
> +	/* Remove the block from its free list */
> +	p = list_remove(addr);
> +	assert(p);
> +
> +	/* update the block size for each page in the block */
> +	for (i = 0; i < BIT(order); i++) {
> +		assert(a->page_states[idx + i] == order);
> +		a->page_states[idx + i] = order - 1;
> +	}
> +	order--;
> +	/* add the first half block to the appropriate free list */
> +	list_add(a->freelists + order, p);
> +	/* add the second half block to the appropriate free list */
> +	list_add(a->freelists + order, (void *)((pfn + BIT(order)) * PAGE_SIZE));
>  }
>  
> -void *alloc_page()
> +/*
> + * Returns a block whose alignment and size are at least the parameter values.
> + * If there is not enough free memory, NULL is returned.
> + *
> + * Both parameters must be not larger than the largest allowed order
> + */
> +static void *page_memalign_order(struct mem_area *a, u8 al, u8 sz)
>  {
> -	void *p;
> +	struct free_list *p, *res = NULL;
> +	u8 order;
>  
> -	if (!freelist)
> -		return 0;
> +	assert((al < NLISTS) && (sz < NLISTS));
> +	/* we need the bigger of the two as starting point */
> +	order = sz > al ? sz : al;
>  
> -	spin_lock(&lock);
> -	p = freelist;
> -	freelist = *(void **)freelist;
> -	spin_unlock(&lock);
> +	/* search all free lists for some memory */
> +	for ( ; order < NLISTS; order++) {
> +		p = a->freelists[order].next;
> +		if (!is_list_empty(p))
> +			break;
> +	}
> +	/* out of memory */
> +	if (order >= NLISTS)
> +		return NULL;
> +
> +	/*
> +	 * the block is bigger than what we need because either there were
> +	 * no smaller blocks, or the smaller blocks were not aligned to our
> +	 * needs; therefore we split the block until we reach the needed size
> +	 */
> +	for (; order > sz; order--)
> +		split(a, p);
>  
> -	if (p)
> -		memset(p, 0, PAGE_SIZE);
> -	return p;
> +	res = list_remove(p);
> +	memset(a->page_states + (PFN(res) - a->base), ALLOC_MASK | order, BIT(order));
> +	return res;
>  }
>  
>  /*
> - * Allocates (1 << order) physically contiguous and naturally aligned pages.
> - * Returns NULL if there's no memory left.
> + * Try to merge two blocks into a bigger one.
> + * Returns true in case of a successful merge.
> + * Merging will succeed only if both blocks have the same block size and are
> + * both free.
> + *
> + * The function depends on the following assumptions:
> + * - the first parameter is strictly smaller than the second
> + * - the parameters must point each to the start of their block
> + * - the two parameters point to adjacent blocks
> + * - the two blocks are both in a free list
> + * - all of the pages of the two blocks must be free
> + * - all of the pages of the two blocks must have the same block size
> + * - the function is called with the lock held
>   */
> -void *alloc_pages(unsigned int order)
> +static bool coalesce(struct mem_area *a, u8 order, uintptr_t pfn, uintptr_t pfn2)
>  {
> -	/* Generic list traversal. */
> -	void *prev;
> -	void *curr = NULL;
> -	void *next = freelist;
> +	uintptr_t first, second, i;
> +	struct free_list *li;
>  
> -	/* Looking for a run of length (1 << order). */
> -	unsigned long run = 0;
> -	const unsigned long n = 1ul << order;
> -	const unsigned long align_mask = (n << PAGE_SHIFT) - 1;
> -	void *run_start = NULL;
> -	void *run_prev = NULL;
> -	unsigned long run_next_pa = 0;
> -	unsigned long pa;
> +	assert(IS_ALIGNED_ORDER(pfn, order) && IS_ALIGNED_ORDER(pfn2, order));
> +	assert(pfn2 == pfn + BIT(order));
> +	assert(a);
>  
> -	assert(order < sizeof(unsigned long) * 8);
> +	if (!area_contains(a, pfn) || !area_contains(a, pfn2 + BIT(order) - 1))
> +		return false;
> +	first = pfn - a->base;
> +	second = pfn2 - a->base;
> +	if ((a->page_states[first] != order) || (a->page_states[second] != order))
> +		return false;
>  
> -	spin_lock(&lock);
> -	for (;;) {
> -		prev = curr;
> -		curr = next;
> +	li = list_remove((void *)(pfn2 << PAGE_SHIFT));
> +	assert(li);
> +	li = list_remove((void *)(pfn << PAGE_SHIFT));
> +	assert(li);
> +	for (i = 0; i < (2ull << order); i++) {
> +		assert(a->page_states[first + i] == order);
> +		a->page_states[first + i] = order + 1;
> +	}
> +	list_add(a->freelists + order + 1, li);
> +	return true;
> +}
>  
> -		if (!curr) {
> -			run_start = NULL;
> -			break;
> -		}
> +/*
> + * Free a block of memory.
> + * The parameter can be NULL, in which case nothing happens.
> + *
> + * The function depends on the following assumptions:
> + * - the parameter is page aligned
> + * - the parameter belongs to an existing memory area
> + * - the parameter points to the beginning of the block
> + * - the size of the block is less than the maximum allowed
> + * - the block is completely contained in its memory area
> + * - all pages in the block have the same block size
> + * - no pages in the memory block were already free
> + * - no pages in the memory block are special
> + */
> +static void _free_pages(void *mem)
> +{
> +	uintptr_t pfn2, pfn = PFN(mem);
> +	struct mem_area *a = NULL;
> +	uintptr_t i, p;
> +	u8 order;
>  
> -		next = *((void **) curr);
> -		pa = virt_to_phys(curr);
> -
> -		if (run == 0) {
> -			if (!(pa & align_mask)) {
> -				run_start = curr;
> -				run_prev = prev;
> -				run_next_pa = pa + PAGE_SIZE;
> -				run = 1;
> -			}
> -		} else if (pa == run_next_pa) {
> -			run_next_pa += PAGE_SIZE;
> -			run += 1;
> -		} else {
> -			run = 0;
> -		}
> +	if (!mem)
> +		return;
> +	assert(IS_ALIGNED((uintptr_t)mem, PAGE_SIZE));
> +	for (i = 0; !a && (i < MAX_AREAS); i++) {
> +		if ((areas_mask & BIT(i)) && area_contains(areas + i, pfn))
> +			a = areas + i;
> +	}
> +	assert_msg(a, "memory does not belong to any area: %p", mem);
>  
> -		if (run == n) {
> -			if (run_prev)
> -				*((void **) run_prev) = next;
> -			else
> -				freelist = next;
> -			break;
> -		}
> +	p = pfn - a->base;
> +	order = a->page_states[p] & ORDER_MASK;
> +
> +	assert(a->page_states[p] == (order | ALLOC_MASK));
> +	assert(order < NLISTS);
> +	assert(IS_ALIGNED_ORDER(pfn, order));
> +	assert(area_contains(a, pfn + BIT(order) - 1));
> +
> +	for (i = 0; i < BIT(order); i++) {
> +		assert(a->page_states[p + i] == (ALLOC_MASK | order));
> +		a->page_states[p + i] &= ~ALLOC_MASK;
>  	}
> -	spin_unlock(&lock);
> -	if (run_start)
> -		memset(run_start, 0, n * PAGE_SIZE);
> -	return run_start;
> +	list_add(a->freelists + order, mem);
> +	do {
> +		order = a->page_states[p] & ORDER_MASK;
> +		if (!IS_ALIGNED_ORDER(pfn, order + 1))
> +			pfn = pfn - BIT(order);
> +		pfn2 = pfn + BIT(order);
> +	} while (coalesce(a, order, pfn, pfn2));
>  }
>  
> +void free_pages(void *mem, size_t size)
> +{
> +	spin_lock(&lock);
> +	_free_pages(mem);
> +	spin_unlock(&lock);
> +}
>  
> -void free_page(void *page)
> +static void *page_memalign_order_area(unsigned area, u8 ord, u8 al)
>  {
> +	void *res = NULL;
> +	int i;
> +
>  	spin_lock(&lock);
> -	*(void **)page = freelist;
> -	freelist = page;
> +	area &= areas_mask;
> +	for (i = 0; !res && (i < MAX_AREAS); i++)
> +		if (area & BIT(i))
> +			res = page_memalign_order(areas + i, ord, al);
>  	spin_unlock(&lock);
> +	return res;
>  }
>  
> -static void *page_memalign(size_t alignment, size_t size)
> +/*
> + * Allocates (1 << order) physically contiguous and naturally aligned pages.
> + * Returns NULL if the allocation was not possible.
> + */
> +void *alloc_pages_area(unsigned int area, unsigned int order)
>  {
> -	unsigned long n = ALIGN(size, PAGE_SIZE) >> PAGE_SHIFT;
> -	unsigned int order;
> +	return page_memalign_order_area(area, order, order);
> +}
>  
> -	if (!size)
> -		return NULL;
> +void *alloc_pages(unsigned int order)
> +{
> +	return alloc_pages_area(~0, order);
> +}
>  
> -	order = get_order(n);
> +/*
> + * Allocates (1 << order) physically contiguous aligned pages.
> + * Returns NULL if the allocation was not possible.
> + */
> +void *memalign_pages_area(unsigned int area, size_t alignment, size_t size)
> +{
> +	assert(is_power_of_2(alignment));
> +	alignment = get_order(PAGE_ALIGN(alignment) >> PAGE_SHIFT);
> +	size = get_order(PAGE_ALIGN(size) >> PAGE_SHIFT);
> +	assert(alignment < NLISTS);
> +	assert(size < NLISTS);
> +	return page_memalign_order_area(area, size, alignment);
> +}
>  
> -	return alloc_pages(order);
> +void *memalign_pages(size_t alignment, size_t size)
> +{
> +	return memalign_pages_area(~0, alignment, size);
>  }
>  
> -static void page_free(void *mem, size_t size)
> +/*
> + * Allocates one page
> + */
> +void *alloc_page()
>  {
> -	free_pages(mem, size);
> +	return alloc_pages(0);
>  }
>  
>  static struct alloc_ops page_alloc_ops = {
> -	.memalign = page_memalign,
> -	.free = page_free,
> +	.memalign = memalign_pages,
> +	.free = free_pages,
>  	.align_min = PAGE_SIZE,
>  };
>  
> +/*
> + * Enables the page allocator.
> + *
> + * Prerequisites:
> + * - at least one memory area has been initialized
> + */
>  void page_alloc_ops_enable(void)
>  {
> +	spin_lock(&lock);
> +	assert(page_alloc_initialized());
>  	alloc_ops = &page_alloc_ops;
> +	spin_unlock(&lock);
> +}
> +
> +/*
> + * Adds a new memory area to the pool of available memory.
> + *
> + * Prerequisites:
> + * - the lock is held
> + * - start and top are page frame numbers

Calling them pfn_start and pfn_top would make that clearer.
Maybe add a list of the input parameters and their contents?

> + * - start is smaller than top
> + * - top does not fall outside of addressable memory
> + * - there is at least one more slot free for memory areas
> + * - if a specific memory area number has been indicated, it needs to be free
> + * - the memory area to add does not overlap with existing areas
> + * - the memory area to add has at least 5 pages available
> + */
> +static void _page_alloc_init_area(int n, uintptr_t start, uintptr_t top)
> +{
> +	size_t table_size, npages, i;
> +	struct mem_area *a;
> +	u8 order = 0;
> +
> +	if (n == -1) {
> +		for (n = 0; n < MAX_AREAS; n++) {
> +			if (!(areas_mask & BIT(n)))
> +				break;
> +		}
> +	}
> +	assert(n < MAX_AREAS);
> +	assert(!(areas_mask & BIT(n)));
> +
> +	assert(top > start);
> +	assert(top - start > 4);
> +	assert(top < BIT_ULL(sizeof(void *) * 8 - PAGE_SHIFT));
> +
> +	table_size = (top - start + PAGE_SIZE) / (PAGE_SIZE + 1);
> +
> +	a = areas + n;
> +	a->page_states = (void *)(start << PAGE_SHIFT);
> +	a->base = start + table_size;
> +	a->top = top;
> +	npages = top - a->base;
> +
> +	assert((a->base - start) * PAGE_SIZE >= npages);
> +	for (i = 0; i < MAX_AREAS; i++) {
> +		if (!(areas_mask & BIT(i)))
> +			continue;
> +		assert(!area_overlaps(areas + i, start));
> +		assert(!area_overlaps(areas + i, top - 1));
> +		assert(!area_overlaps(a, PFN(areas[i].page_states)));
> +		assert(!area_overlaps(a, areas[i].top - 1));
> +	}
> +	for (i = 0; i < NLISTS; i++)
> +		a->freelists[i].next = a->freelists[i].prev = a->freelists + i;
> +
> +	for (i = a->base; i < a->top; i += 1ull << order) {
> +		while (i + BIT(order) > a->top) {
> +			assert(order);
> +			order--;
> +		}
> +		while (IS_ALIGNED_ORDER(i, order + 1) && (i + BIT(order + 1) <= a->top))
> +			order++;
> +		assert(order < NLISTS);
> +		memset(a->page_states + (i - a->base), order, BIT(order));
> +		list_add(a->freelists + order, (void *)(i << PAGE_SHIFT));
> +	}
> +	areas_mask |= BIT(n);
> +}
> +
> +/*
> + * Adds a new memory area to the pool of available memory.
> + *
> + * Prerequisites:
> + * see _page_alloc_init_area
> + */
> +void page_alloc_init_area(int n, uintptr_t base_pfn, uintptr_t top_pfn)
> +{
> +	spin_lock(&lock);
> +	_page_alloc_init_area(n, base_pfn, top_pfn);
> +	spin_unlock(&lock);
>  }
> diff --git a/lib/arm/setup.c b/lib/arm/setup.c
> index 78562e4..a3c573f 100644
> --- a/lib/arm/setup.c
> +++ b/lib/arm/setup.c
> @@ -155,7 +155,7 @@ static void mem_init(phys_addr_t freemem_start)
>  	assert(sizeof(long) == 8 || !(base >> 32));
>  	if (sizeof(long) != 8 && (top >> 32) != 0)
>  		top = ((uint64_t)1 << 32);
> -	free_pages((void *)(unsigned long)base, top - base);
> +	page_alloc_init_area(-1, base, top);
>  	page_alloc_ops_enable();
>  }
>  
> diff --git a/lib/s390x/sclp.c b/lib/s390x/sclp.c
> index 4054d0e..c25d442 100644
> --- a/lib/s390x/sclp.c
> +++ b/lib/s390x/sclp.c
> @@ -37,11 +37,16 @@ static void mem_init(phys_addr_t mem_end)
>  
>  	phys_alloc_init(freemem_start, mem_end - freemem_start);
>  	phys_alloc_get_unused(&base, &top);
> -	base = (base + PAGE_SIZE - 1) & -PAGE_SIZE;
> -	top = top & -PAGE_SIZE;
> +	base = PAGE_ALIGN(base) >> PAGE_SHIFT;
> +	top = top >> PAGE_SHIFT;
>  
>  	/* Make the pages available to the physical allocator */
> -	free_pages((void *)(unsigned long)base, top - base);
> +	if (top > (2ull * SZ_1G >> PAGE_SHIFT)) {
> +		page_alloc_init_area(0, 2ull * SZ_1G >> PAGE_SHIFT, top);
> +		page_alloc_init_area(1, base, 2ull * SZ_1G >> PAGE_SHIFT);
> +	} else {
> +		page_alloc_init_area(1, base, top);
> +	}
>  	page_alloc_ops_enable();
>  }
>  
> diff --git a/lib/s390x/smp.c b/lib/s390x/smp.c
> index 2860e9c..d954094 100644
> --- a/lib/s390x/smp.c
> +++ b/lib/s390x/smp.c
> @@ -190,7 +190,7 @@ int smp_cpu_setup(uint16_t addr, struct psw psw)
>  
>  	sigp_retry(cpu->addr, SIGP_INITIAL_CPU_RESET, 0, NULL);
>  
> -	lc = alloc_pages(1);
> +	lc = alloc_pages_area(_AREA(1), 1);
>  	cpu->lowcore = lc;
>  	memset(lc, 0, PAGE_SIZE * 2);
>  	sigp_retry(cpu->addr, SIGP_SET_PREFIX, (unsigned long )lc, NULL);
> diff --git a/lib/vmalloc.c b/lib/vmalloc.c
> index aca0876..f72c5b3 100644
> --- a/lib/vmalloc.c
> +++ b/lib/vmalloc.c
> @@ -217,18 +217,19 @@ void setup_vm()
>  	 * so that it can be used to allocate page tables.
>  	 */
>  	if (!page_alloc_initialized()) {
> -		base = PAGE_ALIGN(base);
> -		top = top & -PAGE_SIZE;
> -		free_pages(phys_to_virt(base), top - base);
> +		base = PAGE_ALIGN(base) >> PAGE_SHIFT;
> +		top = top >> PAGE_SHIFT;
> +		page_alloc_init_area(1, base, top);
> +		page_alloc_ops_enable();
>  	}
>  
>  	find_highmem();
>  	phys_alloc_get_unused(&base, &top);
>  	page_root = setup_mmu(top);
>  	if (base != top) {
> -		base = PAGE_ALIGN(base);
> -		top = top & -PAGE_SIZE;
> -		free_pages(phys_to_virt(base), top - base);
> +		base = PAGE_ALIGN(base) >> PAGE_SHIFT;
> +		top = top >> PAGE_SHIFT;
> +		page_alloc_init_area(0, base, top);
>  	}
>  
>  	spin_lock(&lock);
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator
  2020-08-19 15:44   ` Janosch Frank
@ 2020-08-19 16:46     ` Claudio Imbrenda
  0 siblings, 0 replies; 13+ messages in thread
From: Claudio Imbrenda @ 2020-08-19 16:46 UTC (permalink / raw)
  To: Janosch Frank; +Cc: kvm, pbonzini, david, thuth, cohuck, lvivier

On Wed, 19 Aug 2020 17:44:55 +0200
Janosch Frank <frankja@linux.ibm.com> wrote:

> On 8/14/20 5:10 PM, Claudio Imbrenda wrote:
> > This is a complete rewrite of the page allocator.
> > 
> > This will bring a few improvements:
> > * no need to specify the size when freeing
> > * allocate small areas with a large alignment without wasting memory
> > * ability to initialize and use multiple memory areas (e.g. DMA)
> > * more sanity checks
> > 
> > A few things have changed:
> > * initialization cannot be done with free_pages like before,
> >   page_alloc_init_area has to be used instead
> > 
> > Arch-specific changes:
> > * arm and x86 have been adapted to put all the memory in just one
> > big area (or two, for x86_64 with highmem).
> > * s390x instead creates one area below 2GiB and one above; the area
> >   below 2GiB is used for SMP lowcore initialization.
> > 
> > Details:
> > Each memory area has metadata at the very beginning. The metadata
> > is a byte array with one entry per usable page (so, excluding the
> > metadata itself). Each entry indicates if the page is special
> > (unused for now), if it is allocated, and the order of the block.
> > Both free and allocated pages are part of larger blocks.
> > 
> > Some more fixed size metadata is present in a fixed-size static
> > array. This metadata contains start and end page frame numbers, the
> > pointer to the metadata array, and the array of freelists. The
> > array of freelists has an entry for each possible order (indicated
> > by the macro NLISTS, defined as BITS_PER_LONG - PAGE_SHIFT).
> > 
> > On allocation, if the free list for the needed size is empty, larger
> > blocks are split. When a small allocation with a large alignment is
> > requested, an appropriately large block is split, to guarantee the
> > alignment.
> > 
> > When a block is freed, an attempt will be made to merge it into the
> > neighbour, iterating the process as long as possible.
> > 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>  
> 
> I'm not sure if it's possible without too much work, but splitting
> this would make my brain hurt less upon reading. I'll try continuing
> review tomorrow being wide awake for review is preferable for this
> patch.
> 
> Some comments below:
> 
> > ---
> >  lib/alloc_page.h |  64 ++++++-
> >  lib/alloc_page.c | 451
> > ++++++++++++++++++++++++++++++++++++----------- lib/arm/setup.c  |
> >  2 +- lib/s390x/sclp.c |  11 +-
> >  lib/s390x/smp.c  |   2 +-
> >  lib/vmalloc.c    |  13 +-
> >  6 files changed, 427 insertions(+), 116 deletions(-)
> > 
> > diff --git a/lib/alloc_page.h b/lib/alloc_page.h
> > index 88540d1..6472abd 100644
> > --- a/lib/alloc_page.h
> > +++ b/lib/alloc_page.h  
> [..]
> > diff --git a/lib/alloc_page.c b/lib/alloc_page.c
> > index 74fe726..7c91f91 100644
> > --- a/lib/alloc_page.c
> > +++ b/lib/alloc_page.c
> > @@ -13,165 +13,410 @@
> >  #include <asm/io.h>
> >  #include <asm/spinlock.h>
> >  
> > +#define IS_ALIGNED_ORDER(x,order) IS_ALIGNED((x),BIT_ULL(order))
> > +#define NLISTS ((BITS_PER_LONG) - (PAGE_SHIFT))
> > +#define PFN(x) ((uintptr_t)(x) >> PAGE_SHIFT)  
> 
> So, more or less virt_to_pfn but with uintptr_t rather than unsigned
> long?

kinda... 

virt_to_pfn implies that the starting addresses are virtual. note that
in some architectures virt_to_pfn does actually more than just a shift.

> > +
> > +#define MAX_AREAS	4
> > +
> > +#define ORDER_MASK	0x3f
> > +#define ALLOC_MASK	0x40
> > +
> > +struct free_list {
> > +	struct free_list *prev;
> > +	struct free_list *next;
> > +};
> > +
> > +struct mem_area {
> > +	/* Physical frame number of the first usable frame in the
> > area */
> > +	uintptr_t base;
> > +	/* Physical frame number of the first frame outside the
> > area */
> > +	uintptr_t top;
> > +	/* Combination ALLOC_MASK and order */
> > +	u8 *page_states;
> > +	/* One freelist for each possible block size, up to NLISTS
> > */
> > +	struct free_list freelists[NLISTS];
> > +};
> > +
> > +static struct mem_area areas[MAX_AREAS];
> > +static unsigned int areas_mask;
> >  static struct spinlock lock;
> > -static void *freelist = 0;
> >  
> >  bool page_alloc_initialized(void)
> >  {
> > -	return freelist != 0;
> > +	return areas_mask != 0;
> >  }
> >  
> > -void free_pages(void *mem, size_t size)
> > +static inline bool area_overlaps(struct mem_area *a, uintptr_t pfn)
> >  {
> > -	void *old_freelist;
> > -	void *end;
> > +	return (pfn >= PFN(a->page_states)) && (pfn < a->top);
> > +}  
> 
> I have pondered over this function for quite a while now and I don't
> think the naming is correct.
> 
> This is basically the contains function below, but including the page
> states which are at the beginning of the area but will never be shown
> to users, right?

correct. I needed a different, short, name for "contains" 

suggestions for a better name are welcome :)

> >  
> > -	assert_msg((unsigned long) mem % PAGE_SIZE == 0,
> > -		   "mem not page aligned: %p", mem);
> > +static inline bool area_contains(struct mem_area *a, uintptr_t pfn)
> > +{
> > +	return (pfn >= a->base) && (pfn < a->top);
> > +}
> >  
> > -	assert_msg(size % PAGE_SIZE == 0, "size not page aligned:
> > %#zx", size); +static inline bool is_list_empty(struct free_list *p)
> > +{
> > +	return !p->next || !p->prev || p == p->next || p ==
> > p->prev; +}
> >  
> > -	assert_msg(size == 0 || (uintptr_t)mem == -size ||
> > -		   (uintptr_t)mem + size > (uintptr_t)mem,
> > -		   "mem + size overflow: %p + %#zx", mem, size);
> > +static struct free_list *list_remove(struct free_list *l)
> > +{
> > +	if (is_list_empty(l))
> > +		return NULL;
> >  
> > -	if (size == 0) {
> > -		freelist = NULL;
> > -		return;
> > -	}
> > +	l->prev->next = l->next;
> > +	l->next->prev = l->prev;
> > +	l->prev = l->next = NULL;
> >  
> > -	spin_lock(&lock);
> > -	old_freelist = freelist;
> > -	freelist = mem;
> > -	end = mem + size;
> > -	while (mem + PAGE_SIZE != end) {
> > -		*(void **)mem = (mem + PAGE_SIZE);
> > -		mem += PAGE_SIZE;
> > -	}
> > +	return l;
> > +}
> >  
> > -	*(void **)mem = old_freelist;
> > -	spin_unlock(&lock);
> > +static void list_add(struct free_list *head, struct free_list *li)
> > +{
> > +	assert(li);
> > +	assert(head);
> > +	li->prev = head;
> > +	li->next = head->next;
> > +	head->next->prev = li;
> > +	head->next = li;
> >  }  
> 
> lib/lists.(c|h) or maybe util.c

hmm makes sense, I'll split

> I guess it's only a matter of time until we have a fully preempt
> testing kernel...
> 
> >  
> > -void free_pages_by_order(void *mem, unsigned int order)
> > +/*
> > + * Splits the free block starting at addr into 2 blocks of half
> > the size.
> > + *
> > + * The function depends on the following assumptions:
> > + * - The allocator must have been initialized
> > + * - the block must be within the memory area
> > + * - all pages in the block must be free and not special
> > + * - the pointer must point to the start of the block
> > + * - all pages in the block must have the same block size.
> > + * - the block size must be greater than 0
> > + * - the block size must be smaller than the maximum allowed
> > + * - the block must be in a free list
> > + * - the function is called with the lock held
> > + */
> > +static void split(struct mem_area *a, void *addr)
> >  {
> > -	free_pages(mem, 1ul << (order + PAGE_SHIFT));
> > +	uintptr_t pfn = PFN(addr);
> > +	struct free_list *p;
> > +	uintptr_t i, idx;
> > +	u8 order;
> > +
> > +	assert(a && area_contains(a, pfn));
> > +	idx = pfn - a->base;
> > +	order = a->page_states[idx];
> > +	assert(!(order & ~ORDER_MASK) && order && (order <
> > NLISTS));
> > +	assert(IS_ALIGNED_ORDER(pfn, order));
> > +	assert(area_contains(a, pfn + BIT(order) - 1));
> > +
> > +	/* Remove the block from its free list */
> > +	p = list_remove(addr);
> > +	assert(p);
> > +
> > +	/* update the block size for each page in the block */
> > +	for (i = 0; i < BIT(order); i++) {
> > +		assert(a->page_states[idx + i] == order);
> > +		a->page_states[idx + i] = order - 1;
> > +	}
> > +	order--;
> > +	/* add the first half block to the appropriate free list */
> > +	list_add(a->freelists + order, p);
> > +	/* add the second half block to the appropriate free list
> > */
> > +	list_add(a->freelists + order, (void *)((pfn + BIT(order))
> > * PAGE_SIZE)); }
> >  
> > -void *alloc_page()
> > +/*
> > + * Returns a block whose alignment and size are at least the
> > parameter values.
> > + * If there is not enough free memory, NULL is returned.
> > + *
> > + * Both parameters must be not larger than the largest allowed
> > order
> > + */
> > +static void *page_memalign_order(struct mem_area *a, u8 al, u8 sz)
> >  {
> > -	void *p;
> > +	struct free_list *p, *res = NULL;
> > +	u8 order;
> >  
> > -	if (!freelist)
> > -		return 0;
> > +	assert((al < NLISTS) && (sz < NLISTS));
> > +	/* we need the bigger of the two as starting point */
> > +	order = sz > al ? sz : al;
> >  
> > -	spin_lock(&lock);
> > -	p = freelist;
> > -	freelist = *(void **)freelist;
> > -	spin_unlock(&lock);
> > +	/* search all free lists for some memory */
> > +	for ( ; order < NLISTS; order++) {
> > +		p = a->freelists[order].next;
> > +		if (!is_list_empty(p))
> > +			break;
> > +	}
> > +	/* out of memory */
> > +	if (order >= NLISTS)
> > +		return NULL;
> > +
> > +	/*
> > +	 * the block is bigger than what we need because either
> > there were
> > +	 * no smaller blocks, or the smaller blocks were not
> > aligned to our
> > +	 * needs; therefore we split the block until we reach the
> > needed size
> > +	 */
> > +	for (; order > sz; order--)
> > +		split(a, p);
> >  
> > -	if (p)
> > -		memset(p, 0, PAGE_SIZE);
> > -	return p;
> > +	res = list_remove(p);
> > +	memset(a->page_states + (PFN(res) - a->base), ALLOC_MASK |
> > order, BIT(order));
> > +	return res;
> >  }
> >  
> >  /*
> > - * Allocates (1 << order) physically contiguous and naturally
> > aligned pages.
> > - * Returns NULL if there's no memory left.
> > + * Try to merge two blocks into a bigger one.
> > + * Returns true in case of a successful merge.
> > + * Merging will succeed only if both blocks have the same block
> > size and are
> > + * both free.
> > + *
> > + * The function depends on the following assumptions:
> > + * - the first parameter is strictly smaller than the second
> > + * - the parameters must point each to the start of their block
> > + * - the two parameters point to adjacent blocks
> > + * - the two blocks are both in a free list
> > + * - all of the pages of the two blocks must be free
> > + * - all of the pages of the two blocks must have the same block
> > size
> > + * - the function is called with the lock held
> >   */
> > -void *alloc_pages(unsigned int order)
> > +static bool coalesce(struct mem_area *a, u8 order, uintptr_t pfn,
> > uintptr_t pfn2) {
> > -	/* Generic list traversal. */
> > -	void *prev;
> > -	void *curr = NULL;
> > -	void *next = freelist;
> > +	uintptr_t first, second, i;
> > +	struct free_list *li;
> >  
> > -	/* Looking for a run of length (1 << order). */
> > -	unsigned long run = 0;
> > -	const unsigned long n = 1ul << order;
> > -	const unsigned long align_mask = (n << PAGE_SHIFT) - 1;
> > -	void *run_start = NULL;
> > -	void *run_prev = NULL;
> > -	unsigned long run_next_pa = 0;
> > -	unsigned long pa;
> > +	assert(IS_ALIGNED_ORDER(pfn, order) &&
> > IS_ALIGNED_ORDER(pfn2, order));
> > +	assert(pfn2 == pfn + BIT(order));
> > +	assert(a);
> >  
> > -	assert(order < sizeof(unsigned long) * 8);
> > +	if (!area_contains(a, pfn) || !area_contains(a, pfn2 +
> > BIT(order) - 1))
> > +		return false;
> > +	first = pfn - a->base;
> > +	second = pfn2 - a->base;
> > +	if ((a->page_states[first] != order) ||
> > (a->page_states[second] != order))
> > +		return false;
> >  
> > -	spin_lock(&lock);
> > -	for (;;) {
> > -		prev = curr;
> > -		curr = next;
> > +	li = list_remove((void *)(pfn2 << PAGE_SHIFT));
> > +	assert(li);
> > +	li = list_remove((void *)(pfn << PAGE_SHIFT));
> > +	assert(li);
> > +	for (i = 0; i < (2ull << order); i++) {
> > +		assert(a->page_states[first + i] == order);
> > +		a->page_states[first + i] = order + 1;
> > +	}
> > +	list_add(a->freelists + order + 1, li);
> > +	return true;
> > +}
> >  
> > -		if (!curr) {
> > -			run_start = NULL;
> > -			break;
> > -		}
> > +/*
> > + * Free a block of memory.
> > + * The parameter can be NULL, in which case nothing happens.
> > + *
> > + * The function depends on the following assumptions:
> > + * - the parameter is page aligned
> > + * - the parameter belongs to an existing memory area
> > + * - the parameter points to the beginning of the block
> > + * - the size of the block is less than the maximum allowed
> > + * - the block is completely contained in its memory area
> > + * - all pages in the block have the same block size
> > + * - no pages in the memory block were already free
> > + * - no pages in the memory block are special
> > + */
> > +static void _free_pages(void *mem)
> > +{
> > +	uintptr_t pfn2, pfn = PFN(mem);
> > +	struct mem_area *a = NULL;
> > +	uintptr_t i, p;
> > +	u8 order;
> >  
> > -		next = *((void **) curr);
> > -		pa = virt_to_phys(curr);
> > -
> > -		if (run == 0) {
> > -			if (!(pa & align_mask)) {
> > -				run_start = curr;
> > -				run_prev = prev;
> > -				run_next_pa = pa + PAGE_SIZE;
> > -				run = 1;
> > -			}
> > -		} else if (pa == run_next_pa) {
> > -			run_next_pa += PAGE_SIZE;
> > -			run += 1;
> > -		} else {
> > -			run = 0;
> > -		}
> > +	if (!mem)
> > +		return;
> > +	assert(IS_ALIGNED((uintptr_t)mem, PAGE_SIZE));
> > +	for (i = 0; !a && (i < MAX_AREAS); i++) {
> > +		if ((areas_mask & BIT(i)) && area_contains(areas +
> > i, pfn))
> > +			a = areas + i;
> > +	}
> > +	assert_msg(a, "memory does not belong to any area: %p",
> > mem); 
> > -		if (run == n) {
> > -			if (run_prev)
> > -				*((void **) run_prev) = next;
> > -			else
> > -				freelist = next;
> > -			break;
> > -		}
> > +	p = pfn - a->base;
> > +	order = a->page_states[p] & ORDER_MASK;
> > +
> > +	assert(a->page_states[p] == (order | ALLOC_MASK));
> > +	assert(order < NLISTS);
> > +	assert(IS_ALIGNED_ORDER(pfn, order));
> > +	assert(area_contains(a, pfn + BIT(order) - 1));
> > +
> > +	for (i = 0; i < BIT(order); i++) {
> > +		assert(a->page_states[p + i] == (ALLOC_MASK |
> > order));
> > +		a->page_states[p + i] &= ~ALLOC_MASK;
> >  	}
> > -	spin_unlock(&lock);
> > -	if (run_start)
> > -		memset(run_start, 0, n * PAGE_SIZE);
> > -	return run_start;
> > +	list_add(a->freelists + order, mem);
> > +	do {
> > +		order = a->page_states[p] & ORDER_MASK;
> > +		if (!IS_ALIGNED_ORDER(pfn, order + 1))
> > +			pfn = pfn - BIT(order);
> > +		pfn2 = pfn + BIT(order);
> > +	} while (coalesce(a, order, pfn, pfn2));
> >  }
> >  
> > +void free_pages(void *mem, size_t size)
> > +{
> > +	spin_lock(&lock);
> > +	_free_pages(mem);
> > +	spin_unlock(&lock);
> > +}
> >  
> > -void free_page(void *page)
> > +static void *page_memalign_order_area(unsigned area, u8 ord, u8 al)
> >  {
> > +	void *res = NULL;
> > +	int i;
> > +
> >  	spin_lock(&lock);
> > -	*(void **)page = freelist;
> > -	freelist = page;
> > +	area &= areas_mask;
> > +	for (i = 0; !res && (i < MAX_AREAS); i++)
> > +		if (area & BIT(i))
> > +			res = page_memalign_order(areas + i, ord,
> > al); spin_unlock(&lock);
> > +	return res;
> >  }
> >  
> > -static void *page_memalign(size_t alignment, size_t size)
> > +/*
> > + * Allocates (1 << order) physically contiguous and naturally
> > aligned pages.
> > + * Returns NULL if the allocation was not possible.
> > + */
> > +void *alloc_pages_area(unsigned int area, unsigned int order)
> >  {
> > -	unsigned long n = ALIGN(size, PAGE_SIZE) >> PAGE_SHIFT;
> > -	unsigned int order;
> > +	return page_memalign_order_area(area, order, order);
> > +}
> >  
> > -	if (!size)
> > -		return NULL;
> > +void *alloc_pages(unsigned int order)
> > +{
> > +	return alloc_pages_area(~0, order);
> > +}
> >  
> > -	order = get_order(n);
> > +/*
> > + * Allocates (1 << order) physically contiguous aligned pages.
> > + * Returns NULL if the allocation was not possible.
> > + */
> > +void *memalign_pages_area(unsigned int area, size_t alignment,
> > size_t size) +{
> > +	assert(is_power_of_2(alignment));
> > +	alignment = get_order(PAGE_ALIGN(alignment) >> PAGE_SHIFT);
> > +	size = get_order(PAGE_ALIGN(size) >> PAGE_SHIFT);
> > +	assert(alignment < NLISTS);
> > +	assert(size < NLISTS);
> > +	return page_memalign_order_area(area, size, alignment);
> > +}
> >  
> > -	return alloc_pages(order);
> > +void *memalign_pages(size_t alignment, size_t size)
> > +{
> > +	return memalign_pages_area(~0, alignment, size);
> >  }
> >  
> > -static void page_free(void *mem, size_t size)
> > +/*
> > + * Allocates one page
> > + */
> > +void *alloc_page()
> >  {
> > -	free_pages(mem, size);
> > +	return alloc_pages(0);
> >  }
> >  
> >  static struct alloc_ops page_alloc_ops = {
> > -	.memalign = page_memalign,
> > -	.free = page_free,
> > +	.memalign = memalign_pages,
> > +	.free = free_pages,
> >  	.align_min = PAGE_SIZE,
> >  };
> >  
> > +/*
> > + * Enables the page allocator.
> > + *
> > + * Prerequisites:
> > + * - at least one memory area has been initialized
> > + */
> >  void page_alloc_ops_enable(void)
> >  {
> > +	spin_lock(&lock);
> > +	assert(page_alloc_initialized());
> >  	alloc_ops = &page_alloc_ops;
> > +	spin_unlock(&lock);
> > +}
> > +
> > +/*
> > + * Adds a new memory area to the pool of available memory.
> > + *
> > + * Prerequisites:
> > + * - the lock is held
> > + * - start and top are page frame numbers  
> 
> Calling them pfn_start and pfn_top would make that clearer.
> Maybe add a list of the input parameters and their contents?

yes, I will fix both

actually I should probably fix the comments for every function

> > + * - start is smaller than top
> > + * - top does not fall outside of addressable memory
> > + * - there is at least one more slot free for memory areas
> > + * - if a specific memory area number has been indicated, it needs
> > to be free
> > + * - the memory area to add does not overlap with existing areas
> > + * - the memory area to add has at least 5 pages available
> > + */
> > +static void _page_alloc_init_area(int n, uintptr_t start,
> > uintptr_t top) +{
> > +	size_t table_size, npages, i;
> > +	struct mem_area *a;
> > +	u8 order = 0;
> > +
> > +	if (n == -1) {
> > +		for (n = 0; n < MAX_AREAS; n++) {
> > +			if (!(areas_mask & BIT(n)))
> > +				break;
> > +		}
> > +	}
> > +	assert(n < MAX_AREAS);
> > +	assert(!(areas_mask & BIT(n)));
> > +
> > +	assert(top > start);
> > +	assert(top - start > 4);
> > +	assert(top < BIT_ULL(sizeof(void *) * 8 - PAGE_SHIFT));
> > +
> > +	table_size = (top - start + PAGE_SIZE) / (PAGE_SIZE + 1);
> > +
> > +	a = areas + n;
> > +	a->page_states = (void *)(start << PAGE_SHIFT);
> > +	a->base = start + table_size;
> > +	a->top = top;
> > +	npages = top - a->base;
> > +
> > +	assert((a->base - start) * PAGE_SIZE >= npages);
> > +	for (i = 0; i < MAX_AREAS; i++) {
> > +		if (!(areas_mask & BIT(i)))
> > +			continue;
> > +		assert(!area_overlaps(areas + i, start));
> > +		assert(!area_overlaps(areas + i, top - 1));
> > +		assert(!area_overlaps(a,
> > PFN(areas[i].page_states)));
> > +		assert(!area_overlaps(a, areas[i].top - 1));
> > +	}
> > +	for (i = 0; i < NLISTS; i++)
> > +		a->freelists[i].next = a->freelists[i].prev =
> > a->freelists + i; +
> > +	for (i = a->base; i < a->top; i += 1ull << order) {
> > +		while (i + BIT(order) > a->top) {
> > +			assert(order);
> > +			order--;
> > +		}
> > +		while (IS_ALIGNED_ORDER(i, order + 1) && (i +
> > BIT(order + 1) <= a->top))
> > +			order++;
> > +		assert(order < NLISTS);
> > +		memset(a->page_states + (i - a->base), order,
> > BIT(order));
> > +		list_add(a->freelists + order, (void *)(i <<
> > PAGE_SHIFT));
> > +	}
> > +	areas_mask |= BIT(n);
> > +}
> > +
> > +/*
> > + * Adds a new memory area to the pool of available memory.
> > + *
> > + * Prerequisites:
> > + * see _page_alloc_init_area
> > + */
> > +void page_alloc_init_area(int n, uintptr_t base_pfn, uintptr_t
> > top_pfn) +{
> > +	spin_lock(&lock);
> > +	_page_alloc_init_area(n, base_pfn, top_pfn);
> > +	spin_unlock(&lock);
> >  }
> > diff --git a/lib/arm/setup.c b/lib/arm/setup.c
> > index 78562e4..a3c573f 100644
> > --- a/lib/arm/setup.c
> > +++ b/lib/arm/setup.c
> > @@ -155,7 +155,7 @@ static void mem_init(phys_addr_t freemem_start)
> >  	assert(sizeof(long) == 8 || !(base >> 32));
> >  	if (sizeof(long) != 8 && (top >> 32) != 0)
> >  		top = ((uint64_t)1 << 32);
> > -	free_pages((void *)(unsigned long)base, top - base);
> > +	page_alloc_init_area(-1, base, top);
> >  	page_alloc_ops_enable();
> >  }
> >  
> > diff --git a/lib/s390x/sclp.c b/lib/s390x/sclp.c
> > index 4054d0e..c25d442 100644
> > --- a/lib/s390x/sclp.c
> > +++ b/lib/s390x/sclp.c
> > @@ -37,11 +37,16 @@ static void mem_init(phys_addr_t mem_end)
> >  
> >  	phys_alloc_init(freemem_start, mem_end - freemem_start);
> >  	phys_alloc_get_unused(&base, &top);
> > -	base = (base + PAGE_SIZE - 1) & -PAGE_SIZE;
> > -	top = top & -PAGE_SIZE;
> > +	base = PAGE_ALIGN(base) >> PAGE_SHIFT;
> > +	top = top >> PAGE_SHIFT;
> >  
> >  	/* Make the pages available to the physical allocator */
> > -	free_pages((void *)(unsigned long)base, top - base);
> > +	if (top > (2ull * SZ_1G >> PAGE_SHIFT)) {
> > +		page_alloc_init_area(0, 2ull * SZ_1G >>
> > PAGE_SHIFT, top);
> > +		page_alloc_init_area(1, base, 2ull * SZ_1G >>
> > PAGE_SHIFT);
> > +	} else {
> > +		page_alloc_init_area(1, base, top);
> > +	}
> >  	page_alloc_ops_enable();
> >  }
> >  
> > diff --git a/lib/s390x/smp.c b/lib/s390x/smp.c
> > index 2860e9c..d954094 100644
> > --- a/lib/s390x/smp.c
> > +++ b/lib/s390x/smp.c
> > @@ -190,7 +190,7 @@ int smp_cpu_setup(uint16_t addr, struct psw psw)
> >  
> >  	sigp_retry(cpu->addr, SIGP_INITIAL_CPU_RESET, 0, NULL);
> >  
> > -	lc = alloc_pages(1);
> > +	lc = alloc_pages_area(_AREA(1), 1);
> >  	cpu->lowcore = lc;
> >  	memset(lc, 0, PAGE_SIZE * 2);
> >  	sigp_retry(cpu->addr, SIGP_SET_PREFIX, (unsigned long )lc,
> > NULL); diff --git a/lib/vmalloc.c b/lib/vmalloc.c
> > index aca0876..f72c5b3 100644
> > --- a/lib/vmalloc.c
> > +++ b/lib/vmalloc.c
> > @@ -217,18 +217,19 @@ void setup_vm()
> >  	 * so that it can be used to allocate page tables.
> >  	 */
> >  	if (!page_alloc_initialized()) {
> > -		base = PAGE_ALIGN(base);
> > -		top = top & -PAGE_SIZE;
> > -		free_pages(phys_to_virt(base), top - base);
> > +		base = PAGE_ALIGN(base) >> PAGE_SHIFT;
> > +		top = top >> PAGE_SHIFT;
> > +		page_alloc_init_area(1, base, top);
> > +		page_alloc_ops_enable();
> >  	}
> >  
> >  	find_highmem();
> >  	phys_alloc_get_unused(&base, &top);
> >  	page_root = setup_mmu(top);
> >  	if (base != top) {
> > -		base = PAGE_ALIGN(base);
> > -		top = top & -PAGE_SIZE;
> > -		free_pages(phys_to_virt(base), top - base);
> > +		base = PAGE_ALIGN(base) >> PAGE_SHIFT;
> > +		top = top >> PAGE_SHIFT;
> > +		page_alloc_init_area(0, base, top);
> >  	}
> >  
> >  	spin_lock(&lock);
> >   
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-08-19 16:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14 15:10 [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Claudio Imbrenda
2020-08-14 15:10 ` [kvm-unit-tests RFC v1 1/5] lib/vmalloc: vmalloc support for handling allocation metadata Claudio Imbrenda
2020-08-19 14:36   ` Janosch Frank
2020-08-19 15:31     ` Claudio Imbrenda
2020-08-19 15:36       ` Janosch Frank
2020-08-14 15:10 ` [kvm-unit-tests RFC v1 2/5] lib/alloc_page: complete rewrite of the page allocator Claudio Imbrenda
2020-08-18 16:06   ` Cornelia Huck
2020-08-19 15:44   ` Janosch Frank
2020-08-19 16:46     ` Claudio Imbrenda
2020-08-14 15:10 ` [kvm-unit-tests RFC v1 3/5] lib/alloc: simplify free and malloc Claudio Imbrenda
2020-08-14 15:10 ` [kvm-unit-tests RFC v1 4/5] lib/alloc.h: remove align_min from struct alloc_ops Claudio Imbrenda
2020-08-14 15:10 ` [kvm-unit-tests RFC v1 5/5] lib/alloc_page: allow reserving arbitrary memory ranges Claudio Imbrenda
2020-08-17 16:44 ` [kvm-unit-tests RFC v1 0/5] Rewrite the allocators Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.