All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET x86/core/percpu] implement dynamic percpu allocator
@ 2009-02-18 12:04 Tejun Heo
  2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
                   ` (11 more replies)
  0 siblings, 12 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Hello, all.

This patchset implements dynamic percpu allocator.  As I wrote before,
the percpu areas are organized in chunks which in turn are composed of
num_possible_cpus() units.  As offsets of units against the first unit
stay the same regardless of where the chunk is, arch code can directly
access each percpu area by setting up percpu access such that each cpu
translates the same percpu address unit size apart.

Statically declared percpu area for the kernel which is setup early
during boot is also served by the same allocator but it needs special
init path as it needs to be up and running way before regular memory
management is initialized.

Percpu areas are allocated from the vmalloc space and managed directly
by the percpu code.  Chunks start empty and are populated with pages
as they're allocated.  As there are many small allocations and
allocations often need much smaller alignment (no need for cacheline
alignment), the allocator tries to maximize chunk utilization and put
allocations in fuller chunks.

There have been several concerns regarding this approach.

* On 64bit, no need for chunks.  We can just allocate contiguous
  areas.

  For 32bit, with the overcrowded address space, consolidating percpu
  allocations into vmalloc (or other) area is a big win as no space
  needs to be further set aside for percpu variables and with
  relatively small number of possible cpus, the chunks can be at
  manageable size (e.g. 128k chunks for 4way smp wouldn't be too bad)
  and it can achieve reasonable scalability.

  So, I think the question becomes whether it makes sense to use
  different allocation scheme for 32 and 64bits.  The added overhead
  of chunk handling itself isn't anything which can warrant separate
  implementations.  If there's a way to solve some other issues nicely
  with larger address space, maybe, but I really think it would be
  best to stick with one implementation.

* It adds to TLB pressure.

  Yeah, unfortunately, it does.  Currently it adds a number of kernel
  4k pages into circulation (cold/high pages, so unlikely to affect
  other large mappings).  There are several different varieties of
  this issue.

  The unit size and thus the chunk size is pretty flexible (it
  currently requires power of 2 but that restriction can be lifted
  easily).  With vm area allocation with larger alignment, using large
  page for chunk (non-NUMA) or unit (large, large NUMA) shouldn't be
  too difficult for highends but for mid range stuff, it looks like
  there isn't much else to do than sticking with 4k mappings.

  The TLB pressure problem would be there regardless of address layout
  as long as we want to grow the percpu area dynamically.
  Page-granual growth will add 4k pressures.  Large-page-granuality is
  likely to waste lots of space.

  One trick we can do is to reserve the initial chunk in non-vmalloc
  area so that at least the static cpu ones and whatever gets
  allocated in the first chunk is served by regular large page
  mappings.  Given that those are most frequent visited ones, this
  could be a nice compromise - no noticeable penalty for usual cases
  yet allowing scalability for unusual cases.  If this is something
  which can be agreed on, I'll pursue this.

The percpu allocator is optional feature which can be selected by each
arch by setting HAVE_DYNAMIC_PER_CPU_AREA configuration variable.
Currently only x86_32 an 64 use it.

Ah.. I also left out cpu hotplugging stuff for now.  This largely
isn't an issue on most machines where num_possible_cpus() doesn't
deviate much from num_online_cpus().  Are there cases where this is
critical?  Currently, no user of percpu allocation, static or dynamic,
cares about this and it has been like this for a long time, so I'm a
little bit skeptical about it.

This patchset contains the following ten patches.

  0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
  0002-module-fix-out-of-range-memory-access.patch
  0003-module-reorder-module-pcpu-related-functions.patch
  0004-alloc_percpu-change-percpu_ptr-to-per_cpu_ptr.patch
  0005-alloc_percpu-add-align-argument-to-__alloc_percpu.patch
  0006-percpu-kill-percpu_alloc-and-friends.patch
  0007-vmalloc-implement-vm_area_register_early.patch
  0008-vmalloc-add-un-map_kernel_range_noflush.patch
  0009-percpu-implement-new-dynamic-percpu-allocator.patch
  0010-x86-convert-to-the-new-dynamic-percpu-allocator.patch

0001-0003 contain fixes and trivial prep.  0004-0006 clean up percpu.
0007-0008 add stuff to vmalloc which will be used by the new
allocator.  0009-0010 implement and use the new allocator.

This patchset is on top of the current x86/core/percpu[1] and can be
fetched from the following git vector.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu

diffstat follows.

 arch/alpha/mm/init.c                       |   20 
 arch/x86/Kconfig                           |    3 
 arch/x86/include/asm/percpu.h              |    8 
 arch/x86/include/asm/pgtable.h             |    1 
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |    2 
 arch/x86/kernel/setup_percpu.c             |   62 +-
 arch/x86/mm/init_32.c                      |   10 
 arch/x86/mm/init_64.c                      |   19 
 block/blktrace.c                           |    2 
 drivers/acpi/processor_perflib.c           |    4 
 include/linux/percpu.h                     |   65 +-
 include/linux/vmalloc.h                    |    4 
 kernel/module.c                            |   78 +-
 kernel/sched.c                             |    6 
 kernel/stop_machine.c                      |    2 
 mm/Makefile                                |    4 
 mm/allocpercpu.c                           |   32 -
 mm/percpu.c                                |  876 +++++++++++++++++++++++++++++
 mm/vmalloc.c                               |   84 ++
 net/ipv4/af_inet.c                         |    4 
 20 files changed, 1183 insertions(+), 103 deletions(-)

Thanks.

--
tejun

[1] 58105ef1857112a186696c9b8957020090226a28

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range()
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-19 12:06   ` Nick Piggin
  2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: proper vcache flush on unmap_kernel_range()

flush_cache_vunmap() should be called before pages are unmapped.  Add
a call to it in unmap_kernel_range().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/vmalloc.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75f49d3..c37924a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1012,6 +1012,8 @@ void __init vmalloc_init(void)
 void unmap_kernel_range(unsigned long addr, unsigned long size)
 {
 	unsigned long end = addr + size;
+
+	flush_cache_vunmap(addr, end);
 	vunmap_page_range(addr, end);
 	flush_tlb_kernel_range(addr, end);
 }
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 02/10] module: fix out-of-range memory access
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
  2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-19 12:08   ` Nick Piggin
  2009-02-20  7:16   ` Tejun Heo
  2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: subtle memory access bug fix

percpu_modalloc() may access pcpu_size[-1].  The access won't change
the value by itself but it still is read/write access and dangerous.
Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/module.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index ba22484..d54a63e 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -426,12 +426,14 @@ static void *percpu_modalloc(unsigned long size, unsigned long align,
 			continue;
 
 		/* Transfer extra to previous block. */
-		if (pcpu_size[i-1] < 0)
-			pcpu_size[i-1] -= extra;
-		else
-			pcpu_size[i-1] += extra;
-		pcpu_size[i] -= extra;
-		ptr += extra;
+		if (extra) {
+			if (pcpu_size[i-1] < 0)
+				pcpu_size[i-1] -= extra;
+			else
+				pcpu_size[i-1] += extra;
+			pcpu_size[i] -= extra;
+			ptr += extra;
+		}
 
 		/* Split block if warranted */
 		if (pcpu_size[i] - size > sizeof(unsigned long))
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 03/10] module: reorder module pcpu related functions
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
  2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
  2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: cleanup

Move percpu_modinit() upwards.  This is to ease further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/module.c |   33 ++++++++++++++++++---------------
 1 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index d54a63e..84773e6 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -482,21 +482,6 @@ static void percpu_modfree(void *freeme)
 	}
 }
 
-static unsigned int find_pcpusec(Elf_Ehdr *hdr,
-				 Elf_Shdr *sechdrs,
-				 const char *secstrings)
-{
-	return find_sec(hdr, sechdrs, secstrings, ".data.percpu");
-}
-
-static void percpu_modcopy(void *pcpudest, const void *from, unsigned long size)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		memcpy(pcpudest + per_cpu_offset(cpu), from, size);
-}
-
 static int percpu_modinit(void)
 {
 	pcpu_num_used = 2;
@@ -515,7 +500,24 @@ static int percpu_modinit(void)
 	return 0;
 }
 __initcall(percpu_modinit);
+
+static unsigned int find_pcpusec(Elf_Ehdr *hdr,
+				 Elf_Shdr *sechdrs,
+				 const char *secstrings)
+{
+	return find_sec(hdr, sechdrs, secstrings, ".data.percpu");
+}
+
+static void percpu_modcopy(void *pcpudest, const void *from, unsigned long size)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		memcpy(pcpudest + per_cpu_offset(cpu), from, size);
+}
+
 #else /* ... !CONFIG_SMP */
+
 static inline void *percpu_modalloc(unsigned long size, unsigned long align,
 				    const char *name)
 {
@@ -537,6 +539,7 @@ static inline void percpu_modcopy(void *pcpudst, const void *src,
 	/* pcpusec should be 0, and size of that section should be 0. */
 	BUG_ON(size != 0);
 }
+
 #endif /* CONFIG_SMP */
 
 #define MODINFO_ATTR(field)	\
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (2 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo
  Cc: mingo, lenb, cpufreq, Tejun Heo

From: Rusty Russell <rusty@rustcorp.com.au>

Impact: cleanup

There are two allocated per-cpu accessor macros with almost identical
spelling.  The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.

tj: kill percpu_ptr() and update UP too

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |    2 +-
 drivers/acpi/processor_perflib.c           |    4 ++--
 include/linux/percpu.h                     |   23 +++++++++++------------
 kernel/sched.c                             |    6 +++---
 kernel/stop_machine.c                      |    2 +-
 5 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
index 4b1c319..22590cf 100644
--- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -601,7 +601,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	if (!data)
 		return -ENOMEM;
 
-	data->acpi_data = percpu_ptr(acpi_perf_data, cpu);
+	data->acpi_data = per_cpu_ptr(acpi_perf_data, cpu);
 	per_cpu(drv_data, cpu) = data;
 
 	if (cpu_has(c, X86_FEATURE_CONSTANT_TSC))
diff --git a/drivers/acpi/processor_perflib.c b/drivers/acpi/processor_perflib.c
index 9cc769b..68fd3d2 100644
--- a/drivers/acpi/processor_perflib.c
+++ b/drivers/acpi/processor_perflib.c
@@ -516,12 +516,12 @@ int acpi_processor_preregister_performance(
 			continue;
 		}
 
-		if (!performance || !percpu_ptr(performance, i)) {
+		if (!performance || !per_cpu_ptr(performance, i)) {
 			retval = -EINVAL;
 			continue;
 		}
 
-		pr->performance = percpu_ptr(performance, i);
+		pr->performance = per_cpu_ptr(performance, i);
 		cpumask_set_cpu(i, pr->performance->shared_cpu_map);
 		if (acpi_processor_get_psd(pr)) {
 			retval = -EINVAL;
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 3577ffd..c80cfe1 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -81,23 +81,13 @@ struct percpu_data {
 };
 
 #define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
-/* 
- * Use this to get to a cpu's version of the per-cpu object dynamically
- * allocated. Non-atomic access to the current CPU's version should
- * probably be combined with get_cpu()/put_cpu().
- */ 
-#define percpu_ptr(ptr, cpu)                              \
-({                                                        \
-        struct percpu_data *__p = __percpu_disguise(ptr); \
-        (__typeof__(ptr))__p->ptrs[(cpu)];	          \
-})
 
 extern void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask);
 extern void percpu_free(void *__pdata);
 
 #else /* CONFIG_SMP */
 
-#define percpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
+#define per_cpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
 
 static __always_inline void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
 {
@@ -122,6 +112,15 @@ static inline void percpu_free(void *__pdata)
 						  cpu_possible_map)
 #define alloc_percpu(type)	(type *)__alloc_percpu(sizeof(type))
 #define free_percpu(ptr)	percpu_free((ptr))
-#define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
+/*
+ * Use this to get to a cpu's version of the per-cpu object dynamically
+ * allocated. Non-atomic access to the current CPU's version should
+ * probably be combined with get_cpu()/put_cpu().
+ */
+#define per_cpu_ptr(ptr, cpu)						\
+({									\
+        struct percpu_data *__p = __percpu_disguise(ptr);		\
+        (__typeof__(ptr))__p->ptrs[(cpu)];				\
+})
 
 #endif /* __LINUX_PERCPU_H */
diff --git a/kernel/sched.c b/kernel/sched.c
index fc17fd9..9d30ac9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9472,7 +9472,7 @@ cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 {
-	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 	u64 data;
 
 #ifndef CONFIG_64BIT
@@ -9491,7 +9491,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 
 static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 {
-	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 
 #ifndef CONFIG_64BIT
 	/*
@@ -9587,7 +9587,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 	ca = task_ca(tsk);
 
 	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 		*cpuusage += cputime;
 	}
 }
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 0cd415e..74541ca 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -170,7 +170,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	 * doesn't hit this CPU until we're ready. */
 	get_cpu();
 	for_each_online_cpu(i) {
-		sm_work = percpu_ptr(stop_machine_work, i);
+		sm_work = per_cpu_ptr(stop_machine_work, i);
 		INIT_WORK(sm_work, stop_cpu);
 		queue_work_on(i, stop_machine_wq, sm_work);
 	}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu.
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (3 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo
  Cc: Christoph Lameter, Jens Axboe

From: Rusty Russell <rusty@rustcorp.com.au>

This prepares for a real __alloc_percpu, by adding an alignment argument.
Only one place uses __alloc_percpu directly, and that's for a string.

tj: af_inet also uses __alloc_percpu(), update it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blktrace.c       |    2 +-
 include/linux/percpu.h |    5 +++--
 net/ipv4/af_inet.c     |    4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/blktrace.c b/block/blktrace.c
index 39cc3bf..4877662 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -363,7 +363,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!bt->sequence)
 		goto err;
 
-	bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG);
+	bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG, __alignof__(char));
 	if (!bt->msg_data)
 		goto err;
 
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index c80cfe1..1fdaee9 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -108,9 +108,10 @@ static inline void percpu_free(void *__pdata)
 
 /* (legacy) interface for use without CPU hotplug handling */
 
-#define __alloc_percpu(size)	percpu_alloc_mask((size), GFP_KERNEL, \
+#define __alloc_percpu(size, align)	percpu_alloc_mask((size), GFP_KERNEL, \
 						  cpu_possible_map)
-#define alloc_percpu(type)	(type *)__alloc_percpu(sizeof(type))
+#define alloc_percpu(type)	(type *)__alloc_percpu(sizeof(type), \
+						       __alignof__(type))
 #define free_percpu(ptr)	percpu_free((ptr))
 /*
  * Use this to get to a cpu's version of the per-cpu object dynamically
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 743f554..3a3dad8 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1375,10 +1375,10 @@ EXPORT_SYMBOL_GPL(snmp_fold_field);
 int snmp_mib_init(void *ptr[2], size_t mibsize)
 {
 	BUG_ON(ptr == NULL);
-	ptr[0] = __alloc_percpu(mibsize);
+	ptr[0] = __alloc_percpu(mibsize, __alignof__(unsigned long long));
 	if (!ptr[0])
 		goto err0;
-	ptr[1] = __alloc_percpu(mibsize);
+	ptr[1] = __alloc_percpu(mibsize, __alignof__(unsigned long long));
 	if (!ptr[1])
 		goto err1;
 	return 0;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 06/10] percpu: kill percpu_alloc() and friends
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (4 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-19  0:17   ` Rusty Russell
  2009-03-11 18:36   ` Tony Luck
  2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: kill unused functions

percpu_alloc() and its friends never saw much action.  It was supposed
to replace the cpu-mask unaware __alloc_percpu() but it never happened
and in fact __percpu_alloc_mask() itself never really grew proper
up/down handling interface either (no exported interface for
populate/depopulate).

percpu allocation is about to go through major reimplementation and
there's no reason to carry this unused interface around.  Replace it
with __alloc_percpu() and free_percpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/percpu.h |   47 ++++++++++++++++++++++-------------------------
 mm/allocpercpu.c       |   32 +++++++++++++++++++-------------
 2 files changed, 41 insertions(+), 38 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1fdaee9..d99e24a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -82,46 +82,43 @@ struct percpu_data {
 
 #define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
 
-extern void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask);
-extern void percpu_free(void *__pdata);
+/*
+ * Use this to get to a cpu's version of the per-cpu object
+ * dynamically allocated. Non-atomic access to the current CPU's
+ * version should probably be combined with get_cpu()/put_cpu().
+ */
+#define per_cpu_ptr(ptr, cpu)						\
+({									\
+        struct percpu_data *__p = __percpu_disguise(ptr);		\
+        (__typeof__(ptr))__p->ptrs[(cpu)];				\
+})
+
+extern void *__alloc_percpu(size_t size, size_t align);
+extern void free_percpu(void *__pdata);
 
 #else /* CONFIG_SMP */
 
 #define per_cpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
 
-static __always_inline void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
+static inline void *__alloc_percpu(size_t size, size_t align)
 {
+	/*
+	 * Can't easily make larger alignment work with kmalloc.  WARN
+	 * on it.  Larger alignment should only be used for module
+	 * percpu sections on SMP for which this path isn't used.
+	 */
+	WARN_ON_ONCE(align > __alignof__(unsigned long long));
 	return kzalloc(size, gfp);
 }
 
-static inline void percpu_free(void *__pdata)
+static inline void free_percpu(void *p)
 {
-	kfree(__pdata);
+	kfree(p);
 }
 
 #endif /* CONFIG_SMP */
 
-#define percpu_alloc_mask(size, gfp, mask) \
-	__percpu_alloc_mask((size), (gfp), &(mask))
-
-#define percpu_alloc(size, gfp) percpu_alloc_mask((size), (gfp), cpu_online_map)
-
-/* (legacy) interface for use without CPU hotplug handling */
-
-#define __alloc_percpu(size, align)	percpu_alloc_mask((size), GFP_KERNEL, \
-						  cpu_possible_map)
 #define alloc_percpu(type)	(type *)__alloc_percpu(sizeof(type), \
 						       __alignof__(type))
-#define free_percpu(ptr)	percpu_free((ptr))
-/*
- * Use this to get to a cpu's version of the per-cpu object dynamically
- * allocated. Non-atomic access to the current CPU's version should
- * probably be combined with get_cpu()/put_cpu().
- */
-#define per_cpu_ptr(ptr, cpu)						\
-({									\
-        struct percpu_data *__p = __percpu_disguise(ptr);		\
-        (__typeof__(ptr))__p->ptrs[(cpu)];				\
-})
 
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
index 4297bc4..3653c57 100644
--- a/mm/allocpercpu.c
+++ b/mm/allocpercpu.c
@@ -99,45 +99,51 @@ static int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
 	__percpu_populate_mask((__pdata), (size), (gfp), &(mask))
 
 /**
- * percpu_alloc_mask - initial setup of per-cpu data
+ * alloc_percpu - initial setup of per-cpu data
  * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-data for cpu's selected through mask bits
+ * @align: alignment
  *
- * Populating per-cpu data for all online cpu's would be a typical use case,
- * which is simplified by the percpu_alloc() wrapper.
- * Per-cpu objects are populated with zeroed buffers.
+ * Allocate dynamic percpu area.  Percpu objects are populated with
+ * zeroed buffers.
  */
-void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
+void *__alloc_percpu(size_t size, size_t align)
 {
 	/*
 	 * We allocate whole cache lines to avoid false sharing
 	 */
 	size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
-	void *pdata = kzalloc(sz, gfp);
+	void *pdata = kzalloc(sz, GFP_KERNEL);
 	void *__pdata = __percpu_disguise(pdata);
 
+	/*
+	 * Can't easily make larger alignment work with kmalloc.  WARN
+	 * on it.  Larger alignment should only be used for module
+	 * percpu sections on SMP for which this path isn't used.
+	 */
+	WARN_ON_ONCE(align > __alignof__(unsigned long long));
+
 	if (unlikely(!pdata))
 		return NULL;
-	if (likely(!__percpu_populate_mask(__pdata, size, gfp, mask)))
+	if (likely(!__percpu_populate_mask(__pdata, size, GFP_KERNEL,
+					   &cpu_possible_map)))
 		return __pdata;
 	kfree(pdata);
 	return NULL;
 }
-EXPORT_SYMBOL_GPL(__percpu_alloc_mask);
+EXPORT_SYMBOL_GPL(__alloc_percpu);
 
 /**
- * percpu_free - final cleanup of per-cpu data
+ * free_percpu - final cleanup of per-cpu data
  * @__pdata: object to clean up
  *
  * We simply clean up any per-cpu object left. No need for the client to
  * track and specify through a bis mask which per-cpu objects are to free.
  */
-void percpu_free(void *__pdata)
+void free_percpu(void *__pdata)
 {
 	if (unlikely(!__pdata))
 		return;
 	__percpu_depopulate_mask(__pdata, &cpu_possible_map);
 	kfree(__percpu_disguise(__pdata));
 }
-EXPORT_SYMBOL_GPL(percpu_free);
+EXPORT_SYMBOL_GPL(free_percpu);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 07/10] vmalloc: implement vm_area_register_early()
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (5 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-19  0:55   ` Tejun Heo
  2009-02-19 12:09   ` Nick Piggin
  2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: allow multiple early vm areas

There are places where kernel VM area needs to be allocated before
vmalloc is initialized.  This is done by allocating static vm_struct,
initializing several fields and linking it to vmlist and later vmalloc
initialization picking up these from vmlist.  This is currently done
manually and if there's more than one such areas, there's no defined
way to arbitrate who gets which address.

This patch implements vm_area_register_early(), which takes vm_area
struct with flags and size initialized, assigns address to it and puts
it on the vmlist.  This way, multiple early vm areas can determine
which addresses they should use.  The only current user - alpha mm
init - is converted to use it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 arch/alpha/mm/init.c    |   20 +++++++++++++-------
 include/linux/vmalloc.h |    1 +
 mm/vmalloc.c            |   24 ++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
index 5d7a16e..df6df02 100644
--- a/arch/alpha/mm/init.c
+++ b/arch/alpha/mm/init.c
@@ -189,9 +189,21 @@ callback_init(void * kernel_end)
 
 	if (alpha_using_srm) {
 		static struct vm_struct console_remap_vm;
-		unsigned long vaddr = VMALLOC_START;
+		unsigned long nr_pages = 0;
+		unsigned long vaddr;
 		unsigned long i, j;
 
+		/* calculate needed size */
+		for (i = 0; i < crb->map_entries; ++i)
+			nr_pages += crb->map[i].count;
+
+		/* register the vm area */
+		console_remap_vm.flags = VM_ALLOC;
+		console_remap_vm.size = nr_pages << PAGE_SHIFT;
+		vm_area_register_early(&console_remap_vm);
+
+		vaddr = (unsigned long)consle_remap_vm.addr;
+
 		/* Set up the third level PTEs and update the virtual
 		   addresses of the CRB entries.  */
 		for (i = 0; i < crb->map_entries; ++i) {
@@ -213,12 +225,6 @@ callback_init(void * kernel_end)
 				vaddr += PAGE_SIZE;
 			}
 		}
-
-		/* Let vmalloc know that we've allocated some space.  */
-		console_remap_vm.flags = VM_ALLOC;
-		console_remap_vm.addr = (void *) VMALLOC_START;
-		console_remap_vm.size = vaddr - VMALLOC_START;
-		vmlist = &console_remap_vm;
 	}
 
 	callback_init_done = 1;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 506e762..bbc0513 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -106,5 +106,6 @@ extern long vwrite(char *buf, char *addr, unsigned long count);
  */
 extern rwlock_t vmlist_lock;
 extern struct vm_struct *vmlist;
+extern __init void vm_area_register_early(struct vm_struct *vm);
 
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c37924a..d206261 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -24,6 +24,7 @@
 #include <linux/radix-tree.h>
 #include <linux/rcupdate.h>
 #include <linux/bootmem.h>
+#include <linux/pfn.h>
 
 #include <asm/atomic.h>
 #include <asm/uaccess.h>
@@ -982,6 +983,29 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 }
 EXPORT_SYMBOL(vm_map_ram);
 
+/**
+ * vm_area_register_early - register vmap area early during boot
+ * @vm: vm_struct to register
+ * @size: size of area to register
+ *
+ * This function is used to register kernel vm area before
+ * vmalloc_init() is called.  @vm->size and @vm->flags should contain
+ * proper values on entry and other fields should be zero.  On return,
+ * vm->addr contains the allocated address.
+ *
+ * DO NOT USE THIS FUNCTION UNLESS YOU KNOW WHAT YOU'RE DOING.
+ */
+void __init vm_area_register_early(struct vm_struct *vm)
+{
+	static size_t vm_init_off __initdata;
+
+	vm->addr = (void *)VMALLOC_START + vm_init_off;
+	vm_init_off = PFN_ALIGN(vm_init_off + vm->size);
+
+	vm->next = vmlist;
+	vmlist = vm;
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush()
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (6 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-19 12:17   ` Nick Piggin
  2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
  2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: two more public map/unmap functions

Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
These functions respectively map and unmap address range in kernel VM
area but doesn't do any vcache or tlb flushing.  These will be used by
new percpu allocator.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/vmalloc.h |    3 ++
 mm/vmalloc.c            |   58 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index bbc0513..599ba79 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -91,6 +91,9 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
 			struct page ***pages);
+extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
+				    pgprot_t prot, struct page **pages);
+extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
 extern void unmap_kernel_range(unsigned long addr, unsigned long size);
 
 /* Allocate/destroy a 'vmalloc' VM area. */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d206261..e62c212 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -153,8 +153,8 @@ static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
  *
  * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
  */
-static int vmap_page_range(unsigned long start, unsigned long end,
-				pgprot_t prot, struct page **pages)
+static int vmap_page_range_noflush(unsigned long start, unsigned long end,
+				   pgprot_t prot, struct page **pages)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -170,13 +170,22 @@ static int vmap_page_range(unsigned long start, unsigned long end,
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
-	flush_cache_vmap(start, end);
 
 	if (unlikely(err))
 		return err;
 	return nr;
 }
 
+static int vmap_page_range(unsigned long start, unsigned long end,
+			   pgprot_t prot, struct page **pages)
+{
+	int ret;
+
+	ret = vmap_page_range_noflush(start, end, prot, pages);
+	flush_cache_vmap(start, end);
+	return ret;
+}
+
 static inline int is_vmalloc_or_module_addr(const void *x)
 {
 	/*
@@ -1033,6 +1042,49 @@ void __init vmalloc_init(void)
 	vmap_initialized = true;
 }
 
+/**
+ * map_kernel_range_noflush - map kernel VM area with the specified pages
+ * @addr: start of the VM area to map
+ * @size: size of the VM area to map
+ * @prot: page protection flags to use
+ * @pages: pages to map
+ *
+ * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size
+ * specify should have been allocated using get_vm_area() and its
+ * friends.  This function doesn't call flush_cache_vmap().
+ *
+ * RETURNS:
+ * The number of pages mapped on success, -errno on failure.
+ */
+int map_kernel_range_noflush(unsigned long addr, unsigned long size,
+			     pgprot_t prot, struct page **pages)
+{
+	return vmap_page_range_noflush(addr, addr + size, prot, pages);
+}
+
+/**
+ * unmap_kernel_range_noflush - unmap kernel VM area
+ * @addr: start of the VM area to unmap
+ * @size: size of the VM area to unmap
+ *
+ * Unmap PFN_UP(@size) pages at @addr.  The VM area @addr and @size
+ * specify should have been allocated using get_vm_area() and its
+ * friends.  This function doesn't flush_cache_vunmap() or
+ * flush_tlb_kernel_range().
+ */
+void unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
+{
+	vunmap_page_range(addr, addr + size);
+}
+
+/**
+ * unmap_kernel_range - unmap kernel VM area and flush cache and TLB
+ * @addr: start of the VM area to unmap
+ * @size: size of the VM area to unmap
+ *
+ * Similar to unmap_kernel_range_noflush() but flushes vcache before
+ * the unmapping and tlb after.
+ */
 void unmap_kernel_range(unsigned long addr, unsigned long size)
 {
 	unsigned long end = addr + size;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (7 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-19 10:10   ` Andrew Morton
                     ` (3 more replies)
  2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
                   ` (2 subsequent siblings)
  11 siblings, 4 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: new scalable dynamic percpu allocator which allows dynamic
        percpu areas to be accessed the same way as static ones

Implement scalable dynamic percpu allocator which can be used for both
static and dynamic percpu areas.  This will allow static and dynamic
areas to share faster direct access methods.  This feature is optional
and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
arch.  Please read comment on top of mm/percpu.c for details.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/percpu.h |   22 +-
 kernel/module.c        |   31 ++
 mm/Makefile            |    4 +
 mm/percpu.c            |  876 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 929 insertions(+), 4 deletions(-)
 create mode 100644 mm/percpu.c

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index d99e24a..1808099 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -76,23 +76,37 @@
 
 #ifdef CONFIG_SMP
 
-struct percpu_data {
-	void *ptrs[1];
-};
+#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
 
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
+extern void *pcpu_base_addr;
 
+typedef void (*pcpu_populate_pte_fn_t)(unsigned long addr);
+
+extern size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
+				       struct page **pages, size_t cpu_size);
 /*
  * Use this to get to a cpu's version of the per-cpu object
  * dynamically allocated. Non-atomic access to the current CPU's
  * version should probably be combined with get_cpu()/put_cpu().
  */
+#define per_cpu_ptr(ptr, cpu)	SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)))
+
+#else /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
+struct percpu_data {
+	void *ptrs[1];
+};
+
+#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
+
 #define per_cpu_ptr(ptr, cpu)						\
 ({									\
         struct percpu_data *__p = __percpu_disguise(ptr);		\
         (__typeof__(ptr))__p->ptrs[(cpu)];				\
 })
 
+#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
 extern void *__alloc_percpu(size_t size, size_t align);
 extern void free_percpu(void *__pdata);
 
diff --git a/kernel/module.c b/kernel/module.c
index 84773e6..6cf0797 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -51,6 +51,7 @@
 #include <linux/tracepoint.h>
 #include <linux/ftrace.h>
 #include <linux/async.h>
+#include <linux/percpu.h>
 
 #if 0
 #define DEBUGP printk
@@ -366,6 +367,34 @@ static struct module *find_module(const char *name)
 }
 
 #ifdef CONFIG_SMP
+
+#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+
+static void *percpu_modalloc(unsigned long size, unsigned long align,
+			     const char *name)
+{
+	void *ptr;
+
+	if (align > PAGE_SIZE) {
+		printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
+		       name, align, PAGE_SIZE);
+		align = PAGE_SIZE;
+	}
+
+	ptr = __alloc_percpu(size, align);
+	if (!ptr)
+		printk(KERN_WARNING
+		       "Could not allocate %lu bytes percpu data\n", size);
+	return ptr;
+}
+
+static void percpu_modfree(void *freeme)
+{
+	free_percpu(freeme);
+}
+
+#else /* ... !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
 /* Number of blocks used and allocated. */
 static unsigned int pcpu_num_used, pcpu_num_allocated;
 /* Size of each block.  -ve means used. */
@@ -501,6 +530,8 @@ static int percpu_modinit(void)
 }
 __initcall(percpu_modinit);
 
+#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
 static unsigned int find_pcpusec(Elf_Ehdr *hdr,
 				 Elf_Shdr *sechdrs,
 				 const char *secstrings)
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..818569b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,6 +30,10 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+obj-$(CONFIG_SMP) += percpu.o
+else
 obj-$(CONFIG_SMP) += allocpercpu.o
+endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/percpu.c b/mm/percpu.c
new file mode 100644
index 0000000..c5708cd
--- /dev/null
+++ b/mm/percpu.c
@@ -0,0 +1,876 @@
+/*
+ * linux/mm/percpu.c - percpu memory allocator
+ *
+ * Copyright (C) 2009		SUSE Linux Products GmbH
+ * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This is percpu allocator which can handle both static and dynamic
+ * areas.  Percpu areas are allocated in chunks in vmalloc area.  Each
+ * chunk is consisted of num_possible_cpus() units and the first chunk
+ * is used for static percpu variables in the kernel image (special
+ * boot time alloc/init handling necessary as these areas need to be
+ * brought up before allocation services are running).  Unit grows as
+ * necessary and all units grow or shrink in unison.  When a chunk is
+ * filled up, another chunk is allocated.  ie. in vmalloc area
+ *
+ *  c0                           c1                         c2
+ *  -------------------          -------------------        ------------
+ * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
+ *  -------------------  ......  -------------------  ....  ------------
+ *
+ * Allocation is done in offset-size areas of single unit space.  Ie,
+ * when UNIT_SIZE is 128k, an area at 134k of 512 bytes occupies 512
+ * bytes at 6k of c1:u0, c1:u1, c1:u2 and c1:u3.  Percpu access can be
+ * done by configuring percpu base registers UNIT_SIZE apart.
+ *
+ * There are usually many small percpu allocations many of them as
+ * small as 4 bytes.  The allocator organizes chunks into lists
+ * according to free size and tries to allocate from the fullest one.
+ * Each chunk keeps the maximum contiguous area size hint which is
+ * guaranteed to be eqaul to or larger than the maximum contiguous
+ * area in the chunk.  This helps the allocator not to iterate the
+ * chunk maps unnecessarily.
+ *
+ * Allocation state in each chunk is kept using an array of integers.
+ * A positive value represents free region and negative allocated.
+ * Allocation inside a chunk is done by scanning this map sequentially
+ * and serving the first matching entry.  This is mostly copied from
+ * the percpu_modalloc() allocator.  Chunks are also linked into a rb
+ * tree to ease address to chunk mapping during free.
+ *
+ * To use this allocator, arch code should do the followings.
+ *
+ * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+ *
+ * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
+ *   regular address to percpu pointer and back
+ *
+ * - use pcpu_setup_static() during percpu area initialization to
+ *   setup kernel static percpu area
+ */
+
+#include <linux/bitmap.h>
+#include <linux/bootmem.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/percpu.h>
+#include <linux/pfn.h>
+#include <linux/rbtree.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
+
+#define PCPU_MIN_UNIT_PAGES_SHIFT	4	/* also max alloc size */
+#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
+#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
+
+struct pcpu_chunk {
+	struct list_head	list;		/* linked to pcpu_slot lists */
+	struct rb_node		rb_node;	/* key is chunk->vm->addr */
+	int			free_size;
+	int			contig_hint;	/* max contiguous size hint */
+	struct vm_struct	*vm;
+	int			map_used;	/* # of map entries used */
+	int			map_alloc;	/* # of map entries allocated */
+	int			*map;
+	struct page		*page[];	/* #cpus * UNIT_PAGES */
+};
+
+#define SIZEOF_STRUCT_PCPU_CHUNK					\
+	(sizeof(struct pcpu_chunk) +					\
+	 (num_possible_cpus() << PCPU_UNIT_PAGES_SHIFT) * sizeof(struct page *))
+
+static int __pcpu_unit_pages_shift = PCPU_MIN_UNIT_PAGES_SHIFT;
+static int __pcpu_unit_pages;
+static int __pcpu_unit_shift;
+static int __pcpu_unit_size;
+static int __pcpu_chunk_size;
+static int __pcpu_nr_slots;
+
+/* currently everything is power of two, there's no hard dependency on it tho */
+#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
+#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
+#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
+#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
+#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
+#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)
+
+/* the address of the first chunk which starts with the kernel static area */
+void *pcpu_base_addr;
+EXPORT_SYMBOL_GPL(pcpu_base_addr);
+
+/* the size of kernel static area */
+static int pcpu_static_size;
+
+static DEFINE_MUTEX(pcpu_mutex);		/* one mutex to rule them all */
+static struct list_head *pcpu_slot;		/* chunk list slots */
+static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
+
+static int pcpu_size_to_slot(int size)
+{
+	int highbit = fls(size);
+	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
+}
+
+static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
+{
+	if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
+		return 0;
+
+	return pcpu_size_to_slot(chunk->free_size);
+}
+
+static int pcpu_page_idx(unsigned int cpu, int page_idx)
+{
+	return (cpu << PCPU_UNIT_PAGES_SHIFT) + page_idx;
+}
+
+static struct page **pcpu_chunk_pagep(struct pcpu_chunk *chunk,
+				      unsigned int cpu, int page_idx)
+{
+	return &chunk->page[pcpu_page_idx(cpu, page_idx)];
+}
+
+static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
+				     unsigned int cpu, int page_idx)
+{
+	return (unsigned long)chunk->vm->addr +
+		(pcpu_page_idx(cpu, page_idx) << PAGE_SHIFT);
+}
+
+static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
+				     int page_idx)
+{
+	return *pcpu_chunk_pagep(chunk, 0, page_idx) != NULL;
+}
+
+/**
+ * pcpu_realloc - versatile realloc
+ * @p: the current pointer (can be NULL for new allocations)
+ * @size: the current size (can be 0 for new allocations)
+ * @new_size: the wanted new size (can be 0 for free)
+ *
+ * More robust realloc which can be used to allocate, resize or free a
+ * memory area of arbitrary size.  If the needed size goes over
+ * PAGE_SIZE, kernel VM is used.
+ *
+ * RETURNS:
+ * The new pointer on success, NULL on failure.
+ */
+static void *pcpu_realloc(void *p, size_t size, size_t new_size)
+{
+	void *new;
+
+	if (new_size <= PAGE_SIZE)
+		new = kmalloc(new_size, GFP_KERNEL);
+	else
+		new = vmalloc(new_size);
+	if (new_size && !new)
+		return NULL;
+
+	memcpy(new, p, min(size, new_size));
+	if (new_size > size)
+		memset(new + size, 0, new_size - size);
+
+	if (size <= PAGE_SIZE)
+		kfree(p);
+	else
+		vfree(p);
+
+	return new;
+}
+
+/**
+ * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
+ * @chunk: chunk of interest
+ * @oslot: the previous slot it was on
+ *
+ * This function is called after an allocation or free changed @chunk.
+ * New slot according to the changed state is determined and @chunk is
+ * moved to the slot.
+ */
+static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
+{
+	int nslot = pcpu_chunk_slot(chunk);
+
+	if (oslot != nslot) {
+		if (oslot < nslot)
+			list_move(&chunk->list, &pcpu_slot[nslot]);
+		else
+			list_move_tail(&chunk->list, &pcpu_slot[nslot]);
+	}
+}
+
+static struct rb_node **pcpu_chunk_rb_search(void *addr,
+					     struct rb_node **parentp)
+{
+	struct rb_node **p = &pcpu_addr_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct pcpu_chunk *chunk;
+
+	while (*p) {
+		parent = *p;
+		chunk = rb_entry(parent, struct pcpu_chunk, rb_node);
+
+		if (addr < chunk->vm->addr)
+			p = &(*p)->rb_left;
+		else if (addr > chunk->vm->addr)
+			p = &(*p)->rb_right;
+		else
+			break;
+	}
+
+	if (parentp)
+		*parentp = parent;
+	return p;
+}
+
+/**
+ * pcpu_chunk_addr_search - search for chunk containing specified address
+ * @addr: address to search for
+ *
+ * Look for chunk which might contain @addr.  More specifically, it
+ * searchs for the chunk with the highest start address which isn't
+ * beyond @addr.
+ *
+ * RETURNS:
+ * The address of the found chunk.
+ */
+static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
+{
+	struct rb_node *n, *parent;
+	struct pcpu_chunk *chunk;
+
+	n = *pcpu_chunk_rb_search(addr, &parent);
+	if (!n) {
+		/* no exactly matching chunk, the parent is the closest */
+		n = parent;
+		BUG_ON(!n);
+	}
+	chunk = rb_entry(n, struct pcpu_chunk, rb_node);
+
+	if (addr < chunk->vm->addr) {
+		/* the parent was the next one, look for the previous one */
+		n = rb_prev(n);
+		BUG_ON(!n);
+		chunk = rb_entry(n, struct pcpu_chunk, rb_node);
+	}
+
+	return chunk;
+}
+
+/**
+ * pcpu_chunk_addr_insert - insert chunk into address rb tree
+ * @new: chunk to insert
+ *
+ * Insert @new into address rb tree.
+ */
+static void pcpu_chunk_addr_insert(struct pcpu_chunk *new)
+{
+	struct rb_node **p, *parent;
+
+	p = pcpu_chunk_rb_search(new->vm->addr, &parent);
+	BUG_ON(*p);
+	rb_link_node(&new->rb_node, parent, p);
+	rb_insert_color(&new->rb_node, &pcpu_addr_root);
+}
+
+/**
+ * pcpu_split_block - split a map block
+ * @chunk: chunk of interest
+ * @i: index of map block to split
+ * @head: head size (can be 0)
+ * @tail: tail size (can be 0)
+ *
+ * Split the @i'th map block into two or three blocks.  If @head is
+ * non-zero, @head bytes block is inserted before block @i moving it
+ * to @i+1 and reducing its size by @head bytes.
+ *
+ * If @tail is non-zero, the target block, which can be @i or @i+1
+ * depending on @head, is reduced by @tail bytes and @tail byte block
+ * is inserted after the target block.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+static int pcpu_split_block(struct pcpu_chunk *chunk, int i, int head, int tail)
+{
+	int nr_extra = !!head + !!tail;
+	int target = chunk->map_used + nr_extra;
+
+	/* reallocation required? */
+	if (chunk->map_alloc < target) {
+		int new_alloc = chunk->map_alloc;
+		int *new;
+
+		while (new_alloc < target)
+			new_alloc *= 2;
+
+		new = pcpu_realloc(chunk->map,
+				   chunk->map_alloc * sizeof(new[0]),
+				   new_alloc * sizeof(new[0]));
+		if (!new)
+			return -ENOMEM;
+
+		chunk->map_alloc = new_alloc;
+		chunk->map = new;
+	}
+
+	/* insert a new subblock */
+	memmove(&chunk->map[i + nr_extra], &chunk->map[i],
+		sizeof(chunk->map[0]) * (chunk->map_used - i));
+	chunk->map_used += nr_extra;
+
+	if (head) {
+		chunk->map[i + 1] = chunk->map[i] - head;
+		chunk->map[i++] = head;
+	}
+	if (tail) {
+		chunk->map[i++] -= tail;
+		chunk->map[i] = tail;
+	}
+	return 0;
+}
+
+/**
+ * pcpu_alloc_area - allocate area from a pcpu_chunk
+ * @chunk: chunk of interest
+ * @size: wanted size
+ * @align: wanted align
+ *
+ * Try to allocate @size bytes area aligned at @align from @chunk.
+ * Note that this function only allocates the offset.  It doesn't
+ * populate or map the area.
+ *
+ * RETURNS:
+ * Allocated offset in @chunk on success, -errno on failure.
+ */
+static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
+{
+	int oslot = pcpu_chunk_slot(chunk);
+	int max_contig = 0;
+	int i, off;
+
+	/*
+	 * The static chunk initially doesn't have map attached
+	 * because kmalloc wasn't available during init.  Give it one.
+	 */
+	if (unlikely(!chunk->map)) {
+		chunk->map = pcpu_realloc(NULL, 0,
+				PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
+		if (!chunk->map)
+			return -ENOMEM;
+
+		chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
+		chunk->map[chunk->map_used++] = -pcpu_static_size;
+		if (chunk->free_size)
+			chunk->map[chunk->map_used++] = chunk->free_size;
+	}
+
+	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
+		bool is_last = i + 1 == chunk->map_used;
+		int head, tail;
+
+		/* extra for alignment requirement */
+		head = ALIGN(off, align) - off;
+		BUG_ON(i == 0 && head != 0);
+
+		if (chunk->map[i] < 0)
+			continue;
+		if (chunk->map[i] < head + size) {
+			max_contig = max(chunk->map[i], max_contig);
+			continue;
+		}
+
+		/*
+		 * If head is small or the previous block is free,
+		 * merge'em.  Note that 'small' is defined as smaller
+		 * than sizeof(int), which is very small but isn't too
+		 * uncommon for percpu allocations.
+		 */
+		if (head && (head < sizeof(int) || chunk->map[i - 1] > 0)) {
+			if (chunk->map[i - 1] > 0)
+				chunk->map[i - 1] += head;
+			else {
+				chunk->map[i - 1] -= head;
+				chunk->free_size -= head;
+			}
+			chunk->map[i] -= head;
+			off += head;
+			head = 0;
+		}
+
+		/* if tail is small, just keep it around */
+		tail = chunk->map[i] - head - size;
+		if (tail < sizeof(int))
+			tail = 0;
+
+		/* split if warranted */
+		if (head || tail) {
+			if (pcpu_split_block(chunk, i, head, tail))
+				return -ENOMEM;
+			if (head) {
+				i++;
+				off += head;
+				max_contig = max(chunk->map[i - 1], max_contig);
+			}
+			if (tail)
+				max_contig = max(chunk->map[i + 1], max_contig);
+		}
+
+		/* update hint and mark allocated */
+		if (is_last)
+			chunk->contig_hint = max_contig; /* fully scanned */
+		else
+			chunk->contig_hint = max(chunk->contig_hint,
+						 max_contig);
+
+		chunk->free_size -= chunk->map[i];
+		chunk->map[i] = -chunk->map[i];
+
+		pcpu_chunk_relocate(chunk, oslot);
+		return off;
+	}
+
+	chunk->contig_hint = max_contig;	/* fully scanned */
+	pcpu_chunk_relocate(chunk, oslot);
+	return -ENOSPC;
+}
+
+/**
+ * pcpu_free_area - free area to a pcpu_chunk
+ * @chunk: chunk of interest
+ * @freeme: offset of area to free
+ *
+ * Free area starting from @freeme to @chunk.  Note that this function
+ * only modifies the allocation map.  It doesn't depopulate or unmap
+ * the area.
+ */
+static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
+{
+	int oslot = pcpu_chunk_slot(chunk);
+	int i, off;
+
+	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
+		if (off == freeme)
+			break;
+	BUG_ON(off != freeme);
+	BUG_ON(chunk->map[i] > 0);
+
+	chunk->map[i] = -chunk->map[i];
+	chunk->free_size += chunk->map[i];
+
+	/* merge with previous? */
+	if (i > 0 && chunk->map[i - 1] >= 0) {
+		chunk->map[i - 1] += chunk->map[i];
+		chunk->map_used--;
+		memmove(&chunk->map[i], &chunk->map[i + 1],
+			(chunk->map_used - i) * sizeof(chunk->map[0]));
+		i--;
+	}
+	/* merge with next? */
+	if (i + 1 < chunk->map_used && chunk->map[i + 1] >= 0) {
+		chunk->map[i] += chunk->map[i + 1];
+		chunk->map_used--;
+		memmove(&chunk->map[i + 1], &chunk->map[i + 2],
+			(chunk->map_used - (i + 1)) * sizeof(chunk->map[0]));
+	}
+
+	chunk->contig_hint = max(chunk->map[i], chunk->contig_hint);
+	pcpu_chunk_relocate(chunk, oslot);
+}
+
+/**
+ * pcpu_unmap - unmap pages out of a pcpu_chunk
+ * @chunk: chunk of interest
+ * @page_start: page index of the first page to unmap
+ * @page_end: page index of the last page to unmap + 1
+ * @flush: whether to flush cache and tlb or not
+ *
+ * For each cpu, unmap pages [@page_start,@page_end) out of @chunk.
+ * If @flush is true, vcache is flushed before unmapping and tlb
+ * after.
+ */
+static void pcpu_unmap(struct pcpu_chunk *chunk, int page_start, int page_end,
+		       bool flush)
+{
+	unsigned int last = num_possible_cpus() - 1;
+	unsigned int cpu;
+
+	/*
+	 * Each flushing trial can be very expensive, issue flush on
+	 * the whole region at once rather than doing it for each cpu.
+	 * This could be an overkill but is more scalable.
+	 */
+	if (flush)
+		flush_cache_vunmap(pcpu_chunk_addr(chunk, 0, page_start),
+				   pcpu_chunk_addr(chunk, last, page_end));
+
+	for_each_possible_cpu(cpu)
+		unmap_kernel_range_noflush(
+				pcpu_chunk_addr(chunk, cpu, page_start),
+				(page_end - page_start) << PAGE_SHIFT);
+
+	/* ditto as flush_cache_vunmap() */
+	if (flush)
+		flush_tlb_kernel_range(pcpu_chunk_addr(chunk, 0, page_start),
+				       pcpu_chunk_addr(chunk, last, page_end));
+}
+
+/**
+ * pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
+ * @chunk: chunk to depopulate
+ * @off: offset to the area to depopulate
+ * @size: size of the area to depopulate
+ * @flush: whether to flush cache and tlb or not
+ *
+ * For each cpu, depopulate and unmap pages [@page_start,@page_end)
+ * from @chunk.  If @flush is true, vcache is flushed before unmapping
+ * and tlb after.
+ */
+static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, size_t off,
+				  size_t size, bool flush)
+{
+	int page_start = PFN_DOWN(off);
+	int page_end = PFN_UP(off + size);
+	int unmap_start = -1;
+	int uninitialized_var(unmap_end);
+	unsigned int cpu;
+	int i;
+
+	for (i = page_start; i < page_end; i++) {
+		for_each_possible_cpu(cpu) {
+			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
+
+			if (!*pagep)
+				continue;
+
+			__free_page(*pagep);
+			*pagep = NULL;
+
+			unmap_start = unmap_start < 0 ? i : unmap_start;
+			unmap_end = i + 1;
+		}
+	}
+
+	if (unmap_start >= 0)
+		pcpu_unmap(chunk, unmap_start, unmap_end, flush);
+}
+
+/**
+ * pcpu_map - map pages into a pcpu_chunk
+ * @chunk: chunk of interest
+ * @page_start: page index of the first page to map
+ * @page_end: page index of the last page to map + 1
+ *
+ * For each cpu, map pages [@page_start,@page_end) into @chunk.
+ * vcache is flushed afterwards.
+ */
+static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
+{
+	unsigned int last = num_possible_cpus() - 1;
+	unsigned int cpu;
+	int err;
+
+	for_each_possible_cpu(cpu) {
+		err = map_kernel_range_noflush(
+				pcpu_chunk_addr(chunk, cpu, page_start),
+				(page_end - page_start) << PAGE_SHIFT,
+				PAGE_KERNEL,
+				pcpu_chunk_pagep(chunk, cpu, page_start));
+		if (err < 0)
+			return err;
+	}
+
+	/* flush at once, please read comments in pcpu_unmap() */
+	flush_cache_vmap(pcpu_chunk_addr(chunk, 0, page_start),
+			 pcpu_chunk_addr(chunk, last, page_end));
+	return 0;
+}
+
+/**
+ * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
+ * @chunk: chunk of interest
+ * @off: offset to the area to populate
+ * @size: size of the area to populate
+ *
+ * For each cpu, populate and map pages [@page_start,@page_end) into
+ * @chunk.  The area is cleared on return.
+ */
+static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
+{
+	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
+	int page_start = PFN_DOWN(off);
+	int page_end = PFN_UP(off + size);
+	int map_start = -1;
+	int map_end;
+	unsigned int cpu;
+	int i;
+
+	for (i = page_start; i < page_end; i++) {
+		if (pcpu_chunk_page_occupied(chunk, i)) {
+			if (map_start >= 0) {
+				if (pcpu_map(chunk, map_start, map_end))
+					goto err;
+				map_start = -1;
+			}
+			continue;
+		}
+
+		map_start = map_start < 0 ? i : map_start;
+		map_end = i + 1;
+
+		for_each_possible_cpu(cpu) {
+			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
+
+			*pagep = alloc_pages_node(cpu_to_node(cpu),
+						  alloc_mask, 0);
+			if (!*pagep)
+				goto err;
+		}
+	}
+
+	if (map_start >= 0 && pcpu_map(chunk, map_start, map_end))
+		goto err;
+
+	for_each_possible_cpu(cpu)
+		memset(chunk->vm->addr + (cpu << PCPU_UNIT_SHIFT) + off, 0,
+		       size);
+
+	return 0;
+err:
+	/* likely under heavy memory pressure, give memory back */
+	pcpu_depopulate_chunk(chunk, off, size, true);
+	return -ENOMEM;
+}
+
+static void free_pcpu_chunk(struct pcpu_chunk *chunk)
+{
+	if (!chunk)
+		return;
+	if (chunk->vm)
+		free_vm_area(chunk->vm);
+	pcpu_realloc(chunk->map, chunk->map_alloc * sizeof(chunk->map[0]), 0);
+	kfree(chunk);
+}
+
+static struct pcpu_chunk *alloc_pcpu_chunk(void)
+{
+	struct pcpu_chunk *chunk;
+
+	chunk = kzalloc(SIZEOF_STRUCT_PCPU_CHUNK, GFP_KERNEL);
+	if (!chunk)
+		return NULL;
+
+	chunk->map = pcpu_realloc(NULL, 0,
+				  PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
+	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
+	chunk->map[chunk->map_used++] = PCPU_UNIT_SIZE;
+
+	chunk->vm = get_vm_area(PCPU_CHUNK_SIZE, GFP_KERNEL);
+	if (!chunk->vm) {
+		free_pcpu_chunk(chunk);
+		return NULL;
+	}
+
+	INIT_LIST_HEAD(&chunk->list);
+	chunk->free_size = PCPU_UNIT_SIZE;
+	chunk->contig_hint = PCPU_UNIT_SIZE;
+
+	return chunk;
+}
+
+/**
+ * __alloc_percpu - allocate percpu area
+ * @size: size of area to allocate
+ * @align: alignment of area (max PAGE_SIZE)
+ *
+ * Allocate percpu area of @size bytes aligned at @align.  Might
+ * sleep.  Might trigger writeouts.
+ *
+ * RETURNS:
+ * Percpu pointer to the allocated area on success, NULL on failure.
+ */
+void *__alloc_percpu(size_t size, size_t align)
+{
+	void *ptr = NULL;
+	struct pcpu_chunk *chunk;
+	int slot, off, err;
+
+	if (unlikely(!size))
+		return NULL;
+
+	if (unlikely(size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
+		     align > PAGE_SIZE)) {
+		printk(KERN_WARNING "illegal size (%zu) or align (%zu) for "
+		       "percpu allocation\n", size, align);
+		return NULL;
+	}
+
+	mutex_lock(&pcpu_mutex);
+
+	/* allocate area */
+	for (slot = pcpu_size_to_slot(size); slot < PCPU_NR_SLOTS; slot++) {
+		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+			if (size > chunk->contig_hint)
+				continue;
+			err = pcpu_alloc_area(chunk, size, align);
+			if (err >= 0) {
+				off = err;
+				goto area_found;
+			}
+			if (err != -ENOSPC)
+				goto out_unlock;
+		}
+	}
+
+	/* hmmm... no space left, create a new chunk */
+	err = -ENOMEM;
+	chunk = alloc_pcpu_chunk();
+	if (!chunk)
+		goto out_unlock;
+	pcpu_chunk_relocate(chunk, -1);
+	pcpu_chunk_addr_insert(chunk);
+
+	err = pcpu_alloc_area(chunk, size, align);
+	if (err < 0)
+		goto out_unlock;
+	off = err;
+
+area_found:
+	/* populate, map and clear the area */
+	if (pcpu_populate_chunk(chunk, off, size)) {
+		pcpu_free_area(chunk, off);
+		goto out_unlock;
+	}
+
+	ptr = __addr_to_pcpu_ptr(chunk->vm->addr + off);
+out_unlock:
+	mutex_unlock(&pcpu_mutex);
+	return ptr;
+}
+EXPORT_SYMBOL_GPL(__alloc_percpu);
+
+static void pcpu_kill_chunk(struct pcpu_chunk *chunk)
+{
+	pcpu_depopulate_chunk(chunk, 0, PCPU_UNIT_SIZE, false);
+	list_del(&chunk->list);
+	rb_erase(&chunk->rb_node, &pcpu_addr_root);
+	free_pcpu_chunk(chunk);
+}
+
+/**
+ * free_percpu - free percpu area
+ * @ptr: pointer to area to free
+ *
+ * Free percpu area @ptr.  Might sleep.
+ */
+void free_percpu(void *ptr)
+{
+	void *addr = __pcpu_ptr_to_addr(ptr);
+	struct pcpu_chunk *chunk;
+	int off;
+
+	if (!ptr)
+		return;
+
+	mutex_lock(&pcpu_mutex);
+
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->vm->addr;
+
+	pcpu_free_area(chunk, off);
+
+	/* the chunk became fully free, kill one if there are other free ones */
+	if (chunk->free_size == PCPU_UNIT_SIZE) {
+		struct pcpu_chunk *pos;
+
+		list_for_each_entry(pos,
+				    &pcpu_slot[pcpu_chunk_slot(chunk)], list)
+			if (pos != chunk) {
+				pcpu_kill_chunk(pos);
+				break;
+			}
+	}
+
+	mutex_unlock(&pcpu_mutex);
+}
+EXPORT_SYMBOL_GPL(free_percpu);
+
+/**
+ * pcpu_setup_static - initialize kernel static percpu area
+ * @populate_pte_fn: callback to allocate pagetable
+ * @pages: num_possible_cpus() * PFN_UP(cpu_size) pages
+ *
+ * Initialize kernel static percpu area.  The caller should allocate
+ * all the necessary pages and pass them in @pages.
+ * @populate_pte_fn() is called on each page to be used for percpu
+ * mapping and is responsible for making sure all the necessary page
+ * tables for the page is allocated.
+ *
+ * RETURNS:
+ * The determined PCPU_UNIT_SIZE which can be used to initialize
+ * percpu access.
+ */
+size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
+				struct page **pages, size_t cpu_size)
+{
+	static struct vm_struct static_vm;
+	struct pcpu_chunk *static_chunk;
+	int nr_cpu_pages = DIV_ROUND_UP(cpu_size, PAGE_SIZE);
+	unsigned int cpu;
+	int err, i;
+
+	while (1 << __pcpu_unit_pages_shift < nr_cpu_pages)
+		__pcpu_unit_pages_shift++;
+
+	pcpu_static_size = cpu_size;
+	__pcpu_unit_pages = 1 << __pcpu_unit_pages_shift;
+	__pcpu_unit_shift = PAGE_SHIFT + __pcpu_unit_pages_shift;
+	__pcpu_unit_size = 1 << __pcpu_unit_shift;
+	__pcpu_chunk_size = num_possible_cpus() * __pcpu_unit_size;
+	__pcpu_nr_slots = pcpu_size_to_slot(__pcpu_unit_size) + 1;
+
+	/* allocate chunk slots */
+	pcpu_slot = alloc_bootmem(PCPU_NR_SLOTS * sizeof(pcpu_slot[0]));
+	for (i = 0; i < PCPU_NR_SLOTS; i++)
+		INIT_LIST_HEAD(&pcpu_slot[i]);
+
+	/* init and register vm area */
+	static_vm.flags = VM_ALLOC;
+	static_vm.size = PCPU_CHUNK_SIZE;
+	vm_area_register_early(&static_vm);
+
+	/* init static_chunk */
+	static_chunk = alloc_bootmem(SIZEOF_STRUCT_PCPU_CHUNK);
+	INIT_LIST_HEAD(&static_chunk->list);
+	static_chunk->vm = &static_vm;
+	static_chunk->free_size = PCPU_UNIT_SIZE - pcpu_static_size;
+	static_chunk->contig_hint = static_chunk->free_size;
+
+	/* assign pages and map them */
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < nr_cpu_pages; i++) {
+			*pcpu_chunk_pagep(static_chunk, cpu, i) = *pages++;
+			populate_pte_fn(pcpu_chunk_addr(static_chunk, cpu, i));
+		}
+	}
+
+	err = pcpu_map(static_chunk, 0, nr_cpu_pages);
+	if (err)
+		panic("failed to setup static percpu area, err=%d\n", err);
+
+	/* link static_chunk in */
+	pcpu_chunk_relocate(static_chunk, -1);
+	pcpu_chunk_addr_insert(static_chunk);
+
+	/* we're done */
+	pcpu_base_addr = (void *)pcpu_chunk_addr(static_chunk, 0, 0);
+	return PCPU_UNIT_SIZE;
+}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 10/10] x86: convert to the new dynamic percpu allocator
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (8 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
@ 2009-02-18 12:04 ` Tejun Heo
  2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
  2009-02-19  0:30 ` Tejun Heo
  11 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-18 12:04 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: Tejun Heo

Impact: use new dynamic allocator, unified access to static/dynamic
        percpu memory

Convert to the new dynamic percpu allocator.

* implement populate_extra_pte() for both 32 and 64
* update setup_per_cpu_areas() to use pcpu_setup_static()
* define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
* define config HAVE_DYNAMIC_PER_CPU_AREA

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 arch/x86/Kconfig               |    3 ++
 arch/x86/include/asm/percpu.h  |    8 +++++
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/kernel/setup_percpu.c |   62 +++++++++++++++++++++++++--------------
 arch/x86/mm/init_32.c          |   10 ++++++
 arch/x86/mm/init_64.c          |   19 ++++++++++++
 6 files changed, 81 insertions(+), 22 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f760a22..d3f6ead 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -135,6 +135,9 @@ config ARCH_HAS_CACHE_LINE_SIZE
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool y
 
+config HAVE_DYNAMIC_PER_CPU_AREA
+	def_bool y
+
 config HAVE_CPUMASK_OF_CPU_MAP
 	def_bool X86_64_SMP
 
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index aee103b..8f1d2fb 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -43,6 +43,14 @@
 #else /* ...!ASSEMBLY */
 
 #include <linux/stringify.h>
+#include <asm/sections.h>
+
+#define __addr_to_pcpu_ptr(addr)					\
+	(void *)((unsigned long)(addr) - (unsigned long)pcpu_base_addr	\
+		 + (unsigned long)__per_cpu_start)
+#define __pcpu_ptr_to_addr(ptr)						\
+	(void *)((unsigned long)(ptr) + (unsigned long)pcpu_base_addr	\
+		 - (unsigned long)__per_cpu_start)
 
 #ifdef CONFIG_SMP
 #define __percpu_arg(x)		"%%"__stringify(__percpu_seg)":%P" #x
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6f7c102..dd91c25 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -402,6 +402,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
 
 /* Install a pte for a particular vaddr in kernel space. */
 void set_pte_vaddr(unsigned long vaddr, pte_t pte);
+void populate_extra_pte(unsigned long vaddr);
 
 #ifdef CONFIG_X86_32
 extern void native_pagetable_setup_start(pgd_t *base);
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index d992e6c..2dce435 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -61,38 +61,56 @@ static inline void setup_percpu_segment(int cpu)
  */
 void __init setup_per_cpu_areas(void)
 {
-	ssize_t size;
-	char *ptr;
-	int cpu;
-
-	/* Copy section for each CPU (we discard the original) */
-	size = roundup(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
+	ssize_t size = __per_cpu_end - __per_cpu_start;
+	unsigned int nr_cpu_pages = DIV_ROUND_UP(size, PAGE_SIZE);
+	static struct page **pages;
+	size_t pages_size;
+	unsigned int cpu, i, j;
+	unsigned long delta;
+	size_t pcpu_unit_size;
 
 	pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
 		NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
+	pr_info("PERCPU: Allocating %zd bytes for static per cpu data\n", size);
 
-	pr_info("PERCPU: Allocating %zd bytes of per cpu data\n", size);
+	pages_size = nr_cpu_pages * num_possible_cpus() * sizeof(pages[0]);
+	pages = alloc_bootmem(pages_size);
 
+	j = 0;
 	for_each_possible_cpu(cpu) {
+		void *ptr;
+
+		for (i = 0; i < nr_cpu_pages; i++) {
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-		ptr = alloc_bootmem_pages(size);
+			ptr = alloc_bootmem_pages(PAGE_SIZE);
 #else
-		int node = early_cpu_to_node(cpu);
-		if (!node_online(node) || !NODE_DATA(node)) {
-			ptr = alloc_bootmem_pages(size);
-			pr_info("cpu %d has no node %d or node-local memory\n",
-				cpu, node);
-			pr_debug("per cpu data for cpu%d at %016lx\n",
-				 cpu, __pa(ptr));
-		} else {
-			ptr = alloc_bootmem_pages_node(NODE_DATA(node), size);
-			pr_debug("per cpu data for cpu%d on node%d at %016lx\n",
-				cpu, node, __pa(ptr));
-		}
+			int node = early_cpu_to_node(cpu);
+
+			if (!node_online(node) || !NODE_DATA(node)) {
+				ptr = alloc_bootmem_pages(PAGE_SIZE);
+				pr_info("cpu %d has no node %d or node-local "
+					"memory\n", cpu, node);
+				pr_debug("per cpu data for cpu%d at %016lx\n",
+					 cpu, __pa(ptr));
+			} else {
+				ptr = alloc_bootmem_pages_node(NODE_DATA(node),
+							       PAGE_SIZE);
+				pr_debug("per cpu data for cpu%d on node%d "
+					 "at %016lx\n", cpu, node, __pa(ptr));
+			}
 #endif
+			memcpy(ptr, __per_cpu_load + i * PAGE_SIZE, PAGE_SIZE);
+			pages[j++] = virt_to_page(ptr);
+		}
+	}
+
+	pcpu_unit_size = pcpu_setup_static(populate_extra_pte, pages, size);
 
-		memcpy(ptr, __per_cpu_load, __per_cpu_end - __per_cpu_start);
-		per_cpu_offset(cpu) = ptr - __per_cpu_start;
+	free_bootmem(__pa(pages), pages_size);
+
+	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
+	for_each_possible_cpu(cpu) {
+		per_cpu_offset(cpu) = delta + cpu * pcpu_unit_size;
 		per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu);
 		per_cpu(cpu_number, cpu) = cpu;
 		setup_percpu_segment(cpu);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 00263bf..8b1a0ef 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -137,6 +137,16 @@ static pte_t * __init one_page_table_init(pmd_t *pmd)
 	return pte_offset_kernel(pmd, 0);
 }
 
+void __init populate_extra_pte(unsigned long vaddr)
+{
+	int pgd_idx = pgd_index(vaddr);
+	int pmd_idx = pmd_index(vaddr);
+	pmd_t *pmd;
+
+	pmd = one_md_table_init(swapper_pg_dir + pgd_idx);
+	one_page_table_init(pmd + pmd_idx);
+}
+
 static pte_t *__init page_table_kmap_check(pte_t *pte, pmd_t *pmd,
 					   unsigned long vaddr, pte_t *lastpte)
 {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e6d36b4..7f91e2c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -223,6 +223,25 @@ set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 	set_pte_vaddr_pud(pud_page, vaddr, pteval);
 }
 
+void __init populate_extra_pte(unsigned long vaddr)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+
+	pgd = pgd_offset_k(vaddr);
+	if (pgd_none(*pgd)) {
+		pud = (pud_t *)spp_getpage();
+		pgd_populate(&init_mm, pgd, pud);
+		if (pud != pud_offset(pgd, 0)) {
+			printk(KERN_ERR "PAGETABLE BUG #00! %p <-> %p\n",
+			       pud, pud_offset(pgd, 0));
+			return;
+		}
+	}
+
+	set_pte_vaddr_pud((pud_t *)pgd_page_vaddr(*pgd), vaddr, __pte(0));
+}
+
 /*
  * Create large page table mappings for a range of physical addresses.
  */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (9 preceding siblings ...)
  2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
@ 2009-02-18 13:43 ` Ingo Molnar
  2009-02-19  0:31   ` Tejun Heo
  2009-02-19 10:51   ` Rusty Russell
  2009-02-19  0:30 ` Tejun Heo
  11 siblings, 2 replies; 78+ messages in thread
From: Ingo Molnar @ 2009-02-18 13:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

>   0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
>   0002-module-fix-out-of-range-memory-access.patch

Hm, these two seem to be .29 material too, agreed?

Rusty, if the fixes are fine with you i can put those two 
commits into tip/core/urgent straight away, the full string of 
10 commits into tip/core/percpu and thus we'd avoid duplicate 
(or even conflicting) commits.

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/10] percpu: kill percpu_alloc() and friends
  2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
@ 2009-02-19  0:17   ` Rusty Russell
  2009-03-11 18:36   ` Tony Luck
  1 sibling, 0 replies; 78+ messages in thread
From: Rusty Russell @ 2009-02-19  0:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wednesday 18 February 2009 22:34:32 Tejun Heo wrote:
> Impact: kill unused functions
> 
> percpu_alloc() and its friends never saw much action.  It was supposed
> to replace the cpu-mask unaware __alloc_percpu() but it never happened
> and in fact __percpu_alloc_mask() itself never really grew proper
> up/down handling interface either (no exported interface for
> populate/depopulate).
> 
> percpu allocation is about to go through major reimplementation and
> there's no reason to carry this unused interface around.  Replace it
> with __alloc_percpu() and free_percpu().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Nice patch.  FWIW, Acked-by: Rusty Russell <rusty@rustcorp.com.au>

(Oh, and your other mods were acked as well, for the record).

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
                   ` (10 preceding siblings ...)
  2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
@ 2009-02-19  0:30 ` Tejun Heo
  2009-02-19 11:07   ` Ingo Molnar
  11 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-19  0:30 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Tejun Heo wrote:
>   One trick we can do is to reserve the initial chunk in non-vmalloc
>   area so that at least the static cpu ones and whatever gets
>   allocated in the first chunk is served by regular large page
>   mappings.  Given that those are most frequent visited ones, this
>   could be a nice compromise - no noticeable penalty for usual cases
>   yet allowing scalability for unusual cases.  If this is something
>   which can be agreed on, I'll pursue this.

I've given more thought to this and it actually will solve most of
issues for non-NUMA but it can't be done for NUMA.  Any better ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
@ 2009-02-19  0:31   ` Tejun Heo
  2009-02-19 10:51   ` Rusty Russell
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-19  0:31 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Hello, Ingo.

Ingo Molnar wrote:
> * Tejun Heo <tj@kernel.org> wrote:
> 
>>   0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
>>   0002-module-fix-out-of-range-memory-access.patch
> 
> Hm, these two seem to be .29 material too, agreed?

Yeap.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/10] vmalloc: implement vm_area_register_early()
  2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
@ 2009-02-19  0:55   ` Tejun Heo
  2009-02-19 12:09   ` Nick Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-19  0:55 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo; +Cc: ink

cc'ing Ivan Kokshaysky.  Hello.

Can you please verify the following patch which contains alpha related
change is correct?  I forgot to cc you while posting the original
patchset.

------------ original message follows --------------
Impact: allow multiple early vm areas

There are places where kernel VM area needs to be allocated before
vmalloc is initialized.  This is done by allocating static vm_struct,
initializing several fields and linking it to vmlist and later vmalloc
initialization picking up these from vmlist.  This is currently done
manually and if there's more than one such areas, there's no defined
way to arbitrate who gets which address.

This patch implements vm_area_register_early(), which takes vm_area
struct with flags and size initialized, assigns address to it and puts
it on the vmlist.  This way, multiple early vm areas can determine
which addresses they should use.  The only current user - alpha mm
init - is converted to use it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 arch/alpha/mm/init.c    |   20 +++++++++++++-------
 include/linux/vmalloc.h |    1 +
 mm/vmalloc.c            |   24 ++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
index 5d7a16e..df6df02 100644
--- a/arch/alpha/mm/init.c
+++ b/arch/alpha/mm/init.c
@@ -189,9 +189,21 @@ callback_init(void * kernel_end)
 
 	if (alpha_using_srm) {
 		static struct vm_struct console_remap_vm;
-		unsigned long vaddr = VMALLOC_START;
+		unsigned long nr_pages = 0;
+		unsigned long vaddr;
 		unsigned long i, j;
 
+		/* calculate needed size */
+		for (i = 0; i < crb->map_entries; ++i)
+			nr_pages += crb->map[i].count;
+
+		/* register the vm area */
+		console_remap_vm.flags = VM_ALLOC;
+		console_remap_vm.size = nr_pages << PAGE_SHIFT;
+		vm_area_register_early(&console_remap_vm);
+
+		vaddr = (unsigned long)consle_remap_vm.addr;
+
 		/* Set up the third level PTEs and update the virtual
 		   addresses of the CRB entries.  */
 		for (i = 0; i < crb->map_entries; ++i) {
@@ -213,12 +225,6 @@ callback_init(void * kernel_end)
 				vaddr += PAGE_SIZE;
 			}
 		}
-
-		/* Let vmalloc know that we've allocated some space.  */
-		console_remap_vm.flags = VM_ALLOC;
-		console_remap_vm.addr = (void *) VMALLOC_START;
-		console_remap_vm.size = vaddr - VMALLOC_START;
-		vmlist = &console_remap_vm;
 	}
 
 	callback_init_done = 1;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 506e762..bbc0513 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -106,5 +106,6 @@ extern long vwrite(char *buf, char *addr, unsigned long count);
  */
 extern rwlock_t vmlist_lock;
 extern struct vm_struct *vmlist;
+extern __init void vm_area_register_early(struct vm_struct *vm);
 
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c37924a..d206261 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -24,6 +24,7 @@
 #include <linux/radix-tree.h>
 #include <linux/rcupdate.h>
 #include <linux/bootmem.h>
+#include <linux/pfn.h>
 
 #include <asm/atomic.h>
 #include <asm/uaccess.h>
@@ -982,6 +983,29 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 }
 EXPORT_SYMBOL(vm_map_ram);
 
+/**
+ * vm_area_register_early - register vmap area early during boot
+ * @vm: vm_struct to register
+ * @size: size of area to register
+ *
+ * This function is used to register kernel vm area before
+ * vmalloc_init() is called.  @vm->size and @vm->flags should contain
+ * proper values on entry and other fields should be zero.  On return,
+ * vm->addr contains the allocated address.
+ *
+ * DO NOT USE THIS FUNCTION UNLESS YOU KNOW WHAT YOU'RE DOING.
+ */
+void __init vm_area_register_early(struct vm_struct *vm)
+{
+	static size_t vm_init_off __initdata;
+
+	vm->addr = (void *)VMALLOC_START + vm_init_off;
+	vm_init_off = PFN_ALIGN(vm_init_off + vm->size);
+
+	vm->next = vmlist;
+	vmlist = vm;
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;

-- 
tejun

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
@ 2009-02-19 10:10   ` Andrew Morton
  2009-02-19 11:01     ` Ingo Molnar
                       ` (2 more replies)
  2009-02-19 11:51   ` Rusty Russell
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 78+ messages in thread
From: Andrew Morton @ 2009-02-19 10:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wed, 18 Feb 2009 21:04:35 +0900 Tejun Heo <tj@kernel.org> wrote:

> Impact: new scalable dynamic percpu allocator which allows dynamic
>         percpu areas to be accessed the same way as static ones
> 
> Implement scalable dynamic percpu allocator which can be used for both
> static and dynamic percpu areas.  This will allow static and dynamic
> areas to share faster direct access methods.  This feature is optional
> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
> arch.  Please read comment on top of mm/percpu.c for details.
> 
>
> ...
>
> +static void *percpu_modalloc(unsigned long size, unsigned long align,
> +			     const char *name)
> +{
> +	void *ptr;
> +
> +	if (align > PAGE_SIZE) {
> +		printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
> +		       name, align, PAGE_SIZE);

It used to be the case that PAGE_SIZE has type `unsigned' on some
architectures and `unsigned long' on others.  I don't know if that was
fixed - probably not.

> +		align = PAGE_SIZE;
> +	}
> +
> +	ptr = __alloc_percpu(size, align);
> +	if (!ptr)
> +		printk(KERN_WARNING
> +		       "Could not allocate %lu bytes percpu data\n", size);

A dump_stack() here would be useful.

> +	return ptr;
> +}
> +
> +static void percpu_modfree(void *freeme)
> +{
> +	free_percpu(freeme);
> +}
> +
> +#else /* ... !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
> +
>
> ...
>
> +/*
> + * linux/mm/percpu.c - percpu memory allocator
> + *
> + * Copyright (C) 2009		SUSE Linux Products GmbH
> + * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
> + *
> + * This file is released under the GPLv2.
> + *
> + * This is percpu allocator which can handle both static and dynamic
> + * areas.  Percpu areas are allocated in chunks in vmalloc area.  Each
> + * chunk is consisted of num_possible_cpus() units and the first chunk
> + * is used for static percpu variables in the kernel image (special
> + * boot time alloc/init handling necessary as these areas need to be
> + * brought up before allocation services are running).  Unit grows as
> + * necessary and all units grow or shrink in unison.  When a chunk is
> + * filled up, another chunk is allocated.  ie. in vmalloc area
> + *
> + *  c0                           c1                         c2
> + *  -------------------          -------------------        ------------
> + * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
> + *  -------------------  ......  -------------------  ....  ------------
> + *
> + * Allocation is done in offset-size areas of single unit space.  Ie,
> + * when UNIT_SIZE is 128k, an area at 134k of 512 bytes occupies 512
> + * bytes at 6k of c1:u0, c1:u1, c1:u2 and c1:u3.  Percpu access can be
> + * done by configuring percpu base registers UNIT_SIZE apart.
> + *
> + * There are usually many small percpu allocations many of them as
> + * small as 4 bytes.  The allocator organizes chunks into lists
> + * according to free size and tries to allocate from the fullest one.
> + * Each chunk keeps the maximum contiguous area size hint which is
> + * guaranteed to be eqaul to or larger than the maximum contiguous
> + * area in the chunk.  This helps the allocator not to iterate the
> + * chunk maps unnecessarily.
> + *
> + * Allocation state in each chunk is kept using an array of integers.
> + * A positive value represents free region and negative allocated.
> + * Allocation inside a chunk is done by scanning this map sequentially
> + * and serving the first matching entry.  This is mostly copied from
> + * the percpu_modalloc() allocator.  Chunks are also linked into a rb
> + * tree to ease address to chunk mapping during free.
> + *
> + * To use this allocator, arch code should do the followings.
> + *
> + * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
> + *
> + * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
> + *   regular address to percpu pointer and back
> + *
> + * - use pcpu_setup_static() during percpu area initialization to
> + *   setup kernel static percpu area
> + */

afacit nobody has answered your "is num_possible_cpus() ever a lot
larger than num_online_cpus()" question.

It is fairly important.

> +#include <linux/bitmap.h>
> +#include <linux/bootmem.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/percpu.h>
> +#include <linux/pfn.h>
> +#include <linux/rbtree.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
> +
> +#define PCPU_MIN_UNIT_PAGES_SHIFT	4	/* also max alloc size */
> +#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
> +#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
> +
> +struct pcpu_chunk {
> +	struct list_head	list;		/* linked to pcpu_slot lists */
> +	struct rb_node		rb_node;	/* key is chunk->vm->addr */
> +	int			free_size;

what's this?

> +	int			contig_hint;	/* max contiguous size hint */
> +	struct vm_struct	*vm;

?

> +	int			map_used;	/* # of map entries used */
> +	int			map_alloc;	/* # of map entries allocated */
> +	int			*map;

?

> +	struct page		*page[];	/* #cpus * UNIT_PAGES */

"pages" ;)

> +};
> +
> +#define SIZEOF_STRUCT_PCPU_CHUNK					\
> +	(sizeof(struct pcpu_chunk) +					\
> +	 (num_possible_cpus() << PCPU_UNIT_PAGES_SHIFT) * sizeof(struct page *))

This macro generates real code.  It is misleading to pretend that it is
a compile-time constant.  Suggest that it be converted to a plain old C
function.

> +static int __pcpu_unit_pages_shift = PCPU_MIN_UNIT_PAGES_SHIFT;
> +static int __pcpu_unit_pages;
> +static int __pcpu_unit_shift;
> +static int __pcpu_unit_size;
> +static int __pcpu_chunk_size;
> +static int __pcpu_nr_slots;
> +
> +/* currently everything is power of two, there's no hard dependency on it tho */
> +#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
> +#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
> +#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
> +#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
> +#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
> +#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)

hm.  Why do these exist?

Again, they look like compile-time constants, but aren't.

> +/* the address of the first chunk which starts with the kernel static area */
> +void *pcpu_base_addr;
> +EXPORT_SYMBOL_GPL(pcpu_base_addr);
> +
>
> ...
>
> +/**
> + * pcpu_realloc - versatile realloc
> + * @p: the current pointer (can be NULL for new allocations)
> + * @size: the current size (can be 0 for new allocations)
> + * @new_size: the wanted new size (can be 0 for free)

So the allocator doesn't internally record the size of each hunk?

<squints at the undocumented `free_size'>

> + * More robust realloc which can be used to allocate, resize or free a
> + * memory area of arbitrary size.  If the needed size goes over
> + * PAGE_SIZE, kernel VM is used.
> + *
> + * RETURNS:
> + * The new pointer on success, NULL on failure.
> + */
> +static void *pcpu_realloc(void *p, size_t size, size_t new_size)
> +{
> +	void *new;
> +
> +	if (new_size <= PAGE_SIZE)
> +		new = kmalloc(new_size, GFP_KERNEL);
> +	else
> +		new = vmalloc(new_size);
> +	if (new_size && !new)
> +		return NULL;
> +
> +	memcpy(new, p, min(size, new_size));
> +	if (new_size > size)
> +		memset(new + size, 0, new_size - size);
> +
> +	if (size <= PAGE_SIZE)
> +		kfree(p);
> +	else
> +		vfree(p);
> +
> +	return new;
> +}

This function can be called under spinlock if new_size>PAGE_SIZE and
the kernel won't (I think) warn.  If new_size<=PAGE_SIZE, the kernel
will warn.

Methinks vmalloc() should have a might_sleep().  Dunno.

> +/**
> + * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
> + * @chunk: chunk of interest
> + * @oslot: the previous slot it was on
> + *
> + * This function is called after an allocation or free changed @chunk.
> + * New slot according to the changed state is determined and @chunk is
> + * moved to the slot.

Locking requirements?

> + */
> +static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
> +{
> +	int nslot = pcpu_chunk_slot(chunk);
> +
> +	if (oslot != nslot) {
> +		if (oslot < nslot)
> +			list_move(&chunk->list, &pcpu_slot[nslot]);
> +		else
> +			list_move_tail(&chunk->list, &pcpu_slot[nslot]);
> +	}
> +}
> +
>
> ...
>
> +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
> +{
> +	int oslot = pcpu_chunk_slot(chunk);
> +	int max_contig = 0;
> +	int i, off;
> +
> +	/*
> +	 * The static chunk initially doesn't have map attached
> +	 * because kmalloc wasn't available during init.  Give it one.
> +	 */
> +	if (unlikely(!chunk->map)) {
> +		chunk->map = pcpu_realloc(NULL, 0,
> +				PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
> +		if (!chunk->map)
> +			return -ENOMEM;
> +
> +		chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
> +		chunk->map[chunk->map_used++] = -pcpu_static_size;
> +		if (chunk->free_size)
> +			chunk->map[chunk->map_used++] = chunk->free_size;
> +	}
> +
> +	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
> +		bool is_last = i + 1 == chunk->map_used;
> +		int head, tail;
> +
> +		/* extra for alignment requirement */
> +		head = ALIGN(off, align) - off;
> +		BUG_ON(i == 0 && head != 0);
> +
> +		if (chunk->map[i] < 0)
> +			continue;
> +		if (chunk->map[i] < head + size) {
> +			max_contig = max(chunk->map[i], max_contig);
> +			continue;
> +		}
> +
> +		/*
> +		 * If head is small or the previous block is free,
> +		 * merge'em.  Note that 'small' is defined as smaller
> +		 * than sizeof(int), which is very small but isn't too
> +		 * uncommon for percpu allocations.
> +		 */
> +		if (head && (head < sizeof(int) || chunk->map[i - 1] > 0)) {
> +			if (chunk->map[i - 1] > 0)
> +				chunk->map[i - 1] += head;
> +			else {
> +				chunk->map[i - 1] -= head;
> +				chunk->free_size -= head;
> +			}
> +			chunk->map[i] -= head;
> +			off += head;
> +			head = 0;
> +		}
> +
> +		/* if tail is small, just keep it around */
> +		tail = chunk->map[i] - head - size;
> +		if (tail < sizeof(int))
> +			tail = 0;
> +
> +		/* split if warranted */
> +		if (head || tail) {
> +			if (pcpu_split_block(chunk, i, head, tail))
> +				return -ENOMEM;
> +			if (head) {
> +				i++;
> +				off += head;
> +				max_contig = max(chunk->map[i - 1], max_contig);
> +			}
> +			if (tail)
> +				max_contig = max(chunk->map[i + 1], max_contig);
> +		}
> +
> +		/* update hint and mark allocated */
> +		if (is_last)
> +			chunk->contig_hint = max_contig; /* fully scanned */
> +		else
> +			chunk->contig_hint = max(chunk->contig_hint,
> +						 max_contig);
> +
> +		chunk->free_size -= chunk->map[i];
> +		chunk->map[i] = -chunk->map[i];

When pcpu_chunk.map gets documented, please also explain the
significance of negative entries in there.

> +		pcpu_chunk_relocate(chunk, oslot);
> +		return off;
> +	}
> +
> +	chunk->contig_hint = max_contig;	/* fully scanned */
> +	pcpu_chunk_relocate(chunk, oslot);
> +	return -ENOSPC;

"No space left on device".

This is not a disk drive.

> +}
> +
>
> ...
>
> +static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, size_t off,
> +				  size_t size, bool flush)
> +{
> +	int page_start = PFN_DOWN(off);
> +	int page_end = PFN_UP(off + size);
> +	int unmap_start = -1;
> +	int uninitialized_var(unmap_end);
> +	unsigned int cpu;
> +	int i;
> +
> +	for (i = page_start; i < page_end; i++) {
> +		for_each_possible_cpu(cpu) {
> +			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
> +
> +			if (!*pagep)
> +				continue;
> +
> +			__free_page(*pagep);
> +			*pagep = NULL;

Why did *pagep get zeroed?  Needs comment?

> +			unmap_start = unmap_start < 0 ? i : unmap_start;
> +			unmap_end = i + 1;
> +		}
> +	}
> +
> +	if (unmap_start >= 0)
> +		pcpu_unmap(chunk, unmap_start, unmap_end, flush);
> +}
> +
>
> ...
>
> +/**
> + * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
> + * @chunk: chunk of interest
> + * @off: offset to the area to populate
> + * @size: size of the area to populate
> + *
> + * For each cpu, populate and map pages [@page_start,@page_end) into
> + * @chunk.  The area is cleared on return.
> + */
> +static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
> +{
> +	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;

A designed decision has been made to not permit the caller to specify
the allocation mode?

Usually a mistake.  Probably appropriate in this case.  Should be
mentioned up-front and discussed a bit.

> +	int page_start = PFN_DOWN(off);
> +	int page_end = PFN_UP(off + size);
> +	int map_start = -1;
> +	int map_end;
> +	unsigned int cpu;
> +	int i;
> +
> +	for (i = page_start; i < page_end; i++) {
> +		if (pcpu_chunk_page_occupied(chunk, i)) {
> +			if (map_start >= 0) {
> +				if (pcpu_map(chunk, map_start, map_end))
> +					goto err;
> +				map_start = -1;
> +			}
> +			continue;
> +		}
> +
> +		map_start = map_start < 0 ? i : map_start;
> +		map_end = i + 1;
> +
> +		for_each_possible_cpu(cpu) {
> +			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
> +
> +			*pagep = alloc_pages_node(cpu_to_node(cpu),
> +						  alloc_mask, 0);
> +			if (!*pagep)
> +				goto err;
> +		}
> +	}
> +
> +	if (map_start >= 0 && pcpu_map(chunk, map_start, map_end))
> +		goto err;
> +
> +	for_each_possible_cpu(cpu)
> +		memset(chunk->vm->addr + (cpu << PCPU_UNIT_SHIFT) + off, 0,
> +		       size);
> +
> +	return 0;
> +err:
> +	/* likely under heavy memory pressure, give memory back */
> +	pcpu_depopulate_chunk(chunk, off, size, true);
> +	return -ENOMEM;
> +}
> +
> +static void free_pcpu_chunk(struct pcpu_chunk *chunk)
> +{
> +	if (!chunk)
> +		return;

afaict this test is unneeded.

> +	if (chunk->vm)
> +		free_vm_area(chunk->vm);

I didn't check whether this one is needed.

> +	pcpu_realloc(chunk->map, chunk->map_alloc * sizeof(chunk->map[0]), 0);
> +	kfree(chunk);
> +}
> +
>
> ...
>
> +/**
> + * __alloc_percpu - allocate percpu area
> + * @size: size of area to allocate
> + * @align: alignment of area (max PAGE_SIZE)
> + *
> + * Allocate percpu area of @size bytes aligned at @align.  Might
> + * sleep.  Might trigger writeouts.
> + *
> + * RETURNS:
> + * Percpu pointer to the allocated area on success, NULL on failure.
> + */
> +void *__alloc_percpu(size_t size, size_t align)
> +{
> +	void *ptr = NULL;
> +	struct pcpu_chunk *chunk;
> +	int slot, off, err;
> +
> +	if (unlikely(!size))
> +		return NULL;

hm.  Why do we do this?  Perhaps emitting this warning:

> +	if (unlikely(size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
> +		     align > PAGE_SIZE)) {
> +		printk(KERN_WARNING "illegal size (%zu) or align (%zu) for "
> +		       "percpu allocation\n", size, align);

would be more appropriate.

> +		return NULL;
> +	}
> +
> +	mutex_lock(&pcpu_mutex);

OK, so we do GFP_KERNEL allocations under this lock, so vast amounts of
kernel code (filesystems, page reclaim, block/io) are not allowed to do
per-cpu allocations.

I doubt if there's a problem with that, but it's worth pointing out.

> +	/* allocate area */
> +	for (slot = pcpu_size_to_slot(size); slot < PCPU_NR_SLOTS; slot++) {
> +		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> +			if (size > chunk->contig_hint)
> +				continue;
> +			err = pcpu_alloc_area(chunk, size, align);
> +			if (err >= 0) {
> +				off = err;
> +				goto area_found;
> +			}
> +			if (err != -ENOSPC)
> +				goto out_unlock;
> +		}
> +	}
> +
> +	/* hmmm... no space left, create a new chunk */
> +	err = -ENOMEM;

This statement is unneeded.

> +	chunk = alloc_pcpu_chunk();
> +	if (!chunk)
> +		goto out_unlock;
> +	pcpu_chunk_relocate(chunk, -1);
> +	pcpu_chunk_addr_insert(chunk);
> +
> +	err = pcpu_alloc_area(chunk, size, align);
> +	if (err < 0)
> +		goto out_unlock;
> +	off = err;

It would be cleaner to do

	off = pcpu_alloc_area(chunk, size, align);
	if (off < 0)
		goto out_unlock;

> +area_found:
> +	/* populate, map and clear the area */
> +	if (pcpu_populate_chunk(chunk, off, size)) {
> +		pcpu_free_area(chunk, off);
> +		goto out_unlock;
> +	}
> +
> +	ptr = __addr_to_pcpu_ptr(chunk->vm->addr + off);
> +out_unlock:
> +	mutex_unlock(&pcpu_mutex);
> +	return ptr;
> +}
> +EXPORT_SYMBOL_GPL(__alloc_percpu);
> +
>
> ...
>
> +/**
> + * free_percpu - free percpu area
> + * @ptr: pointer to area to free
> + *
> + * Free percpu area @ptr.  Might sleep.
> + */
> +void free_percpu(void *ptr)
> +{
> +	void *addr = __pcpu_ptr_to_addr(ptr);
> +	struct pcpu_chunk *chunk;
> +	int off;
> +
> +	if (!ptr)
> +		return;

Do we ever do this?  Should it be permitted?  Should we warn?

> +	mutex_lock(&pcpu_mutex);
> +
> +	chunk = pcpu_chunk_addr_search(addr);
> +	off = addr - chunk->vm->addr;
> +
> +	pcpu_free_area(chunk, off);
> +
> +	/* the chunk became fully free, kill one if there are other free ones */
> +	if (chunk->free_size == PCPU_UNIT_SIZE) {
> +		struct pcpu_chunk *pos;
> +
> +		list_for_each_entry(pos,
> +				    &pcpu_slot[pcpu_chunk_slot(chunk)], list)
> +			if (pos != chunk) {
> +				pcpu_kill_chunk(pos);
> +				break;
> +			}
> +	}
> +
> +	mutex_unlock(&pcpu_mutex);
> +}
> +EXPORT_SYMBOL_GPL(free_percpu);
> +
> +/**
> + * pcpu_setup_static - initialize kernel static percpu area
> + * @populate_pte_fn: callback to allocate pagetable
> + * @pages: num_possible_cpus() * PFN_UP(cpu_size) pages
> + *
> + * Initialize kernel static percpu area.  The caller should allocate
> + * all the necessary pages and pass them in @pages.
> + * @populate_pte_fn() is called on each page to be used for percpu
> + * mapping and is responsible for making sure all the necessary page
> + * tables for the page is allocated.
> + *
> + * RETURNS:
> + * The determined PCPU_UNIT_SIZE which can be used to initialize
> + * percpu access.
> + */
> +size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
> +				struct page **pages, size_t cpu_size)
> +{
> +	static struct vm_struct static_vm;
> +	struct pcpu_chunk *static_chunk;
> +	int nr_cpu_pages = DIV_ROUND_UP(cpu_size, PAGE_SIZE);
> +	unsigned int cpu;
> +	int err, i;
> +
> +	while (1 << __pcpu_unit_pages_shift < nr_cpu_pages)
> +		__pcpu_unit_pages_shift++;

Is there an ilog2() hiding in there somewhere?

> +	pcpu_static_size = cpu_size;
> +	__pcpu_unit_pages = 1 << __pcpu_unit_pages_shift;
> +	__pcpu_unit_shift = PAGE_SHIFT + __pcpu_unit_pages_shift;
> +	__pcpu_unit_size = 1 << __pcpu_unit_shift;
> +	__pcpu_chunk_size = num_possible_cpus() * __pcpu_unit_size;
> +	__pcpu_nr_slots = pcpu_size_to_slot(__pcpu_unit_size) + 1;
> +
> +	/* allocate chunk slots */
> +	pcpu_slot = alloc_bootmem(PCPU_NR_SLOTS * sizeof(pcpu_slot[0]));
> +	for (i = 0; i < PCPU_NR_SLOTS; i++)
> +		INIT_LIST_HEAD(&pcpu_slot[i]);
> +
> +	/* init and register vm area */
> +	static_vm.flags = VM_ALLOC;
> +	static_vm.size = PCPU_CHUNK_SIZE;
> +	vm_area_register_early(&static_vm);
> +
> +	/* init static_chunk */
> +	static_chunk = alloc_bootmem(SIZEOF_STRUCT_PCPU_CHUNK);
> +	INIT_LIST_HEAD(&static_chunk->list);
> +	static_chunk->vm = &static_vm;
> +	static_chunk->free_size = PCPU_UNIT_SIZE - pcpu_static_size;
> +	static_chunk->contig_hint = static_chunk->free_size;
> +
> +	/* assign pages and map them */
> +	for_each_possible_cpu(cpu) {
> +		for (i = 0; i < nr_cpu_pages; i++) {
> +			*pcpu_chunk_pagep(static_chunk, cpu, i) = *pages++;
> +			populate_pte_fn(pcpu_chunk_addr(static_chunk, cpu, i));
> +		}
> +	}
> +
> +	err = pcpu_map(static_chunk, 0, nr_cpu_pages);
> +	if (err)
> +		panic("failed to setup static percpu area, err=%d\n", err);
> +
> +	/* link static_chunk in */
> +	pcpu_chunk_relocate(static_chunk, -1);
> +	pcpu_chunk_addr_insert(static_chunk);
> +
> +	/* we're done */
> +	pcpu_base_addr = (void *)pcpu_chunk_addr(static_chunk, 0, 0);
> +	return PCPU_UNIT_SIZE;
> +}


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
  2009-02-19  0:31   ` Tejun Heo
@ 2009-02-19 10:51   ` Rusty Russell
  2009-02-19 11:06     ` Ingo Molnar
  1 sibling, 1 reply; 78+ messages in thread
From: Rusty Russell @ 2009-02-19 10:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tejun Heo, tglx, x86, linux-kernel, hpa, jeremy, cpw

On Thursday 19 February 2009 00:13:31 Ingo Molnar wrote:
> 
> * Tejun Heo <tj@kernel.org> wrote:
> 
> >   0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
> >   0002-module-fix-out-of-range-memory-access.patch
> 
> Hm, these two seem to be .29 material too, agreed?
> 
> Rusty, if the fixes are fine with you i can put those two 
> commits into tip/core/urgent straight away, the full string of 
> 10 commits into tip/core/percpu and thus we'd avoid duplicate 
> (or even conflicting) commits.

No, the second one is not .29 material; it's a nice, but theoretical, fix.

Don't know about the first one.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-19 10:10   ` Andrew Morton
@ 2009-02-19 11:01     ` Ingo Molnar
  2009-02-20  2:45       ` Tejun Heo
  2009-02-19 12:07     ` Rusty Russell
  2009-02-20  2:35     ` Tejun Heo
  2 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-19 11:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > + * To use this allocator, arch code should do the followings.
> > + *
> > + * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
> > + *
> > + * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
> > + *   regular address to percpu pointer and back
> > + *
> > + * - use pcpu_setup_static() during percpu area initialization to
> > + *   setup kernel static percpu area
> > + */
> 
> afacit nobody has answered your "is num_possible_cpus() ever a 
> lot larger than num_online_cpus()" question.
> 
> It is fairly important.

yeah.

On x86 we limit num_possible_cpus() at boot time from NR_CPUS to 
the BIOS-enumerated set of possible CPUs - i.e. the two will 
always be either equal, or be very close to each other.

( there used to be broken early BIOSes that enumerated more CPUs 
  than needed but it's very rare and because it also wastes BIOS 
  RAM/ROM it's something they'll usually avoid even if they dont 
  care about Linux. )

So this should be a pretty OK assumption.

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-19 10:51   ` Rusty Russell
@ 2009-02-19 11:06     ` Ingo Molnar
  2009-02-19 12:14       ` Rusty Russell
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-19 11:06 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Tejun Heo, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> On Thursday 19 February 2009 00:13:31 Ingo Molnar wrote:
> > 
> > * Tejun Heo <tj@kernel.org> wrote:
> > 
> > >   0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
> > >   0002-module-fix-out-of-range-memory-access.patch
> > 
> > Hm, these two seem to be .29 material too, agreed?
> > 
> > Rusty, if the fixes are fine with you i can put those two 
> > commits into tip/core/urgent straight away, the full string of 
> > 10 commits into tip/core/percpu and thus we'd avoid duplicate 
> > (or even conflicting) commits.
> 
> No, the second one is not .29 material; it's a nice, but 
> theoretical, fix.

Can it never trigger?

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-19  0:30 ` Tejun Heo
@ 2009-02-19 11:07   ` Ingo Molnar
  2009-02-20  3:17     ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-19 11:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

> Tejun Heo wrote:
> >   One trick we can do is to reserve the initial chunk in non-vmalloc
> >   area so that at least the static cpu ones and whatever gets
> >   allocated in the first chunk is served by regular large page
> >   mappings.  Given that those are most frequent visited ones, this
> >   could be a nice compromise - no noticeable penalty for usual cases
> >   yet allowing scalability for unusual cases.  If this is something
> >   which can be agreed on, I'll pursue this.
> 
> I've given more thought to this and it actually will solve 
> most of issues for non-NUMA but it can't be done for NUMA.  
> Any better ideas?

It could be allocated via NUMA-aware bootmem allocations.

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
  2009-02-19 10:10   ` Andrew Morton
@ 2009-02-19 11:51   ` Rusty Russell
  2009-02-20  3:01     ` Tejun Heo
  2009-02-19 12:36   ` Nick Piggin
  2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
  3 siblings, 1 reply; 78+ messages in thread
From: Rusty Russell @ 2009-02-19 11:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo, tony.luck

On Wednesday 18 February 2009 22:34:35 Tejun Heo wrote:
> Impact: new scalable dynamic percpu allocator which allows dynamic
>         percpu areas to be accessed the same way as static ones
> 
> Implement scalable dynamic percpu allocator which can be used for both
> static and dynamic percpu areas.  This will allow static and dynamic
> areas to share faster direct access methods.  This feature is optional
> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
> arch.  Please read comment on top of mm/percpu.c for details.

Hi Tejun,

   One question.  Are you thinking that to be defined by every SMP arch
long-term?  Because there are benefits in having &<percpuvar> == valid
percpuptr, such as passing them around as parameters.  If so, IA64
will want a dedicated per-cpu area for statics (tho it can probably just
map it somehow, but it has to be 64k).

   It'd also be nice to use your generalised module_percpu allocator for the
!CONFIG_HAVE_DYNAMIC_PER_CPU_AREA case, but doesn't really matter if that's
temporary anyway.

Direct comments follow:

> +static int __pcpu_unit_pages_shift = PCPU_MIN_UNIT_PAGES_SHIFT;
> +static int __pcpu_unit_pages;
> +static int __pcpu_unit_shift;
> +static int __pcpu_unit_size;
> +static int __pcpu_chunk_size;
> +static int __pcpu_nr_slots;
> +
> +/* currently everything is power of two, there's no hard dependency on it tho */
> +#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
> +#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
> +#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
> +#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
> +#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
> +#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)

These pseudo-constants seem like a really weird thing to do to me.

And AFAICT you have the requirement that PCPU_UNIT_PAGES*PAGE_SIZE >=
sizeof(.data.percpu).  Should probably note that somewhere.

> +static DEFINE_MUTEX(pcpu_mutex);		/* one mutex to rule them all */
> +static struct list_head *pcpu_slot;		/* chunk list slots */
> +static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */

rbtree might be overkill on first cut.  I'm bearing in mind that Christoph L
had a nice patch to use dynamic percpu allocation in the sl*b allocators;
which would mean this needs to only use get_free_page.

Ah, I see akpm has responded.  I'll stop now and chain onto his comments
in the morning.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range()
  2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
@ 2009-02-19 12:06   ` Nick Piggin
  2009-02-19 22:36     ` David Miller
  0 siblings, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2009-02-19 12:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wednesday 18 February 2009 23:04:27 Tejun Heo wrote:
> Impact: proper vcache flush on unmap_kernel_range()
>
> flush_cache_vunmap() should be called before pages are unmapped.  Add
> a call to it in unmap_kernel_range().
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Shouldn't this go as a fix to mainline and even .stable?

Otherwise:
Acked-by: Nick Piggin <npiggin@suse.de>

> ---
>  mm/vmalloc.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 75f49d3..c37924a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1012,6 +1012,8 @@ void __init vmalloc_init(void)
>  void unmap_kernel_range(unsigned long addr, unsigned long size)
>  {
>  	unsigned long end = addr + size;
> +
> +	flush_cache_vunmap(addr, end);
>  	vunmap_page_range(addr, end);
>  	flush_tlb_kernel_range(addr, end);
>  }



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-19 10:10   ` Andrew Morton
  2009-02-19 11:01     ` Ingo Molnar
@ 2009-02-19 12:07     ` Rusty Russell
  2009-02-20  2:35     ` Tejun Heo
  2 siblings, 0 replies; 78+ messages in thread
From: Rusty Russell @ 2009-02-19 12:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Thursday 19 February 2009 20:40:15 Andrew Morton wrote:
> afacit nobody has answered your "is num_possible_cpus() ever a lot
> larger than num_online_cpus()" question.
> 
> It is fairly important.

Hi Andrew,

  It can be: suspend a giant machine; goes down to 1 cpu.

But I don't think there's much point worrying about a potentially-giant-
but-actually-tiny machine.  Noone else has, so we wait until someone actually
creates such a thing, then they can fix this, as well as all the others.

(The only place I can see that this makes sense is in the virtualization space
when you might be on a 4096 CPU host, so all guests might want the capability
to expand to fill the machine.)

> > +	struct page		*page[];	/* #cpus * UNIT_PAGES */
> 
> "pages" ;)

Heh, disagree: users are clearer if it's page :)

> > +static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
> > +{
> > +	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
> 
> A designed decision has been made to not permit the caller to specify
> the allocation mode?
> 
> Usually a mistake.  Probably appropriate in this case.  Should be
> mentioned up-front and discussed a bit.

Yes, it derives from alloc_percpu which (1) zeroes, and (2) can sleep.

I chose this way-back-when because I didn't want to require atomic allocs
when it was implemented properly, and I couldn't think of a single sane use
case, so I'd rather that pioneer be the one to add the flags.

> > +	if (unlikely(!size))
> > +		return NULL;
> 
> hm.  Why do we do this?  Perhaps emitting this warning:

Yes, I prefer size++ myself, maybe with a warn_on until someone uses it.



> > +void free_percpu(void *ptr)
> > +{
> > +	void *addr = __pcpu_ptr_to_addr(ptr);
> > +	struct pcpu_chunk *chunk;
> > +	int off;
> > +
> > +	if (!ptr)
> > +		return;
> 
> Do we ever do this?  Should it be permitted?  Should we warn?

I want to.  Yes.  No.

Any generic free function should take NULL; it's a bug otherwise, and just
makes for gratuitous over-cautious branches in callers when we equivocate.

BTW Andrew, this was an excellent example of how to review kernel code.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/10] module: fix out-of-range memory access
  2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
@ 2009-02-19 12:08   ` Nick Piggin
  2009-02-20  7:16   ` Tejun Heo
  1 sibling, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2009-02-19 12:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wednesday 18 February 2009 23:04:28 Tejun Heo wrote:
> Impact: subtle memory access bug fix
>
> percpu_modalloc() may access pcpu_size[-1].  The access won't change
> the value by itself but it still is read/write access and dangerous.
> Fix it.

Ditto for this one...

>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/module.c |   14 ++++++++------
>  1 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/module.c b/kernel/module.c
> index ba22484..d54a63e 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -426,12 +426,14 @@ static void *percpu_modalloc(unsigned long size,
> unsigned long align, continue;
>
>  		/* Transfer extra to previous block. */
> -		if (pcpu_size[i-1] < 0)
> -			pcpu_size[i-1] -= extra;
> -		else
> -			pcpu_size[i-1] += extra;
> -		pcpu_size[i] -= extra;
> -		ptr += extra;
> +		if (extra) {
> +			if (pcpu_size[i-1] < 0)
> +				pcpu_size[i-1] -= extra;
> +			else
> +				pcpu_size[i-1] += extra;
> +			pcpu_size[i] -= extra;
> +			ptr += extra;
> +		}
>
>  		/* Split block if warranted */
>  		if (pcpu_size[i] - size > sizeof(unsigned long))



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/10] vmalloc: implement vm_area_register_early()
  2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
  2009-02-19  0:55   ` Tejun Heo
@ 2009-02-19 12:09   ` Nick Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2009-02-19 12:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wednesday 18 February 2009 23:04:33 Tejun Heo wrote:
> Impact: allow multiple early vm areas
>
> There are places where kernel VM area needs to be allocated before
> vmalloc is initialized.  This is done by allocating static vm_struct,
> initializing several fields and linking it to vmlist and later vmalloc
> initialization picking up these from vmlist.  This is currently done
> manually and if there's more than one such areas, there's no defined
> way to arbitrate who gets which address.
>
> This patch implements vm_area_register_early(), which takes vm_area
> struct with flags and size initialized, assigns address to it and puts
> it on the vmlist.  This way, multiple early vm areas can determine
> which addresses they should use.  The only current user - alpha mm
> init - is converted to use it.

Yes, this is much cleaner. Arguably could go upstream earlier, but
if there are no other callers, probably doesn't matter so much.

Acked-by: Nick Piggin <npiggin@suse.de>

>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  arch/alpha/mm/init.c    |   20 +++++++++++++-------
>  include/linux/vmalloc.h |    1 +
>  mm/vmalloc.c            |   24 ++++++++++++++++++++++++
>  3 files changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
> index 5d7a16e..df6df02 100644
> --- a/arch/alpha/mm/init.c
> +++ b/arch/alpha/mm/init.c
> @@ -189,9 +189,21 @@ callback_init(void * kernel_end)
>
>  	if (alpha_using_srm) {
>  		static struct vm_struct console_remap_vm;
> -		unsigned long vaddr = VMALLOC_START;
> +		unsigned long nr_pages = 0;
> +		unsigned long vaddr;
>  		unsigned long i, j;
>
> +		/* calculate needed size */
> +		for (i = 0; i < crb->map_entries; ++i)
> +			nr_pages += crb->map[i].count;
> +
> +		/* register the vm area */
> +		console_remap_vm.flags = VM_ALLOC;
> +		console_remap_vm.size = nr_pages << PAGE_SHIFT;
> +		vm_area_register_early(&console_remap_vm);
> +
> +		vaddr = (unsigned long)consle_remap_vm.addr;
> +
>  		/* Set up the third level PTEs and update the virtual
>  		   addresses of the CRB entries.  */
>  		for (i = 0; i < crb->map_entries; ++i) {
> @@ -213,12 +225,6 @@ callback_init(void * kernel_end)
>  				vaddr += PAGE_SIZE;
>  			}
>  		}
> -
> -		/* Let vmalloc know that we've allocated some space.  */
> -		console_remap_vm.flags = VM_ALLOC;
> -		console_remap_vm.addr = (void *) VMALLOC_START;
> -		console_remap_vm.size = vaddr - VMALLOC_START;
> -		vmlist = &console_remap_vm;
>  	}
>
>  	callback_init_done = 1;
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 506e762..bbc0513 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -106,5 +106,6 @@ extern long vwrite(char *buf, char *addr, unsigned long
> count); */
>  extern rwlock_t vmlist_lock;
>  extern struct vm_struct *vmlist;
> +extern __init void vm_area_register_early(struct vm_struct *vm);
>
>  #endif /* _LINUX_VMALLOC_H */
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c37924a..d206261 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -24,6 +24,7 @@
>  #include <linux/radix-tree.h>
>  #include <linux/rcupdate.h>
>  #include <linux/bootmem.h>
> +#include <linux/pfn.h>
>
>  #include <asm/atomic.h>
>  #include <asm/uaccess.h>
> @@ -982,6 +983,29 @@ void *vm_map_ram(struct page **pages, unsigned int
> count, int node, pgprot_t pro }
>  EXPORT_SYMBOL(vm_map_ram);
>
> +/**
> + * vm_area_register_early - register vmap area early during boot
> + * @vm: vm_struct to register
> + * @size: size of area to register
> + *
> + * This function is used to register kernel vm area before
> + * vmalloc_init() is called.  @vm->size and @vm->flags should contain
> + * proper values on entry and other fields should be zero.  On return,
> + * vm->addr contains the allocated address.
> + *
> + * DO NOT USE THIS FUNCTION UNLESS YOU KNOW WHAT YOU'RE DOING.
> + */
> +void __init vm_area_register_early(struct vm_struct *vm)
> +{
> +	static size_t vm_init_off __initdata;
> +
> +	vm->addr = (void *)VMALLOC_START + vm_init_off;
> +	vm_init_off = PFN_ALIGN(vm_init_off + vm->size);
> +
> +	vm->next = vmlist;
> +	vmlist = vm;
> +}
> +
>  void __init vmalloc_init(void)
>  {
>  	struct vmap_area *va;



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-19 11:06     ` Ingo Molnar
@ 2009-02-19 12:14       ` Rusty Russell
  2009-02-20  3:08         ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Rusty Russell @ 2009-02-19 12:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tejun Heo, tglx, x86, linux-kernel, hpa, jeremy, cpw

On Thursday 19 February 2009 21:36:31 Ingo Molnar wrote:
> 
> * Rusty Russell <rusty@rustcorp.com.au> wrote:
> 
> > On Thursday 19 February 2009 00:13:31 Ingo Molnar wrote:
> > > 
> > > * Tejun Heo <tj@kernel.org> wrote:
> > > 
> > > >   0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
> > > >   0002-module-fix-out-of-range-memory-access.patch
> > > 
> > > Hm, these two seem to be .29 material too, agreed?
> > > 
> > > Rusty, if the fixes are fine with you i can put those two 
> > > commits into tip/core/urgent straight away, the full string of 
> > > 10 commits into tip/core/percpu and thus we'd avoid duplicate 
> > > (or even conflicting) commits.
> > 
> > No, the second one is not .29 material; it's a nice, but 
> > theoretical, fix.
> 
> Can it never trigger?

Actually, checked again.  It's not even necessary AFAICT (tho a comment
would be nice):

	for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
		/* Extra for alignment requirement. */
		extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
		BUG_ON(i == 0 && extra != 0);

		if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
			continue;

		/* Transfer extra to previous block. */
		if (pcpu_size[i-1] < 0)
			pcpu_size[i-1] -= extra;
		else
			pcpu_size[i-1] += extra;

pcpu_size[0] is *always* negative: it's marked allocated at initialization
(it's the static per-cpu allocations).

Sorry I didn't examine more closely,
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush()
  2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
@ 2009-02-19 12:17   ` Nick Piggin
  2009-02-20  1:27     ` Tejun Heo
  2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
  1 sibling, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2009-02-19 12:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wednesday 18 February 2009 23:04:34 Tejun Heo wrote:
> Impact: two more public map/unmap functions
>
> Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
> These functions respectively map and unmap address range in kernel VM
> area but doesn't do any vcache or tlb flushing.  These will be used by
> new percpu allocator.

Hmm... I have no real issues with this, although the caller is going
to have to be very careful not to introduce bugs (which I'm sure you
were ;)).

Maybe can you add comments specifying the minimum of which flushes
are required and when, to scare people away from using them?


>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/vmalloc.h |    3 ++
>  mm/vmalloc.c            |   58
> ++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 58
> insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
  2009-02-19 10:10   ` Andrew Morton
  2009-02-19 11:51   ` Rusty Russell
@ 2009-02-19 12:36   ` Nick Piggin
  2009-02-20  3:04     ` Tejun Heo
  2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
  3 siblings, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2009-02-19 12:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wednesday 18 February 2009 23:04:35 Tejun Heo wrote:
> Impact: new scalable dynamic percpu allocator which allows dynamic
>         percpu areas to be accessed the same way as static ones
>
> Implement scalable dynamic percpu allocator which can be used for both
> static and dynamic percpu areas.  This will allow static and dynamic
> areas to share faster direct access methods.  This feature is optional
> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
> arch.  Please read comment on top of mm/percpu.c for details.

Seems pretty nice. Wishlist: would be cool to have per-cpu virtual
memory mappings and do CPU-local percpu access via a single pointer.
Of course there would need to be some machinery and maybe a new API
to be more careful about accessing remote percpu data (that access
could perhaps just be slower and go via the linear mapping).

It would probably be quite a bit slower to do remote percpu access,
but some users never do this remote access in fastpath, and want
really fast local access (eg slab allocators).

I guess the hardest part would be doing the arch code.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range()
  2009-02-19 12:06   ` Nick Piggin
@ 2009-02-19 22:36     ` David Miller
  0 siblings, 0 replies; 78+ messages in thread
From: David Miller @ 2009-02-19 22:36 UTC (permalink / raw)
  To: nickpiggin; +Cc: tj, rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Thu, 19 Feb 2009 23:06:27 +1100

> On Wednesday 18 February 2009 23:04:27 Tejun Heo wrote:
> > Impact: proper vcache flush on unmap_kernel_range()
> >
> > flush_cache_vunmap() should be called before pages are unmapped.  Add
> > a call to it in unmap_kernel_range().
> >
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> 
> Shouldn't this go as a fix to mainline and even .stable?
> 
> Otherwise:
> Acked-by: Nick Piggin <npiggin@suse.de>

Agreed, this is -stable material:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush()
  2009-02-19 12:17   ` Nick Piggin
@ 2009-02-20  1:27     ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  1:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Nick Piggin wrote:
> On Wednesday 18 February 2009 23:04:34 Tejun Heo wrote:
>> Impact: two more public map/unmap functions
>>
>> Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
>> These functions respectively map and unmap address range in kernel VM
>> area but doesn't do any vcache or tlb flushing.  These will be used by
>> new percpu allocator.
> 
> Hmm... I have no real issues with this, although the caller is going
> to have to be very careful not to introduce bugs (which I'm sure you
> were ;)).
> 
> Maybe can you add comments specifying the minimum of which flushes
> are required and when, to scare people away from using them?

Yeap, will add comments.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-19 10:10   ` Andrew Morton
  2009-02-19 11:01     ` Ingo Molnar
  2009-02-19 12:07     ` Rusty Russell
@ 2009-02-20  2:35     ` Tejun Heo
  2009-02-20  3:04       ` Andrew Morton
  2 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  2:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Hello, Andrew.

Andrew Morton wrote:
>> +static void *percpu_modalloc(unsigned long size, unsigned long align,
>> +			     const char *name)
>> +{
>> +	void *ptr;
>> +
>> +	if (align > PAGE_SIZE) {
>> +		printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
>> +		       name, align, PAGE_SIZE);
> 
> It used to be the case that PAGE_SIZE has type `unsigned' on some
> architectures and `unsigned long' on others.  I don't know if that was
> fixed - probably not.

The printk has been there long before this patch in the original
percpu_modalloc().  Given the wide build coverage module.c gets, I
wonder whether the PAGE_SIZE type problem still exists.  Grepping...
Simple grep "define[ ^t]*PAGE_SIZE" doesn't show any non UL PAGE_SIZE
definitions although there are places where _AC macro can be used
instead of explicit ifdef or custom macro.

>> +		align = PAGE_SIZE;
>> +	}
>> +
>> +	ptr = __alloc_percpu(size, align);
>> +	if (!ptr)
>> +		printk(KERN_WARNING
>> +		       "Could not allocate %lu bytes percpu data\n", size);
> 
> A dump_stack() here would be useful.

Hmmm... Is it customary to dump stack on allocation failure?  AFAICS,
kmalloc or vmalloc isn't doing it.

>> + * - use pcpu_setup_static() during percpu area initialization to
>> + *   setup kernel static percpu area
>> + */
> 
> afacit nobody has answered your "is num_possible_cpus() ever a lot
> larger than num_online_cpus()" question.
> 
> It is fairly important.

Heh.. People don't seem to agree on this.  I'll write in other replies.

>> +struct pcpu_chunk {
>> +	struct list_head	list;		/* linked to pcpu_slot lists */
>> +	struct rb_node		rb_node;	/* key is chunk->vm->addr */
>> +	int			free_size;
> 
> what's this?

Size of free space in the chunk.  Will add comment.

>> +	int			contig_hint;	/* max contiguous size hint */
>> +	struct vm_struct	*vm;
> 
> ?

vmalloc area for the chunk.

>> +	int			map_used;	/* # of map entries used */
>> +	int			map_alloc;	/* # of map entries allocated */
>> +	int			*map;
> 
> ?

And, area allocation map.

>> +	struct page		*page[];	/* #cpus * UNIT_PAGES */
> 
> "pages" ;)

I kind of bounce between singular and plural when naming arrays, list
heads or whatever collective data structures.  Plural seems better
suited for the field itself but when accessing the elements it's more
natural to use singular form and the naming convention is mixed all
over the kernel.  One way doesn't really have much technical advantage
over the other so it's still better to have some consistency in place.
Maybe we need to decide on either one, put it in code style and stick
with it?

>> +#define SIZEOF_STRUCT_PCPU_CHUNK					\
>> +	(sizeof(struct pcpu_chunk) +					\
>> +	 (num_possible_cpus() << PCPU_UNIT_PAGES_SHIFT) * sizeof(struct page *))
> 
> This macro generates real code.  It is misleading to pretend that it is
> a compile-time constant.  Suggest that it be converted to a plain old C
> function.

Please see below.

>> +static int __pcpu_unit_pages_shift = PCPU_MIN_UNIT_PAGES_SHIFT;
>> +static int __pcpu_unit_pages;
>> +static int __pcpu_unit_shift;
>> +static int __pcpu_unit_size;
>> +static int __pcpu_chunk_size;
>> +static int __pcpu_nr_slots;
>> +
>> +/* currently everything is power of two, there's no hard dependency on it tho */
>> +#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
>> +#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
>> +#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
>> +#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
>> +#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
>> +#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)
> 
> hm.  Why do these exist?

Because those parameters are initialized during boot but should work
as constant once they're set up.  I wanted to make sure that these
values don't get assigned to or changed and make that clear by making
them look like constants as except for the init code they're constants
for all purposes.  So, they aren't not constants in technical sense
but are in their semantics, which I think is more important.  Changing
them isn't difficult at all but I think it's better this way.

>> +/**
>> + * pcpu_realloc - versatile realloc
>> + * @p: the current pointer (can be NULL for new allocations)
>> + * @size: the current size (can be 0 for new allocations)
>> + * @new_size: the wanted new size (can be 0 for free)
> 
> So the allocator doesn't internally record the size of each hunk?
> 
> <squints at the undocumented `free_size'>

This one is a utility function used only inside allocator
implementation as I wanted something more robust than krealloc.  It
has nothing to do with the free_size or chunk management.

...
> This function can be called under spinlock if new_size>PAGE_SIZE and
> the kernel won't (I think) warn.  If new_size<=PAGE_SIZE, the kernel
> will warn.
> 
> Methinks vmalloc() should have a might_sleep().  Dunno.

I can add that but it's an internal utility function which is called
from a few well known obvious call sites to replace krealloc, so I
don't thinks it's a big deal.  Maybe the correct thing to do is adding
might_sleep() to vmalloc?

>> +/**
>> + * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
>> + * @chunk: chunk of interest
>> + * @oslot: the previous slot it was on
>> + *
>> + * This function is called after an allocation or free changed @chunk.
>> + * New slot according to the changed state is determined and @chunk is
>> + * moved to the slot.
> 
> Locking requirements?

I thought "one mutex to rule them all" comment was enough.  I'll add
more description about locking to the comment at the top.

>> +		chunk->free_size -= chunk->map[i];
>> +		chunk->map[i] = -chunk->map[i];
> 
> When pcpu_chunk.map gets documented, please also explain the
> significance of negative entries in there.

Hmmm... already explained in the top comment.  I generally find it
more useful to have general overview description of the design at the
top rather than having piecemeal descriptions here and there and try
to write a decent description of the design at the top of the file.

>> +		pcpu_chunk_relocate(chunk, oslot);
>> +		return off;
>> +	}
>> +
>> +	chunk->contig_hint = max_contig;	/* fully scanned */
>> +	pcpu_chunk_relocate(chunk, oslot);
>> +	return -ENOSPC;
> 
> "No space left on device".
> 
> This is not a disk drive.

That error value is an internal code to notify the caller to extend
area and retry and won't be visible to the outside.  I originally used
-EAGAIN but -ENOSPC seemed more fitting.  Which value is used
eventually doesn't really matter tho.  I'll add a comment to explain
what's going on.

>> +	for (i = page_start; i < page_end; i++) {
>> +		for_each_possible_cpu(cpu) {
>> +			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
>> +
>> +			if (!*pagep)
>> +				continue;
>> +
>> +			__free_page(*pagep);
>> +			*pagep = NULL;
> 
> Why did *pagep get zeroed?  Needs comment?

Cuz chunks can be partially occupied and allocation status is
represented by non-NULL values in the page pointer array.  Will add
comment.

>> +/**
>> + * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
>> + * @chunk: chunk of interest
>> + * @off: offset to the area to populate
>> + * @size: size of the area to populate
>> + *
>> + * For each cpu, populate and map pages [@page_start,@page_end) into
>> + * @chunk.  The area is cleared on return.
>> + */
>> +static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
>> +{
>> +	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
> 
> A designed decision has been made to not permit the caller to specify
> the allocation mode?

The design decision was inherited from the original percpu allocator.

> Usually a mistake.  Probably appropriate in this case.  Should be
> mentioned up-front and discussed a bit.

percpu allocator had the interface but nobody used it till now.  I can
put the whole thing under spinlock (and increase locking granuality)
and add gfp mask but given the current usage I'm a bit skeptical
whether it would be necessary.

>> +static void free_pcpu_chunk(struct pcpu_chunk *chunk)
>> +{
>> +	if (!chunk)
>> +		return;
> 
> afaict this test is unneeded.

I think it's better to put NULL test to free functions in general.
People expect free functions to take NULL argument and swallow it.
Doint it otherwise adds unnecessary danger for subtle bugs later on.

>> +	if (chunk->vm)
>> +		free_vm_area(chunk->vm);
> 
> I didn't check whether this one is needed.

The function is also called from allocation failure path, so it is
necessary and even if it's not, I think it's better to make free
function (or any kind of backout/shutdown functions) robust.

>> +void *__alloc_percpu(size_t size, size_t align)
>> +{
>> +	void *ptr = NULL;
>> +	struct pcpu_chunk *chunk;
>> +	int slot, off, err;
>> +
>> +	if (unlikely(!size))
>> +		return NULL;
> 
> hm.  Why do we do this?  Perhaps emitting this warning:
> 
>> +	if (unlikely(size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
>> +		     align > PAGE_SIZE)) {
>> +		printk(KERN_WARNING "illegal size (%zu) or align (%zu) for "
>> +		       "percpu allocation\n", size, align);
> 
> would be more appropriate.

Maybe.  Dunno.  Returning NULL is what malloc/calloc are allowed to do
at least.  kmalloc() returns special token.

>> +		return NULL;
>> +	}
>> +
>> +	mutex_lock(&pcpu_mutex);
> 
> OK, so we do GFP_KERNEL allocations under this lock, so vast amounts of
> kernel code (filesystems, page reclaim, block/io) are not allowed to do
> per-cpu allocations.
> 
> I doubt if there's a problem with that, but it's worth pointing out.

Yeah, it's the same restriction inherited from the original percpu
allocator.  Rusty seems to think it's enough and I wanted to keep
things simple but if the gfp flag thing is necessary it can be easily
changed by putting the area allocation under spinlock and doing page
allocation without lock, but given the possibly large number of
allocations percpu allocation has to do, I don't think allowing the
function to be called from non-preemptive context is a wise thing.

>> +	/* allocate area */
>> +	for (slot = pcpu_size_to_slot(size); slot < PCPU_NR_SLOTS; slot++) {
>> +		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
>> +			if (size > chunk->contig_hint)
>> +				continue;
>> +			err = pcpu_alloc_area(chunk, size, align);
>> +			if (err >= 0) {
>> +				off = err;
>> +				goto area_found;
>> +			}
>> +			if (err != -ENOSPC)
>> +				goto out_unlock;
>> +		}
>> +	}
>> +
>> +	/* hmmm... no space left, create a new chunk */
>> +	err = -ENOMEM;
> 
> This statement is unneeded.
>
>> +	chunk = alloc_pcpu_chunk();
>> +	if (!chunk)
>> +		goto out_unlock;
>> +	pcpu_chunk_relocate(chunk, -1);
>> +	pcpu_chunk_addr_insert(chunk);
>> +
>> +	err = pcpu_alloc_area(chunk, size, align);
>> +	if (err < 0)
>> +		goto out_unlock;
>> +	off = err;
> 
> It would be cleaner to do
> 
> 	off = pcpu_alloc_area(chunk, size, align);
> 	if (off < 0)
> 		goto out_unlock;

Yeah, right.  The err thing is remnant of different interface where it
returned ERR_PTR value.  I'll remove it.

>> +/**
>> + * free_percpu - free percpu area
>> + * @ptr: pointer to area to free
>> + *
>> + * Free percpu area @ptr.  Might sleep.
>> + */
>> +void free_percpu(void *ptr)
>> +{
>> +	void *addr = __pcpu_ptr_to_addr(ptr);
>> +	struct pcpu_chunk *chunk;
>> +	int off;
>> +
>> +	if (!ptr)
>> +		return;
> 
> Do we ever do this?  Should it be permitted?  Should we warn?

Dunno but should be allowed, yes, no.  :-)

>> +size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
>> +				struct page **pages, size_t cpu_size)
>> +{
>> +	static struct vm_struct static_vm;
>> +	struct pcpu_chunk *static_chunk;
>> +	int nr_cpu_pages = DIV_ROUND_UP(cpu_size, PAGE_SIZE);
>> +	unsigned int cpu;
>> +	int err, i;
>> +
>> +	while (1 << __pcpu_unit_pages_shift < nr_cpu_pages)
>> +		__pcpu_unit_pages_shift++;
> 
> Is there an ilog2() hiding in there somewhere?

Will convert.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-19 11:01     ` Ingo Molnar
@ 2009-02-20  2:45       ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  2:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>>> + * To use this allocator, arch code should do the followings.
>>> + *
>>> + * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
>>> + *
>>> + * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
>>> + *   regular address to percpu pointer and back
>>> + *
>>> + * - use pcpu_setup_static() during percpu area initialization to
>>> + *   setup kernel static percpu area
>>> + */
>> afacit nobody has answered your "is num_possible_cpus() ever a 
>> lot larger than num_online_cpus()" question.
>>
>> It is fairly important.
> 
> yeah.
> 
> On x86 we limit num_possible_cpus() at boot time from NR_CPUS to 
> the BIOS-enumerated set of possible CPUs - i.e. the two will 
> always be either equal, or be very close to each other.
> 
> ( there used to be broken early BIOSes that enumerated more CPUs 
>   than needed but it's very rare and because it also wastes BIOS 
>   RAM/ROM it's something they'll usually avoid even if they dont 
>   care about Linux. )
> 
> So this should be a pretty OK assumption.

Hmm... this is a confusing conversation.  Andrew seems to say that not
allocating memory for offline cpus is fairly important and Ingo's
reply starts with yeah but draws the opposite conclusion.  Or my
English failing me again?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-19 11:51   ` Rusty Russell
@ 2009-02-20  3:01     ` Tejun Heo
  2009-02-20  3:02       ` Tejun Heo
  2009-02-24  2:56       ` Rusty Russell
  0 siblings, 2 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  3:01 UTC (permalink / raw)
  To: Rusty Russell; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo, tony.luck

Hello, Rusty.

Rusty Russell wrote:
> On Wednesday 18 February 2009 22:34:35 Tejun Heo wrote:
>> Impact: new scalable dynamic percpu allocator which allows dynamic
>>         percpu areas to be accessed the same way as static ones
>>
>> Implement scalable dynamic percpu allocator which can be used for both
>> static and dynamic percpu areas.  This will allow static and dynamic
>> areas to share faster direct access methods.  This feature is optional
>> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
>> arch.  Please read comment on top of mm/percpu.c for details.
> 
> Hi Tejun,
> 
>    One question.  Are you thinking that to be defined by every SMP arch
> long-term?

Yeap, definitely.

> Because there are benefits in having &<percpuvar> == valid
> percpuptr, such as passing them around as parameters.  If so, IA64
> will want a dedicated per-cpu area for statics (tho it can probably
> just map it somehow, but it has to be 64k).

Hmmm...  Don't have much idea about ia64 and its magic 64k.  Can it
somehow be used for the first chunk?

>    It'd also be nice to use your generalised module_percpu allocator for the
> !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA case, but doesn't really matter if that's
> temporary anyway.

Yeap, once the conversion is complete, the old allocator will go away
so there's no reason to put more work into it.

>> +#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
>> +#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
>> +#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
>> +#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
>> +#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
>> +#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)
> 
> These pseudo-constants seem like a really weird thing to do to me.

I explained this in the reply to Andrew's comment.  It's
non-really-constant-but-should-be-considered-so-by-users thing.  Is it
too weird?  Even if I add comment explaning it?

> And AFAICT you have the requirement that PCPU_UNIT_PAGES*PAGE_SIZE >=
> sizeof(.data.percpu).  Should probably note that somewhere.

__pcu_unit_pages_shift is adjusted automatically according to
sizeof(.data.percpu), so it will adapt as necessary.  After the
initial adjustment, it should be considered constant, so the above
seemingly weird hack.

>> +static DEFINE_MUTEX(pcpu_mutex);		/* one mutex to rule them all */
>> +static struct list_head *pcpu_slot;		/* chunk list slots */
>> +static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
> 
> rbtree might be overkill on first cut.  I'm bearing in mind that Christoph L
> had a nice patch to use dynamic percpu allocation in the sl*b allocators;
> which would mean this needs to only use get_free_page.

Hmmm... the reverse mapping can be piggy backed on vmalloc by adding a
private pointer to the vm_struct but rbtree isn't too difficult to use
so I just did it directly.  Nick, what do you think about adding
private field to vm_struct and providing a reverse map function?

As for the sl*b allocation thing, can you please explain in more
detail or point me to the patches / threads?

Thanks.  :-)

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  3:01     ` Tejun Heo
@ 2009-02-20  3:02       ` Tejun Heo
  2009-02-24  2:56       ` Rusty Russell
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  3:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo, tony.luck, Nick Piggin

Oops, forgot to cc Nick.  cc'ing and quoting whole body.

Tejun Heo wrote:
> Hello, Rusty.
> 
> Rusty Russell wrote:
>> On Wednesday 18 February 2009 22:34:35 Tejun Heo wrote:
>>> Impact: new scalable dynamic percpu allocator which allows dynamic
>>>         percpu areas to be accessed the same way as static ones
>>>
>>> Implement scalable dynamic percpu allocator which can be used for both
>>> static and dynamic percpu areas.  This will allow static and dynamic
>>> areas to share faster direct access methods.  This feature is optional
>>> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
>>> arch.  Please read comment on top of mm/percpu.c for details.
>> Hi Tejun,
>>
>>    One question.  Are you thinking that to be defined by every SMP arch
>> long-term?
> 
> Yeap, definitely.
> 
>> Because there are benefits in having &<percpuvar> == valid
>> percpuptr, such as passing them around as parameters.  If so, IA64
>> will want a dedicated per-cpu area for statics (tho it can probably
>> just map it somehow, but it has to be 64k).
> 
> Hmmm...  Don't have much idea about ia64 and its magic 64k.  Can it
> somehow be used for the first chunk?
> 
>>    It'd also be nice to use your generalised module_percpu allocator for the
>> !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA case, but doesn't really matter if that's
>> temporary anyway.
> 
> Yeap, once the conversion is complete, the old allocator will go away
> so there's no reason to put more work into it.
> 
>>> +#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
>>> +#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
>>> +#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
>>> +#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
>>> +#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
>>> +#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)
>> These pseudo-constants seem like a really weird thing to do to me.
> 
> I explained this in the reply to Andrew's comment.  It's
> non-really-constant-but-should-be-considered-so-by-users thing.  Is it
> too weird?  Even if I add comment explaning it?
> 
>> And AFAICT you have the requirement that PCPU_UNIT_PAGES*PAGE_SIZE >=
>> sizeof(.data.percpu).  Should probably note that somewhere.
> 
> __pcu_unit_pages_shift is adjusted automatically according to
> sizeof(.data.percpu), so it will adapt as necessary.  After the
> initial adjustment, it should be considered constant, so the above
> seemingly weird hack.
> 
>>> +static DEFINE_MUTEX(pcpu_mutex);		/* one mutex to rule them all */
>>> +static struct list_head *pcpu_slot;		/* chunk list slots */
>>> +static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
>> rbtree might be overkill on first cut.  I'm bearing in mind that Christoph L
>> had a nice patch to use dynamic percpu allocation in the sl*b allocators;
>> which would mean this needs to only use get_free_page.
> 
> Hmmm... the reverse mapping can be piggy backed on vmalloc by adding a
> private pointer to the vm_struct but rbtree isn't too difficult to use
> so I just did it directly.  Nick, what do you think about adding
> private field to vm_struct and providing a reverse map function?
> 
> As for the sl*b allocation thing, can you please explain in more
> detail or point me to the patches / threads?
> 
> Thanks.  :-)
> 


-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  2:35     ` Tejun Heo
@ 2009-02-20  3:04       ` Andrew Morton
  2009-02-20  5:29         ` Tejun Heo
  2009-02-24  2:52         ` Rusty Russell
  0 siblings, 2 replies; 78+ messages in thread
From: Andrew Morton @ 2009-02-20  3:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Fri, 20 Feb 2009 11:35:01 +0900 Tejun Heo <tj@kernel.org> wrote:

> Hello, Andrew.
> 
> >> +		align = PAGE_SIZE;
> >> +	}
> >> +
> >> +	ptr = __alloc_percpu(size, align);
> >> +	if (!ptr)
> >> +		printk(KERN_WARNING
> >> +		       "Could not allocate %lu bytes percpu data\n", size);
> > 
> > A dump_stack() here would be useful.
> 
> Hmmm... Is it customary to dump stack on allocation failure?  AFAICS,
> kmalloc or vmalloc isn't doing it.

The page allocator (and hence kmalloc) will do it.

But a more important question is "is a trace useful".  I'd say "yes". 
Because being told that something ran out of memory isn't terribly
useful.  The very first question is "OK, well _what_ ran out of
memory?".

> >> +#define SIZEOF_STRUCT_PCPU_CHUNK					\
> >> +	(sizeof(struct pcpu_chunk) +					\
> >> +	 (num_possible_cpus() << PCPU_UNIT_PAGES_SHIFT) * sizeof(struct page *))
> > 
> > This macro generates real code.  It is misleading to pretend that it is
> > a compile-time constant.  Suggest that it be converted to a plain old C
> > function.
> 
> Please see below.
> 
> >> +static int __pcpu_unit_pages_shift = PCPU_MIN_UNIT_PAGES_SHIFT;
> >> +static int __pcpu_unit_pages;
> >> +static int __pcpu_unit_shift;
> >> +static int __pcpu_unit_size;
> >> +static int __pcpu_chunk_size;
> >> +static int __pcpu_nr_slots;
> >> +
> >> +/* currently everything is power of two, there's no hard dependency on it tho */
> >> +#define PCPU_UNIT_PAGES_SHIFT	((int)__pcpu_unit_pages_shift)
> >> +#define PCPU_UNIT_PAGES		((int)__pcpu_unit_pages)
> >> +#define PCPU_UNIT_SHIFT		((int)__pcpu_unit_shift)
> >> +#define PCPU_UNIT_SIZE		((int)__pcpu_unit_size)
> >> +#define PCPU_CHUNK_SIZE		((int)__pcpu_chunk_size)
> >> +#define PCPU_NR_SLOTS		((int)__pcpu_nr_slots)
> > 
> > hm.  Why do these exist?
> 
> Because those parameters are initialized during boot but should work
> as constant once they're set up.  I wanted to make sure that these
> values don't get assigned to or changed and make that clear by making
> them look like constants as except for the init code they're constants
> for all purposes.

Well, there are an infinite number of ways in which people can later introduce
bugs.  Why defend against just one?  Particularly in a way which muckies up
the code?

If you really want to defend against alterations, access these things
via function calls rather than via nastycasts which masquerade as
constants?

static inline int pcpu_unit_pages_shift(void)
{
	return __pcpu_unit_pages_shift;
}

> >> +/**
> >> + * pcpu_realloc - versatile realloc
> >> + * @p: the current pointer (can be NULL for new allocations)
> >> + * @size: the current size (can be 0 for new allocations)
> >> + * @new_size: the wanted new size (can be 0 for free)
> > 
> > So the allocator doesn't internally record the size of each hunk?
> > 
> > <squints at the undocumented `free_size'>
> 
> This one is a utility function used only inside allocator
> implementation as I wanted something more robust than krealloc.  It
> has nothing to do with the free_size or chunk management.
> 
> ...
> > This function can be called under spinlock if new_size>PAGE_SIZE and
> > the kernel won't (I think) warn.  If new_size<=PAGE_SIZE, the kernel
> > will warn.
> > 
> > Methinks vmalloc() should have a might_sleep().  Dunno.
> 
> I can add that but it's an internal utility function which is called
> from a few well known obvious call sites to replace krealloc, so I
> don't thinks it's a big deal.  Maybe the correct thing to do is adding
> might_sleep() to vmalloc?

I think so.  It perhaps already has one, via indirect means.

> >> + * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
> >> + * @chunk: chunk of interest
> >> + * @off: offset to the area to populate
> >> + * @size: size of the area to populate
> >> + *
> >> + * For each cpu, populate and map pages [@page_start,@page_end) into
> >> + * @chunk.  The area is cleared on return.
> >> + */
> >> +static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
> >> +{
> >> +	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
> > 
> > A designed decision has been made to not permit the caller to specify
> > the allocation mode?
> 
> The design decision was inherited from the original percpu allocator.

That doesn't mean it's right ;)

But I don't recall us ever wishing that the gfp_t arg had been
included.

> >> +static void free_pcpu_chunk(struct pcpu_chunk *chunk)
> >> +{
> >> +	if (!chunk)
> >> +		return;
> > 
> > afaict this test is unneeded.
> 
> I think it's better to put NULL test to free functions in general.
> People expect free functions to take NULL argument and swallow it.
> Doint it otherwise adds unnecessary danger for subtle bugs later on.

It's a dumb convention.  In the vast majority of cases the pointer is
not NULL.  We add a test-n-branch to 99.999999999% of cases just to
save three seconds of programmer effort a single time.

A better design would have been to have kfree() and
kfree_might_be_null().  (We can still do that by adding a new
kfree_im_not_stupid() which doesn't do the check).

It's a bad tradeoff to expend billions of cycles on millions of
machines to save a little programmer effort.

(And we're not consistent anyway - see pci_free_consistent)

> 
> >> +void *__alloc_percpu(size_t size, size_t align)
> >> +{
> >> +	void *ptr = NULL;
> >> +	struct pcpu_chunk *chunk;
> >> +	int slot, off, err;
> >> +
> >> +	if (unlikely(!size))
> >> +		return NULL;
> > 
> > hm.  Why do we do this?  Perhaps emitting this warning:
> > 
> >> +	if (unlikely(size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
> >> +		     align > PAGE_SIZE)) {
> >> +		printk(KERN_WARNING "illegal size (%zu) or align (%zu) for "
> >> +		       "percpu allocation\n", size, align);
> > 
> > would be more appropriate.
> 
> Maybe.  Dunno.  Returning NULL is what malloc/calloc are allowed to do
> at least.

Yes, but it is probably a programming error in the caller.  We want to
report that asap, not hide it.  The buggy caller will probably now
assume that the memory allocation failed and will bale out altogether,
leaving everyone all confused.

> >> +/**
> >> + * free_percpu - free percpu area
> >> + * @ptr: pointer to area to free
> >> + *
> >> + * Free percpu area @ptr.  Might sleep.
> >> + */
> >> +void free_percpu(void *ptr)
> >> +{
> >> +	void *addr = __pcpu_ptr_to_addr(ptr);
> >> +	struct pcpu_chunk *chunk;
> >> +	int off;
> >> +
> >> +	if (!ptr)
> >> +		return;
> > 
> > Do we ever do this?  Should it be permitted?  Should we warn?
> 
> Dunno but should be allowed, yes, no.  :-)

It adds cycles and hides caller bugs.  Zap it!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-19 12:36   ` Nick Piggin
@ 2009-02-20  3:04     ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  3:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Hello, Nick.

Nick Piggin wrote:
> On Wednesday 18 February 2009 23:04:35 Tejun Heo wrote:
>> Impact: new scalable dynamic percpu allocator which allows dynamic
>>         percpu areas to be accessed the same way as static ones
>>
>> Implement scalable dynamic percpu allocator which can be used for both
>> static and dynamic percpu areas.  This will allow static and dynamic
>> areas to share faster direct access methods.  This feature is optional
>> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
>> arch.  Please read comment on top of mm/percpu.c for details.
> 
> Seems pretty nice. Wishlist: would be cool to have per-cpu virtual
> memory mappings and do CPU-local percpu access via a single pointer.
> Of course there would need to be some machinery and maybe a new API
> to be more careful about accessing remote percpu data (that access
> could perhaps just be slower and go via the linear mapping).

Yeah, that's what's scheduled next.  Direct percpu accessors and
probably consolidation of local_t into percpu accessors.  Once dust
around the allocator itself settles down, I'll work on those.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-19 12:14       ` Rusty Russell
@ 2009-02-20  3:08         ` Tejun Heo
  2009-02-20  5:36           ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  3:08 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ingo Molnar, tglx, x86, linux-kernel, hpa, jeremy, cpw

Rusty Russell wrote:
>>>> Rusty, if the fixes are fine with you i can put those two 
>>>> commits into tip/core/urgent straight away, the full string of 
>>>> 10 commits into tip/core/percpu and thus we'd avoid duplicate 
>>>> (or even conflicting) commits.
>>> No, the second one is not .29 material; it's a nice, but 
>>> theoretical, fix.
>> Can it never trigger?
> 
> Actually, checked again.  It's not even necessary AFAICT (tho a comment
> would be nice):
> 
> 	for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
> 		/* Extra for alignment requirement. */
> 		extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
> 		BUG_ON(i == 0 && extra != 0);
> 
> 		if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
> 			continue;
> 
> 		/* Transfer extra to previous block. */
> 		if (pcpu_size[i-1] < 0)
> 			pcpu_size[i-1] -= extra;
> 		else
> 			pcpu_size[i-1] += extra;
> 
> pcpu_size[0] is *always* negative: it's marked allocated at initialization
> (it's the static per-cpu allocations).
> 
> Sorry I didn't examine more closely,

Ah... okay.  Right.  I took the code and used it in the chunk area
allocator where 0 isn't guaranteed to be occupied and saw the problem
triggering and then assumed the modalloc allocator shared the same
problem.  So, unnecessary fix but I think it really needs some
explanation.

What to do about #tj-percpu?  Ingo, do you want me to rebase tree sans
the second one?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-19 11:07   ` Ingo Molnar
@ 2009-02-20  3:17     ` Tejun Heo
  2009-02-20  9:32       ` Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  3:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Hello, Ingo.

Ingo Molnar wrote:
> * Tejun Heo <tj@kernel.org> wrote:
> 
>> Tejun Heo wrote:
>>>   One trick we can do is to reserve the initial chunk in non-vmalloc
>>>   area so that at least the static cpu ones and whatever gets
>>>   allocated in the first chunk is served by regular large page
>>>   mappings.  Given that those are most frequent visited ones, this
>>>   could be a nice compromise - no noticeable penalty for usual cases
>>>   yet allowing scalability for unusual cases.  If this is something
>>>   which can be agreed on, I'll pursue this.
>> I've given more thought to this and it actually will solve 
>> most of issues for non-NUMA but it can't be done for NUMA.  
>> Any better ideas?
> 
> It could be allocated via NUMA-aware bootmem allocations.

Hmmm... not really.  Here's what I was planning to do on non-NUMA.

  Allocate the first chunk using alloc_bootmem().  After setting up
  each unit, give back extra space sans the initialized static area
  and some amount of free space which should be enough for common
  cases by calling free_bootmem().  Mark the returned space as used in
  the chunk map.

This will allow sane chunk size and scalability without adding TLB
pressure, so it's actually pretty sweet.  Unfortunately, this doesn't
really work for NUMA because we don't have control over how NUMA
addresses are laid out so we can't allocate contiguous NUMA-correct
chunk without remapping.  And if we remap, we can't give back what's
left to the allocator.  Giving back the original address doubles TLB
usage and giving back the remapped address breaks __pa/__va.  :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  3:04       ` Andrew Morton
@ 2009-02-20  5:29         ` Tejun Heo
  2009-02-24  2:52         ` Rusty Russell
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  5:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Hello, Andrew.

Andrew Morton wrote:
>> Hmmm... Is it customary to dump stack on allocation failure?  AFAICS,
>> kmalloc or vmalloc isn't doing it.
> 
> The page allocator (and hence kmalloc) will do it.
> 
> But a more important question is "is a trace useful".  I'd say "yes". 
> Because being told that something ran out of memory isn't terribly
> useful.  The very first question is "OK, well _what_ ran out of
> memory?".

Then the page allocator will do it for percpu allocator too other than
get_vm_area() failure.  I think the right place to add dump_stack()
would be failure path of get_mv_area().  Right?

>> Because those parameters are initialized during boot but should work
>> as constant once they're set up.  I wanted to make sure that these
>> values don't get assigned to or changed and make that clear by making
>> them look like constants as except for the init code they're constants
>> for all purposes.
> 
> Well, there are an infinite number of ways in which people can later
> introduce bugs.  Why defend against just one?  Particularly in a way
> which muckies up the code?
> 
> If you really want to defend against alterations, access these things
> via function calls rather than via nastycasts which masquerade as
> constants?
> 
> static inline int pcpu_unit_pages_shift(void)
> {
> 	return __pcpu_unit_pages_shift;
> }

It's more a notation to signify the semantics of usage rather than the
mechanics.  I don't really see why this is such a big deal.  Macros
evaluating to rvalue to act as pseudo constant isn't that uncommon.
Anyways, I'll just drop the macros and use the raw variables.

>>> Methinks vmalloc() should have a might_sleep().  Dunno.
>> I can add that but it's an internal utility function which is called
>> from a few well known obvious call sites to replace krealloc, so I
>> don't thinks it's a big deal.  Maybe the correct thing to do is adding
>> might_sleep() to vmalloc?
> 
> I think so.  It perhaps already has one, via indirect means.

Alright, will look into it and add it if it actually is missing.

>>> A designed decision has been made to not permit the caller to specify
>>> the allocation mode?
>> The design decision was inherited from the original percpu allocator.
> 
> That doesn't mean it's right ;)

:-)

> But I don't recall us ever wishing that the gfp_t arg had been
> included.
> 
>>>> +static void free_pcpu_chunk(struct pcpu_chunk *chunk)
>>>> +{
>>>> +	if (!chunk)
>>>> +		return;
>>> afaict this test is unneeded.
>> I think it's better to put NULL test to free functions in general.
>> People expect free functions to take NULL argument and swallow it.
>> Doint it otherwise adds unnecessary danger for subtle bugs later on.
> 
> It's a dumb convention.  In the vast majority of cases the pointer is
> not NULL.  We add a test-n-branch to 99.999999999% of cases just to
> save three seconds of programmer effort a single time.
> 
> A better design would have been to have kfree() and
> kfree_might_be_null().  (We can still do that by adding a new
> kfree_im_not_stupid() which doesn't do the check).
> 
> It's a bad tradeoff to expend billions of cycles on millions of
> machines to save a little programmer effort.
> 
> (And we're not consistent anyway - see pci_free_consistent)

By making free_pcpu_chunk() not accept NULL, we'll only increase the
inconsistency.  The given fact is that we simply can't remove it from
kfree() at this point.  With the most popular free function supporting
that convention, it's silly and unfruitful to do things otherwise.  It
forces callers of any free functions to go look for each function
implementation to check whether it accepts NULL or not and in many
cases to wrongly assume one way or the other.  I don't think the
minute performance gain justifies the programming overhead.  Given the
current situation, what needs fixing is pci_free_consistent().

>>>> +void *__alloc_percpu(size_t size, size_t align)
>>>> +{
>>>> +	void *ptr = NULL;
>>>> +	struct pcpu_chunk *chunk;
>>>> +	int slot, off, err;
>>>> +
>>>> +	if (unlikely(!size))
>>>> +		return NULL;
>>> hm.  Why do we do this?  Perhaps emitting this warning:
>>>
>>>> +	if (unlikely(size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
>>>> +		     align > PAGE_SIZE)) {
>>>> +		printk(KERN_WARNING "illegal size (%zu) or align (%zu) for "
>>>> +		       "percpu allocation\n", size, align);
>>> would be more appropriate.
>> Maybe.  Dunno.  Returning NULL is what malloc/calloc are allowed to do
>> at least.
> 
> Yes, but it is probably a programming error in the caller.  We want to
> report that asap, not hide it.  The buggy caller will probably now
> assume that the memory allocation failed and will bale out altogether,
> leaving everyone all confused.

I kind of agree to this one.  Most alloc functions do allow it but
yeap its usage isn't very prevalent and is much more likely to be
buggy.  Alright, WARN_ON() then.

>>>> +/**
>>>> + * free_percpu - free percpu area
>>>> + * @ptr: pointer to area to free
>>>> + *
>>>> + * Free percpu area @ptr.  Might sleep.
>>>> + */
>>>> +void free_percpu(void *ptr)
>>>> +{
>>>> +	void *addr = __pcpu_ptr_to_addr(ptr);
>>>> +	struct pcpu_chunk *chunk;
>>>> +	int off;
>>>> +
>>>> +	if (!ptr)
>>>> +		return;
>>> Do we ever do this?  Should it be permitted?  Should we warn?
>> Dunno but should be allowed, yes, no.  :-)
> 
> It adds cycles and hides caller bugs.  Zap it!

Heh heh... No! :-) I'm sorry but I think that's the wrong decision.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-20  3:08         ` Tejun Heo
@ 2009-02-20  5:36           ` Tejun Heo
  2009-02-20  7:33             ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  5:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ingo Molnar, tglx, x86, linux-kernel, hpa, jeremy, cpw

Tejun Heo wrote:
> Rusty Russell wrote:
>>>>> Rusty, if the fixes are fine with you i can put those two 
>>>>> commits into tip/core/urgent straight away, the full string of 
>>>>> 10 commits into tip/core/percpu and thus we'd avoid duplicate 
>>>>> (or even conflicting) commits.
>>>> No, the second one is not .29 material; it's a nice, but 
>>>> theoretical, fix.
>>> Can it never trigger?
>> Actually, checked again.  It's not even necessary AFAICT (tho a comment
>> would be nice):
>>
>> 	for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
>> 		/* Extra for alignment requirement. */
>> 		extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
>> 		BUG_ON(i == 0 && extra != 0);
>>
>> 		if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
>> 			continue;
>>
>> 		/* Transfer extra to previous block. */
>> 		if (pcpu_size[i-1] < 0)
>> 			pcpu_size[i-1] -= extra;
>> 		else
>> 			pcpu_size[i-1] += extra;
>>
>> pcpu_size[0] is *always* negative: it's marked allocated at initialization
>> (it's the static per-cpu allocations).
>>
>> Sorry I didn't examine more closely,
> 
> Ah... okay.  Right.  I took the code and used it in the chunk area
> allocator where 0 isn't guaranteed to be occupied and saw the problem
> triggering and then assumed the modalloc allocator shared the same
> problem.  So, unnecessary fix but I think it really needs some
> explanation.
> 
> What to do about #tj-percpu?  Ingo, do you want me to rebase tree sans
> the second one?

Ingo, as you haven't pulled already.  I'm incorporating changes from
the comments posted till now and rebasing the tree.  Please stand by a
bit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Subject: [PATCH 08/10 UPDATED] vmalloc: add un/map_kernel_range_noflush()
  2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
  2009-02-19 12:17   ` Nick Piggin
@ 2009-02-20  7:15   ` Tejun Heo
  2009-02-20  8:32     ` Andrew Morton
  1 sibling, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  7:15 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Impact: two more public map/unmap functions

Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
These functions respectively map and unmap address range in kernel VM
area but doesn't do any vcache or tlb flushing.  These will be used by
new percpu allocator.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
---
NOTE: about cache flush requirements added as per Nick's suggestion.

Thanks.

 include/linux/vmalloc.h |    3 ++
 mm/vmalloc.c            |   67 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index bbc0513..599ba79 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -91,6 +91,9 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
 			struct page ***pages);
+extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
+				    pgprot_t prot, struct page **pages);
+extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
 extern void unmap_kernel_range(unsigned long addr, unsigned long size);
 
 /* Allocate/destroy a 'vmalloc' VM area. */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d206261..224eca9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -153,8 +153,8 @@ static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
  *
  * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
  */
-static int vmap_page_range(unsigned long start, unsigned long end,
-				pgprot_t prot, struct page **pages)
+static int vmap_page_range_noflush(unsigned long start, unsigned long end,
+				   pgprot_t prot, struct page **pages)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -170,13 +170,22 @@ static int vmap_page_range(unsigned long start, unsigned long end,
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
-	flush_cache_vmap(start, end);
 
 	if (unlikely(err))
 		return err;
 	return nr;
 }
 
+static int vmap_page_range(unsigned long start, unsigned long end,
+			   pgprot_t prot, struct page **pages)
+{
+	int ret;
+
+	ret = vmap_page_range_noflush(start, end, prot, pages);
+	flush_cache_vmap(start, end);
+	return ret;
+}
+
 static inline int is_vmalloc_or_module_addr(const void *x)
 {
 	/*
@@ -1033,6 +1042,58 @@ void __init vmalloc_init(void)
 	vmap_initialized = true;
 }
 
+/**
+ * map_kernel_range_noflush - map kernel VM area with the specified pages
+ * @addr: start of the VM area to map
+ * @size: size of the VM area to map
+ * @prot: page protection flags to use
+ * @pages: pages to map
+ *
+ * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size
+ * specify should have been allocated using get_vm_area() and its
+ * friends.
+ *
+ * NOTE:
+ * This function does NOT do any cache flushing.  The caller is
+ * responsible for calling flush_cache_vmap() on to-be-mapped areas
+ * before calling this function.
+ *
+ * RETURNS:
+ * The number of pages mapped on success, -errno on failure.
+ */
+int map_kernel_range_noflush(unsigned long addr, unsigned long size,
+			     pgprot_t prot, struct page **pages)
+{
+	return vmap_page_range_noflush(addr, addr + size, prot, pages);
+}
+
+/**
+ * unmap_kernel_range_noflush - unmap kernel VM area
+ * @addr: start of the VM area to unmap
+ * @size: size of the VM area to unmap
+ *
+ * Unmap PFN_UP(@size) pages at @addr.  The VM area @addr and @size
+ * specify should have been allocated using get_vm_area() and its
+ * friends.
+ *
+ * NOTE:
+ * This function does NOT do any cache flushing.  The caller is
+ * responsible for calling flush_cache_vunmap() on to-be-mapped areas
+ * before calling this function and flush_tlb_kernel_range() after.
+ */
+void unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
+{
+	vunmap_page_range(addr, addr + size);
+}
+
+/**
+ * unmap_kernel_range - unmap kernel VM area and flush cache and TLB
+ * @addr: start of the VM area to unmap
+ * @size: size of the VM area to unmap
+ *
+ * Similar to unmap_kernel_range_noflush() but flushes vcache before
+ * the unmapping and tlb after.
+ */
 void unmap_kernel_range(unsigned long addr, unsigned long size)
 {
 	unsigned long end = addr + size;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/10] module: fix out-of-range memory access
  2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
  2009-02-19 12:08   ` Nick Piggin
@ 2009-02-20  7:16   ` Tejun Heo
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  7:16 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Tejun Heo wrote:
> Impact: subtle memory access bug fix
> 
> percpu_modalloc() may access pcpu_size[-1].  The access won't change
> the value by itself but it still is read/write access and dangerous.
> Fix it.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Dropped as this can never happen.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH UPDATED 09/10] percpu: implement new dynamic percpu allocator
  2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
                     ` (2 preceding siblings ...)
  2009-02-19 12:36   ` Nick Piggin
@ 2009-02-20  7:30   ` Tejun Heo
  2009-02-20  8:37     ` Andrew Morton
  3 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  7:30 UTC (permalink / raw)
  To: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Impact: new scalable dynamic percpu allocator which allows dynamic
        percpu areas to be accessed the same way as static ones

Implement scalable dynamic percpu allocator which can be used for both
static and dynamic percpu areas.  This will allow static and dynamic
areas to share faster direct access methods.  This feature is optional
and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
arch.  Please read comment on top of mm/percpu.c for details.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
The following changes have been made as per Andrew's suggestions.

* drop PCPU_* macros and use variables directly
* chunk->map, ->free_size related comments added
* more locking comment
* comment explaning why *pagep needs clearing in
  pcpu_depopulate_chunk()
* drop unnecessary err variable from __alloc_percpu() and use off
  directly
* use order_base_2() in pcpu_setup_static() instead of open loop
* explain the use of -ENOSPC

Thanks.

 include/linux/percpu.h |   22 +-
 kernel/module.c        |   31 ++
 mm/Makefile            |    4 +
 mm/percpu.c            |  890 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 943 insertions(+), 4 deletions(-)
 create mode 100644 mm/percpu.c

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index d99e24a..1808099 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -76,23 +76,37 @@
 
 #ifdef CONFIG_SMP
 
-struct percpu_data {
-	void *ptrs[1];
-};
+#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
 
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
+extern void *pcpu_base_addr;
 
+typedef void (*pcpu_populate_pte_fn_t)(unsigned long addr);
+
+extern size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
+				       struct page **pages, size_t cpu_size);
 /*
  * Use this to get to a cpu's version of the per-cpu object
  * dynamically allocated. Non-atomic access to the current CPU's
  * version should probably be combined with get_cpu()/put_cpu().
  */
+#define per_cpu_ptr(ptr, cpu)	SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)))
+
+#else /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
+struct percpu_data {
+	void *ptrs[1];
+};
+
+#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
+
 #define per_cpu_ptr(ptr, cpu)						\
 ({									\
         struct percpu_data *__p = __percpu_disguise(ptr);		\
         (__typeof__(ptr))__p->ptrs[(cpu)];				\
 })
 
+#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
 extern void *__alloc_percpu(size_t size, size_t align);
 extern void free_percpu(void *__pdata);
 
diff --git a/kernel/module.c b/kernel/module.c
index 52b3497..1f0657a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -51,6 +51,7 @@
 #include <linux/tracepoint.h>
 #include <linux/ftrace.h>
 #include <linux/async.h>
+#include <linux/percpu.h>
 
 #if 0
 #define DEBUGP printk
@@ -366,6 +367,34 @@ static struct module *find_module(const char *name)
 }
 
 #ifdef CONFIG_SMP
+
+#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+
+static void *percpu_modalloc(unsigned long size, unsigned long align,
+			     const char *name)
+{
+	void *ptr;
+
+	if (align > PAGE_SIZE) {
+		printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
+		       name, align, PAGE_SIZE);
+		align = PAGE_SIZE;
+	}
+
+	ptr = __alloc_percpu(size, align);
+	if (!ptr)
+		printk(KERN_WARNING
+		       "Could not allocate %lu bytes percpu data\n", size);
+	return ptr;
+}
+
+static void percpu_modfree(void *freeme)
+{
+	free_percpu(freeme);
+}
+
+#else /* ... !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
 /* Number of blocks used and allocated. */
 static unsigned int pcpu_num_used, pcpu_num_allocated;
 /* Size of each block.  -ve means used. */
@@ -499,6 +528,8 @@ static int percpu_modinit(void)
 }
 __initcall(percpu_modinit);
 
+#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+
 static unsigned int find_pcpusec(Elf_Ehdr *hdr,
 				 Elf_Shdr *sechdrs,
 				 const char *secstrings)
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..818569b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,6 +30,10 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+obj-$(CONFIG_SMP) += percpu.o
+else
 obj-$(CONFIG_SMP) += allocpercpu.o
+endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/percpu.c b/mm/percpu.c
new file mode 100644
index 0000000..4617d97
--- /dev/null
+++ b/mm/percpu.c
@@ -0,0 +1,890 @@
+/*
+ * linux/mm/percpu.c - percpu memory allocator
+ *
+ * Copyright (C) 2009		SUSE Linux Products GmbH
+ * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This is percpu allocator which can handle both static and dynamic
+ * areas.  Percpu areas are allocated in chunks in vmalloc area.  Each
+ * chunk is consisted of num_possible_cpus() units and the first chunk
+ * is used for static percpu variables in the kernel image (special
+ * boot time alloc/init handling necessary as these areas need to be
+ * brought up before allocation services are running).  Unit grows as
+ * necessary and all units grow or shrink in unison.  When a chunk is
+ * filled up, another chunk is allocated.  ie. in vmalloc area
+ *
+ *  c0                           c1                         c2
+ *  -------------------          -------------------        ------------
+ * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
+ *  -------------------  ......  -------------------  ....  ------------
+ *
+ * Allocation is done in offset-size areas of single unit space.  Ie,
+ * an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
+ * c1:u1, c1:u2 and c1:u3.  Percpu access can be done by configuring
+ * percpu base registers UNIT_SIZE apart.
+ *
+ * There are usually many small percpu allocations many of them as
+ * small as 4 bytes.  The allocator organizes chunks into lists
+ * according to free size and tries to allocate from the fullest one.
+ * Each chunk keeps the maximum contiguous area size hint which is
+ * guaranteed to be eqaul to or larger than the maximum contiguous
+ * area in the chunk.  This helps the allocator not to iterate the
+ * chunk maps unnecessarily.
+ *
+ * Allocation state in each chunk is kept using an array of integers
+ * on chunk->map.  A positive value in the map represents a free
+ * region and negative allocated.  Allocation inside a chunk is done
+ * by scanning this map sequentially and serving the first matching
+ * entry.  This is mostly copied from the percpu_modalloc() allocator.
+ * Chunks are also linked into a rb tree to ease address to chunk
+ * mapping during free.
+ *
+ * To use this allocator, arch code should do the followings.
+ *
+ * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+ *
+ * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
+ *   regular address to percpu pointer and back
+ *
+ * - use pcpu_setup_static() during percpu area initialization to
+ *   setup kernel static percpu area
+ */
+
+#include <linux/bitmap.h>
+#include <linux/bootmem.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/percpu.h>
+#include <linux/pfn.h>
+#include <linux/rbtree.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
+
+#define PCPU_MIN_UNIT_PAGES_SHIFT	4	/* also max alloc size */
+#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
+#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
+
+struct pcpu_chunk {
+	struct list_head	list;		/* linked to pcpu_slot lists */
+	struct rb_node		rb_node;	/* key is chunk->vm->addr */
+	int			free_size;	/* free bytes in the chunk */
+	int			contig_hint;	/* max contiguous size hint */
+	struct vm_struct	*vm;		/* mapped vmalloc region */
+	int			map_used;	/* # of map entries used */
+	int			map_alloc;	/* # of map entries allocated */
+	int			*map;		/* allocation map */
+	struct page		*page[];	/* #cpus * UNIT_PAGES */
+};
+
+static int pcpu_unit_pages_shift;
+static int pcpu_unit_pages;
+static int pcpu_unit_shift;
+static int pcpu_unit_size;
+static int pcpu_chunk_size;
+static int pcpu_nr_slots;
+static size_t pcpu_chunk_struct_size;
+
+/* the address of the first chunk which starts with the kernel static area */
+void *pcpu_base_addr;
+EXPORT_SYMBOL_GPL(pcpu_base_addr);
+
+/* the size of kernel static area */
+static int pcpu_static_size;
+
+/*
+ * One mutex to rule them all.
+ *
+ * The following mutex is grabbed in the outermost public alloc/free
+ * interface functions and released only when the operation is
+ * complete.  As such, every function in this file other than the
+ * outermost functions are called under pcpu_mutex.
+ *
+ * It can easily be switched to use spinlock such that only the area
+ * allocation and page population commit are protected with it doing
+ * actual [de]allocation without holding any lock.  However, given
+ * what this allocator does, I think it's better to let them run
+ * sequentially.
+ */
+static DEFINE_MUTEX(pcpu_mutex);
+
+static struct list_head *pcpu_slot;		/* chunk list slots */
+static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
+
+static int pcpu_size_to_slot(int size)
+{
+	int highbit = fls(size);
+	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
+}
+
+static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
+{
+	if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
+		return 0;
+
+	return pcpu_size_to_slot(chunk->free_size);
+}
+
+static int pcpu_page_idx(unsigned int cpu, int page_idx)
+{
+	return (cpu << pcpu_unit_pages_shift) + page_idx;
+}
+
+static struct page **pcpu_chunk_pagep(struct pcpu_chunk *chunk,
+				      unsigned int cpu, int page_idx)
+{
+	return &chunk->page[pcpu_page_idx(cpu, page_idx)];
+}
+
+static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
+				     unsigned int cpu, int page_idx)
+{
+	return (unsigned long)chunk->vm->addr +
+		(pcpu_page_idx(cpu, page_idx) << PAGE_SHIFT);
+}
+
+static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
+				     int page_idx)
+{
+	return *pcpu_chunk_pagep(chunk, 0, page_idx) != NULL;
+}
+
+/**
+ * pcpu_realloc - versatile realloc
+ * @p: the current pointer (can be NULL for new allocations)
+ * @size: the current size (can be 0 for new allocations)
+ * @new_size: the wanted new size (can be 0 for free)
+ *
+ * More robust realloc which can be used to allocate, resize or free a
+ * memory area of arbitrary size.  If the needed size goes over
+ * PAGE_SIZE, kernel VM is used.
+ *
+ * RETURNS:
+ * The new pointer on success, NULL on failure.
+ */
+static void *pcpu_realloc(void *p, size_t size, size_t new_size)
+{
+	void *new;
+
+	if (new_size <= PAGE_SIZE)
+		new = kmalloc(new_size, GFP_KERNEL);
+	else
+		new = vmalloc(new_size);
+	if (new_size && !new)
+		return NULL;
+
+	memcpy(new, p, min(size, new_size));
+	if (new_size > size)
+		memset(new + size, 0, new_size - size);
+
+	if (size <= PAGE_SIZE)
+		kfree(p);
+	else
+		vfree(p);
+
+	return new;
+}
+
+/**
+ * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
+ * @chunk: chunk of interest
+ * @oslot: the previous slot it was on
+ *
+ * This function is called after an allocation or free changed @chunk.
+ * New slot according to the changed state is determined and @chunk is
+ * moved to the slot.
+ */
+static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
+{
+	int nslot = pcpu_chunk_slot(chunk);
+
+	if (oslot != nslot) {
+		if (oslot < nslot)
+			list_move(&chunk->list, &pcpu_slot[nslot]);
+		else
+			list_move_tail(&chunk->list, &pcpu_slot[nslot]);
+	}
+}
+
+static struct rb_node **pcpu_chunk_rb_search(void *addr,
+					     struct rb_node **parentp)
+{
+	struct rb_node **p = &pcpu_addr_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct pcpu_chunk *chunk;
+
+	while (*p) {
+		parent = *p;
+		chunk = rb_entry(parent, struct pcpu_chunk, rb_node);
+
+		if (addr < chunk->vm->addr)
+			p = &(*p)->rb_left;
+		else if (addr > chunk->vm->addr)
+			p = &(*p)->rb_right;
+		else
+			break;
+	}
+
+	if (parentp)
+		*parentp = parent;
+	return p;
+}
+
+/**
+ * pcpu_chunk_addr_search - search for chunk containing specified address
+ * @addr: address to search for
+ *
+ * Look for chunk which might contain @addr.  More specifically, it
+ * searchs for the chunk with the highest start address which isn't
+ * beyond @addr.
+ *
+ * RETURNS:
+ * The address of the found chunk.
+ */
+static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
+{
+	struct rb_node *n, *parent;
+	struct pcpu_chunk *chunk;
+
+	n = *pcpu_chunk_rb_search(addr, &parent);
+	if (!n) {
+		/* no exactly matching chunk, the parent is the closest */
+		n = parent;
+		BUG_ON(!n);
+	}
+	chunk = rb_entry(n, struct pcpu_chunk, rb_node);
+
+	if (addr < chunk->vm->addr) {
+		/* the parent was the next one, look for the previous one */
+		n = rb_prev(n);
+		BUG_ON(!n);
+		chunk = rb_entry(n, struct pcpu_chunk, rb_node);
+	}
+
+	return chunk;
+}
+
+/**
+ * pcpu_chunk_addr_insert - insert chunk into address rb tree
+ * @new: chunk to insert
+ *
+ * Insert @new into address rb tree.
+ */
+static void pcpu_chunk_addr_insert(struct pcpu_chunk *new)
+{
+	struct rb_node **p, *parent;
+
+	p = pcpu_chunk_rb_search(new->vm->addr, &parent);
+	BUG_ON(*p);
+	rb_link_node(&new->rb_node, parent, p);
+	rb_insert_color(&new->rb_node, &pcpu_addr_root);
+}
+
+/**
+ * pcpu_split_block - split a map block
+ * @chunk: chunk of interest
+ * @i: index of map block to split
+ * @head: head size (can be 0)
+ * @tail: tail size (can be 0)
+ *
+ * Split the @i'th map block into two or three blocks.  If @head is
+ * non-zero, @head bytes block is inserted before block @i moving it
+ * to @i+1 and reducing its size by @head bytes.
+ *
+ * If @tail is non-zero, the target block, which can be @i or @i+1
+ * depending on @head, is reduced by @tail bytes and @tail byte block
+ * is inserted after the target block.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+static int pcpu_split_block(struct pcpu_chunk *chunk, int i, int head, int tail)
+{
+	int nr_extra = !!head + !!tail;
+	int target = chunk->map_used + nr_extra;
+
+	/* reallocation required? */
+	if (chunk->map_alloc < target) {
+		int new_alloc = chunk->map_alloc;
+		int *new;
+
+		while (new_alloc < target)
+			new_alloc *= 2;
+
+		new = pcpu_realloc(chunk->map,
+				   chunk->map_alloc * sizeof(new[0]),
+				   new_alloc * sizeof(new[0]));
+		if (!new)
+			return -ENOMEM;
+
+		chunk->map_alloc = new_alloc;
+		chunk->map = new;
+	}
+
+	/* insert a new subblock */
+	memmove(&chunk->map[i + nr_extra], &chunk->map[i],
+		sizeof(chunk->map[0]) * (chunk->map_used - i));
+	chunk->map_used += nr_extra;
+
+	if (head) {
+		chunk->map[i + 1] = chunk->map[i] - head;
+		chunk->map[i++] = head;
+	}
+	if (tail) {
+		chunk->map[i++] -= tail;
+		chunk->map[i] = tail;
+	}
+	return 0;
+}
+
+/**
+ * pcpu_alloc_area - allocate area from a pcpu_chunk
+ * @chunk: chunk of interest
+ * @size: wanted size
+ * @align: wanted align
+ *
+ * Try to allocate @size bytes area aligned at @align from @chunk.
+ * Note that this function only allocates the offset.  It doesn't
+ * populate or map the area.
+ *
+ * RETURNS:
+ * Allocated offset in @chunk on success, -errno on failure.
+ */
+static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
+{
+	int oslot = pcpu_chunk_slot(chunk);
+	int max_contig = 0;
+	int i, off;
+
+	/*
+	 * The static chunk initially doesn't have map attached
+	 * because kmalloc wasn't available during init.  Give it one.
+	 */
+	if (unlikely(!chunk->map)) {
+		chunk->map = pcpu_realloc(NULL, 0,
+				PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
+		if (!chunk->map)
+			return -ENOMEM;
+
+		chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
+		chunk->map[chunk->map_used++] = -pcpu_static_size;
+		if (chunk->free_size)
+			chunk->map[chunk->map_used++] = chunk->free_size;
+	}
+
+	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
+		bool is_last = i + 1 == chunk->map_used;
+		int head, tail;
+
+		/* extra for alignment requirement */
+		head = ALIGN(off, align) - off;
+		BUG_ON(i == 0 && head != 0);
+
+		if (chunk->map[i] < 0)
+			continue;
+		if (chunk->map[i] < head + size) {
+			max_contig = max(chunk->map[i], max_contig);
+			continue;
+		}
+
+		/*
+		 * If head is small or the previous block is free,
+		 * merge'em.  Note that 'small' is defined as smaller
+		 * than sizeof(int), which is very small but isn't too
+		 * uncommon for percpu allocations.
+		 */
+		if (head && (head < sizeof(int) || chunk->map[i - 1] > 0)) {
+			if (chunk->map[i - 1] > 0)
+				chunk->map[i - 1] += head;
+			else {
+				chunk->map[i - 1] -= head;
+				chunk->free_size -= head;
+			}
+			chunk->map[i] -= head;
+			off += head;
+			head = 0;
+		}
+
+		/* if tail is small, just keep it around */
+		tail = chunk->map[i] - head - size;
+		if (tail < sizeof(int))
+			tail = 0;
+
+		/* split if warranted */
+		if (head || tail) {
+			if (pcpu_split_block(chunk, i, head, tail))
+				return -ENOMEM;
+			if (head) {
+				i++;
+				off += head;
+				max_contig = max(chunk->map[i - 1], max_contig);
+			}
+			if (tail)
+				max_contig = max(chunk->map[i + 1], max_contig);
+		}
+
+		/* update hint and mark allocated */
+		if (is_last)
+			chunk->contig_hint = max_contig; /* fully scanned */
+		else
+			chunk->contig_hint = max(chunk->contig_hint,
+						 max_contig);
+
+		chunk->free_size -= chunk->map[i];
+		chunk->map[i] = -chunk->map[i];
+
+		pcpu_chunk_relocate(chunk, oslot);
+		return off;
+	}
+
+	chunk->contig_hint = max_contig;	/* fully scanned */
+	pcpu_chunk_relocate(chunk, oslot);
+
+	/*
+	 * Tell the upper layer that this chunk has no area left.
+	 * Note that this is not an error condition but a notification
+	 * to upper layer that it needs to look at other chunks.
+	 * -ENOSPC is chosen as it isn't used in memory subsystem and
+	 * matches the meaning in a way.
+	 */
+	return -ENOSPC;
+}
+
+/**
+ * pcpu_free_area - free area to a pcpu_chunk
+ * @chunk: chunk of interest
+ * @freeme: offset of area to free
+ *
+ * Free area starting from @freeme to @chunk.  Note that this function
+ * only modifies the allocation map.  It doesn't depopulate or unmap
+ * the area.
+ */
+static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
+{
+	int oslot = pcpu_chunk_slot(chunk);
+	int i, off;
+
+	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
+		if (off == freeme)
+			break;
+	BUG_ON(off != freeme);
+	BUG_ON(chunk->map[i] > 0);
+
+	chunk->map[i] = -chunk->map[i];
+	chunk->free_size += chunk->map[i];
+
+	/* merge with previous? */
+	if (i > 0 && chunk->map[i - 1] >= 0) {
+		chunk->map[i - 1] += chunk->map[i];
+		chunk->map_used--;
+		memmove(&chunk->map[i], &chunk->map[i + 1],
+			(chunk->map_used - i) * sizeof(chunk->map[0]));
+		i--;
+	}
+	/* merge with next? */
+	if (i + 1 < chunk->map_used && chunk->map[i + 1] >= 0) {
+		chunk->map[i] += chunk->map[i + 1];
+		chunk->map_used--;
+		memmove(&chunk->map[i + 1], &chunk->map[i + 2],
+			(chunk->map_used - (i + 1)) * sizeof(chunk->map[0]));
+	}
+
+	chunk->contig_hint = max(chunk->map[i], chunk->contig_hint);
+	pcpu_chunk_relocate(chunk, oslot);
+}
+
+/**
+ * pcpu_unmap - unmap pages out of a pcpu_chunk
+ * @chunk: chunk of interest
+ * @page_start: page index of the first page to unmap
+ * @page_end: page index of the last page to unmap + 1
+ * @flush: whether to flush cache and tlb or not
+ *
+ * For each cpu, unmap pages [@page_start,@page_end) out of @chunk.
+ * If @flush is true, vcache is flushed before unmapping and tlb
+ * after.
+ */
+static void pcpu_unmap(struct pcpu_chunk *chunk, int page_start, int page_end,
+		       bool flush)
+{
+	unsigned int last = num_possible_cpus() - 1;
+	unsigned int cpu;
+
+	/*
+	 * Each flushing trial can be very expensive, issue flush on
+	 * the whole region at once rather than doing it for each cpu.
+	 * This could be an overkill but is more scalable.
+	 */
+	if (flush)
+		flush_cache_vunmap(pcpu_chunk_addr(chunk, 0, page_start),
+				   pcpu_chunk_addr(chunk, last, page_end));
+
+	for_each_possible_cpu(cpu)
+		unmap_kernel_range_noflush(
+				pcpu_chunk_addr(chunk, cpu, page_start),
+				(page_end - page_start) << PAGE_SHIFT);
+
+	/* ditto as flush_cache_vunmap() */
+	if (flush)
+		flush_tlb_kernel_range(pcpu_chunk_addr(chunk, 0, page_start),
+				       pcpu_chunk_addr(chunk, last, page_end));
+}
+
+/**
+ * pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
+ * @chunk: chunk to depopulate
+ * @off: offset to the area to depopulate
+ * @size: size of the area to depopulate
+ * @flush: whether to flush cache and tlb or not
+ *
+ * For each cpu, depopulate and unmap pages [@page_start,@page_end)
+ * from @chunk.  If @flush is true, vcache is flushed before unmapping
+ * and tlb after.
+ */
+static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, size_t off,
+				  size_t size, bool flush)
+{
+	int page_start = PFN_DOWN(off);
+	int page_end = PFN_UP(off + size);
+	int unmap_start = -1;
+	int uninitialized_var(unmap_end);
+	unsigned int cpu;
+	int i;
+
+	for (i = page_start; i < page_end; i++) {
+		for_each_possible_cpu(cpu) {
+			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
+
+			if (!*pagep)
+				continue;
+
+			__free_page(*pagep);
+
+			/*
+			 * If it's partial depopulation, it might get
+			 * populated or depopulated again.  Mark the
+			 * page gone.
+			 */
+			*pagep = NULL;
+
+			unmap_start = unmap_start < 0 ? i : unmap_start;
+			unmap_end = i + 1;
+		}
+	}
+
+	if (unmap_start >= 0)
+		pcpu_unmap(chunk, unmap_start, unmap_end, flush);
+}
+
+/**
+ * pcpu_map - map pages into a pcpu_chunk
+ * @chunk: chunk of interest
+ * @page_start: page index of the first page to map
+ * @page_end: page index of the last page to map + 1
+ *
+ * For each cpu, map pages [@page_start,@page_end) into @chunk.
+ * vcache is flushed afterwards.
+ */
+static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
+{
+	unsigned int last = num_possible_cpus() - 1;
+	unsigned int cpu;
+	int err;
+
+	for_each_possible_cpu(cpu) {
+		err = map_kernel_range_noflush(
+				pcpu_chunk_addr(chunk, cpu, page_start),
+				(page_end - page_start) << PAGE_SHIFT,
+				PAGE_KERNEL,
+				pcpu_chunk_pagep(chunk, cpu, page_start));
+		if (err < 0)
+			return err;
+	}
+
+	/* flush at once, please read comments in pcpu_unmap() */
+	flush_cache_vmap(pcpu_chunk_addr(chunk, 0, page_start),
+			 pcpu_chunk_addr(chunk, last, page_end));
+	return 0;
+}
+
+/**
+ * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
+ * @chunk: chunk of interest
+ * @off: offset to the area to populate
+ * @size: size of the area to populate
+ *
+ * For each cpu, populate and map pages [@page_start,@page_end) into
+ * @chunk.  The area is cleared on return.
+ */
+static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
+{
+	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
+	int page_start = PFN_DOWN(off);
+	int page_end = PFN_UP(off + size);
+	int map_start = -1;
+	int map_end;
+	unsigned int cpu;
+	int i;
+
+	for (i = page_start; i < page_end; i++) {
+		if (pcpu_chunk_page_occupied(chunk, i)) {
+			if (map_start >= 0) {
+				if (pcpu_map(chunk, map_start, map_end))
+					goto err;
+				map_start = -1;
+			}
+			continue;
+		}
+
+		map_start = map_start < 0 ? i : map_start;
+		map_end = i + 1;
+
+		for_each_possible_cpu(cpu) {
+			struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
+
+			*pagep = alloc_pages_node(cpu_to_node(cpu),
+						  alloc_mask, 0);
+			if (!*pagep)
+				goto err;
+		}
+	}
+
+	if (map_start >= 0 && pcpu_map(chunk, map_start, map_end))
+		goto err;
+
+	for_each_possible_cpu(cpu)
+		memset(chunk->vm->addr + (cpu << pcpu_unit_shift) + off, 0,
+		       size);
+
+	return 0;
+err:
+	/* likely under heavy memory pressure, give memory back */
+	pcpu_depopulate_chunk(chunk, off, size, true);
+	return -ENOMEM;
+}
+
+static void free_pcpu_chunk(struct pcpu_chunk *chunk)
+{
+	if (!chunk)
+		return;
+	if (chunk->vm)
+		free_vm_area(chunk->vm);
+	pcpu_realloc(chunk->map, chunk->map_alloc * sizeof(chunk->map[0]), 0);
+	kfree(chunk);
+}
+
+static struct pcpu_chunk *alloc_pcpu_chunk(void)
+{
+	struct pcpu_chunk *chunk;
+
+	chunk = kzalloc(pcpu_chunk_struct_size, GFP_KERNEL);
+	if (!chunk)
+		return NULL;
+
+	chunk->map = pcpu_realloc(NULL, 0,
+				  PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
+	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
+	chunk->map[chunk->map_used++] = pcpu_unit_size;
+
+	chunk->vm = get_vm_area(pcpu_chunk_size, GFP_KERNEL);
+	if (!chunk->vm) {
+		free_pcpu_chunk(chunk);
+		return NULL;
+	}
+
+	INIT_LIST_HEAD(&chunk->list);
+	chunk->free_size = pcpu_unit_size;
+	chunk->contig_hint = pcpu_unit_size;
+
+	return chunk;
+}
+
+/**
+ * __alloc_percpu - allocate percpu area
+ * @size: size of area to allocate
+ * @align: alignment of area (max PAGE_SIZE)
+ *
+ * Allocate percpu area of @size bytes aligned at @align.  Might
+ * sleep.  Might trigger writeouts.
+ *
+ * RETURNS:
+ * Percpu pointer to the allocated area on success, NULL on failure.
+ */
+void *__alloc_percpu(size_t size, size_t align)
+{
+	void *ptr = NULL;
+	struct pcpu_chunk *chunk;
+	int slot, off;
+
+	if (unlikely(!size || size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
+		     align > PAGE_SIZE)) {
+		WARN(true, "illegal size (%zu) or align (%zu) for "
+		     "percpu allocation\n", size, align);
+		return NULL;
+	}
+
+	mutex_lock(&pcpu_mutex);
+
+	/* allocate area */
+	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
+		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+			if (size > chunk->contig_hint)
+				continue;
+			off = pcpu_alloc_area(chunk, size, align);
+			if (off >= 0)
+				goto area_found;
+			if (off != -ENOSPC)
+				goto out_unlock;
+		}
+	}
+
+	/* hmmm... no space left, create a new chunk */
+	chunk = alloc_pcpu_chunk();
+	if (!chunk)
+		goto out_unlock;
+	pcpu_chunk_relocate(chunk, -1);
+	pcpu_chunk_addr_insert(chunk);
+
+	off = pcpu_alloc_area(chunk, size, align);
+	if (off < 0)
+		goto out_unlock;
+
+area_found:
+	/* populate, map and clear the area */
+	if (pcpu_populate_chunk(chunk, off, size)) {
+		pcpu_free_area(chunk, off);
+		goto out_unlock;
+	}
+
+	ptr = __addr_to_pcpu_ptr(chunk->vm->addr + off);
+out_unlock:
+	mutex_unlock(&pcpu_mutex);
+	return ptr;
+}
+EXPORT_SYMBOL_GPL(__alloc_percpu);
+
+static void pcpu_kill_chunk(struct pcpu_chunk *chunk)
+{
+	pcpu_depopulate_chunk(chunk, 0, pcpu_unit_size, false);
+	list_del(&chunk->list);
+	rb_erase(&chunk->rb_node, &pcpu_addr_root);
+	free_pcpu_chunk(chunk);
+}
+
+/**
+ * free_percpu - free percpu area
+ * @ptr: pointer to area to free
+ *
+ * Free percpu area @ptr.  Might sleep.
+ */
+void free_percpu(void *ptr)
+{
+	void *addr = __pcpu_ptr_to_addr(ptr);
+	struct pcpu_chunk *chunk;
+	int off;
+
+	if (!ptr)
+		return;
+
+	mutex_lock(&pcpu_mutex);
+
+	chunk = pcpu_chunk_addr_search(addr);
+	off = addr - chunk->vm->addr;
+
+	pcpu_free_area(chunk, off);
+
+	/* the chunk became fully free, kill one if there are other free ones */
+	if (chunk->free_size == pcpu_unit_size) {
+		struct pcpu_chunk *pos;
+
+		list_for_each_entry(pos,
+				    &pcpu_slot[pcpu_chunk_slot(chunk)], list)
+			if (pos != chunk) {
+				pcpu_kill_chunk(pos);
+				break;
+			}
+	}
+
+	mutex_unlock(&pcpu_mutex);
+}
+EXPORT_SYMBOL_GPL(free_percpu);
+
+/**
+ * pcpu_setup_static - initialize kernel static percpu area
+ * @populate_pte_fn: callback to allocate pagetable
+ * @pages: num_possible_cpus() * PFN_UP(cpu_size) pages
+ *
+ * Initialize kernel static percpu area.  The caller should allocate
+ * all the necessary pages and pass them in @pages.
+ * @populate_pte_fn() is called on each page to be used for percpu
+ * mapping and is responsible for making sure all the necessary page
+ * tables for the page is allocated.
+ *
+ * RETURNS:
+ * The determined pcpu_unit_size which can be used to initialize
+ * percpu access.
+ */
+size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
+				struct page **pages, size_t cpu_size)
+{
+	static struct vm_struct static_vm;
+	struct pcpu_chunk *static_chunk;
+	int nr_cpu_pages = DIV_ROUND_UP(cpu_size, PAGE_SIZE);
+	unsigned int cpu;
+	int err, i;
+
+	pcpu_unit_pages_shift = max_t(int, PCPU_MIN_UNIT_PAGES_SHIFT,
+				      order_base_2(cpu_size) - PAGE_SHIFT);
+
+	pcpu_static_size = cpu_size;
+	pcpu_unit_pages = 1 << pcpu_unit_pages_shift;
+	pcpu_unit_shift = PAGE_SHIFT + pcpu_unit_pages_shift;
+	pcpu_unit_size = 1 << pcpu_unit_shift;
+	pcpu_chunk_size = num_possible_cpus() * pcpu_unit_size;
+	pcpu_nr_slots = pcpu_size_to_slot(pcpu_unit_size) + 1;
+	pcpu_chunk_struct_size = sizeof(struct pcpu_chunk)
+		+ (1 << pcpu_unit_pages_shift) * sizeof(struct page *);
+
+	/* allocate chunk slots */
+	pcpu_slot = alloc_bootmem(pcpu_nr_slots * sizeof(pcpu_slot[0]));
+	for (i = 0; i < pcpu_nr_slots; i++)
+		INIT_LIST_HEAD(&pcpu_slot[i]);
+
+	/* init and register vm area */
+	static_vm.flags = VM_ALLOC;
+	static_vm.size = pcpu_chunk_size;
+	vm_area_register_early(&static_vm);
+
+	/* init static_chunk */
+	static_chunk = alloc_bootmem(pcpu_chunk_struct_size);
+	INIT_LIST_HEAD(&static_chunk->list);
+	static_chunk->vm = &static_vm;
+	static_chunk->free_size = pcpu_unit_size - pcpu_static_size;
+	static_chunk->contig_hint = static_chunk->free_size;
+
+	/* assign pages and map them */
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < nr_cpu_pages; i++) {
+			*pcpu_chunk_pagep(static_chunk, cpu, i) = *pages++;
+			populate_pte_fn(pcpu_chunk_addr(static_chunk, cpu, i));
+		}
+	}
+
+	err = pcpu_map(static_chunk, 0, nr_cpu_pages);
+	if (err)
+		panic("failed to setup static percpu area, err=%d\n", err);
+
+	/* link static_chunk in */
+	pcpu_chunk_relocate(static_chunk, -1);
+	pcpu_chunk_addr_insert(static_chunk);
+
+	/* we're done */
+	pcpu_base_addr = (void *)pcpu_chunk_addr(static_chunk, 0, 0);
+	return pcpu_unit_size;
+}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-20  5:36           ` Tejun Heo
@ 2009-02-20  7:33             ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-20  7:33 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ingo Molnar, tglx, x86, linux-kernel, hpa, jeremy, cpw

Tejun Heo wrote:
> Ingo, as you haven't pulled already.  I'm incorporating changes from
> the comments posted till now and rebasing the tree.  Please stand by a
> bit.

Alright, the updated tree is at

  http://git.kernel.org/?p=linux/kernel/git/tj/misc.git tj-percpu

The commit ID is 11124411aa95827404d6bfdfc14c908e1b54513c.

Changes from the last tree are...

* Lai's patch to use percpu data for irq stacks is now the first one.

* The bogus modalloc fix patch dropped.

* Scary comments to map/unmap_kernel_range_noflush() added as per
  Nick's suggestion.

* implement-new-dynamic-percpu-allocator patch updated as per Andrew's
  suggestions.

I think I'll just stack future changes on top of this tree from now
on, so please feel free to pull from it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Subject: [PATCH 08/10 UPDATED] vmalloc: add un/map_kernel_range_noflush()
  2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
@ 2009-02-20  8:32     ` Andrew Morton
  2009-02-21  3:21       ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Andrew Morton @ 2009-02-20  8:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Fri, 20 Feb 2009 16:15:39 +0900 Tejun Heo <tj@kernel.org> wrote:

> +/**
> + * map_kernel_range_noflush - map kernel VM area with the specified pages
> + * @addr: start of the VM area to map
> + * @size: size of the VM area to map
> + * @prot: page protection flags to use
> + * @pages: pages to map
> + *
> + * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size
> + * specify should have been allocated using get_vm_area() and its
> + * friends.
> + *
> + * NOTE:
> + * This function does NOT do any cache flushing.  The caller is
> + * responsible for calling flush_cache_vmap() on to-be-mapped areas
> + * before calling this function.
> + *
> + * RETURNS:
> + * The number of pages mapped on success, -errno on failure.
> + */
> +int map_kernel_range_noflush(unsigned long addr, unsigned long size,
> +			     pgprot_t prot, struct page **pages)
> +{
> +	return vmap_page_range_noflush(addr, addr + size, prot, pages);
> +}
> +
> +/**
> + * unmap_kernel_range_noflush - unmap kernel VM area
> + * @addr: start of the VM area to unmap
> + * @size: size of the VM area to unmap
> + *
> + * Unmap PFN_UP(@size) pages at @addr.  The VM area @addr and @size
> + * specify should have been allocated using get_vm_area() and its
> + * friends.
> + *
> + * NOTE:
> + * This function does NOT do any cache flushing.  The caller is
> + * responsible for calling flush_cache_vunmap() on to-be-mapped areas
> + * before calling this function and flush_tlb_kernel_range() after.
> + */
> +void unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
> +{
> +	vunmap_page_range(addr, addr + size);
> +}

Should these be called
vmap_kernel_range_noflush/vunmap_kernel_range_noflush?

<avoids pointing out the 2 gigapage limit>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH UPDATED 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
@ 2009-02-20  8:37     ` Andrew Morton
  2009-02-21  3:23       ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Andrew Morton @ 2009-02-20  8:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Fri, 20 Feb 2009 16:30:01 +0900 Tejun Heo <tj@kernel.org> wrote:

> Impact: new scalable dynamic percpu allocator which allows dynamic
>         percpu areas to be accessed the same way as static ones
> 
> Implement scalable dynamic percpu allocator which can be used for both
> static and dynamic percpu areas.  This will allow static and dynamic
> areas to share faster direct access methods.  This feature is optional
> and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
> arch.  Please read comment on top of mm/percpu.c for details.
> 
> ...
>
> +static int pcpu_unit_pages_shift;
> +static int pcpu_unit_pages;
> +static int pcpu_unit_shift;
> +static int pcpu_unit_size;
> +static int pcpu_chunk_size;
> +static int pcpu_nr_slots;
> +static size_t pcpu_chunk_struct_size;
> +
> +/* the address of the first chunk which starts with the kernel static area */
> +void *pcpu_base_addr;
> +EXPORT_SYMBOL_GPL(pcpu_base_addr);
> +
> +/* the size of kernel static area */
> +static int pcpu_static_size;

It would be nice to document the units of the `size' variables.  Bytes?
Pages?

Or, better: s/size/bytes/g.  

> +static int pcpu_size_to_slot(int size)
> +{
> +	int highbit = fls(size);
> +	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
> +}

See,

static int pcpu_bytes_to_slot(int bytes)
{
	int highbit = fls(bytes);
	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
}

is clearer.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-20  3:17     ` Tejun Heo
@ 2009-02-20  9:32       ` Ingo Molnar
  2009-02-21  7:10         ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-20  9:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> Ingo Molnar wrote:
> > * Tejun Heo <tj@kernel.org> wrote:
> > 
> >> Tejun Heo wrote:
> >>>   One trick we can do is to reserve the initial chunk in non-vmalloc
> >>>   area so that at least the static cpu ones and whatever gets
> >>>   allocated in the first chunk is served by regular large page
> >>>   mappings.  Given that those are most frequent visited ones, this
> >>>   could be a nice compromise - no noticeable penalty for usual cases
> >>>   yet allowing scalability for unusual cases.  If this is something
> >>>   which can be agreed on, I'll pursue this.
> >> I've given more thought to this and it actually will solve 
> >> most of issues for non-NUMA but it can't be done for NUMA.  
> >> Any better ideas?
> > 
> > It could be allocated via NUMA-aware bootmem allocations.
> 
> Hmmm... not really.  Here's what I was planning to do on non-NUMA.
> 
>   Allocate the first chunk using alloc_bootmem().  After setting up
>   each unit, give back extra space sans the initialized static area
>   and some amount of free space which should be enough for common
>   cases by calling free_bootmem().  Mark the returned space as used in
>   the chunk map.
> 
> This will allow sane chunk size and scalability without adding 
> TLB pressure, so it's actually pretty sweet.  Unfortunately, 
> this doesn't really work for NUMA because we don't have 
> control over how NUMA addresses are laid out so we can't 
> allocate contiguous NUMA-correct chunk without remapping.  And 
> if we remap, we can't give back what's left to the allocator.  
> Giving back the original address doubles TLB usage and giving 
> back the remapped address breaks __pa/__va.  :-(

Where's the problem? Via bootmem we can allocate an arbitrarily 
large, properly NUMA-affine, well-aligned, linear, large-TLB 
piece of memory, for each CPU.

We should allocate a large enough chunk for the static percpu 
variables, and remap them using 2MB mapping[s].

I'm not sure where the desire for 'chunking' below 2MB comes 
from - there's no real benefit from it - the TLB will either be 
4K or 2MB, going inbetween makes little sense.

So i think the best (and simplest) approach is to:

 - allocate the static percpu area using bootmem-alloc, but 
   using a 2MB alignment parameter and a 2MB aligned size. Then 
   we can remap it to some convenient and undisturbed virtual 
   memory area, using 2MB TLBs. [*]

 - The 'partial' bit of the 2MB page (the one that is outside 
   the 4K-uprounded portion of __per_cpu_end - __per_cpu_start) 
   can then be freed via bootmem and is available as regular 
   pages to the rest of the kernel.

 - Then we start dynamic allocations at the _next_ 2MB boundary 
   in the virtual remapped space, and use 4K mappings from that 
   point on. Since at least initially we dont want to waste a 
   full 2MB page on dynamic allocations, we've got no choice but 
   to use 4K pages.

 - This means that percpu_alloc() will not return a pointer to 
   an array of percpu pointers - but will return a standard 
   offset that is valid in each percpu area and points to 
   somewhere beyond the 2MB boundary that comes after the 
   initial static area. This means it needs some minimal memory 
   management - but it all looks relatively straightforward.

So the virtual memory area will be continous, with a 'hole' in 
it that separates the static and dynamic portions, and dynamic 
percpu pointers will point straight into it [with a %gs offset] 
- without an intermediary array of pointers.

No chunking, no fuss - just bootmem plus 4K allocations - the 
best of both worlds.

This also means we've essentially eliminated the boundary 
between static and dynamic APIs, and can probably use some of 
the same direct assembly optimizations (on x86) for local-CPU 
dynamic percpu accesses too. [maybe not all addressing modes are 
possible straight away, this needs a more precise check.]

	Ingo

[*] Note: the 2MB up-rounding bootmem trick above is needed to 
          make sure the partial 2MB page is still fully RAM - 
          it's not well-specified to have a PAT-incompatible 
          area (unmapped RAM, device memory, etc.) in that hole.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Subject: [PATCH 08/10 UPDATED] vmalloc: add un/map_kernel_range_noflush()
  2009-02-20  8:32     ` Andrew Morton
@ 2009-02-21  3:21       ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  3:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Andrew Morton wrote:
> Should these be called
> vmap_kernel_range_noflush/vunmap_kernel_range_noflush?
> 
> <avoids pointing out the 2 gigapage limit>

Yeap, having v there would be nicer but there already was an exported
function unmap_kernel_range() and I didn't want to rename the current
users or introduce inconsistent names (or interface).  Hmmmm.... There
aren't many users and we can rename them all but I don't feel it's
really necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH UPDATED 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  8:37     ` Andrew Morton
@ 2009-02-21  3:23       ` Tejun Heo
  2009-02-21  3:42         ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  3:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Andrew Morton wrote:
>> +void *pcpu_base_addr;
>> +EXPORT_SYMBOL_GPL(pcpu_base_addr);
>> +
>> +/* the size of kernel static area */
>> +static int pcpu_static_size;
> 
> It would be nice to document the units of the `size' variables.  Bytes?
> Pages?

I almost always use size for bytes, so it isn't confusing to me.

> Or, better: s/size/bytes/g.  
>
>> +static int pcpu_size_to_slot(int size)
>> +{
>> +	int highbit = fls(size);
>> +	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
>> +}
> 
> See,
> 
> static int pcpu_bytes_to_slot(int bytes)
> {
> 	int highbit = fls(bytes);
> 	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
> }
> 
> is clearer.

but, yeah, I agree.  I'll post a patch to do the renaming.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface
  2009-02-21  3:23       ` Tejun Heo
@ 2009-02-21  3:42         ` Tejun Heo
  2009-02-21  7:48           ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  3:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Do s/size/bytes/g as per Andrew Morton's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Okay, here's the patch.  I also merged it to #tj-percpu.  Having done
the conversion, I'm not too thrilled tho.  size was consistently used
to represent bytes and it's very customary especially if it's a memory
allocator and I can't really see how s/size/bytes/g makes things
better for percpu allocator.  Clear naming is good but not being able
to use size in favor of bytes seems a bit extreme to me.  After all,
it's size_t and sizeof() not bytes_t and bytesof().  That said, I have
nothing against bytes either, so...

Thanks.

 include/linux/percpu.h |    8 +-
 mm/percpu.c            |  154 ++++++++++++++++++++++++------------------------
 2 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1808099..7b61606 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -83,7 +83,7 @@ extern void *pcpu_base_addr;
 typedef void (*pcpu_populate_pte_fn_t)(unsigned long addr);
 
 extern size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
-				       struct page **pages, size_t cpu_size);
+				       struct page **pages, size_t cpu_bytes);
 /*
  * Use this to get to a cpu's version of the per-cpu object
  * dynamically allocated. Non-atomic access to the current CPU's
@@ -107,14 +107,14 @@ struct percpu_data {
 
 #endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
 
-extern void *__alloc_percpu(size_t size, size_t align);
+extern void *__alloc_percpu(size_t bytes, size_t align);
 extern void free_percpu(void *__pdata);
 
 #else /* CONFIG_SMP */
 
 #define per_cpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
 
-static inline void *__alloc_percpu(size_t size, size_t align)
+static inline void *__alloc_percpu(size_t bytes, size_t align)
 {
 	/*
 	 * Can't easily make larger alignment work with kmalloc.  WARN
@@ -122,7 +122,7 @@ static inline void *__alloc_percpu(size_t size, size_t align)
 	 * percpu sections on SMP for which this path isn't used.
 	 */
 	WARN_ON_ONCE(align > __alignof__(unsigned long long));
-	return kzalloc(size, gfp);
+	return kzalloc(bytes, gfp);
 }
 
 static inline void free_percpu(void *p)
diff --git a/mm/percpu.c b/mm/percpu.c
index 4617d97..8d6725a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -20,15 +20,15 @@
  * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
  *  -------------------  ......  -------------------  ....  ------------
  *
- * Allocation is done in offset-size areas of single unit space.  Ie,
+ * Allocation is done in offset-bytes areas of single unit space.  Ie,
  * an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
  * c1:u1, c1:u2 and c1:u3.  Percpu access can be done by configuring
- * percpu base registers UNIT_SIZE apart.
+ * percpu base registers pcpu_unit_bytes apart.
  *
  * There are usually many small percpu allocations many of them as
  * small as 4 bytes.  The allocator organizes chunks into lists
- * according to free size and tries to allocate from the fullest one.
- * Each chunk keeps the maximum contiguous area size hint which is
+ * according to free bytes and tries to allocate from the fullest one.
+ * Each chunk keeps the maximum contiguous area bytes hint which is
  * guaranteed to be eqaul to or larger than the maximum contiguous
  * area in the chunk.  This helps the allocator not to iterate the
  * chunk maps unnecessarily.
@@ -67,15 +67,15 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-#define PCPU_MIN_UNIT_PAGES_SHIFT	4	/* also max alloc size */
+#define PCPU_MIN_UNIT_PAGES_SHIFT	4	/* also max alloc bytes */
 #define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
 #define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
 
 struct pcpu_chunk {
 	struct list_head	list;		/* linked to pcpu_slot lists */
 	struct rb_node		rb_node;	/* key is chunk->vm->addr */
-	int			free_size;	/* free bytes in the chunk */
-	int			contig_hint;	/* max contiguous size hint */
+	int			free_bytes;	/* free bytes in the chunk */
+	int			contig_hint;	/* max contiguous bytes hint */
 	struct vm_struct	*vm;		/* mapped vmalloc region */
 	int			map_used;	/* # of map entries used */
 	int			map_alloc;	/* # of map entries allocated */
@@ -86,8 +86,8 @@ struct pcpu_chunk {
 static int pcpu_unit_pages_shift;
 static int pcpu_unit_pages;
 static int pcpu_unit_shift;
-static int pcpu_unit_size;
-static int pcpu_chunk_size;
+static int pcpu_unit_bytes;
+static int pcpu_chunk_bytes;
 static int pcpu_nr_slots;
 static size_t pcpu_chunk_struct_size;
 
@@ -96,7 +96,7 @@ void *pcpu_base_addr;
 EXPORT_SYMBOL_GPL(pcpu_base_addr);
 
 /* the size of kernel static area */
-static int pcpu_static_size;
+static int pcpu_static_bytes;
 
 /*
  * One mutex to rule them all.
@@ -117,18 +117,18 @@ static DEFINE_MUTEX(pcpu_mutex);
 static struct list_head *pcpu_slot;		/* chunk list slots */
 static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
 
-static int pcpu_size_to_slot(int size)
+static int pcpu_bytes_to_slot(int bytes)
 {
-	int highbit = fls(size);
+	int highbit = fls(bytes);
 	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
 }
 
 static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
 {
-	if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
+	if (chunk->free_bytes < sizeof(int) || chunk->contig_hint < sizeof(int))
 		return 0;
 
-	return pcpu_size_to_slot(chunk->free_size);
+	return pcpu_bytes_to_slot(chunk->free_bytes);
 }
 
 static int pcpu_page_idx(unsigned int cpu, int page_idx)
@@ -158,8 +158,8 @@ static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
 /**
  * pcpu_realloc - versatile realloc
  * @p: the current pointer (can be NULL for new allocations)
- * @size: the current size (can be 0 for new allocations)
- * @new_size: the wanted new size (can be 0 for free)
+ * @bytes: the current size (can be 0 for new allocations)
+ * @new_bytes: the wanted new size (can be 0 for free)
  *
  * More robust realloc which can be used to allocate, resize or free a
  * memory area of arbitrary size.  If the needed size goes over
@@ -168,22 +168,22 @@ static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
  * RETURNS:
  * The new pointer on success, NULL on failure.
  */
-static void *pcpu_realloc(void *p, size_t size, size_t new_size)
+static void *pcpu_realloc(void *p, size_t bytes, size_t new_bytes)
 {
 	void *new;
 
-	if (new_size <= PAGE_SIZE)
-		new = kmalloc(new_size, GFP_KERNEL);
+	if (new_bytes <= PAGE_SIZE)
+		new = kmalloc(new_bytes, GFP_KERNEL);
 	else
-		new = vmalloc(new_size);
-	if (new_size && !new)
+		new = vmalloc(new_bytes);
+	if (new_bytes && !new)
 		return NULL;
 
-	memcpy(new, p, min(size, new_size));
-	if (new_size > size)
-		memset(new + size, 0, new_size - size);
+	memcpy(new, p, min(bytes, new_bytes));
+	if (new_bytes > bytes)
+		memset(new + bytes, 0, new_bytes - bytes);
 
-	if (size <= PAGE_SIZE)
+	if (bytes <= PAGE_SIZE)
 		kfree(p);
 	else
 		vfree(p);
@@ -346,17 +346,17 @@ static int pcpu_split_block(struct pcpu_chunk *chunk, int i, int head, int tail)
 /**
  * pcpu_alloc_area - allocate area from a pcpu_chunk
  * @chunk: chunk of interest
- * @size: wanted size
+ * @bytes: wanted size
  * @align: wanted align
  *
- * Try to allocate @size bytes area aligned at @align from @chunk.
- * Note that this function only allocates the offset.  It doesn't
- * populate or map the area.
+ * Try to allocate @bytes area aligned at @align from @chunk.  Note
+ * that this function only allocates the offset.  It doesn't populate
+ * or map the area.
  *
  * RETURNS:
  * Allocated offset in @chunk on success, -errno on failure.
  */
-static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
+static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bytes, int align)
 {
 	int oslot = pcpu_chunk_slot(chunk);
 	int max_contig = 0;
@@ -373,9 +373,9 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 			return -ENOMEM;
 
 		chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
-		chunk->map[chunk->map_used++] = -pcpu_static_size;
-		if (chunk->free_size)
-			chunk->map[chunk->map_used++] = chunk->free_size;
+		chunk->map[chunk->map_used++] = -pcpu_static_bytes;
+		if (chunk->free_bytes)
+			chunk->map[chunk->map_used++] = chunk->free_bytes;
 	}
 
 	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
@@ -388,7 +388,7 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 
 		if (chunk->map[i] < 0)
 			continue;
-		if (chunk->map[i] < head + size) {
+		if (chunk->map[i] < head + bytes) {
 			max_contig = max(chunk->map[i], max_contig);
 			continue;
 		}
@@ -404,7 +404,7 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 				chunk->map[i - 1] += head;
 			else {
 				chunk->map[i - 1] -= head;
-				chunk->free_size -= head;
+				chunk->free_bytes -= head;
 			}
 			chunk->map[i] -= head;
 			off += head;
@@ -412,7 +412,7 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 		}
 
 		/* if tail is small, just keep it around */
-		tail = chunk->map[i] - head - size;
+		tail = chunk->map[i] - head - bytes;
 		if (tail < sizeof(int))
 			tail = 0;
 
@@ -436,7 +436,7 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 			chunk->contig_hint = max(chunk->contig_hint,
 						 max_contig);
 
-		chunk->free_size -= chunk->map[i];
+		chunk->free_bytes -= chunk->map[i];
 		chunk->map[i] = -chunk->map[i];
 
 		pcpu_chunk_relocate(chunk, oslot);
@@ -477,7 +477,7 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
 	BUG_ON(chunk->map[i] > 0);
 
 	chunk->map[i] = -chunk->map[i];
-	chunk->free_size += chunk->map[i];
+	chunk->free_bytes += chunk->map[i];
 
 	/* merge with previous? */
 	if (i > 0 && chunk->map[i - 1] >= 0) {
@@ -540,18 +540,18 @@ static void pcpu_unmap(struct pcpu_chunk *chunk, int page_start, int page_end,
  * pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
  * @chunk: chunk to depopulate
  * @off: offset to the area to depopulate
- * @size: size of the area to depopulate
+ * @bytes: size of the area to depopulate
  * @flush: whether to flush cache and tlb or not
  *
  * For each cpu, depopulate and unmap pages [@page_start,@page_end)
  * from @chunk.  If @flush is true, vcache is flushed before unmapping
  * and tlb after.
  */
-static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, size_t off,
-				  size_t size, bool flush)
+static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int off, int bytes,
+				  bool flush)
 {
 	int page_start = PFN_DOWN(off);
-	int page_end = PFN_UP(off + size);
+	int page_end = PFN_UP(off + bytes);
 	int unmap_start = -1;
 	int uninitialized_var(unmap_end);
 	unsigned int cpu;
@@ -617,16 +617,16 @@ static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
  * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
  * @chunk: chunk of interest
  * @off: offset to the area to populate
- * @size: size of the area to populate
+ * @bytes: size of the area to populate
  *
  * For each cpu, populate and map pages [@page_start,@page_end) into
  * @chunk.  The area is cleared on return.
  */
-static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
+static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int bytes)
 {
 	const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
 	int page_start = PFN_DOWN(off);
-	int page_end = PFN_UP(off + size);
+	int page_end = PFN_UP(off + bytes);
 	int map_start = -1;
 	int map_end;
 	unsigned int cpu;
@@ -660,12 +660,12 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
 
 	for_each_possible_cpu(cpu)
 		memset(chunk->vm->addr + (cpu << pcpu_unit_shift) + off, 0,
-		       size);
+		       bytes);
 
 	return 0;
 err:
 	/* likely under heavy memory pressure, give memory back */
-	pcpu_depopulate_chunk(chunk, off, size, true);
+	pcpu_depopulate_chunk(chunk, off, bytes, true);
 	return -ENOMEM;
 }
 
@@ -690,53 +690,53 @@ static struct pcpu_chunk *alloc_pcpu_chunk(void)
 	chunk->map = pcpu_realloc(NULL, 0,
 				  PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
 	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
-	chunk->map[chunk->map_used++] = pcpu_unit_size;
+	chunk->map[chunk->map_used++] = pcpu_unit_bytes;
 
-	chunk->vm = get_vm_area(pcpu_chunk_size, GFP_KERNEL);
+	chunk->vm = get_vm_area(pcpu_chunk_bytes, GFP_KERNEL);
 	if (!chunk->vm) {
 		free_pcpu_chunk(chunk);
 		return NULL;
 	}
 
 	INIT_LIST_HEAD(&chunk->list);
-	chunk->free_size = pcpu_unit_size;
-	chunk->contig_hint = pcpu_unit_size;
+	chunk->free_bytes = pcpu_unit_bytes;
+	chunk->contig_hint = pcpu_unit_bytes;
 
 	return chunk;
 }
 
 /**
  * __alloc_percpu - allocate percpu area
- * @size: size of area to allocate
+ * @bytes: size of area to allocate
  * @align: alignment of area (max PAGE_SIZE)
  *
- * Allocate percpu area of @size bytes aligned at @align.  Might
- * sleep.  Might trigger writeouts.
+ * Allocate percpu area of @bytes aligned at @align.  Might sleep.
+ * Might trigger writeouts.
  *
  * RETURNS:
  * Percpu pointer to the allocated area on success, NULL on failure.
  */
-void *__alloc_percpu(size_t size, size_t align)
+void *__alloc_percpu(size_t bytes, size_t align)
 {
 	void *ptr = NULL;
 	struct pcpu_chunk *chunk;
 	int slot, off;
 
-	if (unlikely(!size || size > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
+	if (unlikely(!bytes || bytes > PAGE_SIZE << PCPU_MIN_UNIT_PAGES_SHIFT ||
 		     align > PAGE_SIZE)) {
 		WARN(true, "illegal size (%zu) or align (%zu) for "
-		     "percpu allocation\n", size, align);
+		     "percpu allocation\n", bytes, align);
 		return NULL;
 	}
 
 	mutex_lock(&pcpu_mutex);
 
 	/* allocate area */
-	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
+	for (slot = pcpu_bytes_to_slot(bytes); slot < pcpu_nr_slots; slot++) {
 		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			if (size > chunk->contig_hint)
+			if (bytes > chunk->contig_hint)
 				continue;
-			off = pcpu_alloc_area(chunk, size, align);
+			off = pcpu_alloc_area(chunk, bytes, align);
 			if (off >= 0)
 				goto area_found;
 			if (off != -ENOSPC)
@@ -751,13 +751,13 @@ void *__alloc_percpu(size_t size, size_t align)
 	pcpu_chunk_relocate(chunk, -1);
 	pcpu_chunk_addr_insert(chunk);
 
-	off = pcpu_alloc_area(chunk, size, align);
+	off = pcpu_alloc_area(chunk, bytes, align);
 	if (off < 0)
 		goto out_unlock;
 
 area_found:
 	/* populate, map and clear the area */
-	if (pcpu_populate_chunk(chunk, off, size)) {
+	if (pcpu_populate_chunk(chunk, off, bytes)) {
 		pcpu_free_area(chunk, off);
 		goto out_unlock;
 	}
@@ -771,7 +771,7 @@ EXPORT_SYMBOL_GPL(__alloc_percpu);
 
 static void pcpu_kill_chunk(struct pcpu_chunk *chunk)
 {
-	pcpu_depopulate_chunk(chunk, 0, pcpu_unit_size, false);
+	pcpu_depopulate_chunk(chunk, 0, pcpu_unit_bytes, false);
 	list_del(&chunk->list);
 	rb_erase(&chunk->rb_node, &pcpu_addr_root);
 	free_pcpu_chunk(chunk);
@@ -800,7 +800,7 @@ void free_percpu(void *ptr)
 	pcpu_free_area(chunk, off);
 
 	/* the chunk became fully free, kill one if there are other free ones */
-	if (chunk->free_size == pcpu_unit_size) {
+	if (chunk->free_bytes == pcpu_unit_bytes) {
 		struct pcpu_chunk *pos;
 
 		list_for_each_entry(pos,
@@ -818,7 +818,7 @@ EXPORT_SYMBOL_GPL(free_percpu);
 /**
  * pcpu_setup_static - initialize kernel static percpu area
  * @populate_pte_fn: callback to allocate pagetable
- * @pages: num_possible_cpus() * PFN_UP(cpu_size) pages
+ * @pages: num_possible_cpus() * PFN_UP(cpu_bytes) pages
  *
  * Initialize kernel static percpu area.  The caller should allocate
  * all the necessary pages and pass them in @pages.
@@ -827,27 +827,27 @@ EXPORT_SYMBOL_GPL(free_percpu);
  * tables for the page is allocated.
  *
  * RETURNS:
- * The determined pcpu_unit_size which can be used to initialize
+ * The determined pcpu_unit_bytes which can be used to initialize
  * percpu access.
  */
 size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
-				struct page **pages, size_t cpu_size)
+				struct page **pages, size_t cpu_bytes)
 {
 	static struct vm_struct static_vm;
 	struct pcpu_chunk *static_chunk;
-	int nr_cpu_pages = DIV_ROUND_UP(cpu_size, PAGE_SIZE);
+	int nr_cpu_pages = DIV_ROUND_UP(cpu_bytes, PAGE_SIZE);
 	unsigned int cpu;
 	int err, i;
 
 	pcpu_unit_pages_shift = max_t(int, PCPU_MIN_UNIT_PAGES_SHIFT,
-				      order_base_2(cpu_size) - PAGE_SHIFT);
+				      order_base_2(cpu_bytes) - PAGE_SHIFT);
 
-	pcpu_static_size = cpu_size;
+	pcpu_static_bytes = cpu_bytes;
 	pcpu_unit_pages = 1 << pcpu_unit_pages_shift;
 	pcpu_unit_shift = PAGE_SHIFT + pcpu_unit_pages_shift;
-	pcpu_unit_size = 1 << pcpu_unit_shift;
-	pcpu_chunk_size = num_possible_cpus() * pcpu_unit_size;
-	pcpu_nr_slots = pcpu_size_to_slot(pcpu_unit_size) + 1;
+	pcpu_unit_bytes = 1 << pcpu_unit_shift;
+	pcpu_chunk_bytes = num_possible_cpus() * pcpu_unit_bytes;
+	pcpu_nr_slots = pcpu_bytes_to_slot(pcpu_unit_bytes) + 1;
 	pcpu_chunk_struct_size = sizeof(struct pcpu_chunk)
 		+ (1 << pcpu_unit_pages_shift) * sizeof(struct page *);
 
@@ -858,15 +858,15 @@ size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
 
 	/* init and register vm area */
 	static_vm.flags = VM_ALLOC;
-	static_vm.size = pcpu_chunk_size;
+	static_vm.size = pcpu_chunk_bytes;
 	vm_area_register_early(&static_vm);
 
 	/* init static_chunk */
 	static_chunk = alloc_bootmem(pcpu_chunk_struct_size);
 	INIT_LIST_HEAD(&static_chunk->list);
 	static_chunk->vm = &static_vm;
-	static_chunk->free_size = pcpu_unit_size - pcpu_static_size;
-	static_chunk->contig_hint = static_chunk->free_size;
+	static_chunk->free_bytes = pcpu_unit_bytes - pcpu_static_bytes;
+	static_chunk->contig_hint = static_chunk->free_bytes;
 
 	/* assign pages and map them */
 	for_each_possible_cpu(cpu) {
@@ -886,5 +886,5 @@ size_t __init pcpu_setup_static(pcpu_populate_pte_fn_t populate_pte_fn,
 
 	/* we're done */
 	pcpu_base_addr = (void *)pcpu_chunk_addr(static_chunk, 0, 0);
-	return pcpu_unit_size;
+	return pcpu_unit_bytes;
 }
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-20  9:32       ` Ingo Molnar
@ 2009-02-21  7:10         ` Tejun Heo
  2009-02-21  7:33           ` Tejun Heo
  2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
  0 siblings, 2 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  7:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Hello, Ingo.

Ingo Molnar wrote:
> Where's the problem? Via bootmem we can allocate an arbitrarily 
> large, properly NUMA-affine, well-aligned, linear, large-TLB 
> piece of memory, for each CPU.

I wish it was that peachy.  The problem is the added TLB pressure.

> We should allocate a large enough chunk for the static percpu 
> variables, and remap them using 2MB mapping[s].
> 
> I'm not sure where the desire for 'chunking' below 2MB comes 
> from - there's no real benefit from it - the TLB will either be 
> 4K or 2MB, going inbetween makes little sense.

Making the 'chunk' size 2MB would be useful for non-NUMA.  For NUMA,
making the 'chunk' size 2MB doesn't help much.  For unit size, 4k is
the minimum and 2MB is a meaningful boundary if percpu area gets
sufficiently large as large page mapping can be used for NUMA.  For
chunk size, 4k * num_possible_cpus() is the minimum and 2MB is a
meaningful boundary for !NUMA and 2MB * num_possible_cpus() for NUMA.

Anything between 4k and one of the meaningful boundaries doesn't make
much difference other than the chunk size needs to be at least as
large as the maximum supported allocation.  If it's above certain
limit, going large doesn't provide much benefit.  Given the tight vm
situation on 32bits, there simply isn't good reason to default to 2MB
unless large mapping is gonna be used.

> So i think the best (and simplest) approach is to:
> 
>  - allocate the static percpu area using bootmem-alloc, but 
>    using a 2MB alignment parameter and a 2MB aligned size. Then 
>    we can remap it to some convenient and undisturbed virtual 
>    memory area, using 2MB TLBs. [*]
> 
>  - The 'partial' bit of the 2MB page (the one that is outside 
>    the 4K-uprounded portion of __per_cpu_end - __per_cpu_start) 
>    can then be freed via bootmem and is available as regular 
>    pages to the rest of the kernel.

Heh... that's exactly where the problem is.  If you remap and free
what's left.  The remapped area and the freed area will use two
separate 2MB TLBs instead of one.  It's probably worse than simply
using 4k mappings.

On !NUMA, we can get away with this because the static percpu area
doesn't need to be remapped so the physical mapping can used unchanged
and what's left can be returned to the system.  On NUMA, we need remap
so we can't easily return what's left.

>  - Then we start dynamic allocations at the _next_ 2MB boundary 
>    in the virtual remapped space, and use 4K mappings from that 
>    point on. Since at least initially we dont want to waste a 
>    full 2MB page on dynamic allocations, we've got no choice but 
>    to use 4K pages.

It will be better to reserve some area for dynamic allocation so that
usual percpu allocations can be served by the initial mapping, which
tends to be pretty small on usual configurations.

>  - This means that percpu_alloc() will not return a pointer to 
>    an array of percpu pointers - but will return a standard 
>    offset that is valid in each percpu area and points to 
>    somewhere beyond the 2MB boundary that comes after the 
>    initial static area. This means it needs some minimal memory 
>    management - but it all looks relatively straightforward.
>
> So the virtual memory area will be continous, with a 'hole' in 
> it that separates the static and dynamic portions, and dynamic 
> percpu pointers will point straight into it [with a %gs offset] 
> - without an intermediary array of pointers.
> 
> No chunking, no fuss - just bootmem plus 4K allocations - the 
> best of both worlds.

The new percpu_alloc() already does that.  Chunking or not makes no
difference on this regard.  The only difference whether there are more
holes in the allocated percpu addresses or not, which basically is
irrelevant and chunking makes things much more flexible and scalable.
ie. It can scale toward many many cpus or large large percpu areas
wheras making the areas contiguous make the scalability determined by
the product of the two.

Also, contiguous per-cpu areas might look simpler but it actually is
more complicated because it becomes much more arch dependent.  With
chunking, the complexity is in generic code as virtual address and
stuff are already in place.  If the cpu areas need to be made
contiguous, the generic code will get simpler but each arch needs to
come up with new address space layout.

There simply isn't any measurable advantage to making the area
contiguous.

> This also means we've essentially eliminated the boundary 
> between static and dynamic APIs, and can probably use some of 
> the same direct assembly optimizations (on x86) for local-CPU 
> dynamic percpu accesses too. [maybe not all addressing modes are 
> possible straight away, this needs a more precise check.]

The posted patchset already does that.  Please take a look at the new
per_cpu_ptr().  It's basically &per_cpu().  Unifying accessors is the
next step and I'm planning to conslidate local_t implementation into
it too but I think all that depends on we agreeing on the allocator.
I can remove the TLB problem from non-NUMA case but for NUMA I still
don't have a good idea.  Maybe we need to accept the overhead for
NUMA?  I don't know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-21  7:10         ` Tejun Heo
@ 2009-02-21  7:33           ` Tejun Heo
  2009-02-22 19:38             ` Ingo Molnar
  2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
  1 sibling, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  7:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Tejun Heo wrote:
> I can remove the TLB problem from non-NUMA case but for NUMA I still
> don't have a good idea.  Maybe we need to accept the overhead for
> NUMA?  I don't know.

Hmmmm... one thing we can do on NUMA is to remap and free the remapped
address and make __pa() and __va() handle that area specially.  It's a
bit convoluted but the added overhead should be minimal.  It'll only
be simple range check in __pa()/__va() and it's not like they are
super hot paths anyway.  I'll give it a shot.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface
  2009-02-21  3:42         ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
@ 2009-02-21  7:48           ` Tejun Heo
  2009-02-21  7:55             ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  7:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Hello, Andrew.

Tejun Heo wrote:
> Do s/size/bytes/g as per Andrew Morton's suggestion.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> Okay, here's the patch.  I also merged it to #tj-percpu.  Having done
> the conversion, I'm not too thrilled tho.  size was consistently used
> to represent bytes and it's very customary especially if it's a memory
> allocator and I can't really see how s/size/bytes/g makes things
> better for percpu allocator.  Clear naming is good but not being able
> to use size in favor of bytes seems a bit extreme to me.  After all,
> it's size_t and sizeof() not bytes_t and bytesof().  That said, I have
> nothing against bytes either, so...

Eeeek... I'm sorry but I'm popping this patch.  It just doesn't look
right.  I'll add comments where appropriate that size is in bytes
instead.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH tj-percpu] percpu: clean up size usage
  2009-02-21  7:48           ` Tejun Heo
@ 2009-02-21  7:55             ` Tejun Heo
  2009-02-21  7:56               ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  7:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Andrew was concerned about the unit of variables named or have suffix
size.  Every usage in percpu allocator is in bytes but make it super
clear by adding comments.

While at it, make pcpu_depopulate_chunk() take int @off and @size like
everyone else.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
Thanks.

 mm/percpu.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 4617d97..297b31f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -119,7 +119,7 @@ static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
 
 static int pcpu_size_to_slot(int size)
 {
-	int highbit = fls(size);
+	int highbit = fls(size);	/* size is in bytes */
 	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
 }
 
@@ -158,8 +158,8 @@ static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
 /**
  * pcpu_realloc - versatile realloc
  * @p: the current pointer (can be NULL for new allocations)
- * @size: the current size (can be 0 for new allocations)
- * @new_size: the wanted new size (can be 0 for free)
+ * @size: the current size in bytes (can be 0 for new allocations)
+ * @new_size: the wanted new size in bytes (can be 0 for free)
  *
  * More robust realloc which can be used to allocate, resize or free a
  * memory area of arbitrary size.  If the needed size goes over
@@ -290,8 +290,8 @@ static void pcpu_chunk_addr_insert(struct pcpu_chunk *new)
  * pcpu_split_block - split a map block
  * @chunk: chunk of interest
  * @i: index of map block to split
- * @head: head size (can be 0)
- * @tail: tail size (can be 0)
+ * @head: head size in bytes (can be 0)
+ * @tail: tail size in bytes (can be 0)
  *
  * Split the @i'th map block into two or three blocks.  If @head is
  * non-zero, @head bytes block is inserted before block @i moving it
@@ -346,7 +346,7 @@ static int pcpu_split_block(struct pcpu_chunk *chunk, int i, int head, int tail)
 /**
  * pcpu_alloc_area - allocate area from a pcpu_chunk
  * @chunk: chunk of interest
- * @size: wanted size
+ * @size: wanted size in bytes
  * @align: wanted align
  *
  * Try to allocate @size bytes area aligned at @align from @chunk.
@@ -540,15 +540,15 @@ static void pcpu_unmap(struct pcpu_chunk *chunk, int page_start, int page_end,
  * pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
  * @chunk: chunk to depopulate
  * @off: offset to the area to depopulate
- * @size: size of the area to depopulate
+ * @size: size of the area to depopulate in bytes
  * @flush: whether to flush cache and tlb or not
  *
  * For each cpu, depopulate and unmap pages [@page_start,@page_end)
  * from @chunk.  If @flush is true, vcache is flushed before unmapping
  * and tlb after.
  */
-static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, size_t off,
-				  size_t size, bool flush)
+static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, intt off, int size,
+				  bool flush)
 {
 	int page_start = PFN_DOWN(off);
 	int page_end = PFN_UP(off + size);
@@ -617,7 +617,7 @@ static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
  * pcpu_populate_chunk - populate and map an area of a pcpu_chunk
  * @chunk: chunk of interest
  * @off: offset to the area to populate
- * @size: size of the area to populate
+ * @size: size of the area to populate in bytes
  *
  * For each cpu, populate and map pages [@page_start,@page_end) into
  * @chunk.  The area is cleared on return.
@@ -707,7 +707,7 @@ static struct pcpu_chunk *alloc_pcpu_chunk(void)
 
 /**
  * __alloc_percpu - allocate percpu area
- * @size: size of area to allocate
+ * @size: size of area to allocate in bytes
  * @align: alignment of area (max PAGE_SIZE)
  *
  * Allocate percpu area of @size bytes aligned at @align.  Might
@@ -819,6 +819,7 @@ EXPORT_SYMBOL_GPL(free_percpu);
  * pcpu_setup_static - initialize kernel static percpu area
  * @populate_pte_fn: callback to allocate pagetable
  * @pages: num_possible_cpus() * PFN_UP(cpu_size) pages
+ * @cpu_size: the size of static percpu area in bytes
  *
  * Initialize kernel static percpu area.  The caller should allocate
  * all the necessary pages and pass them in @pages.

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH tj-percpu] percpu: clean up size usage
  2009-02-21  7:55             ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
@ 2009-02-21  7:56               ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-21  7:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Tejun Heo wrote:
> +static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, intt off, int size,

intt should have been int.  Corrected version committed to tj-percpu.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-21  7:10         ` Tejun Heo
  2009-02-21  7:33           ` Tejun Heo
@ 2009-02-22 19:27           ` Ingo Molnar
  2009-02-23  0:47             ` Tejun Heo
  1 sibling, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-22 19:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

> > So i think the best (and simplest) approach is to:
> > 
> >  - allocate the static percpu area using bootmem-alloc, but 
> >    using a 2MB alignment parameter and a 2MB aligned size. Then 
> >    we can remap it to some convenient and undisturbed virtual 
> >    memory area, using 2MB TLBs. [*]
> > 
> >  - The 'partial' bit of the 2MB page (the one that is outside 
> >    the 4K-uprounded portion of __per_cpu_end - __per_cpu_start) 
> >    can then be freed via bootmem and is available as regular 
> >    pages to the rest of the kernel.
> 
> Heh... that's exactly where the problem is.  If you remap and 
> free what's left.  The remapped area and the freed area will 
> use two separate 2MB TLBs instead of one.  It's probably worse 
> than simply using 4k mappings.

Uhm, no. We'll have one extra 2MB TLB and that's it. Both the 
low linear 2MB TLB and the high remapped alias 2MB TLB will 
cover on average of 256 4K pages. A very good deal still.

We dont want to split up the static percpu area into zillions of 
small 4K TLBs - we'll rather use +1 large-TLB.

If we used 4K ptes we'd waste up to 512 TLB entries. (largely 
simplified as the number of large TLB entries smaller than that 
of 4K TLB, but still the arguments holds in terms of TLB reach.)

So there is no "TLB problem" whatsoever that i can see ...

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-21  7:33           ` Tejun Heo
@ 2009-02-22 19:38             ` Ingo Molnar
  2009-02-23  0:43               ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-22 19:38 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

> Tejun Heo wrote:
> > I can remove the TLB problem from non-NUMA case but for NUMA I still
> > don't have a good idea.  Maybe we need to accept the overhead for
> > NUMA?  I don't know.
> 
> Hmmmm... one thing we can do on NUMA is to remap and free the 
> remapped address and make __pa() and __va() handle that area 
> specially.  It's a bit convoluted but the added overhead 
> should be minimal.  It'll only be simple range check in 
> __pa()/__va() and it's not like they are super hot paths 
> anyway.  I'll give it a shot.

Heck no. It is absolutely crazy to complicate __pa()/__va() in 
_any_ way just to 'save' one more 2MB dTLB.

We'll use that TLB because that is what TLBs are for: to handle 
mapped pages. Yes, in the percpu scheme we are working on we'll 
have a 'dual' mapping for the static percpu area (on 64-bit) but 
mapping aliases have been one of the most basic CPU features for 
the past 15 years ...

Even a single NOP in the __pa()/__va() path is _more_ expensive 
than that TLB, believe me.

Look at last year's cheap quad CPU:

 Data TLB: 4MB pages, 4-way associative, 32 entries

That's 32x2MB = 64MB of data reach. Our access patterns in the 
kernel tend to be pretty focused as well, so 32 is more than 
enough in practice.

Especially if the pte is cached a TLB fill is very cheap on 
Intel CPUs. So even if we were trashing those 32 entries (which 
we are generally not), having a dTLB for the percpu area is a 
TLB entry well spent.

So lets just do the most simple and most straightforward mapping 
approach which i suggested: it takes advantage of everything, is 
very close to the best possible performance in the cached case - 
and dont worry about hardware resources.

The moment you start worrying about hardware resources on that 
level and start 'optimizing' it in software, you've already lost 
it. It leads down to the path of soft-TLB handlers and other 
sillyness. There's no way you can win such a race against 
hardware fundamentals - at least at today's speed of advance in 
the hw space.

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-22 19:38             ` Ingo Molnar
@ 2009-02-23  0:43               ` Tejun Heo
  2009-02-23 10:17                 ` Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-23  0:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Hello, Ingo.

Ingo Molnar wrote:
> Heck no. It is absolutely crazy to complicate __pa()/__va() in 
> _any_ way just to 'save' one more 2MB dTLB.

Are __pa()/__va() that hot paths?  Or am I over-estimating the cost of
2MB dTLB?

> We'll use that TLB because that is what TLBs are for: to handle 
> mapped pages. Yes, in the percpu scheme we are working on we'll 
> have a 'dual' mapping for the static percpu area (on 64-bit) but 
> mapping aliases have been one of the most basic CPU features for 
> the past 15 years ...
> 
> Even a single NOP in the __pa()/__va() path is _more_ expensive 
> than that TLB, believe me.

Alright, I'll believe you.  That actually works very nice for me.  :-)

> Look at last year's cheap quad CPU:
> 
>  Data TLB: 4MB pages, 4-way associative, 32 entries
> 
> That's 32x2MB = 64MB of data reach. Our access patterns in the 
> kernel tend to be pretty focused as well, so 32 is more than 
> enough in practice.
> 
> Especially if the pte is cached a TLB fill is very cheap on 
> Intel CPUs. So even if we were trashing those 32 entries (which 
> we are generally not), having a dTLB for the percpu area is a 
> TLB entry well spent.
> 
> So lets just do the most simple and most straightforward mapping 
> approach which i suggested: it takes advantage of everything, is 
> very close to the best possible performance in the cached case - 
> and dont worry about hardware resources.

Alright, for NUMA, I'll just remap a large page.  For UMA, I already
wrote code to embed it existing large page nicely, so I'll keep it
that way.  The added code is only about 40 lines which is localized in
setup_percpu.c and all __init.  The NUMA remap also shouldn't take too
much code if the __pa/__va() trick isn't necessary.  I'll post the
patches soon.

> The moment you start worrying about hardware resources on that 
> level and start 'optimizing' it in software, you've already lost 
> it. It leads down to the path of soft-TLB handlers and other 
> sillyness. There's no way you can win such a race against 
> hardware fundamentals - at least at today's speed of advance in 
> the hw space.

Well, I was hoping for not introducing any performance regression
while converting to new allocator.  Performance penalty due to TLB
pressure is especially difficult to measure, so avoiding any addition
there makes accepting the new allocator much easier but I gotta admit
that I'm not an expert at x86 micro performance tuning.  If you think
the overhead is acceptable, I'm a happy camper.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
@ 2009-02-23  0:47             ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-23  0:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw

Hello, Ingo.

Ingo Molnar wrote:
> Uhm, no. We'll have one extra 2MB TLB and that's it. Both the 
> low linear 2MB TLB and the high remapped alias 2MB TLB will 
> cover on average of 256 4K pages. A very good deal still.

Yeah, double the TLB usage for the specific large page.  Maybe I was
reading too many corporate emails.  :-)

> We dont want to split up the static percpu area into zillions of 
> small 4K TLBs - we'll rather use +1 large-TLB.
> 
> If we used 4K ptes we'd waste up to 512 TLB entries. (largely 
> simplified as the number of large TLB entries smaller than that 
> of 4K TLB, but still the arguments holds in terms of TLB reach.)
> 
> So there is no "TLB problem" whatsoever that i can see ...

Well, other people raised the issue and for machines with very small
separate TLBs for large pages (earlier x86s), it might be measurable
penalty.  Anyways, remap only for NUMA should suffice it seems.  I'll
post patches soon.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
  2009-02-23  0:43               ` Tejun Heo
@ 2009-02-23 10:17                 ` Ingo Molnar
  2009-02-23 13:38                   ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-23 10:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> Ingo Molnar wrote:
> > Heck no. It is absolutely crazy to complicate __pa()/__va() in 
> > _any_ way just to 'save' one more 2MB dTLB.
> 
> Are __pa()/__va() that hot paths?  Or am I over-estimating the 
> cost of 2MB dTLB?

yes, __pa()/__va() is a very hot path - in a defconfig they are 
used in about a thousand different places.

In fact it would be nice to get rid of the __phys_addr() 
redirection on the 64-bit side (which is non-linear and a 
function there, and all __pa()s go through it) and make it a 
constant offset again.

This isnt trivial/possible to do though as .data/.bss is in the 
high alias. (high .text aliases alone wouldnt be a big issue to 
fix, but the data aliases are an issue.)

Moving .data/.bss into the linear space isnt feasible as we'd 
lose RIP-relative addressing shortcuts.

Maybe we could figure out the places that do __pa() on a high 
alias and gradually eliminate them. __pa() on .data/.bss is a 
rare and unusal thing to do, and CONFIG_DEBUG_VIRTUAL could warn 
about them without crashing the kernel.

Later on we could make this check unconditional, and then switch 
over __pa() to addr-PAGE_OFFSET in the !CONFIG_DEBUG_VIRTUAL 
case (which is the default).

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-23 10:17                 ` Ingo Molnar
@ 2009-02-23 13:38                   ` Ingo Molnar
  2009-02-23 14:08                     ` Nick Piggin
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-23 13:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw


* Ingo Molnar <mingo@elte.hu> wrote:

> > Are __pa()/__va() that hot paths?  Or am I over-estimating 
> > the cost of 2MB dTLB?
> 
> yes, __pa()/__va() is a very hot path - in a defconfig they 
> are used in about a thousand different places.
> 
> In fact it would be nice to get rid of the __phys_addr() 
> redirection on the 64-bit side (which is non-linear and a 
> function there, and all __pa()s go through it) and make it a 
> constant offset again.
> 
> This isnt trivial/possible to do though as .data/.bss is in 
> the high alias. (high .text aliases alone wouldnt be a big 
> issue to fix, but the data aliases are an issue.)
> 
> Moving .data/.bss into the linear space isnt feasible as we'd 
> lose RIP-relative addressing shortcuts.
> 
> Maybe we could figure out the places that do __pa() on a high 
> alias and gradually eliminate them. __pa() on .data/.bss is a 
> rare and unusal thing to do, and CONFIG_DEBUG_VIRTUAL could 
> warn about them without crashing the kernel.
> 
> Later on we could make this check unconditional, and then 
> switch over __pa() to addr-PAGE_OFFSET in the 
> !CONFIG_DEBUG_VIRTUAL case (which is the default).

Ok, i couldnt resist and using ftrace_printk() (regular printk 
in __pa() would hang during bootup) and came up with the patch 
below - which allows the second patch below that does:

 -#define __pa(x)		__phys_addr((unsigned long)(x))
 +#define __pa(x)		((unsigned long)(x)-PAGE_OFFSET)

It cuts a nice (and hotly executed) ~650 bytes chunk out of the 
x86 64-bit defconfig kernel text:

    text	   data	    bss	    dec	    hex	filename
 7999071	1137780	 843672	9980523	 984a6b	vmlinux.before
 7998414	1137780	 843672	9979866	 9847da	vmlinux.after

And it even boots.

(the load_cr3() hack needs to be changed, by setting the init 
pgdir from init_level4_pgt to __va(__pa_symbol(init_level4_pgt).)

(32-bit is untested and likely wont even build.)

It's not even that bad and looks quite maintainable as a 
concept.

This also means that __va() and __pa() will be one and the same 
thing simple arithmetics again on both 32-bit and 64-bit 
kernels.

	Ingo

---
 arch/x86/include/asm/page.h          |    4 +++-
 arch/x86/include/asm/page_64_types.h |    1 +
 arch/x86/include/asm/pgalloc.h       |    4 ++--
 arch/x86/include/asm/pgtable.h       |    2 +-
 arch/x86/include/asm/processor.h     |    7 ++++++-
 arch/x86/kernel/setup.c              |   12 ++++++------
 arch/x86/mm/init_64.c                |    6 +++---
 arch/x86/mm/ioremap.c                |   12 +++++++++++-
 arch/x86/mm/pageattr.c               |   28 ++++++++++++++--------------
 arch/x86/mm/pgtable.c                |    2 +-
 10 files changed, 48 insertions(+), 30 deletions(-)

Index: linux/arch/x86/include/asm/page.h
===================================================================
--- linux.orig/arch/x86/include/asm/page.h
+++ linux/arch/x86/include/asm/page.h
@@ -34,10 +34,11 @@ static inline void copy_user_page(void *
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define __pa(x)		__phys_addr((unsigned long)(x))
+#define __pa_slow(x)		__phys_addr_slow((unsigned long)(x))
 #define __pa_nodebug(x)	__phys_addr_nodebug((unsigned long)(x))
 /* __pa_symbol should be used for C visible symbols.
    This seems to be the official gcc blessed way to do such arithmetic. */
-#define __pa_symbol(x)	__pa(__phys_reloc_hide((unsigned long)(x)))
+#define __pa_symbol(x)	__pa_slow(__phys_reloc_hide((unsigned long)(x)))
 
 #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
 
@@ -49,6 +50,7 @@ static inline void copy_user_page(void *
  * virt_addr_valid(kaddr) returns true.
  */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_slow(kaddr) pfn_to_page(__pa_slow(kaddr) >> PAGE_SHIFT)
 #define pfn_to_kaddr(pfn)      __va((pfn) << PAGE_SHIFT)
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
Index: linux/arch/x86/include/asm/page_64_types.h
===================================================================
--- linux.orig/arch/x86/include/asm/page_64_types.h
+++ linux/arch/x86/include/asm/page_64_types.h
@@ -67,6 +67,7 @@ extern unsigned long max_pfn;
 extern unsigned long phys_base;
 
 extern unsigned long __phys_addr(unsigned long);
+extern unsigned long __phys_addr_slow(unsigned long);
 #define __phys_reloc_hide(x)	(x)
 
 #define vmemmap ((struct page *)VMEMMAP_START)
Index: linux/arch/x86/include/asm/pgalloc.h
===================================================================
--- linux.orig/arch/x86/include/asm/pgalloc.h
+++ linux/arch/x86/include/asm/pgalloc.h
@@ -51,8 +51,8 @@ extern void __pte_free_tlb(struct mmu_ga
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
 {
-	paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
-	set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+	paravirt_alloc_pte(mm, __pa_symbol(pte) >> PAGE_SHIFT);
+	set_pmd(pmd, __pmd(__pa_symbol(pte) | _PAGE_TABLE));
 }
 
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
Index: linux/arch/x86/include/asm/pgtable.h
===================================================================
--- linux.orig/arch/x86/include/asm/pgtable.h
+++ linux/arch/x86/include/asm/pgtable.h
@@ -20,7 +20,7 @@
  * for zero-mapped memory areas etc..
  */
 extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
-#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
+#define ZERO_PAGE(vaddr) (virt_to_page_slow(empty_zero_page))
 
 extern spinlock_t pgd_lock;
 extern struct list_head pgd_list;
Index: linux/arch/x86/include/asm/processor.h
===================================================================
--- linux.orig/arch/x86/include/asm/processor.h
+++ linux/arch/x86/include/asm/processor.h
@@ -186,9 +186,14 @@ static inline void native_cpuid(unsigned
 	    : "0" (*eax), "2" (*ecx));
 }
 
+extern pgd_t init_level4_pgt[];
+
 static inline void load_cr3(pgd_t *pgdir)
 {
-	write_cr3(__pa(pgdir));
+	if (pgdir == init_level4_pgt)
+		write_cr3((unsigned long)(pgdir) - __START_KERNEL_map);
+	else
+		write_cr3(__pa(pgdir));
 }
 
 #ifdef CONFIG_X86_32
Index: linux/arch/x86/kernel/setup.c
===================================================================
--- linux.orig/arch/x86/kernel/setup.c
+++ linux/arch/x86/kernel/setup.c
@@ -733,12 +733,12 @@ void __init setup_arch(char **cmdline_p)
 	init_mm.brk = (unsigned long) &_end;
 #endif
 
-	code_resource.start = virt_to_phys(_text);
-	code_resource.end = virt_to_phys(_etext)-1;
-	data_resource.start = virt_to_phys(_etext);
-	data_resource.end = virt_to_phys(_edata)-1;
-	bss_resource.start = virt_to_phys(&__bss_start);
-	bss_resource.end = virt_to_phys(&__bss_stop)-1;
+	code_resource.start = __pa_symbol(_text);
+	code_resource.end = __pa_symbol(_etext)-1;
+	data_resource.start = __pa_symbol(_etext);
+	data_resource.end = __pa_symbol(_edata)-1;
+	bss_resource.start = __pa_symbol(&__bss_start);
+	bss_resource.end = __pa_symbol(&__bss_stop)-1;
 
 #ifdef CONFIG_CMDLINE_BOOL
 #ifdef CONFIG_CMDLINE_OVERRIDE
Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -965,11 +965,11 @@ void free_init_pages(char *what, unsigne
 	printk(KERN_INFO "Freeing %s: %luk freed\n", what, (end - begin) >> 10);
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		ClearPageReserved(virt_to_page(addr));
-		init_page_count(virt_to_page(addr));
+		ClearPageReserved(virt_to_page_slow(addr));
+		init_page_count(virt_to_page_slow(addr));
 		memset((void *)(addr & ~(PAGE_SIZE-1)),
 			POISON_FREE_INITMEM, PAGE_SIZE);
-		free_page(addr);
+		free_page((unsigned long)__va(__pa_symbol(addr)));
 		totalram_pages++;
 	}
 #endif
Index: linux/arch/x86/mm/ioremap.c
===================================================================
--- linux.orig/arch/x86/mm/ioremap.c
+++ linux/arch/x86/mm/ioremap.c
@@ -13,6 +13,7 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/mmiotrace.h>
+#include <linux/ftrace.h>
 
 #include <asm/cacheflush.h>
 #include <asm/e820.h>
@@ -29,7 +30,7 @@ static inline int phys_addr_valid(unsign
 	return addr < (1UL << boot_cpu_data.x86_phys_bits);
 }
 
-unsigned long __phys_addr(unsigned long x)
+unsigned long __phys_addr_slow(unsigned long x)
 {
 	if (x >= __START_KERNEL_map) {
 		x -= __START_KERNEL_map;
@@ -43,6 +44,15 @@ unsigned long __phys_addr(unsigned long 
 	}
 	return x;
 }
+EXPORT_SYMBOL(__phys_addr_slow);
+
+unsigned long __phys_addr(unsigned long x)
+{
+	if (x >= __START_KERNEL_map)
+		ftrace_printk("__phys_addr() done on symbol: %p\n", (void *)x);
+
+	return __phys_addr_slow(x);
+}
 EXPORT_SYMBOL(__phys_addr);
 
 bool __virt_addr_valid(unsigned long x)
Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -90,12 +90,12 @@ static inline void split_page_count(int 
 
 static inline unsigned long highmap_start_pfn(void)
 {
-	return __pa(_text) >> PAGE_SHIFT;
+	return __pa_symbol(_text) >> PAGE_SHIFT;
 }
 
 static inline unsigned long highmap_end_pfn(void)
 {
-	return __pa(roundup((unsigned long)_end, PMD_SIZE)) >> PAGE_SHIFT;
+	return __pa_symbol(roundup((unsigned long)_end, PMD_SIZE)) >> PAGE_SHIFT;
 }
 
 #endif
@@ -266,8 +266,8 @@ static inline pgprot_t static_protection
 	 * The .rodata section needs to be read-only. Using the pfn
 	 * catches all aliases.
 	 */
-	if (within(pfn, __pa((unsigned long)__start_rodata) >> PAGE_SHIFT,
-		   __pa((unsigned long)__end_rodata) >> PAGE_SHIFT))
+	if (within(pfn, __pa_symbol((unsigned long)__start_rodata) >> PAGE_SHIFT,
+		   __pa_symbol((unsigned long)__end_rodata) >> PAGE_SHIFT))
 		pgprot_val(forbidden) |= _PAGE_RW;
 
 	prot = __pgprot(pgprot_val(prot) & ~pgprot_val(forbidden));
@@ -555,7 +555,7 @@ static int __cpa_process_fault(struct cp
 	if (within(vaddr, PAGE_OFFSET,
 		   PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT))) {
 		cpa->numpages = 1;
-		cpa->pfn = __pa(vaddr) >> PAGE_SHIFT;
+		cpa->pfn = __pa_symbol(vaddr) >> PAGE_SHIFT;
 		return 0;
 	} else {
 		WARN(1, KERN_WARNING "CPA: called for zero pte. "
@@ -901,7 +901,7 @@ int set_memory_uc(unsigned long addr, in
 	/*
 	 * for now UC MINUS. see comments in ioremap_nocache()
 	 */
-	if (reserve_memtype(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
+	if (reserve_memtype(__pa_symbol(addr), __pa_symbol(addr) + numpages * PAGE_SIZE,
 			    _PAGE_CACHE_UC_MINUS, NULL))
 		return -EINVAL;
 
@@ -918,9 +918,9 @@ int set_memory_array_uc(unsigned long *a
 	 * for now UC MINUS. see comments in ioremap_nocache()
 	 */
 	for (i = 0; i < addrinarray; i++) {
-		start = __pa(addr[i]);
+		start = __pa_symbol(addr[i]);
 		for (end = start + PAGE_SIZE; i < addrinarray - 1; end += PAGE_SIZE) {
-			if (end != __pa(addr[i + 1]))
+			if (end != __pa_symbol(addr[i + 1]))
 				break;
 			i++;
 		}
@@ -932,12 +932,12 @@ int set_memory_array_uc(unsigned long *a
 				    __pgprot(_PAGE_CACHE_UC_MINUS), 1);
 out:
 	for (i = 0; i < addrinarray; i++) {
-		unsigned long tmp = __pa(addr[i]);
+		unsigned long tmp = __pa_symbol(addr[i]);
 
 		if (tmp == start)
 			break;
 		for (end = tmp + PAGE_SIZE; i < addrinarray - 1; end += PAGE_SIZE) {
-			if (end != __pa(addr[i + 1]))
+			if (end != __pa_symbol(addr[i + 1]))
 				break;
 			i++;
 		}
@@ -958,7 +958,7 @@ int set_memory_wc(unsigned long addr, in
 	if (!pat_enabled)
 		return set_memory_uc(addr, numpages);
 
-	if (reserve_memtype(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
+	if (reserve_memtype(__pa_symbol(addr), __pa_symbol(addr) + numpages * PAGE_SIZE,
 		_PAGE_CACHE_WC, NULL))
 		return -EINVAL;
 
@@ -974,7 +974,7 @@ int _set_memory_wb(unsigned long addr, i
 
 int set_memory_wb(unsigned long addr, int numpages)
 {
-	free_memtype(__pa(addr), __pa(addr) + numpages * PAGE_SIZE);
+	free_memtype(__pa_symbol(addr), __pa_symbol(addr) + numpages * PAGE_SIZE);
 
 	return _set_memory_wb(addr, numpages);
 }
@@ -985,11 +985,11 @@ int set_memory_array_wb(unsigned long *a
 	int i;
 
 	for (i = 0; i < addrinarray; i++) {
-		unsigned long start = __pa(addr[i]);
+		unsigned long start = __pa_symbol(addr[i]);
 		unsigned long end;
 
 		for (end = start + PAGE_SIZE; i < addrinarray - 1; end += PAGE_SIZE) {
-			if (end != __pa(addr[i + 1]))
+			if (end != __pa_symbol(addr[i + 1]))
 				break;
 			i++;
 		}
Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c
+++ linux/arch/x86/mm/pgtable.c
@@ -77,7 +77,7 @@ static void pgd_ctor(pgd_t *pgd)
 				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
 				KERNEL_PGD_PTRS);
 		paravirt_alloc_pmd_clone(__pa(pgd) >> PAGE_SHIFT,
-					 __pa(swapper_pg_dir) >> PAGE_SHIFT,
+					 __pa_symbol(swapper_pg_dir) >> PAGE_SHIFT,
 					 KERNEL_PGD_BOUNDARY,
 					 KERNEL_PGD_PTRS);
 	}

---
 arch/x86/include/asm/page.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/page.h
===================================================================
--- linux.orig/arch/x86/include/asm/page.h
+++ linux/arch/x86/include/asm/page.h
@@ -33,7 +33,7 @@ static inline void copy_user_page(void *
 	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
-#define __pa(x)		__phys_addr((unsigned long)(x))
+#define __pa(x)		((unsigned long)(x)-PAGE_OFFSET)
 #define __pa_slow(x)		__phys_addr_slow((unsigned long)(x))
 #define __pa_nodebug(x)	__phys_addr_nodebug((unsigned long)(x))
 /* __pa_symbol should be used for C visible symbols.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-23 13:38                   ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
@ 2009-02-23 14:08                     ` Nick Piggin
  2009-02-23 14:53                       ` Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2009-02-23 14:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa,
	jeremy, cpw

On Tuesday 24 February 2009 00:38:04 Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> > > Are __pa()/__va() that hot paths?  Or am I over-estimating
> > > the cost of 2MB dTLB?
> >
> > yes, __pa()/__va() is a very hot path - in a defconfig they
> > are used in about a thousand different places.
> >
> > In fact it would be nice to get rid of the __phys_addr()
> > redirection on the 64-bit side (which is non-linear and a
> > function there, and all __pa()s go through it) and make it a
> > constant offset again.
> >
> > This isnt trivial/possible to do though as .data/.bss is in
> > the high alias. (high .text aliases alone wouldnt be a big
> > issue to fix, but the data aliases are an issue.)
> >
> > Moving .data/.bss into the linear space isnt feasible as we'd
> > lose RIP-relative addressing shortcuts.
> >
> > Maybe we could figure out the places that do __pa() on a high
> > alias and gradually eliminate them. __pa() on .data/.bss is a
> > rare and unusal thing to do, and CONFIG_DEBUG_VIRTUAL could
> > warn about them without crashing the kernel.
> >
> > Later on we could make this check unconditional, and then
> > switch over __pa() to addr-PAGE_OFFSET in the
> > !CONFIG_DEBUG_VIRTUAL case (which is the default).
>
> Ok, i couldnt resist and using ftrace_printk() (regular printk
> in __pa() would hang during bootup) and came up with the patch
> below - which allows the second patch below that does:
>
>  -#define __pa(x)		__phys_addr((unsigned long)(x))
>  +#define __pa(x)		((unsigned long)(x)-PAGE_OFFSET)
>
> It cuts a nice (and hotly executed) ~650 bytes chunk out of the
> x86 64-bit defconfig kernel text:
>
>     text	   data	    bss	    dec	    hex	filename
>  7999071	1137780	 843672	9980523	 984a6b	vmlinux.before
>  7998414	1137780	 843672	9979866	 9847da	vmlinux.after
>
> And it even boots.
>
> (the load_cr3() hack needs to be changed, by setting the init
> pgdir from init_level4_pgt to __va(__pa_symbol(init_level4_pgt).)
>
> (32-bit is untested and likely wont even build.)
>
> It's not even that bad and looks quite maintainable as a
> concept.
>
> This also means that __va() and __pa() will be one and the same
> thing simple arithmetics again on both 32-bit and 64-bit
> kernels.
>
> 	Ingo
>
> ---
>  arch/x86/include/asm/page.h          |    4 +++-
>  arch/x86/include/asm/page_64_types.h |    1 +
>  arch/x86/include/asm/pgalloc.h       |    4 ++--
>  arch/x86/include/asm/pgtable.h       |    2 +-
>  arch/x86/include/asm/processor.h     |    7 ++++++-
>  arch/x86/kernel/setup.c              |   12 ++++++------
>  arch/x86/mm/init_64.c                |    6 +++---
>  arch/x86/mm/ioremap.c                |   12 +++++++++++-
>  arch/x86/mm/pageattr.c               |   28 ++++++++++++++--------------
>  arch/x86/mm/pgtable.c                |    2 +-
>  10 files changed, 48 insertions(+), 30 deletions(-)
>
> Index: linux/arch/x86/include/asm/page.h
> ===================================================================
> --- linux.orig/arch/x86/include/asm/page.h
> +++ linux/arch/x86/include/asm/page.h
> @@ -34,10 +34,11 @@ static inline void copy_user_page(void *
>  #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
>
>  #define __pa(x)		__phys_addr((unsigned long)(x))
> +#define __pa_slow(x)		__phys_addr_slow((unsigned long)(x))
>  #define __pa_nodebug(x)	__phys_addr_nodebug((unsigned long)(x))
>  /* __pa_symbol should be used for C visible symbols.
>     This seems to be the official gcc blessed way to do such arithmetic. */
> -#define __pa_symbol(x)	__pa(__phys_reloc_hide((unsigned long)(x)))
> +#define __pa_symbol(x)	__pa_slow(__phys_reloc_hide((unsigned long)(x)))
>
>  #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
>
> @@ -49,6 +50,7 @@ static inline void copy_user_page(void *
>   * virt_addr_valid(kaddr) returns true.
>   */
>  #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
> +#define virt_to_page_slow(kaddr) pfn_to_page(__pa_slow(kaddr) >>

Heh. I have almost the exact opposite patch which adds a virt_to_page_fast
and uses it in critical places (in the slab allocator).

But if you can do this more complete conversion, cool. Yes, __pa is very
performance critical (not just code size). Time to alloc+free an object
in the slab allocator is on the order of 100 cycles, so saving a few
cycles here == saving a few %. (although saying that, you hardly ever see
a workload where the slab allocator is too prominent)



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-23 14:08                     ` Nick Piggin
@ 2009-02-23 14:53                       ` Ingo Molnar
  2009-02-24 16:00                         ` Andi Kleen
  2009-02-27  5:57                         ` Tejun Heo
  0 siblings, 2 replies; 78+ messages in thread
From: Ingo Molnar @ 2009-02-23 14:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Tejun Heo, Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa,
	jeremy, cpw


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 24 February 2009 00:38:04 Ingo Molnar wrote:
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > > > Are __pa()/__va() that hot paths?  Or am I over-estimating
> > > > the cost of 2MB dTLB?
> > >
> > > yes, __pa()/__va() is a very hot path - in a defconfig they
> > > are used in about a thousand different places.
> > >
> > > In fact it would be nice to get rid of the __phys_addr()
> > > redirection on the 64-bit side (which is non-linear and a
> > > function there, and all __pa()s go through it) and make it a
> > > constant offset again.
> > >
> > > This isnt trivial/possible to do though as .data/.bss is in
> > > the high alias. (high .text aliases alone wouldnt be a big
> > > issue to fix, but the data aliases are an issue.)
> > >
> > > Moving .data/.bss into the linear space isnt feasible as we'd
> > > lose RIP-relative addressing shortcuts.
> > >
> > > Maybe we could figure out the places that do __pa() on a high
> > > alias and gradually eliminate them. __pa() on .data/.bss is a
> > > rare and unusal thing to do, and CONFIG_DEBUG_VIRTUAL could
> > > warn about them without crashing the kernel.
> > >
> > > Later on we could make this check unconditional, and then
> > > switch over __pa() to addr-PAGE_OFFSET in the
> > > !CONFIG_DEBUG_VIRTUAL case (which is the default).
> >
> > Ok, i couldnt resist and using ftrace_printk() (regular printk
> > in __pa() would hang during bootup) and came up with the patch
> > below - which allows the second patch below that does:
> >
> >  -#define __pa(x)		__phys_addr((unsigned long)(x))
> >  +#define __pa(x)		((unsigned long)(x)-PAGE_OFFSET)
> >
> > It cuts a nice (and hotly executed) ~650 bytes chunk out of the
> > x86 64-bit defconfig kernel text:
> >
> >     text	   data	    bss	    dec	    hex	filename
> >  7999071	1137780	 843672	9980523	 984a6b	vmlinux.before
> >  7998414	1137780	 843672	9979866	 9847da	vmlinux.after
> >
> > And it even boots.
> >
> > (the load_cr3() hack needs to be changed, by setting the init
> > pgdir from init_level4_pgt to __va(__pa_symbol(init_level4_pgt).)
> >
> > (32-bit is untested and likely wont even build.)
> >
> > It's not even that bad and looks quite maintainable as a
> > concept.
> >
> > This also means that __va() and __pa() will be one and the same
> > thing simple arithmetics again on both 32-bit and 64-bit
> > kernels.
> >
> > 	Ingo
> >
> > ---
> >  arch/x86/include/asm/page.h          |    4 +++-
> >  arch/x86/include/asm/page_64_types.h |    1 +
> >  arch/x86/include/asm/pgalloc.h       |    4 ++--
> >  arch/x86/include/asm/pgtable.h       |    2 +-
> >  arch/x86/include/asm/processor.h     |    7 ++++++-
> >  arch/x86/kernel/setup.c              |   12 ++++++------
> >  arch/x86/mm/init_64.c                |    6 +++---
> >  arch/x86/mm/ioremap.c                |   12 +++++++++++-
> >  arch/x86/mm/pageattr.c               |   28 ++++++++++++++--------------
> >  arch/x86/mm/pgtable.c                |    2 +-
> >  10 files changed, 48 insertions(+), 30 deletions(-)
> >
> > Index: linux/arch/x86/include/asm/page.h
> > ===================================================================
> > --- linux.orig/arch/x86/include/asm/page.h
> > +++ linux/arch/x86/include/asm/page.h
> > @@ -34,10 +34,11 @@ static inline void copy_user_page(void *
> >  #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
> >
> >  #define __pa(x)		__phys_addr((unsigned long)(x))
> > +#define __pa_slow(x)		__phys_addr_slow((unsigned long)(x))
> >  #define __pa_nodebug(x)	__phys_addr_nodebug((unsigned long)(x))
> >  /* __pa_symbol should be used for C visible symbols.
> >     This seems to be the official gcc blessed way to do such arithmetic. */
> > -#define __pa_symbol(x)	__pa(__phys_reloc_hide((unsigned long)(x)))
> > +#define __pa_symbol(x)	__pa_slow(__phys_reloc_hide((unsigned long)(x)))
> >
> >  #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
> >
> > @@ -49,6 +50,7 @@ static inline void copy_user_page(void *
> >   * virt_addr_valid(kaddr) returns true.
> >   */
> >  #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
> > +#define virt_to_page_slow(kaddr) pfn_to_page(__pa_slow(kaddr) >>
> 
> Heh. I have almost the exact opposite patch which adds a 
> virt_to_page_fast and uses it in critical places (in the slab 
> allocator).
> 
> But if you can do this more complete conversion, cool. Yes, 
> __pa is very performance critical (not just code size). Time 
> to alloc+free an object in the slab allocator is on the order 
> of 100 cycles, so saving a few cycles here == saving a few %. 
> (although saying that, you hardly ever see a workload where 
> the slab allocator is too prominent)

Yeah, we can do this complete conversion.

I'll clean it up some more. I think the best representation of 
this will be via a virt_to_sym() and sym_to_virt() space. That 
makes it really clear when we are moving from the symbol space 
to the linear space and back.

That way we wont need the _slow() methods at all - we'll always 
know whether an address is pure linear or in the symbol space.

In other words, it will be even faster and even nicer ;-)

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  3:04       ` Andrew Morton
  2009-02-20  5:29         ` Tejun Heo
@ 2009-02-24  2:52         ` Rusty Russell
  1 sibling, 0 replies; 78+ messages in thread
From: Rusty Russell @ 2009-02-24  2:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Friday 20 February 2009 13:34:17 Andrew Morton wrote:
> It's a dumb convention.

I disagree, but it doesn't matter.  Least surprise wins; let's not make
kernel coding any harder than it has to be.

free() does it, so kfree() should do it.  Otherwise call it something
completely different.  Too late, let's move on...

> In the vast majority of cases the pointer is
> not NULL.  We add a test-n-branch to 99.999999999% of cases just to
> save three seconds of programmer effort a single time.

It's unusual, but since I've used it several times in the kernel myself,
it's less than 4 9s (by call sites not by usage, since it tends to be
error paths).

> (We can still do that by adding a new
> kfree_im_not_stupid() which doesn't do the check).

Now you're insulting people who use it as well as exaggerating your case.

Do you need a hug?
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-20  3:01     ` Tejun Heo
  2009-02-20  3:02       ` Tejun Heo
@ 2009-02-24  2:56       ` Rusty Russell
  2009-02-24  5:27         ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
  2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
  1 sibling, 2 replies; 78+ messages in thread
From: Rusty Russell @ 2009-02-24  2:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo, tony.luck

On Friday 20 February 2009 13:31:21 Tejun Heo wrote:
> >    One question.  Are you thinking that to be defined by every SMP arch
> > long-term?
> 
> Yeap, definitely.

Excellent.  That opens some really nice stuff.

> > Because there are benefits in having &<percpuvar> == valid
> > percpuptr, such as passing them around as parameters.  If so, IA64
> > will want a dedicated per-cpu area for statics (tho it can probably
> > just map it somehow, but it has to be 64k).
> 
> Hmmm...  Don't have much idea about ia64 and its magic 64k.  Can it
> somehow be used for the first chunk?

Yes, but I think that chunk must not be handed out for dynamic allocations
but kept in reserve for modules.

IA64 uses a pinned TLB entry to map this cpu's 64k at __phys_per_cpu_start.
See __ia64_per_cpu_var() in arch/ia64/include/asm/percpu.h.  This means they
can also optimize cpu_local_* and read_cpuvar (or whatever it's called now).
IIUC IA64 needs this region internally, using it for percpu vars is a bonus.

> > These pseudo-constants seem like a really weird thing to do to me.
> 
> I explained this in the reply to Andrew's comment.  It's
> non-really-constant-but-should-be-considered-so-by-users thing.  Is it
> too weird?  Even if I add comment explaning it?

It's weird; I'd make them __read_mostly and be done with it.

> > rbtree might be overkill on first cut.  I'm bearing in mind that Christoph L
> > had a nice patch to use dynamic percpu allocation in the sl*b allocators;
> > which would mean this needs to only use get_free_page.
> 
> Hmmm... the reverse mapping can be piggy backed on vmalloc by adding a
> private pointer to the vm_struct but rbtree isn't too difficult to use
> so I just did it directly.  Nick, what do you think about adding
> private field to vm_struct and providing a reverse map function?

Naah, just walk the arrays to do the mapping.  Cuts a heap of code, and
we can optimize when someone complains :)

Walking arrays is cache friendly, too.

> As for the sl*b allocation thing, can you please explain in more
> detail or point me to the patches / threads?

lkml from 2008-05-30:

Message-Id: <20080530040021.800522644@sgi.com>:
Subject: [patch 32/41] cpu alloc: Use in slub
And:
Subject: [patch 33/41] cpu alloc: Remove slub fields
Subject: [patch 34/41] cpu alloc: Page allocator conversion

> Thanks.  :-)

Don't thank me: you're doing all the work!
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only
  2009-02-24  2:56       ` Rusty Russell
@ 2009-02-24  5:27         ` Tejun Heo
  2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-24  5:27 UTC (permalink / raw)
  To: Rusty Russell; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo, tony.luck

Most global variables in percpu allocator are initialized during boot
and read only from that point on.  Add __read_mostly as per Rusty's
suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
---
Added to #tj-percpu.

 mm/percpu.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 9ac0198..5954e7a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -83,18 +83,18 @@ struct pcpu_chunk {
 	struct page		*page[];	/* #cpus * UNIT_PAGES */
 };
 
-static int pcpu_unit_pages;
-static int pcpu_unit_size;
-static int pcpu_chunk_size;
-static int pcpu_nr_slots;
-static size_t pcpu_chunk_struct_size;
+static int pcpu_unit_pages __read_mostly;
+static int pcpu_unit_size __read_mostly;
+static int pcpu_chunk_size __read_mostly;
+static int pcpu_nr_slots __read_mostly;
+static size_t pcpu_chunk_struct_size __read_mostly;
 
 /* the address of the first chunk which starts with the kernel static area */
-void *pcpu_base_addr;
+void *pcpu_base_addr __read_mostly;
 EXPORT_SYMBOL_GPL(pcpu_base_addr);
 
 /* the size of kernel static area */
-static int pcpu_static_size;
+static int pcpu_static_size __read_mostly;
 
 /*
  * One mutex to rule them all.
@@ -112,7 +112,7 @@ static int pcpu_static_size;
  */
 static DEFINE_MUTEX(pcpu_mutex);
 
-static struct list_head *pcpu_slot;		/* chunk list slots */
+static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
 static struct rb_root pcpu_addr_root = RB_ROOT;	/* chunks by address */
 
 static int __pcpu_size_to_slot(int size)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-24  2:56       ` Rusty Russell
  2009-02-24  5:27         ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
@ 2009-02-24  5:47         ` Tejun Heo
  2009-02-24 17:41           ` Luck, Tony
  1 sibling, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-24  5:47 UTC (permalink / raw)
  To: Rusty Russell; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo, tony.luck

Hello, Rusty.

Rusty Russell wrote:
> On Friday 20 February 2009 13:31:21 Tejun Heo wrote:
>>>    One question.  Are you thinking that to be defined by every SMP arch
>>> long-term?
>> Yeap, definitely.
> 
> Excellent.  That opens some really nice stuff.

Yeap, I think it'll be pretty interesting.

>>> Because there are benefits in having &<percpuvar> == valid
>>> percpuptr, such as passing them around as parameters.  If so, IA64
>>> will want a dedicated per-cpu area for statics (tho it can probably
>>> just map it somehow, but it has to be 64k).
>> Hmmm...  Don't have much idea about ia64 and its magic 64k.  Can it
>> somehow be used for the first chunk?
> 
> Yes, but I think that chunk must not be handed out for dynamic allocations
> but kept in reserve for modules.
> 
> IA64 uses a pinned TLB entry to map this cpu's 64k at __phys_per_cpu_start.
> See __ia64_per_cpu_var() in arch/ia64/include/asm/percpu.h.  This means they
> can also optimize cpu_local_* and read_cpuvar (or whatever it's called now).
> IIUC IA64 needs this region internally, using it for percpu vars is a bonus.

I'll take a look.

>>> These pseudo-constants seem like a really weird thing to do to me.
>> I explained this in the reply to Andrew's comment.  It's
>> non-really-constant-but-should-be-considered-so-by-users thing.  Is it
>> too weird?  Even if I add comment explaning it?
> 
> It's weird; I'd make them __read_mostly and be done with it.

Already dropped.  It seems I was the only one liking it.

>> Hmmm... the reverse mapping can be piggy backed on vmalloc by adding a
>> private pointer to the vm_struct but rbtree isn't too difficult to use
>> so I just did it directly.  Nick, what do you think about adding
>> private field to vm_struct and providing a reverse map function?
> 
> Naah, just walk the arrays to do the mapping.  Cuts a heap of code, and
> we can optimize when someone complains :)
> 
> Walking arrays is cache friendly, too.

It won't make much difference cache line wise here as it needs to
dereference anyway.  It will cut less than a hundred lines of code
comments included.  Given the not-so-large amount of reduced
complexity, I'm a little bit reluctant to cut the code but please feel
free to submit a patch to kill it if you think it's really necessary.

>> As for the sl*b allocation thing, can you please explain in more
>> detail or point me to the patches / threads?
> 
> lkml from 2008-05-30:
> 
> Message-Id: <20080530040021.800522644@sgi.com>:
> Subject: [patch 32/41] cpu alloc: Use in slub
> And:
> Subject: [patch 33/41] cpu alloc: Remove slub fields
> Subject: [patch 34/41] cpu alloc: Page allocator conversion

I'll read them.  Thanks.

>> Thanks.  :-)
> 
> Don't thank me: you're doing all the work!
> Rusty.

Heh... I'm just being coward.  I keep thanks around so that I can
remove it when I wanna curse.  :-P

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-23 14:53                       ` Ingo Molnar
@ 2009-02-24 16:00                         ` Andi Kleen
  2009-02-27  5:57                         ` Tejun Heo
  1 sibling, 0 replies; 78+ messages in thread
From: Andi Kleen @ 2009-02-24 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Tejun Heo, Linus Torvalds, rusty, tglx, x86,
	linux-kernel, hpa, jeremy, cpw

Ingo Molnar <mingo@elte.hu> writes:
>
> Yeah, we can do this complete conversion.
>
> I'll clean it up some more. I think the best representation of 
> this will be via a virt_to_sym() and sym_to_virt() space. That 
> makes it really clear when we are moving from the symbol space 
> to the linear space and back.
>
> That way we wont need the _slow() methods at all - we'll always 
> know whether an address is pure linear or in the symbol space.
>
> In other words, it will be even faster and even nicer ;-)

That is what the original code did (virt_to_sym was just
done through __pa_symbol), but it was sometimes tricky 
to get right andLinus wanted an unified __pa/__va
and put it out of line.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
@ 2009-02-24 17:41           ` Luck, Tony
  2009-02-26  3:17             ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Luck, Tony @ 2009-02-24 17:41 UTC (permalink / raw)
  To: Tejun Heo, Rusty Russell; +Cc: tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

> > IA64 uses a pinned TLB entry to map this cpu's 64k at __phys_per_cpu_start.
> > See __ia64_per_cpu_var() in arch/ia64/include/asm/percpu.h.  This means they
> > can also optimize cpu_local_* and read_cpuvar (or whatever it's called now).
> > IIUC IA64 needs this region internally, using it for percpu vars is a bonus.

Something like that ...

ia64 started out with a pinned TLB entry to map the percpu space to the
top 64K of address space (so that the compiler can generate ld/st instructions
with a small negative offset from register r0 to access local-to-this-cpu
objects).

Then we started using a one of the ar.k* registers to hold the base
physical address for each cpus per-cpu area so that early parts of
machine check code (which runs with MMU off) can access per-cpu variables.

Finally we found that certain transaction processing benchmarks ran faster
if we let the cpu have free access to one extra TLB entry ... so we
stopped pinning the per-cpu area, and wrote a s/w fault handler to
insert the mapping on demand (using the ar.k3 register to get the
physical address for the mapping).

N.B. ar.k3 is a medium-slow register ... I wouldn't want to use it
in the code sequence for *every* per-cpu variable access.

-Tony

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-24 17:41           ` Luck, Tony
@ 2009-02-26  3:17             ` Tejun Heo
  2009-02-27 19:41               ` Luck, Tony
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-26  3:17 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Rusty Russell, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Hello,

Luck, Tony wrote:
> ia64 started out with a pinned TLB entry to map the percpu space to the
> top 64K of address space (so that the compiler can generate ld/st instructions
> with a small negative offset from register r0 to access local-to-this-cpu
> objects).
> 
> Then we started using a one of the ar.k* registers to hold the base
> physical address for each cpus per-cpu area so that early parts of
> machine check code (which runs with MMU off) can access per-cpu variables.
> 
> Finally we found that certain transaction processing benchmarks ran faster
> if we let the cpu have free access to one extra TLB entry ... so we
> stopped pinning the per-cpu area, and wrote a s/w fault handler to
> insert the mapping on demand (using the ar.k3 register to get the
> physical address for the mapping).
> 
> N.B. ar.k3 is a medium-slow register ... I wouldn't want to use it
> in the code sequence for *every* per-cpu variable access.

Ah... I see, so the 64k limit for small offset.  I think what we can
do is using the first chunk for static percpu variables.  We'll still
be able to use the same accessor by doing something like...

#define unified_percpu_accessor(ptr) ({ \
	if (__builtin_constant_p(ptr)) \
		return r0 - unit_size + ptr; \
	else \
		do ar.k3 + ptr; \
	})

So, dynamic ones will be slower than normal ones but faster than what
we currently have (it will be faster than indirect pointer
derferencing, right?) while keeping static accesses fast.  Does it
sound okay to you?  Also, does anyone know whether there's a working
ia64 emulator?  There doesn't seem to be any and it seems almost
impossible to get hold of an actual ia64 machine over here.  :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-23 14:53                       ` Ingo Molnar
  2009-02-24 16:00                         ` Andi Kleen
@ 2009-02-27  5:57                         ` Tejun Heo
  2009-02-27  6:57                           ` Ingo Molnar
  1 sibling, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2009-02-27  5:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa,
	jeremy, cpw

Hello,

Ingo Molnar wrote:
> Yeah, we can do this complete conversion.
>
> I'll clean it up some more. I think the best representation of 
> this will be via a virt_to_sym() and sym_to_virt() space. That 
> makes it really clear when we are moving from the symbol space 
> to the linear space and back.

For arch code, maybe it's maintainable but with my driver developer
hat on I gotta say virt_to_page() not working on .data/.bss is quite
scary.  We can try to convert whatever which could be affected but

* They aren't clear at all not only when the code is being written but
  also when someone tries to use the code which can be buried several
  layers under.

* The failure cases can be hidden very well so that they can pass most
  tests unnoticed.  For example, buffer reserved for exception cases
  allocated statically which is usually used by PIO (no problem) but
  on a few selected controllers DMA is used.

* The failure mode is unobvious and very nasty.  With the debug code
  left out, the failure simply is mistranslated address or page
  pointer.  We might end up feeding the wrong address to controllers.
  The addresses are likely to be invalid but we really have no idea
  how the controllers would react.  If it ever happens, it's gonna be
  nasty.

* There isn't any point in trying to save a few cycles when we're deep
  in the IO path.  The cost simply is negligible compared to all the
  stuff necessary for programing devices.

So, I really think we should do what Nick suggested.  Make a fast
version and use it where the saved few cycles actually matter.  A
postfix which is more descriptive than _fast would be better tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-27  5:57                         ` Tejun Heo
@ 2009-02-27  6:57                           ` Ingo Molnar
  2009-02-27  7:11                             ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2009-02-27  6:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nick Piggin, Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa,
	jeremy, cpw


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> Ingo Molnar wrote:
> > Yeah, we can do this complete conversion.
> >
> > I'll clean it up some more. I think the best representation of 
> > this will be via a virt_to_sym() and sym_to_virt() space. That 
> > makes it really clear when we are moving from the symbol space 
> > to the linear space and back.
> 
> For arch code, maybe it's maintainable but with my driver developer
> hat on I gotta say virt_to_page() not working on .data/.bss is quite
> scary. [...]

Well, we have a debug mechanism in place.

As i suggested it in my first mail we can run with debug enabled 
for a cycle and then turn on the optimization by default (with 
the debug option still available too).

Drivers doing DMA on .data/.bss items is rather questionable 
anyway (and dangerous as well, on any platform where there's 
coherency problems if DMA is misaligned, etc.), and a quick look 
shows there's at most 2-3 dozen examples of that in all of 
drivers/*.

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [patch] x86: optimize __pa() to be linear again on 64-bit x86
  2009-02-27  6:57                           ` Ingo Molnar
@ 2009-02-27  7:11                             ` Tejun Heo
  0 siblings, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-02-27  7:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linus Torvalds, rusty, tglx, x86, linux-kernel, hpa,
	jeremy, cpw

Hello, Ingo.

Ingo Molnar wrote:
> * Tejun Heo <tj@kernel.org> wrote:
> 
>> Hello,
>>
>> Ingo Molnar wrote:
>>> Yeah, we can do this complete conversion.
>>>
>>> I'll clean it up some more. I think the best representation of 
>>> this will be via a virt_to_sym() and sym_to_virt() space. That 
>>> makes it really clear when we are moving from the symbol space 
>>> to the linear space and back.
>> For arch code, maybe it's maintainable but with my driver developer
>> hat on I gotta say virt_to_page() not working on .data/.bss is quite
>> scary. [...]
> 
> Well, we have a debug mechanism in place.
> 
> As i suggested it in my first mail we can run with debug enabled 
> for a cycle and then turn on the optimization by default (with 
> the debug option still available too).

I don't know.  The failure mode just seems to subtle to me and we'll
be able to gain most of the benefits by using the fast version at
appropriate places without adding any risk.

> Drivers doing DMA on .data/.bss items is rather questionable 
> anyway (and dangerous as well, on any platform where there's 
> coherency problems if DMA is misaligned, etc.), and a quick look 
> shows there's at most 2-3 dozen examples of that in all of 
> drivers/*.

Gained benefit vs. added danger equation just doesn't seem right to
me.  Yes, we'll be able to filter most of them in a cycle or two but
we will never know whether it's fully safe or not.  Please note that
when it goes wrong, it can go wrong silently corrupting some unrelated
stuff.  When there is a way to achieve almost the same level of
performance gain in safe way, I don't think doing it this way is a
good choice.  Also, if we do this, we're basically introducing new API
by changing semantics of an existing one in a way that can break the
current users, which we really should avoid.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 09/10] percpu: implement new dynamic percpu allocator
  2009-02-26  3:17             ` Tejun Heo
@ 2009-02-27 19:41               ` Luck, Tony
  0 siblings, 0 replies; 78+ messages in thread
From: Luck, Tony @ 2009-02-27 19:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Rusty Russell, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

> Ah... I see, so the 64k limit for small offset.  I think what we can

The 64k currently in use is detemined by the TLB page size that was chosen
for the percpu area.  We can move up to a larger size (but supported page
sizes increase in even powers of two, so next up from 64K is 256K, then 1M).
Just changing PERCPU_PAGE_SHIFT in asm/page.h is sufficient to use a different
page size.

> do is using the first chunk for static percpu variables.  We'll still
> be able to use the same accessor by doing something like...
>
> #define unified_percpu_accessor(ptr) ({ \
>       if (__builtin_constant_p(ptr)) \
>               return r0 - unit_size + ptr; \
>       else \
>               do ar.k3 + ptr; \
>       })
>
> So, dynamic ones will be slower than normal ones but faster than what
> we currently have (it will be faster than indirect pointer
> derferencing, right?)

Depends on how many dynamic percpu accesses are being done, and how close
together they are.  The read of ar.k3 looks to take about 30ns on my test
machine.  Faster than a memory access, but slower than a cache-hit. So
a small sequence of close together dynamic percpu accesses will go
faster with dereferencing than looking at ar.k3 for each one.

 while keeping static accesses fast.  Does it
> sound okay to you?  Also, does anyone know whether there's a working
> ia64 emulator?  There doesn't seem to be any and it seems almost
> impossible to get hold of an actual ia64 machine over here.  :-(

The HP "ski" simulator: http://www.hpl.hp.com/research/linux/ski/ might
do what you want ... but I haven't actually booted a kernel on it in
a very long time.

-Tony


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/10] percpu: kill percpu_alloc() and friends
  2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
  2009-02-19  0:17   ` Rusty Russell
@ 2009-03-11 18:36   ` Tony Luck
  2009-03-11 22:44     ` Rusty Russell
  2009-03-12  2:06     ` Tejun Heo
  1 sibling, 2 replies; 78+ messages in thread
From: Tony Luck @ 2009-03-11 18:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Wed, Feb 18, 2009 at 5:04 AM, Tejun Heo <tj@kernel.org> wrote:
> +static inline void *__alloc_percpu(size_t size, size_t align)
>  {
> +       /*
> +        * Can't easily make larger alignment work with kmalloc.  WARN
> +        * on it.  Larger alignment should only be used for module
> +        * percpu sections on SMP for which this path isn't used.
> +        */
> +       WARN_ON_ONCE(align > __alignof__(unsigned long long));
>        return kzalloc(size, gfp);
>  }

This WARN_ON just pinged for me when I built & ran linux-next tag next-20090311

Stack trace from the WARN_ON pointed to __create_workqueue_key() which
does:

         wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);

and the cpu_workqueue_struct is defined as ____cacheline_aligned

I hit this on ia64, but all this code looks generic.

-Tony

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/10] percpu: kill percpu_alloc() and friends
  2009-03-11 18:36   ` Tony Luck
@ 2009-03-11 22:44     ` Rusty Russell
  2009-03-12  2:06     ` Tejun Heo
  1 sibling, 0 replies; 78+ messages in thread
From: Rusty Russell @ 2009-03-11 22:44 UTC (permalink / raw)
  To: Tony Luck; +Cc: Tejun Heo, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

On Thursday 12 March 2009 05:06:58 Tony Luck wrote:
>          wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
> 
> and the cpu_workqueue_struct is defined as ____cacheline_aligned

Yes, and it no longer needs to be now we have the real per-cpu allocator.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/10] percpu: kill percpu_alloc() and friends
  2009-03-11 18:36   ` Tony Luck
  2009-03-11 22:44     ` Rusty Russell
@ 2009-03-12  2:06     ` Tejun Heo
  1 sibling, 0 replies; 78+ messages in thread
From: Tejun Heo @ 2009-03-12  2:06 UTC (permalink / raw)
  To: Tony Luck; +Cc: rusty, tglx, x86, linux-kernel, hpa, jeremy, cpw, mingo

Tony Luck wrote:
> On Wed, Feb 18, 2009 at 5:04 AM, Tejun Heo <tj@kernel.org> wrote:
>> +static inline void *__alloc_percpu(size_t size, size_t align)
>>  {
>> +       /*
>> +        * Can't easily make larger alignment work with kmalloc.  WARN
>> +        * on it.  Larger alignment should only be used for module
>> +        * percpu sections on SMP for which this path isn't used.
>> +        */
>> +       WARN_ON_ONCE(align > __alignof__(unsigned long long));
>>        return kzalloc(size, gfp);
>>  }
> 
> This WARN_ON just pinged for me when I built & ran linux-next tag next-20090311
> 
> Stack trace from the WARN_ON pointed to __create_workqueue_key() which
> does:
> 
>          wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
> 
> and the cpu_workqueue_struct is defined as ____cacheline_aligned
> 
> I hit this on ia64, but all this code looks generic.

Yeap, it's fixed now, but as Rusty pointed out, once move to dynamic
percpu allocator is complete, we wouldn't need cacheline alignment for
percpu data structures.  It will only hurt performance by wasting
cachelines.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2009-03-12  2:06 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
2009-02-19 12:06   ` Nick Piggin
2009-02-19 22:36     ` David Miller
2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
2009-02-19 12:08   ` Nick Piggin
2009-02-20  7:16   ` Tejun Heo
2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
2009-02-19  0:17   ` Rusty Russell
2009-03-11 18:36   ` Tony Luck
2009-03-11 22:44     ` Rusty Russell
2009-03-12  2:06     ` Tejun Heo
2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
2009-02-19  0:55   ` Tejun Heo
2009-02-19 12:09   ` Nick Piggin
2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
2009-02-19 12:17   ` Nick Piggin
2009-02-20  1:27     ` Tejun Heo
2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
2009-02-20  8:32     ` Andrew Morton
2009-02-21  3:21       ` Tejun Heo
2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-19 10:10   ` Andrew Morton
2009-02-19 11:01     ` Ingo Molnar
2009-02-20  2:45       ` Tejun Heo
2009-02-19 12:07     ` Rusty Russell
2009-02-20  2:35     ` Tejun Heo
2009-02-20  3:04       ` Andrew Morton
2009-02-20  5:29         ` Tejun Heo
2009-02-24  2:52         ` Rusty Russell
2009-02-19 11:51   ` Rusty Russell
2009-02-20  3:01     ` Tejun Heo
2009-02-20  3:02       ` Tejun Heo
2009-02-24  2:56       ` Rusty Russell
2009-02-24  5:27         ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-24 17:41           ` Luck, Tony
2009-02-26  3:17             ` Tejun Heo
2009-02-27 19:41               ` Luck, Tony
2009-02-19 12:36   ` Nick Piggin
2009-02-20  3:04     ` Tejun Heo
2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
2009-02-20  8:37     ` Andrew Morton
2009-02-21  3:23       ` Tejun Heo
2009-02-21  3:42         ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
2009-02-21  7:48           ` Tejun Heo
2009-02-21  7:55             ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
2009-02-21  7:56               ` Tejun Heo
2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
2009-02-19  0:31   ` Tejun Heo
2009-02-19 10:51   ` Rusty Russell
2009-02-19 11:06     ` Ingo Molnar
2009-02-19 12:14       ` Rusty Russell
2009-02-20  3:08         ` Tejun Heo
2009-02-20  5:36           ` Tejun Heo
2009-02-20  7:33             ` Tejun Heo
2009-02-19  0:30 ` Tejun Heo
2009-02-19 11:07   ` Ingo Molnar
2009-02-20  3:17     ` Tejun Heo
2009-02-20  9:32       ` Ingo Molnar
2009-02-21  7:10         ` Tejun Heo
2009-02-21  7:33           ` Tejun Heo
2009-02-22 19:38             ` Ingo Molnar
2009-02-23  0:43               ` Tejun Heo
2009-02-23 10:17                 ` Ingo Molnar
2009-02-23 13:38                   ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
2009-02-23 14:08                     ` Nick Piggin
2009-02-23 14:53                       ` Ingo Molnar
2009-02-24 16:00                         ` Andi Kleen
2009-02-27  5:57                         ` Tejun Heo
2009-02-27  6:57                           ` Ingo Molnar
2009-02-27  7:11                             ` Tejun Heo
2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
2009-02-23  0:47             ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.