All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/2] Replacing PID bitmap implementation with IDR API
@ 2017-10-11 22:19 Gargi Sharma
  2017-10-11 22:19 ` [PATCH v6 1/2] pid: Replace pid " Gargi Sharma
  2017-10-11 22:19 ` [PATCH v6 2/2] pid: Remove pidhash Gargi Sharma
  0 siblings, 2 replies; 17+ messages in thread
From: Gargi Sharma @ 2017-10-11 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: riel, julia.lawall, akpm, mingo, pasha.tatashin, ktkhai, oleg,
	ebiederm, hch, lkp, tony.luck, Gargi Sharma

This patch series replaces kernel bitmap implementation of PID allocation
with IDR API. These patches are written to simplify the kernel by replacing custom code with calls to generic code.

The following are the stats for pid and pid_namespace object files
before and after the replacement. There is a noteworthy change between
the IDR and bitmap implementation.

Before
text       data        bss        dec        hex    filename
   8447       3894         64      12405       3075    kernel/pid.o
After
text       data        bss        dec        hex    filename
   3397        304          0       3701        e75    kernel/pid.o

Before
 text       data        bss        dec        hex    filename
   5692       1842        192       7726       1e2e    kernel/pid_namespace.o
After
text       data        bss        dec        hex    filename
   2854        216         16       3086        c0e    kernel/pid_namespace.o

The following are the stats for ps, pstree and calling readdir on /proc
for 10,000 processes.

ps:
        With IDR API    With bitmap
real    0m1.962s        0m2.319s
user    0m0.052s        0m0.060s
sys     0m0.392s        0m0.516s

pstree:
        With IDR API    With bitmap
real    0m1.062s        0m1.794s
user    0m0.536s        0m0.612s
sys     0m0.184s        0m0.264s

proc:
        With IDR API    With bitmap
real    0m0.073s        0m0.074s
user    0m0.004s        0m0.004s
sys     0m0.012s        0m0.016s

---
Changes in v6:
	- Fix comment style inside alloc_pid.
	- Move rcu lock before tasklist lock.
	- Fix build error for upid.
Changes in v5:
        - Add rcu lock while iterating over idr tree.
        - Fix checkpatch issues.
Changes in v4:
        - Make comments for alloc_pid clearer.
Changes in v3:
        - Replace idr_next with idr_get_cursor().
        - Correct pid_alloc so that find_pid_ns can't
          find not completely allocated pids.
Changes in v2:
        - Removed redundant  IDR function that was introduced
          in the previous patchset.
        - Renamed PIDNS_HASH_ADDING
        - Used idr_for_each_entry_continue()
        - Used idr_find() to lookup pids

Gargi Sharma (2):
  pid: Replace pid bitmap implementation with IDR API
  pid: Remove pidhash

 arch/ia64/kernel/asm-offsets.c            |   4 +-
 arch/powerpc/platforms/cell/spufs/sched.c |   2 +-
 fs/proc/loadavg.c                         |   2 +-
 include/linux/init_task.h                 |   1 -
 include/linux/pid.h                       |   2 -
 include/linux/pid_namespace.h             |  18 +--
 init/main.c                               |   3 +-
 kernel/fork.c                             |   2 +-
 kernel/pid.c                              | 247 ++++++------------------------
 kernel/pid_namespace.c                    |  50 +++---
 10 files changed, 74 insertions(+), 257 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-11 22:19 [PATCH v6 0/2] Replacing PID bitmap implementation with IDR API Gargi Sharma
@ 2017-10-11 22:19 ` Gargi Sharma
  2017-10-11 22:28   ` Rik van Riel
                     ` (2 more replies)
  2017-10-11 22:19 ` [PATCH v6 2/2] pid: Remove pidhash Gargi Sharma
  1 sibling, 3 replies; 17+ messages in thread
From: Gargi Sharma @ 2017-10-11 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: riel, julia.lawall, akpm, mingo, pasha.tatashin, ktkhai, oleg,
	ebiederm, hch, lkp, tony.luck, Gargi Sharma

This patch replaces the current bitmap implemetation for
Process ID allocation. Functions that are no longer required,
for example, free_pidmap(), alloc_pidmap(), etc. are removed.
The rest of the functions are modified to use the IDR API.
The change was made to make the PID allocation less complex by
replacing custom code with calls to generic API.

Signed-off-by: Gargi Sharma <gs051095@gmail.com>
---
 arch/powerpc/platforms/cell/spufs/sched.c |   2 +-
 fs/proc/loadavg.c                         |   2 +-
 include/linux/pid_namespace.h             |  14 +--
 init/main.c                               |   2 +-
 kernel/pid.c                              | 201 ++++++------------------------
 kernel/pid_namespace.c                    |  44 +++----
 6 files changed, 57 insertions(+), 208 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 1fbb5da..e47761c 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1093,7 +1093,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
 		LOAD_INT(c), LOAD_FRAC(c),
 		count_active_contexts(),
 		atomic_read(&nr_spu_contexts),
-		task_active_pid_ns(current)->last_pid);
+		idr_get_cursor(&task_active_pid_ns(current)->idr));
 	return 0;
 }
 
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 983fce5..ba3d0e2 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -23,7 +23,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
 		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
 		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
 		nr_running(), nr_threads,
-		task_active_pid_ns(current)->last_pid);
+		idr_get_cursor(&task_active_pid_ns(current)->idr));
 	return 0;
 }
 
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index b09136f..f4db4a7 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,15 +9,8 @@
 #include <linux/nsproxy.h>
 #include <linux/kref.h>
 #include <linux/ns_common.h>
+#include <linux/idr.h>
 
-struct pidmap {
-       atomic_t nr_free;
-       void *page;
-};
-
-#define BITS_PER_PAGE		(PAGE_SIZE * 8)
-#define BITS_PER_PAGE_MASK	(BITS_PER_PAGE-1)
-#define PIDMAP_ENTRIES		((PID_MAX_LIMIT+BITS_PER_PAGE-1)/BITS_PER_PAGE)
 
 struct fs_pin;
 
@@ -29,9 +22,8 @@ enum { /* definitions for pid_namespace's hide_pid field */
 
 struct pid_namespace {
 	struct kref kref;
-	struct pidmap pidmap[PIDMAP_ENTRIES];
+	struct idr idr;
 	struct rcu_head rcu;
-	int last_pid;
 	unsigned int nr_hashed;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
@@ -105,6 +97,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
 void pidhash_init(void);
-void pidmap_init(void);
+void pid_idr_init(void);
 
 #endif /* _LINUX_PID_NS_H */
diff --git a/init/main.c b/init/main.c
index 0ee9c686..9f4db20 100644
--- a/init/main.c
+++ b/init/main.c
@@ -667,7 +667,7 @@ asmlinkage __visible void __init start_kernel(void)
 	if (late_time_init)
 		late_time_init();
 	calibrate_delay();
-	pidmap_init();
+	pid_idr_init();
 	anon_vma_init();
 	acpi_early_init();
 #ifdef CONFIG_X86
diff --git a/kernel/pid.c b/kernel/pid.c
index 020dedb..0ce5936 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -39,6 +39,7 @@
 #include <linux/proc_ns.h>
 #include <linux/proc_fs.h>
 #include <linux/sched/task.h>
+#include <linux/idr.h>
 
 #define pid_hashfn(nr, ns)	\
 	hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
@@ -53,14 +54,6 @@ int pid_max = PID_MAX_DEFAULT;
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
 
-static inline int mk_pid(struct pid_namespace *pid_ns,
-		struct pidmap *map, int off)
-{
-	return (map - pid_ns->pidmap)*BITS_PER_PAGE + off;
-}
-
-#define find_next_offset(map, off)					\
-		find_next_zero_bit((map)->page, BITS_PER_PAGE, off)
 
 /*
  * PID-map pages start out as NULL, they get allocated upon
@@ -70,10 +63,7 @@ static inline int mk_pid(struct pid_namespace *pid_ns,
  */
 struct pid_namespace init_pid_ns = {
 	.kref = KREF_INIT(2),
-	.pidmap = {
-		[ 0 ... PIDMAP_ENTRIES-1] = { ATOMIC_INIT(BITS_PER_PAGE), NULL }
-	},
-	.last_pid = 0,
+	.idr = IDR_INIT,
 	.nr_hashed = PIDNS_HASH_ADDING,
 	.level = 0,
 	.child_reaper = &init_task,
@@ -101,138 +91,6 @@ EXPORT_SYMBOL_GPL(init_pid_ns);
 
 static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
 
-static void free_pidmap(struct upid *upid)
-{
-	int nr = upid->nr;
-	struct pidmap *map = upid->ns->pidmap + nr / BITS_PER_PAGE;
-	int offset = nr & BITS_PER_PAGE_MASK;
-
-	clear_bit(offset, map->page);
-	atomic_inc(&map->nr_free);
-}
-
-/*
- * If we started walking pids at 'base', is 'a' seen before 'b'?
- */
-static int pid_before(int base, int a, int b)
-{
-	/*
-	 * This is the same as saying
-	 *
-	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
-	 * and that mapping orders 'a' and 'b' with respect to 'base'.
-	 */
-	return (unsigned)(a - base) < (unsigned)(b - base);
-}
-
-/*
- * We might be racing with someone else trying to set pid_ns->last_pid
- * at the pid allocation time (there's also a sysctl for this, but racing
- * with this one is OK, see comment in kernel/pid_namespace.c about it).
- * We want the winner to have the "later" value, because if the
- * "earlier" value prevails, then a pid may get reused immediately.
- *
- * Since pids rollover, it is not sufficient to just pick the bigger
- * value.  We have to consider where we started counting from.
- *
- * 'base' is the value of pid_ns->last_pid that we observed when
- * we started looking for a pid.
- *
- * 'pid' is the pid that we eventually found.
- */
-static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
-{
-	int prev;
-	int last_write = base;
-	do {
-		prev = last_write;
-		last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
-	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
-}
-
-static int alloc_pidmap(struct pid_namespace *pid_ns)
-{
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	struct pidmap *map;
-
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
-	offset = pid & BITS_PER_PAGE_MASK;
-	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	/*
-	 * If last_pid points into the middle of the map->page we
-	 * want to scan this bitmap block twice, the second time
-	 * we start with offset == 0 (or RESERVED_PIDS).
-	 */
-	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
-	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (!map->page) {
-				map->page = page;
-				page = NULL;
-			}
-			spin_unlock_irq(&pidmap_lock);
-			kfree(page);
-			if (unlikely(!map->page))
-				return -ENOMEM;
-		}
-		if (likely(atomic_read(&map->nr_free))) {
-			for ( ; ; ) {
-				if (!test_and_set_bit(offset, map->page)) {
-					atomic_dec(&map->nr_free);
-					set_last_pid(pid_ns, last, pid);
-					return pid;
-				}
-				offset = find_next_offset(map, offset);
-				if (offset >= BITS_PER_PAGE)
-					break;
-				pid = mk_pid(pid_ns, map, offset);
-				if (pid >= pid_max)
-					break;
-			}
-		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
-			++map;
-			offset = 0;
-		} else {
-			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
-				break;
-		}
-		pid = mk_pid(pid_ns, map, offset);
-	}
-	return -EAGAIN;
-}
-
-int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
-{
-	int offset;
-	struct pidmap *map, *end;
-
-	if (last >= PID_MAX_LIMIT)
-		return -1;
-
-	offset = (last + 1) & BITS_PER_PAGE_MASK;
-	map = &pid_ns->pidmap[(last + 1)/BITS_PER_PAGE];
-	end = &pid_ns->pidmap[PIDMAP_ENTRIES];
-	for (; map < end; map++, offset = 0) {
-		if (unlikely(!map->page))
-			continue;
-		offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
-		if (offset < BITS_PER_PAGE)
-			return mk_pid(pid_ns, map, offset);
-	}
-	return -1;
-}
-
 void put_pid(struct pid *pid)
 {
 	struct pid_namespace *ns;
@@ -266,7 +124,7 @@ void free_pid(struct pid *pid)
 		struct upid *upid = pid->numbers + i;
 		struct pid_namespace *ns = upid->ns;
 		hlist_del_rcu(&upid->pid_chain);
-		switch(--ns->nr_hashed) {
+		switch (--ns->nr_hashed) {
 		case 2:
 		case 1:
 			/* When all that is left in the pid namespace
@@ -284,12 +142,11 @@ void free_pid(struct pid *pid)
 			schedule_work(&ns->proc_work);
 			break;
 		}
+
+		idr_remove(&ns->idr, upid->nr);
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
-	for (i = 0; i <= pid->level; i++)
-		free_pidmap(pid->numbers + i);
-
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
@@ -308,8 +165,29 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	pid->level = ns->level;
+
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		int pid_min = 1;
+
+		idr_preload(GFP_KERNEL);
+		spin_lock_irq(&pidmap_lock);
+
+		/*
+		 * init really needs pid 1, but after reaching the maximum
+		 * wrap back to RESERVED_PIDS
+		 */
+		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+			pid_min = RESERVED_PIDS;
+
+		/*
+		 * Store a null pointer so find_pid_ns does not find
+		 * a partially initialized PID (see below).
+		 */
+		nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+				      pid_max, GFP_ATOMIC);
+		spin_unlock_irq(&pidmap_lock);
+		idr_preload_end();
+
 		if (nr < 0) {
 			retval = nr;
 			goto out_free;
@@ -339,6 +217,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
 				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
+		/* Make the PID visible to find_pid_ns. */
+		idr_replace(&upid->ns->idr, pid, upid->nr);
 		upid->ns->nr_hashed++;
 	}
 	spin_unlock_irq(&pidmap_lock);
@@ -350,8 +230,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	put_pid_ns(ns);
 
 out_free:
+	spin_lock_irq(&pidmap_lock);
 	while (++i <= ns->level)
-		free_pidmap(pid->numbers + i);
+		idr_remove(&ns->idr, (pid->numbers + i)->nr);
+
+	spin_unlock_irq(&pidmap_lock);
 
 	kmem_cache_free(ns->pid_cachep, pid);
 	return ERR_PTR(retval);
@@ -553,16 +436,7 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
  */
 struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 {
-	struct pid *pid;
-
-	do {
-		pid = find_pid_ns(nr, ns);
-		if (pid)
-			break;
-		nr = next_pidmap(ns, nr);
-	} while (nr > 0);
-
-	return pid;
+	return idr_get_next(&ns->idr, &nr);
 }
 
 /*
@@ -578,7 +452,7 @@ void __init pidhash_init(void)
 					   0, 4096);
 }
 
-void __init pidmap_init(void)
+void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */
 	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);
@@ -590,10 +464,7 @@ void __init pidmap_init(void)
 				PIDS_PER_CPU_MIN * num_possible_cpus());
 	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
 
-	init_pid_ns.pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-	/* Reserve PID 0. We never call free_pidmap(0) */
-	set_bit(0, init_pid_ns.pidmap[0].page);
-	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
+	idr_init(&init_pid_ns.idr);
 
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 4918314..5009dbe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -21,6 +21,7 @@
 #include <linux/export.h>
 #include <linux/sched/task.h>
 #include <linux/sched/signal.h>
+#include <linux/idr.h>
 
 struct pid_cache {
 	int nr_ids;
@@ -98,7 +99,6 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	struct pid_namespace *ns;
 	unsigned int level = parent_pid_ns->level + 1;
 	struct ucounts *ucounts;
-	int i;
 	int err;
 
 	err = -EINVAL;
@@ -117,17 +117,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	if (ns == NULL)
 		goto out_dec;
 
-	ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!ns->pidmap[0].page)
-		goto out_free;
+	idr_init(&ns->idr);
 
 	ns->pid_cachep = create_pid_cachep(level + 1);
 	if (ns->pid_cachep == NULL)
-		goto out_free_map;
+		goto out_free_idr;
 
 	err = ns_alloc_inum(&ns->ns);
 	if (err)
-		goto out_free_map;
+		goto out_free_idr;
 	ns->ns.ops = &pidns_operations;
 
 	kref_init(&ns->kref);
@@ -138,17 +136,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->nr_hashed = PIDNS_HASH_ADDING;
 	INIT_WORK(&ns->proc_work, proc_cleanup_work);
 
-	set_bit(0, ns->pidmap[0].page);
-	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
-
-	for (i = 1; i < PIDMAP_ENTRIES; i++)
-		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
-
 	return ns;
 
-out_free_map:
-	kfree(ns->pidmap[0].page);
-out_free:
+out_free_idr:
+	idr_destroy(&ns->idr);
 	kmem_cache_free(pid_ns_cachep, ns);
 out_dec:
 	dec_pid_namespaces(ucounts);
@@ -168,11 +159,9 @@ static void delayed_free_pidns(struct rcu_head *p)
 
 static void destroy_pid_namespace(struct pid_namespace *ns)
 {
-	int i;
-
 	ns_free_inum(&ns->ns);
-	for (i = 0; i < PIDMAP_ENTRIES; i++)
-		kfree(ns->pidmap[i].page);
+
+	idr_destroy(&ns->idr);
 	call_rcu(&ns->rcu, delayed_free_pidns);
 }
 
@@ -213,6 +202,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	int rc;
 	struct task_struct *task, *me = current;
 	int init_pids = thread_group_leader(me) ? 1 : 2;
+	struct pid *pid;
 
 	/* Don't allow any more processes into the pid namespace */
 	disable_pid_allocation(pid_ns);
@@ -239,20 +229,16 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	 * 	  maintain a tasklist for each pid namespace.
 	 *
 	 */
+	rcu_read_lock();
 	read_lock(&tasklist_lock);
-	nr = next_pidmap(pid_ns, 1);
-	while (nr > 0) {
-		rcu_read_lock();
-
-		task = pid_task(find_vpid(nr), PIDTYPE_PID);
+	nr = 2;
+	idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
+		task = pid_task(pid, PIDTYPE_PID);
 		if (task && !__fatal_signal_pending(task))
 			send_sig_info(SIGKILL, SEND_SIG_FORCED, task);
-
-		rcu_read_unlock();
-
-		nr = next_pidmap(pid_ns, nr);
 	}
 	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 
 	/*
 	 * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
@@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	 * it should synchronize its usage with external means.
 	 */
 
-	tmp.data = &pid_ns->last_pid;
+	tmp.data = &pid_ns->idr.idr_next;
 	return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/2] pid: Remove pidhash
  2017-10-11 22:19 [PATCH v6 0/2] Replacing PID bitmap implementation with IDR API Gargi Sharma
  2017-10-11 22:19 ` [PATCH v6 1/2] pid: Replace pid " Gargi Sharma
@ 2017-10-11 22:19 ` Gargi Sharma
  2017-10-11 22:28   ` Rik van Riel
  2017-10-11 22:47   ` Luck, Tony
  1 sibling, 2 replies; 17+ messages in thread
From: Gargi Sharma @ 2017-10-11 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: riel, julia.lawall, akpm, mingo, pasha.tatashin, ktkhai, oleg,
	ebiederm, hch, lkp, tony.luck, Gargi Sharma

pidhash is no longer required as all the information
can be looked up from idr tree. nr_hashed represented
the number of pids that had been hashed. Since, nr_hashed and
PIDNS_HASH_ADDING are no longer relevant, it has been renamed
to pid_allocated and PIDNS_ADDING respectively.

Signed-off-by: Gargi Sharma <gs051095@gmail.com>
---
 arch/ia64/kernel/asm-offsets.c |  4 ++--
 include/linux/init_task.h      |  1 -
 include/linux/pid.h            |  2 --
 include/linux/pid_namespace.h  |  4 ++--
 init/main.c                    |  1 -
 kernel/fork.c                  |  2 +-
 kernel/pid.c                   | 48 +++++++++---------------------------------
 kernel/pid_namespace.c         |  6 +++---
 8 files changed, 18 insertions(+), 50 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index 798bdb2..62ba885 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -30,8 +30,8 @@ void foo(void)
 	DEFINE(SIGFRAME_SIZE, sizeof (struct sigframe));
 	DEFINE(UNW_FRAME_INFO_SIZE, sizeof (struct unw_frame_info));
 
-	BUILD_BUG_ON(sizeof(struct upid) != 32);
-	DEFINE(IA64_UPID_SHIFT, 5);
+	BUILD_BUG_ON(sizeof(struct upid) != 16);
+	DEFINE(IA64_UPID_SHIFT, 4);
 
 	BLANK();
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 3c07ace..cc45798 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -104,7 +104,6 @@ extern struct group_info init_groups;
 	.numbers	= { {						\
 		.nr		= 0,					\
 		.ns		= &init_pid_ns,				\
-		.pid_chain	= { .next = NULL, .pprev = NULL },	\
 	}, }								\
 }
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7195827..3915664 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -50,10 +50,8 @@ enum pid_type
  */
 
 struct upid {
-	/* Try to keep pid_chain in the same cacheline as nr for find_vpid */
 	int nr;
 	struct pid_namespace *ns;
-	struct hlist_node pid_chain;
 };
 
 struct pid
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index f4db4a7..7911b58 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -24,7 +24,7 @@ struct pid_namespace {
 	struct kref kref;
 	struct idr idr;
 	struct rcu_head rcu;
-	unsigned int nr_hashed;
+	unsigned int pid_allocated;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
@@ -48,7 +48,7 @@ struct pid_namespace {
 
 extern struct pid_namespace init_pid_ns;
 
-#define PIDNS_HASH_ADDING (1U << 31)
+#define PIDNS_ADDING (1U << 31)
 
 #ifdef CONFIG_PID_NS
 static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
diff --git a/init/main.c b/init/main.c
index 9f4db20..c17a21b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -562,7 +562,6 @@ asmlinkage __visible void __init start_kernel(void)
 	 * kmem_cache_init()
 	 */
 	setup_log_buf(0);
-	pidhash_init();
 	vfs_caches_init_early();
 	sort_main_extable();
 	trap_init();
diff --git a/kernel/fork.c b/kernel/fork.c
index 1064618..c3518b8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1853,7 +1853,7 @@ static __latent_entropy struct task_struct *copy_process(
 		retval = -ERESTARTNOINTR;
 		goto bad_fork_cancel_cgroup;
 	}
-	if (unlikely(!(ns_of_pid(pid)->nr_hashed & PIDNS_HASH_ADDING))) {
+	if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
 		retval = -ENOMEM;
 		goto bad_fork_cancel_cgroup;
 	}
diff --git a/kernel/pid.c b/kernel/pid.c
index 0ce5936..b13b624 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -41,10 +41,6 @@
 #include <linux/sched/task.h>
 #include <linux/idr.h>
 
-#define pid_hashfn(nr, ns)	\
-	hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
-static struct hlist_head *pid_hash;
-static unsigned int pidhash_shift = 4;
 struct pid init_struct_pid = INIT_STRUCT_PID;
 
 int pid_max = PID_MAX_DEFAULT;
@@ -54,7 +50,6 @@ int pid_max = PID_MAX_DEFAULT;
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
 
-
 /*
  * PID-map pages start out as NULL, they get allocated upon
  * first use and are never deallocated. This way a low pid_max
@@ -64,7 +59,7 @@ int pid_max_max = PID_MAX_LIMIT;
 struct pid_namespace init_pid_ns = {
 	.kref = KREF_INIT(2),
 	.idr = IDR_INIT,
-	.nr_hashed = PIDNS_HASH_ADDING,
+	.pid_allocated = PIDNS_ADDING,
 	.level = 0,
 	.child_reaper = &init_task,
 	.user_ns = &init_user_ns,
@@ -123,8 +118,7 @@ void free_pid(struct pid *pid)
 	for (i = 0; i <= pid->level; i++) {
 		struct upid *upid = pid->numbers + i;
 		struct pid_namespace *ns = upid->ns;
-		hlist_del_rcu(&upid->pid_chain);
-		switch (--ns->nr_hashed) {
+		switch (--ns->pid_allocated) {
 		case 2:
 		case 1:
 			/* When all that is left in the pid namespace
@@ -133,10 +127,10 @@ void free_pid(struct pid *pid)
 			 */
 			wake_up_process(ns->child_reaper);
 			break;
-		case PIDNS_HASH_ADDING:
+		case PIDNS_ADDING:
 			/* Handle a fork failure of the first process */
 			WARN_ON(ns->child_reaper);
-			ns->nr_hashed = 0;
+			ns->pid_allocated = 0;
 			/* fall through */
 		case 0:
 			schedule_work(&ns->proc_work);
@@ -212,14 +206,12 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
+	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
-		hlist_add_head_rcu(&upid->pid_chain,
-				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
 		/* Make the PID visible to find_pid_ns. */
 		idr_replace(&upid->ns->idr, pid, upid->nr);
-		upid->ns->nr_hashed++;
+		upid->ns->pid_allocated++;
 	}
 	spin_unlock_irq(&pidmap_lock);
 
@@ -243,21 +235,13 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 void disable_pid_allocation(struct pid_namespace *ns)
 {
 	spin_lock_irq(&pidmap_lock);
-	ns->nr_hashed &= ~PIDNS_HASH_ADDING;
+	ns->pid_allocated &= ~PIDNS_ADDING;
 	spin_unlock_irq(&pidmap_lock);
 }
 
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
 {
-	struct upid *pnr;
-
-	hlist_for_each_entry_rcu(pnr,
-			&pid_hash[pid_hashfn(nr, ns)], pid_chain)
-		if (pnr->nr == nr && pnr->ns == ns)
-			return container_of(pnr, struct pid,
-					numbers[ns->level]);
-
-	return NULL;
+	return idr_find(&ns->idr, nr);
 }
 EXPORT_SYMBOL_GPL(find_pid_ns);
 
@@ -413,6 +397,7 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 		if (type != PIDTYPE_PID) {
 			if (type == __PIDTYPE_TGID)
 				type = PIDTYPE_PID;
+
 			task = task->group_leader;
 		}
 		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
@@ -439,23 +424,10 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 	return idr_get_next(&ns->idr, &nr);
 }
 
-/*
- * The pid hash table is scaled according to the amount of memory in the
- * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
- * more.
- */
-void __init pidhash_init(void)
-{
-	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
-					   HASH_EARLY | HASH_SMALL | HASH_ZERO,
-					   &pidhash_shift, NULL,
-					   0, 4096);
-}
-
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */
-	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);
+	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
 
 	/* bump default and minimum pid_max based on number of cpus */
 	pid_max = min(pid_max_max, max_t(int, pid_max,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 5009dbe..fea2c24 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -133,7 +133,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->parent = get_pid_ns(parent_pid_ns);
 	ns->user_ns = get_user_ns(user_ns);
 	ns->ucounts = ucounts;
-	ns->nr_hashed = PIDNS_HASH_ADDING;
+	ns->pid_allocated = PIDNS_ADDING;
 	INIT_WORK(&ns->proc_work, proc_cleanup_work);
 
 	return ns;
@@ -254,7 +254,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	 * sys_wait4() above can't reap the EXIT_DEAD children but we do not
 	 * really care, we could reparent them to the global init. We could
 	 * exit and reap ->child_reaper even if it is not the last thread in
-	 * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(),
+	 * this pid_ns, free_pid(pid_allocated == 0) calls proc_cleanup_work(),
 	 * pid_ns can not go away until proc_kill_sb() drops the reference.
 	 *
 	 * But this ns can also have other tasks injected by setns()+fork().
@@ -268,7 +268,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	 */
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (pid_ns->nr_hashed == init_pids)
+		if (pid_ns->pid_allocated == init_pids)
 			break;
 		schedule();
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-11 22:19 ` [PATCH v6 1/2] pid: Replace pid " Gargi Sharma
@ 2017-10-11 22:28   ` Rik van Riel
  2017-10-11 22:37   ` Andrew Morton
  2017-10-19  7:30   ` [v6,1/2] " Andrei Vagin
  2 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2017-10-11 22:28 UTC (permalink / raw)
  To: Gargi Sharma, linux-kernel
  Cc: julia.lawall, akpm, mingo, pasha.tatashin, ktkhai, oleg,
	ebiederm, hch, lkp, tony.luck

[-- Attachment #1: Type: text/plain, Size: 1171 bytes --]

On Wed, 2017-10-11 at 18:19 -0400, Gargi Sharma wrote:
> This patch replaces the current bitmap implemetation for
> Process ID allocation. Functions that are no longer required,
> for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> The rest of the functions are modified to use the IDR API.
> The change was made to make the PID allocation less complex by
> replacing custom code with calls to generic API.
> 
> Signed-off-by: Gargi Sharma <gs051095@gmail.com>
> ---
>  arch/powerpc/platforms/cell/spufs/sched.c |   2 +-
>  fs/proc/loadavg.c                         |   2 +-
>  include/linux/pid_namespace.h             |  14 +--
>  init/main.c                               |   2 +-
>  kernel/pid.c                              | 201 ++++++------------
> ------------
>  kernel/pid_namespace.c                    |  44 +++----
>  6 files changed, 57 insertions(+), 208 deletions(-)

I'm all out of nitpicks now :)
Also, I really love that diffstat.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] pid: Remove pidhash
  2017-10-11 22:19 ` [PATCH v6 2/2] pid: Remove pidhash Gargi Sharma
@ 2017-10-11 22:28   ` Rik van Riel
  2017-10-11 22:47   ` Luck, Tony
  1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2017-10-11 22:28 UTC (permalink / raw)
  To: Gargi Sharma, linux-kernel
  Cc: julia.lawall, akpm, mingo, pasha.tatashin, ktkhai, oleg,
	ebiederm, hch, lkp, tony.luck

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]

On Wed, 2017-10-11 at 18:19 -0400, Gargi Sharma wrote:
> pidhash is no longer required as all the information
> can be looked up from idr tree. nr_hashed represented
> the number of pids that had been hashed. Since, nr_hashed and
> PIDNS_HASH_ADDING are no longer relevant, it has been renamed
> to pid_allocated and PIDNS_ADDING respectively.
> 
> Signed-off-by: Gargi Sharma <gs051095@gmail.com>
> ---
>  arch/ia64/kernel/asm-offsets.c |  4 ++--
>  include/linux/init_task.h      |  1 -
>  include/linux/pid.h            |  2 --
>  include/linux/pid_namespace.h  |  4 ++--
>  init/main.c                    |  1 -
>  kernel/fork.c                  |  2 +-
>  kernel/pid.c                   | 48 +++++++++-----------------------
> ----------
>  kernel/pid_namespace.c         |  6 +++---
>  8 files changed, 18 insertions(+), 50 deletions(-)

Fingers crossed for IA64. It all looks good, though.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-11 22:19 ` [PATCH v6 1/2] pid: Replace pid " Gargi Sharma
  2017-10-11 22:28   ` Rik van Riel
@ 2017-10-11 22:37   ` Andrew Morton
  2017-10-12  0:54     ` Rik van Riel
  2017-10-19  7:30   ` [v6,1/2] " Andrei Vagin
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2017-10-11 22:37 UTC (permalink / raw)
  To: Gargi Sharma
  Cc: linux-kernel, riel, julia.lawall, mingo, pasha.tatashin, ktkhai,
	oleg, ebiederm, hch, lkp, tony.luck

On Wed, 11 Oct 2017 18:19:38 -0400 Gargi Sharma <gs051095@gmail.com> wrote:

> This patch replaces the current bitmap implemetation for
> Process ID allocation. Functions that are no longer required,
> for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> The rest of the functions are modified to use the IDR API.
> The change was made to make the PID allocation less complex by
> replacing custom code with calls to generic API.

I still don't understand the locking.  spin_lock_irq(&pidmap_lock) in
some places, rcu_read_lock() in others.

If the locking is indeed now correct, can we please get it fully
documented?  A comment at the pid_namespace.idr definition site would
suit.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] pid: Remove pidhash
  2017-10-11 22:19 ` [PATCH v6 2/2] pid: Remove pidhash Gargi Sharma
  2017-10-11 22:28   ` Rik van Riel
@ 2017-10-11 22:47   ` Luck, Tony
  1 sibling, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2017-10-11 22:47 UTC (permalink / raw)
  To: Gargi Sharma
  Cc: linux-kernel, riel, julia.lawall, akpm, mingo, pasha.tatashin,
	ktkhai, oleg, ebiederm, hch, lkp

On Wed, Oct 11, 2017 at 06:19:39PM -0400, Gargi Sharma wrote:
> pidhash is no longer required as all the information
> can be looked up from idr tree. nr_hashed represented
> the number of pids that had been hashed. Since, nr_hashed and
> PIDNS_HASH_ADDING are no longer relevant, it has been renamed
> to pid_allocated and PIDNS_ADDING respectively.
> 
> Signed-off-by: Gargi Sharma <gs051095@gmail.com>

Tested-by: Tony Luck <tony.luck@intel.com> #ia64

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-11 22:37   ` Andrew Morton
@ 2017-10-12  0:54     ` Rik van Riel
  2017-10-12 22:58       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2017-10-12  0:54 UTC (permalink / raw)
  To: Andrew Morton, Gargi Sharma
  Cc: linux-kernel, julia.lawall, mingo, pasha.tatashin, ktkhai, oleg,
	ebiederm, hch, lkp, tony.luck

[-- Attachment #1: Type: text/plain, Size: 1035 bytes --]

On Wed, 2017-10-11 at 15:37 -0700, Andrew Morton wrote:
> On Wed, 11 Oct 2017 18:19:38 -0400 Gargi Sharma <gs051095@gmail.com>
> wrote:
> 
> > This patch replaces the current bitmap implemetation for
> > Process ID allocation. Functions that are no longer required,
> > for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> > The rest of the functions are modified to use the IDR API.
> > The change was made to make the PID allocation less complex by
> > replacing custom code with calls to generic API.
> 
> I still don't understand the locking.  spin_lock_irq(&pidmap_lock) in
> some places, rcu_read_lock() in others.
> 
> If the locking is indeed now correct, can we please get it fully
> documented?  A comment at the pid_namespace.idr definition site would
> suit.

Would you like me to send a follow-up patch to document the
locking?

Documenting the locking on all the existing code, plus the
new code, seems a little out of scope of an Outreachy
internship...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-12  0:54     ` Rik van Riel
@ 2017-10-12 22:58       ` Andrew Morton
  2017-10-13 14:58         ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2017-10-12 22:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Gargi Sharma, linux-kernel, julia.lawall, mingo, pasha.tatashin,
	ktkhai, oleg, ebiederm, hch, lkp, tony.luck

On Wed, 11 Oct 2017 20:54:32 -0400 Rik van Riel <riel@surriel.com> wrote:

> On Wed, 2017-10-11 at 15:37 -0700, Andrew Morton wrote:
> > On Wed, 11 Oct 2017 18:19:38 -0400 Gargi Sharma <gs051095@gmail.com>
> > wrote:
> > 
> > > This patch replaces the current bitmap implemetation for
> > > Process ID allocation. Functions that are no longer required,
> > > for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> > > The rest of the functions are modified to use the IDR API.
> > > The change was made to make the PID allocation less complex by
> > > replacing custom code with calls to generic API.
> > 
> > I still don't understand the locking.  spin_lock_irq(&pidmap_lock) in
> > some places, rcu_read_lock() in others.
> > 
> > If the locking is indeed now correct, can we please get it fully
> > documented?  A comment at the pid_namespace.idr definition site would
> > suit.
> 
> Would you like me to send a follow-up patch to document the
> locking?

Sure.

> Documenting the locking on all the existing code, plus the
> new code, seems a little out of scope of an Outreachy
> internship...

I'm not referring to all the existing code!  Just this new
pid_namespace.idr's locking.  If it was protected by
spin_lock_irq(&pidmap_lock) everywhere then an explanation wouldn't be
needed.  But we have this oddball site where pidmap_lock isn't taken
but it uses rcu_read_lock() which is surprising to say the least. 
Readers could be forgiven for thinking that is a bug.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-12 22:58       ` Andrew Morton
@ 2017-10-13 14:58         ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2017-10-13 14:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Gargi Sharma, linux-kernel, julia.lawall, mingo,
	pasha.tatashin, ktkhai, ebiederm, hch, lkp, tony.luck

On 10/12, Andrew Morton wrote:
>
> On Wed, 11 Oct 2017 20:54:32 -0400 Rik van Riel <riel@surriel.com> wrote:
>
> > Documenting the locking on all the existing code, plus the
> > new code, seems a little out of scope of an Outreachy
> > internship...
>
> I'm not referring to all the existing code!  Just this new
> pid_namespace.idr's locking.  If it was protected by
> spin_lock_irq(&pidmap_lock) everywhere then an explanation wouldn't be
> needed.  But we have this oddball site where pidmap_lock isn't taken
> but it uses rcu_read_lock() which is surprising to say the least.
> Readers could be forgiven for thinking that is a bug.

Actually this all looks simple, or I missed something...

pid_namespace.idr is protected by pidmap_lock, it is always modified
with this lock held.

idr_find/idr_find_ext/etc can be called under rcu_read_lock() simply
because idr/radix_tree is rcu-safe and this is already documented in
idr.h, say, the comment above idr_find() says

 * This function can be called under rcu_read_lock(), given that the leaf
 * pointers lifetimes are correctly managed.

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-11 22:19 ` [PATCH v6 1/2] pid: Replace pid " Gargi Sharma
  2017-10-11 22:28   ` Rik van Riel
  2017-10-11 22:37   ` Andrew Morton
@ 2017-10-19  7:30   ` Andrei Vagin
  2017-10-19 13:20     ` Gargi Sharma
  2017-10-19 16:18     ` Oleg Nesterov
  2 siblings, 2 replies; 17+ messages in thread
From: Andrei Vagin @ 2017-10-19  7:30 UTC (permalink / raw)
  To: Gargi Sharma
  Cc: linux-kernel, riel, julia.lawall, akpm, mingo, pasha.tatashin,
	ktkhai, oleg, ebiederm, hch, lkp, tony.luck

Hi Gargi,

This patch breaks CRIU, because it changes a meaning of ns_last_pid.

========================== Run zdtm/static/env00 in h ==========================
 DEP       env00.d
 CC        env00.o
 LINK      env00
Start test
./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/env00/52/1/restore.log
------------------------ grep Error ------------------------
(00.000587) No mountpoints-6.img image
(00.000593) mnt: Reading mountpoint images (id 6 pid 52)
(00.000653) Forking task with 52 pid (flags 0x0)
(00.007568) PID: real 51 virt 52
(00.010363)     52: Error (criu/cr-restore.c:1787): Pid 51 do not match expected 52 (task 52)
(00.010474) Error (criu/cr-restore.c:2449): Restoring FAILED.
------------------------ ERROR OVER ------------------------

Before this patch, ns_last_pid contains a pid of a last process. With
this patch, it contains a pid of a "next" process.

In CRIU we use ns_last_pid to restore a process with a specified pid,
and now this logic is broken:

$ uname -a
Linux laptop 4.11.11-200.fc25.x86_64 #1 SMP Mon Jul 17 17:41:12 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ echo 19999 > /proc/sys/kernel/ns_last_pid && sh -c 'echo $$'
20000

$ uname -a
Linux fc24 4.14.0-rc5-next-20171018 #1 SMP Wed Oct 18 23:52:43 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
$ echo 19999 > /proc/sys/kernel/ns_last_pid && sh -c 'echo $$'
19999

Thanks,
Andrei

On Wed, Oct 11, 2017 at 06:19:38PM -0400, Gargi Sharma wrote:
> This patch replaces the current bitmap implemetation for
> Process ID allocation. Functions that are no longer required,
> for example, free_pidmap(), alloc_pidmap(), etc. are removed.
> The rest of the functions are modified to use the IDR API.
> The change was made to make the PID allocation less complex by
> replacing custom code with calls to generic API.
> 
> Signed-off-by: Gargi Sharma <gs051095@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/powerpc/platforms/cell/spufs/sched.c |   2 +-
>  fs/proc/loadavg.c                         |   2 +-
>  include/linux/pid_namespace.h             |  14 +--
>  init/main.c                               |   2 +-
>  kernel/pid.c                              | 201 ++++++------------------------
>  kernel/pid_namespace.c                    |  44 +++----
>  6 files changed, 57 insertions(+), 208 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
> index 1fbb5da..e47761c 100644
> --- a/arch/powerpc/platforms/cell/spufs/sched.c
> +++ b/arch/powerpc/platforms/cell/spufs/sched.c
> @@ -1093,7 +1093,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
>  		LOAD_INT(c), LOAD_FRAC(c),
>  		count_active_contexts(),
>  		atomic_read(&nr_spu_contexts),
> -		task_active_pid_ns(current)->last_pid);
> +		idr_get_cursor(&task_active_pid_ns(current)->idr));
>  	return 0;
>  }
>  
> diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
> index 983fce5..ba3d0e2 100644
> --- a/fs/proc/loadavg.c
> +++ b/fs/proc/loadavg.c
> @@ -23,7 +23,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
>  		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
>  		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
>  		nr_running(), nr_threads,
> -		task_active_pid_ns(current)->last_pid);
> +		idr_get_cursor(&task_active_pid_ns(current)->idr));
>  	return 0;
>  }
>  
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index b09136f..f4db4a7 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -9,15 +9,8 @@
>  #include <linux/nsproxy.h>
>  #include <linux/kref.h>
>  #include <linux/ns_common.h>
> +#include <linux/idr.h>
>  
> -struct pidmap {
> -       atomic_t nr_free;
> -       void *page;
> -};
> -
> -#define BITS_PER_PAGE		(PAGE_SIZE * 8)
> -#define BITS_PER_PAGE_MASK	(BITS_PER_PAGE-1)
> -#define PIDMAP_ENTRIES		((PID_MAX_LIMIT+BITS_PER_PAGE-1)/BITS_PER_PAGE)
>  
>  struct fs_pin;
>  
> @@ -29,9 +22,8 @@ enum { /* definitions for pid_namespace's hide_pid field */
>  
>  struct pid_namespace {
>  	struct kref kref;
> -	struct pidmap pidmap[PIDMAP_ENTRIES];
> +	struct idr idr;
>  	struct rcu_head rcu;
> -	int last_pid;
>  	unsigned int nr_hashed;
>  	struct task_struct *child_reaper;
>  	struct kmem_cache *pid_cachep;
> @@ -105,6 +97,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>  
>  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
>  void pidhash_init(void);
> -void pidmap_init(void);
> +void pid_idr_init(void);
>  
>  #endif /* _LINUX_PID_NS_H */
> diff --git a/init/main.c b/init/main.c
> index 0ee9c686..9f4db20 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -667,7 +667,7 @@ asmlinkage __visible void __init start_kernel(void)
>  	if (late_time_init)
>  		late_time_init();
>  	calibrate_delay();
> -	pidmap_init();
> +	pid_idr_init();
>  	anon_vma_init();
>  	acpi_early_init();
>  #ifdef CONFIG_X86
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 020dedb..0ce5936 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -39,6 +39,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/proc_fs.h>
>  #include <linux/sched/task.h>
> +#include <linux/idr.h>
>  
>  #define pid_hashfn(nr, ns)	\
>  	hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
> @@ -53,14 +54,6 @@ int pid_max = PID_MAX_DEFAULT;
>  int pid_max_min = RESERVED_PIDS + 1;
>  int pid_max_max = PID_MAX_LIMIT;
>  
> -static inline int mk_pid(struct pid_namespace *pid_ns,
> -		struct pidmap *map, int off)
> -{
> -	return (map - pid_ns->pidmap)*BITS_PER_PAGE + off;
> -}
> -
> -#define find_next_offset(map, off)					\
> -		find_next_zero_bit((map)->page, BITS_PER_PAGE, off)
>  
>  /*
>   * PID-map pages start out as NULL, they get allocated upon
> @@ -70,10 +63,7 @@ static inline int mk_pid(struct pid_namespace *pid_ns,
>   */
>  struct pid_namespace init_pid_ns = {
>  	.kref = KREF_INIT(2),
> -	.pidmap = {
> -		[ 0 ... PIDMAP_ENTRIES-1] = { ATOMIC_INIT(BITS_PER_PAGE), NULL }
> -	},
> -	.last_pid = 0,
> +	.idr = IDR_INIT,
>  	.nr_hashed = PIDNS_HASH_ADDING,
>  	.level = 0,
>  	.child_reaper = &init_task,
> @@ -101,138 +91,6 @@ EXPORT_SYMBOL_GPL(init_pid_ns);
>  
>  static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
>  
> -static void free_pidmap(struct upid *upid)
> -{
> -	int nr = upid->nr;
> -	struct pidmap *map = upid->ns->pidmap + nr / BITS_PER_PAGE;
> -	int offset = nr & BITS_PER_PAGE_MASK;
> -
> -	clear_bit(offset, map->page);
> -	atomic_inc(&map->nr_free);
> -}
> -
> -/*
> - * If we started walking pids at 'base', is 'a' seen before 'b'?
> - */
> -static int pid_before(int base, int a, int b)
> -{
> -	/*
> -	 * This is the same as saying
> -	 *
> -	 * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
> -	 * and that mapping orders 'a' and 'b' with respect to 'base'.
> -	 */
> -	return (unsigned)(a - base) < (unsigned)(b - base);
> -}
> -
> -/*
> - * We might be racing with someone else trying to set pid_ns->last_pid
> - * at the pid allocation time (there's also a sysctl for this, but racing
> - * with this one is OK, see comment in kernel/pid_namespace.c about it).
> - * We want the winner to have the "later" value, because if the
> - * "earlier" value prevails, then a pid may get reused immediately.
> - *
> - * Since pids rollover, it is not sufficient to just pick the bigger
> - * value.  We have to consider where we started counting from.
> - *
> - * 'base' is the value of pid_ns->last_pid that we observed when
> - * we started looking for a pid.
> - *
> - * 'pid' is the pid that we eventually found.
> - */
> -static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
> -{
> -	int prev;
> -	int last_write = base;
> -	do {
> -		prev = last_write;
> -		last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
> -	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
> -}
> -
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> -{
> -	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> -	struct pidmap *map;
> -
> -	pid = last + 1;
> -	if (pid >= pid_max)
> -		pid = RESERVED_PIDS;
> -	offset = pid & BITS_PER_PAGE_MASK;
> -	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> -	/*
> -	 * If last_pid points into the middle of the map->page we
> -	 * want to scan this bitmap block twice, the second time
> -	 * we start with offset == 0 (or RESERVED_PIDS).
> -	 */
> -	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
> -	for (i = 0; i <= max_scan; ++i) {
> -		if (unlikely(!map->page)) {
> -			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> -			/*
> -			 * Free the page if someone raced with us
> -			 * installing it:
> -			 */
> -			spin_lock_irq(&pidmap_lock);
> -			if (!map->page) {
> -				map->page = page;
> -				page = NULL;
> -			}
> -			spin_unlock_irq(&pidmap_lock);
> -			kfree(page);
> -			if (unlikely(!map->page))
> -				return -ENOMEM;
> -		}
> -		if (likely(atomic_read(&map->nr_free))) {
> -			for ( ; ; ) {
> -				if (!test_and_set_bit(offset, map->page)) {
> -					atomic_dec(&map->nr_free);
> -					set_last_pid(pid_ns, last, pid);
> -					return pid;
> -				}
> -				offset = find_next_offset(map, offset);
> -				if (offset >= BITS_PER_PAGE)
> -					break;
> -				pid = mk_pid(pid_ns, map, offset);
> -				if (pid >= pid_max)
> -					break;
> -			}
> -		}
> -		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
> -			++map;
> -			offset = 0;
> -		} else {
> -			map = &pid_ns->pidmap[0];
> -			offset = RESERVED_PIDS;
> -			if (unlikely(last == offset))
> -				break;
> -		}
> -		pid = mk_pid(pid_ns, map, offset);
> -	}
> -	return -EAGAIN;
> -}
> -
> -int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> -{
> -	int offset;
> -	struct pidmap *map, *end;
> -
> -	if (last >= PID_MAX_LIMIT)
> -		return -1;
> -
> -	offset = (last + 1) & BITS_PER_PAGE_MASK;
> -	map = &pid_ns->pidmap[(last + 1)/BITS_PER_PAGE];
> -	end = &pid_ns->pidmap[PIDMAP_ENTRIES];
> -	for (; map < end; map++, offset = 0) {
> -		if (unlikely(!map->page))
> -			continue;
> -		offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
> -		if (offset < BITS_PER_PAGE)
> -			return mk_pid(pid_ns, map, offset);
> -	}
> -	return -1;
> -}
> -
>  void put_pid(struct pid *pid)
>  {
>  	struct pid_namespace *ns;
> @@ -266,7 +124,7 @@ void free_pid(struct pid *pid)
>  		struct upid *upid = pid->numbers + i;
>  		struct pid_namespace *ns = upid->ns;
>  		hlist_del_rcu(&upid->pid_chain);
> -		switch(--ns->nr_hashed) {
> +		switch (--ns->nr_hashed) {
>  		case 2:
>  		case 1:
>  			/* When all that is left in the pid namespace
> @@ -284,12 +142,11 @@ void free_pid(struct pid *pid)
>  			schedule_work(&ns->proc_work);
>  			break;
>  		}
> +
> +		idr_remove(&ns->idr, upid->nr);
>  	}
>  	spin_unlock_irqrestore(&pidmap_lock, flags);
>  
> -	for (i = 0; i <= pid->level; i++)
> -		free_pidmap(pid->numbers + i);
> -
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
>  
> @@ -308,8 +165,29 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  
>  	tmp = ns;
>  	pid->level = ns->level;
> +
>  	for (i = ns->level; i >= 0; i--) {
> -		nr = alloc_pidmap(tmp);
> +		int pid_min = 1;
> +
> +		idr_preload(GFP_KERNEL);
> +		spin_lock_irq(&pidmap_lock);
> +
> +		/*
> +		 * init really needs pid 1, but after reaching the maximum
> +		 * wrap back to RESERVED_PIDS
> +		 */
> +		if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> +			pid_min = RESERVED_PIDS;
> +
> +		/*
> +		 * Store a null pointer so find_pid_ns does not find
> +		 * a partially initialized PID (see below).
> +		 */
> +		nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> +				      pid_max, GFP_ATOMIC);
> +		spin_unlock_irq(&pidmap_lock);
> +		idr_preload_end();
> +
>  		if (nr < 0) {
>  			retval = nr;
>  			goto out_free;
> @@ -339,6 +217,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	for ( ; upid >= pid->numbers; --upid) {
>  		hlist_add_head_rcu(&upid->pid_chain,
>  				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
> +		/* Make the PID visible to find_pid_ns. */
> +		idr_replace(&upid->ns->idr, pid, upid->nr);
>  		upid->ns->nr_hashed++;
>  	}
>  	spin_unlock_irq(&pidmap_lock);
> @@ -350,8 +230,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  	put_pid_ns(ns);
>  
>  out_free:
> +	spin_lock_irq(&pidmap_lock);
>  	while (++i <= ns->level)
> -		free_pidmap(pid->numbers + i);
> +		idr_remove(&ns->idr, (pid->numbers + i)->nr);
> +
> +	spin_unlock_irq(&pidmap_lock);
>  
>  	kmem_cache_free(ns->pid_cachep, pid);
>  	return ERR_PTR(retval);
> @@ -553,16 +436,7 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
>   */
>  struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>  {
> -	struct pid *pid;
> -
> -	do {
> -		pid = find_pid_ns(nr, ns);
> -		if (pid)
> -			break;
> -		nr = next_pidmap(ns, nr);
> -	} while (nr > 0);
> -
> -	return pid;
> +	return idr_get_next(&ns->idr, &nr);
>  }
>  
>  /*
> @@ -578,7 +452,7 @@ void __init pidhash_init(void)
>  					   0, 4096);
>  }
>  
> -void __init pidmap_init(void)
> +void __init pid_idr_init(void)
>  {
>  	/* Verify no one has done anything silly: */
>  	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);
> @@ -590,10 +464,7 @@ void __init pidmap_init(void)
>  				PIDS_PER_CPU_MIN * num_possible_cpus());
>  	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>  
> -	init_pid_ns.pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> -	/* Reserve PID 0. We never call free_pidmap(0) */
> -	set_bit(0, init_pid_ns.pidmap[0].page);
> -	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
> +	idr_init(&init_pid_ns.idr);
>  
>  	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
>  			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 4918314..5009dbe 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -21,6 +21,7 @@
>  #include <linux/export.h>
>  #include <linux/sched/task.h>
>  #include <linux/sched/signal.h>
> +#include <linux/idr.h>
>  
>  struct pid_cache {
>  	int nr_ids;
> @@ -98,7 +99,6 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	struct pid_namespace *ns;
>  	unsigned int level = parent_pid_ns->level + 1;
>  	struct ucounts *ucounts;
> -	int i;
>  	int err;
>  
>  	err = -EINVAL;
> @@ -117,17 +117,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	if (ns == NULL)
>  		goto out_dec;
>  
> -	ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> -	if (!ns->pidmap[0].page)
> -		goto out_free;
> +	idr_init(&ns->idr);
>  
>  	ns->pid_cachep = create_pid_cachep(level + 1);
>  	if (ns->pid_cachep == NULL)
> -		goto out_free_map;
> +		goto out_free_idr;
>  
>  	err = ns_alloc_inum(&ns->ns);
>  	if (err)
> -		goto out_free_map;
> +		goto out_free_idr;
>  	ns->ns.ops = &pidns_operations;
>  
>  	kref_init(&ns->kref);
> @@ -138,17 +136,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	ns->nr_hashed = PIDNS_HASH_ADDING;
>  	INIT_WORK(&ns->proc_work, proc_cleanup_work);
>  
> -	set_bit(0, ns->pidmap[0].page);
> -	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
> -
> -	for (i = 1; i < PIDMAP_ENTRIES; i++)
> -		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
> -
>  	return ns;
>  
> -out_free_map:
> -	kfree(ns->pidmap[0].page);
> -out_free:
> +out_free_idr:
> +	idr_destroy(&ns->idr);
>  	kmem_cache_free(pid_ns_cachep, ns);
>  out_dec:
>  	dec_pid_namespaces(ucounts);
> @@ -168,11 +159,9 @@ static void delayed_free_pidns(struct rcu_head *p)
>  
>  static void destroy_pid_namespace(struct pid_namespace *ns)
>  {
> -	int i;
> -
>  	ns_free_inum(&ns->ns);
> -	for (i = 0; i < PIDMAP_ENTRIES; i++)
> -		kfree(ns->pidmap[i].page);
> +
> +	idr_destroy(&ns->idr);
>  	call_rcu(&ns->rcu, delayed_free_pidns);
>  }
>  
> @@ -213,6 +202,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	int rc;
>  	struct task_struct *task, *me = current;
>  	int init_pids = thread_group_leader(me) ? 1 : 2;
> +	struct pid *pid;
>  
>  	/* Don't allow any more processes into the pid namespace */
>  	disable_pid_allocation(pid_ns);
> @@ -239,20 +229,16 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	 * 	  maintain a tasklist for each pid namespace.
>  	 *
>  	 */
> +	rcu_read_lock();
>  	read_lock(&tasklist_lock);
> -	nr = next_pidmap(pid_ns, 1);
> -	while (nr > 0) {
> -		rcu_read_lock();
> -
> -		task = pid_task(find_vpid(nr), PIDTYPE_PID);
> +	nr = 2;
> +	idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> +		task = pid_task(pid, PIDTYPE_PID);
>  		if (task && !__fatal_signal_pending(task))
>  			send_sig_info(SIGKILL, SEND_SIG_FORCED, task);
> -
> -		rcu_read_unlock();
> -
> -		nr = next_pidmap(pid_ns, nr);
>  	}
>  	read_unlock(&tasklist_lock);
> +	rcu_read_unlock();
>  
>  	/*
>  	 * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
> @@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>  	 * it should synchronize its usage with external means.
>  	 */
>  
> -	tmp.data = &pid_ns->last_pid;
> +	tmp.data = &pid_ns->idr.idr_next;
>  	return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>  }
>  

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-19  7:30   ` [v6,1/2] " Andrei Vagin
@ 2017-10-19 13:20     ` Gargi Sharma
  2017-10-19 16:18     ` Oleg Nesterov
  1 sibling, 0 replies; 17+ messages in thread
From: Gargi Sharma @ 2017-10-19 13:20 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: linux-kernel, Rik van Riel, Julia Lawall, Andrew Morton, mingo,
	pasha.tatashin, ktkhai, Oleg Nesterov, Eric W. Biederman,
	Christoph Hellwig, lkp, tony.luck

On Thu, Oct 19, 2017 at 8:30 AM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> Hi Gargi,
>
> This patch breaks CRIU, because it changes a meaning of ns_last_pid.
>
> ========================== Run zdtm/static/env00 in h ==========================
>  DEP       env00.d
>  CC        env00.o
>  LINK      env00
> Start test
> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> Run criu dump
> Run criu restore
> =[log]=> dump/zdtm/static/env00/52/1/restore.log
> ------------------------ grep Error ------------------------
> (00.000587) No mountpoints-6.img image
> (00.000593) mnt: Reading mountpoint images (id 6 pid 52)
> (00.000653) Forking task with 52 pid (flags 0x0)
> (00.007568) PID: real 51 virt 52
> (00.010363)     52: Error (criu/cr-restore.c:1787): Pid 51 do not match expected 52 (task 52)
> (00.010474) Error (criu/cr-restore.c:2449): Restoring FAILED.
> ------------------------ ERROR OVER ------------------------
>
> Before this patch, ns_last_pid contains a pid of a last process. With
> this patch, it contains a pid of a "next" process.
>
> In CRIU we use ns_last_pid to restore a process with a specified pid,
> and now this logic is broken:
Hi Andrei,

We looked at the uses of last_pid and did not find critical uses
inside the kernel, hence figured it'll be okay to change it to store
the next pid. But that was not true, I will fix this in the next
version.

Thanks,
Gargi

>
> $ uname -a
> Linux laptop 4.11.11-200.fc25.x86_64 #1 SMP Mon Jul 17 17:41:12 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> $ echo 19999 > /proc/sys/kernel/ns_last_pid && sh -c 'echo $$'
> 20000
>
> $ uname -a
> Linux fc24 4.14.0-rc5-next-20171018 #1 SMP Wed Oct 18 23:52:43 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> $ echo 19999 > /proc/sys/kernel/ns_last_pid && sh -c 'echo $$'
> 19999
>
> Thanks,
> Andrei
>
> On Wed, Oct 11, 2017 at 06:19:38PM -0400, Gargi Sharma wrote:
>> This patch replaces the current bitmap implemetation for
>> Process ID allocation. Functions that are no longer required,
>> for example, free_pidmap(), alloc_pidmap(), etc. are removed.
>> The rest of the functions are modified to use the IDR API.
>> The change was made to make the PID allocation less complex by
>> replacing custom code with calls to generic API.
>>
>> Signed-off-by: Gargi Sharma <gs051095@gmail.com>
>> Reviewed-by: Rik van Riel <riel@redhat.com>
>> ---
>>  arch/powerpc/platforms/cell/spufs/sched.c |   2 +-
>>  fs/proc/loadavg.c                         |   2 +-
>>  include/linux/pid_namespace.h             |  14 +--
>>  init/main.c                               |   2 +-
>>  kernel/pid.c                              | 201 ++++++------------------------
>>  kernel/pid_namespace.c                    |  44 +++----
>>  6 files changed, 57 insertions(+), 208 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
>> index 1fbb5da..e47761c 100644
>> --- a/arch/powerpc/platforms/cell/spufs/sched.c
>> +++ b/arch/powerpc/platforms/cell/spufs/sched.c
>> @@ -1093,7 +1093,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
>>               LOAD_INT(c), LOAD_FRAC(c),
>>               count_active_contexts(),
>>               atomic_read(&nr_spu_contexts),
>> -             task_active_pid_ns(current)->last_pid);
>> +             idr_get_cursor(&task_active_pid_ns(current)->idr));
>>       return 0;
>>  }
>>
>> diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
>> index 983fce5..ba3d0e2 100644
>> --- a/fs/proc/loadavg.c
>> +++ b/fs/proc/loadavg.c
>> @@ -23,7 +23,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
>>               LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
>>               LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
>>               nr_running(), nr_threads,
>> -             task_active_pid_ns(current)->last_pid);
>> +             idr_get_cursor(&task_active_pid_ns(current)->idr));
>>       return 0;
>>  }
>>
>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> index b09136f..f4db4a7 100644
>> --- a/include/linux/pid_namespace.h
>> +++ b/include/linux/pid_namespace.h
>> @@ -9,15 +9,8 @@
>>  #include <linux/nsproxy.h>
>>  #include <linux/kref.h>
>>  #include <linux/ns_common.h>
>> +#include <linux/idr.h>
>>
>> -struct pidmap {
>> -       atomic_t nr_free;
>> -       void *page;
>> -};
>> -
>> -#define BITS_PER_PAGE                (PAGE_SIZE * 8)
>> -#define BITS_PER_PAGE_MASK   (BITS_PER_PAGE-1)
>> -#define PIDMAP_ENTRIES               ((PID_MAX_LIMIT+BITS_PER_PAGE-1)/BITS_PER_PAGE)
>>
>>  struct fs_pin;
>>
>> @@ -29,9 +22,8 @@ enum { /* definitions for pid_namespace's hide_pid field */
>>
>>  struct pid_namespace {
>>       struct kref kref;
>> -     struct pidmap pidmap[PIDMAP_ENTRIES];
>> +     struct idr idr;
>>       struct rcu_head rcu;
>> -     int last_pid;
>>       unsigned int nr_hashed;
>>       struct task_struct *child_reaper;
>>       struct kmem_cache *pid_cachep;
>> @@ -105,6 +97,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>>
>>  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
>>  void pidhash_init(void);
>> -void pidmap_init(void);
>> +void pid_idr_init(void);
>>
>>  #endif /* _LINUX_PID_NS_H */
>> diff --git a/init/main.c b/init/main.c
>> index 0ee9c686..9f4db20 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -667,7 +667,7 @@ asmlinkage __visible void __init start_kernel(void)
>>       if (late_time_init)
>>               late_time_init();
>>       calibrate_delay();
>> -     pidmap_init();
>> +     pid_idr_init();
>>       anon_vma_init();
>>       acpi_early_init();
>>  #ifdef CONFIG_X86
>> diff --git a/kernel/pid.c b/kernel/pid.c
>> index 020dedb..0ce5936 100644
>> --- a/kernel/pid.c
>> +++ b/kernel/pid.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/proc_ns.h>
>>  #include <linux/proc_fs.h>
>>  #include <linux/sched/task.h>
>> +#include <linux/idr.h>
>>
>>  #define pid_hashfn(nr, ns)   \
>>       hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
>> @@ -53,14 +54,6 @@ int pid_max = PID_MAX_DEFAULT;
>>  int pid_max_min = RESERVED_PIDS + 1;
>>  int pid_max_max = PID_MAX_LIMIT;
>>
>> -static inline int mk_pid(struct pid_namespace *pid_ns,
>> -             struct pidmap *map, int off)
>> -{
>> -     return (map - pid_ns->pidmap)*BITS_PER_PAGE + off;
>> -}
>> -
>> -#define find_next_offset(map, off)                                   \
>> -             find_next_zero_bit((map)->page, BITS_PER_PAGE, off)
>>
>>  /*
>>   * PID-map pages start out as NULL, they get allocated upon
>> @@ -70,10 +63,7 @@ static inline int mk_pid(struct pid_namespace *pid_ns,
>>   */
>>  struct pid_namespace init_pid_ns = {
>>       .kref = KREF_INIT(2),
>> -     .pidmap = {
>> -             [ 0 ... PIDMAP_ENTRIES-1] = { ATOMIC_INIT(BITS_PER_PAGE), NULL }
>> -     },
>> -     .last_pid = 0,
>> +     .idr = IDR_INIT,
>>       .nr_hashed = PIDNS_HASH_ADDING,
>>       .level = 0,
>>       .child_reaper = &init_task,
>> @@ -101,138 +91,6 @@ EXPORT_SYMBOL_GPL(init_pid_ns);
>>
>>  static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
>>
>> -static void free_pidmap(struct upid *upid)
>> -{
>> -     int nr = upid->nr;
>> -     struct pidmap *map = upid->ns->pidmap + nr / BITS_PER_PAGE;
>> -     int offset = nr & BITS_PER_PAGE_MASK;
>> -
>> -     clear_bit(offset, map->page);
>> -     atomic_inc(&map->nr_free);
>> -}
>> -
>> -/*
>> - * If we started walking pids at 'base', is 'a' seen before 'b'?
>> - */
>> -static int pid_before(int base, int a, int b)
>> -{
>> -     /*
>> -      * This is the same as saying
>> -      *
>> -      * (a - base + MAXUINT) % MAXUINT < (b - base + MAXUINT) % MAXUINT
>> -      * and that mapping orders 'a' and 'b' with respect to 'base'.
>> -      */
>> -     return (unsigned)(a - base) < (unsigned)(b - base);
>> -}
>> -
>> -/*
>> - * We might be racing with someone else trying to set pid_ns->last_pid
>> - * at the pid allocation time (there's also a sysctl for this, but racing
>> - * with this one is OK, see comment in kernel/pid_namespace.c about it).
>> - * We want the winner to have the "later" value, because if the
>> - * "earlier" value prevails, then a pid may get reused immediately.
>> - *
>> - * Since pids rollover, it is not sufficient to just pick the bigger
>> - * value.  We have to consider where we started counting from.
>> - *
>> - * 'base' is the value of pid_ns->last_pid that we observed when
>> - * we started looking for a pid.
>> - *
>> - * 'pid' is the pid that we eventually found.
>> - */
>> -static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
>> -{
>> -     int prev;
>> -     int last_write = base;
>> -     do {
>> -             prev = last_write;
>> -             last_write = cmpxchg(&pid_ns->last_pid, prev, pid);
>> -     } while ((prev != last_write) && (pid_before(base, last_write, pid)));
>> -}
>> -
>> -static int alloc_pidmap(struct pid_namespace *pid_ns)
>> -{
>> -     int i, offset, max_scan, pid, last = pid_ns->last_pid;
>> -     struct pidmap *map;
>> -
>> -     pid = last + 1;
>> -     if (pid >= pid_max)
>> -             pid = RESERVED_PIDS;
>> -     offset = pid & BITS_PER_PAGE_MASK;
>> -     map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
>> -     /*
>> -      * If last_pid points into the middle of the map->page we
>> -      * want to scan this bitmap block twice, the second time
>> -      * we start with offset == 0 (or RESERVED_PIDS).
>> -      */
>> -     max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
>> -     for (i = 0; i <= max_scan; ++i) {
>> -             if (unlikely(!map->page)) {
>> -                     void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> -                     /*
>> -                      * Free the page if someone raced with us
>> -                      * installing it:
>> -                      */
>> -                     spin_lock_irq(&pidmap_lock);
>> -                     if (!map->page) {
>> -                             map->page = page;
>> -                             page = NULL;
>> -                     }
>> -                     spin_unlock_irq(&pidmap_lock);
>> -                     kfree(page);
>> -                     if (unlikely(!map->page))
>> -                             return -ENOMEM;
>> -             }
>> -             if (likely(atomic_read(&map->nr_free))) {
>> -                     for ( ; ; ) {
>> -                             if (!test_and_set_bit(offset, map->page)) {
>> -                                     atomic_dec(&map->nr_free);
>> -                                     set_last_pid(pid_ns, last, pid);
>> -                                     return pid;
>> -                             }
>> -                             offset = find_next_offset(map, offset);
>> -                             if (offset >= BITS_PER_PAGE)
>> -                                     break;
>> -                             pid = mk_pid(pid_ns, map, offset);
>> -                             if (pid >= pid_max)
>> -                                     break;
>> -                     }
>> -             }
>> -             if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
>> -                     ++map;
>> -                     offset = 0;
>> -             } else {
>> -                     map = &pid_ns->pidmap[0];
>> -                     offset = RESERVED_PIDS;
>> -                     if (unlikely(last == offset))
>> -                             break;
>> -             }
>> -             pid = mk_pid(pid_ns, map, offset);
>> -     }
>> -     return -EAGAIN;
>> -}
>> -
>> -int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
>> -{
>> -     int offset;
>> -     struct pidmap *map, *end;
>> -
>> -     if (last >= PID_MAX_LIMIT)
>> -             return -1;
>> -
>> -     offset = (last + 1) & BITS_PER_PAGE_MASK;
>> -     map = &pid_ns->pidmap[(last + 1)/BITS_PER_PAGE];
>> -     end = &pid_ns->pidmap[PIDMAP_ENTRIES];
>> -     for (; map < end; map++, offset = 0) {
>> -             if (unlikely(!map->page))
>> -                     continue;
>> -             offset = find_next_bit((map)->page, BITS_PER_PAGE, offset);
>> -             if (offset < BITS_PER_PAGE)
>> -                     return mk_pid(pid_ns, map, offset);
>> -     }
>> -     return -1;
>> -}
>> -
>>  void put_pid(struct pid *pid)
>>  {
>>       struct pid_namespace *ns;
>> @@ -266,7 +124,7 @@ void free_pid(struct pid *pid)
>>               struct upid *upid = pid->numbers + i;
>>               struct pid_namespace *ns = upid->ns;
>>               hlist_del_rcu(&upid->pid_chain);
>> -             switch(--ns->nr_hashed) {
>> +             switch (--ns->nr_hashed) {
>>               case 2:
>>               case 1:
>>                       /* When all that is left in the pid namespace
>> @@ -284,12 +142,11 @@ void free_pid(struct pid *pid)
>>                       schedule_work(&ns->proc_work);
>>                       break;
>>               }
>> +
>> +             idr_remove(&ns->idr, upid->nr);
>>       }
>>       spin_unlock_irqrestore(&pidmap_lock, flags);
>>
>> -     for (i = 0; i <= pid->level; i++)
>> -             free_pidmap(pid->numbers + i);
>> -
>>       call_rcu(&pid->rcu, delayed_put_pid);
>>  }
>>
>> @@ -308,8 +165,29 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>
>>       tmp = ns;
>>       pid->level = ns->level;
>> +
>>       for (i = ns->level; i >= 0; i--) {
>> -             nr = alloc_pidmap(tmp);
>> +             int pid_min = 1;
>> +
>> +             idr_preload(GFP_KERNEL);
>> +             spin_lock_irq(&pidmap_lock);
>> +
>> +             /*
>> +              * init really needs pid 1, but after reaching the maximum
>> +              * wrap back to RESERVED_PIDS
>> +              */
>> +             if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
>> +                     pid_min = RESERVED_PIDS;
>> +
>> +             /*
>> +              * Store a null pointer so find_pid_ns does not find
>> +              * a partially initialized PID (see below).
>> +              */
>> +             nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
>> +                                   pid_max, GFP_ATOMIC);
>> +             spin_unlock_irq(&pidmap_lock);
>> +             idr_preload_end();
>> +
>>               if (nr < 0) {
>>                       retval = nr;
>>                       goto out_free;
>> @@ -339,6 +217,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>       for ( ; upid >= pid->numbers; --upid) {
>>               hlist_add_head_rcu(&upid->pid_chain,
>>                               &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
>> +             /* Make the PID visible to find_pid_ns. */
>> +             idr_replace(&upid->ns->idr, pid, upid->nr);
>>               upid->ns->nr_hashed++;
>>       }
>>       spin_unlock_irq(&pidmap_lock);
>> @@ -350,8 +230,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>       put_pid_ns(ns);
>>
>>  out_free:
>> +     spin_lock_irq(&pidmap_lock);
>>       while (++i <= ns->level)
>> -             free_pidmap(pid->numbers + i);
>> +             idr_remove(&ns->idr, (pid->numbers + i)->nr);
>> +
>> +     spin_unlock_irq(&pidmap_lock);
>>
>>       kmem_cache_free(ns->pid_cachep, pid);
>>       return ERR_PTR(retval);
>> @@ -553,16 +436,7 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
>>   */
>>  struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>>  {
>> -     struct pid *pid;
>> -
>> -     do {
>> -             pid = find_pid_ns(nr, ns);
>> -             if (pid)
>> -                     break;
>> -             nr = next_pidmap(ns, nr);
>> -     } while (nr > 0);
>> -
>> -     return pid;
>> +     return idr_get_next(&ns->idr, &nr);
>>  }
>>
>>  /*
>> @@ -578,7 +452,7 @@ void __init pidhash_init(void)
>>                                          0, 4096);
>>  }
>>
>> -void __init pidmap_init(void)
>> +void __init pid_idr_init(void)
>>  {
>>       /* Verify no one has done anything silly: */
>>       BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);
>> @@ -590,10 +464,7 @@ void __init pidmap_init(void)
>>                               PIDS_PER_CPU_MIN * num_possible_cpus());
>>       pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>>
>> -     init_pid_ns.pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> -     /* Reserve PID 0. We never call free_pidmap(0) */
>> -     set_bit(0, init_pid_ns.pidmap[0].page);
>> -     atomic_dec(&init_pid_ns.pidmap[0].nr_free);
>> +     idr_init(&init_pid_ns.idr);
>>
>>       init_pid_ns.pid_cachep = KMEM_CACHE(pid,
>>                       SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index 4918314..5009dbe 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -21,6 +21,7 @@
>>  #include <linux/export.h>
>>  #include <linux/sched/task.h>
>>  #include <linux/sched/signal.h>
>> +#include <linux/idr.h>
>>
>>  struct pid_cache {
>>       int nr_ids;
>> @@ -98,7 +99,6 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>>       struct pid_namespace *ns;
>>       unsigned int level = parent_pid_ns->level + 1;
>>       struct ucounts *ucounts;
>> -     int i;
>>       int err;
>>
>>       err = -EINVAL;
>> @@ -117,17 +117,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>>       if (ns == NULL)
>>               goto out_dec;
>>
>> -     ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> -     if (!ns->pidmap[0].page)
>> -             goto out_free;
>> +     idr_init(&ns->idr);
>>
>>       ns->pid_cachep = create_pid_cachep(level + 1);
>>       if (ns->pid_cachep == NULL)
>> -             goto out_free_map;
>> +             goto out_free_idr;
>>
>>       err = ns_alloc_inum(&ns->ns);
>>       if (err)
>> -             goto out_free_map;
>> +             goto out_free_idr;
>>       ns->ns.ops = &pidns_operations;
>>
>>       kref_init(&ns->kref);
>> @@ -138,17 +136,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>>       ns->nr_hashed = PIDNS_HASH_ADDING;
>>       INIT_WORK(&ns->proc_work, proc_cleanup_work);
>>
>> -     set_bit(0, ns->pidmap[0].page);
>> -     atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
>> -
>> -     for (i = 1; i < PIDMAP_ENTRIES; i++)
>> -             atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
>> -
>>       return ns;
>>
>> -out_free_map:
>> -     kfree(ns->pidmap[0].page);
>> -out_free:
>> +out_free_idr:
>> +     idr_destroy(&ns->idr);
>>       kmem_cache_free(pid_ns_cachep, ns);
>>  out_dec:
>>       dec_pid_namespaces(ucounts);
>> @@ -168,11 +159,9 @@ static void delayed_free_pidns(struct rcu_head *p)
>>
>>  static void destroy_pid_namespace(struct pid_namespace *ns)
>>  {
>> -     int i;
>> -
>>       ns_free_inum(&ns->ns);
>> -     for (i = 0; i < PIDMAP_ENTRIES; i++)
>> -             kfree(ns->pidmap[i].page);
>> +
>> +     idr_destroy(&ns->idr);
>>       call_rcu(&ns->rcu, delayed_free_pidns);
>>  }
>>
>> @@ -213,6 +202,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>       int rc;
>>       struct task_struct *task, *me = current;
>>       int init_pids = thread_group_leader(me) ? 1 : 2;
>> +     struct pid *pid;
>>
>>       /* Don't allow any more processes into the pid namespace */
>>       disable_pid_allocation(pid_ns);
>> @@ -239,20 +229,16 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>        *        maintain a tasklist for each pid namespace.
>>        *
>>        */
>> +     rcu_read_lock();
>>       read_lock(&tasklist_lock);
>> -     nr = next_pidmap(pid_ns, 1);
>> -     while (nr > 0) {
>> -             rcu_read_lock();
>> -
>> -             task = pid_task(find_vpid(nr), PIDTYPE_PID);
>> +     nr = 2;
>> +     idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
>> +             task = pid_task(pid, PIDTYPE_PID);
>>               if (task && !__fatal_signal_pending(task))
>>                       send_sig_info(SIGKILL, SEND_SIG_FORCED, task);
>> -
>> -             rcu_read_unlock();
>> -
>> -             nr = next_pidmap(pid_ns, nr);
>>       }
>>       read_unlock(&tasklist_lock);
>> +     rcu_read_unlock();
>>
>>       /*
>>        * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
>> @@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>>        * it should synchronize its usage with external means.
>>        */
>>
>> -     tmp.data = &pid_ns->last_pid;
>> +     tmp.data = &pid_ns->idr.idr_next;
>>       return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>>  }
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-19  7:30   ` [v6,1/2] " Andrei Vagin
  2017-10-19 13:20     ` Gargi Sharma
@ 2017-10-19 16:18     ` Oleg Nesterov
  2017-10-20 16:06       ` Gargi Sharma
  1 sibling, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2017-10-19 16:18 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Gargi Sharma, linux-kernel, riel, julia.lawall, akpm, mingo,
	pasha.tatashin, ktkhai, ebiederm, hch, lkp, tony.luck

On 10/19, Andrei Vagin wrote:
>
> Hi Gargi,
>
> This patch breaks CRIU, because it changes a meaning of ns_last_pid.

...

> > @@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> >  	 * it should synchronize its usage with external means.
> >  	 */
> >  
> > -	tmp.data = &pid_ns->last_pid;
> > +	tmp.data = &pid_ns->idr.idr_next;

Ah, yes, off-by-one error...

Gargi, I don't think you need to make another version, I'd suggest you to send
the trivial fix to Andrew, afaics you just need to replace these 2 lines with

	unsigned int last;
	int err;

	tmp.data = &last;
	err = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
	if (!err)
		idr_set_cursor(&pid_ns->idr, last + 1);
	return err;

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-19 16:18     ` Oleg Nesterov
@ 2017-10-20 16:06       ` Gargi Sharma
  2017-10-20 17:21         ` Andrei Vagin
  0 siblings, 1 reply; 17+ messages in thread
From: Gargi Sharma @ 2017-10-20 16:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrei Vagin, linux-kernel, Rik van Riel, Julia Lawall,
	Andrew Morton, mingo, pasha.tatashin, ktkhai, Eric W. Biederman,
	Christoph Hellwig, lkp, tony.luck

On Thu, Oct 19, 2017 at 5:18 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 10/19, Andrei Vagin wrote:
>>
>> Hi Gargi,
>>
>> This patch breaks CRIU, because it changes a meaning of ns_last_pid.
>
> ...
>
>> > @@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> >      * it should synchronize its usage with external means.
>> >      */
>> >
>> > -   tmp.data = &pid_ns->last_pid;
>> > +   tmp.data = &pid_ns->idr.idr_next;
>
> Ah, yes, off-by-one error...
>
> Gargi, I don't think you need to make another version, I'd suggest you to send
> the trivial fix to Andrew, afaics you just need to replace these 2 lines with
>
>         unsigned int last;
>         int err;
>
>         tmp.data = &last;
>         err = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>         if (!err)
>                 idr_set_cursor(&pid_ns->idr, last + 1);
>         return err;
I'm not sure entirely understand how this takes care of rolling over of PIDs?
Can we ignore that? If yes, won't the tests for CRIU still break?

Thanks,
Gargi
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-20 16:06       ` Gargi Sharma
@ 2017-10-20 17:21         ` Andrei Vagin
  2017-10-22  8:05           ` Gargi Sharma
  0 siblings, 1 reply; 17+ messages in thread
From: Andrei Vagin @ 2017-10-20 17:21 UTC (permalink / raw)
  To: Gargi Sharma
  Cc: Oleg Nesterov, linux-kernel, Rik van Riel, Julia Lawall,
	Andrew Morton, mingo, pasha.tatashin, ktkhai, Eric W. Biederman,
	Christoph Hellwig, lkp, tony.luck

On Fri, Oct 20, 2017 at 05:06:47PM +0100, Gargi Sharma wrote:
> On Thu, Oct 19, 2017 at 5:18 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 10/19, Andrei Vagin wrote:
> >>
> >> Hi Gargi,
> >>
> >> This patch breaks CRIU, because it changes a meaning of ns_last_pid.
> >
> > ...
> >
> >> > @@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> >> >      * it should synchronize its usage with external means.
> >> >      */
> >> >
> >> > -   tmp.data = &pid_ns->last_pid;
> >> > +   tmp.data = &pid_ns->idr.idr_next;
> >
> > Ah, yes, off-by-one error...
> >
> > Gargi, I don't think you need to make another version, I'd suggest you to send
> > the trivial fix to Andrew, afaics you just need to replace these 2 lines with
> >
> >         unsigned int last;
> >         int err;
> >
> >         tmp.data = &last;
> >         err = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> >         if (!err)
> >                 idr_set_cursor(&pid_ns->idr, last + 1);
> >         return err;
> I'm not sure entirely understand how this takes care of rolling over of PIDs?
> Can we ignore that? If yes, won't the tests for CRIU still break?

Gargi, I don't understand what you mean. Could you elaborate? Do you
mean a case when idr_next is bigger than pid_max? I think this logic
remains the same what we had before switching to idr.

CRIU tests works with a following patch. It is slightly modified version
of Oleg's patch.

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index fea2c24..1c791b3 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -287,6 +287,7 @@ static int pid_ns_ctl_handler(struct ctl_table
*table, int write,
 {
        struct pid_namespace *pid_ns = task_active_pid_ns(current);
        struct ctl_table tmp = *table;
+       int ret;
 
        if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
                return -EPERM;
@@ -298,7 +299,12 @@ static int pid_ns_ctl_handler(struct ctl_table
*table, int write,
         */
 
        tmp.data = &pid_ns->idr.idr_next;
-       return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+       ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+       if (ret < 0)
+               return ret;
+
+       idr_set_cursor(&pid_ns->idr, pid_ns->idr.idr_next + 1);
+       return 0;
 }
 
 extern int pid_max;


> 
> Thanks,
> Gargi
> >
> > Oleg.
> >

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-20 17:21         ` Andrei Vagin
@ 2017-10-22  8:05           ` Gargi Sharma
  2017-10-23 15:31             ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Gargi Sharma @ 2017-10-22  8:05 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Oleg Nesterov, linux-kernel, Rik van Riel, Julia Lawall,
	Andrew Morton, mingo, pasha.tatashin, ktkhai, Eric W. Biederman,
	Christoph Hellwig, lkp, tony.luck

On Fri, Oct 20, 2017 at 6:21 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> On Fri, Oct 20, 2017 at 05:06:47PM +0100, Gargi Sharma wrote:
>> On Thu, Oct 19, 2017 at 5:18 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > On 10/19, Andrei Vagin wrote:
>> >>
>> >> Hi Gargi,
>> >>
>> >> This patch breaks CRIU, because it changes a meaning of ns_last_pid.
>> >
>> > ...
>> >
>> >> > @@ -311,7 +297,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> >> >      * it should synchronize its usage with external means.
>> >> >      */
>> >> >
>> >> > -   tmp.data = &pid_ns->last_pid;
>> >> > +   tmp.data = &pid_ns->idr.idr_next;
>> >
>> > Ah, yes, off-by-one error...
>> >
>> > Gargi, I don't think you need to make another version, I'd suggest you to send
>> > the trivial fix to Andrew, afaics you just need to replace these 2 lines with
>> >
>> >         unsigned int last;
>> >         int err;
>> >
>> >         tmp.data = &last;
>> >         err = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>> >         if (!err)
>> >                 idr_set_cursor(&pid_ns->idr, last + 1);
>> >         return err;
>> I'm not sure entirely understand how this takes care of rolling over of PIDs?
>> Can we ignore that? If yes, won't the tests for CRIU still break?
>
> Gargi, I don't understand what you mean. Could you elaborate? Do you
> mean a case when idr_next is bigger than pid_max? I think this logic
> remains the same what we had before switching to idr.

When the PIDs are allocated, if the allocation exceeds pid_max wraps
around and starts allocating PIDs starting from pid_min.
I'm not sure if idr_set_cursor(&pid_ns->idr, pid_ns->idr.idr_next + 1)
takes care of that.

Gargi
>
> CRIU tests works with a following patch. It is slightly modified version
> of Oleg's patch.
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index fea2c24..1c791b3 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -287,6 +287,7 @@ static int pid_ns_ctl_handler(struct ctl_table
> *table, int write,
>  {
>         struct pid_namespace *pid_ns = task_active_pid_ns(current);
>         struct ctl_table tmp = *table;
> +       int ret;
>
>         if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
>                 return -EPERM;
> @@ -298,7 +299,12 @@ static int pid_ns_ctl_handler(struct ctl_table
> *table, int write,
>          */
>
>         tmp.data = &pid_ns->idr.idr_next;
> -       return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> +       ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> +       if (ret < 0)
> +               return ret;
> +
> +       idr_set_cursor(&pid_ns->idr, pid_ns->idr.idr_next + 1);
> +       return 0;
>  }
>
>  extern int pid_max;
>
>
>>
>> Thanks,
>> Gargi
>> >
>> > Oleg.
>> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [v6,1/2] pid: Replace pid bitmap implementation with IDR API
  2017-10-22  8:05           ` Gargi Sharma
@ 2017-10-23 15:31             ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2017-10-23 15:31 UTC (permalink / raw)
  To: Gargi Sharma
  Cc: Andrei Vagin, linux-kernel, Rik van Riel, Julia Lawall,
	Andrew Morton, mingo, pasha.tatashin, ktkhai, Eric W. Biederman,
	Christoph Hellwig, lkp, tony.luck

On 10/22, Gargi Sharma wrote:
>
> On Fri, Oct 20, 2017 at 6:21 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> > On Fri, Oct 20, 2017 at 05:06:47PM +0100, Gargi Sharma wrote:
> >> On Thu, Oct 19, 2017 at 5:18 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> >
> >> >         unsigned int last;
> >> >         int err;
> >> >
> >> >         tmp.data = &last;
> >> >         err = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> >> >         if (!err)
> >> >                 idr_set_cursor(&pid_ns->idr, last + 1);
> >> >         return err;
> >> I'm not sure entirely understand how this takes care of rolling over of PIDs?
> >> Can we ignore that? If yes, won't the tests for CRIU still break?
> >
> > Gargi, I don't understand what you mean. Could you elaborate? Do you
> > mean a case when idr_next is bigger than pid_max? I think this logic
> > remains the same what we had before switching to idr.
>
> When the PIDs are allocated, if the allocation exceeds pid_max wraps
> around and starts allocating PIDs starting from pid_min.

You misunderstood the problem introduced by your patch...

We do not care about pid_max overlap. The problem is that criu writes
to /proc/sys/kernel/ns_last_pid to control the pid number allocated by
the next fork().

So if you do "echo 100 > /proc/sys/kernel/ns_last_pid" the next fork()
should create the process with tid=101 (of course, if 101 is free).

After your patch the new task will have tid=100. See?

> > CRIU tests works with a following patch. It is slightly modified version
> > of Oleg's patch.

Andrei, could you write the changelog and send the fix to akpm? Feel free
to add my ack or sob.

Oleg.

> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index fea2c24..1c791b3 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -287,6 +287,7 @@ static int pid_ns_ctl_handler(struct ctl_table
> > *table, int write,
> >  {
> >         struct pid_namespace *pid_ns = task_active_pid_ns(current);
> >         struct ctl_table tmp = *table;
> > +       int ret;
> >
> >         if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> >                 return -EPERM;
> > @@ -298,7 +299,12 @@ static int pid_ns_ctl_handler(struct ctl_table
> > *table, int write,
> >          */
> >
> >         tmp.data = &pid_ns->idr.idr_next;
> > -       return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> > +       ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       idr_set_cursor(&pid_ns->idr, pid_ns->idr.idr_next + 1);
> > +       return 0;
> >  }
> >
> >  extern int pid_max;
> >
> >
> >>
> >> Thanks,
> >> Gargi
> >> >
> >> > Oleg.
> >> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-10-23 15:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11 22:19 [PATCH v6 0/2] Replacing PID bitmap implementation with IDR API Gargi Sharma
2017-10-11 22:19 ` [PATCH v6 1/2] pid: Replace pid " Gargi Sharma
2017-10-11 22:28   ` Rik van Riel
2017-10-11 22:37   ` Andrew Morton
2017-10-12  0:54     ` Rik van Riel
2017-10-12 22:58       ` Andrew Morton
2017-10-13 14:58         ` Oleg Nesterov
2017-10-19  7:30   ` [v6,1/2] " Andrei Vagin
2017-10-19 13:20     ` Gargi Sharma
2017-10-19 16:18     ` Oleg Nesterov
2017-10-20 16:06       ` Gargi Sharma
2017-10-20 17:21         ` Andrei Vagin
2017-10-22  8:05           ` Gargi Sharma
2017-10-23 15:31             ` Oleg Nesterov
2017-10-11 22:19 ` [PATCH v6 2/2] pid: Remove pidhash Gargi Sharma
2017-10-11 22:28   ` Rik van Riel
2017-10-11 22:47   ` Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.