All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts
@ 2014-03-05  3:47 Al Viro
  2014-03-05  3:49 ` [PATCH 1/5][RFC][CFT] percpu fixes, part 1 Al Viro
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Al Viro @ 2014-03-05  3:47 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, linux-kernel, Stephen Tweedie, Jeremy Eder

	Looks like people had finally started to use setups with serious
amounts of mounts.  And that has uncovered some downright embarrassing
bottlenecks.  Below is more or less the catch from the last couple of weeks.
Kudos to Stephen and Jeremy for testing and profiling that stuff.

	1) percpu_alloc()/percpu_free() doesn't scale well, to put it
_very_ mildly.  The time complexity of N small allocations is O(N^2).
Yes, really.  Freeing is the same.  Fortunately, there are several
low-hanging fruits in there and dealing with those helps quite a bit.
First of all, the information about splitup of chunk into areas is
kept in the following form: an array of +-length.  Sign is used to
distinguish free from in-use.  It's much better to use offset | (0 or 1),
with the lowest bit indicating free vs. in-use.  For starters, that
allows freeing to use binary search in that array.  Another thing is
that we can easily maintain the "I know that first N areas are all
in use" hint, so that sequential allocation does *not* have to add up
the lengths of already allocated areas just to find the candidate
offset.  It could be improved further, but these two already kick
the damn thing from the top of profiles on "lots of mount/umount"
loads.  Note that all this work is done with IRQs disabled, so in
mainline excessive time spent there easily leads to softlockups.
Another benefitiary is aio - io_setup(2) does a couple of 4-byte
allocations (alloc_vfsmnt() does one 8-byte percpu_alloc()), so a load
with a bunch of aio contexts created in parallel on the same box suffers
from the same bottleneck.  That's first couple of patches (1 and 2 out
of 5)
	2) propagate_mnt() tried to save the complexity in the case
when some nodes in propagation graph are rejected, since they do not
contain the counterparts of our mountpoint dentry.  It's much simpler
to create a copy for each node, set propagation between them parallel
to that on what they are mounted on and then kill the ones that shouldn't
survive.  IIRC, we recognized the problem, but decided to go for simpler
logics unless and until somebody runs into situations where it creates
a lot of copies only to tear almost all of them down.  Well, it turns
out that it's *very* easy to trigger the worst-case behaviour - have
N mount --bind create O(N^2) vfsmounts and tear down all but O(N) of them.
Just
	mkdir /tmp/blah
	mount --bind /tmp/blah /tmp/blah
	mount --make-shared /tmp/blah
	cd /tmp/blah
	touch foo
	for i in `seq 4000`; do touch $i; mount --bind foo $i; done
will step into it - that loop adds 4000 vfsmounts, but allocates about
two million more of them, never seen by userland and freed in the
same mount(2) call that has allocated them.
	Turns out, it *is* possible to avoid the excessive allocations,
with a slightly different order of creating the copies and a bit of a trick
to find which copy will end up the master of given one.  See the commit
message for more details.  That's patch 3 of 5.
	3) reading /proc/mounts end-to-end (e.g. cat /proc/mounts) turns
out to be (embarrassingly) quadratic by the size of said /proc/mounts.
The reason is that read() will need to find out the next vfsmount to
return, and that's done by walking the list.  From the beginning.  Even
if no changes have happened, which we *can* detect - namespace has event
counter.  Solution is trivial - cache that + the last vfsmount we'd been
looking at for that one + the index it had been at.  And if the event
counter matches, have m_start() skip the search from the beginning of
the list.  In fact, all we really need is a couple of cases - last index
we looked at and last index plus one.  Taking care of that reduces the
time needed to read the damn thing from O(mounts^2) to O(mounts).  The
same goes for /proc/self/mountinfo, and e.g. umount(8) does read it
end-to-end.  So does the Lennart's horror.  Without that patch (and with
previous stupidities taken care of) we had seq_start_list() fairly high
in the profiles; with it it's pretty much gone.  That's patch 4 of 5.
	4) mount hash is ridiculously small.  256 hash chains is not
enough when you might get thousands of elements in that hash and when
hash lookups happen on fairly hot paths.  Solved by switch to
alloc_large_system_hash(), hopefully with sane order (less than it used
to be way back).  That may need tuning, but I think that this variant is
tolerable.

	Folks, please review.  It survives pretty heavy testing and most
of that stuff is pretty straightforward, but I would rather have it
discussed on fsdevel before I push that pile to Linus.  Patches follow;
if there are any questions and/or objections - please, yell.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-05  3:47 [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts Al Viro
@ 2014-03-05  3:49 ` Al Viro
  2014-03-06 19:20   ` Tejun Heo
  2014-03-05  3:50 ` [PATCH 2/5][RFC][CFT] fold pcpu_split_block() into the only caller Al Viro
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2014-03-05  3:49 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, linux-kernel, Stephen Tweedie, Jeremy Eder

convert ->map[] to array of offsets, cache the "no free areas among the
first N" in chunk.  Free/in-use is represented by the LSB, a sentry
element (<free = false, offset = total size of chunk>) is added in
the end.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/percpu.c |  165 +++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 103 insertions(+), 62 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 036cfe0..4469fbd 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -102,10 +102,11 @@ struct pcpu_chunk {
 	int			free_size;	/* free bytes in the chunk */
 	int			contig_hint;	/* max contiguous size hint */
 	void			*base_addr;	/* base address of this chunk */
-	int			map_used;	/* # of map entries used */
+	int			map_used;	/* # of map entries used before the sentry */
 	int			map_alloc;	/* # of map entries allocated */
 	int			*map;		/* allocation map */
 	void			*data;		/* chunk data */
+	int			first_free;	/* no free below this */
 	bool			immutable;	/* no [de]population allowed */
 	unsigned long		populated[];	/* populated bitmap */
 };
@@ -356,11 +357,11 @@ static int pcpu_need_to_extend(struct pcpu_chunk *chunk)
 {
 	int new_alloc;
 
-	if (chunk->map_alloc >= chunk->map_used + 2)
+	if (chunk->map_alloc >= chunk->map_used + 3)
 		return 0;
 
 	new_alloc = PCPU_DFL_MAP_ALLOC;
-	while (new_alloc < chunk->map_used + 2)
+	while (new_alloc < chunk->map_used + 3)
 		new_alloc *= 2;
 
 	return new_alloc;
@@ -438,25 +439,24 @@ out_unlock:
  * pcpu_lock.
  */
 static void pcpu_split_block(struct pcpu_chunk *chunk, int i,
-			     int head, int tail)
+			     int head, int size, int tail)
 {
 	int nr_extra = !!head + !!tail;
+	int off;
 
-	BUG_ON(chunk->map_alloc < chunk->map_used + nr_extra);
+	BUG_ON(chunk->map_alloc <= chunk->map_used + nr_extra);
 
 	/* insert new subblocks */
-	memmove(&chunk->map[i + nr_extra], &chunk->map[i],
+	memmove(&chunk->map[i + nr_extra] + 1, &chunk->map[i] + 1,
 		sizeof(chunk->map[0]) * (chunk->map_used - i));
 	chunk->map_used += nr_extra;
 
-	if (head) {
-		chunk->map[i + 1] = chunk->map[i] - head;
-		chunk->map[i++] = head;
-	}
-	if (tail) {
-		chunk->map[i++] -= tail;
-		chunk->map[i] = tail;
-	}
+	off = chunk->map[i];
+
+	if (head)
+		chunk->map[++i] = off += head;
+	if (tail)
+		chunk->map[++i] = off += size;
 }
 
 /**
@@ -483,19 +483,27 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 	int oslot = pcpu_chunk_slot(chunk);
 	int max_contig = 0;
 	int i, off;
+	int seen_free = 0;
+	int *p;
 
-	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
-		bool is_last = i + 1 == chunk->map_used;
+	for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) {
 		int head, tail;
+		int this_size;
+
+		off = *p;
+		if (off & 1)
+			continue;
 
 		/* extra for alignment requirement */
 		head = ALIGN(off, align) - off;
-		BUG_ON(i == 0 && head != 0);
 
-		if (chunk->map[i] < 0)
-			continue;
-		if (chunk->map[i] < head + size) {
-			max_contig = max(chunk->map[i], max_contig);
+		this_size = (p[1] & ~1) - off;
+		if (this_size < head + size) {
+			if (!seen_free) {
+				chunk->first_free = i;
+				seen_free = 1;
+			}
+			max_contig = max(this_size, max_contig);
 			continue;
 		}
 
@@ -505,44 +513,50 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 		 * than sizeof(int), which is very small but isn't too
 		 * uncommon for percpu allocations.
 		 */
-		if (head && (head < sizeof(int) || chunk->map[i - 1] > 0)) {
-			if (chunk->map[i - 1] > 0)
-				chunk->map[i - 1] += head;
-			else {
-				chunk->map[i - 1] -= head;
+		if (head && (head < sizeof(int) || !(p[-1] & 1))) {
+			if (p[-1] & 1)
 				chunk->free_size -= head;
-			}
-			chunk->map[i] -= head;
-			off += head;
+			*p = off += head;
+			this_size -= head;
 			head = 0;
 		}
 
 		/* if tail is small, just keep it around */
-		tail = chunk->map[i] - head - size;
-		if (tail < sizeof(int))
+		tail = this_size - head - size;
+		if (tail < sizeof(int)) {
 			tail = 0;
+			size = this_size - head;
+		}
 
 		/* split if warranted */
 		if (head || tail) {
-			pcpu_split_block(chunk, i, head, tail);
+			pcpu_split_block(chunk, i, head, size, tail);
 			if (head) {
+				if (!seen_free) {
+					chunk->first_free = i;
+					seen_free = 1;
+				}
 				i++;
+				p++;
 				off += head;
-				max_contig = max(chunk->map[i - 1], max_contig);
+				max_contig = max(head, max_contig);
 			}
 			if (tail)
-				max_contig = max(chunk->map[i + 1], max_contig);
+				max_contig = max(tail, max_contig);
 		}
 
+		if (!seen_free)
+			chunk->first_free = i + 1;
+
 		/* update hint and mark allocated */
-		if (is_last)
+		if (i + 1 == chunk->map_used)
 			chunk->contig_hint = max_contig; /* fully scanned */
 		else
 			chunk->contig_hint = max(chunk->contig_hint,
 						 max_contig);
 
-		chunk->free_size -= chunk->map[i];
-		chunk->map[i] = -chunk->map[i];
+		chunk->free_size -= size;
+		*p |= 1;
 
 		pcpu_chunk_relocate(chunk, oslot);
 		return off;
@@ -570,34 +584,50 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
 {
 	int oslot = pcpu_chunk_slot(chunk);
-	int i, off;
-
-	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
-		if (off == freeme)
-			break;
+	int off = 0;
+	unsigned i, j;
+	int to_free = 0;
+	int *p;
+
+	freeme |= 1;
+
+	i = 0;
+	j = chunk->map_used;
+	while (i != j) {
+		unsigned k = (i + j) / 2;
+		off = chunk->map[k];
+		if (off < freeme)
+			i = k + 1;
+		else if (off > freeme)
+			j = k;
+		else
+			i = j = k;
+	}
 	BUG_ON(off != freeme);
-	BUG_ON(chunk->map[i] > 0);
 
-	chunk->map[i] = -chunk->map[i];
-	chunk->free_size += chunk->map[i];
+	if (i < chunk->first_free)
+		chunk->first_free = i;
+
+	p = chunk->map + i;
+	*p = off &= ~1;
+	chunk->free_size += (p[1] & ~1) - off;
 
+	/* merge with next? */
+	if (!(p[1] & 1))
+		to_free++;
 	/* merge with previous? */
-	if (i > 0 && chunk->map[i - 1] >= 0) {
-		chunk->map[i - 1] += chunk->map[i];
-		chunk->map_used--;
-		memmove(&chunk->map[i], &chunk->map[i + 1],
-			(chunk->map_used - i) * sizeof(chunk->map[0]));
+	if (i > 0 && !(p[-1] & 1)) {
+		to_free++;
 		i--;
+		p--;
 	}
-	/* merge with next? */
-	if (i + 1 < chunk->map_used && chunk->map[i + 1] >= 0) {
-		chunk->map[i] += chunk->map[i + 1];
-		chunk->map_used--;
-		memmove(&chunk->map[i + 1], &chunk->map[i + 2],
-			(chunk->map_used - (i + 1)) * sizeof(chunk->map[0]));
+	if (to_free) {
+		chunk->map_used -= to_free;
+		memmove(p + 1, p + 1 + to_free, 
+			(chunk->map_used - i) * sizeof(chunk->map[0]));
 	}
 
-	chunk->contig_hint = max(chunk->map[i], chunk->contig_hint);
+	chunk->contig_hint = max(chunk->map[i + 1] - chunk->map[i] - 1, chunk->contig_hint);
 	pcpu_chunk_relocate(chunk, oslot);
 }
 
@@ -617,7 +647,9 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void)
 	}
 
 	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
-	chunk->map[chunk->map_used++] = pcpu_unit_size;
+	chunk->map[0] = 0;
+	chunk->map[1] = pcpu_unit_size | 1;
+	chunk->map_used = 1;
 
 	INIT_LIST_HEAD(&chunk->list);
 	chunk->free_size = pcpu_unit_size;
@@ -713,6 +745,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 	unsigned long flags;
 	void __percpu *ptr;
 
+	if (unlikely(align < 2))
+		align = 2;
+
 	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) {
 		WARN(true, "illegal size (%zu) or align (%zu) for "
 		     "percpu allocation\n", size, align);
@@ -1343,9 +1378,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	}
 	schunk->contig_hint = schunk->free_size;
 
-	schunk->map[schunk->map_used++] = -ai->static_size;
+	schunk->map[0] = 1;
+	schunk->map[1] = ai->static_size;
+	schunk->map_used = 1;
 	if (schunk->free_size)
-		schunk->map[schunk->map_used++] = schunk->free_size;
+		schunk->map[++schunk->map_used] = 1 | (ai->static_size + schunk->free_size);
+	else
+		schunk->map[1] |= 1;
 
 	/* init dynamic chunk if necessary */
 	if (dyn_size) {
@@ -1358,8 +1397,10 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		bitmap_fill(dchunk->populated, pcpu_unit_pages);
 
 		dchunk->contig_hint = dchunk->free_size = dyn_size;
-		dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
-		dchunk->map[dchunk->map_used++] = dchunk->free_size;
+		dchunk->map[0] = 1;
+		dchunk->map[1] = pcpu_reserved_chunk_limit;
+		dchunk->map[2] = (pcpu_reserved_chunk_limit + dchunk->free_size) | 1;
+		dchunk->map_used = 2;
 	}
 
 	/* link the first chunk in */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5][RFC][CFT] fold pcpu_split_block() into the only caller
  2014-03-05  3:47 [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts Al Viro
  2014-03-05  3:49 ` [PATCH 1/5][RFC][CFT] percpu fixes, part 1 Al Viro
@ 2014-03-05  3:50 ` Al Viro
  2014-03-06 19:21   ` Tejun Heo
  2014-03-05  3:51 ` [PATCH 3/5][RFC][CFT] smarter propagate_mnt() Al Viro
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2014-03-05  3:50 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, linux-kernel, Stephen Tweedie, Jeremy Eder

... and clean it up a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/percpu.c |   58 ++++++++++++----------------------------------------------
 1 file changed, 12 insertions(+), 46 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 4469fbd..383801e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -419,47 +419,6 @@ out_unlock:
 }
 
 /**
- * pcpu_split_block - split a map block
- * @chunk: chunk of interest
- * @i: index of map block to split
- * @head: head size in bytes (can be 0)
- * @tail: tail size in bytes (can be 0)
- *
- * Split the @i'th map block into two or three blocks.  If @head is
- * non-zero, @head bytes block is inserted before block @i moving it
- * to @i+1 and reducing its size by @head bytes.
- *
- * If @tail is non-zero, the target block, which can be @i or @i+1
- * depending on @head, is reduced by @tail bytes and @tail byte block
- * is inserted after the target block.
- *
- * @chunk->map must have enough free slots to accommodate the split.
- *
- * CONTEXT:
- * pcpu_lock.
- */
-static void pcpu_split_block(struct pcpu_chunk *chunk, int i,
-			     int head, int size, int tail)
-{
-	int nr_extra = !!head + !!tail;
-	int off;
-
-	BUG_ON(chunk->map_alloc <= chunk->map_used + nr_extra);
-
-	/* insert new subblocks */
-	memmove(&chunk->map[i + nr_extra] + 1, &chunk->map[i] + 1,
-		sizeof(chunk->map[0]) * (chunk->map_used - i));
-	chunk->map_used += nr_extra;
-
-	off = chunk->map[i];
-
-	if (head)
-		chunk->map[++i] = off += head;
-	if (tail)
-		chunk->map[++i] = off += size;
-}
-
-/**
  * pcpu_alloc_area - allocate area from a pcpu_chunk
  * @chunk: chunk of interest
  * @size: wanted size in bytes
@@ -530,19 +489,26 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 
 		/* split if warranted */
 		if (head || tail) {
-			pcpu_split_block(chunk, i, head, size, tail);
+			int nr_extra = !!head + !!tail;
+
+			/* insert new subblocks */
+			memmove(p + nr_extra + 1, p + 1,
+				sizeof(chunk->map[0]) * (chunk->map_used - i));
+			chunk->map_used += nr_extra;
+
 			if (head) {
 				if (!seen_free) {
 					chunk->first_free = i;
 					seen_free = 1;
 				}
-				i++;
-				p++;
-				off += head;
+				*++p = off += head;
+				++i;
 				max_contig = max(head, max_contig);
 			}
-			if (tail)
+			if (tail) {
+				p[1] = off + size;
 				max_contig = max(tail, max_contig);
+			}
 		}
 
 		if (!seen_free)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5][RFC][CFT] smarter propagate_mnt()
  2014-03-05  3:47 [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts Al Viro
  2014-03-05  3:49 ` [PATCH 1/5][RFC][CFT] percpu fixes, part 1 Al Viro
  2014-03-05  3:50 ` [PATCH 2/5][RFC][CFT] fold pcpu_split_block() into the only caller Al Viro
@ 2014-03-05  3:51 ` Al Viro
  2014-03-05  3:51 ` [PATCH 4/5][RFC][CFT] reduce m_start() cost Al Viro
  2014-03-05  3:52 ` [PATCH 5/5][RFC][CFT] resizable namespace.c hashes Al Viro
  4 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2014-03-05  3:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, linux-kernel, Stephen Tweedie, Jeremy Eder

The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c        |   26 ++++---
 fs/pnode.c            |  194 ++++++++++++++++++++++++++++++-------------------
 fs/pnode.h            |    3 +
 include/linux/mount.h |    3 +
 4 files changed, 138 insertions(+), 88 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 22e5367..f28c4fd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -849,7 +849,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 			goto out_free;
 	}
 
-	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~MNT_WRITE_HOLD;
+	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
 	/* Don't allow unprivileged users to change mount flags */
 	if ((flag & CL_UNPRIVILEGED) && (mnt->mnt.mnt_flags & MNT_READONLY))
 		mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
@@ -1613,16 +1613,14 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 		err = invent_group_ids(source_mnt, true);
 		if (err)
 			goto out;
-	}
-	err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
-	if (err)
-		goto out_cleanup_ids;
-
-	lock_mount_hash();
-
-	if (IS_MNT_SHARED(dest_mnt)) {
+		err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
+		lock_mount_hash();
+		if (err)
+			goto out_cleanup_ids;
 		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
 			set_mnt_shared(p);
+	} else {
+		lock_mount_hash();
 	}
 	if (parent_path) {
 		detach_mnt(source_mnt, parent_path);
@@ -1642,8 +1640,12 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 	return 0;
 
  out_cleanup_ids:
-	if (IS_MNT_SHARED(dest_mnt))
-		cleanup_group_ids(source_mnt, NULL);
+	while (!list_empty(&tree_list)) {
+		child = list_first_entry(&tree_list, struct mount, mnt_hash);
+		umount_tree(child, 0);
+	}
+	unlock_mount_hash();
+	cleanup_group_ids(source_mnt, NULL);
  out:
 	return err;
 }
@@ -1997,7 +1999,7 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
 	struct mount *parent;
 	int err;
 
-	mnt_flags &= ~(MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | MNT_DOOMED | MNT_SYNC_UMOUNT);
+	mnt_flags &= ~MNT_INTERNAL_FLAGS;
 
 	mp = lock_mount(path);
 	if (IS_ERR(mp))
diff --git a/fs/pnode.c b/fs/pnode.c
index c7221bb..04ca12c 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -164,46 +164,94 @@ static struct mount *propagation_next(struct mount *m,
 	}
 }
 
-/*
- * return the source mount to be used for cloning
- *
- * @dest 	the current destination mount
- * @last_dest  	the last seen destination mount
- * @last_src  	the last seen source mount
- * @type	return CL_SLAVE if the new mount has to be
- * 		cloned as a slave.
- */
-static struct mount *get_source(struct mount *dest,
-				struct mount *last_dest,
-				struct mount *last_src,
-				int *type)
+static struct mount *next_group(struct mount *m, struct mount *origin)
 {
-	struct mount *p_last_src = NULL;
-	struct mount *p_last_dest = NULL;
-
-	while (last_dest != dest->mnt_master) {
-		p_last_dest = last_dest;
-		p_last_src = last_src;
-		last_dest = last_dest->mnt_master;
-		last_src = last_src->mnt_master;
+	while (1) {
+		while (1) {
+			struct mount *next;
+			if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
+				return first_slave(m);
+			next = next_peer(m);
+			if (m->mnt_group_id == origin->mnt_group_id) {
+				if (next == origin)
+					return NULL;
+			} else if (m->mnt_slave.next != &next->mnt_slave)
+				break;
+			m = next;
+		}
+		/* m is the last peer */
+		while (1) {
+			struct mount *master = m->mnt_master;
+			if (m->mnt_slave.next != &master->mnt_slave_list)
+				return next_slave(m);
+			m = next_peer(master);
+			if (master->mnt_group_id == origin->mnt_group_id)
+				break;
+			if (master->mnt_slave.next == &m->mnt_slave)
+				break;
+			m = master;
+		}
+		if (m == origin)
+			return NULL;
 	}
+}
 
-	if (p_last_dest) {
-		do {
-			p_last_dest = next_peer(p_last_dest);
-		} while (IS_MNT_NEW(p_last_dest));
-		/* is that a peer of the earlier? */
-		if (dest == p_last_dest) {
-			*type = CL_MAKE_SHARED;
-			return p_last_src;
+/* all accesses are serialized by namespace_sem */
+static struct user_namespace *user_ns;
+static struct mount *last_dest, *last_source, *dest_master;
+static struct mountpoint *mp;
+static struct list_head *list;
+
+static int propagate_one(struct mount *m)
+{
+	struct mount *child;
+	int type;
+	/* skip ones added by this propagate_mnt() */
+	if (IS_MNT_NEW(m))
+		return 0;
+	/* skip if mountpoint isn't covered by it */
+	if (!is_subdir(mp->m_dentry, m->mnt.mnt_root))
+		return 0;
+	if (m->mnt_group_id == last_dest->mnt_group_id) {
+		type = CL_MAKE_SHARED;
+	} else {
+		struct mount *n, *p;
+		for (n = m; ; n = p) {
+			p = n->mnt_master;
+			if (p == dest_master || IS_MNT_MARKED(p)) {
+				while (last_dest->mnt_master != p) {
+					last_source = last_source->mnt_master;
+					last_dest = last_source->mnt_parent;
+				}
+				if (n->mnt_group_id != last_dest->mnt_group_id) {
+					last_source = last_source->mnt_master;
+					last_dest = last_source->mnt_parent;
+				}
+				break;
+			}
 		}
+		type = CL_SLAVE;
+		/* beginning of peer group among the slaves? */
+		if (IS_MNT_SHARED(m))
+			type |= CL_MAKE_SHARED;
+	}
+		
+	/* Notice when we are propagating across user namespaces */
+	if (m->mnt_ns->user_ns != user_ns)
+		type |= CL_UNPRIVILEGED;
+	child = copy_tree(last_source, last_source->mnt.mnt_root, type);
+	if (IS_ERR(child))
+		return PTR_ERR(child);
+	mnt_set_mountpoint(m, mp, child);
+	last_dest = m;
+	last_source = child;
+	if (m->mnt_master != dest_master) {
+		read_seqlock_excl(&mount_lock);
+		SET_MNT_MARK(m->mnt_master);
+		read_sequnlock_excl(&mount_lock);
 	}
-	/* slave of the earlier, then */
-	*type = CL_SLAVE;
-	/* beginning of peer group among the slaves? */
-	if (IS_MNT_SHARED(dest))
-		*type |= CL_MAKE_SHARED;
-	return last_src;
+	list_add_tail(&child->mnt_hash, list);
+	return 0;
 }
 
 /*
@@ -222,54 +270,48 @@ static struct mount *get_source(struct mount *dest,
 int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 		    struct mount *source_mnt, struct list_head *tree_list)
 {
-	struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
-	struct mount *m, *child;
+	struct mount *m, *n;
 	int ret = 0;
-	struct mount *prev_dest_mnt = dest_mnt;
-	struct mount *prev_src_mnt  = source_mnt;
-	LIST_HEAD(tmp_list);
-
-	for (m = propagation_next(dest_mnt, dest_mnt); m;
-			m = propagation_next(m, dest_mnt)) {
-		int type;
-		struct mount *source;
-
-		if (IS_MNT_NEW(m))
-			continue;
-
-		source =  get_source(m, prev_dest_mnt, prev_src_mnt, &type);
-
-		/* Notice when we are propagating across user namespaces */
-		if (m->mnt_ns->user_ns != user_ns)
-			type |= CL_UNPRIVILEGED;
 
-		child = copy_tree(source, source->mnt.mnt_root, type);
-		if (IS_ERR(child)) {
-			ret = PTR_ERR(child);
-			list_splice(tree_list, tmp_list.prev);
+	/*
+	 * we don't want to bother passing tons of arguments to
+	 * propagate_one(); everything is serialized by namespace_sem,
+	 * so globals will do just fine.
+	 */
+	user_ns = current->nsproxy->mnt_ns->user_ns;
+	last_dest = dest_mnt;
+	last_source = source_mnt;
+	mp = dest_mp;
+	list = tree_list;
+	dest_master = dest_mnt->mnt_master;
+
+	/* all peers of dest_mnt, except dest_mnt itself */
+	for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
+		ret = propagate_one(n);
+		if (ret)
 			goto out;
-		}
+	}
 
-		if (is_subdir(dest_mp->m_dentry, m->mnt.mnt_root)) {
-			mnt_set_mountpoint(m, dest_mp, child);
-			list_add_tail(&child->mnt_hash, tree_list);
-		} else {
-			/*
-			 * This can happen if the parent mount was bind mounted
-			 * on some subdirectory of a shared/slave mount.
-			 */
-			list_add_tail(&child->mnt_hash, &tmp_list);
-		}
-		prev_dest_mnt = m;
-		prev_src_mnt  = child;
+	/* all slave groups */
+	for (m = next_group(dest_mnt, dest_mnt); m;
+			m = next_group(m, dest_mnt)) {
+		/* everything in that slave group */
+		n = m;
+		do {
+			ret = propagate_one(n);
+			if (ret)
+				goto out;
+			n = next_peer(n);
+		} while (n != m);
 	}
 out:
-	lock_mount_hash();
-	while (!list_empty(&tmp_list)) {
-		child = list_first_entry(&tmp_list, struct mount, mnt_hash);
-		umount_tree(child, 0);
+	read_seqlock_excl(&mount_lock);
+	list_for_each_entry(n, tree_list, mnt_hash) {
+		m = n->mnt_parent;
+		if (m->mnt_master != dest_mnt->mnt_master)
+			CLEAR_MNT_MARK(m->mnt_master);
 	}
-	unlock_mount_hash();
+	read_sequnlock_excl(&mount_lock);
 	return ret;
 }
 
diff --git a/fs/pnode.h b/fs/pnode.h
index 59e7eda..65e04f6 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -16,6 +16,9 @@
 #define IS_MNT_NEW(m)  (!(m)->mnt_ns)
 #define CLEAR_MNT_SHARED(m) ((m)->mnt.mnt_flags &= ~MNT_SHARED)
 #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
+#define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
+#define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
+#define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 371d346..839bac2 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -44,6 +44,8 @@ struct mnt_namespace;
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
 #define MNT_PROPAGATION_MASK	(MNT_SHARED | MNT_UNBINDABLE)
 
+#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
+			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
 
 #define MNT_INTERNAL	0x4000
 
@@ -51,6 +53,7 @@ struct mnt_namespace;
 #define MNT_LOCKED		0x800000
 #define MNT_DOOMED		0x1000000
 #define MNT_SYNC_UMOUNT		0x2000000
+#define MNT_MARKED		0x4000000
 
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5][RFC][CFT] reduce m_start() cost...
  2014-03-05  3:47 [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts Al Viro
                   ` (2 preceding siblings ...)
  2014-03-05  3:51 ` [PATCH 3/5][RFC][CFT] smarter propagate_mnt() Al Viro
@ 2014-03-05  3:51 ` Al Viro
  2014-03-05  3:52 ` [PATCH 5/5][RFC][CFT] resizable namespace.c hashes Al Viro
  4 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2014-03-05  3:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, linux-kernel, Stephen Tweedie, Jeremy Eder

Don't rescan the mount list in m_start() if there hadn't been
any changes and we are right at the needed place of just before
it (that's the only occuring ones on sequential read() without
seeks).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/mount.h          |    5 ++++-
 fs/namespace.c      |   21 ++++++++++++++++++---
 fs/proc_namespace.c |    1 +
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index a17458c..61e1c05 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -10,7 +10,7 @@ struct mnt_namespace {
 	struct user_namespace	*user_ns;
 	u64			seq;	/* Sequence number to prevent loops */
 	wait_queue_head_t poll;
-	int event;
+	u64 event;
 };
 
 struct mnt_pcp {
@@ -104,6 +104,9 @@ struct proc_mounts {
 	struct mnt_namespace *ns;
 	struct path root;
 	int (*show)(struct seq_file *, struct vfsmount *);
+	void *cached_mount;
+	u64 cached_event;
+	loff_t cached_index;
 };
 
 #define proc_mounts(p) (container_of((p), struct proc_mounts, m))
diff --git a/fs/namespace.c b/fs/namespace.c
index f28c4fd..00f12af 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -29,7 +29,7 @@
 #define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
 #define HASH_SIZE (1UL << HASH_SHIFT)
 
-static int event;
+static u64 event;
 static DEFINE_IDA(mnt_id_ida);
 static DEFINE_IDA(mnt_group_ida);
 static DEFINE_SPINLOCK(mnt_id_lock);
@@ -1064,14 +1064,29 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	struct proc_mounts *p = proc_mounts(m);
 
 	down_read(&namespace_sem);
-	return seq_list_start(&p->ns->list, *pos);
+	if (p->cached_event == p->ns->event) {
+		void *v = p->cached_mount;
+		if (*pos == p->cached_index)
+			return v;
+		if (*pos == p->cached_index + 1) {
+			v = seq_list_next(v, &p->ns->list, &p->cached_index);
+			return p->cached_mount = v;
+		}
+	}
+
+	p->cached_event = p->ns->event;
+	p->cached_mount = seq_list_start(&p->ns->list, *pos);
+	p->cached_index = *pos;
+	return p->cached_mount;
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct proc_mounts *p = proc_mounts(m);
 
-	return seq_list_next(v, &p->ns->list, pos);
+	p->cached_mount = seq_list_next(v, &p->ns->list, pos);
+	p->cached_index = *pos;
+	return p->cached_mount;
 }
 
 static void m_stop(struct seq_file *m, void *v)
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 7be26f0..1a81373 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -267,6 +267,7 @@ static int mounts_open_common(struct inode *inode, struct file *file,
 	p->root = root;
 	p->m.poll_event = ns->event;
 	p->show = show;
+	p->cached_event = ~0ULL;
 
 	return 0;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5][RFC][CFT] resizable namespace.c hashes
  2014-03-05  3:47 [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts Al Viro
                   ` (3 preceding siblings ...)
  2014-03-05  3:51 ` [PATCH 4/5][RFC][CFT] reduce m_start() cost Al Viro
@ 2014-03-05  3:52 ` Al Viro
  2014-03-07 17:17   ` Andi Kleen
  4 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2014-03-05  3:52 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds, linux-kernel, Stephen Tweedie, Jeremy Eder

* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/mount.h     |    2 +-
 fs/namespace.c |   81 ++++++++++++++++++++++++++++++++++++++++----------------
 2 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 61e1c05..46790ae1 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -19,7 +19,7 @@ struct mnt_pcp {
 };
 
 struct mountpoint {
-	struct list_head m_hash;
+	struct hlist_node m_hash;
 	struct dentry *m_dentry;
 	int m_count;
 };
diff --git a/fs/namespace.c b/fs/namespace.c
index 00f12af..6be899f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -23,11 +23,34 @@
 #include <linux/uaccess.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
+#include <linux/bootmem.h>
 #include "pnode.h"
 #include "internal.h"
 
-#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
-#define HASH_SIZE (1UL << HASH_SHIFT)
+static unsigned int m_hash_mask __read_mostly;
+static unsigned int m_hash_shift __read_mostly;
+static unsigned int mp_hash_mask __read_mostly;
+static unsigned int mp_hash_shift __read_mostly;
+
+static __initdata unsigned long mhash_entries;
+static int __init set_mhash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mhash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mhash_entries=", set_mhash_entries);
+
+static __initdata unsigned long mphash_entries;
+static int __init set_mphash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	mphash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("mphash_entries=", set_mphash_entries);
 
 static u64 event;
 static DEFINE_IDA(mnt_id_ida);
@@ -37,7 +60,7 @@ static int mnt_id_start = 0;
 static int mnt_group_start = 1;
 
 static struct list_head *mount_hashtable __read_mostly;
-static struct list_head *mountpoint_hashtable __read_mostly;
+static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
 static DECLARE_RWSEM(namespace_sem);
 
@@ -55,12 +78,19 @@ EXPORT_SYMBOL_GPL(fs_kobj);
  */
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
 
-static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
+static inline struct list_head *m_hash(struct vfsmount *mnt, struct dentry *dentry)
 {
 	unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
 	tmp += ((unsigned long)dentry / L1_CACHE_BYTES);
-	tmp = tmp + (tmp >> HASH_SHIFT);
-	return tmp & (HASH_SIZE - 1);
+	tmp = tmp + (tmp >> m_hash_shift);
+	return &mount_hashtable[tmp & m_hash_mask];
+}
+
+static inline struct hlist_head *mp_hash(struct dentry *dentry)
+{
+	unsigned long tmp = ((unsigned long)dentry / L1_CACHE_BYTES);
+	tmp = tmp + (tmp >> mp_hash_shift);
+	return &mountpoint_hashtable[tmp & mp_hash_mask];
 }
 
 /*
@@ -575,7 +605,7 @@ bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
  */
 struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct list_head *head = mount_hashtable + hash(mnt, dentry);
+	struct list_head *head = m_hash(mnt, dentry);
 	struct mount *p;
 
 	list_for_each_entry_rcu(p, head, mnt_hash)
@@ -590,7 +620,7 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
  */
 struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
 {
-	struct list_head *head = mount_hashtable + hash(mnt, dentry);
+	struct list_head *head = m_hash(mnt, dentry);
 	struct mount *p;
 
 	list_for_each_entry_reverse(p, head, mnt_hash)
@@ -633,11 +663,11 @@ struct vfsmount *lookup_mnt(struct path *path)
 
 static struct mountpoint *new_mountpoint(struct dentry *dentry)
 {
-	struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
+	struct hlist_head *chain = mp_hash(dentry);
 	struct mountpoint *mp;
 	int ret;
 
-	list_for_each_entry(mp, chain, m_hash) {
+	hlist_for_each_entry(mp, chain, m_hash) {
 		if (mp->m_dentry == dentry) {
 			/* might be worth a WARN_ON() */
 			if (d_unlinked(dentry))
@@ -659,7 +689,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
 
 	mp->m_dentry = dentry;
 	mp->m_count = 1;
-	list_add(&mp->m_hash, chain);
+	hlist_add_head(&mp->m_hash, chain);
 	return mp;
 }
 
@@ -670,7 +700,7 @@ static void put_mountpoint(struct mountpoint *mp)
 		spin_lock(&dentry->d_lock);
 		dentry->d_flags &= ~DCACHE_MOUNTED;
 		spin_unlock(&dentry->d_lock);
-		list_del(&mp->m_hash);
+		hlist_del(&mp->m_hash);
 		kfree(mp);
 	}
 }
@@ -739,8 +769,7 @@ static void attach_mnt(struct mount *mnt,
 			struct mountpoint *mp)
 {
 	mnt_set_mountpoint(parent, mp, mnt);
-	list_add_tail(&mnt->mnt_hash, mount_hashtable +
-			hash(&parent->mnt, mp->m_dentry));
+	list_add_tail(&mnt->mnt_hash, m_hash(&parent->mnt, mp->m_dentry));
 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
 }
 
@@ -762,8 +791,8 @@ static void commit_tree(struct mount *mnt)
 
 	list_splice(&head, n->list.prev);
 
-	list_add_tail(&mnt->mnt_hash, mount_hashtable +
-				hash(&parent->mnt, mnt->mnt_mountpoint));
+	list_add_tail(&mnt->mnt_hash,
+				m_hash(&parent->mnt, mnt->mnt_mountpoint));
 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
 	touch_mnt_namespace(n);
 }
@@ -2794,18 +2823,24 @@ void __init mnt_init(void)
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
 			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
 
-	mount_hashtable = (struct list_head *)__get_free_page(GFP_ATOMIC);
-	mountpoint_hashtable = (struct list_head *)__get_free_page(GFP_ATOMIC);
+	mount_hashtable = alloc_large_system_hash("Mount-cache",
+				sizeof(struct list_head),
+				mhash_entries, 19,
+				0,
+				&m_hash_shift, &m_hash_mask, 0, 0);
+	mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache",
+				sizeof(struct hlist_head),
+				mphash_entries, 19,
+				0,
+				&mp_hash_shift, &mp_hash_mask, 0, 0);
 
 	if (!mount_hashtable || !mountpoint_hashtable)
 		panic("Failed to allocate mount hash table\n");
 
-	printk(KERN_INFO "Mount-cache hash table entries: %lu\n", HASH_SIZE);
-
-	for (u = 0; u < HASH_SIZE; u++)
+	for (u = 0; u <= m_hash_mask; u++)
 		INIT_LIST_HEAD(&mount_hashtable[u]);
-	for (u = 0; u < HASH_SIZE; u++)
-		INIT_LIST_HEAD(&mountpoint_hashtable[u]);
+	for (u = 0; u <= mp_hash_mask; u++)
+		INIT_HLIST_HEAD(&mountpoint_hashtable[u]);
 
 	kernfs_init();
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-05  3:49 ` [PATCH 1/5][RFC][CFT] percpu fixes, part 1 Al Viro
@ 2014-03-06 19:20   ` Tejun Heo
  2014-03-06 20:30     ` Al Viro
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2014-03-06 19:20 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

Hello, Al.

On Wed, Mar 05, 2014 at 03:49:19AM +0000, Al Viro wrote:
> convert ->map[] to array of offsets, cache the "no free areas among the
> first N" in chunk.  Free/in-use is represented by the LSB, a sentry
> element (<free = false, offset = total size of chunk>) is added in
> the end.

Can you please add why this change is necessary to the description?
Also, I think it'd be better to split addition of first_free hint to a
separate patch.

>  static void pcpu_split_block(struct pcpu_chunk *chunk, int i,
> -			     int head, int tail)
> +			     int head, int size, int tail)
>  {
>  	int nr_extra = !!head + !!tail;
> +	int off;
>  
> -	BUG_ON(chunk->map_alloc < chunk->map_used + nr_extra);
> +	BUG_ON(chunk->map_alloc <= chunk->map_used + nr_extra);
>  
>  	/* insert new subblocks */
> -	memmove(&chunk->map[i + nr_extra], &chunk->map[i],
> +	memmove(&chunk->map[i + nr_extra] + 1, &chunk->map[i] + 1,
>  		sizeof(chunk->map[0]) * (chunk->map_used - i));
>  	chunk->map_used += nr_extra;
>  
> -	if (head) {
> -		chunk->map[i + 1] = chunk->map[i] - head;
> -		chunk->map[i++] = head;
> -	}
> -	if (tail) {
> -		chunk->map[i++] -= tail;
> -		chunk->map[i] = tail;
> -	}
> +	off = chunk->map[i];
> +
> +	if (head)
> +		chunk->map[++i] = off += head;
> +	if (tail)
> +		chunk->map[++i] = off += size;
>  }

Do we need to pass @size in the above function?  Isn't that something
which can be easily determined?  If @size is gonna stay, we'll need to
update the function comment too.

>  /**
> @@ -483,19 +483,27 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
>  	int oslot = pcpu_chunk_slot(chunk);
>  	int max_contig = 0;
>  	int i, off;
> +	int seen_free = 0;

bool

> @@ -570,34 +584,50 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
>  static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
>  {
>  	int oslot = pcpu_chunk_slot(chunk);
> -	int i, off;
> -
> -	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
> -		if (off == freeme)
> -			break;
> +	int off = 0;
> +	unsigned i, j;
> +	int to_free = 0;
> +	int *p;
> +
> +	freeme |= 1;
> +
> +	i = 0;
> +	j = chunk->map_used;
> +	while (i != j) {
> +		unsigned k = (i + j) / 2;
> +		off = chunk->map[k];
> +		if (off < freeme)
> +			i = k + 1;
> +		else if (off > freeme)
> +			j = k;
> +		else
> +			i = j = k;
> +	}
>  	BUG_ON(off != freeme);
> -	BUG_ON(chunk->map[i] > 0);

A comment explaining why ignoring the free bit during bin search is
okay would be nice?

> @@ -617,7 +647,9 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void)
>  	}
>  
>  	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
> -	chunk->map[chunk->map_used++] = pcpu_unit_size;
> +	chunk->map[0] = 0;
> +	chunk->map[1] = pcpu_unit_size | 1;
> +	chunk->map_used = 1;
>  
>  	INIT_LIST_HEAD(&chunk->list);
>  	chunk->free_size = pcpu_unit_size;
> @@ -713,6 +745,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
>  	unsigned long flags;
>  	void __percpu *ptr;
>  
> +	if (unlikely(align < 2))
> +		align = 2;

Please add a comment explaining why the above min alignment is
necessary.

Other than the above, looks good to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5][RFC][CFT] fold pcpu_split_block() into the only caller
  2014-03-05  3:50 ` [PATCH 2/5][RFC][CFT] fold pcpu_split_block() into the only caller Al Viro
@ 2014-03-06 19:21   ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2014-03-06 19:21 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

On Wed, Mar 05, 2014 at 03:50:10AM +0000, Al Viro wrote:
> ... and clean it up a bit
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Oh, you're collapsing pcpu_split_block() here.  Please disregard the
review about the function comment and @size in the previous patch
then.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-06 19:20   ` Tejun Heo
@ 2014-03-06 20:30     ` Al Viro
  2014-03-06 20:47       ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2014-03-06 20:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

On Thu, Mar 06, 2014 at 02:20:26PM -0500, Tejun Heo wrote:

> Can you please add why this change is necessary to the description?

OK, will do...

> Also, I think it'd be better to split addition of first_free hint to a
> separate patch.

OK, but I'm not sure how much does it simplify things, actually.
 
> > +		chunk->map[++i] = off += size;
> >  }
> 
> Do we need to pass @size in the above function?  Isn't that something
> which can be easily determined?  If @size is gonna stay, we'll need to
> update the function comment too.

It's folded into the caller in the next patch.

> > @@ -483,19 +483,27 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
> >  	int oslot = pcpu_chunk_slot(chunk);
> >  	int max_contig = 0;
> >  	int i, off;
> > +	int seen_free = 0;
> 
> bool

Umm...  Matter of taste, but OK, I'll do that.

> > @@ -570,34 +584,50 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
> >  static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
> >  {
> >  	int oslot = pcpu_chunk_slot(chunk);
> > -	int i, off;
> > -
> > -	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
> > -		if (off == freeme)
> > -			break;
> > +	int off = 0;
> > +	unsigned i, j;
> > +	int to_free = 0;
> > +	int *p;
> > +
> > +	freeme |= 1;
> > +
> > +	i = 0;
> > +	j = chunk->map_used;
> > +	while (i != j) {
> > +		unsigned k = (i + j) / 2;
> > +		off = chunk->map[k];
> > +		if (off < freeme)
> > +			i = k + 1;
> > +		else if (off > freeme)
> > +			j = k;
> > +		else
> > +			i = j = k;
> > +	}
> >  	BUG_ON(off != freeme);
> > -	BUG_ON(chunk->map[i] > 0);
> 
> A comment explaining why ignoring the free bit during bin search is
> okay would be nice?

Huh?  We are not ignoring it - we are searching for exact value, including
the lower bit being set.  It might be worth adding a comment next to
"freeme |= 1;" before the loop, but that's it.  These two BUG_ON() fold
nicely - that's one of the reasons why I prefer to keep the offset of
area and is_free flag of the same area in one array element.  That's why
I prefer to have the first element of array to be <0,false> or <0,true>,
and add <total_size, true> as the sentry in the end.  Sure, we could
keep <offset of the next, is this one free> together instead, and make
that array one element shorter, but that way we get more complex logics,
including that search in freeing...

> > +	if (unlikely(align < 2))
> > +		align = 2;
> 
> Please add a comment explaining why the above min alignment is
> necessary.

Umm...  Will "we want the lowest bit of offset available for free/in_use
indicator" do?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-06 20:30     ` Al Viro
@ 2014-03-06 20:47       ` Tejun Heo
  2014-03-07  2:52         ` Al Viro
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2014-03-06 20:47 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

Hello, Al.

On Thu, Mar 06, 2014 at 08:30:30PM +0000, Al Viro wrote:
> > Also, I think it'd be better to split addition of first_free hint to a
> > separate patch.
> 
> OK, but I'm not sure how much does it simplify things, actually.

Not much, but it should at least help bisection if something goes
wrong, I think.

> > A comment explaining why ignoring the free bit during bin search is
> > okay would be nice?
> 
> Huh?  We are not ignoring it - we are searching for exact value, including
> the lower bit being set.  It might be worth adding a comment next to
> "freeme |= 1;" before the loop, but that's it.  These two BUG_ON() fold

Oh yeah, nothing too much.  I was just thinking about noting that
looking for the exact match is enough as we're looking for the
matching busy slot and the bit itself can't change ordering between
entries.

> > > +	if (unlikely(align < 2))
> > > +		align = 2;
> > 
> > Please add a comment explaining why the above min alignment is
> > necessary.
> 
> Umm...  Will "we want the lowest bit of offset available for free/in_use
> indicator" do?

Yeap.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-06 20:47       ` Tejun Heo
@ 2014-03-07  2:52         ` Al Viro
  2014-03-07 12:30           ` Tejun Heo
  2014-03-14 18:45           ` Al Viro
  0 siblings, 2 replies; 18+ messages in thread
From: Al Viro @ 2014-03-07  2:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

[-- Attachment #1: Type: text/plain, Size: 355 bytes --]

On Thu, Mar 06, 2014 at 03:47:26PM -0500, Tejun Heo wrote:

> Not much, but it should at least help bisection if something goes
> wrong, I think.

OK.  It looks better when folding pcpu_split_block() into the caller
is done as the first step.  See the attached 3 patches - combination
is the same, modulo comment addition and switching seen_free to bool.

[-- Attachment #2: 0001-perpcu-fold-pcpu_split_block-into-the-only-caller.patch --]
[-- Type: text/plain, Size: 3011 bytes --]

>From c58e32c7b5d750ae7caa5369a4a817e66fbd96c2 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Thu, 6 Mar 2014 21:08:24 -0500
Subject: [PATCH 1/3] perpcu: fold pcpu_split_block() into the only caller

... and simplify the results a bit.  Makes the next step easier
to deal with - we will be changing the data representation for
chunk->map[] and it's easier to do if the code in question is
not split between pcpu_alloc_area() and pcpu_split_block().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/percpu.c |   63 +++++++++++++++--------------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 036cfe0..592f289 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -418,48 +418,6 @@ out_unlock:
 }
 
 /**
- * pcpu_split_block - split a map block
- * @chunk: chunk of interest
- * @i: index of map block to split
- * @head: head size in bytes (can be 0)
- * @tail: tail size in bytes (can be 0)
- *
- * Split the @i'th map block into two or three blocks.  If @head is
- * non-zero, @head bytes block is inserted before block @i moving it
- * to @i+1 and reducing its size by @head bytes.
- *
- * If @tail is non-zero, the target block, which can be @i or @i+1
- * depending on @head, is reduced by @tail bytes and @tail byte block
- * is inserted after the target block.
- *
- * @chunk->map must have enough free slots to accommodate the split.
- *
- * CONTEXT:
- * pcpu_lock.
- */
-static void pcpu_split_block(struct pcpu_chunk *chunk, int i,
-			     int head, int tail)
-{
-	int nr_extra = !!head + !!tail;
-
-	BUG_ON(chunk->map_alloc < chunk->map_used + nr_extra);
-
-	/* insert new subblocks */
-	memmove(&chunk->map[i + nr_extra], &chunk->map[i],
-		sizeof(chunk->map[0]) * (chunk->map_used - i));
-	chunk->map_used += nr_extra;
-
-	if (head) {
-		chunk->map[i + 1] = chunk->map[i] - head;
-		chunk->map[i++] = head;
-	}
-	if (tail) {
-		chunk->map[i++] -= tail;
-		chunk->map[i] = tail;
-	}
-}
-
-/**
  * pcpu_alloc_area - allocate area from a pcpu_chunk
  * @chunk: chunk of interest
  * @size: wanted size in bytes
@@ -524,14 +482,25 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 
 		/* split if warranted */
 		if (head || tail) {
-			pcpu_split_block(chunk, i, head, tail);
+			int nr_extra = !!head + !!tail;
+
+			/* insert new subblocks */
+			memmove(&chunk->map[i + nr_extra], &chunk->map[i],
+				sizeof(chunk->map[0]) * (chunk->map_used - i));
+			chunk->map_used += nr_extra;
+
 			if (head) {
-				i++;
+				chunk->map[i + 1] = chunk->map[i] - head;
+				chunk->map[i] = head;
 				off += head;
-				max_contig = max(chunk->map[i - 1], max_contig);
+				i++;
+				max_contig = max(head, max_contig);
+			}
+			if (tail) {
+				chunk->map[i] -= tail;
+				chunk->map[i + 1] = tail;
+				max_contig = max(tail, max_contig);
 			}
-			if (tail)
-				max_contig = max(chunk->map[i + 1], max_contig);
 		}
 
 		/* update hint and mark allocated */
-- 
1.7.10.4


[-- Attachment #3: 0002-percpu-store-offsets-instead-of-lengths-in-map.patch --]
[-- Type: text/plain, Size: 9181 bytes --]

>From d9b884b48114b60da0be9bc2fe35f6a5b1456520 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Thu, 6 Mar 2014 21:13:18 -0500
Subject: [PATCH 2/3] percpu: store offsets instead of lengths in ->map[]

Current code keeps +-length for each area in chunk->map[].  It has
several unpleasant consequences:
	* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
	* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in.  Note that offsets are monotonous, so simple
binary search would do here.

	New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>.  The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.

In other words, where the old variant would have
	4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
	<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
	0, 5, 13, 16, 21, 32, 133

This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do.  Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/percpu.c |  136 +++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 81 insertions(+), 55 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 592f289..f82cc21 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -102,7 +102,7 @@ struct pcpu_chunk {
 	int			free_size;	/* free bytes in the chunk */
 	int			contig_hint;	/* max contiguous size hint */
 	void			*base_addr;	/* base address of this chunk */
-	int			map_used;	/* # of map entries used */
+	int			map_used;	/* # of map entries used before the sentry */
 	int			map_alloc;	/* # of map entries allocated */
 	int			*map;		/* allocation map */
 	void			*data;		/* chunk data */
@@ -356,11 +356,11 @@ static int pcpu_need_to_extend(struct pcpu_chunk *chunk)
 {
 	int new_alloc;
 
-	if (chunk->map_alloc >= chunk->map_used + 2)
+	if (chunk->map_alloc >= chunk->map_used + 3)
 		return 0;
 
 	new_alloc = PCPU_DFL_MAP_ALLOC;
-	while (new_alloc < chunk->map_used + 2)
+	while (new_alloc < chunk->map_used + 3)
 		new_alloc *= 2;
 
 	return new_alloc;
@@ -441,19 +441,22 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 	int oslot = pcpu_chunk_slot(chunk);
 	int max_contig = 0;
 	int i, off;
+	int *p;
 
-	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
-		bool is_last = i + 1 == chunk->map_used;
+	for (i = 0, p = chunk->map; i < chunk->map_used; i++, p++) {
 		int head, tail;
+		int this_size;
+
+		off = *p;
+		if (off & 1)
+			continue;
 
 		/* extra for alignment requirement */
 		head = ALIGN(off, align) - off;
-		BUG_ON(i == 0 && head != 0);
 
-		if (chunk->map[i] < 0)
-			continue;
-		if (chunk->map[i] < head + size) {
-			max_contig = max(chunk->map[i], max_contig);
+		this_size = (p[1] & ~1) - off;
+		if (this_size < head + size) {
+			max_contig = max(this_size, max_contig);
 			continue;
 		}
 
@@ -463,55 +466,50 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 		 * than sizeof(int), which is very small but isn't too
 		 * uncommon for percpu allocations.
 		 */
-		if (head && (head < sizeof(int) || chunk->map[i - 1] > 0)) {
-			if (chunk->map[i - 1] > 0)
-				chunk->map[i - 1] += head;
-			else {
-				chunk->map[i - 1] -= head;
+		if (head && (head < sizeof(int) || !(p[-1] & 1))) {
+			if (p[-1] & 1)
 				chunk->free_size -= head;
-			}
-			chunk->map[i] -= head;
-			off += head;
+			*p = off += head;
+			this_size -= head;
 			head = 0;
 		}
 
 		/* if tail is small, just keep it around */
-		tail = chunk->map[i] - head - size;
-		if (tail < sizeof(int))
+		tail = this_size - head - size;
+		if (tail < sizeof(int)) {
 			tail = 0;
+			size = this_size - head;
+		}
 
 		/* split if warranted */
 		if (head || tail) {
 			int nr_extra = !!head + !!tail;
 
 			/* insert new subblocks */
-			memmove(&chunk->map[i + nr_extra], &chunk->map[i],
+			memmove(p + nr_extra + 1, p + 1,
 				sizeof(chunk->map[0]) * (chunk->map_used - i));
 			chunk->map_used += nr_extra;
 
 			if (head) {
-				chunk->map[i + 1] = chunk->map[i] - head;
-				chunk->map[i] = head;
-				off += head;
-				i++;
+				*++p = off += head;
+				++i;
 				max_contig = max(head, max_contig);
 			}
 			if (tail) {
-				chunk->map[i] -= tail;
-				chunk->map[i + 1] = tail;
+				p[1] = off + size;
 				max_contig = max(tail, max_contig);
 			}
 		}
 
 		/* update hint and mark allocated */
-		if (is_last)
+		if (i + 1 == chunk->map_used)
 			chunk->contig_hint = max_contig; /* fully scanned */
 		else
 			chunk->contig_hint = max(chunk->contig_hint,
 						 max_contig);
 
-		chunk->free_size -= chunk->map[i];
-		chunk->map[i] = -chunk->map[i];
+		chunk->free_size -= size;
+		*p |= 1;
 
 		pcpu_chunk_relocate(chunk, oslot);
 		return off;
@@ -539,34 +537,47 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
 {
 	int oslot = pcpu_chunk_slot(chunk);
-	int i, off;
-
-	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
-		if (off == freeme)
-			break;
+	int off = 0;
+	unsigned i, j;
+	int to_free = 0;
+	int *p;
+
+	freeme |= 1;	/* we are searching for <given offset, in use> pair */
+
+	i = 0;
+	j = chunk->map_used;
+	while (i != j) {
+		unsigned k = (i + j) / 2;
+		off = chunk->map[k];
+		if (off < freeme)
+			i = k + 1;
+		else if (off > freeme)
+			j = k;
+		else
+			i = j = k;
+	}
 	BUG_ON(off != freeme);
-	BUG_ON(chunk->map[i] > 0);
 
-	chunk->map[i] = -chunk->map[i];
-	chunk->free_size += chunk->map[i];
+	p = chunk->map + i;
+	*p = off &= ~1;
+	chunk->free_size += (p[1] & ~1) - off;
 
+	/* merge with next? */
+	if (!(p[1] & 1))
+		to_free++;
 	/* merge with previous? */
-	if (i > 0 && chunk->map[i - 1] >= 0) {
-		chunk->map[i - 1] += chunk->map[i];
-		chunk->map_used--;
-		memmove(&chunk->map[i], &chunk->map[i + 1],
-			(chunk->map_used - i) * sizeof(chunk->map[0]));
+	if (i > 0 && !(p[-1] & 1)) {
+		to_free++;
 		i--;
+		p--;
 	}
-	/* merge with next? */
-	if (i + 1 < chunk->map_used && chunk->map[i + 1] >= 0) {
-		chunk->map[i] += chunk->map[i + 1];
-		chunk->map_used--;
-		memmove(&chunk->map[i + 1], &chunk->map[i + 2],
-			(chunk->map_used - (i + 1)) * sizeof(chunk->map[0]));
+	if (to_free) {
+		chunk->map_used -= to_free;
+		memmove(p + 1, p + 1 + to_free, 
+			(chunk->map_used - i) * sizeof(chunk->map[0]));
 	}
 
-	chunk->contig_hint = max(chunk->map[i], chunk->contig_hint);
+	chunk->contig_hint = max(chunk->map[i + 1] - chunk->map[i] - 1, chunk->contig_hint);
 	pcpu_chunk_relocate(chunk, oslot);
 }
 
@@ -586,7 +597,9 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void)
 	}
 
 	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
-	chunk->map[chunk->map_used++] = pcpu_unit_size;
+	chunk->map[0] = 0;
+	chunk->map[1] = pcpu_unit_size | 1;
+	chunk->map_used = 1;
 
 	INIT_LIST_HEAD(&chunk->list);
 	chunk->free_size = pcpu_unit_size;
@@ -682,6 +695,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 	unsigned long flags;
 	void __percpu *ptr;
 
+	/*
+	 * We want the lowest bit of offset available for in-use/free
+	 * indicator.
+	 */
+	if (unlikely(align < 2))
+		align = 2;
+
 	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) {
 		WARN(true, "illegal size (%zu) or align (%zu) for "
 		     "percpu allocation\n", size, align);
@@ -1312,9 +1332,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	}
 	schunk->contig_hint = schunk->free_size;
 
-	schunk->map[schunk->map_used++] = -ai->static_size;
+	schunk->map[0] = 1;
+	schunk->map[1] = ai->static_size;
+	schunk->map_used = 1;
 	if (schunk->free_size)
-		schunk->map[schunk->map_used++] = schunk->free_size;
+		schunk->map[++schunk->map_used] = 1 | (ai->static_size + schunk->free_size);
+	else
+		schunk->map[1] |= 1;
 
 	/* init dynamic chunk if necessary */
 	if (dyn_size) {
@@ -1327,8 +1351,10 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		bitmap_fill(dchunk->populated, pcpu_unit_pages);
 
 		dchunk->contig_hint = dchunk->free_size = dyn_size;
-		dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
-		dchunk->map[dchunk->map_used++] = dchunk->free_size;
+		dchunk->map[0] = 1;
+		dchunk->map[1] = pcpu_reserved_chunk_limit;
+		dchunk->map[2] = (pcpu_reserved_chunk_limit + dchunk->free_size) | 1;
+		dchunk->map_used = 2;
 	}
 
 	/* link the first chunk in */
-- 
1.7.10.4


[-- Attachment #4: 0003-percpu-speed-alloc_pcpu_area-up.patch --]
[-- Type: text/plain, Size: 2485 bytes --]

>From 76ffa9a3ed635a860a1b7de612f2552fbf46a775 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Thu, 6 Mar 2014 20:52:32 -0500
Subject: [PATCH 3/3] percpu: speed alloc_pcpu_area() up

If we know that first N areas are all in use, we can obviously skip
them when searching for a free one.  And that kind of hint is very
easy to maintain.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/percpu.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index f82cc21..5bb658a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -106,6 +106,7 @@ struct pcpu_chunk {
 	int			map_alloc;	/* # of map entries allocated */
 	int			*map;		/* allocation map */
 	void			*data;		/* chunk data */
+	int			first_free;	/* no free below this */
 	bool			immutable;	/* no [de]population allowed */
 	unsigned long		populated[];	/* populated bitmap */
 };
@@ -441,9 +442,10 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 	int oslot = pcpu_chunk_slot(chunk);
 	int max_contig = 0;
 	int i, off;
+	bool seen_free = false;
 	int *p;
 
-	for (i = 0, p = chunk->map; i < chunk->map_used; i++, p++) {
+	for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) {
 		int head, tail;
 		int this_size;
 
@@ -456,6 +458,10 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 
 		this_size = (p[1] & ~1) - off;
 		if (this_size < head + size) {
+			if (!seen_free) {
+				chunk->first_free = i;
+				seen_free = true;
+			}
 			max_contig = max(this_size, max_contig);
 			continue;
 		}
@@ -491,6 +497,10 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 			chunk->map_used += nr_extra;
 
 			if (head) {
+				if (!seen_free) {
+					chunk->first_free = i;
+					seen_free = true;
+				}
 				*++p = off += head;
 				++i;
 				max_contig = max(head, max_contig);
@@ -501,6 +511,9 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 			}
 		}
 
+		if (!seen_free)
+			chunk->first_free = i + 1;
+
 		/* update hint and mark allocated */
 		if (i + 1 == chunk->map_used)
 			chunk->contig_hint = max_contig; /* fully scanned */
@@ -558,6 +571,9 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
 	}
 	BUG_ON(off != freeme);
 
+	if (i < chunk->first_free)
+		chunk->first_free = i;
+
 	p = chunk->map + i;
 	*p = off &= ~1;
 	chunk->free_size += (p[1] & ~1) - off;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-07  2:52         ` Al Viro
@ 2014-03-07 12:30           ` Tejun Heo
  2014-03-14 18:45           ` Al Viro
  1 sibling, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2014-03-07 12:30 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

On Fri, Mar 07, 2014 at 02:52:06AM +0000, Al Viro wrote:
> On Thu, Mar 06, 2014 at 03:47:26PM -0500, Tejun Heo wrote:
> 
> > Not much, but it should at least help bisection if something goes
> > wrong, I think.
> 
> OK.  It looks better when folding pcpu_split_block() into the caller
> is done as the first step.  See the attached 3 patches - combination
> is the same, modulo comment addition and switching seen_free to bool.

All look good to me.  Applying them to percpu/for-3.15.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5][RFC][CFT] resizable namespace.c hashes
  2014-03-05  3:52 ` [PATCH 5/5][RFC][CFT] resizable namespace.c hashes Al Viro
@ 2014-03-07 17:17   ` Andi Kleen
  2014-03-07 18:38     ` Al Viro
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2014-03-07 17:17 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

Al Viro <viro@ZenIV.linux.org.uk> writes:

> * switch allocation to alloc_large_system_hash()
> * make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
> * switch mountpoint_hashtable from list_head to hlist_head

So how much memory does this use on a standard system (<4GB memory)?
How much memory does it use on a large system (0.5TB)?

How good is your hash function. Would jhash be more appropiate
and allow smaller hash tables?

Perhaps just want a tree here.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5][RFC][CFT] resizable namespace.c hashes
  2014-03-07 17:17   ` Andi Kleen
@ 2014-03-07 18:38     ` Al Viro
  0 siblings, 0 replies; 18+ messages in thread
From: Al Viro @ 2014-03-07 18:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

On Fri, Mar 07, 2014 at 09:17:19AM -0800, Andi Kleen wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > * switch allocation to alloc_large_system_hash()
> > * make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
> > * switch mountpoint_hashtable from list_head to hlist_head
> 
> So how much memory does this use on a standard system (<4GB memory)?

Two hash chains per megabyte.  IOW, 2Gb => 4096 of them.  Could drive
it lower, probably, but I'm a bit nervous about _really_ low-end
boxen.  Right now it matches your variant at 128Mb box.

> How much memory does it use on a large system (0.5TB)?

Probably over the top - alloc_large_system_hash() has that problem
in general.

0.5Tb would amount to 2^20 hash chains.  Which is probably too much,
but then dentry hash on the same box will be 2^26 hash chains and
inode hash - 2^25 chains.

> How good is your hash function. Would jhash be more appropiate
> and allow smaller hash tables?

How the hell does hash function affect the average chain length?  And
yes, they are pretty evenly distributed.  And while we are at it,
what the hell does jhash have to do with the whole thing?  No strings
involved - the hash key is a pair of pointers.  To objects allocated
by kmem_cache_alloc(), so their middle bits are already fairly random.

> Perhaps just want a tree here.

_What_ tree?  Andi, WTF are you talking about?  That hash is a mapping
from (parent vfsmount, mountpoint dentry) to child vfsmount.  Sure,
we do have the mount tree - all children of given vfsmount are on a
cyclic list anchored in it.  And iterating through those is painfully
slow on realistic setups.  Have a bunch of nfs4 referrals on one fs,
and you are welcome to a hundred of vfsmounts on the child list of one.
Create a bunch of bindings in assorted places in /usr, have *any*
mountpoint in /usr pay the price when we cross it on pathname resolution.
Same with /, since there tends to be a bunch of mounts on directories
in root filesystem.  FWIW, on the box I'm typing at right now - /proc,
/sys, /dev, /run, /tmp, /home, /usr, /var.   8 of them.  On the box I'm
sshed into: /proc, /sys, /dev, /run, /usr, /media, /var, /tmp, /boot,
/archive - 10.  And traversing that would be the price on *any* pathname
resolution trying to cross from root to e.g. /home or /usr.

To get an equally bad behaviour on the current setup (even with the
ridiculously small hash table size set by your patch back in 2002),
you'd need ~2000 mounts.  And it very easily degrades even more -
consider e.g. a box that has a shitload of stuff automounted under
/home.  It can easily go well past 10 simultaneous ones...

We could keep a per-mountpoint list, anchored in dentry.  And pay with
an extra pointer in each struct dentry out there.  Which will cost a lot
more.  Or, we could anchor them in struct mountpoint and do hash lookup
*and* list search on each mountpoint crossing - one to find struct mountpoint
by struct dentry, another to look for vfsmount with the right parent.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-07  2:52         ` Al Viro
  2014-03-07 12:30           ` Tejun Heo
@ 2014-03-14 18:45           ` Al Viro
  2014-03-14 18:47             ` Tejun Heo
  1 sibling, 1 reply; 18+ messages in thread
From: Al Viro @ 2014-03-14 18:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

There's a missing piece in 2/3 (percpu: store offsets instead of lengths in
->map[]) - we need to round size up to multiple of 2 in pcpu_alloc().
Updated patch follows; if you want an incremental instead of replacement -
please, yell.

commit 0b0fc67ae4b574ec35c37b508b84329b99d07bf7
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Thu Mar 6 21:13:18 2014 -0500

    percpu: store offsets instead of lengths in ->map[]
    
    Current code keeps +-length for each area in chunk->map[].  It has
    several unpleasant consequences:
    	* even if we know that first 50 areas are all in use, allocation
    still needs to go through all those areas just to sum their sizes, just
    to get the offset of free one.
    	* freeing needs to find the array entry refering to the area
    in question; again, the need to sum the sizes until we reach the offset
    we are interested in.  Note that offsets are monotonous, so simple
    binary search would do here.
    
    	New data representation: array of <offset,in-use flag> pairs.
    Each pair is represented by one int - we use offset|1 for <offset, in use>
    and offset for <offset, free> (we make sure that all offsets are even).
    In the end we put a sentry entry - <total size, in use>.  The first
    entry is <0, flag>; it would be possible to store together the flag
    for Nth area and offset for N+1st, but that leads to much hairier code.
    
    In other words, where the old variant would have
    	4, -8, -4, 4, -12, 100
    (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
    	<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
    i.e.
    	0, 5, 13, 16, 21, 32, 133
    
    This commit switches to new data representation and takes care of a couple
    of low-hanging fruits in free_pcpu_area() - one is the switch to binary
    search, another is not doing two memmove() when one would do.  Speeding
    the alloc side up (by keeping track of how many areas in the beginning are
    known to be all in use) also becomes possible - that'll be done in the next
    commit.
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/mm/percpu.c b/mm/percpu.c
index 592f289..648ccbf 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -102,7 +102,7 @@ struct pcpu_chunk {
 	int			free_size;	/* free bytes in the chunk */
 	int			contig_hint;	/* max contiguous size hint */
 	void			*base_addr;	/* base address of this chunk */
-	int			map_used;	/* # of map entries used */
+	int			map_used;	/* # of map entries used before the sentry */
 	int			map_alloc;	/* # of map entries allocated */
 	int			*map;		/* allocation map */
 	void			*data;		/* chunk data */
@@ -356,11 +356,11 @@ static int pcpu_need_to_extend(struct pcpu_chunk *chunk)
 {
 	int new_alloc;
 
-	if (chunk->map_alloc >= chunk->map_used + 2)
+	if (chunk->map_alloc >= chunk->map_used + 3)
 		return 0;
 
 	new_alloc = PCPU_DFL_MAP_ALLOC;
-	while (new_alloc < chunk->map_used + 2)
+	while (new_alloc < chunk->map_used + 3)
 		new_alloc *= 2;
 
 	return new_alloc;
@@ -441,19 +441,22 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 	int oslot = pcpu_chunk_slot(chunk);
 	int max_contig = 0;
 	int i, off;
+	int *p;
 
-	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++])) {
-		bool is_last = i + 1 == chunk->map_used;
+	for (i = 0, p = chunk->map; i < chunk->map_used; i++, p++) {
 		int head, tail;
+		int this_size;
+
+		off = *p;
+		if (off & 1)
+			continue;
 
 		/* extra for alignment requirement */
 		head = ALIGN(off, align) - off;
-		BUG_ON(i == 0 && head != 0);
 
-		if (chunk->map[i] < 0)
-			continue;
-		if (chunk->map[i] < head + size) {
-			max_contig = max(chunk->map[i], max_contig);
+		this_size = (p[1] & ~1) - off;
+		if (this_size < head + size) {
+			max_contig = max(this_size, max_contig);
 			continue;
 		}
 
@@ -463,55 +466,50 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 		 * than sizeof(int), which is very small but isn't too
 		 * uncommon for percpu allocations.
 		 */
-		if (head && (head < sizeof(int) || chunk->map[i - 1] > 0)) {
-			if (chunk->map[i - 1] > 0)
-				chunk->map[i - 1] += head;
-			else {
-				chunk->map[i - 1] -= head;
+		if (head && (head < sizeof(int) || !(p[-1] & 1))) {
+			if (p[-1] & 1)
 				chunk->free_size -= head;
-			}
-			chunk->map[i] -= head;
-			off += head;
+			*p = off += head;
+			this_size -= head;
 			head = 0;
 		}
 
 		/* if tail is small, just keep it around */
-		tail = chunk->map[i] - head - size;
-		if (tail < sizeof(int))
+		tail = this_size - head - size;
+		if (tail < sizeof(int)) {
 			tail = 0;
+			size = this_size - head;
+		}
 
 		/* split if warranted */
 		if (head || tail) {
 			int nr_extra = !!head + !!tail;
 
 			/* insert new subblocks */
-			memmove(&chunk->map[i + nr_extra], &chunk->map[i],
+			memmove(p + nr_extra + 1, p + 1,
 				sizeof(chunk->map[0]) * (chunk->map_used - i));
 			chunk->map_used += nr_extra;
 
 			if (head) {
-				chunk->map[i + 1] = chunk->map[i] - head;
-				chunk->map[i] = head;
-				off += head;
-				i++;
+				*++p = off += head;
+				++i;
 				max_contig = max(head, max_contig);
 			}
 			if (tail) {
-				chunk->map[i] -= tail;
-				chunk->map[i + 1] = tail;
+				p[1] = off + size;
 				max_contig = max(tail, max_contig);
 			}
 		}
 
 		/* update hint and mark allocated */
-		if (is_last)
+		if (i + 1 == chunk->map_used)
 			chunk->contig_hint = max_contig; /* fully scanned */
 		else
 			chunk->contig_hint = max(chunk->contig_hint,
 						 max_contig);
 
-		chunk->free_size -= chunk->map[i];
-		chunk->map[i] = -chunk->map[i];
+		chunk->free_size -= size;
+		*p |= 1;
 
 		pcpu_chunk_relocate(chunk, oslot);
 		return off;
@@ -539,34 +537,47 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align)
 static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
 {
 	int oslot = pcpu_chunk_slot(chunk);
-	int i, off;
-
-	for (i = 0, off = 0; i < chunk->map_used; off += abs(chunk->map[i++]))
-		if (off == freeme)
-			break;
+	int off = 0;
+	unsigned i, j;
+	int to_free = 0;
+	int *p;
+
+	freeme |= 1;	/* we are searching for <given offset, in use> pair */
+
+	i = 0;
+	j = chunk->map_used;
+	while (i != j) {
+		unsigned k = (i + j) / 2;
+		off = chunk->map[k];
+		if (off < freeme)
+			i = k + 1;
+		else if (off > freeme)
+			j = k;
+		else
+			i = j = k;
+	}
 	BUG_ON(off != freeme);
-	BUG_ON(chunk->map[i] > 0);
 
-	chunk->map[i] = -chunk->map[i];
-	chunk->free_size += chunk->map[i];
+	p = chunk->map + i;
+	*p = off &= ~1;
+	chunk->free_size += (p[1] & ~1) - off;
 
+	/* merge with next? */
+	if (!(p[1] & 1))
+		to_free++;
 	/* merge with previous? */
-	if (i > 0 && chunk->map[i - 1] >= 0) {
-		chunk->map[i - 1] += chunk->map[i];
-		chunk->map_used--;
-		memmove(&chunk->map[i], &chunk->map[i + 1],
-			(chunk->map_used - i) * sizeof(chunk->map[0]));
+	if (i > 0 && !(p[-1] & 1)) {
+		to_free++;
 		i--;
+		p--;
 	}
-	/* merge with next? */
-	if (i + 1 < chunk->map_used && chunk->map[i + 1] >= 0) {
-		chunk->map[i] += chunk->map[i + 1];
-		chunk->map_used--;
-		memmove(&chunk->map[i + 1], &chunk->map[i + 2],
-			(chunk->map_used - (i + 1)) * sizeof(chunk->map[0]));
+	if (to_free) {
+		chunk->map_used -= to_free;
+		memmove(p + 1, p + 1 + to_free, 
+			(chunk->map_used - i) * sizeof(chunk->map[0]));
 	}
 
-	chunk->contig_hint = max(chunk->map[i], chunk->contig_hint);
+	chunk->contig_hint = max(chunk->map[i + 1] - chunk->map[i] - 1, chunk->contig_hint);
 	pcpu_chunk_relocate(chunk, oslot);
 }
 
@@ -586,7 +597,9 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void)
 	}
 
 	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
-	chunk->map[chunk->map_used++] = pcpu_unit_size;
+	chunk->map[0] = 0;
+	chunk->map[1] = pcpu_unit_size | 1;
+	chunk->map_used = 1;
 
 	INIT_LIST_HEAD(&chunk->list);
 	chunk->free_size = pcpu_unit_size;
@@ -682,6 +695,16 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 	unsigned long flags;
 	void __percpu *ptr;
 
+	/*
+	 * We want the lowest bit of offset available for in-use/free
+	 * indicator.
+	 */
+	if (unlikely(align < 2))
+		align = 2;
+
+	if (unlikely(size & 1))
+		size++;
+
 	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) {
 		WARN(true, "illegal size (%zu) or align (%zu) for "
 		     "percpu allocation\n", size, align);
@@ -1312,9 +1335,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	}
 	schunk->contig_hint = schunk->free_size;
 
-	schunk->map[schunk->map_used++] = -ai->static_size;
+	schunk->map[0] = 1;
+	schunk->map[1] = ai->static_size;
+	schunk->map_used = 1;
 	if (schunk->free_size)
-		schunk->map[schunk->map_used++] = schunk->free_size;
+		schunk->map[++schunk->map_used] = 1 | (ai->static_size + schunk->free_size);
+	else
+		schunk->map[1] |= 1;
 
 	/* init dynamic chunk if necessary */
 	if (dyn_size) {
@@ -1327,8 +1354,10 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		bitmap_fill(dchunk->populated, pcpu_unit_pages);
 
 		dchunk->contig_hint = dchunk->free_size = dyn_size;
-		dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
-		dchunk->map[dchunk->map_used++] = dchunk->free_size;
+		dchunk->map[0] = 1;
+		dchunk->map[1] = pcpu_reserved_chunk_limit;
+		dchunk->map[2] = (pcpu_reserved_chunk_limit + dchunk->free_size) | 1;
+		dchunk->map_used = 2;
 	}
 
 	/* link the first chunk in */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-14 18:45           ` Al Viro
@ 2014-03-14 18:47             ` Tejun Heo
  2014-03-14 18:53               ` Al Viro
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2014-03-14 18:47 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

On Fri, Mar 14, 2014 at 06:45:45PM +0000, Al Viro wrote:
> There's a missing piece in 2/3 (percpu: store offsets instead of lengths in
> ->map[]) - we need to round size up to multiple of 2 in pcpu_alloc().
> Updated patch follows; if you want an incremental instead of replacement -
> please, yell.

Yeah, I'd prefer an incremental patch.  Also, can you please update
the comment above alignment / size updates with a bit more detail?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5][RFC][CFT] percpu fixes, part 1
  2014-03-14 18:47             ` Tejun Heo
@ 2014-03-14 18:53               ` Al Viro
  2014-03-17 20:12                 ` [PATCH percpu/for-3.15] percpu: allocation size should be even Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2014-03-14 18:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

On Fri, Mar 14, 2014 at 02:47:45PM -0400, Tejun Heo wrote:
> On Fri, Mar 14, 2014 at 06:45:45PM +0000, Al Viro wrote:
> > There's a missing piece in 2/3 (percpu: store offsets instead of lengths in
> > ->map[]) - we need to round size up to multiple of 2 in pcpu_alloc().
> > Updated patch follows; if you want an incremental instead of replacement -
> > please, yell.
> 
> Yeah, I'd prefer an incremental patch.  Also, can you please update
> the comment above alignment / size updates with a bit more detail?
> 
> Thanks!

Done; incremental follows:

diff --git a/mm/percpu.c b/mm/percpu.c
index 5bb658a..65cab3b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -713,11 +713,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 
 	/*
 	 * We want the lowest bit of offset available for in-use/free
-	 * indicator.
+	 * indicator, so force >= 16bit alignment and make size even.
 	 */
 	if (unlikely(align < 2))
 		align = 2;
 
+	if (unlikely(size & 1))
+		size++;
+
 	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) {
 		WARN(true, "illegal size (%zu) or align (%zu) for "
 		     "percpu allocation\n", size, align);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH percpu/for-3.15] percpu: allocation size should be even
  2014-03-14 18:53               ` Al Viro
@ 2014-03-17 20:12                 ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2014-03-17 20:12 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Stephen Tweedie,
	Jeremy Eder

>From 2f69fa829cb4ca062aaffee9ab9eb44484db75b1 Mon Sep 17 00:00:00 2001
From: Viro <viro@ZenIV.linux.org.uk>
Date: Mon, 17 Mar 2014 16:01:27 -0400

723ad1d90b56 ("percpu: store offsets instead of lengths in ->map[]")
updated percpu area allocator to use the lowest bit, instead of sign,
to signify whether the area is occupied and forced min align to 2;
unfortunately, it forgot to force the allocation size to be even
causing malfunctions for the very rare odd-sized allocations.

Always force the allocations to be even sized.

tj: Wrote patch description.

Original-patch-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
Applied to percpu/for-3.15.

Thanks.

 mm/percpu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index c7206d0..202e104 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -713,11 +713,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
 
 	/*
 	 * We want the lowest bit of offset available for in-use/free
-	 * indicator.
+	 * indicator, so force >= 16bit alignment and make size even.
 	 */
 	if (unlikely(align < 2))
 		align = 2;
 
+	if (unlikely(size & 1))
+		size++;
+
 	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) {
 		WARN(true, "illegal size (%zu) or align (%zu) for "
 		     "percpu allocation\n", size, align);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-03-17 20:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-05  3:47 [PATCHES][RFC][CFT] scalability fixes for shitloads of mounts Al Viro
2014-03-05  3:49 ` [PATCH 1/5][RFC][CFT] percpu fixes, part 1 Al Viro
2014-03-06 19:20   ` Tejun Heo
2014-03-06 20:30     ` Al Viro
2014-03-06 20:47       ` Tejun Heo
2014-03-07  2:52         ` Al Viro
2014-03-07 12:30           ` Tejun Heo
2014-03-14 18:45           ` Al Viro
2014-03-14 18:47             ` Tejun Heo
2014-03-14 18:53               ` Al Viro
2014-03-17 20:12                 ` [PATCH percpu/for-3.15] percpu: allocation size should be even Tejun Heo
2014-03-05  3:50 ` [PATCH 2/5][RFC][CFT] fold pcpu_split_block() into the only caller Al Viro
2014-03-06 19:21   ` Tejun Heo
2014-03-05  3:51 ` [PATCH 3/5][RFC][CFT] smarter propagate_mnt() Al Viro
2014-03-05  3:51 ` [PATCH 4/5][RFC][CFT] reduce m_start() cost Al Viro
2014-03-05  3:52 ` [PATCH 5/5][RFC][CFT] resizable namespace.c hashes Al Viro
2014-03-07 17:17   ` Andi Kleen
2014-03-07 18:38     ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.