All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
       [not found] <E1PAVIx-0001qL-EB@smtp.tuxdriver.com>
@ 2010-10-25 22:30 ` Eric Dumazet
  2010-10-25 23:35   ` Neil Horman
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2010-10-25 22:30 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, jpirko

Le lundi 25 octobre 2010 à 18:14 -0400, nhorman@tuxdriver.com a écrit :
> I think I remember those changes and IIrc yes,  tcpdump will make
> several attempts to get buffers of an appropriate size.  But while it
> tries to do that it bogs the system trying to write out pagecahe,
> swap, etc.  And that activity doesn't guarantee success.  His does
> either, but getting 5 order 0 pages is far easier and less intrusive
> to a loaded system than trying to get 1 order 4 chunk.  That's all I'm
> trying to accomplish here.  Just making it easier to use af_packet
> sockets without interfering with system performance
> 

Actually, using vmalloc() would probably hurt performance, because of
extra TLB pressure.

Of course, on recent x86 hardware you dont notice that much...

If not, why af_packet would use such convoluted double array of
'compound pages' ?

Also, on x86_32, vmalloc()/vmap() space is small (128 MB) so you might
exhaust it pretty fast with several sniffers running.

I would try a two level thing : Try to get high order pages, and
fallback on low order pages, but normally libpcap does this for us ?




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-25 22:30 ` [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation Eric Dumazet
@ 2010-10-25 23:35   ` Neil Horman
  2010-10-25 23:46     ` David Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Horman @ 2010-10-25 23:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, jpirko

On Tue, Oct 26, 2010 at 12:30:56AM +0200, Eric Dumazet wrote:
> Le lundi 25 octobre 2010 à 18:14 -0400, nhorman@tuxdriver.com a écrit :
> > I think I remember those changes and IIrc yes,  tcpdump will make
> > several attempts to get buffers of an appropriate size.  But while it
> > tries to do that it bogs the system trying to write out pagecahe,
> > swap, etc.  And that activity doesn't guarantee success.  His does
> > either, but getting 5 order 0 pages is far easier and less intrusive
> > to a loaded system than trying to get 1 order 4 chunk.  That's all I'm
> > trying to accomplish here.  Just making it easier to use af_packet
> > sockets without interfering with system performance
> > 
> 
> Actually, using vmalloc() would probably hurt performance, because of
> extra TLB pressure.
> 
> Of course, on recent x86 hardware you dont notice that much...
> 
Exactly, you notice it a good deal less then you do the swapping that occurs if
you try to allocate a contiguous order 4 chunk of RAM.  That will bog down the
system, even if the allocation ultimately fails.

> If not, why af_packet would use such convoluted double array of
> 'compound pages' ?
> 
Gah!  Because I have blinders on, apparently.  The origional implementation used
a ring of pointer, and apparently I was so focused on keeping with that
implementation, it never occured to me to just use vmalloc.  That was stupid of
me, I'll respin this and get rid of my idiocy.

> Also, on x86_32, vmalloc()/vmap() space is small (128 MB) so you might
> exhaust it pretty fast with several sniffers running.
> 
You might, although (assuming no other significant users), 64K * 32 ~= 1.5Mb.
You could run 10 sniffers and only consume about 10-15% of the vmalloc space. 

> I would try a two level thing : Try to get high order pages, and
> fallback on low order pages, but normally libpcap does this for us ?
> 
> 
It does, but it tries them in that order, which causes the problem I'm
describing, which is to say that attempting to get a large high order allocation
causes the system to dig into swap and become unresponsive while it tries to
assemble those allocations.  I would suggest a vmalloc, with a backoff to high
order allocation if that fails.

I'll post a new patch shortly.
Neil

> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-25 23:35   ` Neil Horman
@ 2010-10-25 23:46     ` David Miller
  2010-10-26  0:48       ` Maciej Żenczykowski
  2010-10-26  1:58       ` Neil Horman
  0 siblings, 2 replies; 9+ messages in thread
From: David Miller @ 2010-10-25 23:46 UTC (permalink / raw)
  To: nhorman; +Cc: eric.dumazet, netdev, jpirko

From: Neil Horman <nhorman@tuxdriver.com>
Date: Mon, 25 Oct 2010 19:35:58 -0400

> On Tue, Oct 26, 2010 at 12:30:56AM +0200, Eric Dumazet wrote:
>> I would try a two level thing : Try to get high order pages, and
>> fallback on low order pages, but normally libpcap does this for us ?
>> 
>> 
> It does, but it tries them in that order, which causes the problem I'm
> describing, which is to say that attempting to get a large high order allocation
> causes the system to dig into swap and become unresponsive while it tries to
> assemble those allocations.  I would suggest a vmalloc, with a backoff to high
> order allocation if that fails.

I think that logic should be maintained, except that one of the GFP_*
flags should be specified so that it doesn't go all swap crazy on us,
and instead fails a high order allocation earlier.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-25 23:46     ` David Miller
@ 2010-10-26  0:48       ` Maciej Żenczykowski
  2010-10-26  1:53         ` Neil Horman
  2010-10-26  1:58       ` Neil Horman
  1 sibling, 1 reply; 9+ messages in thread
From: Maciej Żenczykowski @ 2010-10-26  0:48 UTC (permalink / raw)
  To: David Miller; +Cc: nhorman, eric.dumazet, netdev, jpirko

This is http://bugzilla.redhat.com/637619

I'm the original reporter and I haven't yet fully vetted / tested the
patch (not directly related to my job and all that).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-26  0:48       ` Maciej Żenczykowski
@ 2010-10-26  1:53         ` Neil Horman
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2010-10-26  1:53 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: David Miller, eric.dumazet, netdev, jpirko

On Mon, Oct 25, 2010 at 05:48:33PM -0700, Maciej Żenczykowski wrote:
> This is http://bugzilla.redhat.com/637619
> 
> I'm the original reporter and I haven't yet fully vetted / tested the
> patch (not directly related to my job and all that).
> 
I validated it.  You reported that tcpdump startup caused your system to
'freeze'.  I managed to recreate that situation, as observed by my system
becomming unresponsive on the console, and validated that that situation did not
occur when this patch was applied.

Neil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-25 23:46     ` David Miller
  2010-10-26  0:48       ` Maciej Żenczykowski
@ 2010-10-26  1:58       ` Neil Horman
  1 sibling, 0 replies; 9+ messages in thread
From: Neil Horman @ 2010-10-26  1:58 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, jpirko

On Mon, Oct 25, 2010 at 04:46:46PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 25 Oct 2010 19:35:58 -0400
> 
> > On Tue, Oct 26, 2010 at 12:30:56AM +0200, Eric Dumazet wrote:
> >> I would try a two level thing : Try to get high order pages, and
> >> fallback on low order pages, but normally libpcap does this for us ?
> >> 
> >> 
> > It does, but it tries them in that order, which causes the problem I'm
> > describing, which is to say that attempting to get a large high order allocation
> > causes the system to dig into swap and become unresponsive while it tries to
> > assemble those allocations.  I would suggest a vmalloc, with a backoff to high
> > order allocation if that fails.
> 
> I think that logic should be maintained, except that one of the GFP_*
> flags should be specified so that it doesn't go all swap crazy on us,
> and instead fails a high order allocation earlier.
> 
I suppose we can do that.  Specifically we can add __GFP_NORETRY so that we
don't continually drive I/O or swap to get pages freed, but I think that all
that will result in is earlier failures.  No matter how you look at it, order 4
allocations are hard to come by, and we can make it work with non-contigous
pages (the vmalloc case).  Given that each of the cases can fail, I can see this
being a workable fallback set:

1) get_free_pages (GFP_KERNEL|...+|__GFP_NORETRY)
2) vmalloc
3) get_free_pages (GFP_KERNEL|...)

Theres no reason to fail af_packet socket options just because the system is
under presure.  My only concern is that, since af_packet is typically used for
debugging/tracing, I'd rather not have its memory allocation adversely affect
system behavio.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-25 19:06 nhorman
  2010-10-25 20:17 ` Francois Romieu
@ 2010-10-25 20:38 ` Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2010-10-25 20:38 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, jpirko

Le lundi 25 octobre 2010 à 15:06 -0400, nhorman@tuxdriver.com a écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> It was shown to me recently that systems under high load were driven very deep
> into swap when tcpdump was run.  The reason this happened was because the
> AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
> application to specify how many entries an AF_PACKET socket will have and how
> large each entry will be.  It seems the default setting for tcpdump is to set
> the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
> allocation.  Thats difficult under good circumstances, and horrid under memory
> pressure.
> 
> I thought it would be good to make that a bit more usable.  I was going to do a
> simple conversion of the ring buffer from contigous pages to iovecs, but
> unfortunately, the metadata which AF_PACKET places in these buffers can easily
> span a page boundary, and given that these buffers get mapped into user space,
> and the data layout doesn't easily allow for a change to padding between frames
> to avoid that, a simple iovec change is just going to break user space ABI
> consistency.
> 
> So instead I've done this.  This patch does the aforementioned change,
> allocating an array of pages instead of one contiguous chunk, and then vmaps the
> array into a contiguous memory space, so that it can still be accessed in the
> same way it was before.  This allows for a consisten user and kernel space
> behavior for memory mapped AF_PACKET sockets, which at the same time relieving
> the memory pressure placed on a system when tcpdump defaults are used.
> 
> Tested successfully by me.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---

Strange because last time I took a look at this stuff, libpcap was doing
several tries, reducing page orders until it got no allocation
failures...

(It tries to get high order pages, maybe to reduce TLB pressure...)

I remember adding __GFP_NOWARN to avoid a kernel message, while tcpdump
was actually working...




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
  2010-10-25 19:06 nhorman
@ 2010-10-25 20:17 ` Francois Romieu
  2010-10-25 20:38 ` Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Francois Romieu @ 2010-10-25 20:17 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, eric.dumazet, jpirko

nhorman@tuxdriver.com <nhorman@tuxdriver.com> :
[...]
> +		if (order > 0) {
> +			for (j = 0; j < pg_vec[i]->num_pages; j++)
> +				pages[j] = virt_to_page(pg_vec[i]->pages[j]);
> +
> +			if (unlikely(!pg_vec[i]))
> +				goto out_free_pgvec;

Check pg_vec[i] sooner ?

-- 
Ueimor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation
@ 2010-10-25 19:06 nhorman
  2010-10-25 20:17 ` Francois Romieu
  2010-10-25 20:38 ` Eric Dumazet
  0 siblings, 2 replies; 9+ messages in thread
From: nhorman @ 2010-10-25 19:06 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jpirko, Neil Horman

From: Neil Horman <nhorman@tuxdriver.com>

It was shown to me recently that systems under high load were driven very deep
into swap when tcpdump was run.  The reason this happened was because the
AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
application to specify how many entries an AF_PACKET socket will have and how
large each entry will be.  It seems the default setting for tcpdump is to set
the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
allocation.  Thats difficult under good circumstances, and horrid under memory
pressure.

I thought it would be good to make that a bit more usable.  I was going to do a
simple conversion of the ring buffer from contigous pages to iovecs, but
unfortunately, the metadata which AF_PACKET places in these buffers can easily
span a page boundary, and given that these buffers get mapped into user space,
and the data layout doesn't easily allow for a change to padding between frames
to avoid that, a simple iovec change is just going to break user space ABI
consistency.

So instead I've done this.  This patch does the aforementioned change,
allocating an array of pages instead of one contiguous chunk, and then vmaps the
array into a contiguous memory space, so that it can still be accessed in the
same way it was before.  This allows for a consisten user and kernel space
behavior for memory mapped AF_PACKET sockets, which at the same time relieving
the memory pressure placed on a system when tcpdump defaults are used.

Tested successfully by me.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/packet/af_packet.c |   94 +++++++++++++++++++++++++++++++++++++----------
 1 files changed, 74 insertions(+), 20 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3616f27..fad3891 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -163,8 +163,14 @@ struct packet_mreq_max {
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring);
 
+struct pkt_page_list {
+	unsigned int num_pages;
+	char	*vmap_area;
+	unsigned long pages[0];
+};
+
 struct packet_ring_buffer {
-	char			**pg_vec;
+	struct pkt_page_list	**pg_vec;
 	unsigned int		head;
 	unsigned int		frames_per_block;
 	unsigned int		frame_size;
@@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
 	pg_vec_pos = position / rb->frames_per_block;
 	frame_offset = position % rb->frames_per_block;
 
-	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
+	h.raw = rb->pg_vec[pg_vec_pos]->vmap_area +
+		(frame_offset * rb->frame_size);
 
 	if (status != __packet_get_status(po, h.raw))
 		return NULL;
@@ -2322,41 +2329,86 @@ static const struct vm_operations_struct packet_mmap_ops = {
 	.close	=	packet_mm_close,
 };
 
-static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
+static void free_one_pkt_list(struct pkt_page_list *pglist)
+{
+	int i;
+
+	if ((pglist->num_pages > 1) && (pglist->vmap_area))
+		vunmap(pglist->vmap_area);
+	for (i = 0; i < pglist->num_pages; i++)
+		free_pages(pglist->pages[i], 0);
+}
+
+static void free_pg_vec(struct pkt_page_list **pg_vec, unsigned int order,
+			unsigned int len)
 {
 	int i;
 
-	for (i = 0; i < len; i++) {
+	for (i = 0; i < len; i++)
 		if (likely(pg_vec[i]))
-			free_pages((unsigned long) pg_vec[i], order);
-	}
+			free_one_pkt_list(pg_vec[i]);
+
 	kfree(pg_vec);
 }
 
-static inline char *alloc_one_pg_vec_page(unsigned long order)
+static inline struct pkt_page_list *alloc_one_pg_page_list(unsigned long order)
 {
+	int i;
+	struct pkt_page_list *newlist;
 	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
 
-	return (char *) __get_free_pages(gfp_flags, order);
+
+	newlist = kzalloc(sizeof(struct pkt_page_list)+
+			  (sizeof(void *)*(1<<order)), GFP_KERNEL);
+	if (!newlist)
+		return NULL;
+	newlist->num_pages = (1<<order);
+
+	for (i = 0; i < newlist->num_pages; i++) {
+		newlist->pages[i] = __get_free_pages(gfp_flags, 0);
+		if (!newlist->pages[i])
+			goto out_free;
+	}
+
+	return newlist;
+out_free:
+	free_one_pkt_list(newlist);
+	kfree(newlist);
+	return NULL;
 }
 
-static char **alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pkt_page_list **alloc_pg_vec(struct tpacket_req *req, int order)
 {
 	unsigned int block_nr = req->tp_block_nr;
-	char **pg_vec;
-	int i;
+	struct pkt_page_list **pg_vec;
+	struct page **pages;
+	int i, j;
 
-	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
-	if (unlikely(!pg_vec))
+	pg_vec = kzalloc(block_nr * sizeof(struct pkt_page_list *), GFP_KERNEL);
+	pages = kzalloc(block_nr * sizeof(struct page *), GFP_KERNEL);
+
+	if (unlikely(!pg_vec || !pages))
 		goto out;
 
 	for (i = 0; i < block_nr; i++) {
-		pg_vec[i] = alloc_one_pg_vec_page(order);
-		if (unlikely(!pg_vec[i]))
-			goto out_free_pgvec;
+		pg_vec[i] = alloc_one_pg_page_list(order);
+
+		if (order > 0) {
+			for (j = 0; j < pg_vec[i]->num_pages; j++)
+				pages[j] = virt_to_page(pg_vec[i]->pages[j]);
+
+			if (unlikely(!pg_vec[i]))
+				goto out_free_pgvec;
+			pg_vec[i]->vmap_area = vmap(pages, pg_vec[i]->num_pages,
+						 VM_MAP, PAGE_KERNEL);
+			if (!pg_vec[i]->vmap_area)
+				goto out_free_pgvec;
+		} else
+			pg_vec[i]->vmap_area = (void *)pg_vec[i]->pages[0];
 	}
 
 out:
+	kfree(pages);
 	return pg_vec;
 
 out_free_pgvec:
@@ -2368,7 +2420,7 @@ out_free_pgvec:
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring)
 {
-	char **pg_vec = NULL;
+	struct pkt_page_list **pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
 	struct packet_ring_buffer *rb;
@@ -2530,11 +2582,13 @@ static int packet_mmap(struct file *file, struct socket *sock,
 			continue;
 
 		for (i = 0; i < rb->pg_vec_len; i++) {
-			struct page *page = virt_to_page(rb->pg_vec[i]);
+			struct pkt_page_list *plist = rb->pg_vec[i];
 			int pg_num;
+			struct page *page;
 
-			for (pg_num = 0; pg_num < rb->pg_vec_pages;
-					pg_num++, page++) {
+			for (pg_num = 0; pg_num < plist->num_pages;
+					pg_num++) {
+				page = virt_to_page(plist->pages[pg_num]);
 				err = vm_insert_page(vma, start, page);
 				if (unlikely(err))
 					goto out;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-10-26  1:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <E1PAVIx-0001qL-EB@smtp.tuxdriver.com>
2010-10-25 22:30 ` [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation Eric Dumazet
2010-10-25 23:35   ` Neil Horman
2010-10-25 23:46     ` David Miller
2010-10-26  0:48       ` Maciej Żenczykowski
2010-10-26  1:53         ` Neil Horman
2010-10-26  1:58       ` Neil Horman
2010-10-25 19:06 nhorman
2010-10-25 20:17 ` Francois Romieu
2010-10-25 20:38 ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.