All of lore.kernel.org
 help / color / mirror / Atom feed
* regression with poll(2)?
@ 2012-08-15 19:46 Sage Weil
  2012-08-15 20:45 ` Atchley, Scott
  2012-08-19 18:49 ` regression with poll(2) Sage Weil
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2012-08-15 19:46 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, ceph-devel

I'm experiencing a stall with Ceph daemons communicating over TCP that 
occurs reliably with 3.6-rc1 (and linus/master) but not 3.5.  The basic 
situation is:

 - the socket is two processes communicating over TCP on the same host, e.g. 

tcp        0 2164849 10.214.132.38:6801      10.214.132.38:51729     ESTABLISHED

 - one end writes a bunch of data in
 - the other end consumes data, but at some point stalls.
 - reads are nonblocking, e.g.

  int got = ::recv( sd, buf, len, MSG_DONTWAIT );

 and between those calls we wait with

  struct pollfd pfd;
  short evmask;
  pfd.fd = sd;
  pfd.events = POLLIN;
#if defined(__linux__)
  pfd.events |= POLLRDHUP;
#endif

  if (poll(&pfd, 1, msgr->timeout) <= 0)
    return -1;

 - in my case the timeout is ~15 minutes.  at that point it errors out, 
and the daemons reconnect and continue for a while until hitting this 
again.

 - at the time of the stall, the reading process is blocked on that 
poll(2) call.  There are a bunch of threads stuck on poll(2), some of them 
stuck and some not, but they all have stacks like

[<ffffffff8118f6f9>] poll_schedule_timeout+0x49/0x70
[<ffffffff81190baf>] do_sys_poll+0x35f/0x4c0
[<ffffffff81190deb>] sys_poll+0x6b/0x100
[<ffffffff8163d369>] system_call_fastpath+0x16/0x1b

 - you'll note that the netstat output shows data queued:

tcp        0 1163264 10.214.132.36:6807      10.214.132.36:41738     ESTABLISHED
tcp        0 1622016 10.214.132.36:41738     10.214.132.36:6807      ESTABLISHED

etc.

Is this a known regression?  Or might I be misusing the API?  What 
information would help track it down?

Thanks!
sage



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)?
  2012-08-15 19:46 regression with poll(2)? Sage Weil
@ 2012-08-15 20:45 ` Atchley, Scott
  2012-08-15 21:03   ` Sage Weil
  2012-08-19 18:49 ` regression with poll(2) Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Atchley, Scott @ 2012-08-15 20:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: netdev, linux-kernel, ceph-devel

On Aug 15, 2012, at 3:46 PM, Sage Weil wrote:

> I'm experiencing a stall with Ceph daemons communicating over TCP that 
> occurs reliably with 3.6-rc1 (and linus/master) but not 3.5.  The basic 
> situation is:
> 
> - the socket is two processes communicating over TCP on the same host, e.g. 
> 
> tcp        0 2164849 10.214.132.38:6801      10.214.132.38:51729     ESTABLISHED
> 
> - one end writes a bunch of data in
> - the other end consumes data, but at some point stalls.
> - reads are nonblocking, e.g.
> 
>  int got = ::recv( sd, buf, len, MSG_DONTWAIT );
> 
> and between those calls we wait with
> 
>  struct pollfd pfd;
>  short evmask;
>  pfd.fd = sd;
>  pfd.events = POLLIN;
> #if defined(__linux__)
>  pfd.events |= POLLRDHUP;
> #endif
> 
>  if (poll(&pfd, 1, msgr->timeout) <= 0)
>    return -1;
> 
> - in my case the timeout is ~15 minutes.  at that point it errors out, 
> and the daemons reconnect and continue for a while until hitting this 
> again.
> 
> - at the time of the stall, the reading process is blocked on that 
> poll(2) call.  There are a bunch of threads stuck on poll(2), some of them 
> stuck and some not, but they all have stacks like
> 
> [<ffffffff8118f6f9>] poll_schedule_timeout+0x49/0x70
> [<ffffffff81190baf>] do_sys_poll+0x35f/0x4c0
> [<ffffffff81190deb>] sys_poll+0x6b/0x100
> [<ffffffff8163d369>] system_call_fastpath+0x16/0x1b
> 
> - you'll note that the netstat output shows data queued:
> 
> tcp        0 1163264 10.214.132.36:6807      10.214.132.36:41738     ESTABLISHED
> tcp        0 1622016 10.214.132.36:41738     10.214.132.36:6807      ESTABLISHED
> 
> etc.
> 
> Is this a known regression?  Or might I be misusing the API?  What 
> information would help track it down?
> 
> Thanks!
> sage


Sage,

Do you see the same behavior when using two hosts (i.e. not loopback)? If different, how much data is in the pipe in the localhost case?

Scott



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)?
  2012-08-15 20:45 ` Atchley, Scott
@ 2012-08-15 21:03   ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2012-08-15 21:03 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: netdev, linux-kernel, ceph-devel

On Wed, 15 Aug 2012, Atchley, Scott wrote:
> On Aug 15, 2012, at 3:46 PM, Sage Weil wrote:
> 
> > I'm experiencing a stall with Ceph daemons communicating over TCP that 
> > occurs reliably with 3.6-rc1 (and linus/master) but not 3.5.  The basic 
> > situation is:
> > 
> > - the socket is two processes communicating over TCP on the same host, e.g. 
> > 
> > tcp        0 2164849 10.214.132.38:6801      10.214.132.38:51729     ESTABLISHED
> > 
> > - one end writes a bunch of data in
> > - the other end consumes data, but at some point stalls.
> > - reads are nonblocking, e.g.
> > 
> >  int got = ::recv( sd, buf, len, MSG_DONTWAIT );
> > 
> > and between those calls we wait with
> > 
> >  struct pollfd pfd;
> >  short evmask;
> >  pfd.fd = sd;
> >  pfd.events = POLLIN;
> > #if defined(__linux__)
> >  pfd.events |= POLLRDHUP;
> > #endif
> > 
> >  if (poll(&pfd, 1, msgr->timeout) <= 0)
> >    return -1;
> > 
> > - in my case the timeout is ~15 minutes.  at that point it errors out, 
> > and the daemons reconnect and continue for a while until hitting this 
> > again.
> > 
> > - at the time of the stall, the reading process is blocked on that 
> > poll(2) call.  There are a bunch of threads stuck on poll(2), some of them 
> > stuck and some not, but they all have stacks like
> > 
> > [<ffffffff8118f6f9>] poll_schedule_timeout+0x49/0x70
> > [<ffffffff81190baf>] do_sys_poll+0x35f/0x4c0
> > [<ffffffff81190deb>] sys_poll+0x6b/0x100
> > [<ffffffff8163d369>] system_call_fastpath+0x16/0x1b
> > 
> > - you'll note that the netstat output shows data queued:
> > 
> > tcp        0 1163264 10.214.132.36:6807      10.214.132.36:41738     ESTABLISHED
> > tcp        0 1622016 10.214.132.36:41738     10.214.132.36:6807      ESTABLISHED
> > 
> > etc.
> > 
> > Is this a known regression?  Or might I be misusing the API?  What 
> > information would help track it down?
> > 
> > Thanks!
> > sage
> 
> 
> Sage,
> 
> Do you see the same behavior when using two hosts (i.e. not loopback)? If different, how much data is in the pipe in the localhost case?

I have only seen it in the loopback case, and have independently diagnosed 
it a half dozen or so times now.

:/
sage


> 
> Scott
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-15 19:46 regression with poll(2)? Sage Weil
  2012-08-15 20:45 ` Atchley, Scott
@ 2012-08-19 18:49 ` Sage Weil
  2012-08-20  8:07   ` Eric Dumazet
  2012-08-20  9:04   ` Mel Gorman
  1 sibling, 2 replies; 13+ messages in thread
From: Sage Weil @ 2012-08-19 18:49 UTC (permalink / raw)
  To: mgorman, davem, netdev
  Cc: linux-kernel, ceph-devel, neilb, a.p.zijlstra, michaelc, emunson,
	eric.dumazet, sebastian, cl, akpm, torvalds

I've bisected and identified this commit:

    netvm: propagate page->pfmemalloc to skb
    
    The skb->pfmemalloc flag gets set to true iff during the slab allocation
    of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
    packet is fragmented, it is possible that pages will be allocated from the
    PFMEMALLOC reserve without propagating this information to the skb.  This
    patch propagates page->pfmemalloc from pages allocated for fragments to
    the skb.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: David S. Miller <davem@davemloft.net>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Mike Christie <michaelc@cs.wisc.edu>
    Cc: Eric B Munson <emunson@mgebm.net>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Christoph Lameter <cl@linux.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

I've retested several times and confirmed that this change leads to the 
breakage, and also confirmed that reverting it on top of -rc1 also fixes 
the problem.

I've also added some additional instrumentation to my code and confirmed 
that the process is blocking on poll(2) while netstat is reporting 
data available on the socket.

What can I do to help track this down?

Thanks!
sage


On Wed, 15 Aug 2012, Sage Weil wrote:

> I'm experiencing a stall with Ceph daemons communicating over TCP that 
> occurs reliably with 3.6-rc1 (and linus/master) but not 3.5.  The basic 
> situation is:
> 
>  - the socket is two processes communicating over TCP on the same host, e.g. 
> 
> tcp        0 2164849 10.214.132.38:6801      10.214.132.38:51729     ESTABLISHED
> 
>  - one end writes a bunch of data in
>  - the other end consumes data, but at some point stalls.
>  - reads are nonblocking, e.g.
> 
>   int got = ::recv( sd, buf, len, MSG_DONTWAIT );
> 
>  and between those calls we wait with
> 
>   struct pollfd pfd;
>   short evmask;
>   pfd.fd = sd;
>   pfd.events = POLLIN;
> #if defined(__linux__)
>   pfd.events |= POLLRDHUP;
> #endif
> 
>   if (poll(&pfd, 1, msgr->timeout) <= 0)
>     return -1;
> 
>  - in my case the timeout is ~15 minutes.  at that point it errors out, 
> and the daemons reconnect and continue for a while until hitting this 
> again.
> 
>  - at the time of the stall, the reading process is blocked on that 
> poll(2) call.  There are a bunch of threads stuck on poll(2), some of them 
> stuck and some not, but they all have stacks like
> 
> [<ffffffff8118f6f9>] poll_schedule_timeout+0x49/0x70
> [<ffffffff81190baf>] do_sys_poll+0x35f/0x4c0
> [<ffffffff81190deb>] sys_poll+0x6b/0x100
> [<ffffffff8163d369>] system_call_fastpath+0x16/0x1b
> 
>  - you'll note that the netstat output shows data queued:
> 
> tcp        0 1163264 10.214.132.36:6807      10.214.132.36:41738     ESTABLISHED
> tcp        0 1622016 10.214.132.36:41738     10.214.132.36:6807      ESTABLISHED
> 
> etc.
> 
> Is this a known regression?  Or might I be misusing the API?  What 
> information would help track it down?
> 
> Thanks!
> sage
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-19 18:49 ` regression with poll(2) Sage Weil
@ 2012-08-20  8:07   ` Eric Dumazet
  2012-08-20  9:04   ` Mel Gorman
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-08-20  8:07 UTC (permalink / raw)
  To: Sage Weil
  Cc: mgorman, davem, netdev, linux-kernel, ceph-devel, neilb,
	a.p.zijlstra, michaelc, emunson, sebastian, cl, akpm, torvalds

On Sun, 2012-08-19 at 11:49 -0700, Sage Weil wrote:
> I've bisected and identified this commit:
> 
>     netvm: propagate page->pfmemalloc to skb
>     
>     The skb->pfmemalloc flag gets set to true iff during the slab allocation
>     of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
>     packet is fragmented, it is possible that pages will be allocated from the
>     PFMEMALLOC reserve without propagating this information to the skb.  This
>     patch propagates page->pfmemalloc from pages allocated for fragments to
>     the skb.
>     
>     Signed-off-by: Mel Gorman <mgorman@suse.de>
>     Acked-by: David S. Miller <davem@davemloft.net>
>     Cc: Neil Brown <neilb@suse.de>
>     Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>     Cc: Mike Christie <michaelc@cs.wisc.edu>
>     Cc: Eric B Munson <emunson@mgebm.net>
>     Cc: Eric Dumazet <eric.dumazet@gmail.com>
>     Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
>     Cc: Mel Gorman <mgorman@suse.de>
>     Cc: Christoph Lameter <cl@linux.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> I've retested several times and confirmed that this change leads to the 
> breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> the problem.
> 
> I've also added some additional instrumentation to my code and confirmed 
> that the process is blocking on poll(2) while netstat is reporting 
> data available on the socket.
> 
> What can I do to help track this down?
> 
> Thanks!
> sage
> 
> 
> On Wed, 15 Aug 2012, Sage Weil wrote:
> 
> > I'm experiencing a stall with Ceph daemons communicating over TCP that 
> > occurs reliably with 3.6-rc1 (and linus/master) but not 3.5.  The basic 
> > situation is:
> > 
> >  - the socket is two processes communicating over TCP on the same host, e.g. 
> > 
> > tcp        0 2164849 10.214.132.38:6801      10.214.132.38:51729     ESTABLISHED
> > 
> >  - one end writes a bunch of data in
> >  - the other end consumes data, but at some point stalls.
> >  - reads are nonblocking, e.g.
> > 
> >   int got = ::recv( sd, buf, len, MSG_DONTWAIT );
> > 
> >  and between those calls we wait with
> > 
> >   struct pollfd pfd;
> >   short evmask;
> >   pfd.fd = sd;
> >   pfd.events = POLLIN;
> > #if defined(__linux__)
> >   pfd.events |= POLLRDHUP;
> > #endif
> > 
> >   if (poll(&pfd, 1, msgr->timeout) <= 0)
> >     return -1;
> > 
> >  - in my case the timeout is ~15 minutes.  at that point it errors out, 
> > and the daemons reconnect and continue for a while until hitting this 
> > again.
> > 
> >  - at the time of the stall, the reading process is blocked on that 
> > poll(2) call.  There are a bunch of threads stuck on poll(2), some of them 
> > stuck and some not, but they all have stacks like
> > 
> > [<ffffffff8118f6f9>] poll_schedule_timeout+0x49/0x70
> > [<ffffffff81190baf>] do_sys_poll+0x35f/0x4c0
> > [<ffffffff81190deb>] sys_poll+0x6b/0x100
> > [<ffffffff8163d369>] system_call_fastpath+0x16/0x1b
> > 
> >  - you'll note that the netstat output shows data queued:
> > 
> > tcp        0 1163264 10.214.132.36:6807      10.214.132.36:41738     ESTABLISHED
> > tcp        0 1622016 10.214.132.36:41738     10.214.132.36:6807      ESTABLISHED
> > 

In this netstat output, we can see some data in output queues, but no
data on receive queues. poll() is OK.

Some TCP frames are not properly delivered, even after a retransmit.

( to see useful stats/counters : ss -emoi dst 10.214.132.36)

For loopback transmits, skbs are taken from the output queue, cloned and
feeded to local stack.

If they have the pfmemalloc bit, they wont be delivered to normal
sockets, but dropped.

tcp_sendmsg() seems to be able to queue skbs with pfmemalloc set to
true, and this makes no sense to me.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-19 18:49 ` regression with poll(2) Sage Weil
  2012-08-20  8:07   ` Eric Dumazet
@ 2012-08-20  9:04   ` Mel Gorman
  2012-08-20  9:30     ` Eric Dumazet
                       ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Mel Gorman @ 2012-08-20  9:04 UTC (permalink / raw)
  To: Sage Weil
  Cc: davem, netdev, linux-kernel, ceph-devel, neilb, a.p.zijlstra,
	michaelc, emunson, eric.dumazet, sebastian, cl, akpm, torvalds

On Sun, Aug 19, 2012 at 11:49:31AM -0700, Sage Weil wrote:
> I've bisected and identified this commit:
> 
>     netvm: propagate page->pfmemalloc to skb
>     
>     The skb->pfmemalloc flag gets set to true iff during the slab allocation
>     of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
>     packet is fragmented, it is possible that pages will be allocated from the
>     PFMEMALLOC reserve without propagating this information to the skb.  This
>     patch propagates page->pfmemalloc from pages allocated for fragments to
>     the skb.
>     
>     Signed-off-by: Mel Gorman <mgorman@suse.de>
>     Acked-by: David S. Miller <davem@davemloft.net>
>     Cc: Neil Brown <neilb@suse.de>
>     Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>     Cc: Mike Christie <michaelc@cs.wisc.edu>
>     Cc: Eric B Munson <emunson@mgebm.net>
>     Cc: Eric Dumazet <eric.dumazet@gmail.com>
>     Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
>     Cc: Mel Gorman <mgorman@suse.de>
>     Cc: Christoph Lameter <cl@linux.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 

Ok, thanks.

> I've retested several times and confirmed that this change leads to the 
> breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> the problem.
> 
> I've also added some additional instrumentation to my code and confirmed 
> that the process is blocking on poll(2) while netstat is reporting 
> data available on the socket.
> 
> What can I do to help track this down?
> 

Can the following patch be tested please? It is reported to fix an fio
regression that may be similar to what you are experiencing but has not
been picked up yet.

---8<---
From: Alex Shi <alex.shi@intel.com>
Subject: [PATCH] mm: correct page->pfmemalloc to fix deactivate_slab regression

commit cfd19c5a9ec (mm: only set page->pfmemalloc when
ALLOC_NO_WATERMARKS was used) try to narrow down page->pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance has 40% regression.

This patch move the checking in get_page_from_freelist, that resolved
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 mm/page_alloc.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 009ac28..07f1924 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1928,6 +1928,17 @@ this_zone_full:
 		zlc_active = 0;
 		goto zonelist_scan;
 	}
+
+	if (page)
+		/*
+		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
+		 * necessary to allocate the page. The expectation is
+		 * that the caller is taking steps that will free more
+		 * memory. The caller should avoid the page being used
+		 * for !PFMEMALLOC purposes.
+		 */
+		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+
 	return page;
 }
 
@@ -2389,14 +2400,6 @@ rebalance:
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 		if (page) {
-			/*
-			 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
-			 * necessary to allocate the page. The expectation is
-			 * that the caller is taking steps that will free more
-			 * memory. The caller should avoid the page being used
-			 * for !PFMEMALLOC purposes.
-			 */
-			page->pfmemalloc = true;
 			goto got_pg;
 		}
 	}
@@ -2569,8 +2572,6 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	else
-		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20  9:04   ` Mel Gorman
@ 2012-08-20  9:30     ` Eric Dumazet
  2012-08-20 23:20       ` Andrew Morton
  2012-08-20 16:54     ` Sage Weil
  2012-08-20 17:02     ` Linus Torvalds
  2 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-08-20  9:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sage Weil, davem, netdev, linux-kernel, ceph-devel, neilb,
	a.p.zijlstra, michaelc, emunson, sebastian, cl, akpm, torvalds

On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:

> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.
> 
> -

This seems to help here.

Boot your machine with "mem=768M" or a bit less depending on your setup,
and try a netperf.

-> before patch :

# netperf
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
localhost.localdomain () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    14.00       6.05   

-> after patch :

Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00    18509.73   



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20  9:04   ` Mel Gorman
  2012-08-20  9:30     ` Eric Dumazet
@ 2012-08-20 16:54     ` Sage Weil
  2012-08-21  7:05       ` Mel Gorman
  2012-08-20 17:02     ` Linus Torvalds
  2 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-08-20 16:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: davem, netdev, linux-kernel, ceph-devel, neilb, a.p.zijlstra,
	michaelc, emunson, eric.dumazet, sebastian, cl, akpm, torvalds

On Mon, 20 Aug 2012, Mel Gorman wrote:
> On Sun, Aug 19, 2012 at 11:49:31AM -0700, Sage Weil wrote:
> > I've bisected and identified this commit:
> > 
> >     netvm: propagate page->pfmemalloc to skb
> >     
> >     The skb->pfmemalloc flag gets set to true iff during the slab allocation
> >     of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
> >     packet is fragmented, it is possible that pages will be allocated from the
> >     PFMEMALLOC reserve without propagating this information to the skb.  This
> >     patch propagates page->pfmemalloc from pages allocated for fragments to
> >     the skb.
> >     
> >     Signed-off-by: Mel Gorman <mgorman@suse.de>
> >     Acked-by: David S. Miller <davem@davemloft.net>
> >     Cc: Neil Brown <neilb@suse.de>
> >     Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >     Cc: Mike Christie <michaelc@cs.wisc.edu>
> >     Cc: Eric B Munson <emunson@mgebm.net>
> >     Cc: Eric Dumazet <eric.dumazet@gmail.com>
> >     Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
> >     Cc: Mel Gorman <mgorman@suse.de>
> >     Cc: Christoph Lameter <cl@linux.com>
> >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > 
> 
> Ok, thanks.
> 
> > I've retested several times and confirmed that this change leads to the 
> > breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> > the problem.
> > 
> > I've also added some additional instrumentation to my code and confirmed 
> > that the process is blocking on poll(2) while netstat is reporting 
> > data available on the socket.
> > 
> > What can I do to help track this down?
> > 
> 
> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.

This patch appears to resolve things for me as well, at least after a 
couple of passes.  I'll let you know if I see any further problems come up 
with more testing.

Thanks!
sage


> 
> ---8<---
> From: Alex Shi <alex.shi@intel.com>
> Subject: [PATCH] mm: correct page->pfmemalloc to fix deactivate_slab regression
> 
> commit cfd19c5a9ec (mm: only set page->pfmemalloc when
> ALLOC_NO_WATERMARKS was used) try to narrow down page->pfmemalloc
> setting, but it missed some places the pfmemalloc should be set.
> 
> So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
> cause incorrect deactivate_slab() on our core2 server:
> 
>     64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
>                      |
>                      --- _raw_spin_lock
>                         |
>                         |---0.34%-- deactivate_slab
>                         |          __slab_alloc
>                         |          kmem_cache_alloc
>                         |          |
> 
> That causes our fio sync write performance has 40% regression.
> 
> This patch move the checking in get_page_from_freelist, that resolved
> this issue.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  mm/page_alloc.c |   21 +++++++++++----------
>  1 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 009ac28..07f1924 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1928,6 +1928,17 @@ this_zone_full:
>  		zlc_active = 0;
>  		goto zonelist_scan;
>  	}
> +
> +	if (page)
> +		/*
> +		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> +		 * necessary to allocate the page. The expectation is
> +		 * that the caller is taking steps that will free more
> +		 * memory. The caller should avoid the page being used
> +		 * for !PFMEMALLOC purposes.
> +		 */
> +		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> +
>  	return page;
>  }
>  
> @@ -2389,14 +2400,6 @@ rebalance:
>  				zonelist, high_zoneidx, nodemask,
>  				preferred_zone, migratetype);
>  		if (page) {
> -			/*
> -			 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> -			 * necessary to allocate the page. The expectation is
> -			 * that the caller is taking steps that will free more
> -			 * memory. The caller should avoid the page being used
> -			 * for !PFMEMALLOC purposes.
> -			 */
> -			page->pfmemalloc = true;
>  			goto got_pg;
>  		}
>  	}
> @@ -2569,8 +2572,6 @@ retry_cpuset:
>  		page = __alloc_pages_slowpath(gfp_mask, order,
>  				zonelist, high_zoneidx, nodemask,
>  				preferred_zone, migratetype);
> -	else
> -		page->pfmemalloc = false;
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
>  
> -- 
> 1.7.5.4
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20  9:04   ` Mel Gorman
  2012-08-20  9:30     ` Eric Dumazet
  2012-08-20 16:54     ` Sage Weil
@ 2012-08-20 17:02     ` Linus Torvalds
  2012-08-21 15:58       ` Andrew Morton
  2 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2012-08-20 17:02 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Sage Weil, David Miller, netdev, linux-kernel, ceph-devel,
	Neil Brown, Peter Zijlstra, michaelc, emunson, Eric Dumazet,
	Christoph Lameter

On Mon, Aug 20, 2012 at 2:04 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.

Andrew, is this in your queue, or should I take this directly, or
what? It seems to fix the problem for Eric and Sage, at least.

           Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20  9:30     ` Eric Dumazet
@ 2012-08-20 23:20       ` Andrew Morton
  2012-08-21  5:16         ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2012-08-20 23:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mel Gorman, Sage Weil, davem, netdev, linux-kernel, ceph-devel,
	neilb, a.p.zijlstra, michaelc, emunson, sebastian, cl, torvalds

On Mon, 20 Aug 2012 11:30:59 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:
> 
> > Can the following patch be tested please? It is reported to fix an fio
> > regression that may be similar to what you are experiencing but has not
> > been picked up yet.
> > 
> > -
> 
> This seems to help here.
> 
> Boot your machine with "mem=768M" or a bit less depending on your setup,
> and try a netperf.
> 
> -> before patch :
> 
> # netperf
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost.localdomain () port 0 AF_INET
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
> 
>  87380  16384  16384    14.00       6.05   
> 
> -> after patch :
> 
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
> 
>  87380  16384  16384    10.00    18509.73   

"seem to help"?  Was previous performance fully restored?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20 23:20       ` Andrew Morton
@ 2012-08-21  5:16         ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-08-21  5:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Sage Weil, davem, netdev, linux-kernel, ceph-devel,
	neilb, a.p.zijlstra, michaelc, emunson, sebastian, cl, torvalds

On Mon, 2012-08-20 at 16:20 -0700, Andrew Morton wrote:
> On Mon, 20 Aug 2012 11:30:59 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:
> > 
> > > Can the following patch be tested please? It is reported to fix an fio
> > > regression that may be similar to what you are experiencing but has not
> > > been picked up yet.
> > > 
> > > -
> > 
> > This seems to help here.
> > 
> > Boot your machine with "mem=768M" or a bit less depending on your setup,
> > and try a netperf.
> > 
> > -> before patch :
> > 
> > # netperf
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > localhost.localdomain () port 0 AF_INET
> > Recv   Send    Send                          
> > Socket Socket  Message  Elapsed              
> > Size   Size    Size     Time     Throughput  
> > bytes  bytes   bytes    secs.    10^6bits/sec  
> > 
> >  87380  16384  16384    14.00       6.05   
> > 
> > -> after patch :
> > 
> > Recv   Send    Send                          
> > Socket Socket  Message  Elapsed              
> > Size   Size    Size     Time     Throughput  
> > bytes  bytes   bytes    secs.    10^6bits/sec  
> > 
> >  87380  16384  16384    10.00    18509.73   
> 
> "seem to help"?  Was previous performance fully restored?

I did some tests this morning on my HP Z600, and got same numbers than
3.5.0

Of course, its a bit difficult to say, because there is no
CONFIG_PFMEMALLOC to test real impact.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20 16:54     ` Sage Weil
@ 2012-08-21  7:05       ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2012-08-21  7:05 UTC (permalink / raw)
  To: Sage Weil
  Cc: davem, netdev, linux-kernel, ceph-devel, neilb, a.p.zijlstra,
	michaelc, emunson, eric.dumazet, sebastian, cl, akpm, torvalds

On Mon, Aug 20, 2012 at 09:54:59AM -0700, Sage Weil wrote:
> > <SNIP>
> > 
> > > I've retested several times and confirmed that this change leads to the 
> > > breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> > > the problem.
> > > 
> > > I've also added some additional instrumentation to my code and confirmed 
> > > that the process is blocking on poll(2) while netstat is reporting 
> > > data available on the socket.
> > > 
> > > What can I do to help track this down?
> > > 
> > 
> > Can the following patch be tested please? It is reported to fix an fio
> > regression that may be similar to what you are experiencing but has not
> > been picked up yet.
> 
> This patch appears to resolve things for me as well, at least after a 
> couple of passes.  I'll let you know if I see any further problems come up 
> with more testing.
> 

Thanks very much Sage.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: regression with poll(2)
  2012-08-20 17:02     ` Linus Torvalds
@ 2012-08-21 15:58       ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2012-08-21 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Sage Weil, David Miller, netdev, linux-kernel,
	ceph-devel, Neil Brown, Peter Zijlstra, michaelc, emunson,
	Eric Dumazet, Christoph Lameter

On Mon, 20 Aug 2012 10:02:05 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Aug 20, 2012 at 2:04 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > Can the following patch be tested please? It is reported to fix an fio
> > regression that may be similar to what you are experiencing but has not
> > been picked up yet.
> 
> Andrew, is this in your queue, or should I take this directly, or
> what? It seems to fix the problem for Eric and Sage, at least.

Yes, I have a copy queued:


From: Alex Shi <alex.shi@intel.com>
Subject: mm: correct page->pfmemalloc to fix deactivate_slab regression

cfd19c5a9ec ("mm: only set page->pfmemalloc when ALLOC_NO_WATERMARKS was
used") tried to narrow down page->pfmemalloc setting, but it missed some
places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff -puN mm/page_alloc.c~mm-correct-page-pfmemalloc-to-fix-deactivate_slab-regression mm/page_alloc.c
--- a/mm/page_alloc.c~mm-correct-page-pfmemalloc-to-fix-deactivate_slab-regression
+++ a/mm/page_alloc.c
@@ -1928,6 +1928,17 @@ this_zone_full:
 		zlc_active = 0;
 		goto zonelist_scan;
 	}
+
+	if (page)
+		/*
+		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
+		 * necessary to allocate the page. The expectation is
+		 * that the caller is taking steps that will free more
+		 * memory. The caller should avoid the page being used
+		 * for !PFMEMALLOC purposes.
+		 */
+		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+
 	return page;
 }
 
@@ -2389,14 +2400,6 @@ rebalance:
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 		if (page) {
-			/*
-			 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
-			 * necessary to allocate the page. The expectation is
-			 * that the caller is taking steps that will free more
-			 * memory. The caller should avoid the page being used
-			 * for !PFMEMALLOC purposes.
-			 */
-			page->pfmemalloc = true;
 			goto got_pg;
 		}
 	}
@@ -2569,8 +2572,6 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	else
-		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
_


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-08-21 15:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-15 19:46 regression with poll(2)? Sage Weil
2012-08-15 20:45 ` Atchley, Scott
2012-08-15 21:03   ` Sage Weil
2012-08-19 18:49 ` regression with poll(2) Sage Weil
2012-08-20  8:07   ` Eric Dumazet
2012-08-20  9:04   ` Mel Gorman
2012-08-20  9:30     ` Eric Dumazet
2012-08-20 23:20       ` Andrew Morton
2012-08-21  5:16         ` Eric Dumazet
2012-08-20 16:54     ` Sage Weil
2012-08-21  7:05       ` Mel Gorman
2012-08-20 17:02     ` Linus Torvalds
2012-08-21 15:58       ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.