All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Florian Westphal <fw@strlen.de>,
	"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 4.4 07/66] Revert "net: use lib/percpu_counter API for fragmentation mem accounting"
Date: Sun, 24 Sep 2017 22:31:02 +0200	[thread overview]
Message-ID: <20170924202920.905183117@linuxfoundation.org> (raw)
In-Reply-To: <20170924202920.581603259@linuxfoundation.org>

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jesper Dangaard Brouer <brouer@redhat.com>


[ Upstream commit fb452a1aa3fd4034d7999e309c5466ff2d7005aa ]

This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.

There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.

The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.

We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:

1) On systems with CPUs > 24, the heavier fully locked
   __percpu_counter_sum() is always invoked, which will be more
   expensive than the atomic_t that is reverted to.

Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option.  To mitigate this, the batch size could be
decreased and thresh be increased.

2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
   CPU, before SKBs are pushed into sockets on remote CPUs.  Given
   NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
   likely be limited.  Thus, a fair chance that atomic add+dec happen
   on the same CPU.

Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/net/inet_frag.h  |   36 +++++++++---------------------------
 net/ipv4/inet_fragment.c |    4 +---
 2 files changed, 10 insertions(+), 30 deletions(-)

--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,14 +1,9 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
-#include <linux/percpu_counter.h>
-
 struct netns_frags {
-	/* The percpu_counter "mem" need to be cacheline aligned.
-	 *  mem.count must not share cacheline with other writers
-	 */
-	struct percpu_counter   mem ____cacheline_aligned_in_smp;
-
+	/* Keep atomic mem on separate cachelines in structs that include it */
+	atomic_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
 	int			timeout;
 	int			high_thresh;
@@ -110,11 +105,11 @@ void inet_frags_fini(struct inet_frags *
 
 static inline int inet_frags_init_net(struct netns_frags *nf)
 {
-	return percpu_counter_init(&nf->mem, 0, GFP_KERNEL);
+	atomic_set(&nf->mem, 0);
+	return 0;
 }
 static inline void inet_frags_uninit_net(struct netns_frags *nf)
 {
-	percpu_counter_destroy(&nf->mem);
 }
 
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
@@ -140,37 +135,24 @@ static inline bool inet_frag_evicting(st
 
 /* Memory Tracking Functions. */
 
-/* The default percpu_counter batch size is not big enough to scale to
- * fragmentation mem acct sizes.
- * The mem size of a 64K fragment is approx:
- *  (44 fragments * 2944 truesize) + frag_queue struct(200) = 129736 bytes
- */
-static unsigned int frag_percpu_counter_batch = 130000;
-
 static inline int frag_mem_limit(struct netns_frags *nf)
 {
-	return percpu_counter_read(&nf->mem);
+	return atomic_read(&nf->mem);
 }
 
 static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	__percpu_counter_add(&nf->mem, -i, frag_percpu_counter_batch);
+	atomic_sub(i, &nf->mem);
 }
 
 static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	__percpu_counter_add(&nf->mem, i, frag_percpu_counter_batch);
+	atomic_add(i, &nf->mem);
 }
 
-static inline unsigned int sum_frag_mem_limit(struct netns_frags *nf)
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
 {
-	unsigned int res;
-
-	local_bh_disable();
-	res = percpu_counter_sum_positive(&nf->mem);
-	local_bh_enable();
-
-	return res;
+	return atomic_read(&nf->mem);
 }
 
 /* RFC 3168 support :
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -234,10 +234,8 @@ evict_again:
 	cond_resched();
 
 	if (read_seqretry(&f->rnd_seqlock, seq) ||
-	    percpu_counter_sum(&nf->mem))
+	    sum_frag_mem_limit(nf))
 		goto evict_again;
-
-	percpu_counter_destroy(&nf->mem);
 }
 EXPORT_SYMBOL(inet_frags_exit_net);
 

  parent reply	other threads:[~2017-09-24 21:28 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-24 20:30 [PATCH 4.4 00/66] 4.4.89-stable review Greg Kroah-Hartman
2017-09-24 20:30 ` [PATCH 4.4 01/66] ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() Greg Kroah-Hartman
2017-09-24 20:30 ` [PATCH 4.4 02/66] ipv6: add rcu grace period before freeing fib6_node Greg Kroah-Hartman
2017-09-24 20:30 ` [PATCH 4.4 03/66] ipv6: fix sparse warning on rt6i_node Greg Kroah-Hartman
2017-09-24 20:30 ` [PATCH 4.4 04/66] qlge: avoid memcpy buffer overflow Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 05/66] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 06/66] tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 Greg Kroah-Hartman
2017-09-24 20:31 ` Greg Kroah-Hartman [this message]
2017-09-24 20:31 ` [PATCH 4.4 08/66] Revert "net: fix percpu memory leaks" Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 09/66] gianfar: Fix Tx flow control deactivation Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 10/66] ipv6: fix memory leak with multiple tables during netns destruction Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 11/66] ipv6: fix typo in fib6_net_exit() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 12/66] f2fs: check hot_data for roll-forward recovery Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 13/66] x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 14/66] md/raid5: release/flush io in raid5_do_work() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 15/66] nfsd: Fix general protection fault in release_lock_stateid() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 16/66] mm: prevent double decrease of nr_reserved_highatomic Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 17/66] tty: improve tty_insert_flip_char() fast path Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 18/66] tty: improve tty_insert_flip_char() slow path Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 19/66] tty: fix __tty_insert_flip_char regression Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 20/66] Input: i8042 - add Gigabyte P57 to the keyboard reset table Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 21/66] MIPS: math-emu: <MAX|MAXA|MIN|MINA>.<D|S>: Fix quiet NaN propagation Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 22/66] MIPS: math-emu: <MAX|MAXA|MIN|MINA>.<D|S>: Fix cases of both inputs zero Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 23/66] MIPS: math-emu: <MAX|MIN>.<D|S>: Fix cases of both inputs negative Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 24/66] MIPS: math-emu: <MAXA|MINA>.<D|S>: Fix cases of input values with opposite signs Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 25/66] MIPS: math-emu: <MAXA|MINA>.<D|S>: Fix cases of both infinite inputs Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 26/66] MIPS: math-emu: MINA.<D|S>: Fix some cases of infinity and zero inputs Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 27/66] [PATCH - RESEND] crypto: AF_ALG - remove SGL terminator indicator when chaining Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 28/66] ext4: fix incorrect quotaoff if the quota feature is enabled Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 29/66] ext4: fix quota inconsistency during orphan cleanup for read-only mounts Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 30/66] powerpc: Fix DAR reporting when alignment handler faults Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 31/66] block: Relax a check in blk_start_queue() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 32/66] md/bitmap: disable bitmap_resize for file-backed bitmaps Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 33/66] skd: Avoid that module unloading triggers a use-after-free Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 34/66] skd: Submit requests to firmware before triggering the doorbell Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 35/66] scsi: zfcp: fix queuecommand for scsi_eh commands when DIX enabled Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 36/66] scsi: zfcp: add handling for FCP_RESID_OVER to the fcp ingress path Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 37/66] scsi: zfcp: fix capping of unsuccessful GPN_FT SAN response trace records Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 38/66] scsi: zfcp: fix passing fsf_req to SCSI trace on TMF to correlate with HBA Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 39/66] scsi: zfcp: fix missing trace records for early returns in TMF eh handlers Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 40/66] scsi: zfcp: fix payload with full FCP_RSP IU in SCSI trace records Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 41/66] scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 42/66] scsi: zfcp: trace high part of "new" 64 bit SCSI LUN Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 43/66] scsi: megaraid_sas: Check valid aen class range to avoid kernel panic Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 44/66] scsi: megaraid_sas: Return pended IOCTLs with cmd_status MFI_STAT_WRONG_STATE in case adapter is dead Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 45/66] scsi: storvsc: fix memory leak on ring buffer busy Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 46/66] scsi: sg: remove save_scat_len Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 47/66] scsi: sg: use standard lists for sg_requests Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 48/66] scsi: sg: off by one in sg_ioctl() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 49/66] scsi: sg: factor out sg_fill_request_table() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 50/66] scsi: sg: fixup infoleak when using SG_GET_REQUEST_TABLE Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 51/66] scsi: qla2xxx: Fix an integer overflow in sysfs code Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 52/66] ftrace: Fix selftest goto location on error Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 53/66] tracing: Apply trace_clock changes to instance max buffer Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 54/66] ARC: Re-enable MMU upon Machine Check exception Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 55/66] PCI: shpchp: Enable bridge bus mastering if MSI is enabled Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 56/66] media: v4l2-compat-ioctl32: Fix timespec conversion Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 57/66] media: uvcvideo: Prevent heap overflow when accessing mapped controls Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 58/66] bcache: initialize dirty stripes in flash_dev_run() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 59/66] bcache: Fix leak of bdev reference Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 60/66] bcache: do not subtract sectors_to_gc for bypassed IO Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 61/66] bcache: correct cache_dirty_target in __update_writeback_rate() Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 62/66] bcache: Correct return value for sysfs attach errors Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 63/66] bcache: fix for gc and write-back race Greg Kroah-Hartman
2017-09-24 20:31 ` [PATCH 4.4 64/66] bcache: fix bch_hprint crash and improve output Greg Kroah-Hartman
2017-09-24 20:32 ` [PATCH 4.4 65/66] ftrace: Fix memleak when unregistering dynamic ops when tracing disabled Greg Kroah-Hartman
2017-09-24 20:32 ` [PATCH 4.4 66/66] mac80211: flush hw_roc_start work before cancelling the ROC Greg Kroah-Hartman
2017-09-25  1:04 ` [PATCH 4.4 00/66] 4.4.89-stable review Guenter Roeck
2017-09-25 23:12 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170924202920.905183117@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=fw@strlen.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.