From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754188AbaFIIMH (ORCPT ); Mon, 9 Jun 2014 04:12:07 -0400 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:44527 "EHLO out2-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753057AbaFIIME (ORCPT ); Mon, 9 Jun 2014 04:12:04 -0400 X-Sasl-enc: foq+wWgiG4z9xbNwvR5nizZcxVUBwM9ibBwOo63+dy6C 1402301522 Message-ID: <1402301519.2775.47.camel@perseus.fritz.box> Subject: Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open From: Ian Kent To: Andrew Morton Cc: Heiko Carstens , Christoph Hellwig , KAMEZAWA Hiroyuki , Andrea Righi , Eric Dumazet , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Hendrik Brueckner , Thorsten Diehl , "Elliott, Robert (Server Storage)" Date: Mon, 09 Jun 2014 16:11:59 +0800 In-Reply-To: <20140528153704.b2a3f46dc39ebf8f681d528a@linux-foundation.org> References: <20140521122521.GB7471@osiris> <20140521143229.GA32011@infradead.org> <20140528085841.GA4219@osiris> <20140528090153.GC4219@osiris> <20140528153704.b2a3f46dc39ebf8f681d528a@linux-foundation.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4 (3.10.4-2.fc20) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2014-05-28 at 15:37 -0700, Andrew Morton wrote: > On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens wrote: > > > Now, /proc/stat uses single_open() for showing information. This means > > the all data will be gathered and buffered once to a (big) buf. > > > > Now, /proc/stat shows stats per cpu and stats per IRQs. To get information > > in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE). > > > > Eric Dumazet reported that the bufsize calculation doesn't take > > the numner of IRQ into account and the information cannot be > > got in one-shot. (By this, seq_read() will allocate buffer again > > and read the whole data gain...) > > > > This patch changes /proc/stat to use seq_open() rather than single_open() > > and provides ->start(), ->next(), ->stop(), ->show(). > > > > By this, /proc/stat will not need to take care of size of buffer. > > > > [heiko.carstens@de.ibm.com]: This is the forward port of a patch > > from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41). > > I added a couple of simple changes like e.g. that the cpu iterator > > handles 32 cpus in a batch to avoid lots of iterations. > > > > With this patch it should not happen anymore that reading /proc/stat > > fails because of a failing high order memory allocation. > > So this deletes the problematic allocation which [1/2] kind-of fixed, > yes? > > I agree with Ian - there's a hotplugging race. And [1/2] doesn't do > anything to address the worst-case allocation size. So I think we may > as well do this all in a single patch. > > Without having looked closely at the code I worry a bit about the > effects. /proc/pid/stat is a complex thing and its contents will vary > in strange ways as the things it is reporting on undergo concurrent > changes. This patch will presumably replace one set of bizarre > behaviour with a new set of bizarre behaviour and there may be > unforseen consequences to established userspace. > > So we're going to need a lot of testing and a lot of testing time to > identify issues and weed them out. > > So.. can we take this up for 3.16-rc1? See if we can get some careful > review done then and test it for a couple of months? > > Meanwhile, the changelog looks a bit hastily thrown together - some > smoothing would be nice, and perhaps some work spent identifying > possible behavioural changes. Timing changes, locking canges, effects > of concurrent fork/exit activity etc? > Umm ... I didn't expect this to turn into such a rant, apologies in advance. Certainly using the usual seq_file ops is desirable in the long run and that change should be worked on by those that maintain this area of code but, as Andrew says, it's too large a change to put in without considerable testing. Now the problem has been exposed by a change which sets the number of CPUs to the maximum number the architecture (s390) can have, 256, when no value has been specified and the kernel default configuration is used rather than 32 when hotplug is not set or 64 when it is. The allocation problem doesn't occur when the number of CPUs is set to the previous default of 64, even for low end systems with 2 CPUs and 2G RAM (like the one for which this problem was reported), but becomes a problem when the number of CPUs is set to 256 on these systems. Also, I believe the s390 maintainers are committed to keeping the the default configured number of CPUs at 256. So the actual problem is the heuristic used to calculate a initial buffer size not accounting for a configured number of CPUs much greater than the hardware can sensibly accommodate. If we assume that the current implementation functions correctly when the buffer overflows, ie. doubles the allocation size and restarts, then an interim solution to the problem comes down to not much more than what is in patch 1 above. Looking at the current heuristic allocation sizes, without taking the allocation for irqs into account we get: cpus: 32 size: 5k cpus: 64 size: 9k cpus: 128 size: 17k cpus: 256 size: 33k I don't know how many irqs need to be accounted for or if that will make a difference to my comments below. Someone else will need to comment on that. We know (from the order 4 allocation fail) that with 256 CPUs kmalloc() is looking for a 64k slab chunk IIUC and on low memory systems will frequently fail to get it. And for the previous default of 64 CPUs kmalloc() would be looking for a 16k slab which we have no evidence it doesn't always get even on low end systems. So why even use a heuristic calculation, since it can be quite wasteful anyway or, as in this case badly wrong, why not just allocate 8k or 16k in the open every time knowing that if the actual number of CPUs is large we can reasonably expect the system RAM to be correspondingly large which should avoid allocation failures upon a read retry. Comments please? Ian