From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423394AbbEENKb (ORCPT ); Tue, 5 May 2015 09:10:31 -0400 Received: from metis.ext.pengutronix.de ([92.198.50.35]:41828 "EHLO metis.ext.pengutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423310AbbEENKR (ORCPT ); Tue, 5 May 2015 09:10:17 -0400 Date: Tue, 5 May 2015 15:10:08 +0200 From: Uwe =?iso-8859-1?Q?Kleine-K=F6nig?= To: Doug Anderson Cc: Addy Ke , Dmitry Torokhov , Heiko Stuebner , Wolfram Sang , Max Schwarz , "linux-kernel@vger.kernel.org" , "open list:ARM/Rockchip SoC..." , "linux-i2c@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" Subject: Re: [PATCH] i2c: rk3x: Increase wait timeout to 1 second Message-ID: <20150505131008.GH25193@pengutronix.de> References: <1430430247-9632-1-git-send-email-dianders@chromium.org> <20150504083312.GN25193@pengutronix.de> <20150504152415.GS25193@pengutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-SA-Exim-Connect-IP: 2001:67c:670:100:1d::c0 X-SA-Exim-Mail-From: ukl@pengutronix.de X-SA-Exim-Scanned: No (on metis.ext.pengutronix.de); SAEximRunCond expanded to false X-PTX-Original-Recipient: linux-kernel@vger.kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Mon, May 04, 2015 at 09:38:28AM -0700, Doug Anderson wrote: > On Mon, May 4, 2015 at 8:24 AM, Uwe Kleine-König > wrote: > >> Thank you for looking at this! I will clarify by giving explicit CPU > >> numbers (this issue can only happen in SMP, I think): > >> > >> 1. CPU1 is running rk3x_i2c_xfer() > >> > >> 2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0. > >> > >> 3. I2C interrupt is ready but is set to only run on CPU0, where IRQs > >> are disabled. > > Why does this irq only trigger on cpu0? > > So I've never done the research myself on the interrupt architecture > on ARM, but certainly on all ARM boards I've worked on recently all > IRQs (except those local to just one core) get routed to CPU0. I've > been told that there are some userspace programs that try to balance > things out a little bit, but even in those cases each IRQ is assigned > a single CPU and if that CPU has its interrupts off then the IRQ won't > be dynamically rerouted. I think I remember someone telling me that > there was also extra complexity around what happens when CPUs get > taken offline... > > A quick search shows some discussion from 2011 at > . > > Given that lots of smart people have looked at this and our interrupts > are still all going to CPU0, it's not something I'm going to try to > solve right now... researching a bit myself it seems you're right here. It would be nice to mention this in the commit log though. > > Assuming this is correct, the > > more robust change would be to detect this situation after 200ms instead > > of waiting 1s to work around this issue. > > Detect in what way? You mean add code to detect that the CPU that's > assigned our interrupt has been stalled for 200ms? ...and what do I > do in that case? > > I suppose I could add code that reads the I2C interrupt status and > notices that although the I2C controller claims that it should have an > interrupt by now but we never saw it go off. I could then give it > more time. Is that what you're looking for? We'd still want to > timeout eventually since there could be some other bug in the system > that's causing the interrupt not to ever go off... Hmm, probably that's overkill. (Better spend the time fixing printk not to disable irqs that long :-) > That adds a bunch of extra complexity, though. Is there a use case > where a timeout of 1 second poses a problem for you that would justify > the extra code? I'd presume you're thinking of a case where a timeout > would be expected in a case other than a bug in the i2c driver or a > hardware bug in the i2c controller where 200ms is an acceptable > timeout but 1 second is far too long. Note: if we are using the > timeout to detect a bug of some sort then I'd imagine that 1 second > might be actually better (a slightly longer delay makes it more > obvious that something bad is happening). > > > One other thing occurs to me: having a longer than 200ms delay may > actually be a correctness thing anyway. Technically i2c devices are > allowed to clock stretch "indefinitely", so transfers be quite long > and still be "legal". In practice a REALLY long clock stretch signals > a problem somewhere so we really do need some timeout, but 200ms may > be too short. For instance, some docs of the bq27541 battery gas > gauge claim that it can clock stretch (in extreme cases) for 144ms. > While this is still less than 200ms, it does seem prudent to give a > little more leeway before seeing a timeout. If you rework your commit log to describe the failing situation more accurately that should be fine. Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | http://www.pengutronix.de/ | From mboxrd@z Thu Jan 1 00:00:00 1970 From: Uwe =?iso-8859-1?Q?Kleine-K=F6nig?= Subject: Re: [PATCH] i2c: rk3x: Increase wait timeout to 1 second Date: Tue, 5 May 2015 15:10:08 +0200 Message-ID: <20150505131008.GH25193@pengutronix.de> References: <1430430247-9632-1-git-send-email-dianders@chromium.org> <20150504083312.GN25193@pengutronix.de> <20150504152415.GS25193@pengutronix.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Doug Anderson Cc: Addy Ke , Dmitry Torokhov , Heiko Stuebner , Wolfram Sang , Max Schwarz , "linux-kernel@vger.kernel.org" , "open list:ARM/Rockchip SoC..." , "linux-i2c@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" List-Id: linux-i2c@vger.kernel.org Hello, On Mon, May 04, 2015 at 09:38:28AM -0700, Doug Anderson wrote: > On Mon, May 4, 2015 at 8:24 AM, Uwe Kleine-K=F6nig > wrote: > >> Thank you for looking at this! I will clarify by giving explicit = CPU > >> numbers (this issue can only happen in SMP, I think): > >> > >> 1. CPU1 is running rk3x_i2c_xfer() > >> > >> 2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0. > >> > >> 3. I2C interrupt is ready but is set to only run on CPU0, where IR= Qs > >> are disabled. > > Why does this irq only trigger on cpu0? >=20 > So I've never done the research myself on the interrupt architecture > on ARM, but certainly on all ARM boards I've worked on recently all > IRQs (except those local to just one core) get routed to CPU0. I've > been told that there are some userspace programs that try to balance > things out a little bit, but even in those cases each IRQ is assigned > a single CPU and if that CPU has its interrupts off then the IRQ won'= t > be dynamically rerouted. I think I remember someone telling me that > there was also extra complexity around what happens when CPUs get > taken offline... >=20 > A quick search shows some discussion from 2011 at > . >=20 > Given that lots of smart people have looked at this and our interrupt= s > are still all going to CPU0, it's not something I'm going to try to > solve right now... researching a bit myself it seems you're right here. It would be nice t= o mention this in the commit log though. =20 > > Assuming this is correct, the > > more robust change would be to detect this situation after 200ms in= stead > > of waiting 1s to work around this issue. >=20 > Detect in what way? You mean add code to detect that the CPU that's > assigned our interrupt has been stalled for 200ms? ...and what do I > do in that case? >=20 > I suppose I could add code that reads the I2C interrupt status and > notices that although the I2C controller claims that it should have a= n > interrupt by now but we never saw it go off. I could then give it > more time. Is that what you're looking for? We'd still want to > timeout eventually since there could be some other bug in the system > that's causing the interrupt not to ever go off... Hmm, probably that's overkill. (Better spend the time fixing printk not to disable irqs that long :-) > That adds a bunch of extra complexity, though. Is there a use case > where a timeout of 1 second poses a problem for you that would justif= y > the extra code? I'd presume you're thinking of a case where a timeou= t > would be expected in a case other than a bug in the i2c driver or a > hardware bug in the i2c controller where 200ms is an acceptable > timeout but 1 second is far too long. Note: if we are using the > timeout to detect a bug of some sort then I'd imagine that 1 second > might be actually better (a slightly longer delay makes it more > obvious that something bad is happening). >=20 >=20 > One other thing occurs to me: having a longer than 200ms delay may > actually be a correctness thing anyway. Technically i2c devices are > allowed to clock stretch "indefinitely", so transfers be quite long > and still be "legal". In practice a REALLY long clock stretch signal= s > a problem somewhere so we really do need some timeout, but 200ms may > be too short. For instance, some docs of the bq27541 battery gas > gauge claim that it can clock stretch (in extreme cases) for 144ms. > While this is still less than 200ms, it does seem prudent to give a > little more leeway before seeing a timeout. If you rework your commit log to describe the failing situation more accurately that should be fine. Best regards Uwe --=20 Pengutronix e.K. | Uwe Kleine-K=F6nig = | Industrial Linux Solutions | http://www.pengutronix.de/= | From mboxrd@z Thu Jan 1 00:00:00 1970 From: u.kleine-koenig@pengutronix.de (Uwe =?iso-8859-1?Q?Kleine-K=F6nig?=) Date: Tue, 5 May 2015 15:10:08 +0200 Subject: [PATCH] i2c: rk3x: Increase wait timeout to 1 second In-Reply-To: References: <1430430247-9632-1-git-send-email-dianders@chromium.org> <20150504083312.GN25193@pengutronix.de> <20150504152415.GS25193@pengutronix.de> Message-ID: <20150505131008.GH25193@pengutronix.de> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hello, On Mon, May 04, 2015 at 09:38:28AM -0700, Doug Anderson wrote: > On Mon, May 4, 2015 at 8:24 AM, Uwe Kleine-K?nig > wrote: > >> Thank you for looking at this! I will clarify by giving explicit CPU > >> numbers (this issue can only happen in SMP, I think): > >> > >> 1. CPU1 is running rk3x_i2c_xfer() > >> > >> 2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0. > >> > >> 3. I2C interrupt is ready but is set to only run on CPU0, where IRQs > >> are disabled. > > Why does this irq only trigger on cpu0? > > So I've never done the research myself on the interrupt architecture > on ARM, but certainly on all ARM boards I've worked on recently all > IRQs (except those local to just one core) get routed to CPU0. I've > been told that there are some userspace programs that try to balance > things out a little bit, but even in those cases each IRQ is assigned > a single CPU and if that CPU has its interrupts off then the IRQ won't > be dynamically rerouted. I think I remember someone telling me that > there was also extra complexity around what happens when CPUs get > taken offline... > > A quick search shows some discussion from 2011 at > . > > Given that lots of smart people have looked at this and our interrupts > are still all going to CPU0, it's not something I'm going to try to > solve right now... researching a bit myself it seems you're right here. It would be nice to mention this in the commit log though. > > Assuming this is correct, the > > more robust change would be to detect this situation after 200ms instead > > of waiting 1s to work around this issue. > > Detect in what way? You mean add code to detect that the CPU that's > assigned our interrupt has been stalled for 200ms? ...and what do I > do in that case? > > I suppose I could add code that reads the I2C interrupt status and > notices that although the I2C controller claims that it should have an > interrupt by now but we never saw it go off. I could then give it > more time. Is that what you're looking for? We'd still want to > timeout eventually since there could be some other bug in the system > that's causing the interrupt not to ever go off... Hmm, probably that's overkill. (Better spend the time fixing printk not to disable irqs that long :-) > That adds a bunch of extra complexity, though. Is there a use case > where a timeout of 1 second poses a problem for you that would justify > the extra code? I'd presume you're thinking of a case where a timeout > would be expected in a case other than a bug in the i2c driver or a > hardware bug in the i2c controller where 200ms is an acceptable > timeout but 1 second is far too long. Note: if we are using the > timeout to detect a bug of some sort then I'd imagine that 1 second > might be actually better (a slightly longer delay makes it more > obvious that something bad is happening). > > > One other thing occurs to me: having a longer than 200ms delay may > actually be a correctness thing anyway. Technically i2c devices are > allowed to clock stretch "indefinitely", so transfers be quite long > and still be "legal". In practice a REALLY long clock stretch signals > a problem somewhere so we really do need some timeout, but 200ms may > be too short. For instance, some docs of the bq27541 battery gas > gauge claim that it can clock stretch (in extreme cases) for 144ms. > While this is still less than 200ms, it does seem prudent to give a > little more leeway before seeing a timeout. If you rework your commit log to describe the failing situation more accurately that should be fine. Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-K?nig | Industrial Linux Solutions | http://www.pengutronix.de/ |