linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression: at24 eeprom writing
@ 2015-10-02 23:05 Peter Rosin
  2015-10-04 19:50 ` Peter Rosin
  2015-10-05  8:45 ` Peter Rosin
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Rosin @ 2015-10-02 23:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Wolfram Sang, Christian Gmainer, linux-arm-kernel, Ludovic Desroches

Hi!

I recently upgraded from the atmel linux-3.18-at91 kernel to vanilla 4.2
and everything seemed fine. Until I tried to write to the little eeprom
chip. I then tried the linux-4.1-at91 kernel and that suffers too.

The symptoms are that it seems like writes get interrupted, and restarted
again without properly initializing everything again. Inspecting the i2c
bus during these fails gets me something like this (int hex) when I

echo abcdefghijklmnopqr > /sys/bus/i2c/devices/0-0050/eeprom

S a0 00 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 P
S a0 10 (clk and data low for a "long" time) 10 71 72 0a P

Notice how the address byte in the second chunk (10) is repeated after
the strange event on the i2c bus.

I looked around and found that if I revert a839ce663b3183209fdf7b1fc4796bfe2a4679c3
"eeprom: at24: extend driver to allow writing via i2c_smbus_write_byte_data"
eeprom writing starts working again.

AFAICT, the i2c-at91 bus driver makes the eeprom driver use the
i2c_transfer code path both with that patch and with it reverted,
so I sadly don't see why the patch makes a difference.

I'm on a board that is based on the sama5d31 evaluation kit, with a
NXP SE97BTP,547 chip and this in the devicetree:

			i2c0: i2c@f0014000 {
				status = "okay";

				jc42@18 {
					compatible = "jc42";
					reg = <0x18>;
				};

				eeprom@50 {
					compatible = "24c02";
					reg = <0x50>;
					pagesize = <16>;
				};
			};

Any ideas?

Cheers,
Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-02 23:05 Regression: at24 eeprom writing Peter Rosin
@ 2015-10-04 19:50 ` Peter Rosin
  2015-10-05  6:16   ` Christian Gmeiner
  2015-10-05  8:45 ` Peter Rosin
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Rosin @ 2015-10-04 19:50 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Wolfram Sang, Christian Gmainer, linux-arm-kernel, Ludovic Desroches

On 2015-10-03 01:05, Peter Rosin wrote:
> I looked around and found that if I revert a839ce663b3183209fdf7b1fc4796bfe2a4679c3
> "eeprom: at24: extend driver to allow writing via i2c_smbus_write_byte_data"
> eeprom writing starts working again.
> 
> AFAICT, the i2c-at91 bus driver makes the eeprom driver use the
> i2c_transfer code path both with that patch and with it reverted,
> so I sadly don't see why the patch makes a difference.

And now when I retry the same thing, that patch is no longer affecting things.
I must have confused myself over what kernel was actually running. Christian,
please accept my deepest apologies for implicating you in this regression.

But the regression is still there. In short, linux-3.18-at91 from the
linux4sam tree works, linux-4.1-at91 from the same tree does not, and
vanilla 4.2 also doesn't work. I have a hard time bisecting this thing
though, since the last known good version has a long list of atmel
patches that I refuse to even try to rebase...

Ideas still welcome of course.

Cheers,
Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-04 19:50 ` Peter Rosin
@ 2015-10-05  6:16   ` Christian Gmeiner
  2015-10-06 16:41     ` Peter Rosin
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Gmeiner @ 2015-10-05  6:16 UTC (permalink / raw)
  To: Peter Rosin
  Cc: Linux Kernel Mailing List, Wolfram Sang, linux-arm-kernel,
	Ludovic Desroches

Hi Peter.

Sorry for the late answer - I am currently on my way to Dublin. Maybe
it helps if you enable
I2C_DEBUG_CORE and I2C_DEBUG_BUS. In theory you should see a little
bit better what
happens on the bus.

Greets
--
Christian Gmeiner, MSc

https://soundcloud.com/christian-gmeiner


2015-10-04 21:50 GMT+02:00 Peter Rosin <peda@lysator.liu.se>:
> On 2015-10-03 01:05, Peter Rosin wrote:
>> I looked around and found that if I revert a839ce663b3183209fdf7b1fc4796bfe2a4679c3
>> "eeprom: at24: extend driver to allow writing via i2c_smbus_write_byte_data"
>> eeprom writing starts working again.
>>
>> AFAICT, the i2c-at91 bus driver makes the eeprom driver use the
>> i2c_transfer code path both with that patch and with it reverted,
>> so I sadly don't see why the patch makes a difference.
>
> And now when I retry the same thing, that patch is no longer affecting things.
> I must have confused myself over what kernel was actually running. Christian,
> please accept my deepest apologies for implicating you in this regression.
>
> But the regression is still there. In short, linux-3.18-at91 from the
> linux4sam tree works, linux-4.1-at91 from the same tree does not, and
> vanilla 4.2 also doesn't work. I have a hard time bisecting this thing
> though, since the last known good version has a long list of atmel
> patches that I refuse to even try to rebase...
>
> Ideas still welcome of course.
>
> Cheers,
> Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-02 23:05 Regression: at24 eeprom writing Peter Rosin
  2015-10-04 19:50 ` Peter Rosin
@ 2015-10-05  8:45 ` Peter Rosin
  2015-10-05  8:59   ` kbuild test robot
                     ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Peter Rosin @ 2015-10-05  8:45 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Wolfram Sang, Christian Gmainer, linux-arm-kernel,
	Ludovic Desroches, Cyrille Pitchen

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]

On 2015-10-03 01:05, Peter Rosin wrote:
> Hi!
> 
> I recently upgraded from the atmel linux-3.18-at91 kernel to vanilla 4.2
> and everything seemed fine. Until I tried to write to the little eeprom
> chip. I then tried the linux-4.1-at91 kernel and that suffers too.
> 
> The symptoms are that it seems like writes get interrupted, and restarted
> again without properly initializing everything again. Inspecting the i2c
> bus during these fails gets me something like this (int hex) when I
> 
> echo abcdefghijklmnopqr > /sys/bus/i2c/devices/0-0050/eeprom
> 
> S a0 00 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 P
> S a0 10 (clk and data low for a "long" time) 10 71 72 0a P
> 
> Notice how the address byte in the second chunk (10) is repeated after
> the strange event on the i2c bus.
> 
> I looked around and found that if I revert a839ce663b3183209fdf7b1fc4796bfe2a4679c3
> "eeprom: at24: extend driver to allow writing via i2c_smbus_write_byte_data"
> eeprom writing starts working again.
> 
> AFAICT, the i2c-at91 bus driver makes the eeprom driver use the
> i2c_transfer code path both with that patch and with it reverted,
> so I sadly don't see why the patch makes a difference.
> 
> I'm on a board that is based on the sama5d31 evaluation kit, with a
> NXP SE97BTP,547 chip and this in the devicetree:
> 
> 			i2c0: i2c@f0014000 {
> 				status = "okay";
> 
> 				jc42@18 {
> 					compatible = "jc42";
> 					reg = <0x18>;
> 				};
> 
> 				eeprom@50 {
> 					compatible = "24c02";
> 					reg = <0x50>;
> 					pagesize = <16>;
> 				};
> 			};

Ok, I found the culprit, and I double and triple checked it this time...

If I move to the very latest on the linux-3.18-at91 branch, the bug is
there too. Which made it vastly more palatable to bisect the bug.

The offender (in the 4.2 kernel) is 93563a6a71bb69dd324fc7354c60fb05f84aae6b
"i2c: at91: fix a race condition when using the DMA controller"
which is far more understandable. Ao, adding Cyrille Pitchen to the Cc list.

If I add that patch on top of my previously working tree, it behaves just
as newer kernels, i.e. equally bad. The patch doesn't revert cleanly, but
reverting the patch and quick-n-dirty-fixing the conflict on vanilla 4.2
makes the problem go away.

I have attached what I actually reverted.

Cheers,
Peter


[-- Attachment #2: 0001-Revert-i2c-at91-fix-a-race-condition-when-using-the-.patch --]
[-- Type: text/x-patch, Size: 7502 bytes --]

>From d178e0636358e61503ac55d39c8612ef93c1d893 Mon Sep 17 00:00:00 2001
From: Peter Rosin <peda@axentia.se>
Date: Mon, 5 Oct 2015 10:16:18 +0200
Subject: [PATCH] Revert "i2c: at91: fix a race condition when using the DMA
 controller"

This reverts commit 93563a6a71bb69dd324fc7354c60fb05f84aae6b.

Conflicts:
	drivers/i2c/busses/i2c-at91.c
---
 drivers/i2c/busses/i2c-at91.c |   97 +++++++++--------------------------------
 1 file changed, 21 insertions(+), 76 deletions(-)

diff --git a/drivers/i2c/busses/i2c-at91.c b/drivers/i2c/busses/i2c-at91.c
index 1c758cd1e1ba..b5a5ef26b142 100644
--- a/drivers/i2c/busses/i2c-at91.c
+++ b/drivers/i2c/busses/i2c-at91.c
@@ -74,9 +74,6 @@
 #define	AT91_TWI_NACK		BIT(8)	/* Not Acknowledged */
 #define	AT91_TWI_LOCK		BIT(23) /* TWI Lock due to Frame Errors */
 
-#define	AT91_TWI_INT_MASK \
-	(AT91_TWI_TXCOMP | AT91_TWI_RXRDY | AT91_TWI_TXRDY | AT91_TWI_NACK)
-
 #define	AT91_TWI_IER		0x0024	/* Interrupt Enable Register */
 #define	AT91_TWI_IDR		0x0028	/* Interrupt Disable Register */
 #define	AT91_TWI_IMR		0x002c	/* Interrupt Mask Register */
@@ -155,12 +152,13 @@ static void at91_twi_write(struct at91_twi_dev *dev, unsigned reg, unsigned val)
 
 static void at91_disable_twi_interrupts(struct at91_twi_dev *dev)
 {
-	at91_twi_write(dev, AT91_TWI_IDR, AT91_TWI_INT_MASK);
+	at91_twi_write(dev, AT91_TWI_IDR,
+		       AT91_TWI_TXCOMP | AT91_TWI_RXRDY | AT91_TWI_TXRDY);
 }
 
 static void at91_twi_irq_save(struct at91_twi_dev *dev)
 {
-	dev->imr = at91_twi_read(dev, AT91_TWI_IMR) & AT91_TWI_INT_MASK;
+	dev->imr = at91_twi_read(dev, AT91_TWI_IMR) & 0x7;
 	at91_disable_twi_interrupts(dev);
 }
 
@@ -255,16 +253,7 @@ static void at91_twi_write_data_dma_callback(void *data)
 	dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg[0]),
 			 dev->buf_len, DMA_TO_DEVICE);
 
-	/*
-	 * When this callback is called, THR/TX FIFO is likely not to be empty
-	 * yet. So we have to wait for TXCOMP or NACK bits to be set into the
-	 * Status Register to be sure that the STOP bit has been sent and the
-	 * transfer is completed. The NACK interrupt has already been enabled,
-	 * we just have to enable TXCOMP one.
-	 */
-	at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
-	if (!dev->pdata->has_alt_cmd)
-		at91_twi_write(dev, AT91_TWI_CR, AT91_TWI_STOP);
+	at91_twi_write(dev, AT91_TWI_CR, AT91_TWI_STOP);
 }
 
 static void at91_twi_write_data_dma(struct at91_twi_dev *dev)
@@ -391,13 +380,10 @@ static void at91_twi_read_data_dma_callback(void *data)
 	dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg[0]),
 			 dev->buf_len, DMA_FROM_DEVICE);
 
-	if (!dev->pdata->has_alt_cmd) {
-		/* The last two bytes have to be read without using dma */
-		dev->buf += dev->buf_len - 2;
-		dev->buf_len = 2;
-		ier |= AT91_TWI_RXRDY;
-	}
-	at91_twi_write(dev, AT91_TWI_IER, ier);
+	/* The last two bytes have to be read without using dma */
+	dev->buf += dev->buf_len - 2;
+	dev->buf_len = 2;
+	at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_RXRDY);
 }
 
 static void at91_twi_read_data_dma(struct at91_twi_dev *dev)
@@ -473,7 +459,7 @@ static irqreturn_t atmel_twi_interrupt(int irq, void *dev_id)
 	/* catch error flags */
 	dev->transfer_status |= status;
 
-	if (irqstatus & (AT91_TWI_TXCOMP | AT91_TWI_NACK)) {
+	if (irqstatus & AT91_TWI_TXCOMP) {
 		at91_disable_twi_interrupts(dev);
 		complete(&dev->cmd_complete);
 	}
@@ -488,49 +474,6 @@ static int at91_do_twi_transfer(struct at91_twi_dev *dev)
 	bool has_unre_flag = dev->pdata->has_unre_flag;
 	bool has_alt_cmd = dev->pdata->has_alt_cmd;
 
-	/*
-	 * WARNING: the TXCOMP bit in the Status Register is NOT a clear on
-	 * read flag but shows the state of the transmission at the time the
-	 * Status Register is read. According to the programmer datasheet,
-	 * TXCOMP is set when both holding register and internal shifter are
-	 * empty and STOP condition has been sent.
-	 * Consequently, we should enable NACK interrupt rather than TXCOMP to
-	 * detect transmission failure.
-	 * Indeed let's take the case of an i2c write command using DMA.
-	 * Whenever the slave doesn't acknowledge a byte, the LOCK, NACK and
-	 * TXCOMP bits are set together into the Status Register.
-	 * LOCK is a clear on write bit, which is set to prevent the DMA
-	 * controller from sending new data on the i2c bus after a NACK
-	 * condition has happened. Once locked, this i2c peripheral stops
-	 * triggering the DMA controller for new data but it is more than
-	 * likely that a new DMA transaction is already in progress, writing
-	 * into the Transmit Holding Register. Since the peripheral is locked,
-	 * these new data won't be sent to the i2c bus but they will remain
-	 * into the Transmit Holding Register, so TXCOMP bit is cleared.
-	 * Then when the interrupt handler is called, the Status Register is
-	 * read: the TXCOMP bit is clear but NACK bit is still set. The driver
-	 * manage the error properly, without waiting for timeout.
-	 * This case can be reproduced easyly when writing into an at24 eeprom.
-	 *
-	 * Besides, the TXCOMP bit is already set before the i2c transaction
-	 * has been started. For read transactions, this bit is cleared when
-	 * writing the START bit into the Control Register. So the
-	 * corresponding interrupt can safely be enabled just after.
-	 * However for write transactions managed by the CPU, we first write
-	 * into THR, so TXCOMP is cleared. Then we can safely enable TXCOMP
-	 * interrupt. If TXCOMP interrupt were enabled before writing into THR,
-	 * the interrupt handler would be called immediately and the i2c command
-	 * would be reported as completed.
-	 * Also when a write transaction is managed by the DMA controller,
-	 * enabling the TXCOMP interrupt in this function may lead to a race
-	 * condition since we don't know whether the TXCOMP interrupt is enabled
-	 * before or after the DMA has started to write into THR. So the TXCOMP
-	 * interrupt is enabled later by at91_twi_write_data_dma_callback().
-	 * Immediately after in that DMA callback, if the alternative command
-	 * mode is not used, we still need to send the STOP condition manually
-	 * writing the corresponding bit into the Control Register.
-	 */
-
 	dev_dbg(dev->dev, "transfer: %s %d bytes.\n",
 		(dev->msg->flags & I2C_M_RD) ? "read" : "write", dev->buf_len);
 
@@ -578,24 +521,26 @@ static int at91_do_twi_transfer(struct at91_twi_dev *dev)
 		 * seems to be the best solution.
 		 */
 		if (dev->use_dma && (dev->buf_len > AT91_I2C_DMA_THRESHOLD)) {
-			at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
 			at91_twi_read_data_dma(dev);
-		} else {
+			/*
+			 * It is important to enable TXCOMP irq here because
+			 * doing it only when transferring the last two bytes
+			 * will mask NACK errors since TXCOMP is set when a
+			 * NACK occurs.
+			 */
 			at91_twi_write(dev, AT91_TWI_IER,
-				       AT91_TWI_TXCOMP |
-				       AT91_TWI_NACK |
-				       AT91_TWI_RXRDY);
-		}
+			       AT91_TWI_TXCOMP);
+		} else
+			at91_twi_write(dev, AT91_TWI_IER,
+			       AT91_TWI_TXCOMP | AT91_TWI_RXRDY);
 	} else {
 		if (dev->use_dma && (dev->buf_len > AT91_I2C_DMA_THRESHOLD)) {
-			at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
 			at91_twi_write_data_dma(dev);
+			at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
 		} else {
 			at91_twi_write_next_byte(dev);
 			at91_twi_write(dev, AT91_TWI_IER,
-				       AT91_TWI_TXCOMP |
-				       AT91_TWI_NACK |
-				       AT91_TWI_TXRDY);
+				AT91_TWI_TXCOMP | AT91_TWI_TXRDY);
 		}
 	}
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Re: Regression: at24 eeprom writing
  2015-10-05  8:45 ` Peter Rosin
@ 2015-10-05  8:59   ` kbuild test robot
  2015-10-05 15:00   ` Ludovic Desroches
  2015-10-05 15:28   ` Cyrille Pitchen
  2 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2015-10-05  8:59 UTC (permalink / raw)
  To: Peter Rosin
  Cc: kbuild-all, Linux Kernel Mailing List, Wolfram Sang,
	Christian Gmainer, linux-arm-kernel, Ludovic Desroches,
	Cyrille Pitchen

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]

Hi Peter,

[auto build test WARNING on v4.3-rc4 -- if it's inappropriate base, please ignore]

config: arm-at91_dt_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   drivers/i2c/busses/i2c-at91.c: In function 'at91_twi_read_data_dma_callback':
>> drivers/i2c/busses/i2c-at91.c:378:11: warning: unused variable 'ier' [-Wunused-variable]
     unsigned ier = AT91_TWI_TXCOMP;
              ^

vim +/ier +378 drivers/i2c/busses/i2c-at91.c

75b81f33 Marek Roszko      2014-08-20  362  			dev->buf_len = 1;
75b81f33 Marek Roszko      2014-08-20  363  		}
fac368a0 Nikolaus Voss     2011-11-08  364  	}
fac368a0 Nikolaus Voss     2011-11-08  365  
fac368a0 Nikolaus Voss     2011-11-08  366  	/* send stop if second but last byte has been read */
0ef6f321 Cyrille Pitchen   2015-06-09  367  	if (!dev->pdata->has_alt_cmd && dev->buf_len == 1)
fac368a0 Nikolaus Voss     2011-11-08  368  		at91_twi_write(dev, AT91_TWI_CR, AT91_TWI_STOP);
fac368a0 Nikolaus Voss     2011-11-08  369  
fac368a0 Nikolaus Voss     2011-11-08  370  	dev_dbg(dev->dev, "read 0x%x, to go %d\n", *dev->buf, dev->buf_len);
fac368a0 Nikolaus Voss     2011-11-08  371  
fac368a0 Nikolaus Voss     2011-11-08  372  	++dev->buf;
fac368a0 Nikolaus Voss     2011-11-08  373  }
fac368a0 Nikolaus Voss     2011-11-08  374  
60937b2c Ludovic Desroches 2012-11-23  375  static void at91_twi_read_data_dma_callback(void *data)
60937b2c Ludovic Desroches 2012-11-23  376  {
60937b2c Ludovic Desroches 2012-11-23  377  	struct at91_twi_dev *dev = (struct at91_twi_dev *)data;
0ef6f321 Cyrille Pitchen   2015-06-09 @378  	unsigned ier = AT91_TWI_TXCOMP;
60937b2c Ludovic Desroches 2012-11-23  379  
5e3cfc6c Cyrille Pitchen   2015-06-09  380  	dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg[0]),
28772ac8 Wolfram Sang      2014-07-21  381  			 dev->buf_len, DMA_FROM_DEVICE);
60937b2c Ludovic Desroches 2012-11-23  382  
60937b2c Ludovic Desroches 2012-11-23  383  	/* The last two bytes have to be read without using dma */
60937b2c Ludovic Desroches 2012-11-23  384  	dev->buf += dev->buf_len - 2;
60937b2c Ludovic Desroches 2012-11-23  385  	dev->buf_len = 2;
e0b440c7 Peter Rosin       2015-10-05  386  	at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_RXRDY);

:::::: The code at line 378 was first introduced by commit
:::::: 0ef6f3213dac48bbc71c98b5a2752db556f3e649 i2c: at91: add support for new alternative command mode

:::::: TO: Cyrille Pitchen <cyrille.pitchen@atmel.com>
:::::: CC: Wolfram Sang <wsa@the-dreams.de>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 20006 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-05  8:45 ` Peter Rosin
  2015-10-05  8:59   ` kbuild test robot
@ 2015-10-05 15:00   ` Ludovic Desroches
  2015-10-05 15:09     ` Peter Rosin
  2015-10-05 15:28   ` Cyrille Pitchen
  2 siblings, 1 reply; 16+ messages in thread
From: Ludovic Desroches @ 2015-10-05 15:00 UTC (permalink / raw)
  To: Peter Rosin
  Cc: Linux Kernel Mailing List, Wolfram Sang, Christian Gmainer,
	linux-arm-kernel, Ludovic Desroches, Cyrille Pitchen

Hi Peter

On Mon, Oct 05, 2015 at 10:45:29AM +0200, Peter Rosin wrote:
> On 2015-10-03 01:05, Peter Rosin wrote:

[...]

> Ok, I found the culprit, and I double and triple checked it this time...
> 
> If I move to the very latest on the linux-3.18-at91 branch, the bug is
> there too. Which made it vastly more palatable to bisect the bug.
> 
> The offender (in the 4.2 kernel) is 93563a6a71bb69dd324fc7354c60fb05f84aae6b
> "i2c: at91: fix a race condition when using the DMA controller"
> which is far more understandable. Ao, adding Cyrille Pitchen to the Cc list.
> 

Thanks for the bisecting effort. I am currently at ELCE where I have
met someone with the same kind of issue. Is it easily reproducible? It
doesn't seem to be the case for him.

I'll have a look once back.

> If I add that patch on top of my previously working tree, it behaves just
> as newer kernels, i.e. equally bad. The patch doesn't revert cleanly, but
> reverting the patch and quick-n-dirty-fixing the conflict on vanilla 4.2
> makes the problem go away.
> 
> I have attached what I actually reverted.
> 
> Cheers,
> Peter
> 

Ludovic

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-05 15:00   ` Ludovic Desroches
@ 2015-10-05 15:09     ` Peter Rosin
  2015-10-12 15:13       ` Peter Rosin
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Rosin @ 2015-10-05 15:09 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Wolfram Sang, Christian Gmainer,
	linux-arm-kernel, Cyrille Pitchen

On 2015-10-05 17:00, Ludovic Desroches wrote:
> Hi Peter
> 
> On Mon, Oct 05, 2015 at 10:45:29AM +0200, Peter Rosin wrote:
>> On 2015-10-03 01:05, Peter Rosin wrote:
> 
> [...]
> 
>> Ok, I found the culprit, and I double and triple checked it this time...
>>
>> If I move to the very latest on the linux-3.18-at91 branch, the bug is
>> there too. Which made it vastly more palatable to bisect the bug.
>>
>> The offender (in the 4.2 kernel) is 93563a6a71bb69dd324fc7354c60fb05f84aae6b
>> "i2c: at91: fix a race condition when using the DMA controller"
>> which is far more understandable. Ao, adding Cyrille Pitchen to the Cc list.
>>
> 
> Thanks for the bisecting effort. I am currently at ELCE where I have
> met someone with the same kind of issue. Is it easily reproducible? It
> doesn't seem to be the case for him.

Yes, easy as pie, happens on every eeprom write of 256 bytes so far...

> I'll have a look once back.

Ok good, to further help understanding, I'm seeing this on the i2c bus (I
hope you understand the notation, or just ask):

Working (4.2 + the revert from my previous message)
===================================================

S W50 0x00 "product = 1-776-" P   S W50 NACK P   S W50 NACK P
delay 15.2 ms
S W50 0x10 "3.0\n" P
delay 19.5 ms
S W50 0x10 "3.0\n" P   S W50 NACK P   S W50 NACK P   S W50 NACK P
delay 19.0 ms
S W50 0x14 "serial = 380" P
delay 18.8 ms
S W50 0x14 "serial = 380" P   S W50 NACK P   S W50 NACK P
delay 18.4 ms
S W50 0x20 "000002\n" P
delay 19.2 ms
S W50 0x20 "000002\n" P
delay 10.8 ms
S W50 0x27 "         " P   S W50 NACK P (repeated 5 times)
delay 16.7 ms
S W50 0x30 "                " P
delay 18.4 ms
S W50 0x30 "                " P   S W50 NACK P (repeated 3 times)
delay 17.9 ms
S W50 0x40 "                " P

etc

I.e. every write (but the first) seems to fail the first time and is
then retried, even if the i2c bus shows no failure indication (at
least that I can find).


Not working (vanilla 4.2)
=========================

S W50 0x00 "product = 1-776-" P   S W50 NACK P   S W50 NACK P
delay 17.3 ms
S W50 0x10 ACK...
delay 19.8 with both SDA and SCL low
...ACK 0x10 "3.0\n" P   S W50 NACK P   S W50 NACK P
delay 19.3 ms
S W50 0x14 "serial = 380" P   S W50 NACK P   S W50 NACK P
delay 18.5 ms
S W50 0x20 ACK...
delay 19.9 with both SDA and SCL low
...ACK 0x20 "000002\n" P   S W50 NACK P   S W50 NACK P
delay 18.9 ms
S W50 0x27 "         " P   S W50 NACK P   S W50 NACK P
delay 19.2 ms
S W50 0x30 "                " P   S W50 NACK P   S W50 NACK P
delay 17.6 ms
S W50 0x40 "                " P   S W50 NACK P   S W50 NACK P

etc

I.e. when there is a disturbance (the long ACks) the recovery
mechanism appears to attempt to heal it by resending only the
failing byte, but the eeprom appears to not see the failure and
takes both bytes instead of just the resend.



It seems dangerous to attempt to fix apparent trouble with an
i2c command by anything less than a full retry, like the working
version appears to do. No?

But what trouble does the i2c bus driver see? Admittedly I only
have a simple logic level bus viewer, and not a full-blown
oscilloscope, so there might be something analogue going on?
I don't think so though, those signals looked fine last time we
looked (but we obviously didn't have these issues then, and
didn't really look that closely). I'll see if I can recheck
with a real scope too.

Cheers,
Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-05  8:45 ` Peter Rosin
  2015-10-05  8:59   ` kbuild test robot
  2015-10-05 15:00   ` Ludovic Desroches
@ 2015-10-05 15:28   ` Cyrille Pitchen
  2015-10-05 15:54     ` Peter Rosin
  2 siblings, 1 reply; 16+ messages in thread
From: Cyrille Pitchen @ 2015-10-05 15:28 UTC (permalink / raw)
  To: Peter Rosin, Linux Kernel Mailing List
  Cc: Wolfram Sang, Christian Gmainer, linux-arm-kernel, Ludovic Desroches

Le 05/10/2015 10:45, Peter Rosin a écrit :
> On 2015-10-03 01:05, Peter Rosin wrote:
>> Hi!
>>
>> I recently upgraded from the atmel linux-3.18-at91 kernel to vanilla 4.2
>> and everything seemed fine. Until I tried to write to the little eeprom
>> chip. I then tried the linux-4.1-at91 kernel and that suffers too.
>>
>> The symptoms are that it seems like writes get interrupted, and restarted
>> again without properly initializing everything again. Inspecting the i2c
>> bus during these fails gets me something like this (int hex) when I
>>
>> echo abcdefghijklmnopqr > /sys/bus/i2c/devices/0-0050/eeprom
>>
>> S a0 00 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 P
>> S a0 10 (clk and data low for a "long" time) 10 71 72 0a P
>>
>> Notice how the address byte in the second chunk (10) is repeated after
>> the strange event on the i2c bus.
>>
>> I looked around and found that if I revert a839ce663b3183209fdf7b1fc4796bfe2a4679c3
>> "eeprom: at24: extend driver to allow writing via i2c_smbus_write_byte_data"
>> eeprom writing starts working again.
>>
>> AFAICT, the i2c-at91 bus driver makes the eeprom driver use the
>> i2c_transfer code path both with that patch and with it reverted,
>> so I sadly don't see why the patch makes a difference.
>>
>> I'm on a board that is based on the sama5d31 evaluation kit, with a
>> NXP SE97BTP,547 chip and this in the devicetree:
>>
>> 			i2c0: i2c@f0014000 {
>> 				status = "okay";
>>
>> 				jc42@18 {
>> 					compatible = "jc42";
>> 					reg = <0x18>;
>> 				};
>>
>> 				eeprom@50 {
>> 					compatible = "24c02";
>> 					reg = <0x50>;
>> 					pagesize = <16>;
>> 				};
>> 			};
> 
> Ok, I found the culprit, and I double and triple checked it this time...
> 
> If I move to the very latest on the linux-3.18-at91 branch, the bug is
> there too. Which made it vastly more palatable to bisect the bug.
> 
> The offender (in the 4.2 kernel) is 93563a6a71bb69dd324fc7354c60fb05f84aae6b
> "i2c: at91: fix a race condition when using the DMA controller"
> which is far more understandable. Ao, adding Cyrille Pitchen to the Cc list.
> 
> If I add that patch on top of my previously working tree, it behaves just
> as newer kernels, i.e. equally bad. The patch doesn't revert cleanly, but
> reverting the patch and quick-n-dirty-fixing the conflict on vanilla 4.2
> makes the problem go away.
> 
> I have attached what I actually reverted.
> 
> Cheers,
> Peter
> 

Hi Peter,

Can you tell me whether your device tree sets the I2C controller i2c0 to use
dma channels, especially the "tx" one. I guess so but it is just to confirm 
hence we look in the right direction.

Then I think we should look at this part of the original patch:

        } else {
                if (dev->use_dma && (dev->buf_len > AT91_I2C_DMA_THRESHOLD)) {
+                       at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
                        at91_twi_write_data_dma(dev);
-                       at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
                } else {
                        at91_twi_write_next_byte(dev);
                        at91_twi_write(dev, AT91_TWI_IER,

Here, for DMA TX transfers, we enable the NACK interrupt instead of the TXCOMP
one. This is the actual fix of the DMA race. Indeed there were two issues when
using TXCOMP to detect NACK conditions.

As written in the datasheet and confirmed by the IP designer, the TXCOMP bit is
set in the Status Register when both the Transmit Holding Register (THR) and
its internal shifter are empty and the STOP condition has been sent.
So when a first transfer successfully completes, the TXCOMP bit is set. Then
this bit remains set until the next write into THR.

The first issue is the race condition:
at91_twi_write_data_dma(dev);
at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);

The first line prepares a DMA transfer but when we execute the second line to
enable the TXCOMP interrupt, we have no mean to know whether the DMA has
already performed a first write access into THR, which also clears the TXCOMP
bit in the Status Register. If the DMA controller hasn't completed this first
write yet, the TXCOMP bit is still set in the Status Register. Hence the
interrupt handler is executed immediately after the TXCOMP interrupt has been
enabled. If the interrupt handler reads the Status Register before the DMA
controller has written into the THR, the TXCOMP bit is still set. Consequently,
the interrupt handler calls complete(&dev->cmd_complete) thinking the transfer
has completed though it actually has not even started.


The second issue is about the detection of NACK condition when using the DMA
controller. Before the patch, the driver relied on the TXCOMP interrupt to
detect NACK condition. It is true that the TXCOMP bit is set in the Status
Register when a NACK condition occurs. However if the I2C controller has
already triggered the DMA controller before it detects a NACK condition and
sets the TXCOMP bit, the DMA controller writes into the THR right after, hence
clears the TXCOMP bit in the Status Register. when the interrupt handler is
executed, it reads the Status Register but fails to detect the NACK condition
since the TXCOMP bit has been cleared: The driver misses the NACK condition.
This is why we should rely on the NACK interrupt instead. the NACK bit is
cleared on read in the Status Register, the NACK condition is properly
detected.

So instead of reverting the patch, maybe you could try to add the single line
which used to enable the TXCOMP interrupt after having scheduled the TX DMA
transfer:

        } else {
                if (dev->use_dma && (dev->buf_len > AT91_I2C_DMA_THRESHOLD)) {
                        at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
                        at91_twi_write_data_dma(dev);
+                       at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
                } else {
                        at91_twi_write_next_byte(dev);
                        at91_twi_write(dev, AT91_TWI_IER,

I don't know whether this would "fix" your issue. Anyway if it does, this is
not a proper fix but it may help us to understand what is going on.

On my side, I will try to reproduce your issue on a sama5d3x board.


Best Regards,

Cyrille

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-05 15:28   ` Cyrille Pitchen
@ 2015-10-05 15:54     ` Peter Rosin
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Rosin @ 2015-10-05 15:54 UTC (permalink / raw)
  To: Cyrille Pitchen, Linux Kernel Mailing List
  Cc: Wolfram Sang, Christian Gmainer, linux-arm-kernel, Ludovic Desroches

On 2015-10-05 17:28, Cyrille Pitchen wrote:
> Le 05/10/2015 10:45, Peter Rosin a écrit :
>> On 2015-10-03 01:05, Peter Rosin wrote:
>>> Hi!
>>>
>>> I recently upgraded from the atmel linux-3.18-at91 kernel to vanilla 4.2
>>> and everything seemed fine. Until I tried to write to the little eeprom
>>> chip. I then tried the linux-4.1-at91 kernel and that suffers too.
>>>
>>> The symptoms are that it seems like writes get interrupted, and restarted
>>> again without properly initializing everything again. Inspecting the i2c
>>> bus during these fails gets me something like this (int hex) when I
>>>
>>> echo abcdefghijklmnopqr > /sys/bus/i2c/devices/0-0050/eeprom
>>>
>>> S a0 00 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 P
>>> S a0 10 (clk and data low for a "long" time) 10 71 72 0a P
>>>
>>> Notice how the address byte in the second chunk (10) is repeated after
>>> the strange event on the i2c bus.
>>>
>>> I looked around and found that if I revert a839ce663b3183209fdf7b1fc4796bfe2a4679c3
>>> "eeprom: at24: extend driver to allow writing via i2c_smbus_write_byte_data"
>>> eeprom writing starts working again.
>>>
>>> AFAICT, the i2c-at91 bus driver makes the eeprom driver use the
>>> i2c_transfer code path both with that patch and with it reverted,
>>> so I sadly don't see why the patch makes a difference.
>>>
>>> I'm on a board that is based on the sama5d31 evaluation kit, with a
>>> NXP SE97BTP,547 chip and this in the devicetree:
>>>
>>> 			i2c0: i2c@f0014000 {
>>> 				status = "okay";
>>>
>>> 				jc42@18 {
>>> 					compatible = "jc42";
>>> 					reg = <0x18>;
>>> 				};
>>>
>>> 				eeprom@50 {
>>> 					compatible = "24c02";
>>> 					reg = <0x50>;
>>> 					pagesize = <16>;
>>> 				};
>>> 			};
>>
>> Ok, I found the culprit, and I double and triple checked it this time...
>>
>> If I move to the very latest on the linux-3.18-at91 branch, the bug is
>> there too. Which made it vastly more palatable to bisect the bug.
>>
>> The offender (in the 4.2 kernel) is 93563a6a71bb69dd324fc7354c60fb05f84aae6b
>> "i2c: at91: fix a race condition when using the DMA controller"
>> which is far more understandable. Ao, adding Cyrille Pitchen to the Cc list.
>>
>> If I add that patch on top of my previously working tree, it behaves just
>> as newer kernels, i.e. equally bad. The patch doesn't revert cleanly, but
>> reverting the patch and quick-n-dirty-fixing the conflict on vanilla 4.2
>> makes the problem go away.
>>
>> I have attached what I actually reverted.
>>
>> Cheers,
>> Peter
>>
> 
> Hi Peter,
> 
> Can you tell me whether your device tree sets the I2C controller i2c0 to use
> dma channels, especially the "tx" one. I guess so but it is just to confirm 
> hence we look in the right direction.

I think yes, I'm including sama5d3.dtsi and am not overriding anything interesting
in that area.
 
> Then I think we should look at this part of the original patch:
> 
>         } else {
>                 if (dev->use_dma && (dev->buf_len > AT91_I2C_DMA_THRESHOLD)) {
> +                       at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
>                         at91_twi_write_data_dma(dev);
> -                       at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
>                 } else {
>                         at91_twi_write_next_byte(dev);
>                         at91_twi_write(dev, AT91_TWI_IER,
> 
> Here, for DMA TX transfers, we enable the NACK interrupt instead of the TXCOMP
> one. This is the actual fix of the DMA race. Indeed there were two issues when
> using TXCOMP to detect NACK conditions.
> 
> As written in the datasheet and confirmed by the IP designer, the TXCOMP bit is
> set in the Status Register when both the Transmit Holding Register (THR) and
> its internal shifter are empty and the STOP condition has been sent.
> So when a first transfer successfully completes, the TXCOMP bit is set. Then
> this bit remains set until the next write into THR.
> 
> The first issue is the race condition:
> at91_twi_write_data_dma(dev);
> at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
> 
> The first line prepares a DMA transfer but when we execute the second line to
> enable the TXCOMP interrupt, we have no mean to know whether the DMA has
> already performed a first write access into THR, which also clears the TXCOMP
> bit in the Status Register. If the DMA controller hasn't completed this first
> write yet, the TXCOMP bit is still set in the Status Register. Hence the
> interrupt handler is executed immediately after the TXCOMP interrupt has been
> enabled. If the interrupt handler reads the Status Register before the DMA
> controller has written into the THR, the TXCOMP bit is still set. Consequently,
> the interrupt handler calls complete(&dev->cmd_complete) thinking the transfer
> has completed though it actually has not even started.
> 
> 
> The second issue is about the detection of NACK condition when using the DMA
> controller. Before the patch, the driver relied on the TXCOMP interrupt to
> detect NACK condition. It is true that the TXCOMP bit is set in the Status
> Register when a NACK condition occurs. However if the I2C controller has
> already triggered the DMA controller before it detects a NACK condition and
> sets the TXCOMP bit, the DMA controller writes into the THR right after, hence
> clears the TXCOMP bit in the Status Register. when the interrupt handler is
> executed, it reads the Status Register but fails to detect the NACK condition
> since the TXCOMP bit has been cleared: The driver misses the NACK condition.
> This is why we should rely on the NACK interrupt instead. the NACK bit is
> cleared on read in the Status Register, the NACK condition is properly
> detected.
> 
> So instead of reverting the patch, maybe you could try to add the single line
> which used to enable the TXCOMP interrupt after having scheduled the TX DMA
> transfer:
> 
>         } else {
>                 if (dev->use_dma && (dev->buf_len > AT91_I2C_DMA_THRESHOLD)) {
>                         at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
>                         at91_twi_write_data_dma(dev);
> +                       at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_TXCOMP);
>                 } else {
>                         at91_twi_write_next_byte(dev);
>                         at91_twi_write(dev, AT91_TWI_IER,
> 
> I don't know whether this would "fix" your issue. Anyway if it does, this is
> not a proper fix but it may help us to understand what is going on.

Nope, no change, and the bus also looks like in the other message I sent in
response to Ludovic, with some ~20ms long ACKs after the eeprom address which
is then repeated.

> On my side, I will try to reproduce your issue on a sama5d3x board.

Cheers,
Peter


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-05  6:16   ` Christian Gmeiner
@ 2015-10-06 16:41     ` Peter Rosin
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Rosin @ 2015-10-06 16:41 UTC (permalink / raw)
  To: Christian Gmeiner
  Cc: Linux Kernel Mailing List, Wolfram Sang, linux-arm-kernel,
	Ludovic Desroches, Cyrille Pitchen

On 2015-10-05 08:16, Christian Gmeiner wrote:
> Hi Peter.
> 
> Sorry for the late answer - I am currently on my way to Dublin. Maybe
> it helps if you enable
> I2C_DEBUG_CORE and I2C_DEBUG_BUS. In theory you should see a little
> bit better what
> happens on the bus.

Good suggestion. I did a new run (vanilla 4.2), and this is what I saw on
the bus this time:

[0.00000] S W50 0x00 "product = 1-776-" P
[0.00184] S W50 NACK P
[0.00195] S W50 NACK P
[0.02027] S W50 0x10 ACK...
[0.04035] ...ACK 0x10 "3.0\n" P
[0.04100] S W50 NACK P
[0.04111] S W50 NACK P
[0.06032] S W50 "serial = 380" P
[0.06174] S W50 NACK P
[0.06185] S W50 NACK P
[0.08032] S W50 0x20 ACK...
[0.10034] ...ACK 0x20 "000002\n" P
[0.10127] S W50 NACK P
[0.10138] S W50 NACK P
[0.12039] S W50 0x27 "         " P
[0.12155] S W50 NACK P
[0.12166] S W50 NACK P
[0.14040] S W50 0x30 "                " P
[0.14219] S W50 NACK P
[0.14230] S W50 NACK P
[0.16042] S W50 0x40 "                " P
...

And this is the corresponding part of the log from the same run
(offset ~197.79 s):

[  197.790000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=17
[  197.790000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.790000] at91_i2c f0014000.i2c: transfer: write 17 bytes.
[  197.800000] at91_i2c f0014000.i2c: transfer complete
[  197.800000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=5
[  197.800000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.800000] at91_i2c f0014000.i2c: transfer: write 5 bytes.
[  197.800000] at91_i2c f0014000.i2c: wrote 0x10, to go 4
[  197.800000] at91_i2c f0014000.i2c: wrote 0x33, to go 3
[  197.800000] at91_i2c f0014000.i2c: received nack
[  197.820000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=5
[  197.820000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.820000] at91_i2c f0014000.i2c: transfer: write 5 bytes.
[  197.820000] at91_i2c f0014000.i2c: wrote 0x10, to go 4
[  197.820000] at91_i2c f0014000.i2c: received nack
[  197.840000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=5
[  197.840000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.840000] at91_i2c f0014000.i2c: transfer: write 5 bytes.
[  197.840000] at91_i2c f0014000.i2c: wrote 0x10, to go 4
[  197.840000] at91_i2c f0014000.i2c: wrote 0x33, to go 3
[  197.840000] at91_i2c f0014000.i2c: wrote 0x2e, to go 2
[  197.840000] at91_i2c f0014000.i2c: wrote 0x30, to go 1
[  197.840000] at91_i2c f0014000.i2c: wrote 0xa, to go 0
[  197.840000] at91_i2c f0014000.i2c: transfer complete
[  197.840000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=13
[  197.840000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.840000] at91_i2c f0014000.i2c: transfer: write 13 bytes.
[  197.840000] at91_i2c f0014000.i2c: received nack
[  197.860000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=13
[  197.860000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.860000] at91_i2c f0014000.i2c: transfer: write 13 bytes.
[  197.860000] at91_i2c f0014000.i2c: transfer complete
[  197.860000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=8
[  197.860000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.860000] at91_i2c f0014000.i2c: transfer: write 8 bytes.
[  197.860000] at91_i2c f0014000.i2c: wrote 0x20, to go 7
[  197.860000] at91_i2c f0014000.i2c: wrote 0x30, to go 6
[  197.860000] at91_i2c f0014000.i2c: received nack
[  197.880000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=8
[  197.880000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.880000] at91_i2c f0014000.i2c: transfer: write 8 bytes.
[  197.880000] at91_i2c f0014000.i2c: wrote 0x20, to go 7
[  197.880000] at91_i2c f0014000.i2c: received nack
[  197.900000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=8
[  197.900000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.900000] at91_i2c f0014000.i2c: transfer: write 8 bytes.
[  197.900000] at91_i2c f0014000.i2c: wrote 0x20, to go 7
[  197.900000] at91_i2c f0014000.i2c: wrote 0x30, to go 6
[  197.900000] at91_i2c f0014000.i2c: wrote 0x30, to go 5
[  197.900000] at91_i2c f0014000.i2c: wrote 0x30, to go 4
[  197.900000] at91_i2c f0014000.i2c: wrote 0x30, to go 3
[  197.900000] at91_i2c f0014000.i2c: wrote 0x30, to go 2
[  197.900000] at91_i2c f0014000.i2c: wrote 0x32, to go 1
[  197.900000] at91_i2c f0014000.i2c: wrote 0xa, to go 0
[  197.900000] at91_i2c f0014000.i2c: transfer complete
[  197.900000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=10
[  197.900000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.900000] at91_i2c f0014000.i2c: transfer: write 10 bytes.
[  197.900000] at91_i2c f0014000.i2c: received nack
[  197.920000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=10
[  197.920000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.920000] at91_i2c f0014000.i2c: transfer: write 10 bytes.
[  197.920000] at91_i2c f0014000.i2c: transfer complete
[  197.920000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=17
[  197.920000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.920000] at91_i2c f0014000.i2c: transfer: write 17 bytes.
[  197.920000] at91_i2c f0014000.i2c: received nack
[  197.940000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=17
[  197.940000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.940000] at91_i2c f0014000.i2c: transfer: write 17 bytes.
[  197.940000] at91_i2c f0014000.i2c: transfer complete
[  197.940000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=17
[  197.940000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.940000] at91_i2c f0014000.i2c: transfer: write 17 bytes.
[  197.940000] at91_i2c f0014000.i2c: received nack
[  197.960000] i2c i2c-0: master_xfer[0] W, addr=0x50, len=17
[  197.960000] i2c i2c-0: at91_xfer: processing 1 messages:
[  197.960000] at91_i2c f0014000.i2c: transfer: write 17 bytes.
[  197.960000] at91_i2c f0014000.i2c: transfer complete
...

Observations/questions:

- The number of NACKs seen on the bus does not match the number
  of "received nack" messages, every double NACK on the bus only
  has a single "received nack" message AFAICT.
- The bus driver seems to think that the 20ms ACKs are NACKs.
  Could it be that the eeprom is engaging in clock stretching
  or something such?
- After a 20ms ACK that the bus driver thinks is a NACK, the last
  command is reissued, but that seems to result in the tail being
  added to the ongoing command that the bus driver thinks have
  already terminated with a NACK. Or something?
- It seems to work better when I write spaces.

Cheers,
Peter


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-05 15:09     ` Peter Rosin
@ 2015-10-12 15:13       ` Peter Rosin
  2015-10-12 16:13         ` Cyrille Pitchen
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Rosin @ 2015-10-12 15:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Wolfram Sang, Christian Gmainer,
	linux-arm-kernel, Cyrille Pitchen

On 2015-10-05 17:09, Peter Rosin wrote:
> But what trouble does the i2c bus driver see? Admittedly I only
> have a simple logic level bus viewer, and not a full-blown
> oscilloscope, so there might be something analogue going on?
> I don't think so though, those signals looked fine last time we
> looked (but we obviously didn't have these issues then, and
> didn't really look that closely). I'll see if I can recheck
> with a real scope too.

We redid the tests with a real scope, and the signal looks nice
and square, so it is not that.

Speculating further on the cause of the long ACKs, I think that
the i2c driver gets confused by an interrupt that marks the
transfer complete, and thinks it's a NACK interrupt instead. Is that
plausible?

If the peripheral unit is such that it generates a stop automatically
on NACKs, then this makes perfect sense. I.e. the TWI sees that the
transfer is complete, generates an interrupt, and waits for further
data or a stop command. Meanwhile the driver thinks it's a NACK and
that a stop condition has already been sent to the bus, and just
notifies the i2c consumer (the eeprom driver in this case) of the
failure and frees up the bus for any future user.

This also matches what I see when I turn on some more traffic on the
bus, that is interleaved with the eeprom traffic. AFAICT, it can be
any command that gets chewed up by the eeprom if it is sent to the
i2c driver during the long ACK.

Are you Atmel people making any progress on this data corrupting
regression? Is there anything else I can do to help?

Cheers,
Peter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-12 15:13       ` Peter Rosin
@ 2015-10-12 16:13         ` Cyrille Pitchen
  2015-10-13 10:38           ` Peter Rosin
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrille Pitchen @ 2015-10-12 16:13 UTC (permalink / raw)
  To: Peter Rosin, Linux Kernel Mailing List, Wolfram Sang,
	Christian Gmainer, linux-arm-kernel, ludovic.desroches

Le 12/10/2015 17:13, Peter Rosin a écrit :
> On 2015-10-05 17:09, Peter Rosin wrote:
>> But what trouble does the i2c bus driver see? Admittedly I only
>> have a simple logic level bus viewer, and not a full-blown
>> oscilloscope, so there might be something analogue going on?
>> I don't think so though, those signals looked fine last time we
>> looked (but we obviously didn't have these issues then, and
>> didn't really look that closely). I'll see if I can recheck
>> with a real scope too.
> 
> We redid the tests with a real scope, and the signal looks nice
> and square, so it is not that.
> 
> Speculating further on the cause of the long ACKs, I think that
> the i2c driver gets confused by an interrupt that marks the
> transfer complete, and thinks it's a NACK interrupt instead. Is that
> plausible?
> 
> If the peripheral unit is such that it generates a stop automatically
> on NACKs, then this makes perfect sense. I.e. the TWI sees that the
> transfer is complete, generates an interrupt, and waits for further
> data or a stop command. Meanwhile the driver thinks it's a NACK and
> that a stop condition has already been sent to the bus, and just
> notifies the i2c consumer (the eeprom driver in this case) of the
> failure and frees up the bus for any future user.
> 
> This also matches what I see when I turn on some more traffic on the
> bus, that is interleaved with the eeprom traffic. AFAICT, it can be
> any command that gets chewed up by the eeprom if it is sent to the
> i2c driver during the long ACK.
> 
> Are you Atmel people making any progress on this data corrupting
> regression? Is there anything else I can do to help?
> 
> Cheers,
> Peter
> 

Hi Peter,

I have sent a patch to Ludovic for a first internal review before publishing to
mainline. The patch should fix your issue since it fixes it on my sama5d36ek
board with an at24 eeprom.

More details on the reason of this bug would be provided in both the commit
message and comments in the code provided by the reviewed patch but I you want
an early fix just read the Status Register (AT91_TWI_SR) at the beginning of
at91_do_twi_transfer(). This read clears the NACK bit in the Status Register.
Then the following source code can safely enable the NACK interrupt, otherwise
in some cases a pending NACK interrupt would rise immediately after the line:
at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
hence breaking the sequence of operations to be done because the interrupt
handler would call complete() too early so wait_for_completion_timeout()
also exits too early.

So reading the Status Register at the beginning of at91_do_twi_transfer()
should be enough to fix the issue.

Another mistake is in the interrupt handler itself, ie atmel_twi_interrupt():
we should check the TWI_TXRDY status bit before calling
at91_twi_write_next_byte() only if both the TWI_TXCOMP and TWI_NACK status bits
are clear. Otherwise, writing a new byte into the THR tells the I2C controller
to start a new transfer. Then the I2C slave, the at24 eeprom, is likely to
also reply by a second NACK. Hence the NACK bit is already set into the Status
Register on the next call of at91_do_twi_transfer().
This is what I saw on my scope for PIO transfers.

By the way, in my case, the first NACK occurs because the at24 driver tries to
perform a second write transfer too quickly and the eeprom is not ready yet,
then replies with a NACK.
However while debugging, placing debug traces in the I2C controller driver
changed the timing: this small delay was enough to make the first NACK
disappear because the eeprom was now ready to process the second write
operation.

For DMA transfer, you have to know that when NACK error is detected, the I2C
controller sets the TWI_TXCOMP, TWI_NACK and TWI_TXRDY bits all together in
the Status Register:
- The TWI_NACK is cleared by reading the Status Register.
- The TWI_TXRDY triggers the DMA controller to write the next byte in the THR,
  which should also tell the I2C controller to start a new transfer.
- The TWI_TXCOMP bit is cleared by writing into the THR: this is why the
  interrupt handler may fail to detect NACK if it relies on TWI_TXCOMP.

Best Regards,

Cyrille

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-12 16:13         ` Cyrille Pitchen
@ 2015-10-13 10:38           ` Peter Rosin
  2015-10-13 12:57             ` Nicolas Ferre
  2015-10-13 13:26             ` ludovic.desroches
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Rosin @ 2015-10-13 10:38 UTC (permalink / raw)
  To: Cyrille Pitchen, Linux Kernel Mailing List, Wolfram Sang,
	Christian Gmainer, linux-arm-kernel, ludovic.desroches

On 2015-10-12 18:13, Cyrille Pitchen wrote:
> Le 12/10/2015 17:13, Peter Rosin a écrit :
>> On 2015-10-05 17:09, Peter Rosin wrote:
>>> But what trouble does the i2c bus driver see? Admittedly I only
>>> have a simple logic level bus viewer, and not a full-blown
>>> oscilloscope, so there might be something analogue going on?
>>> I don't think so though, those signals looked fine last time we
>>> looked (but we obviously didn't have these issues then, and
>>> didn't really look that closely). I'll see if I can recheck
>>> with a real scope too.
>>
>> We redid the tests with a real scope, and the signal looks nice
>> and square, so it is not that.
>>
>> Speculating further on the cause of the long ACKs, I think that
>> the i2c driver gets confused by an interrupt that marks the
>> transfer complete, and thinks it's a NACK interrupt instead. Is that
>> plausible?
>>
>> If the peripheral unit is such that it generates a stop automatically
>> on NACKs, then this makes perfect sense. I.e. the TWI sees that the
>> transfer is complete, generates an interrupt, and waits for further
>> data or a stop command. Meanwhile the driver thinks it's a NACK and
>> that a stop condition has already been sent to the bus, and just
>> notifies the i2c consumer (the eeprom driver in this case) of the
>> failure and frees up the bus for any future user.
>>
>> This also matches what I see when I turn on some more traffic on the
>> bus, that is interleaved with the eeprom traffic. AFAICT, it can be
>> any command that gets chewed up by the eeprom if it is sent to the
>> i2c driver during the long ACK.
>>
>> Are you Atmel people making any progress on this data corrupting
>> regression? Is there anything else I can do to help?
>>
>> Cheers,
>> Peter
>>
> 
> Hi Peter,
> 
> I have sent a patch to Ludovic for a first internal review before publishing to
> mainline. The patch should fix your issue since it fixes it on my sama5d36ek
> board with an at24 eeprom.
> 
> More details on the reason of this bug would be provided in both the commit
> message and comments in the code provided by the reviewed patch but I you want
> an early fix just read the Status Register (AT91_TWI_SR) at the beginning of
> at91_do_twi_transfer(). This read clears the NACK bit in the Status Register.
> Then the following source code can safely enable the NACK interrupt, otherwise
> in some cases a pending NACK interrupt would rise immediately after the line:
> at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
> hence breaking the sequence of operations to be done because the interrupt
> handler would call complete() too early so wait_for_completion_timeout()
> also exits too early.
> 
> So reading the Status Register at the beginning of at91_do_twi_transfer()
> should be enough to fix the issue.

Yes, I see no more long ACKs after that reading the Status Register there.
Great!

> Another mistake is in the interrupt handler itself, ie atmel_twi_interrupt():
> we should check the TWI_TXRDY status bit before calling
> at91_twi_write_next_byte() only if both the TWI_TXCOMP and TWI_NACK status bits
> are clear. Otherwise, writing a new byte into the THR tells the I2C controller
> to start a new transfer. Then the I2C slave, the at24 eeprom, is likely to
> also reply by a second NACK. Hence the NACK bit is already set into the Status
> Register on the next call of at91_do_twi_transfer().
> This is what I saw on my scope for PIO transfers.

I interpret this as a proposed solution for the strange double NACKs?

Anyway, I find it unnecessarily hard to grasp exactly what you mean
(wasteful policy you are apparently suffering from where it is OK to
publish a patch written in English, but apparently a big no-no to
send a diff until it passes some internal review???). I interpreted
your "patch" in English as:

 		at91_twi_read_next_byte(dev);
-	else if (irqstatus & AT_TWI_TXRDY)
+	else if ((irqstatus & (AT91_TWI_TXCOMP | AT91_TWI_TXRDY | AT91_TWI_NACK)) == AT91_TWI_TXRDY)
 		at91_twi_write_next_byte(dev);

But still see some double NACKs. Not always though, and it doesn't wreak
any havoc. But it still looks strange and I can't explain them when looking
at what the eeprom driver requests. Does this mean that there are more
races present?

Or, did I just parse your English "patch" badly?

> By the way, in my case, the first NACK occurs because the at24 driver tries to
> perform a second write transfer too quickly and the eeprom is not ready yet,
> then replies with a NACK.

Yes, I believe this is by design. Noone wants to encode the exact delays
needed since they are hard to predict. So, the eeprom driver polls.

> However while debugging, placing debug traces in the I2C controller driver
> changed the timing: this small delay was enough to make the first NACK
> disappear because the eeprom was now ready to process the second write
> operation.

Yes, w/o NACKs everything is going smooth regardless of kernel version.

> For DMA transfer, you have to know that when NACK error is detected, the I2C
> controller sets the TWI_TXCOMP, TWI_NACK and TWI_TXRDY bits all together in
> the Status Register:
> - The TWI_NACK is cleared by reading the Status Register.
> - The TWI_TXRDY triggers the DMA controller to write the next byte in the THR,
>   which should also tell the I2C controller to start a new transfer.
> - The TWI_TXCOMP bit is cleared by writing into the THR: this is why the
>   interrupt handler may fail to detect NACK if it relies on TWI_TXCOMP.
> 
> Best Regards,
> 
> Cyrille
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-13 10:38           ` Peter Rosin
@ 2015-10-13 12:57             ` Nicolas Ferre
  2015-10-13 14:35               ` Peter Rosin
  2015-10-13 13:26             ` ludovic.desroches
  1 sibling, 1 reply; 16+ messages in thread
From: Nicolas Ferre @ 2015-10-13 12:57 UTC (permalink / raw)
  To: Peter Rosin, Cyrille Pitchen, Linux Kernel Mailing List,
	Wolfram Sang, Christian Gmainer, linux-arm-kernel,
	ludovic.desroches

Le 13/10/2015 12:38, Peter Rosin a écrit :
> On 2015-10-12 18:13, Cyrille Pitchen wrote:
>> Le 12/10/2015 17:13, Peter Rosin a écrit :
>>> On 2015-10-05 17:09, Peter Rosin wrote:
>>>> But what trouble does the i2c bus driver see? Admittedly I only
>>>> have a simple logic level bus viewer, and not a full-blown
>>>> oscilloscope, so there might be something analogue going on?
>>>> I don't think so though, those signals looked fine last time we
>>>> looked (but we obviously didn't have these issues then, and
>>>> didn't really look that closely). I'll see if I can recheck
>>>> with a real scope too.
>>>
>>> We redid the tests with a real scope, and the signal looks nice
>>> and square, so it is not that.
>>>
>>> Speculating further on the cause of the long ACKs, I think that
>>> the i2c driver gets confused by an interrupt that marks the
>>> transfer complete, and thinks it's a NACK interrupt instead. Is that
>>> plausible?
>>>
>>> If the peripheral unit is such that it generates a stop automatically
>>> on NACKs, then this makes perfect sense. I.e. the TWI sees that the
>>> transfer is complete, generates an interrupt, and waits for further
>>> data or a stop command. Meanwhile the driver thinks it's a NACK and
>>> that a stop condition has already been sent to the bus, and just
>>> notifies the i2c consumer (the eeprom driver in this case) of the
>>> failure and frees up the bus for any future user.
>>>
>>> This also matches what I see when I turn on some more traffic on the
>>> bus, that is interleaved with the eeprom traffic. AFAICT, it can be
>>> any command that gets chewed up by the eeprom if it is sent to the
>>> i2c driver during the long ACK.
>>>
>>> Are you Atmel people making any progress on this data corrupting
>>> regression? Is there anything else I can do to help?
>>>
>>> Cheers,
>>> Peter
>>>
>>
>> Hi Peter,
>>
>> I have sent a patch to Ludovic for a first internal review before publishing to
>> mainline. The patch should fix your issue since it fixes it on my sama5d36ek
>> board with an at24 eeprom.
>>
>> More details on the reason of this bug would be provided in both the commit
>> message and comments in the code provided by the reviewed patch but I you want
>> an early fix just read the Status Register (AT91_TWI_SR) at the beginning of
>> at91_do_twi_transfer(). This read clears the NACK bit in the Status Register.
>> Then the following source code can safely enable the NACK interrupt, otherwise
>> in some cases a pending NACK interrupt would rise immediately after the line:
>> at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
>> hence breaking the sequence of operations to be done because the interrupt
>> handler would call complete() too early so wait_for_completion_timeout()
>> also exits too early.
>>
>> So reading the Status Register at the beginning of at91_do_twi_transfer()
>> should be enough to fix the issue.
> 
> Yes, I see no more long ACKs after that reading the Status Register there.
> Great!
> 
>> Another mistake is in the interrupt handler itself, ie atmel_twi_interrupt():
>> we should check the TWI_TXRDY status bit before calling
>> at91_twi_write_next_byte() only if both the TWI_TXCOMP and TWI_NACK status bits
>> are clear. Otherwise, writing a new byte into the THR tells the I2C controller
>> to start a new transfer. Then the I2C slave, the at24 eeprom, is likely to
>> also reply by a second NACK. Hence the NACK bit is already set into the Status
>> Register on the next call of at91_do_twi_transfer().
>> This is what I saw on my scope for PIO transfers.
> 
> I interpret this as a proposed solution for the strange double NACKs?
> 
> Anyway, I find it unnecessarily hard to grasp exactly what you mean
> (wasteful policy you are apparently suffering from where it is OK to
> publish a patch written in English, but apparently a big no-no to
> send a diff until it passes some internal review???). I interpreted

I find your remark pretty rude as I'm reading it while just at the desk
behind me Cyrille and Ludovic are together trying hard to understand and
fix this issue.
They are making sure that this fix doesn't interfere with another SoC's
IP version with any of the PIO/DMA/FIFO transfer types.

Cyrille just wanted to keep you informed quickly as we were
progressing... Anyone can tell you that, obviously, there is no stupid
policy of not releasing patches @ Atmel!

Anyway, let's keep the good debugging session progressing well with your
valuable help...

Bye.


> your "patch" in English as:
> 
>  		at91_twi_read_next_byte(dev);
> -	else if (irqstatus & AT_TWI_TXRDY)
> +	else if ((irqstatus & (AT91_TWI_TXCOMP | AT91_TWI_TXRDY | AT91_TWI_NACK)) == AT91_TWI_TXRDY)
>  		at91_twi_write_next_byte(dev);
> 
> But still see some double NACKs. Not always though, and it doesn't wreak
> any havoc. But it still looks strange and I can't explain them when looking
> at what the eeprom driver requests. Does this mean that there are more
> races present?
> 
> Or, did I just parse your English "patch" badly?
> 
>> By the way, in my case, the first NACK occurs because the at24 driver tries to
>> perform a second write transfer too quickly and the eeprom is not ready yet,
>> then replies with a NACK.
> 
> Yes, I believe this is by design. Noone wants to encode the exact delays
> needed since they are hard to predict. So, the eeprom driver polls.
> 
>> However while debugging, placing debug traces in the I2C controller driver
>> changed the timing: this small delay was enough to make the first NACK
>> disappear because the eeprom was now ready to process the second write
>> operation.
> 
> Yes, w/o NACKs everything is going smooth regardless of kernel version.
> 
>> For DMA transfer, you have to know that when NACK error is detected, the I2C
>> controller sets the TWI_TXCOMP, TWI_NACK and TWI_TXRDY bits all together in
>> the Status Register:
>> - The TWI_NACK is cleared by reading the Status Register.
>> - The TWI_TXRDY triggers the DMA controller to write the next byte in the THR,
>>   which should also tell the I2C controller to start a new transfer.
>> - The TWI_TXCOMP bit is cleared by writing into the THR: this is why the
>>   interrupt handler may fail to detect NACK if it relies on TWI_TXCOMP.
>>
>> Best Regards,
>>
>> Cyrille
>>
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


-- 
Nicolas Ferre

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-13 10:38           ` Peter Rosin
  2015-10-13 12:57             ` Nicolas Ferre
@ 2015-10-13 13:26             ` ludovic.desroches
  1 sibling, 0 replies; 16+ messages in thread
From: ludovic.desroches @ 2015-10-13 13:26 UTC (permalink / raw)
  To: Peter Rosin
  Cc: Cyrille Pitchen, Linux Kernel Mailing List, Wolfram Sang,
	Christian Gmainer, linux-arm-kernel, ludovic.desroches

Hi Peter,

On Tue, Oct 13, 2015 at 12:38:39PM +0200, Peter Rosin wrote:
> On 2015-10-12 18:13, Cyrille Pitchen wrote:
> > Le 12/10/2015 17:13, Peter Rosin a écrit :
> >> On 2015-10-05 17:09, Peter Rosin wrote:

[...]

> Anyway, I find it unnecessarily hard to grasp exactly what you mean
> (wasteful policy you are apparently suffering from where it is OK to
> publish a patch written in English, but apparently a big no-no to
> send a diff until it passes some internal review???). I interpreted
> your "patch" in English as:
> 
>  		at91_twi_read_next_byte(dev);
> -	else if (irqstatus & AT_TWI_TXRDY)
> +	else if ((irqstatus & (AT91_TWI_TXCOMP | AT91_TWI_TXRDY | AT91_TWI_NACK)) == AT91_TWI_TXRDY)
>  		at91_twi_write_next_byte(dev);

We don't have such kind of policy. I simply discussed about your bug
with Cyrille and he managed to fix it. Since there are many
combinations (IP version, PIO/dma, hdmac/xdmac), I wanted to test it on
other platforms that the one used by Cyrille and add my signed-off-by
before sending it.

If nothing goes wrong (one platform has not been tested yet), the patch will
be sent this afternoon.

Cyrille answered in English in order to not create confusion with a patch which
could not fit all combinations.

Regards

Ludovic

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression: at24 eeprom writing
  2015-10-13 12:57             ` Nicolas Ferre
@ 2015-10-13 14:35               ` Peter Rosin
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Rosin @ 2015-10-13 14:35 UTC (permalink / raw)
  To: Nicolas Ferre, Cyrille Pitchen, Linux Kernel Mailing List,
	Wolfram Sang, Christian Gmainer, linux-arm-kernel,
	ludovic.desroches

On 2015-10-13 14:57, Nicolas Ferre wrote:
> Le 13/10/2015 12:38, Peter Rosin a écrit :
>> On 2015-10-12 18:13, Cyrille Pitchen wrote:
>>> Le 12/10/2015 17:13, Peter Rosin a écrit :
>>>> On 2015-10-05 17:09, Peter Rosin wrote:
>>>>> But what trouble does the i2c bus driver see? Admittedly I only
>>>>> have a simple logic level bus viewer, and not a full-blown
>>>>> oscilloscope, so there might be something analogue going on?
>>>>> I don't think so though, those signals looked fine last time we
>>>>> looked (but we obviously didn't have these issues then, and
>>>>> didn't really look that closely). I'll see if I can recheck
>>>>> with a real scope too.
>>>>
>>>> We redid the tests with a real scope, and the signal looks nice
>>>> and square, so it is not that.
>>>>
>>>> Speculating further on the cause of the long ACKs, I think that
>>>> the i2c driver gets confused by an interrupt that marks the
>>>> transfer complete, and thinks it's a NACK interrupt instead. Is that
>>>> plausible?
>>>>
>>>> If the peripheral unit is such that it generates a stop automatically
>>>> on NACKs, then this makes perfect sense. I.e. the TWI sees that the
>>>> transfer is complete, generates an interrupt, and waits for further
>>>> data or a stop command. Meanwhile the driver thinks it's a NACK and
>>>> that a stop condition has already been sent to the bus, and just
>>>> notifies the i2c consumer (the eeprom driver in this case) of the
>>>> failure and frees up the bus for any future user.
>>>>
>>>> This also matches what I see when I turn on some more traffic on the
>>>> bus, that is interleaved with the eeprom traffic. AFAICT, it can be
>>>> any command that gets chewed up by the eeprom if it is sent to the
>>>> i2c driver during the long ACK.
>>>>
>>>> Are you Atmel people making any progress on this data corrupting
>>>> regression? Is there anything else I can do to help?
>>>>
>>>> Cheers,
>>>> Peter
>>>>
>>>
>>> Hi Peter,
>>>
>>> I have sent a patch to Ludovic for a first internal review before publishing to
>>> mainline. The patch should fix your issue since it fixes it on my sama5d36ek
>>> board with an at24 eeprom.
>>>
>>> More details on the reason of this bug would be provided in both the commit
>>> message and comments in the code provided by the reviewed patch but I you want
>>> an early fix just read the Status Register (AT91_TWI_SR) at the beginning of
>>> at91_do_twi_transfer(). This read clears the NACK bit in the Status Register.
>>> Then the following source code can safely enable the NACK interrupt, otherwise
>>> in some cases a pending NACK interrupt would rise immediately after the line:
>>> at91_twi_write(dev, AT91_TWI_IER, AT91_TWI_NACK);
>>> hence breaking the sequence of operations to be done because the interrupt
>>> handler would call complete() too early so wait_for_completion_timeout()
>>> also exits too early.
>>>
>>> So reading the Status Register at the beginning of at91_do_twi_transfer()
>>> should be enough to fix the issue.
>>
>> Yes, I see no more long ACKs after that reading the Status Register there.
>> Great!
>>
>>> Another mistake is in the interrupt handler itself, ie atmel_twi_interrupt():
>>> we should check the TWI_TXRDY status bit before calling
>>> at91_twi_write_next_byte() only if both the TWI_TXCOMP and TWI_NACK status bits
>>> are clear. Otherwise, writing a new byte into the THR tells the I2C controller
>>> to start a new transfer. Then the I2C slave, the at24 eeprom, is likely to
>>> also reply by a second NACK. Hence the NACK bit is already set into the Status
>>> Register on the next call of at91_do_twi_transfer().
>>> This is what I saw on my scope for PIO transfers.
>>
>> I interpret this as a proposed solution for the strange double NACKs?
>>
>> Anyway, I find it unnecessarily hard to grasp exactly what you mean
>> (wasteful policy you are apparently suffering from where it is OK to
>> publish a patch written in English, but apparently a big no-no to
>> send a diff until it passes some internal review???). I interpreted
> 
> I find your remark pretty rude as I'm reading it while just at the desk
> behind me Cyrille and Ludovic are together trying hard to understand and
> fix this issue.
> They are making sure that this fix doesn't interfere with another SoC's
> IP version with any of the PIO/DMA/FIFO transfer types.
> 
> Cyrille just wanted to keep you informed quickly as we were
> progressing... Anyone can tell you that, obviously, there is no stupid
> policy of not releasing patches @ Atmel!
> 
> Anyway, let's keep the good debugging session progressing well with your
> valuable help...

Yeah, right, that didn't come out right. Sorry!

It was just that from over here I was pulling my hair out over this bug,
and then there appeared to have existed a patch that had been stuck in
internal review since some point last week. I realize now that this might
not have been the case and that the solution could have been found just
yesterday, but the impression I got was that it was not a fresh fix and
that made me a bit tired. So, sorry again.

I will now go test the patch instead. Knock wood.

Cheers,
Peter


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-10-13 14:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-02 23:05 Regression: at24 eeprom writing Peter Rosin
2015-10-04 19:50 ` Peter Rosin
2015-10-05  6:16   ` Christian Gmeiner
2015-10-06 16:41     ` Peter Rosin
2015-10-05  8:45 ` Peter Rosin
2015-10-05  8:59   ` kbuild test robot
2015-10-05 15:00   ` Ludovic Desroches
2015-10-05 15:09     ` Peter Rosin
2015-10-12 15:13       ` Peter Rosin
2015-10-12 16:13         ` Cyrille Pitchen
2015-10-13 10:38           ` Peter Rosin
2015-10-13 12:57             ` Nicolas Ferre
2015-10-13 14:35               ` Peter Rosin
2015-10-13 13:26             ` ludovic.desroches
2015-10-05 15:28   ` Cyrille Pitchen
2015-10-05 15:54     ` Peter Rosin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).