* Re: [sparc64] crc32c misbehave [not found] ` <20170531.115335.213691069712156732.davem@davemloft.net> @ 2017-05-31 16:19 ` Eric Sandeen 2017-05-31 16:31 ` Eric Sandeen 0 siblings, 1 reply; 10+ messages in thread From: Eric Sandeen @ 2017-05-31 16:19 UTC (permalink / raw) To: David Miller, matorola; +Cc: sparclinux, linux-xfs On 5/31/17 10:53 AM, David Miller wrote: > From: Anatoly Pugachev <matorola@gmail.com> > Date: Wed, 31 May 2017 14:56:52 +0300 > >> While debugging occasional crc32c checksum errors with xfs disk reads on >> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found >> that crc32c sometimes returns wrong checksum for data. Eric made a simple >> test kernel module (included), which produce the following results on my >> sparc64 machines: cc: linux-xfs, because this problem cropped up on xfs/sparc. -Eric > I don't think that crc32c() is thread safe because of the way it is > implemented with a shared TFM crypto object allocated once at boot > time. > > I think you are seeing the corruption any time an interrupt comes in > on the same cpu as your test module is running on and does a crc32c() > calculation, corrupting the context key value being used by your > invocation. > > At least that's my guess, I could have misread how the key is stored > and managed around operations. > > Can you try something like disabling cpu IRQs around the crc32c() function > in lib/libcrc32c.c? Something like: > > u32 retval; > > local_irq_disable(); > > shash->tfm = tfm; > shash->flags = 0; > *ctx = crc; > > err = crypto_shash_update(shash, address, length); > BUG_ON(err); > > retval = *ctx; > > local_irq_enable(); > > return retval; > > Thanks. > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-05-31 16:19 ` [sparc64] crc32c misbehave Eric Sandeen @ 2017-05-31 16:31 ` Eric Sandeen 2017-05-31 16:49 ` David Miller 0 siblings, 1 reply; 10+ messages in thread From: Eric Sandeen @ 2017-05-31 16:31 UTC (permalink / raw) To: David Miller, matorola; +Cc: sparclinux, linux-xfs On 5/31/17 11:19 AM, Eric Sandeen wrote: > On 5/31/17 10:53 AM, David Miller wrote: >> From: Anatoly Pugachev <matorola@gmail.com> >> Date: Wed, 31 May 2017 14:56:52 +0300 >> >>> While debugging occasional crc32c checksum errors with xfs disk reads on >>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found >>> that crc32c sometimes returns wrong checksum for data. Eric made a simple >>> test kernel module (included), which produce the following results on my >>> sparc64 machines: > > cc: linux-xfs, because this problem cropped up on xfs/sparc. FWIW, the testcase (module which does crc = crc32c(CRC_SEED, data, 512); 1 million times in a loop on the same data, and printk's if the result ever changes) does not fail on x86_64 or ARM (well, not after a gcc bug was fixed on ARM ...) -Eric > -Eric > >> I don't think that crc32c() is thread safe because of the way it is >> implemented with a shared TFM crypto object allocated once at boot >> time. >> >> I think you are seeing the corruption any time an interrupt comes in >> on the same cpu as your test module is running on and does a crc32c() >> calculation, corrupting the context key value being used by your >> invocation. >> >> At least that's my guess, I could have misread how the key is stored >> and managed around operations. >> >> Can you try something like disabling cpu IRQs around the crc32c() function >> in lib/libcrc32c.c? Something like: >> >> u32 retval; >> >> local_irq_disable(); >> >> shash->tfm = tfm; >> shash->flags = 0; >> *ctx = crc; >> >> err = crypto_shash_update(shash, address, length); >> BUG_ON(err); >> >> retval = *ctx; >> >> local_irq_enable(); >> >> return retval; >> >> Thanks. >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-05-31 16:31 ` Eric Sandeen @ 2017-05-31 16:49 ` David Miller 2017-06-01 21:44 ` David Miller 2017-06-06 19:05 ` David Miller 0 siblings, 2 replies; 10+ messages in thread From: David Miller @ 2017-05-31 16:49 UTC (permalink / raw) To: sandeen; +Cc: matorola, sparclinux, linux-xfs From: Eric Sandeen <sandeen@sandeen.net> Date: Wed, 31 May 2017 11:31:10 -0500 > On 5/31/17 11:19 AM, Eric Sandeen wrote: >> On 5/31/17 10:53 AM, David Miller wrote: >>> From: Anatoly Pugachev <matorola@gmail.com> >>> Date: Wed, 31 May 2017 14:56:52 +0300 >>> >>>> While debugging occasional crc32c checksum errors with xfs disk reads on >>>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found >>>> that crc32c sometimes returns wrong checksum for data. Eric made a simple >>>> test kernel module (included), which produce the following results on my >>>> sparc64 machines: >> >> cc: linux-xfs, because this problem cropped up on xfs/sparc. > > FWIW, the testcase (module which does > > crc = crc32c(CRC_SEED, data, 512); > > 1 million times in a loop on the same data, and printk's if > the result ever changes) does not fail on x86_64 or ARM > (well, not after a gcc bug was fixed on ARM ...) Is the machine doing things that would cause crc32c() operations in interrupts (SCTP protocol traffic) or on other cpus? That's the danger in comparing other machines, the context and what's running on them is different. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-05-31 16:49 ` David Miller @ 2017-06-01 21:44 ` David Miller 2017-06-02 1:57 ` David Miller 2017-06-06 19:05 ` David Miller 1 sibling, 1 reply; 10+ messages in thread From: David Miller @ 2017-06-01 21:44 UTC (permalink / raw) To: sandeen; +Cc: matorola, sparclinux, linux-xfs From: David Miller <davem@davemloft.net> Date: Wed, 31 May 2017 12:49:16 -0400 (EDT) > From: Eric Sandeen <sandeen@sandeen.net> > Date: Wed, 31 May 2017 11:31:10 -0500 > >> On 5/31/17 11:19 AM, Eric Sandeen wrote: >>> On 5/31/17 10:53 AM, David Miller wrote: >>>> From: Anatoly Pugachev <matorola@gmail.com> >>>> Date: Wed, 31 May 2017 14:56:52 +0300 >>>> >>>>> While debugging occasional crc32c checksum errors with xfs disk reads on >>>>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found >>>>> that crc32c sometimes returns wrong checksum for data. Eric made a simple >>>>> test kernel module (included), which produce the following results on my >>>>> sparc64 machines: >>> >>> cc: linux-xfs, because this problem cropped up on xfs/sparc. >> >> FWIW, the testcase (module which does >> >> crc = crc32c(CRC_SEED, data, 512); >> >> 1 million times in a loop on the same data, and printk's if >> the result ever changes) does not fail on x86_64 or ARM >> (well, not after a gcc bug was fixed on ARM ...) > > Is the machine doing things that would cause crc32c() operations in > interrupts (SCTP protocol traffic) or on other cpus? > > That's the danger in comparing other machines, the context and what's > running on them is different. Ok, I can reproduce this bug on my systems. I'll see if I can figure out what is going on. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-06-01 21:44 ` David Miller @ 2017-06-02 1:57 ` David Miller 2017-06-02 2:10 ` Eric Sandeen 0 siblings, 1 reply; 10+ messages in thread From: David Miller @ 2017-06-02 1:57 UTC (permalink / raw) To: sandeen; +Cc: matorola, sparclinux, linux-xfs From: David Miller <davem@davemloft.net> Date: Thu, 01 Jun 2017 17:44:19 -0400 (EDT) > Ok, I can reproduce this bug on my systems. I'll see if I can figure out > what is going on. So I've done several tests to try and narrow down the cause. First, I implemented crc32c() inside of the test module, doing exactly the same thing that lib/libcrc32c.c is doing. So this make it use a separate tfm. This never fails. Then, I implemented a separate module "davem_crc32c.ko" that is identical to lib/libcrc32.c except it uses it's own 'tfm' and it exports the symbol davem_crc32c() instead of crc32c(). And finally I adjust the test case to call davem_crc32c() instead of crc32c(). This also never fails. So it only fails if we use the lib/libcrc32.c shared with the rest of the kernel. I really can't figure out yet why this sharing can even matter. The per-computation state is all in the on-stack 'shash': SHASH_DESC_ON_STACK(shash, tfm); So invocations of crc32c() should not be able to corrupt the state of other parallel invocations. I'll keep digging, but that is where I am right now. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-06-02 1:57 ` David Miller @ 2017-06-02 2:10 ` Eric Sandeen 2017-06-02 3:33 ` David Miller 0 siblings, 1 reply; 10+ messages in thread From: Eric Sandeen @ 2017-06-02 2:10 UTC (permalink / raw) To: David Miller; +Cc: matorola, sparclinux, linux-xfs On 6/1/17 8:57 PM, David Miller wrote: > From: David Miller <davem@davemloft.net> > Date: Thu, 01 Jun 2017 17:44:19 -0400 (EDT) > >> Ok, I can reproduce this bug on my systems. I'll see if I can figure out >> what is going on. > > So I've done several tests to try and narrow down the cause. > > First, I implemented crc32c() inside of the test module, doing > exactly the same thing that lib/libcrc32c.c is doing. So this > make it use a separate tfm. > > This never fails. > > Then, I implemented a separate module "davem_crc32c.ko" that is > identical to lib/libcrc32.c except it uses it's own 'tfm' and it > exports the symbol davem_crc32c() instead of crc32c(). And finally I > adjust the test case to call davem_crc32c() instead of crc32c(). > > This also never fails. > > So it only fails if we use the lib/libcrc32.c shared with the rest of > the kernel. > > I really can't figure out yet why this sharing can even matter. The > per-computation state is all in the on-stack 'shash': > > SHASH_DESC_ON_STACK(shash, tfm); > > So invocations of crc32c() should not be able to corrupt the state of > other parallel invocations. > > I'll keep digging, but that is where I am right now. Thanks for digging. On ARM, there was a gcc bug causing similar results - I /think/ it was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63293 "programs could fail sporadically with this if an interrupt happens at the wrong instant in time and data was written onto the current stack." https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02292.html Maybe totally unrelated; if not, hope it helps. :) -Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-06-02 2:10 ` Eric Sandeen @ 2017-06-02 3:33 ` David Miller 2017-06-02 3:34 ` Eric Sandeen 0 siblings, 1 reply; 10+ messages in thread From: David Miller @ 2017-06-02 3:33 UTC (permalink / raw) To: sandeen; +Cc: matorola, sparclinux, linux-xfs From: Eric Sandeen <sandeen@sandeen.net> Date: Thu, 1 Jun 2017 21:10:50 -0500 > On ARM, there was a gcc bug causing similar results - I /think/ > it was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63293 > > "programs could fail sporadically with this if an interrupt happens at > the wrong instant in time and data was written onto the current stack." > > https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02292.html > > Maybe totally unrelated; if not, hope it helps. :) Wow, that looks exactly like what the bug is: crc32c: .register %g2, #scratch save %sp, -176, %sp ! sethi %hi(tfm), %g1 !, tmp121 mov %i2, %o2 ! length, ldx [%g1+%lo(tfm)], %g2 ! tfm, tfm.0_4 mov %i1, %o1 ! address, lduw [%g2], %g1 ! tfm.0_4->descsize, tfm.0_4->descsize add %g1, 38, %g1 ! tfm.0_4->descsize,, tmp126 srlx %g1, 4, %g1 ! tmp126,, tmp127 sllx %g1, 4, %g1 ! tmp127,, tmp128 sub %sp, %g1, %sp !, tmp128, add %sp, 2230, %i5 !,, tmp130 Ok, %i5 holds the stack address of the shash context: ... return %i7+8 lduw [%o5+16], %o0 ! MEM[(u32 *)__shash_desc.1_10 + 16B], 'return' deallocates the stack frame plus the register window, and at the same time does a delayed control transfer to "%i7 + 8". So in the branch delay slot instruction %i5 becomes %o5. And here we are accessing deallocated stack memory in the delay slot. I'm using gcc-6.3.0 here. And indeed the following patch makes the problem go away: diff --git a/lib/libcrc32c.c b/lib/libcrc32c.c index 74a54b7..bf831e2 100644 --- a/lib/libcrc32c.c +++ b/lib/libcrc32c.c @@ -43,7 +43,7 @@ static struct crypto_shash *tfm; u32 crc32c(u32 crc, const void *address, unsigned int length) { SHASH_DESC_ON_STACK(shash, tfm); - u32 *ctx = (u32 *)shash_desc_ctx(shash); + u32 ret, *ctx = (u32 *)shash_desc_ctx(shash); int err; shash->tfm = tfm; @@ -53,7 +53,9 @@ u32 crc32c(u32 crc, const void *address, unsigned int length) err = crypto_shash_update(shash, address, length); BUG_ON(err); - return *ctx; + ret = *ctx; + barrier(); + return ret; } EXPORT_SYMBOL(crc32c); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-06-02 3:33 ` David Miller @ 2017-06-02 3:34 ` Eric Sandeen 0 siblings, 0 replies; 10+ messages in thread From: Eric Sandeen @ 2017-06-02 3:34 UTC (permalink / raw) To: David Miller; +Cc: matorola, sparclinux, linux-xfs On 6/1/17 10:33 PM, David Miller wrote: > From: Eric Sandeen <sandeen@sandeen.net> > Date: Thu, 1 Jun 2017 21:10:50 -0500 > >> On ARM, there was a gcc bug causing similar results - I /think/ >> it was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63293 >> >> "programs could fail sporadically with this if an interrupt happens at >> the wrong instant in time and data was written onto the current stack." >> >> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02292.html >> >> Maybe totally unrelated; if not, hope it helps. :) > > Wow, that looks exactly like what the bug is: Sweet. \o/ -Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-05-31 16:49 ` David Miller 2017-06-01 21:44 ` David Miller @ 2017-06-06 19:05 ` David Miller 2017-06-06 19:09 ` Eric Sandeen 1 sibling, 1 reply; 10+ messages in thread From: David Miller @ 2017-06-06 19:05 UTC (permalink / raw) To: sandeen; +Cc: matorola, sparclinux, linux-xfs Just FYI, I've meanwhile pushed a fix for the GCC bug that created this mess to all active branches. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [sparc64] crc32c misbehave 2017-06-06 19:05 ` David Miller @ 2017-06-06 19:09 ` Eric Sandeen 0 siblings, 0 replies; 10+ messages in thread From: Eric Sandeen @ 2017-06-06 19:09 UTC (permalink / raw) To: David Miller; +Cc: matorola, sparclinux, linux-xfs On 6/6/17 2:05 PM, David Miller wrote: > > Just FYI, I've meanwhile pushed a fix for the GCC bug that created > this mess to all active branches. David - Awesome. Thanks for digging into it. -Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-06-06 19:09 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CADxRZqzgaL4ew6PVek3WBsdwo6GcT0ORx=7h+6p0V3NAr8qF+w@mail.gmail.com> [not found] ` <CADxRZqzNTB1n0g1r_2rLsjifHpJ+_x1WPxNOSqEcHbi74jHi5w@mail.gmail.com> [not found] ` <20170531.115335.213691069712156732.davem@davemloft.net> 2017-05-31 16:19 ` [sparc64] crc32c misbehave Eric Sandeen 2017-05-31 16:31 ` Eric Sandeen 2017-05-31 16:49 ` David Miller 2017-06-01 21:44 ` David Miller 2017-06-02 1:57 ` David Miller 2017-06-02 2:10 ` Eric Sandeen 2017-06-02 3:33 ` David Miller 2017-06-02 3:34 ` Eric Sandeen 2017-06-06 19:05 ` David Miller 2017-06-06 19:09 ` Eric Sandeen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).