linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [sparc64] crc32c misbehave
       [not found]   ` <20170531.115335.213691069712156732.davem@davemloft.net>
@ 2017-05-31 16:19     ` Eric Sandeen
  2017-05-31 16:31       ` Eric Sandeen
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2017-05-31 16:19 UTC (permalink / raw)
  To: David Miller, matorola; +Cc: sparclinux, linux-xfs

On 5/31/17 10:53 AM, David Miller wrote:
> From: Anatoly Pugachev <matorola@gmail.com>
> Date: Wed, 31 May 2017 14:56:52 +0300
> 
>> While debugging occasional crc32c checksum errors with xfs disk reads on
>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found
>> that crc32c sometimes returns wrong checksum for data. Eric made a simple
>> test kernel module (included), which produce the following results on my
>> sparc64 machines:

cc: linux-xfs, because this problem cropped up on xfs/sparc.

-Eric

> I don't think that crc32c() is thread safe because of the way it is
> implemented with a shared TFM crypto object allocated once at boot
> time.
> 
> I think you are seeing the corruption any time an interrupt comes in
> on the same cpu as your test module is running on and does a crc32c()
> calculation, corrupting the context key value being used by your
> invocation.
> 
> At least that's my guess, I could have misread how the key is stored
> and managed around operations.
> 
> Can you try something like disabling cpu IRQs around the crc32c() function
> in lib/libcrc32c.c?  Something like:
> 
> 	u32 retval;
> 
> 	local_irq_disable();
> 
> 	shash->tfm = tfm;
> 	shash->flags = 0;
> 	*ctx = crc;
> 
> 	err = crypto_shash_update(shash, address, length);
> 	BUG_ON(err);
> 
> 	retval = *ctx;
> 
> 	local_irq_enable();
> 
> 	return retval;
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-05-31 16:19     ` [sparc64] crc32c misbehave Eric Sandeen
@ 2017-05-31 16:31       ` Eric Sandeen
  2017-05-31 16:49         ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2017-05-31 16:31 UTC (permalink / raw)
  To: David Miller, matorola; +Cc: sparclinux, linux-xfs

On 5/31/17 11:19 AM, Eric Sandeen wrote:
> On 5/31/17 10:53 AM, David Miller wrote:
>> From: Anatoly Pugachev <matorola@gmail.com>
>> Date: Wed, 31 May 2017 14:56:52 +0300
>>
>>> While debugging occasional crc32c checksum errors with xfs disk reads on
>>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found
>>> that crc32c sometimes returns wrong checksum for data. Eric made a simple
>>> test kernel module (included), which produce the following results on my
>>> sparc64 machines:
> 
> cc: linux-xfs, because this problem cropped up on xfs/sparc.

FWIW, the testcase (module which does

	crc = crc32c(CRC_SEED, data, 512);

1 million times in a loop on the same data, and printk's if
the result ever changes) does not fail on x86_64 or ARM
(well, not after a gcc bug was fixed on ARM ...)

-Eric

> -Eric
> 
>> I don't think that crc32c() is thread safe because of the way it is
>> implemented with a shared TFM crypto object allocated once at boot
>> time.
>>
>> I think you are seeing the corruption any time an interrupt comes in
>> on the same cpu as your test module is running on and does a crc32c()
>> calculation, corrupting the context key value being used by your
>> invocation.
>>
>> At least that's my guess, I could have misread how the key is stored
>> and managed around operations.
>>
>> Can you try something like disabling cpu IRQs around the crc32c() function
>> in lib/libcrc32c.c?  Something like:
>>
>> 	u32 retval;
>>
>> 	local_irq_disable();
>>
>> 	shash->tfm = tfm;
>> 	shash->flags = 0;
>> 	*ctx = crc;
>>
>> 	err = crypto_shash_update(shash, address, length);
>> 	BUG_ON(err);
>>
>> 	retval = *ctx;
>>
>> 	local_irq_enable();
>>
>> 	return retval;
>>
>> Thanks.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-05-31 16:31       ` Eric Sandeen
@ 2017-05-31 16:49         ` David Miller
  2017-06-01 21:44           ` David Miller
  2017-06-06 19:05           ` David Miller
  0 siblings, 2 replies; 10+ messages in thread
From: David Miller @ 2017-05-31 16:49 UTC (permalink / raw)
  To: sandeen; +Cc: matorola, sparclinux, linux-xfs

From: Eric Sandeen <sandeen@sandeen.net>
Date: Wed, 31 May 2017 11:31:10 -0500

> On 5/31/17 11:19 AM, Eric Sandeen wrote:
>> On 5/31/17 10:53 AM, David Miller wrote:
>>> From: Anatoly Pugachev <matorola@gmail.com>
>>> Date: Wed, 31 May 2017 14:56:52 +0300
>>>
>>>> While debugging occasional crc32c checksum errors with xfs disk reads on
>>>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found
>>>> that crc32c sometimes returns wrong checksum for data. Eric made a simple
>>>> test kernel module (included), which produce the following results on my
>>>> sparc64 machines:
>> 
>> cc: linux-xfs, because this problem cropped up on xfs/sparc.
> 
> FWIW, the testcase (module which does
> 
> 	crc = crc32c(CRC_SEED, data, 512);
> 
> 1 million times in a loop on the same data, and printk's if
> the result ever changes) does not fail on x86_64 or ARM
> (well, not after a gcc bug was fixed on ARM ...)

Is the machine doing things that would cause crc32c() operations in
interrupts (SCTP protocol traffic) or on other cpus?

That's the danger in comparing other machines, the context and what's
running on them is different.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-05-31 16:49         ` David Miller
@ 2017-06-01 21:44           ` David Miller
  2017-06-02  1:57             ` David Miller
  2017-06-06 19:05           ` David Miller
  1 sibling, 1 reply; 10+ messages in thread
From: David Miller @ 2017-06-01 21:44 UTC (permalink / raw)
  To: sandeen; +Cc: matorola, sparclinux, linux-xfs

From: David Miller <davem@davemloft.net>
Date: Wed, 31 May 2017 12:49:16 -0400 (EDT)

> From: Eric Sandeen <sandeen@sandeen.net>
> Date: Wed, 31 May 2017 11:31:10 -0500
> 
>> On 5/31/17 11:19 AM, Eric Sandeen wrote:
>>> On 5/31/17 10:53 AM, David Miller wrote:
>>>> From: Anatoly Pugachev <matorola@gmail.com>
>>>> Date: Wed, 31 May 2017 14:56:52 +0300
>>>>
>>>>> While debugging occasional crc32c checksum errors with xfs disk reads on
>>>>> sparc64 (T5 [sun4v] 3.6 GHz CPU ldom, debian unstable/sid), Eric have found
>>>>> that crc32c sometimes returns wrong checksum for data. Eric made a simple
>>>>> test kernel module (included), which produce the following results on my
>>>>> sparc64 machines:
>>> 
>>> cc: linux-xfs, because this problem cropped up on xfs/sparc.
>> 
>> FWIW, the testcase (module which does
>> 
>> 	crc = crc32c(CRC_SEED, data, 512);
>> 
>> 1 million times in a loop on the same data, and printk's if
>> the result ever changes) does not fail on x86_64 or ARM
>> (well, not after a gcc bug was fixed on ARM ...)
> 
> Is the machine doing things that would cause crc32c() operations in
> interrupts (SCTP protocol traffic) or on other cpus?
> 
> That's the danger in comparing other machines, the context and what's
> running on them is different.

Ok, I can reproduce this bug on my systems.  I'll see if I can figure out
what is going on.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-06-01 21:44           ` David Miller
@ 2017-06-02  1:57             ` David Miller
  2017-06-02  2:10               ` Eric Sandeen
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2017-06-02  1:57 UTC (permalink / raw)
  To: sandeen; +Cc: matorola, sparclinux, linux-xfs

From: David Miller <davem@davemloft.net>
Date: Thu, 01 Jun 2017 17:44:19 -0400 (EDT)

> Ok, I can reproduce this bug on my systems.  I'll see if I can figure out
> what is going on.

So I've done several tests to try and narrow down the cause.

First, I implemented crc32c() inside of the test module, doing
exactly the same thing that lib/libcrc32c.c is doing.  So this
make it use a separate tfm.

This never fails.

Then, I implemented a separate module "davem_crc32c.ko" that is
identical to lib/libcrc32.c except it uses it's own 'tfm' and it
exports the symbol davem_crc32c() instead of crc32c(). And finally I
adjust the test case to call davem_crc32c() instead of crc32c().

This also never fails.

So it only fails if we use the lib/libcrc32.c shared with the rest of
the kernel.

I really can't figure out yet why this sharing can even matter.  The
per-computation state is all in the on-stack 'shash':

	SHASH_DESC_ON_STACK(shash, tfm);

So invocations of crc32c() should not be able to corrupt the state of
other parallel invocations.

I'll keep digging, but that is where I am right now.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-06-02  1:57             ` David Miller
@ 2017-06-02  2:10               ` Eric Sandeen
  2017-06-02  3:33                 ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2017-06-02  2:10 UTC (permalink / raw)
  To: David Miller; +Cc: matorola, sparclinux, linux-xfs

On 6/1/17 8:57 PM, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 01 Jun 2017 17:44:19 -0400 (EDT)
> 
>> Ok, I can reproduce this bug on my systems.  I'll see if I can figure out
>> what is going on.
> 
> So I've done several tests to try and narrow down the cause.
> 
> First, I implemented crc32c() inside of the test module, doing
> exactly the same thing that lib/libcrc32c.c is doing.  So this
> make it use a separate tfm.
> 
> This never fails.
> 
> Then, I implemented a separate module "davem_crc32c.ko" that is
> identical to lib/libcrc32.c except it uses it's own 'tfm' and it
> exports the symbol davem_crc32c() instead of crc32c(). And finally I
> adjust the test case to call davem_crc32c() instead of crc32c().
> 
> This also never fails.
> 
> So it only fails if we use the lib/libcrc32.c shared with the rest of
> the kernel.
> 
> I really can't figure out yet why this sharing can even matter.  The
> per-computation state is all in the on-stack 'shash':
> 
> 	SHASH_DESC_ON_STACK(shash, tfm);
> 
> So invocations of crc32c() should not be able to corrupt the state of
> other parallel invocations.
> 
> I'll keep digging, but that is where I am right now.

Thanks for digging.

On ARM, there was a gcc bug causing similar results - I /think/
it was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63293

"programs could fail sporadically with this if an interrupt happens at
the wrong instant in time and data was written onto the current stack."

https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02292.html

Maybe totally unrelated; if not, hope it helps.  :)

-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-06-02  2:10               ` Eric Sandeen
@ 2017-06-02  3:33                 ` David Miller
  2017-06-02  3:34                   ` Eric Sandeen
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2017-06-02  3:33 UTC (permalink / raw)
  To: sandeen; +Cc: matorola, sparclinux, linux-xfs

From: Eric Sandeen <sandeen@sandeen.net>
Date: Thu, 1 Jun 2017 21:10:50 -0500

> On ARM, there was a gcc bug causing similar results - I /think/
> it was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63293
> 
> "programs could fail sporadically with this if an interrupt happens at
> the wrong instant in time and data was written onto the current stack."
> 
> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02292.html
> 
> Maybe totally unrelated; if not, hope it helps.  :)

Wow, that looks exactly like what the bug is:

crc32c:
        .register       %g2, #scratch
        save    %sp, -176, %sp  !
        sethi   %hi(tfm), %g1   !, tmp121
        mov     %i2, %o2        ! length,
        ldx     [%g1+%lo(tfm)], %g2     ! tfm, tfm.0_4
        mov     %i1, %o1        ! address,
        lduw    [%g2], %g1      ! tfm.0_4->descsize, tfm.0_4->descsize
        add     %g1, 38, %g1    ! tfm.0_4->descsize,, tmp126
        srlx    %g1, 4, %g1     ! tmp126,, tmp127
        sllx    %g1, 4, %g1     ! tmp127,, tmp128
        sub     %sp, %g1, %sp   !, tmp128,
        add     %sp, 2230, %i5  !,, tmp130

Ok, %i5 holds the stack address of the shash context:

 ...
        return  %i7+8
         lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],

'return' deallocates the stack frame plus the register window, and at
the same time does a delayed control transfer to "%i7 + 8".  So in the
branch delay slot instruction %i5 becomes %o5.

And here we are accessing deallocated stack memory in the delay slot.

I'm using gcc-6.3.0 here.

And indeed the following patch makes the problem go away:

diff --git a/lib/libcrc32c.c b/lib/libcrc32c.c
index 74a54b7..bf831e2 100644
--- a/lib/libcrc32c.c
+++ b/lib/libcrc32c.c
@@ -43,7 +43,7 @@ static struct crypto_shash *tfm;
 u32 crc32c(u32 crc, const void *address, unsigned int length)
 {
 	SHASH_DESC_ON_STACK(shash, tfm);
-	u32 *ctx = (u32 *)shash_desc_ctx(shash);
+	u32 ret, *ctx = (u32 *)shash_desc_ctx(shash);
 	int err;
 
 	shash->tfm = tfm;
@@ -53,7 +53,9 @@ u32 crc32c(u32 crc, const void *address, unsigned int length)
 	err = crypto_shash_update(shash, address, length);
 	BUG_ON(err);
 
-	return *ctx;
+	ret = *ctx;
+	barrier();
+	return ret;
 }
 
 EXPORT_SYMBOL(crc32c);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-06-02  3:33                 ` David Miller
@ 2017-06-02  3:34                   ` Eric Sandeen
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Sandeen @ 2017-06-02  3:34 UTC (permalink / raw)
  To: David Miller; +Cc: matorola, sparclinux, linux-xfs



On 6/1/17 10:33 PM, David Miller wrote:
> From: Eric Sandeen <sandeen@sandeen.net>
> Date: Thu, 1 Jun 2017 21:10:50 -0500
> 
>> On ARM, there was a gcc bug causing similar results - I /think/
>> it was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63293
>>
>> "programs could fail sporadically with this if an interrupt happens at
>> the wrong instant in time and data was written onto the current stack."
>>
>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02292.html
>>
>> Maybe totally unrelated; if not, hope it helps.  :)
> 
> Wow, that looks exactly like what the bug is:

Sweet.

\o/

-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-05-31 16:49         ` David Miller
  2017-06-01 21:44           ` David Miller
@ 2017-06-06 19:05           ` David Miller
  2017-06-06 19:09             ` Eric Sandeen
  1 sibling, 1 reply; 10+ messages in thread
From: David Miller @ 2017-06-06 19:05 UTC (permalink / raw)
  To: sandeen; +Cc: matorola, sparclinux, linux-xfs


Just FYI, I've meanwhile pushed a fix for the GCC bug that created
this mess to all active branches.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [sparc64] crc32c misbehave
  2017-06-06 19:05           ` David Miller
@ 2017-06-06 19:09             ` Eric Sandeen
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Sandeen @ 2017-06-06 19:09 UTC (permalink / raw)
  To: David Miller; +Cc: matorola, sparclinux, linux-xfs

On 6/6/17 2:05 PM, David Miller wrote:
> 
> Just FYI, I've meanwhile pushed a fix for the GCC bug that created
> this mess to all active branches.

David -

Awesome.  Thanks for digging into it.

-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-06-06 19:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CADxRZqzgaL4ew6PVek3WBsdwo6GcT0ORx=7h+6p0V3NAr8qF+w@mail.gmail.com>
     [not found] ` <CADxRZqzNTB1n0g1r_2rLsjifHpJ+_x1WPxNOSqEcHbi74jHi5w@mail.gmail.com>
     [not found]   ` <20170531.115335.213691069712156732.davem@davemloft.net>
2017-05-31 16:19     ` [sparc64] crc32c misbehave Eric Sandeen
2017-05-31 16:31       ` Eric Sandeen
2017-05-31 16:49         ` David Miller
2017-06-01 21:44           ` David Miller
2017-06-02  1:57             ` David Miller
2017-06-02  2:10               ` Eric Sandeen
2017-06-02  3:33                 ` David Miller
2017-06-02  3:34                   ` Eric Sandeen
2017-06-06 19:05           ` David Miller
2017-06-06 19:09             ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).