linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Darren Hart <dvhltc@us.ibm.com>
To: svaidy@linux.vnet.ibm.com
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
	Gautham R Shenoy <ego@in.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	linuxppc-dev@ozlabs.org, Will Schmidt <willschm@us.ibm.com>,
	Paul Mackerras <paulus@samba.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
Date: Fri, 06 Aug 2010 00:13:23 -0700	[thread overview]
Message-ID: <4C5BB613.6050002@us.ibm.com> (raw)
In-Reply-To: <20100806050936.GI3282@dirshya.in.ibm.com>

On 08/05/2010 10:09 PM, Vaidyanathan Srinivasan wrote:
> * Darren Hart<dvhltc@us.ibm.com>  [2010-08-05 19:19:00]:
> 
>> On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
>>>
>>>>    suggestion I updated the instrumentation to display the
>>>> local_save_flags and irqs_disabled_flags:
>>>
>>>> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
>>>> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>>> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>>>
>>>> I'm not sure if I'm reading that right, but I believe interrupts are
>>>> intended to be disabled here. If accomplished via the
>>>> spin_lock_irqsave() this would behave differently on RT. However, this
>>>> path disables the interrupts handled by xics, all but the IPIs anyway.
>>>> On RT I disabled the decrementer as well.
>>>>
>>>> Is it possible for RT to be receiving other interrupts here?
>>>
>>> Also you may want to call hard_irq_disable() to -really- disable
>>> interrupts ... since we do lazy-disable on powerpc
>>>
>>
>> A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer.
>>
>> I've added patches one at a time, eliminating bugs with preempt_rt
>> (tip/rt/head). With my current patchset I have no crashes with a single run of
>> random_online.sh (stress testing to hit the hang will sees is my todo for
>> tomorrow).
>>
>> Current patchset includes:
>> patches/0001-wms-fix01.patch # P7 lazy flushing thing
>>
>> # next four are sent to / queued for mainline
>> patches/powerpc-increase-pseries_cpu_die-delay.patch
>> patches/powerpc-enable-preemption-before-cpu_die.patch
>> patches/powerpc-silence-__cpu_up-under-normal-operation.patch
>> patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
>>
>> # this one needs to be cleaned up and sent to mainline
>> patches/powerpc-wait-for-cpu-to-go-inactive.patch
>>
>> patches/powerpc-disable-decrementer-on-offline.patch
>> patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede
>> patches/powerpc-cede-processor-inst.patch # instrumentation
>> patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
>> patches/powerpc-pad-thread_info.patch
>>
>> I didn't include all the patches as the relevant bits are included below in
>> code form.
>>
>> With the instrumentation, it's clear the change to preempt_count() is targeted
>> (since I added padding before and after preempt_count in the thread_info
>> struct) and preempt_count is still changed. It is also clear that it isn't just
>> a stray decrement since I set preempt_count() to 4 prior to calling cede and it
>> still is 0xffffffff after the hcall. Also note that the stack pointer doesn't
>> change across the cede call and neither does any other part of the thread_info
>> structure.
>>
>> Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
>> serialized list of before/after thread_info prints around cede, I see
>> several befores then several afters. But the preempt_count is still modified.
>>
>> The relevant code looks like this (from pseries_mach_cpu_die())
>>
>>                          hard_irq_disable(); /* this doesn't fix things... but
>> 			                       does change behavior a bit */
>>                          preempt_count() = 0x4;
>>                          asm("mr %0,1" : "=r" (sp));  /* stack pointer is in R1 */
>>                          printk("before cede: sp=%lx pcnt=%x\n", sp, preempt_count());
>>                          print_thread_info();
>>                          extended_cede_processor(cede_latency_hint);
>>                          asm("mr %0,1" : "=r" (sp));
>>                          printk("after cede: sp=%lx pcnt=%x\n", sp, preempt_count());
>>                          print_thread_info();
>>                          preempt_count() = 0;
>>
>>
>> With these patches applied, the output across cede looks like:
>>
>> before cede: sp=c000000000b57150 pcnt=4
>> *** current->thread_info ***
>> ti->task: c000000000aa1410
>> ti->exec_domain: c000000000a59958
>> ti->cpu: 0
>> ti->stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
>> ti->preempt_count: 4
>> ti->stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
>> ti->local_flags: 0
>> *** end thread_info ***
>> after cede: sp=c000000000b57150 pcnt=ffffffff
>> *** current->thread_info ***
>> ti->task: c000000000aa1410
>> ti->exec_domain: c000000000a59958
>> ti->cpu: 0
> 
> Is this print sufficient to prove that we did not jump CPU?

Probably not, but the following should be sufficient. Note that the
dump occurs on "CPU: 6" which matches the CPU noted in the ti->cpu
before and after the cede call.

*** current->thread_info ***
ti->task: c00000008e7b2240
ti->exec_domain: c000000000a59958
ti->cpu: 6
ti->stomp_on_me: 57005
ti->preempt_count: 4
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***
------------[ cut here ]------------
Badness at kernel/sched.c:3698
NIP: c0000000006987e4 LR: c0000000006987c8 CTR: c00000000005c668
REGS: c00000010e713800 TRAP: 0700   Not tainted  (2.6.33-rt-dvhrt05-02358-g61223dd-dirty)
MSR: 8000000000021032 <ME,CE,IR,DR>  CR: 24000082  XER: 20000000
TASK = c00000008e7b2240[0] 'swapper' THREAD: c00000010e710000 CPU: 6
GPR00: 0000000000000000 c00000010e713a80 c000000000b54f88 0000000000000001 
GPR04: c00000010e713d08 ffffffffffffffff 00000000000000e0 800000000d049138 
GPR08: 0000000000000000 c000000000ca5420 0000000000000001 c000000000bc22e8 
GPR12: 0000000000000002 c000000000ba5080 0000000000000000 000000000f394b6c 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: c000000000ba50c0 0000000000000004 0000000000000002 0000000000000000 
GPR24: 0000000000000004 c00000010e713cd0 c0000000009f1ecc c00000000088aaff 
GPR28: 0000000000000000 0000000000000001 c000000000ad7610 c00000010e713a80 
NIP [c0000000006987e4] .add_preempt_count+0x6c/0xe0
LR [c0000000006987c8] .add_preempt_count+0x50/0xe0
Call Trace:
[c00000010e713a80] [c00000010e713b30] 0xc00000010e713b30 (unreliable)
[c00000010e713b10] [c0000000000824b4] .vprintk+0xac/0x480
[c00000010e713c40] [c00000000069bcbc] .printk+0x48/0x5c
[c00000010e713cd0] [c00000000005c190] .pseries_mach_cpu_die+0x1c4/0x39c
[c00000010e713db0] [c00000000001e7e0] .cpu_die+0x4c/0x68
[c00000010e713e30] [c000000000017de0] .cpu_idle+0x1f8/0x220
[c00000010e713ed0] [c0000000006a17c4] .start_secondary+0x394/0x3d4
[c00000010e713f90] [c000000000008264] .start_secondary_prolog+0x10/0x14
Instruction dump:
78290464 80090018 2f800000 40fc002c 4bd089c1 60000000 2fa30000 419e006c 
e93e87e0 80090000 2f800000 409e005c <0fe00000> 48000054 e97e8398 78290464 
after cede: sp=c00000010e713cd0 pcnt=ffffffff
*** current->thread_info ***
ti->task: c00000008e7b2240
ti->exec_domain: c000000000a59958
ti->cpu: 6
ti->stomp_on_me: 57005
ti->preempt_count: fffffffe
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***


> We agree that decrementer can possibly expire and we could have gone
> to the handler and come back just like we would do in an idle loop.

I disable the decrementer in the stop_cpu path... that may not be
sufficient for the pseries_mach_cpu_die() path, I'll have to experiment.

However, even if it did, it shouldn't be just changing the value,
especially not to something illegal.

> This will not happen always, but I could see a window of time during
> which this may happen.  But that should not change the preempt_count
> unconditionally to -1 with the same stack pointer.
> 
>> ti->stomp_on_me: 57005
>> ti->preempt_count: ffffffff
>> ti->stomp_on_me_too: 48879
>> ti->local_flags: 0
> 
> Is there a method to identify whether we had executed any of the
> interrupt handler while in this code?

I'd like to know as well. If anyone knows, please share. Otherwise
I'll be trying to sort that out tomorrow.

> 
>> *** end thread_info ***
>>
>> Are there any additional thoughts on what might be causing preempt_count to change?
>> I was thinking that cede might somehow force it to 0 (or perhaps one of the
>> preempt_count explicit value setters in irq.c) and then decrement it - but that dec
>> should trigger an error in the dec_preempt_count() as I have CONFIG_DEBUG_PREEMPT=y.
> 
> Decrementer is only the suspect, we have not yet proved that the dec
> handler is being executed.  Can somebody be using our stack?  The
> pointer is same.. but do we still own it?  I cannot think of how
> somebody else could be using this cpu's stack... but just a thought to
> explain this anomaly.
> 
> --Vaidy


Thanks for the thoughts,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

  reply	other threads:[~2010-08-06  7:13 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-22 18:24 [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries Darren Hart
2010-07-22 18:36 ` Darren Hart
2010-07-22 18:38 ` Thomas Gleixner
2010-08-10 22:36   ` Darren Hart
2010-07-22 22:25 ` Benjamin Herrenschmidt
2010-07-22 23:57   ` Darren Hart
2010-07-23  4:44     ` Darren Hart
2010-07-23  5:08       ` Vaidyanathan Srinivasan
2010-07-23  5:11         ` Benjamin Herrenschmidt
2010-07-23  7:07           ` Vaidyanathan Srinivasan
2010-08-05  4:45             ` Darren Hart
2010-08-05 11:06               ` Vaidyanathan Srinivasan
2010-08-05 12:26                 ` Thomas Gleixner
2010-07-23  5:09       ` Benjamin Herrenschmidt
2010-08-06  2:19         ` Darren Hart
2010-08-06  5:09           ` Vaidyanathan Srinivasan
2010-08-06  7:13             ` Darren Hart [this message]
2010-07-23 14:39     ` Will Schmidt
2010-08-04 13:44   ` Darren Hart
2010-08-19 15:58 ` Ankita Garg
2010-08-19 18:58   ` Will Schmidt
2010-08-23 22:20   ` Darren Hart
2010-08-31  7:12   ` Darren Hart
2010-09-01  5:54     ` Michael Ellerman
2010-09-01 15:10       ` Darren Hart
2010-09-01 18:47         ` Darren Hart
2010-09-01 19:59           ` Steven Rostedt
2010-09-01 20:42             ` Darren Hart
2010-09-02  1:02               ` Michael Neuling
2010-09-02  4:06                 ` Steven Rostedt
2010-09-02  6:04                   ` Darren Hart
2010-09-03 20:10                     ` Will Schmidt
2010-09-02 23:04                   ` Michael Neuling
2010-09-03  0:08                     ` Darren Hart
2010-09-02  3:46           ` Michael Neuling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C5BB613.6050002@us.ibm.com \
    --to=dvhltc@us.ibm.com \
    --cc=ego@in.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=paulus@samba.org \
    --cc=rostedt@goodmis.org \
    --cc=sfr@canb.auug.org.au \
    --cc=svaidy@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=willschm@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).