Unexpected data TLB miss happens when guest OS executing a "bl" instruction

All of lore.kernel.org
 help / color / mirror / Atom feed

* Unexpected data TLB miss happens when guest OS executing a "bl" instruction
@ 2012-07-02 15:34 Fei K Chen
  2012-07-02 19:03 ` Jimi Xenidis
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Fei K Chen @ 2012-07-02 15:34 UTC (permalink / raw)
  To: kvm-ppc

We are debuging kvm on IBM poweren chip by RSICWatch tool. An unexpected data TLB miss happened and we can not explain why. Any one have met this before?

1. Guest OS executes a "bl" instruction with PC=0xC0000000005A49CC. According to the guest linux kernel objdump file, the next instruction will be "mflr r0" with PC=0xC000000000599CC0.

2. By single-step execution in RISCWatch, guest OS does jump to an instruction with PC=0xC000000000599CC0. At this time, RISCWatch tool can not display what the instruction is. We guess this is because there is no instruction TLB entry in hardware TLB for PC=0xC000000000599CC0. Thus an instruction TLB miss is expected if we press the "Asmstep" to execute the next instruction.

3. Unfortunately, poweren jumps an instruction with PC=0xC000000000051FF4 which is the beginning of data TLB miss entry in kvm. We read the values in spr SRR0 and DEAR. Both of them are 0xC000000000599CC0. We even can not imagine why this happens.

4. As external interrupt will happen during single-step debugging, we set a hardware breakpoint at PC=0xC000000000599CC0, and let poweren directly run to that point.

5. When poweren stops at PC=0xC000000000599CC0, from the output of RISCWatch, a "trap" instruction is placed at PC=0xC000000000599CC0. It is different with what should be according to the kernel objdump file. The only explanation we can imagine is that our kvm code set a wrong TLB entry for PC=0xC000000000599CC0 (it may be brought by that unexpected data TLB miss).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexpected data TLB miss happens when guest OS executing a "bl" instruction
  2012-07-02 15:34 Unexpected data TLB miss happens when guest OS executing a "bl" instruction Fei K Chen
@ 2012-07-02 19:03 ` Jimi Xenidis
  2012-07-02 19:14 ` Alexander Graf
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jimi Xenidis @ 2012-07-02 19:03 UTC (permalink / raw)
  To: kvm-ppc

On Jul 2, 2012, at 10:34 AM, Fei K Chen wrote:

> We are debuging kvm on IBM poweren chip by RSICWatch tool.

So for those of you who don't know,  the "RISCWatch tool" (RW) is a JTAG based HW probe debugger.

> An unexpected data TLB miss happened and we can not explain why. Any one have met this before?
> 
> 1. Guest OS executes a "bl" instruction with PC=0xC0000000005A49CC. According to the guest linux kernel objdump file, the next instruction will be "mflr r0" with PC=0xC000000000599CC0.

Normally, if you were debugging the host, you could strip off the 0xC and read the location from memory to verify this.
However, since you are debugging the guest, you probably can't figure out the machine physical address.

> 
> 2. By single-step execution in RISCWatch, guest OS does jump to an instruction with PC=0xC000000000599CC0. At this time, RISCWatch tool can not display what the instruction is. We guess this is because there is no instruction TLB entry in hardware TLB for PC=0xC000000000599CC0. Thus an instruction TLB miss is expected if we press the "Asmstep" to execute the next instruction.
> 
> 3. Unfortunately, poweren jumps an instruction with PC=0xC000000000051FF4 which is the beginning of data TLB miss entry in kvm. We read the values in spr SRR0 and DEAR. Both of them are 0xC000000000599CC0. We even can not imagine why this happens.

So when RW decides to look at memory it uses "instruction stuffing/ramming" where it is able to insert instructions into the thread's instruction "port",  this way it can perform loads and stores using translation in the same way that software does.
Since you are using the ASM window, it is trying to read the instruction (before it is executed) and this causes your data fault.
So this is normal and completely expect.

Instead of using the ASM window use the "command line window" to "step" and "read iar" then you will get the correct instruction fault you are looking for.
Note: It may also be the case that you can uncheck  the "track IAR" box in the asm windows if you insist on doing that.

> 
> 4. As external interrupt will happen during single-step debugging, we set a hardware breakpoint at PC=0xC000000000599CC0, and let poweren directly run to that point.

Yes, hardware probes become difficult with any interrupts active and, IMNSHO, should really only be used to debug bootstrap or exception level code.
It would be way easier to instrument the host fault handlers to help you debug this case.

> 
> 5. When poweren stops at PC=0xC000000000599CC0, from the output of RISCWatch, a "trap" instruction is placed at PC=0xC000000000599CC0. It is different with what should be according to the kernel objdump file. The only explanation we can imagine is that our kvm code set a wrong TLB entry for PC=0xC000000000599CC0 (it may be brought by that unexpected data TLB miss).

As explained above, I'm pretty sure you did not hit the data fault in the same way as before.
Does the rest of the instruction stream match? If not then you likely have a translation error.
However, there are only a handful of static "trap" instructions in vmlinux, so you should be able to track them down.
My bet is that some software (or RW) has inserted the trap instruction to facilitate some form of break point?

-jx

> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexpected data TLB miss happens when guest OS executing a "bl" instruction
  2012-07-02 15:34 Unexpected data TLB miss happens when guest OS executing a "bl" instruction Fei K Chen
  2012-07-02 19:03 ` Jimi Xenidis
@ 2012-07-02 19:14 ` Alexander Graf
  2012-07-03 13:26 ` Fei K Chen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Alexander Graf @ 2012-07-02 19:14 UTC (permalink / raw)
  To: kvm-ppc


On 02.07.2012, at 21:03, Jimi Xenidis wrote:

> 
> On Jul 2, 2012, at 10:34 AM, Fei K Chen wrote:
> 
>> We are debuging kvm on IBM poweren chip by RSICWatch tool.
> 
> So for those of you who don't know,  the "RISCWatch tool" (RW) is a JTAG based HW probe debugger.
> 
>> An unexpected data TLB miss happened and we can not explain why. Any one have met this before?
>> 
>> 1. Guest OS executes a "bl" instruction with PC=0xC0000000005A49CC. According to the guest linux kernel objdump file, the next instruction will be "mflr r0" with PC=0xC000000000599CC0.
> 
> Normally, if you were debugging the host, you could strip off the 0xC and read the location from memory to verify this.
> However, since you are debugging the guest, you probably can't figure out the machine physical address.
> 
>> 
>> 2. By single-step execution in RISCWatch, guest OS does jump to an instruction with PC=0xC000000000599CC0. At this time, RISCWatch tool can not display what the instruction is. We guess this is because there is no instruction TLB entry in hardware TLB for PC=0xC000000000599CC0. Thus an instruction TLB miss is expected if we press the "Asmstep" to execute the next instruction.
>> 
>> 3. Unfortunately, poweren jumps an instruction with PC=0xC000000000051FF4 which is the beginning of data TLB miss entry in kvm. We read the values in spr SRR0 and DEAR. Both of them are 0xC000000000599CC0. We even can not imagine why this happens.
> 
> So when RW decides to look at memory it uses "instruction stuffing/ramming" where it is able to insert instructions into the thread's instruction "port",  this way it can perform loads and stores using translation in the same way that software does.
> Since you are using the ASM window, it is trying to read the instruction (before it is executed) and this causes your data fault.
> So this is normal and completely expect.
> 
> Instead of using the ASM window use the "command line window" to "step" and "read iar" then you will get the correct instruction fault you are looking for.
> Note: It may also be the case that you can uncheck  the "track IAR" box in the asm windows if you insist on doing that.
> 
>> 
>> 4. As external interrupt will happen during single-step debugging, we set a hardware breakpoint at PC=0xC000000000599CC0, and let poweren directly run to that point.
> 
> Yes, hardware probes become difficult with any interrupts active and, IMNSHO, should really only be used to debug bootstrap or exception level code.
> It would be way easier to instrument the host fault handlers to help you debug this case.

Right. Please check out how we do trace points on book3s_pr. You probably want the same thing for your implementation and then just trace what exactly is going on, without using explicit debuggers that might mess up your execution flow.


Alex


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexpected data TLB miss happens when guest OS executing a "bl" instruction
  2012-07-02 15:34 Unexpected data TLB miss happens when guest OS executing a "bl" instruction Fei K Chen
  2012-07-02 19:03 ` Jimi Xenidis
  2012-07-02 19:14 ` Alexander Graf
@ 2012-07-03 13:26 ` Fei K Chen
  2012-07-03 13:44 ` Fei K Chen
  2012-07-04 10:00 ` tiejun.chen
  4 siblings, 0 replies; 6+ messages in thread
From: Fei K Chen @ 2012-07-03 13:26 UTC (permalink / raw)
  To: kvm-ppc


>On 02.07.2012, at 21:03, Jimi Xenidis wrote:
>
>> 
>> On Jul 2, 2012, at 10:34 AM, Fei K Chen wrote:
>> 
>>> We are debuging kvm on IBM poweren chip by RSICWatch tool.
>> 
>> So for those of you who don't know,  the "RISCWatch tool" (RW) is a JTAG based HW probe debugger.
>> 
>>> An unexpected data TLB miss happened and we can not explain why. Any one have met this before?
>>> 
>>> 1. Guest OS executes a "bl" instruction with PC=0xC0000000005A49CC. According to the guest linux kernel objdump file, the next instruction will be "mflr r0" with PC=0xC000000000599CC0.
>> 
>> Normally, if you were debugging the host, you could strip off the 0xC and read the location from memory to verify this.
>> However, since you are debugging the guest, you probably can't figure out the machine physical address.
>> 
>>> 
>>> 2. By single-step execution in RISCWatch, guest OS does jump to an instruction with PC=0xC000000000599CC0. At this time, RISCWatch tool can not display what the instruction is. We guess this is because there is no instruction TLB entry in hardware TLB for PC=0xC000000000599CC0. Thus an instruction TLB miss is expected if we press the "Asmstep" to execute the next instruction.
>>> 
>>> 3. Unfortunately, poweren jumps an instruction with PC=0xC000000000051FF4 which is the beginning of data TLB miss entry in kvm. We read the values in spr SRR0 and DEAR. Both of them are 0xC000000000599CC0. We even can not imagine why this happens.
>> 
>> So when RW decides to look at memory it uses "instruction stuffing/ramming" where it is able to insert instructions into the thread's instruction "port",  this way it can perform loads and stores using translation in the same way that software does.
>> Since you are using the ASM window, it is trying to read the instruction (before it is executed) and this causes your data fault.
>> So this is normal and completely expect.
>> 
>> Instead of using the ASM window use the "command line window" to "step" and "read iar" then you will get the correct instruction fault you are looking for.

>> Note: It may also be the case that you can uncheck  the "track IAR" box in the asm windows if you insist on doing that.
>> 
>>> 
>>> 4. As external interrupt will happen during single-step debugging, we set a hardware breakpoint at PC=0xC000000000599CC0, and let poweren directly run to that point.
>> 
>> Yes, hardware probes become difficult with any interrupts active and, IMNSHO, should really only be used to debug bootstrap or exception level code.
>> It would be way easier to instrument the host fault handlers to help you debug this case.
>
>Right. Please check out how we do trace points on book3s_pr. You probably want the same thing for your implementation and then just trace what exactly is going on, without using explicit debuggers that might mess up your execution flow.
>
>
>Alex
>

For some bugs in codes dealing with software TLB, adding trace points might change the behavior of hardware, and thus lead to undeterminate error. That's why we have to use the RISCWatch, though it sometimes brings in extra interrupts.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexpected data TLB miss happens when guest OS executing a "bl" instruction
  2012-07-02 15:34 Unexpected data TLB miss happens when guest OS executing a "bl" instruction Fei K Chen
                   ` (2 preceding siblings ...)
  2012-07-03 13:26 ` Fei K Chen
@ 2012-07-03 13:44 ` Fei K Chen
  2012-07-04 10:00 ` tiejun.chen
  4 siblings, 0 replies; 6+ messages in thread
From: Fei K Chen @ 2012-07-03 13:44 UTC (permalink / raw)
  To: kvm-ppc

On Jul 2, 2012, at 10:34 AM, Fei K Chen wrote:
>
>> We are debuging kvm on IBM poweren chip by RSICWatch tool.
>
>So for those of you who don't know,  the "RISCWatch tool" (RW) is a JTAG based HW probe debugger.
>
>> An unexpected data TLB miss happened and we can not explain why. Any one have met this before?
>> 
>> 1. Guest OS executes a "bl" instruction with PC=0xC0000000005A49CC. According to the guest linux kernel objdump file, the next instruction will be "mflr r0" with PC=0xC000000000599CC0.
>
>Normally, if you were debugging the host, you could strip off the 0xC and read the location from memory to verify this.
>However, since you are debugging the guest, you probably can't figure out the machine physical address.
>
>> 
>> 2. By single-step execution in RISCWatch, guest OS does jump to an instruction with PC=0xC000000000599CC0. At this time, RISCWatch tool can not display what the instruction is. We guess this is because there is no instruction TLB entry in hardware TLB for PC=0xC000000000599CC0. Thus an instruction TLB miss is expected if we press the "Asmstep" to execute the next instruction.
>> 
>> 3. Unfortunately, poweren jumps an instruction with PC=0xC000000000051FF4 which is the beginning of data TLB miss entry in kvm. We read the values in spr SRR0 and DEAR. Both of them are 0xC000000000599CC0. We even can not imagine why this happens.
>
>So when RW decides to look at memory it uses "instruction stuffing/ramming" where it is able to insert instructions into the thread's instruction "port",  this way it can perform loads and stores using translation in the same way that software does.
>Since you are using the ASM window, it is trying to read the instruction (before it is executed) and this causes your data fault.
>So this is normal and completely expect.
>
>Instead of using the ASM window use the "command line window" to "step" and "read iar" then you will get the correct instruction fault you are looking for.

Yes, as what you said, by using the "command line window", the data TLB miss does not happen now.

>Note: It may also be the case that you can uncheck  the "track IAR" box in the asm windows if you insist on doing that.
>
>> 
>> 4. As external interrupt will happen during single-step debugging, we set a hardware breakpoint at PC=0xC000000000599CC0, and let poweren directly run to that point.
>
>Yes, hardware probes become difficult with any interrupts active and, IMNSHO, should really only be used to debug bootstrap or exception level code.
>It would be way easier to instrument the host fault handlers to help you debug this case.
>
>> 
>> 5. When poweren stops at PC=0xC000000000599CC0, from the output of RISCWatch, a "trap" instruction is placed at PC=0xC000000000599CC0. It is different with what should be according to the kernel objdump file. The only explanation we can imagine is that our kvm code set a wrong TLB entry for PC=0xC000000000599CC0 (it may be brought by that unexpected data TLB miss).
>
>As explained above, I'm pretty sure you did not hit the data fault in the same way as before.
>Does the rest of the instruction stream match? If not then you likely have a translation error.

The reset of the instructions before and after 0xC000000000599CC0 match the vmlinux objdump file. So it seems that someone changed the memory content at 0xC000000000599CC0.

>However, there are only a handful of static "trap" instructions in vmlinux, so you should be able to track them down.
>My bet is that some software (or RW) has inserted the trap instruction to facilitate some form of break point?

The instruction in 0xC000000000599CC0 should not be covered by anyone because the instruction "mflr r0" is saving the lr register. Without the value in lr, function call can not return to the caller. 

I am tracing the event that someone writes to address 0xC000000000599CC0 with the help of RW. Strange, poweren dose not stop before it stops at that unexpected "trap" instruction. It looks like that the "trap" exists before I set up the data address write breakpoint.

>
>-jx
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexpected data TLB miss happens when guest OS executing a "bl" instruction
  2012-07-02 15:34 Unexpected data TLB miss happens when guest OS executing a "bl" instruction Fei K Chen
                   ` (3 preceding siblings ...)
  2012-07-03 13:44 ` Fei K Chen
@ 2012-07-04 10:00 ` tiejun.chen
  4 siblings, 0 replies; 6+ messages in thread
From: tiejun.chen @ 2012-07-04 10:00 UTC (permalink / raw)
  To: kvm-ppc

On 07/03/2012 09:44 PM, Fei K Chen wrote:
> On Jul 2, 2012, at 10:34 AM, Fei K Chen wrote:
>>
>>> We are debuging kvm on IBM poweren chip by RSICWatch tool.
>>
>> So for those of you who don't know,  the "RISCWatch tool" (RW) is a JTAG based HW probe debugger.
>>
>>> An unexpected data TLB miss happened and we can not explain why. Any one have met this before?
>>>
>>> 1. Guest OS executes a "bl" instruction with PC=0xC0000000005A49CC. According to the guest linux kernel objdump file, the next instruction will be "mflr r0" with PC=0xC000000000599CC0.
>>
>> Normally, if you were debugging the host, you could strip off the 0xC and read the location from memory to verify this.
>> However, since you are debugging the guest, you probably can't figure out the machine physical address.
>>
>>>
>>> 2. By single-step execution in RISCWatch, guest OS does jump to an instruction with PC=0xC000000000599CC0. At this time, RISCWatch tool can not display what the instruction is. We guess this is because there is no instruction TLB entry in hardware TLB for PC=0xC000000000599CC0. Thus an instruction TLB miss is expected if we press the "Asmstep" to execute the next instruction.
>>>
>>> 3. Unfortunately, poweren jumps an instruction with PC=0xC000000000051FF4 which is the beginning of data TLB miss entry in kvm. We read the values in spr SRR0 and DEAR. Both of them are 0xC000000000599CC0. We even can not imagine why this happens.
>>
>> So when RW decides to look at memory it uses "instruction stuffing/ramming" where it is able to insert instructions into the thread's instruction "port",  this way it can perform loads and stores using translation in the same way that software does.
>> Since you are using the ASM window, it is trying to read the instruction (before it is executed) and this causes your data fault.
>> So this is normal and completely expect.
>>
>> Instead of using the ASM window use the "command line window" to "step" and "read iar" then you will get the correct instruction fault you are looking for.
> 
> Yes, as what you said, by using the "command line window", the data TLB miss does not happen now.
> 
>> Note: It may also be the case that you can uncheck  the "track IAR" box in the asm windows if you insist on doing that.
>>
>>>
>>> 4. As external interrupt will happen during single-step debugging, we set a hardware breakpoint at PC=0xC000000000599CC0, and let poweren directly run to that point.
>>
>> Yes, hardware probes become difficult with any interrupts active and, IMNSHO, should really only be used to debug bootstrap or exception level code.
>> It would be way easier to instrument the host fault handlers to help you debug this case.
>>
>>>
>>> 5. When poweren stops at PC=0xC000000000599CC0, from the output of RISCWatch, a "trap" instruction is placed at PC=0xC000000000599CC0. It is different with what should be according to the kernel objdump file. The only explanation we can imagine is that our kvm code set a wrong TLB entry for PC=0xC000000000599CC0 (it may be brought by that unexpected data TLB miss).
>>
>> As explained above, I'm pretty sure you did not hit the data fault in the same way as before.
>> Does the rest of the instruction stream match? If not then you likely have a translation error.
> 
> The reset of the instructions before and after 0xC000000000599CC0 match the vmlinux objdump file. So it seems that someone changed the memory content at 0xC000000000599CC0.
> 
>> However, there are only a handful of static "trap" instructions in vmlinux, so you should be able to track them down.
>> My bet is that some software (or RW) has inserted the trap instruction to facilitate some form of break point?
> 
> The instruction in 0xC000000000599CC0 should not be covered by anyone because the instruction "mflr r0" is saving the lr register. Without the value in lr, function call can not return to the caller. 
> 

I think you can set a data write breakpoint at another address to check if that
is replaced with trap instruction as well. If not, at least this means this
shouldn't be inserted by RW :)

When that stops, you also can dump 0xC000000000599CC0 to take a look at what
happened.

> I am tracing the event that someone writes to address 0xC000000000599CC0 with the help of RW. Strange, poweren dose not stop before it stops at that unexpected "trap" instruction. It looks like that the "trap" exists before I set up the data address write breakpoint.

Did you enable kprobe to do probe something? Since Kprobe just use trap
instruction to replace the krpobed address to make CPU hit.

Tiejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-07-04 10:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-02 15:34 Unexpected data TLB miss happens when guest OS executing a "bl" instruction Fei K Chen
2012-07-02 19:03 ` Jimi Xenidis
2012-07-02 19:14 ` Alexander Graf
2012-07-03 13:26 ` Fei K Chen
2012-07-03 13:44 ` Fei K Chen
2012-07-04 10:00 ` tiejun.chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.