All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] x86 segment limits enforcement with TCG
@ 2019-02-24 19:36 Stephen Checkoway
  2019-02-24 19:46 ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Checkoway @ 2019-02-24 19:36 UTC (permalink / raw)
  To: qemu-devel

Hi all,

Sorry for the long email. The TL;DR is I'm trying to add x86 segment limit checks in TCG and my added code is causing qemu to abort during code generation because some registers have become dead.

I have some x86 firmware that refuses to boot unless a variety of x86 security mechanisms are enforced. In particular, it checks that the general protection fault handler runs if an access to memory is greater than a segment limit: it configures a segment with a limit of 3, sets ds to the corresponding selector and then tries to load 4 bytes from address 3.

I looked through target/i386/translate.c and it looks like in most (all?) cases, gen_lea_v_seg is used, either directly or indirectly, to compute a linear address by adding the base address of the segment. I attempted to modify this to check the segment limit and raise a general protection fault if the base address plus the size of the access is greater than the limit. The full diff is below, but the main change adds this.

        if (s->pe && !CODE64(s)) {
            /*
             * Check that the access is within the segment limit in protected
             * mode.
             */
            TCGLabel *label = gen_new_label();
            int access_limit = (1 << (aflag & MO_SIZE)) - 1;
            tcg_gen_mov_tl(s->A1, a0);
            if (access_limit > 0) {
                tcg_gen_addi_tl(s->A1, s->A1, access_limit);
            }
            tcg_gen_brcond_tl(TCG_COND_LE, s->A1, limit, label);
            gen_exception(s, EXCP0D_GPF, s->pc_start - s->cs_base);
            gen_set_label(label);
        }

With this change, qemu aborts when trying to run tests/pxe-test. I ran the test in lldb and it's aborting in the TCG generation (full stack trace below).

frame #3: 0x000000010001eac4 qemu-system-i386`temp_load(s=0x0000000105062600, ts=0x0000000105063170, desired_regs=65535, allocated_regs=48, preferred_regs=0) at tcg.c:3211
   3208         break;
   3209     case TEMP_VAL_DEAD:
   3210     default:
-> 3211         tcg_abort();
   3212     }
   3213     ts->reg = reg;
   3214     ts->val_type = TEMP_VAL_REG;

because ts->value_type is TEMP_VAL_DEAD.

I tried removing everything except for the tcg_gen_brcond_tl (using a0, rather than s->A1) and the gen_set_label and I have the same behavior. The operation it aborts on is

(lldb) p *op
(TCGOp) $0 = {
  opc = INDEX_op_add_i32
  param1 = 0
  param2 = 0
  life = 24
  link = {
    tqe_next = 0x0000000105055c30
    tqe_circ = {
      tql_next = 0x0000000105055c30
      tql_prev = 0x0000000105055b48
    }
  }
  args = ([0] = 4378961264, [1] = 4378961264, [2] = 4378960256, [3] = 18446744073709486096, [4] = 281470681808641, [5] = 4378961880, [6] = 4378961880, [7] = 0, [8] = 18446744069414649855, [9] = 5, [10] = 4379204752)
  output_pref = ([0] = 65343, [1] = 1)
}

which I think is coming from the add for the segment base. Changing the condition to TCG_COND_ALWAYS breaks elsewhere. 

I think that something about adding the tcg_gen_brcond_tl is causing values to become dead and then qemu aborts.

My questions are
1. Is this the right place to add segment limit checks? (I guess I don't need to check every memory access; catching the majority of them is likely fine.)

2. Am I using tcg_gen_brcond_tl (or any of the other TCG functions) incorrectly?

3. Is there a way I can see the generated TCG for this test? I've been running lldb using
$ QTEST_QEMU_BINARY='tmux split-window lldb -- i386-softmmu/qemu-system-i386' QTEST_QEMU_IMG=qemu-img tests/pxe-test -m=quick -k --tap
but when I try to add additional options like -d op, the test harness aborts: qemu/tests/boot-sector.c:119:boot_sector_init: code should not be reached

Any assistance is very much appreciated.

Thank you,

Steve

Stack trace:
  * frame #0: 0x00007fff6b45323e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff6b509c1c libsystem_pthread.dylib`pthread_kill + 285
    frame #2: 0x00007fff6b3bc1c9 libsystem_c.dylib`abort + 127
    frame #3: 0x000000010001eac4 qemu-system-i386`temp_load(s=0x0000000105062600, ts=0x0000000105063170, desired_regs=65535, allocated_regs=48, preferred_regs=0) at tcg.c:3211
    frame #4: 0x000000010001af09 qemu-system-i386`tcg_reg_alloc_op(s=0x0000000105062600, op=0x0000000104a78300) at tcg.c:3458
    frame #5: 0x00000001000176f2 qemu-system-i386`tcg_gen_code(s=0x0000000105062600, tb=0x00000001058f93c0) at tcg.c:3981
    frame #6: 0x00000001000d3972 qemu-system-i386`tb_gen_code(cpu=0x0000000105043c00, pc=787953, cs_base=786432, flags=192, cflags=-16252928) at translate-all.c:1751
    frame #7: 0x00000001000d1696 qemu-system-i386`tb_find(cpu=0x0000000105043c00, last_tb=0x0000000000000000, tb_exit=0, cf_mask=524288) at cpu-exec.c:407
    frame #8: 0x00000001000d0dc1 qemu-system-i386`cpu_exec(cpu=0x0000000105043c00) at cpu-exec.c:728
    frame #9: 0x000000010006bbf1 qemu-system-i386`tcg_cpu_exec(cpu=0x0000000105043c00) at cpus.c:1429
    frame #10: 0x000000010006b4b7 qemu-system-i386`qemu_tcg_cpu_thread_fn(arg=0x0000000105043c00) at cpus.c:1733
    frame #11: 0x00000001006223ec qemu-system-i386`qemu_thread_start(args=0x0000000103d0e770) at qemu-thread-posix.c:502
    frame #12: 0x00007fff6b507305 libsystem_pthread.dylib`_pthread_body + 126
    frame #13: 0x00007fff6b50a26f libsystem_pthread.dylib`_pthread_start + 70
    frame #14: 0x00007fff6b506415 libsystem_pthread.dylib`thread_start + 13

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 49cd298374..6f3682a268 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -76,6 +76,7 @@ static TCGv cpu_cc_dst, cpu_cc_src, cpu_cc_src2;
 static TCGv_i32 cpu_cc_op;
 static TCGv cpu_regs[CPU_NB_REGS];
 static TCGv cpu_seg_base[6];
+static TCGv cpu_seg_limit[6];
 static TCGv_i64 cpu_bndl[4];
 static TCGv_i64 cpu_bndu[4];

@@ -130,6 +131,7 @@ typedef struct DisasContext {
     /* TCG local temps */
     TCGv cc_srcT;
     TCGv A0;
+    TCGv A1;
     TCGv T0;
     TCGv T1;

@@ -448,6 +450,8 @@ static inline void gen_jmp_im(DisasContext *s, target_ulong pc)
     gen_op_jmp_v(s->tmp0);
 }

+static void gen_exception(DisasContext *s, int trapno, target_ulong cur_eip);
+
 /* Compute SEG:REG into A0.  SEG is selected from the override segment
    (OVR_SEG) and the default segment (DEF_SEG).  OVR_SEG may be -1 to
    indicate no override.  */
@@ -491,6 +495,23 @@ static void gen_lea_v_seg(DisasContext *s, TCGMemOp aflag, TCGv a0,

     if (ovr_seg >= 0) {
         TCGv seg = cpu_seg_base[ovr_seg];
+        TCGv limit = cpu_seg_limit[ovr_seg];
+
+        if (s->pe && !CODE64(s)) {
+            /*
+             * Check that the access is within the segment limit in protected
+             * mode.
+             */
+            TCGLabel *label = gen_new_label();
+            int access_limit = (1 << (aflag & MO_SIZE)) - 1;
+            tcg_gen_mov_tl(s->A1, a0);
+            if (access_limit > 0) {
+                tcg_gen_addi_tl(s->A1, s->A1, access_limit);
+            }
+            tcg_gen_brcond_tl(TCG_COND_LE, s->A1, limit, label);
+            gen_exception(s, EXCP0D_GPF, s->pc_start - s->cs_base);
+            gen_set_label(label);
+        }

         if (aflag == MO_64) {
             tcg_gen_add_tl(s->A0, a0, seg);
@@ -8382,6 +8403,14 @@ void tcg_x86_init(void)
         [R_GS] = "gs_base",
         [R_SS] = "ss_base",
     };
+    static const char seg_limit_names[6][9] = {
+        [R_CS] = "cs_limit",
+        [R_DS] = "ds_limit",
+        [R_ES] = "es_limit",
+        [R_FS] = "fs_limit",
+        [R_GS] = "gs_limit",
+        [R_SS] = "ss_limit",
+    };
     static const char bnd_regl_names[4][8] = {
         "bnd0_lb", "bnd1_lb", "bnd2_lb", "bnd3_lb"
     };
@@ -8410,6 +8439,10 @@ void tcg_x86_init(void)
             = tcg_global_mem_new(cpu_env,
                                  offsetof(CPUX86State, segs[i].base),
                                  seg_base_names[i]);
+        cpu_seg_limit[i]
+            = tcg_global_mem_new(cpu_env,
+                                 offsetof(CPUX86State, segs[i].limit),
+                                 seg_limit_names[i]);
     }

     for (i = 0; i < 4; ++i) {
@@ -8482,6 +8515,7 @@ static void i386_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cpu)
     dc->T0 = tcg_temp_new();
     dc->T1 = tcg_temp_new();
     dc->A0 = tcg_temp_new();
+    dc->A1 = tcg_temp_new();

     dc->tmp0 = tcg_temp_new();
     dc->tmp1_i64 = tcg_temp_new_i64();

-- 
Stephen Checkoway

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-24 19:36 [Qemu-devel] x86 segment limits enforcement with TCG Stephen Checkoway
@ 2019-02-24 19:46 ` Peter Maydell
  2019-02-24 20:21   ` Stephen Checkoway
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2019-02-24 19:46 UTC (permalink / raw)
  To: Stephen Checkoway; +Cc: QEMU Developers

On Sun, 24 Feb 2019 at 19:37, Stephen Checkoway
<stephen.checkoway@oberlin.edu> wrote:
> I think that something about adding the tcg_gen_brcond_tl is causing values to become dead and then qemu aborts.

Yep -- all "TCG temporaries" are dead at the end
of a basic block, and brcond ends a basic block.
Only globals and "local temporaries" stay live
across brcond. This is documented in tcg/README,
though it doesn't spell it out very explicitly.

This makes brcond pretty painful to use and
almost impossible to introduce into the middle
of some existing sequence of generated code.
I haven't looked at what the best way to do what
you're trying to do here is, though.

By the way, don't do this:
+    dc->A1 = tcg_temp_new();

The current use of a small number of tcg temps
in the i386 translate.c code is an antipattern
that is a relic from a very old version of the
code. It's much better to simply create new
temporaries in the code at the point where you
need them and then free them once you're done.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-24 19:46 ` Peter Maydell
@ 2019-02-24 20:21   ` Stephen Checkoway
  2019-02-26  0:32     ` Stephen Checkoway
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Checkoway @ 2019-02-24 20:21 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers



> On Feb 24, 2019, at 14:46, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Sun, 24 Feb 2019 at 19:37, Stephen Checkoway
> <stephen.checkoway@oberlin.edu> wrote:
>> I think that something about adding the tcg_gen_brcond_tl is causing values to become dead and then qemu aborts.
> 
> Yep -- all "TCG temporaries" are dead at the end
> of a basic block, and brcond ends a basic block.
> Only globals and "local temporaries" stay live
> across brcond. This is documented in tcg/README,
> though it doesn't spell it out very explicitly.

Ah yes. I see that now. I missed it on my first read through.

> This makes brcond pretty painful to use and
> almost impossible to introduce into the middle
> of some existing sequence of generated code.
> I haven't looked at what the best way to do what
> you're trying to do here is, though.

Are there other examples of straight-line code being converted to a conditional I might be able to use as an example? I thought INTO would be a good example, but it merely calls a helper. Maybe I should do that? I assume it'll be slow, but speed isn't really my primary concern.

> By the way, don't do this:
> +    dc->A1 = tcg_temp_new();
> 
> The current use of a small number of tcg temps
> in the i386 translate.c code is an antipattern
> that is a relic from a very old version of the
> code. It's much better to simply create new
> temporaries in the code at the point where you
> need them and then free them once you're done.

Great, thanks. I saw both the A0/T0/T1 and the creation of new temporaries and I wasn't sure which pattern I should follow.

-- 
Stephen Checkoway

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-24 20:21   ` Stephen Checkoway
@ 2019-02-26  0:32     ` Stephen Checkoway
  2019-02-26 16:56       ` Richard Henderson
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Checkoway @ 2019-02-26  0:32 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers

FWIW, I figured out an approach to this. Essentially, I'm reusing the function which computes linear addresses to enforce not only segment limits (in everything but long mode), but also read/write access (in protected mode).

Unfortunately, that meant every call to the linear address computation has to be augmented with an access size and whether it's a store or not. The patch is pretty large so I won't include it here, but you can view it at <https://github.com/stevecheckoway/qemu/commit/ac58652efacedc53f3831301ea0326ac8f882b18>.

If this is something that qemu would like, then I think two additional things are definitely required:
1. Tests. make check passes and the firmware I have which necessitated the checks appears to work, but that change touches the almost every guest-memory-accessing x86 instruction.
2. This is going to slow down the common case—where segments have a 0 base address and a limit of 0xFFFFFFFF—and there's no real need to do that. It seems like something akin to addseg could be used to decide when to insert the checks. I don't really understand how that works and in my case, segments with nonzero bases and non-0xFFFFFFFF limits are the norm so I didn't investigate that. Something similar could probably be done to omit the writable segment checks.

Finally, there are some limitations. The amount of memory touched by xsave (and the related instructions) depends on edx:eax. I didn't bother checking that at all. Similarly, there are some MPX instructions that don't do any access checks when they really should. And finally, there's one place that checks for an access size of 8 bytes when, in some cases, it should be 16.

I'm happy to work to upstream this, if the approach is broadly acceptable and the functionality is desired.

If not, thank you Peter for answering my questions which enabled me to solve my problem.

Cheers,

Steve

> On Feb 24, 2019, at 15:21, Stephen Checkoway <stephen.checkoway@oberlin.edu> wrote:
> 
> 
> 
>> On Feb 24, 2019, at 14:46, Peter Maydell <peter.maydell@linaro.org> wrote:
>> 
>> On Sun, 24 Feb 2019 at 19:37, Stephen Checkoway
>> <stephen.checkoway@oberlin.edu> wrote:
>>> I think that something about adding the tcg_gen_brcond_tl is causing values to become dead and then qemu aborts.
>> 
>> Yep -- all "TCG temporaries" are dead at the end
>> of a basic block, and brcond ends a basic block.
>> Only globals and "local temporaries" stay live
>> across brcond. This is documented in tcg/README,
>> though it doesn't spell it out very explicitly.
> 
> Ah yes. I see that now. I missed it on my first read through.
> 
>> This makes brcond pretty painful to use and
>> almost impossible to introduce into the middle
>> of some existing sequence of generated code.
>> I haven't looked at what the best way to do what
>> you're trying to do here is, though.
> 
> Are there other examples of straight-line code being converted to a conditional I might be able to use as an example? I thought INTO would be a good example, but it merely calls a helper. Maybe I should do that? I assume it'll be slow, but speed isn't really my primary concern.
> 
>> By the way, don't do this:
>> +    dc->A1 = tcg_temp_new();
>> 
>> The current use of a small number of tcg temps
>> in the i386 translate.c code is an antipattern
>> that is a relic from a very old version of the
>> code. It's much better to simply create new
>> temporaries in the code at the point where you
>> need them and then free them once you're done.
> 
> Great, thanks. I saw both the A0/T0/T1 and the creation of new temporaries and I wasn't sure which pattern I should follow.
> 
> -- 
> Stephen Checkoway
> 
> 
> 
> 
> 

-- 
Stephen Checkoway

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-26  0:32     ` Stephen Checkoway
@ 2019-02-26 16:56       ` Richard Henderson
  2019-02-28 15:01         ` Stephen Checkoway
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Henderson @ 2019-02-26 16:56 UTC (permalink / raw)
  To: Stephen Checkoway, Peter Maydell; +Cc: QEMU Developers

On 2/25/19 4:32 PM, Stephen Checkoway wrote:
> FWIW, I figured out an approach to this. Essentially, I'm reusing the function which computes linear addresses to enforce not only segment limits (in everything but long mode), but also read/write access (in protected mode).
> 
> Unfortunately, that meant every call to the linear address computation has to be augmented with an access size and whether it's a store or not. The patch is pretty large so I won't include it here, but you can view it at <https://github.com/stevecheckoway/qemu/commit/ac58652efacedc53f3831301ea0326ac8f882b18>.
> 
> If this is something that qemu would like, then I think two additional things are definitely required:
> 1. Tests. make check passes and the firmware I have which necessitated the checks appears to work, but that change touches the almost every guest-memory-accessing x86 instruction.
> 2. This is going to slow down the common case—where segments have a 0 base address and a limit of 0xFFFFFFFF—and there's no real need to do that. It seems like something akin to addseg could be used to decide when to insert the checks. I don't really understand how that works and in my case, segments with nonzero bases and non-0xFFFFFFFF limits are the norm so I didn't investigate that. Something similar could probably be done to omit the writable segment checks.
> 
> Finally, there are some limitations. The amount of memory touched by xsave (and the related instructions) depends on edx:eax. I didn't bother checking that at all. Similarly, there are some MPX instructions that don't do any access checks when they really should. And finally, there's one place that checks for an access size of 8 bytes when, in some cases, it should be 16.
> 
> I'm happy to work to upstream this, if the approach is broadly acceptable and the functionality is desired.

I am happy to have proper segmentation support upstream, but having read
through your patch I think I would approach it differently: I would incorporate
segmentation into the softmmu translation process.

Having many softmmu tlbs, even if unused, used to be expensive, and managing
them difficult.  However, some (very) recent work has reduced that expense.

I would add 6 new MMU_MODES, one for each segment register.  Translation for
these modes would proceed in two stages, just like real segmentation+paging.
So your access checks happen as a part of normal memory accesses.  (We have an
example of two-level translation in target/arm, S1_ptw_translate.)

These new tlbs would need to be flushed on any segment register change, and
with any change to the underlying page tables.  They would need to be flushed
on privilege level changes (or we'd need to add another 6 for ring0).

I would extend the check for HF_ADDSEG_MASK to include 4GB segment limits.
With that, "normal" 32-bit operation would ignore these new tlbs and continue
to use the current flat view of the virtual address space.

That should all mean no slow down in the common case, not having to adjust
every single memory access in target/i386/translate.c, and fewer runtime calls
to helper functions when segmentation is in effect.


r~

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-26 16:56       ` Richard Henderson
@ 2019-02-28 15:01         ` Stephen Checkoway
  2019-02-28 16:11           ` Richard Henderson
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Checkoway @ 2019-02-28 15:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers

Apologies. I started writing this response several days ago but got busy.

> On Feb 26, 2019, at 11:56, Richard Henderson <richard.henderson@linaro.org> wrote:
> 
> I am happy to have proper segmentation support upstream, but having read
> through your patch I think I would approach it differently: I would incorporate
> segmentation into the softmmu translation process.

That's an interesting idea. Before I try that, I have a few questions.

I'm very new to this part of the code base. I'm not entirely sure how the softmmu translation process works. It looks like each target defines a number of MMU_MODES and each memory access TCG instruction (I'm not sure what these are called) encodes the MMU mode index. Then, I suppose the code generation takes a list of TCG instructions and when generating loads and stores, it takes the MMU mode index into account to perform virtual to physical translation. I'm not entirely sure where the code that does this lives, but is that broadly correct?

> Having many softmmu tlbs, even if unused, used to be expensive, and managing
> them difficult.  However, some (very) recent work has reduced that expense.
> 
> I would add 6 new MMU_MODES, one for each segment register.  Translation for
> these modes would proceed in two stages, just like real segmentation+paging.
> So your access checks happen as a part of normal memory accesses.  (We have an
> example of two-level translation in target/arm, S1_ptw_translate.)

So to take an example, movs es:[edi], ds:[esi] would generate a load using the DS MMU mode and a store using the ES MMU mode. Currently, a single mode appears to be used for all memory accesses and this mode is set in the initializer for the disassembly context, i386_tr_init_disas_context.

> These new tlbs would need to be flushed on any segment register change, and
> with any change to the underlying page tables.  They would need to be flushed
> on privilege level changes (or we'd need to add another 6 for ring0).

Are you thinking that this should be modeled as independent sets of TLBs, one per mode?

It seems easier to have a linear address MMU mode and then for the MMU modes corresponding to segment registers, perform an access and limit check, adjust the address by the segment base, and then go through the linear address MMU mode translation.

In particular, code that uses segments spends a lot of time changing the values of segment registers. E.g., in the movs example above, the ds segment may be overridden but the es segment cannot be, so to use the string move instructions within ds, es needs to be saved, modified, and then restored.

> I would extend the check for HF_ADDSEG_MASK to include 4GB segment limits.
> With that, "normal" 32-bit operation would ignore these new tlbs and continue
> to use the current flat view of the virtual address space.
> 
> That should all mean no slow down in the common case, not having to adjust
> every single memory access in target/i386/translate.c, and fewer runtime calls
> to helper functions when segmentation is in effect.

Modifying HF_ADDSEG_MASK to mean not just the base, but also the limit makes a lot of sense. I don't know what happens when these hidden flags get modified though. If a basic block is translated with ADDSEG not set, then a segment register is changed such that ADDSEG becomes set, will the previously translated basic block be retranslated in light of this change? I hope so, but I'm not sure how/where this happens.

Not having to adjust every single memory access in i386/translate.c would be fantastic. But I don't see a lot of great options for implementing this that doesn't require changing them all. Each memory access needs to know what segment register* (and thus which MMU mode) to use. So either every access needs to be adjusted to explicitly use the appropriate mode—taking overrides into account—or the lea functions need to set the appropriate mode for a subsequent access to use. The latter option means that there's an implicit dependency in the order of operations.

Returning to the movs example, the order of operations _must_ be
1. lea ds:[esi]
2. load 4 bytes
3. lea es:[edi]
4. store 4 bytes

Swapping the order of 2 and 3 is currently fine (as long as different temporaries are used for storing the results of the lea) but if the lea code is also setting the mode, then swapping the order would lead to loading 4 bytes from the wrong segment.

This approach seems pretty brittle and errors are likely going to be difficult to catch.

Do you have an approach in mind that avoids this difficulty and also doesn't modify every memory access?

* I believe LGDT and LIDT are the only two x86 instructions that use a linear address rather than a segment-relative one.

Finally, I think there's an issue with this approach when trying to store more than 4 or 8 bytes of data in a single operation. On a 32-bit host, MOVQ (the MMX instruction) is going to store 64-bits of data. If this store happens starting 4 bytes before the end of the segment, I believe this should either case #GP(0) or #SS(0), depending on the segment. But if the 64-bit store is broken into two 32-bit stores, the first may succeed and the second fail, leading to an inconsistent state. Do you have any thoughts on how this should be handled?

Thank you,

Steve

-- 
Stephen Checkoway

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-28 15:01         ` Stephen Checkoway
@ 2019-02-28 16:11           ` Richard Henderson
  2019-02-28 17:18             ` Stephen Checkoway
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Henderson @ 2019-02-28 16:11 UTC (permalink / raw)
  To: Stephen Checkoway; +Cc: Peter Maydell, QEMU Developers

On 2/28/19 7:01 AM, Stephen Checkoway wrote:
> I'm very new to this part of the code base. I'm not entirely sure how the
> softmmu translation process works. It looks like each target defines a
> number of MMU_MODES and each memory access TCG instruction (I'm not sure
> what these are called) encodes the MMU mode index. Then, I suppose the code
> generation takes a list of TCG instructions and when generating loads and
> stores, it takes the MMU mode index into account to perform virtual to
> physical translation. I'm not entirely sure where the code that does this
> lives, but is that broadly correct?
Yes.

We call them "tcg opcodes".

The "mmu mode index" indexes into an array of softmmu TLBs, defined in the
CPU_COMMON_TLB macro, in include/exec/cpu-defs.h.  This macro is included
within each CPUArchState (env) structure defined by target/$guest/cpu.h.

The "env pointer" is held in TCG_AREG0 (tcg/$host/tcg-target.h; %rbp for
x86_64), and we generate code to access cached values within that structure.
If we see an address for which we do not have a cached translation, we call
into one of a set of helper functions defined in accel/tcg/cputlb.c.

At the bottom of that set of helper functions, is the function "tlb_fill".
This function is defined somewhere in target/$guest/*.c.  Its job is to take
the virtual address, as presented by the guest instruction, plus the mmu mode
index, plus the operation (read/write/execute), then (1) verify that the
operation is allowed, and raise an exception if not; (2) translate to a
physical address, "page" size, and permissions (r/w/x) for that page.

Importantly, tlb_fill can do completely different things depending on the mmu
mode that is given.  So the existing indexes would continue to interpret flat
virtual addresses to physical addresses for ring3, ring0-smm, and ring0+smm.

>> I would add 6 new MMU_MODES, one for each segment register.  Translation for
>> these modes would proceed in two stages, just like real segmentation+paging.
>> So your access checks happen as a part of normal memory accesses.  (We have an
>> example of two-level translation in target/arm, S1_ptw_translate.)
> 
> So to take an example, movs es:[edi], ds:[esi] would generate a load using
> the DS MMU mode and a store using the ES MMU mode. Currently, a single mode
> appears to be used for all memory accesses and this mode is set in the
> initializer for the disassembly context, i386_tr_init_disas_context.
Correct.

> Are you thinking that this should be modeled as independent sets of TLBs, one per mode?

One per segment you mean?  Yes, exactly.  Since each segment can have
independent segment base + limit + permissions.  All of which would be taken
into account by tlb_fill when populating the TLB.

> It seems easier to have a linear address MMU mode and then for the MMU modes
> corresponding to segment registers, perform an access and limit check,
> adjust the address by the segment base, and then go through the linear
> address MMU mode translation.
Except you need to generate extra calls at runtime to perform this translation,
and you are not able to cache the result of the lookup against a second access
to the same page.

> In particular, code that uses segments spends a lot of time changing the
> values of segment registers. E.g., in the movs example above, the ds segment
> may be overridden but the es segment cannot be, so to use the string move
> instructions within ds, es needs to be saved, modified, and then restored.
You are correct that this would result in two TLB flushes.

But if MOVS executes a non-trivial number of iterations, we still may win.

The work that Emilio Cota has done in this development cycle to make the size
of the softmmu TLBs dynamic will help here.  It may well be that MOVS is used
with small memcpy, and there are a fair few flushes.  But in that case the TLB
will be kept very small, and so the flush will not be expensive.

On the other hand, DS changes are rare (depending on the programming model),
and SS changes only on context switches.  Their TLBs will keep their contents,
even while ES gets flushed.  Work has been saved over adding explicit calls to
a linear address helper function.

> Modifying HF_ADDSEG_MASK to mean not just the base, but also the limit makes
> a lot of sense. I don't know what happens when these hidden flags get
> modified though. If a basic block is translated with ADDSEG not set, then a
> segment register is changed such that ADDSEG becomes set, will the
> previously translated basic block be retranslated in light of this change? I
> hope so, but I'm not sure how/where this happens.
cpu_get_tb_cpu_state returns 3 values (pc, cs_base, flags) to
tb_lookup__cpu_state (in include/exec/tb-lookup.h).

These values are used as the key to a hash table to find a previously compiled
TranslationBlock that we want to execute.  If a bit in flags changes, a TB
without that bit will not match, which may lead to retranslation.

We can, of course, hold *both* translations, with and without bit X set in
flags, in the hash table at the same time, so any one execution may or may not
lead to actual retranslation.

BTW, while the name cs_base is in fact tied to its original (and current) usage
within target/i386/, it can be thought of as a second pointer-sized
target-specific value.  E.g. for target/sparc, to implement delay slots,
cs_base holds next_pc.  For target/hppa, cs_base holds the space register plus
an offset to next_pc.

> Not having to adjust every single memory access in i386/translate.c would be
> fantastic. But I don't see a lot of great options for implementing this that
> doesn't require changing them all. Each memory access needs to know what
> segment register* (and thus which MMU mode) to use. So either every access
> needs to be adjusted to explicitly use the appropriate mode—taking overrides
> into account—or the lea functions need to set the appropriate mode for a
> subsequent access to use. The latter option means that there's an implicit
> dependency in the order of operations.
The vast majority of x86 instructions have exactly one memory access, and it
uses the default segment (ds/ss) or the segment override.  We can set this
default mmu index as soon as we have seen any segment override.

> Returning to the movs example, the order of operations _must_ be
> 1. lea ds:[esi]
> 2. load 4 bytes
> 3. lea es:[edi]
> 4. store 4 bytes

MOVS is one of the rare examples of two memory accesses within one instruction.
 Yes, we would have to special case this, and be careful to get everything right.

> This approach seems pretty brittle and errors are likely going to be difficult to catch.

Because of the rarity of dual memory accesses, I do not think it will be that
brittle.

> Finally, I think there's an issue with this approach when trying to store
> more than 4 or 8 bytes of data in a single operation. On a 32-bit host, MOVQ
> (the MMX instruction) is going to store 64-bits of data. If this store
> happens starting 4 bytes before the end of the segment, I believe this
> should either case #GP(0) or #SS(0), depending on the segment. But if the
> 64-bit store is broken into two 32-bit stores, the first may succeed and the
> second fail, leading to an inconsistent state. Do you have any thoughts on
> how this should be handled?

For those specific examples, the correct solution is to use a 64-bit store,
even with a 32-bit host.  The access will be seen by accel/tcg/cputlb.c as one
64-bit operation, and will raise the appropriate exception before actually
performing the host-level store (which will be done by gcc, probably with two
32-bit insns).

We do have an existing issue with larger stores, for vectors and such.  This
really needs to be handled more generally within accel/tcg/cputlb.c.
Unfortunately, the easiest way for me to do this is to completely drop support
for 32-bit hosts, and there's some resistance to that.  I'm not sure how to
approach this problem in a way that allows 32-bit hosts, so I've done nothing
so far.

We do have a helper function, probe_write, which will check to see if the whole
of a write operation would succeed, and raise an exception if not.  This is
currently only used in specific situations within one or two targets.  It is
not, IMO, a good replacement for incorporating proper support for 16-byte and
32-byte memory operations into accel/tcg/.


r~

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-28 16:11           ` Richard Henderson
@ 2019-02-28 17:18             ` Stephen Checkoway
  2019-02-28 18:05               ` Richard Henderson
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Checkoway @ 2019-02-28 17:18 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers

This is all extremely helpful! I'll dig in and try this approach soon.

> On Feb 28, 2019, at 11:11, Richard Henderson <richard.henderson@linaro.org> wrote:
> 
>> Are you thinking that this should be modeled as independent sets of TLBs, one per mode?
> 
> One per segment you mean?

Yes.

>  Yes, exactly.  Since each segment can have
> independent segment base + limit + permissions.  All of which would be taken
> into account by tlb_fill when populating the TLB.
> 
>> It seems easier to have a linear address MMU mode and then for the MMU modes
>> corresponding to segment registers, perform an access and limit check,
>> adjust the address by the segment base, and then go through the linear
>> address MMU mode translation.
> Except you need to generate extra calls at runtime to perform this translation,
> and you are not able to cache the result of the lookup against a second access
> to the same page.

I see. That makes sense. I didn't realize the results of the calls were being cached.

> 
>> In particular, code that uses segments spends a lot of time changing the
>> values of segment registers. E.g., in the movs example above, the ds segment
>> may be overridden but the es segment cannot be, so to use the string move
>> instructions within ds, es needs to be saved, modified, and then restored.
> You are correct that this would result in two TLB flushes.
> 
> But if MOVS executes a non-trivial number of iterations, we still may win.
> 
> The work that Emilio Cota has done in this development cycle to make the size
> of the softmmu TLBs dynamic will help here.  It may well be that MOVS is used
> with small memcpy, and there are a fair few flushes.  But in that case the TLB
> will be kept very small, and so the flush will not be expensive.

I wonder if it would make sense to maintain a small cache of TLBs. The majority of cases are likely to involving setting segment registers to one of a handful of segments (e.g., setting es to ds or ss). So it might be nice to avoid the flushes entirely.

> On the other hand, DS changes are rare (depending on the programming model),
> and SS changes only on context switches.  Their TLBs will keep their contents,
> even while ES gets flushed.  Work has been saved over adding explicit calls to
> a linear address helper function.

In my case, ds changes are pretty frequent—I count 75 instances of mov ds, __ and 124 instances of pop ds—in the executive (ring 0) portion of this firmware. Obviously the dynamic count is more interesting, but I don't have that off-hand.

> The vast majority of x86 instructions have exactly one memory access, and it
> uses the default segment (ds/ss) or the segment override.  We can set this
> default mmu index as soon as we have seen any segment override.
> 
>> Returning to the movs example, the order of operations _must_ be
>> 1. lea ds:[esi]
>> 2. load 4 bytes
>> 3. lea es:[edi]
>> 4. store 4 bytes
> 
> MOVS is one of the rare examples of two memory accesses within one instruction.
> Yes, we would have to special case this, and be careful to get everything right.

I agree that the vast majority of x86 instructions access at most one segment, but off-hand, I can think of a handful that access two:

- movs 
- cmps
- push r/m32
- pop r/m32
- call m32
- call m16:m32

I'm not sure if there are others.

-- 
Stephen Checkoway

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-28 17:18             ` Stephen Checkoway
@ 2019-02-28 18:05               ` Richard Henderson
  2019-03-07  1:49                 ` Emilio G. Cota
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Henderson @ 2019-02-28 18:05 UTC (permalink / raw)
  To: Stephen Checkoway; +Cc: Peter Maydell, QEMU Developers, Emilio G. Cota

On 2/28/19 9:18 AM, Stephen Checkoway wrote:
> I wonder if it would make sense to maintain a small cache of TLBs. The
> majority of cases are likely to involving setting segment registers to one
> of a handful of segments (e.g., setting es to ds or ss). So it might be nice
> to avoid the flushes entirely.
Hmm.

The straight-forward approach to this would change the mapping between segment
and mmu index, which would need to force a new translation (since mmu indexes
are built into the generated code as constants).  It would be easy for this
scheme to generate too many translations and slow down the system as a whole.

However, since the change to dynamic tlbs, the actual tlb is now a pointer.  So
it might not be out of the question to simply swap TLB contents around when
changing segment registers.  All you would need is N+1 tlbs to support the
(easy?) case of es swapping.

With some additional work in cputlb, it might even be possible to have
different mmu indexes share the same backing tlb.  This would be tricky to
manage during a tlb resize, but perhaps not impossible.

Emilio, do you have any thoughts here?


> I agree that the vast majority of x86 instructions access at most one
> segment, but off-hand, I can think of a handful that access two:
> 
> - movs 
> - cmps
> - push r/m32
> - pop r/m32
> - call m32
> - call m16:m32
> 
> I'm not sure if there are others.

Sure, but my point is that we're certainly talking about 10's not 1000's, which
is where we were when talking about every memory operation for every x86
instruction.


r~

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] x86 segment limits enforcement with TCG
  2019-02-28 18:05               ` Richard Henderson
@ 2019-03-07  1:49                 ` Emilio G. Cota
  0 siblings, 0 replies; 10+ messages in thread
From: Emilio G. Cota @ 2019-03-07  1:49 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Stephen Checkoway, Peter Maydell, QEMU Developers

On Thu, Feb 28, 2019 at 10:05:02 -0800, Richard Henderson wrote:
> On 2/28/19 9:18 AM, Stephen Checkoway wrote:
> > I wonder if it would make sense to maintain a small cache of TLBs. The
> > majority of cases are likely to involving setting segment registers to one
> > of a handful of segments (e.g., setting es to ds or ss). So it might be nice
> > to avoid the flushes entirely.
> Hmm.
> 
> The straight-forward approach to this would change the mapping between segment
> and mmu index, which would need to force a new translation (since mmu indexes
> are built into the generated code as constants).  It would be easy for this
> scheme to generate too many translations and slow down the system as a whole.
> 
> However, since the change to dynamic tlbs, the actual tlb is now a pointer.  So
> it might not be out of the question to simply swap TLB contents around when
> changing segment registers.  All you would need is N+1 tlbs to support the
> (easy?) case of es swapping.
> 
> With some additional work in cputlb, it might even be possible to have
> different mmu indexes share the same backing tlb.  This would be tricky to
> manage during a tlb resize, but perhaps not impossible.
> 
> Emilio, do you have any thoughts here?

I like the approach you propose. If I understood correctly, we could:

- Let target code manipulate the pointer in TLB[mmu_idx]. This way the
  target can keep for each idx a set of tables, indexed by segment register
  value. The target would detect changes to segment registers and
  update the pointer in TLB[mmu_idx].

- Add a hook from cputlb to target-specific code, so that the latter
  is notified of flushes and can therefore flush the set of tables.
  Note that resizes only occur during flushes, which makes things
  simpler.

Seems like a reasonable amount of work and won't affect much the
common path (i.e. no segmentation) -- at worst it will add the overhead
of using helpers for a small number of instructions.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-03-07  1:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-24 19:36 [Qemu-devel] x86 segment limits enforcement with TCG Stephen Checkoway
2019-02-24 19:46 ` Peter Maydell
2019-02-24 20:21   ` Stephen Checkoway
2019-02-26  0:32     ` Stephen Checkoway
2019-02-26 16:56       ` Richard Henderson
2019-02-28 15:01         ` Stephen Checkoway
2019-02-28 16:11           ` Richard Henderson
2019-02-28 17:18             ` Stephen Checkoway
2019-02-28 18:05               ` Richard Henderson
2019-03-07  1:49                 ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.