All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: "Emilio G. Cota" <cota@braap.org>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	MTTCG Devel <mttcg@listserver.greensocs.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Peter Crosthwaite <crosthwaite.peter@gmail.com>,
	Richard Henderson <rth@twiddle.net>,
	Sergey Fedorov <serge.fdrv@gmail.com>
Subject: Re: [Qemu-devel] [RFC v3] translate-all: protect code_gen_buffer with RCU
Date: Tue, 26 Apr 2016 07:32:39 +0100	[thread overview]
Message-ID: <87shy8ev7c.fsf@linaro.org> (raw)
In-Reply-To: <1461627983-32563-1-git-send-email-cota@braap.org>


Emilio G. Cota <cota@braap.org> writes:

> Context:
>   https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg04658.html
>   https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg06942.html
>
> This seems to half-work[*] although I'm uneasy about the while idea.
> I see two major hurdles:
>
> * If the TB size is too small, this breaks badly, because we're not
>   out of the RCU read critical section when another call to tb_flush
>   is made. For instance, when booting ARM with this patch applied,
>   I need to at least pass -tb-size 10 for it to fully boot debian
>   jessie.
> * We have different tb_flush callers that should be audited:
>  $ git grep '\stb_flush(' | grep -v '//' | grep -v 'CPUState'
>  exec.c:            tb_flush(cpu);
>  gdbstub.c:        tb_flush(cpu);
>  gdbstub.c:    tb_flush(cpu);
>  hw/ppc/spapr_hcall.c:            tb_flush(CPU(cpu));
>  target-alpha/sys_helper.c:    tb_flush(CPU(alpha_env_get_cpu(env)));
>  target-i386/translate.c:        tb_flush(CPU(x86_env_get_cpu(env)));
>  translate-all.c:        tb_flush(cpu);
>
> With two code_gen "halves", if two tb_flush calls are done in the same
> RCU read critical section, we're screwed. I added a cpu_exit at the end
> of tb_flush to try to mitigate this, but I haven't audited all the callers
> (for instance, what does the gdbstub do?).

I'm not sure we are going to get much from this approach. The tb_flush
is a fairly rare occurrence its not like its on the critical performance
path (although of course pathological cases are possible).

I still think there are possibilities with a smaller TranslationRegion
approach but this is more aimed at solving problems like bulk
invalidations of a page of translations at a time and safer inter-block
patching. It doesn't do much to make the tb_flush easier though.

>
> If we end up having a mechanism to "stop all  CPUs to do something", as
> I think we'll end up needing for correct LL/SC emulation, we'll probably
> be better off using that mechanism for tb_flush as well -- plus, we'll avoid
> wasting memory.

I'm fairly certain there will need to be a "stop everything" mode for
some things - I'm less certain of the best way of doing it. Did you get
a chance to look at my version of the async_safe_work mechanism?

>
> Other issues:
> - This could be split into at least 2 patches, one that touches
>   tcg/ and another to deal with translate-all.
>   Note that in translate-all, the initial allocation of code_gen doesn't
>   allocate extra space for the guard page; reserving guard page space is
>   done instead by the added split_code_gen_buffer function.
> - Windows: not even compile-tested.
>
> [*] "Seems to half-work". At least it boots ARM OK with MTTCG and -smp 4.
>     Alex' tests, however, sometimes fail with:
>
> Unhandled exception 3 (pabt)
> Exception frame registers:
> pc : [<fffffb44>]    lr : [<00000001>]    psr: 20000173
> sp : 4004f528  ip : 40012048  fp : 40032ca8
> r10: 40032ca8  r9 : 00000000  r8 : 00000000
> r7 : 0000000e  r6 : 40030000  r5 : 40032ca8  r4 : 00001ac6
> r3 : 40012030  r2 : 40012030  r1 : d5ffffe7  r0 : 00000028
> Flags: nzCv  IRQs on  FIQs off  Mode SVC_32
> Control: 00c5107d  Table: 40060000  DAC: 00000000
> IFAR: fffffb44    IFSR: 00000205
>
> or with:
>
> CPU0: 16986 irqs (0 races, 11 slow,  1322 ticks avg latency)
> FAIL: smc: irq: 17295 IRQs sent, 16986 received
>
> Unhandled exception 6 (irq)
> Exception frame registers:
> pc : [<00000020>]    lr : [<40010800>]    psr: 60000153
> sp : 400b45c0  ip : 400b34e8  fp : 40032ca8
> r10: 00000000  r9 : 00000000  r8 : 00000000
> r7 : 00000000  r6 : 00000000  r5 : 00000000  r4 : 00000000
> r3 : 00000000  r2 : 00000000  r1 : 000000ff  r0 : 00000000
> Flags: nZCv  IRQs on  FIQs off  Mode SVC_32
> Control: 00c5107d  Table: 40060000  DAC: 00000000
>
> I built with --enable-tcg-debug.

I'll have another go at reproducing. I could only get the asserts to
fire which was odd because AFAICT the prologue generation which
triggered it was serialised with tb_lock.

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg.c       |  22 +++++---
>  tcg/tcg.h       |   1 +
>  translate-all.c | 155 ++++++++++++++++++++++++++++++++++++++++++++++----------
>  3 files changed, 144 insertions(+), 34 deletions(-)
>
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index b46bf1a..7db8ce9 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -380,6 +380,20 @@ void tcg_context_init(TCGContext *s)
>      }
>  }
>
> +void tcg_code_gen_init(TCGContext *s, void *buf, size_t prologue_size)
> +{
> +    size_t size = s->code_gen_buffer_size - prologue_size;
> +
> +    s->code_gen_ptr = buf;
> +    s->code_gen_buffer = buf;
> +    s->code_buf = buf;
> +
> +    /* Compute a high-water mark, at which we voluntarily flush the buffer
> +       and start over.  The size here is arbitrary, significantly larger
> +       than we expect the code generation for any one opcode to require.  */
> +    s->code_gen_highwater = s->code_buf + (size - 1024);
> +}
> +
>  void tcg_prologue_init(TCGContext *s)
>  {
>      size_t prologue_size, total_size;
> @@ -398,16 +412,10 @@ void tcg_prologue_init(TCGContext *s)
>
>      /* Deduct the prologue from the buffer.  */
>      prologue_size = tcg_current_code_size(s);
> -    s->code_gen_ptr = buf1;
> -    s->code_gen_buffer = buf1;
> -    s->code_buf = buf1;
>      total_size = s->code_gen_buffer_size - prologue_size;
>      s->code_gen_buffer_size = total_size;
>
> -    /* Compute a high-water mark, at which we voluntarily flush the buffer
> -       and start over.  The size here is arbitrary, significantly larger
> -       than we expect the code generation for any one opcode to require.  */
> -    s->code_gen_highwater = s->code_gen_buffer + (total_size - 1024);
> +    tcg_code_gen_init(s, buf1, prologue_size);
>
>      tcg_register_jit(s->code_gen_buffer, total_size);
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 40c8fbe..e849d3e 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -634,6 +634,7 @@ static inline void *tcg_malloc(int size)
>
>  void tcg_context_init(TCGContext *s);
>  void tcg_prologue_init(TCGContext *s);
> +void tcg_code_gen_init(TCGContext *s, void *buf, size_t prologue_size);
>  void tcg_func_start(TCGContext *s);
>
>  int tcg_gen_code(TCGContext *s, TranslationBlock *tb);
> diff --git a/translate-all.c b/translate-all.c
> index bba9b62..7a83176 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -485,6 +485,11 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>    (DEFAULT_CODE_GEN_BUFFER_SIZE_1 < MAX_CODE_GEN_BUFFER_SIZE \
>     ? DEFAULT_CODE_GEN_BUFFER_SIZE_1 : MAX_CODE_GEN_BUFFER_SIZE)
>
> +static int code_gen_buf_mask;
> +static void *code_gen_buf1;
> +static void *code_gen_buf2;
> +static size_t code_gen_prologue_size;
> +
>  static inline size_t size_code_gen_buffer(size_t tb_size)
>  {
>      /* Size the buffer.  */
> @@ -508,6 +513,43 @@ static inline size_t size_code_gen_buffer(size_t tb_size)
>      return tb_size;
>  }
>
> +static void *split_code_gen_buffer(void *buf, size_t size)
> +{
> +    /* enforce alignment of the buffer to a page boundary */
> +    if (unlikely((uintptr_t)buf & ~qemu_real_host_page_mask)) {
> +        uintptr_t b;
> +
> +        b = QEMU_ALIGN_UP((uintptr_t)buf, qemu_real_host_page_size);
> +        size -= b - (uintptr_t)buf;
> +        buf = (void *)b;
> +    }
> +    /*
> +     * Make sure the size is an even number of pages so that both halves will
> +     * end on a page boundary.
> +     */
> +    size = QEMU_ALIGN_DOWN(size, 2 * qemu_real_host_page_size);
> +
> +    /* split in half, making room for 2 guard pages */
> +    size -= 2 * qemu_real_host_page_size;
> +    size /= 2;
> +    code_gen_buf1 = buf;
> +    code_gen_buf2 = buf + size + qemu_real_host_page_size;
> +
> +    /*
> +     * write the prologue into buf2. This is safe because we'll later call
> +     * tcg_prologue_init on buf1, from which we'll start execution.
> +     */
> +    tcg_ctx.code_gen_buffer = code_gen_buf2;
> +    tcg_prologue_init(&tcg_ctx);
> +    code_gen_prologue_size = (void *)tcg_ctx.code_ptr - code_gen_buf2;
> +
> +    tcg_ctx.code_gen_buffer_size = size;
> +
> +    /* start execution from buf1 */
> +    code_gen_buf_mask = 1;
> +    return code_gen_buf1;
> +}
> +
>  #ifdef __mips__
>  /* In order to use J and JAL within the code_gen_buffer, we require
>     that the buffer not cross a 256MB boundary.  */
> @@ -583,21 +625,17 @@ static inline void map_none(void *addr, long size)
>  static inline void *alloc_code_gen_buffer(void)
>  {
>      void *buf = static_code_gen_buffer;
> -    size_t full_size, size;
> +    size_t size;
>
>      /* The size of the buffer, rounded down to end on a page boundary.  */
> -    full_size = (((uintptr_t)buf + sizeof(static_code_gen_buffer))
> -                 & qemu_real_host_page_mask) - (uintptr_t)buf;
> -
> -    /* Reserve a guard page.  */
> -    size = full_size - qemu_real_host_page_size;
> +    size = (((uintptr_t)buf + sizeof(static_code_gen_buffer))
> +            & qemu_real_host_page_mask) - (uintptr_t)buf;
>
>      /* Honor a command-line option limiting the size of the buffer.  */
>      if (size > tcg_ctx.code_gen_buffer_size) {
>          size = (((uintptr_t)buf + tcg_ctx.code_gen_buffer_size)
>                  & qemu_real_host_page_mask) - (uintptr_t)buf;
>      }
> -    tcg_ctx.code_gen_buffer_size = size;
>
>  #ifdef __mips__
>      if (cross_256mb(buf, size)) {
> @@ -607,27 +645,40 @@ static inline void *alloc_code_gen_buffer(void)
>  #endif
>
>      map_exec(buf, size);
> -    map_none(buf + size, qemu_real_host_page_size);
>      qemu_madvise(buf, size, QEMU_MADV_HUGEPAGE);
>
> +    buf = split_code_gen_buffer(buf, size);
> +    size = tcg_ctx.code_gen_buffer_size;
> +
> +    /* page guards */
> +    map_none(code_gen_buf1 + size, qemu_real_host_page_size);
> +    map_none(code_gen_buf2 + size, qemu_real_host_page_size);
> +
>      return buf;
>  }
>  #elif defined(_WIN32)
>  static inline void *alloc_code_gen_buffer(void)
>  {
>      size_t size = tcg_ctx.code_gen_buffer_size;
> -    void *buf1, *buf2;
> +    void *buf;
> +    void *ret;
>
> -    /* Perform the allocation in two steps, so that the guard page
> -       is reserved but uncommitted.  */
> -    buf1 = VirtualAlloc(NULL, size + qemu_real_host_page_size,
> -                        MEM_RESERVE, PAGE_NOACCESS);
> -    if (buf1 != NULL) {
> -        buf2 = VirtualAlloc(buf1, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
> -        assert(buf1 == buf2);
> +    /* Perform the allocation in two steps, so that the guard pages
> +       are reserved but uncommitted.  */
> +    ret = VirtualAlloc(NULL, size, MEM_RESERVE, PAGE_NOACCESS);
> +    if (ret == NULL) {
> +        return NULL;
>      }
>
> -    return buf1;
> +    ret = split_code_gen_buffer(ret, size);
> +    size = tcg_ctx.code_gen_buffer_size;
> +
> +    buf = VirtualAlloc(code_gen_buf1, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
> +    assert(buf);
> +    buf = VirtualAlloc(code_gen_buf2, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
> +    assert(buf);
> +
> +    return ret;
>  }
>  #else
>  static inline void *alloc_code_gen_buffer(void)
> @@ -665,8 +716,7 @@ static inline void *alloc_code_gen_buffer(void)
>  #  endif
>  # endif
>
> -    buf = mmap((void *)start, size + qemu_real_host_page_size,
> -               PROT_NONE, flags, -1, 0);
> +    buf = mmap((void *)start, size, PROT_NONE, flags, -1, 0);
>      if (buf == MAP_FAILED) {
>          return NULL;
>      }
> @@ -676,24 +726,24 @@ static inline void *alloc_code_gen_buffer(void)
>          /* Try again, with the original still mapped, to avoid re-acquiring
>             that 256mb crossing.  This time don't specify an address.  */
>          size_t size2;
> -        void *buf2 = mmap(NULL, size + qemu_real_host_page_size,
> -                          PROT_NONE, flags, -1, 0);
> +        void *buf2 = mmap(NULL, size, PROT_NONE, flags, -1, 0);
> +
>          switch (buf2 != MAP_FAILED) {
>          case 1:
>              if (!cross_256mb(buf2, size)) {
>                  /* Success!  Use the new buffer.  */
> -                munmap(buf, size + qemu_real_host_page_size);
> +                munmap(buf, size);
>                  break;
>              }
>              /* Failure.  Work with what we had.  */
> -            munmap(buf2, size + qemu_real_host_page_size);
> +            munmap(buf2, size);
>              /* fallthru */
>          default:
>              /* Split the original buffer.  Free the smaller half.  */
>              buf2 = split_cross_256mb(buf, size);
>              size2 = tcg_ctx.code_gen_buffer_size;
>              if (buf == buf2) {
> -                munmap(buf + size2 + qemu_real_host_page_size, size - size2);
> +                munmap(buf + size2, size - size2);
>              } else {
>                  munmap(buf, size - size2);
>              }
> @@ -704,13 +754,19 @@ static inline void *alloc_code_gen_buffer(void)
>      }
>  #endif
>
> -    /* Make the final buffer accessible.  The guard page at the end
> -       will remain inaccessible with PROT_NONE.  */
> +    /* Make the final buffer accessible. */
>      mprotect(buf, size, PROT_WRITE | PROT_READ | PROT_EXEC);
>
>      /* Request large pages for the buffer.  */
>      qemu_madvise(buf, size, QEMU_MADV_HUGEPAGE);
>
> +    buf = split_code_gen_buffer(buf + 1, size);
> +    size = tcg_ctx.code_gen_buffer_size;
> +
> +    /* page guards */
> +    mprotect(code_gen_buf1 + size, qemu_real_host_page_size, PROT_NONE);
> +    mprotect(code_gen_buf2 + size, qemu_real_host_page_size, PROT_NONE);
> +
>      return buf;
>  }
>  #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
> @@ -829,10 +885,48 @@ static void page_flush_tb(void)
>      }
>  }
>
> +struct code_gen_desc {
> +    struct rcu_head rcu;
> +    int clear_bit;
> +};
> +
> +static void clear_code_gen_buffer(struct rcu_head *rcu)
> +{
> +    struct code_gen_desc *desc = container_of(rcu, struct code_gen_desc, rcu);
> +
> +    tb_lock();
> +    code_gen_buf_mask &= ~desc->clear_bit;
> +    tb_unlock();
> +    g_free(desc);
> +}
> +
> +static void *replace_code_gen_buffer(void)
> +{
> +    struct code_gen_desc *desc = g_malloc0(sizeof(*desc));
> +
> +    /*
> +     * If both bits are set, we're having two concurrent flushes. This
> +     * can easily happen if the buffers are heavily undersized.
> +     */
> +    assert(code_gen_buf_mask == 1 || code_gen_buf_mask == 2);
> +
> +    desc->clear_bit = code_gen_buf_mask;
> +    call_rcu1(&desc->rcu, clear_code_gen_buffer);
> +
> +    if (code_gen_buf_mask == 1) {
> +        code_gen_buf_mask |= 2;
> +        return code_gen_buf2 + code_gen_prologue_size;
> +    }
> +    code_gen_buf_mask |= 1;
> +    return code_gen_buf1 + code_gen_prologue_size;
> +}
> +
>  /* flush all the translation blocks */
>  /* XXX: tb_flush is currently not thread safe */
>  void tb_flush(CPUState *cpu)
>  {
> +    void *buf;
> +
>  #if defined(DEBUG_FLUSH)
>      printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
>             (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
> @@ -853,10 +947,17 @@ void tb_flush(CPUState *cpu)
>      qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
> -    tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> +    buf = replace_code_gen_buffer();
> +    tcg_code_gen_init(&tcg_ctx, buf, code_gen_prologue_size);
> +
>      /* XXX: flush processor icache at this point if cache flush is
>         expensive */
>      tcg_ctx.tb_ctx.tb_flush_count++;
> +
> +    /* exit all CPUs so that the old buffer is quickly cleared */
> +    CPU_FOREACH(cpu) {
> +        cpu_exit(cpu);
> +    }
>  }
>
>  #ifdef DEBUG_TB_CHECK


--
Alex Bennée

  parent reply	other threads:[~2016-04-26  6:32 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-22  0:06 [Qemu-devel] [RFC] translate-all: protect code_gen_buffer with RCU Emilio G. Cota
2016-04-22 14:41 ` Alex Bennée
2016-04-22 14:47   ` Alex Bennée
2016-04-24  3:20   ` Emilio G. Cota
2016-04-25  8:35     ` Alex Bennée
2016-04-22 18:25 ` Richard Henderson
2016-04-24  3:27   ` [Qemu-devel] [RFC v2] " Emilio G. Cota
2016-04-24 18:12     ` Richard Henderson
2016-04-25 15:19     ` Alex Bennée
2016-04-25 15:25       ` Emilio G. Cota
2016-04-25 23:46         ` [Qemu-devel] [RFC v3] " Emilio G. Cota
2016-04-26  4:48           ` Richard Henderson
2016-04-26  6:35             ` Alex Bennée
2016-04-26 15:42               ` Richard Henderson
2016-04-26  6:32           ` Alex Bennée [this message]
2016-04-30  3:40             ` Emilio G. Cota
2016-05-09 11:21               ` Paolo Bonzini
2016-05-09 11:50                 ` Alex Bennée
2016-05-09 13:55                   ` Paolo Bonzini
2016-05-09 15:05                     ` Alex Bennée
2016-05-09 17:07                 ` Emilio G. Cota

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87shy8ev7c.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=crosthwaite.peter@gmail.com \
    --cc=mttcg@listserver.greensocs.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=serge.fdrv@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.