xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Event delivery and "domain blocking" on PVHv2
@ 2020-06-15 14:25 Martin Lucina
  2020-06-15 15:03 ` Roger Pau Monné
  2020-06-15 16:58 ` Andrew Cooper
  0 siblings, 2 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-15 14:25 UTC (permalink / raw)
  To: xen-devel, mirageos-devel

Hi,

puzzle time: In my continuing explorations of the PVHv2 ABIs for the new 
MirageOS Xen stack, I've run into some issues with what looks like 
missed deliveries of events on event channels.

While a simple unikernel that only uses the Xen console and effectively 
does for (1..5) { printf("foo"); sleep(1); } works fine, once I plug in 
the existing OCaml Xenstore and Netfront code, the behaviour I see is 
that the unikernel hangs in random places, blocking as if an event that 
should have been delivered has been missed.

Multiple runs of the unikernel have it block at different places during 
Netfront setup, and sometimes it will run as far as a fully setup 
Netfront, and then wait for network packets. However, even if it gets 
that far, packets are not actually being delivered:

Solo5: Xen console: port 0x2, ring @0x00000000FEFFF000
             |      ___|
   __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
Solo5: Bindings version v0.6.5-6-gf4b47d11
Solo5: Memory map: 256 MB addressable:
Solo5:   reserved @ (0x0 - 0xfffff)
Solo5:       text @ (0x100000 - 0x28ffff)
Solo5:     rodata @ (0x290000 - 0x2e0fff)
Solo5:       data @ (0x2e1000 - 0x3fafff)
Solo5:       heap >= 0x3fb000 < stack < 0x10000000
gnttab_init(): pages=1 entries=256
2020-06-15 13:42:08 -00:00: INF [net-xen frontend] connect 0
>>>> Sometimes we hang here
2020-06-15 13:42:08 -00:00: INF [net-xen frontend] create: id=0 domid=0
2020-06-15 13:42:08 -00:00: INF [net-xen frontend]  sg:true 
gso_tcpv4:true rx_copy:true rx_flip:false smart_poll:false
2020-06-15 13:42:08 -00:00: INF [net-xen frontend] MAC: 
00:16:3e:30:49:52
>>>> Or here
gnttab_grant_access(): ref=0x8, domid=0x0, addr=0x8f9000, readonly=0
gnttab_grant_access(): ref=0x9, domid=0x0, addr=0x8fb000, readonly=0
evtchn_alloc_unbound(remote=0x0) = 0x4
2020-06-15 13:42:08 -00:00: INF [ethernet] Connected Ethernet interface 
00:16:3e:30:49:52
2020-06-15 13:42:08 -00:00: INF [ARP] Sending gratuitous ARP for 
10.0.0.2 (00:16:3e:30:49:52)
gnttab_grant_access(): ref=0xa, domid=0x0, addr=0x8fd000, readonly=1
2020-06-15 13:42:08 -00:00: INF [udp] UDP interface connected on 
10.0.0.2
2020-06-15 13:42:08 -00:00: INF [tcpip-stack-direct] stack assembled: 
mac=00:16:3e:30:49:52,ip=10.0.0.2
Gntref.get(): Waiting for free grant
Gntref.get(): Waiting for free grant
>>>> The above are also rather odd, but not related to event channel 
>>>> delivery, so one problem at a time...
>>>> Once we get this far, packets should be flowing but aren't (either 
>>>> way). However, Xenstore is obviously working, as we wouldn't get 
>>>> through Netfront setup without it.

Given that I've essentially re-written the low-level event channel C 
code, I'd like to verify that the mechanisms I'm using for event 
delivery are indeed the right thing to do on PVHv2.

For event delivery, I'm registering the upcall with Xen as follows:

     uint64_t val = 32ULL;
     val |= (uint64_t)HVM_PARAM_CALLBACK_TYPE_VECTOR << 56;
     int rc = hypercall_hvm_set_param(HVM_PARAM_CALLBACK_IRQ, val);
     assert(rc == 0);

i.e. upcalls are to be delivered via IDT vector.

Questions:

1. Being based on the Solo5 virtio code, the low-level setup code is 
doing the "usual" i8259 PIC setup, to remap the PIC IRQs to vectors 32 
and above. Should I be doing this initialisation for Xen PVH at all? I'm 
not interested in using the PIC for anything, and all interrupts will be 
delivered via Xen event channels.

2. Related to the above, the IRQ handler code is ACKing the interrupt 
after the handler runs. Should I be doing that? Does ACKing "IRQ" 0 on 
the PIC have any interactions with Xen's view of event channels/pending 
upcalls?

Next, for a PVHv2, uniprocessor only guest, is the following flow 
sufficient to unmask an event channel?

     struct shared_info *s = SHARED_INFO();
     int pending = 0;

     atomic_sync_btc(port, &s->evtchn_mask[0]);
     pending = sync_bt(port, &s->evtchn_mask[0]);
     if (pending) {
         /*
          * Slow path:
          *
          * If pending is set here, then there was a race, and we lost 
the
          * upcall.  Mask the port again and force an upcall via a call 
to
          * hyperspace.
          *
          * This should be sufficient for HVM/PVHv2 based on my 
understanding of
          * Linux drivers/xen/events/events_2l.c.
          */
         atomic_sync_bts(port, &s->evtchn_mask[0]);
         hypercall_evtchn_unmask(port);
     }

Lastly, the old PV-only Mini-OS based stack would do delays ("block the 
domain") by doing a HYPERVISOR_set_timer_op(deadline) followed by a 
HYPERVISOR_sched_op(SCHEDOP_block,0 ). In the new code, I'm doing the 
following (based on what Mini-OS seems to be doing for HVM):

     solo5_time_t deadline = Int64_val(v_deadline);

     if (solo5_clock_monotonic() < deadline) {
         hypercall_set_timer_op(deadline);
         __asm__ __volatile__ ("hlt" : : : "memory");
         /* XXX: cancel timer_op here if woken up early? */
     }

Again, is this the right thing to do for PVH?

As the comment says, do I need to cancel the timer_op? I understood the 
semantics to be "fire once at/after the time deadline is reached", if 
that is indeed the case then with my current VIRQ_TIMER handler which 
does nothing in the interrupt context and has no side effects I should 
be fine.

I can also post the code that does the actual demuxing of events from 
Xen (i.e. the upcall handler), but I've read through that several times 
now and I don't think the problem is there; adding diagnostic prints to 
both the low-level C event channel code and higher-level OCaml 
Activations code confirms that received events are being mapped to their 
ports correctly.

Any advice much appreciated,

Thanks,

Martin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-15 14:25 Event delivery and "domain blocking" on PVHv2 Martin Lucina
@ 2020-06-15 15:03 ` Roger Pau Monné
  2020-06-15 15:51   ` Martin Lucina
  2020-06-15 16:58 ` Andrew Cooper
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-15 15:03 UTC (permalink / raw)
  To: xen-devel, mirageos-devel, martin

On Mon, Jun 15, 2020 at 04:25:57PM +0200, Martin Lucina wrote:
> Hi,
> 
> puzzle time: In my continuing explorations of the PVHv2 ABIs for the new
> MirageOS Xen stack, I've run into some issues with what looks like missed
> deliveries of events on event channels.

That would be weird, as other OSes using PVHv2 seem to be fine.

> While a simple unikernel that only uses the Xen console and effectively does
> for (1..5) { printf("foo"); sleep(1); } works fine, once I plug in the
> existing OCaml Xenstore and Netfront code, the behaviour I see is that the
> unikernel hangs in random places, blocking as if an event that should have
> been delivered has been missed.
> 
> Multiple runs of the unikernel have it block at different places during
> Netfront setup, and sometimes it will run as far as a fully setup Netfront,
> and then wait for network packets. However, even if it gets that far,
> packets are not actually being delivered:
> 
> Solo5: Xen console: port 0x2, ring @0x00000000FEFFF000
>             |      ___|
>   __|  _ \  |  _ \ __ \
> \__ \ (   | | (   |  ) |
> ____/\___/ _|\___/____/
> Solo5: Bindings version v0.6.5-6-gf4b47d11
> Solo5: Memory map: 256 MB addressable:
> Solo5:   reserved @ (0x0 - 0xfffff)
> Solo5:       text @ (0x100000 - 0x28ffff)
> Solo5:     rodata @ (0x290000 - 0x2e0fff)
> Solo5:       data @ (0x2e1000 - 0x3fafff)
> Solo5:       heap >= 0x3fb000 < stack < 0x10000000
> gnttab_init(): pages=1 entries=256
> 2020-06-15 13:42:08 -00:00: INF [net-xen frontend] connect 0
> > > > > Sometimes we hang here
> 2020-06-15 13:42:08 -00:00: INF [net-xen frontend] create: id=0 domid=0
> 2020-06-15 13:42:08 -00:00: INF [net-xen frontend]  sg:true gso_tcpv4:true
> rx_copy:true rx_flip:false smart_poll:false
> 2020-06-15 13:42:08 -00:00: INF [net-xen frontend] MAC: 00:16:3e:30:49:52
> > > > > Or here
> gnttab_grant_access(): ref=0x8, domid=0x0, addr=0x8f9000, readonly=0
> gnttab_grant_access(): ref=0x9, domid=0x0, addr=0x8fb000, readonly=0
> evtchn_alloc_unbound(remote=0x0) = 0x4
> 2020-06-15 13:42:08 -00:00: INF [ethernet] Connected Ethernet interface
> 00:16:3e:30:49:52
> 2020-06-15 13:42:08 -00:00: INF [ARP] Sending gratuitous ARP for 10.0.0.2
> (00:16:3e:30:49:52)
> gnttab_grant_access(): ref=0xa, domid=0x0, addr=0x8fd000, readonly=1
> 2020-06-15 13:42:08 -00:00: INF [udp] UDP interface connected on 10.0.0.2
> 2020-06-15 13:42:08 -00:00: INF [tcpip-stack-direct] stack assembled:
> mac=00:16:3e:30:49:52,ip=10.0.0.2
> Gntref.get(): Waiting for free grant
> Gntref.get(): Waiting for free grant
> > > > > The above are also rather odd, but not related to event
> > > > > channel delivery, so one problem at a time...
> > > > > Once we get this far, packets should be flowing but aren't
> > > > > (either way). However, Xenstore is obviously working, as we
> > > > > wouldn't get through Netfront setup without it.
> 
> Given that I've essentially re-written the low-level event channel C code,
> I'd like to verify that the mechanisms I'm using for event delivery are
> indeed the right thing to do on PVHv2.
> 
> For event delivery, I'm registering the upcall with Xen as follows:
> 
>     uint64_t val = 32ULL;
>     val |= (uint64_t)HVM_PARAM_CALLBACK_TYPE_VECTOR << 56;
>     int rc = hypercall_hvm_set_param(HVM_PARAM_CALLBACK_IRQ, val);
>     assert(rc == 0);
> 
> i.e. upcalls are to be delivered via IDT vector.

This way of event channel injection is slitgly hackish, and I would
recommend using HVMOP_set_evtchn_upcall_vector, that way vectors will
be properly routed using the lapic.

Using HVM_PARAM_CALLBACK_TYPE_VECTOR vectors are injected without
setting the IRR/ISR bit in the lapic registers.

> Questions:
> 
> 1. Being based on the Solo5 virtio code, the low-level setup code is doing
> the "usual" i8259 PIC setup, to remap the PIC IRQs to vectors 32 and above.
> Should I be doing this initialisation for Xen PVH at all?

Hm, there are no IO-APICs (or legacy PICs) on a PVH domU, so there's
not much to route. If Solo5 is thinking it's somehow configuring them
it's likely writing to some hole in memory, or to some RAM.

IO-APIC presence is signaled on the ACPI MADT table on PVH domU.

> I'm not interested
> in using the PIC for anything, and all interrupts will be delivered via Xen
> event channels.
> 
> 2. Related to the above, the IRQ handler code is ACKing the interrupt after
> the handler runs. Should I be doing that? Does ACKing "IRQ" 0 on the PIC
> have any interactions with Xen's view of event channels/pending upcalls?

Which kind of ACking it's doing? Is it writing to the lapic EOI
register? If so that would be OK when using
HVMOP_set_evtchn_upcall_vector. If using
HVM_PARAM_CALLBACK_TYPE_VECTOR there's nothing to Ack I think.

> Next, for a PVHv2, uniprocessor only guest, is the following flow sufficient
> to unmask an event channel?
> 
>     struct shared_info *s = SHARED_INFO();
>     int pending = 0;
> 
>     atomic_sync_btc(port, &s->evtchn_mask[0]);
>     pending = sync_bt(port, &s->evtchn_mask[0]);

You should check for pending interrupts on evtchn_pending, not
evtchn_mask.

>     if (pending) {
>         /*
>          * Slow path:
>          *
>          * If pending is set here, then there was a race, and we lost the
>          * upcall.  Mask the port again and force an upcall via a call to
>          * hyperspace.
>          *
>          * This should be sufficient for HVM/PVHv2 based on my understanding
> of
>          * Linux drivers/xen/events/events_2l.c.
>          */
>         atomic_sync_bts(port, &s->evtchn_mask[0]);
>         hypercall_evtchn_unmask(port);
>     }

FWIW, I use the hypercall unconditionally on FreeBSD because I didn't
see a performance difference when compared to this method.

> Lastly, the old PV-only Mini-OS based stack would do delays ("block the
> domain") by doing a HYPERVISOR_set_timer_op(deadline) followed by a
> HYPERVISOR_sched_op(SCHEDOP_block,0 ). In the new code, I'm doing the
> following (based on what Mini-OS seems to be doing for HVM):
> 
>     solo5_time_t deadline = Int64_val(v_deadline);
> 
>     if (solo5_clock_monotonic() < deadline) {
>         hypercall_set_timer_op(deadline);
>         __asm__ __volatile__ ("hlt" : : : "memory");
>         /* XXX: cancel timer_op here if woken up early? */
>     }
> 
> Again, is this the right thing to do for PVH?

hlt will trap into the hypervisor, so it's fine to use.

> As the comment says, do I need to cancel the timer_op?

I'm not sure. Keep in mind that a new call to hypercall_set_timer_op
will overwrite the previous timer, and hence should be fine I think as
long as you are using the one-shot timer.

> I understood the
> semantics to be "fire once at/after the time deadline is reached", if that
> is indeed the case then with my current VIRQ_TIMER handler which does
> nothing in the interrupt context and has no side effects I should be fine.

I have no idea how Solo5 works, maybe you should re-set the timer to
the next deadline in the handler?

Or that's fine because the timer is always set before blocking.

> I can also post the code that does the actual demuxing of events from Xen
> (i.e. the upcall handler), but I've read through that several times now and
> I don't think the problem is there; adding diagnostic prints to both the
> low-level C event channel code and higher-level OCaml Activations code
> confirms that received events are being mapped to their ports correctly.

I can take a look at the event channel handler if you want, as you
wish. The only weird think I've noticed is the error in the unmask
where you seem to use evtchn_mask instead of evtchn_pending.

I would also recommend using HVMOP_set_evtchn_upcall_vector instead of
HVM_PARAM_CALLBACK_TYPE_VECTOR. In order to make some tools happy just
set HVM_PARAM_CALLBACK_TYPE_VECTOR to 1. You can take a look at how
Xen does it when running as a guest:

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/guest/xen/xen.c;h=2ff63d370a8a12fef166677e2ded7ed82a628ce8;hb=HEAD#l205

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-15 15:03 ` Roger Pau Monné
@ 2020-06-15 15:51   ` Martin Lucina
  2020-06-15 16:52     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Lucina @ 2020-06-15 15:51 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, mirageos-devel

On 2020-06-15 17:03, Roger Pau Monné wrote:
> This way of event channel injection is slitgly hackish, and I would
> recommend using HVMOP_set_evtchn_upcall_vector, that way vectors will
> be properly routed using the lapic.
> 
> Using HVM_PARAM_CALLBACK_TYPE_VECTOR vectors are injected without
> setting the IRR/ISR bit in the lapic registers.

I picked HVM_PARAM_CALLBACK_TYPE_VECTOR since that seemed like the 
easiest option for a uniprocessor, PVH-only guest.

What does <vector> mean in the context of 
HVMOP_set_evtchn_upcall_vector? If it's an APIC vector number (sorry, 
not too familiar with the post-legacy i8259 world), does that imply that 
if I use HVMOP_set_evtchn_upcall_vector I need to do some APIC 
initialisation / programming?

All I need for Solo5/Mirage purposes is to have the upcall land on IDT 
vector #32 or above.

> 
>> Questions:
>> 
>> 1. Being based on the Solo5 virtio code, the low-level setup code is 
>> doing
>> the "usual" i8259 PIC setup, to remap the PIC IRQs to vectors 32 and 
>> above.
>> Should I be doing this initialisation for Xen PVH at all?
> 
> Hm, there are no IO-APICs (or legacy PICs) on a PVH domU, so there's
> not much to route. If Solo5 is thinking it's somehow configuring them
> it's likely writing to some hole in memory, or to some RAM.

Solo5 only has a very primitive understanding of hardware interrupts, so 
it's only configuring the legacy PICs via port IO.

> 
> IO-APIC presence is signaled on the ACPI MADT table on PVH domU.
> 

Hmm, it'd be very unfortunate if I had to deal with ACPI, so here's 
hoping that I don't actually need to touch the APIC.

>> I'm not interested
>> in using the PIC for anything, and all interrupts will be delivered 
>> via Xen
>> event channels.
>> 
>> 2. Related to the above, the IRQ handler code is ACKing the interrupt 
>> after
>> the handler runs. Should I be doing that? Does ACKing "IRQ" 0 on the 
>> PIC
>> have any interactions with Xen's view of event channels/pending 
>> upcalls?
> 
> Which kind of ACking it's doing? Is it writing to the lapic EOI
> register? If so that would be OK when using
> HVMOP_set_evtchn_upcall_vector. If using
> HVM_PARAM_CALLBACK_TYPE_VECTOR there's nothing to Ack I think.

Legacy PIC EOI via port IO.

> 
>> Next, for a PVHv2, uniprocessor only guest, is the following flow 
>> sufficient
>> to unmask an event channel?
>> 
>>     struct shared_info *s = SHARED_INFO();
>>     int pending = 0;
>> 
>>     atomic_sync_btc(port, &s->evtchn_mask[0]);
>>     pending = sync_bt(port, &s->evtchn_mask[0]);
> 
> You should check for pending interrupts on evtchn_pending, not
> evtchn_mask.

Ah, thanks for spotting that! Fixed, but just that change by itself does 
not seem to change the observed behaviour in any way.

> 
>>     if (pending) {
>>         /*
>>          * Slow path:
>>          *
>>          * If pending is set here, then there was a race, and we lost 
>> the
>>          * upcall.  Mask the port again and force an upcall via a call 
>> to
>>          * hyperspace.
>>          *
>>          * This should be sufficient for HVM/PVHv2 based on my 
>> understanding
>> of
>>          * Linux drivers/xen/events/events_2l.c.
>>          */
>>         atomic_sync_bts(port, &s->evtchn_mask[0]);
>>         hypercall_evtchn_unmask(port);
>>     }
> 
> FWIW, I use the hypercall unconditionally on FreeBSD because I didn't
> see a performance difference when compared to this method.
> 
>> Lastly, the old PV-only Mini-OS based stack would do delays ("block 
>> the
>> domain") by doing a HYPERVISOR_set_timer_op(deadline) followed by a
>> HYPERVISOR_sched_op(SCHEDOP_block,0 ). In the new code, I'm doing the
>> following (based on what Mini-OS seems to be doing for HVM):
>> 
>>     solo5_time_t deadline = Int64_val(v_deadline);
>> 
>>     if (solo5_clock_monotonic() < deadline) {
>>         hypercall_set_timer_op(deadline);
>>         __asm__ __volatile__ ("hlt" : : : "memory");
>>         /* XXX: cancel timer_op here if woken up early? */
>>     }
>> 
>> Again, is this the right thing to do for PVH?
> 
> hlt will trap into the hypervisor, so it's fine to use.
> 

Thanks for confirming.

>> As the comment says, do I need to cancel the timer_op?
> 
> I'm not sure. Keep in mind that a new call to hypercall_set_timer_op
> will overwrite the previous timer, and hence should be fine I think as
> long as you are using the one-shot timer.

Is there something other than a one-shot timer? hypercall_set_timer_op 
is just a single-argument hypercall with an uint64_t deadline, right? 
(And not documented in xen.h either ...)

> 
>> I understood the
>> semantics to be "fire once at/after the time deadline is reached", if 
>> that
>> is indeed the case then with my current VIRQ_TIMER handler which does
>> nothing in the interrupt context and has no side effects I should be 
>> fine.
> 
> I have no idea how Solo5 works, maybe you should re-set the timer to
> the next deadline in the handler?
> 
> Or that's fine because the timer is always set before blocking.

Yes, it's always set before blocking, and we always unblock after the 
first interrupt (i.e. some event) is received, so should be fine.

> 
>> I can also post the code that does the actual demuxing of events from 
>> Xen
>> (i.e. the upcall handler), but I've read through that several times 
>> now and
>> I don't think the problem is there; adding diagnostic prints to both 
>> the
>> low-level C event channel code and higher-level OCaml Activations code
>> confirms that received events are being mapped to their ports 
>> correctly.
> 
> I can take a look at the event channel handler if you want, as you
> wish. The only weird think I've noticed is the error in the unmask
> where you seem to use evtchn_mask instead of evtchn_pending.

Thanks for the offer, this stuff is fairly subtle.

In the Mirage/Xen scenario, there are two parts to the upcall handler. 
The top half is executed in interrupt context and basically does nothing 
except acknowledge the upcall:

int solo5__xen_evtchn_vector_handler(void *arg __attribute__((unused)))
{
     struct vcpu_info *vi = VCPU0_INFO();

     /*
      * All work is done outside of interrupt context by 
evtchn_demux_pending(),
      * so all we need to do here is acknowledge the upcall from Xen.
      */
     vi->evtchn_upcall_pending = 0;
     return 1;
}

The bottom half is then called periodically (and always before blocking) 
by the OCaml code:

static bool evtchn_demux_pending(void)
{
     struct shared_info *s = SHARED_INFO();
     struct vcpu_info *vi = VCPU0_INFO();
     bool some_pending = false;

     vi->evtchn_upcall_pending = 0;

     /*
      * Demux events received from Xen.
      *
      * pending_l1 is the "outer" per-VCPU selector (evtchn_pending_sel).
      * pending_l2 is the "inner" system-wide word (evtchn_pending[l1i]).
      */
     xen_ulong_t pending_l1, pending_l2;
     atomic_sync_xchg(&vi->evtchn_pending_sel, 0, &pending_l1);
     while (pending_l1 != 0) {
         xen_ulong_t l1i = ffs(pending_l1);

         /*
          * Masking pending_l2 with ~evtchn_mask[l1i] is necessary to get 
the
          * *current* masked events; otherwise races like the following
          * can occur:
          *
          *     1. X is generated, upcall scheduled by Xen.
          *     2. X is masked.
          *     3. Upcall is delivered.
          *     4. X fires despite now being masked.
          */
         while ((pending_l2 =
                     (s->evtchn_pending[l1i] & ~s->evtchn_mask[l1i])) != 
0) {
             xen_ulong_t l2i = ffs(pending_l2);

             evtchn_port_t port = (l1i * (sizeof(xen_ulong_t) * 8)) + 
l2i;
             /*
              * Mark as pending on the OCaml side.
              */
             evtchn_callback_ml[port] = 1;
             some_pending = true;

             atomic_sync_btc(l2i, &s->evtchn_pending[l1i]);
         }

         pending_l1 &= ~(1UL << l1i);
     }

     return some_pending;
}

> 
> I would also recommend using HVMOP_set_evtchn_upcall_vector instead of
> HVM_PARAM_CALLBACK_TYPE_VECTOR. In order to make some tools happy just
> set HVM_PARAM_CALLBACK_TYPE_VECTOR to 1. You can take a look at how
> Xen does it when running as a guest:
> 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/guest/xen/xen.c;h=2ff63d370a8a12fef166677e2ded7ed82a628ce8;hb=HEAD#l205

Thanks for the pointer. As I write above, if I can use 
HVMOP_set_evtchn_upcall_vector w/o doing too much "extra work" on the 
guest side then I will go with that.

> 
> Roger.

-mato


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-15 15:51   ` Martin Lucina
@ 2020-06-15 16:52     ` Roger Pau Monné
  0 siblings, 0 replies; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-15 16:52 UTC (permalink / raw)
  To: Martin Lucina; +Cc: xen-devel, mirageos-devel

On Mon, Jun 15, 2020 at 05:51:51PM +0200, Martin Lucina wrote:
> On 2020-06-15 17:03, Roger Pau Monné wrote:
> > This way of event channel injection is slitgly hackish, and I would
> > recommend using HVMOP_set_evtchn_upcall_vector, that way vectors will
> > be properly routed using the lapic.
> > 
> > Using HVM_PARAM_CALLBACK_TYPE_VECTOR vectors are injected without
> > setting the IRR/ISR bit in the lapic registers.
> 
> I picked HVM_PARAM_CALLBACK_TYPE_VECTOR since that seemed like the easiest
> option for a uniprocessor, PVH-only guest.
> 
> What does <vector> mean in the context of HVMOP_set_evtchn_upcall_vector? If
> it's an APIC vector number (sorry, not too familiar with the post-legacy
> i8259 world), does that imply that if I use HVMOP_set_evtchn_upcall_vector I
> need to do some APIC initialisation / programming?
> 
> All I need for Solo5/Mirage purposes is to have the upcall land on IDT
> vector #32 or above.

Oh, OK. It was reported that HVM_PARAM_CALLBACK_TYPE_VECTOR doesn't
work well with migration and when using hardware APIC virtualization
IIRC, because of the fact that interrupts are not signaled in the
lapic and directly injected.

> > > Questions:
> > > 
> > > 1. Being based on the Solo5 virtio code, the low-level setup code is
> > > doing
> > > the "usual" i8259 PIC setup, to remap the PIC IRQs to vectors 32 and
> > > above.
> > > Should I be doing this initialisation for Xen PVH at all?
> > 
> > Hm, there are no IO-APICs (or legacy PICs) on a PVH domU, so there's
> > not much to route. If Solo5 is thinking it's somehow configuring them
> > it's likely writing to some hole in memory, or to some RAM.
> 
> Solo5 only has a very primitive understanding of hardware interrupts, so
> it's only configuring the legacy PICs via port IO.

Oh, then there's definitely no legacy PIC at all. Writes to the PIC IO
ports will be dropped/ignored, and reads will return ~0. I guess you
could implement some check based on that in order to avoid doing any
initialization?

I don't think it should be harmful in any way, but you are just likely
spending a bunch of time trapping into the hypervisor to handle those
reads/writes for no good reason.

> > 
> > IO-APIC presence is signaled on the ACPI MADT table on PVH domU.
> > 
> 
> Hmm, it'd be very unfortunate if I had to deal with ACPI, so here's hoping
> that I don't actually need to touch the APIC.

If you don't do any IO-APIC configuration at all then that's
completely fine.

> > > I'm not interested
> > > in using the PIC for anything, and all interrupts will be delivered
> > > via Xen
> > > event channels.
> > > 
> > > 2. Related to the above, the IRQ handler code is ACKing the
> > > interrupt after
> > > the handler runs. Should I be doing that? Does ACKing "IRQ" 0 on the
> > > PIC
> > > have any interactions with Xen's view of event channels/pending
> > > upcalls?
> > 
> > Which kind of ACking it's doing? Is it writing to the lapic EOI
> > register? If so that would be OK when using
> > HVMOP_set_evtchn_upcall_vector. If using
> > HVM_PARAM_CALLBACK_TYPE_VECTOR there's nothing to Ack I think.
> 
> Legacy PIC EOI via port IO.

No need to do that. There's no legacy PIC on PVH, and then event
channel interrupts are not routed through the PIC when using
HVM_PARAM_CALLBACK_TYPE_VECTOR.

> > I'm not sure. Keep in mind that a new call to hypercall_set_timer_op
> > will overwrite the previous timer, and hence should be fine I think as
> > long as you are using the one-shot timer.
> 
> Is there something other than a one-shot timer? hypercall_set_timer_op is
> just a single-argument hypercall with an uint64_t deadline, right? (And not
> documented in xen.h either ...)

There's also a periodic timer (see VCPUOP_{set/stop}_periodic_timer),
which was enabled by default at 100Hz for PV guests. This is not the
case for PVH.

> > > I can also post the code that does the actual demuxing of events
> > > from Xen
> > > (i.e. the upcall handler), but I've read through that several times
> > > now and
> > > I don't think the problem is there; adding diagnostic prints to both
> > > the
> > > low-level C event channel code and higher-level OCaml Activations code
> > > confirms that received events are being mapped to their ports
> > > correctly.
> > 
> > I can take a look at the event channel handler if you want, as you
> > wish. The only weird think I've noticed is the error in the unmask
> > where you seem to use evtchn_mask instead of evtchn_pending.
> 
> Thanks for the offer, this stuff is fairly subtle.
> 
> In the Mirage/Xen scenario, there are two parts to the upcall handler. The
> top half is executed in interrupt context and basically does nothing except
> acknowledge the upcall:
> 
> int solo5__xen_evtchn_vector_handler(void *arg __attribute__((unused)))
> {
>     struct vcpu_info *vi = VCPU0_INFO();
> 
>     /*
>      * All work is done outside of interrupt context by
> evtchn_demux_pending(),
>      * so all we need to do here is acknowledge the upcall from Xen.
>      */
>     vi->evtchn_upcall_pending = 0;

I think you should best avoid clearing evtchn_upcall_pending here, as
Xen will then inject more vector callbacks if a new event is signaled,
even when you have not processed the previous one?

>     return 1;
> }
> 
> The bottom half is then called periodically (and always before blocking) by
> the OCaml code:
> 
> static bool evtchn_demux_pending(void)
> {
>     struct shared_info *s = SHARED_INFO();
>     struct vcpu_info *vi = VCPU0_INFO();
>     bool some_pending = false;
> 
>     vi->evtchn_upcall_pending = 0;
> 
>     /*
>      * Demux events received from Xen.
>      *
>      * pending_l1 is the "outer" per-VCPU selector (evtchn_pending_sel).
>      * pending_l2 is the "inner" system-wide word (evtchn_pending[l1i]).
>      */
>     xen_ulong_t pending_l1, pending_l2;
>     atomic_sync_xchg(&vi->evtchn_pending_sel, 0, &pending_l1);
>     while (pending_l1 != 0) {
>         xen_ulong_t l1i = ffs(pending_l1);
> 
>         /*
>          * Masking pending_l2 with ~evtchn_mask[l1i] is necessary to get the
>          * *current* masked events; otherwise races like the following
>          * can occur:
>          *
>          *     1. X is generated, upcall scheduled by Xen.
>          *     2. X is masked.
>          *     3. Upcall is delivered.
>          *     4. X fires despite now being masked.
>          */
>         while ((pending_l2 =
>                     (s->evtchn_pending[l1i] & ~s->evtchn_mask[l1i])) != 0) {
>             xen_ulong_t l2i = ffs(pending_l2);
> 
>             evtchn_port_t port = (l1i * (sizeof(xen_ulong_t) * 8)) + l2i;
>             /*
>              * Mark as pending on the OCaml side.
>              */
>             evtchn_callback_ml[port] = 1;

How is this cleared? It must be cleared before the handler is run, or
else you will likely miss interrupts.

Also, I think you could mask the port before setting
evtchn_callback_ml and unmask it when evtchn_callback_ml is cleared?

>             some_pending = true;
> 
>             atomic_sync_btc(l2i, &s->evtchn_pending[l1i]);
>         }
> 
>         pending_l1 &= ~(1UL << l1i);
>     }
> 
>     return some_pending;
> }

If you can dump the event channel numbers and their usage from Solo5
then you can check against the 'e' debug key from Xen in order to
check if there are indeed pending events to be delivered on some of
them.

Just checking the output from the 'e' debug key will tell you if
there's anything pending and if there are any channels masked.

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-15 14:25 Event delivery and "domain blocking" on PVHv2 Martin Lucina
  2020-06-15 15:03 ` Roger Pau Monné
@ 2020-06-15 16:58 ` Andrew Cooper
  2020-06-18 10:13   ` Martin Lucina
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2020-06-15 16:58 UTC (permalink / raw)
  To: xen-devel, mirageos-devel, martin

On 15/06/2020 15:25, Martin Lucina wrote:
> Hi,
>
> puzzle time: In my continuing explorations of the PVHv2 ABIs for the
> new MirageOS Xen stack, I've run into some issues with what looks like
> missed deliveries of events on event channels.
>
> While a simple unikernel that only uses the Xen console and
> effectively does for (1..5) { printf("foo"); sleep(1); } works fine,
> once I plug in the existing OCaml Xenstore and Netfront code, the
> behaviour I see is that the unikernel hangs in random places, blocking
> as if an event that should have been delivered has been missed.

You can see what is going on, event channel wise, with the 'e'
debug-key.  This will highlight cases such as the event channel being
masked and pending, which is a common guest bug ending up in this state.

>
> <snip>
> Given that I've essentially re-written the low-level event channel C
> code, I'd like to verify that the mechanisms I'm using for event
> delivery are indeed the right thing to do on PVHv2.
>
> For event delivery, I'm registering the upcall with Xen as follows:
>
>     uint64_t val = 32ULL;
>     val |= (uint64_t)HVM_PARAM_CALLBACK_TYPE_VECTOR << 56;
>     int rc = hypercall_hvm_set_param(HVM_PARAM_CALLBACK_IRQ, val);
>     assert(rc == 0);
>
> i.e. upcalls are to be delivered via IDT vector.

Don't use HVM_PARAM_CALLBACK_TYPE_VECTOR.  It is conceptually broken, as
it bypasses all queueing and IRR logic in the LAPIC.

At some point, I'm going to have to figure out a compatible way to deal
with all the guests still using this mechanism, because it is
incompatible with using hardware accelerated APIC support in
IvyBridge/Zen+ hardware.

Use HVMOP_set_evtchn_upcall_vector instead, which does the same thing,
but actually behaves like a real vector as far as the rest of the LAPIC
is concerned.

>
> Questions:
>
> 1. Being based on the Solo5 virtio code, the low-level setup code is
> doing the "usual" i8259 PIC setup, to remap the PIC IRQs to vectors 32
> and above. Should I be doing this initialisation for Xen PVH at all?
> I'm not interested in using the PIC for anything, and all interrupts
> will be delivered via Xen event channels.

PVH guests don't get a PIC by default.  Xen will just be swallowing all
your setup and doing nothing with it.

"plain" PVH guests also don't get an IO-APIC by default.  Unless you're
wanting to get PVH dom0 support working, (or PCI Passthrough in the
future), don't worry about the IO-APIC either.

>
> 2. Related to the above, the IRQ handler code is ACKing the interrupt
> after the handler runs. Should I be doing that? Does ACKing "IRQ" 0 on
> the PIC have any interactions with Xen's view of event
> channels/pending upcalls?

There's no PIC to begin with, but even then, talking to the PIC/IO-APIC
would only be correct for type INTX/GSI.

TYPE_VECTOR shouldn't have an ack at the LAPIC (it is this properly
which makes it incompatible with hardware acceleration), while
HVMOP_set_evtchn_upcall_vector should be acked at the LAPIC.

~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-15 16:58 ` Andrew Cooper
@ 2020-06-18 10:13   ` Martin Lucina
  2020-06-18 11:46     ` Roger Pau Monné
  2020-06-18 23:43     ` Andrew Cooper
  0 siblings, 2 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-18 10:13 UTC (permalink / raw)
  To: Andrew Cooper, Roger Pau Monné; +Cc: xen-devel, mirageos-devel

On Monday, 15.06.2020 at 17:58, Andrew Cooper wrote:
> On 15/06/2020 15:25, Martin Lucina wrote:
> > Hi,
> >
> > puzzle time: In my continuing explorations of the PVHv2 ABIs for the
> > new MirageOS Xen stack, I've run into some issues with what looks like
> > missed deliveries of events on event channels.
> >
> > While a simple unikernel that only uses the Xen console and
> > effectively does for (1..5) { printf("foo"); sleep(1); } works fine,
> > once I plug in the existing OCaml Xenstore and Netfront code, the
> > behaviour I see is that the unikernel hangs in random places, blocking
> > as if an event that should have been delivered has been missed.
> 
> You can see what is going on, event channel wise, with the 'e'
> debug-key.  This will highlight cases such as the event channel being
> masked and pending, which is a common guest bug ending up in this state.

Ok, based on your and Roger's suggestions I've made some changes:

1. I've dropped all the legacy PIC initialisation code from the Solo5
parts, written some basic APIC initialisation code and switched to using
HVMOP_set_evtchn_upcall_vector for upcall registration, along with setting
HVM_PARAM_CALLBACK_IRQ to 1 as suggested by Roger and done by Xen when
running as a guest. Commit at [1], nothing controversial there.

2. I've re-worked the "bottom half" of the event channel upcall handler to
mask the event when marking it as pending in the OCaml-side "shadow"
structure, and unmask it immediately before an OCaml-side handler would be
run, i.e. when doing a "test and clear port" operation on the OCaml-side
structure.

However, none of this seems to have fundamentally changed the behaviour.
The unikernel still gets "stuck" at random points during netfront set-up.
Now that I've extended the grant table size, *sometimes* get as far as a
fully functioning netfront and packets will flow, but things will
eventually freeze up (pretty much immediately if I do something like ping
-f).

When the unikernel is in the "frozen" state, the domain is blocked (so we
think we're waiting for events), and the "e" debug key shows the relevant
event channels (Xenstore or netfront) as pending but unmasked:

Example - when hung during netfront bring-up:

(XEN) Event channel information for domain 27:
(XEN) Polling vCPUs: {}
(XEN)     port [p/m/s]
(XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
(XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
(XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0

(1 is Xenstore, 2 is console which we don't care about, 3 is VIRQ_TIMER).

When hung after hammering with "ping -f":

(XEN) Event channel information for domain 28:
(XEN) Polling vCPUs: {}
(XEN)     port [p/m/s]
(XEN)        1 [0/0/1]: s=3 n=0 x=0 d=0 p=33
(XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
(XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
(XEN)        4 [1/0/1]: s=3 n=0 x=0 d=0 p=35

(as above, 4 is netfront)

Some more questions about the "bottom half" of the event channel upcall,
called periodically by OCaml outside of interrupt context:

static bool evtchn_demux_pending(void)
{
    struct shared_info *s = SHARED_INFO();
    struct vcpu_info *vi = VCPU0_INFO();
    bool some_pending = false;

    vi->evtchn_upcall_pending = 0;
>>>> Based on Roger's suggestion, this is now only done here and no longer
>>>> in the "top half" in interrupt context, which now only ACKs the vector
>>>> on the APIC.

    /*
     * Demux events received from Xen.
     *
     * pending_l1 is the "outer" per-VCPU selector (evtchn_pending_sel).
     * pending_l2 is the "inner" system-wide word (evtchn_pending[l1i]).
     */
    xen_ulong_t pending_l1, pending_l2;
    atomic_sync_xchg(&vi->evtchn_pending_sel, 0, &pending_l1);
    while (pending_l1 != 0) {
        xen_ulong_t l1i = ffs(pending_l1);
        pending_l1 &= ~(1UL << l1i);

        /*
         * Masking pending_l2 with ~evtchn_mask[l1i] is necessary to get the
         * *current* masked events; otherwise races like the following
         * can occur:
         *
         *     1. X is generated, upcall scheduled by Xen.
         *     2. X is masked.
         *     3. Upcall is delivered.
         *     4. X fires despite now being masked.
         */
        while ((pending_l2 =
                    (s->evtchn_pending[l1i] & ~s->evtchn_mask[l1i])) != 0) {
            xen_ulong_t l2i = ffs(pending_l2);
            pending_l2 &= ~(1UL << l2i);
>>>> Do I need the above? It doesn't make a difference and seems redundant
>>>> since pending_l2 is always reassigned in the loop, but Mini-OS and
>>>> others are doing it...?

            evtchn_port_t port = (l1i * (sizeof(xen_ulong_t) * 8)) + l2i;
            /*
             * Mark as pending on the OCaml side and mask the event until
             * just before OCaml gets around to handling it. Also, clear
             * the pending bit on the Xen side.
             */
            evtchn_callback_ml[port] = 1;
            atomic_sync_bts(l2i, &s->evtchn_mask[l1i]);
>>>> Mask added at Roger's suggestion, not in the original Mini-OS PV-based
>>>> implementation.
            atomic_sync_btc(l2i, &s->evtchn_pending[l1i]);
            some_pending = true;
        }
    }
    return some_pending;
}

The OCaml code essentially calls the above periodically, and, if
it returns true, then calls into the following "test and clear" operation
for all ports:

static inline bool evtchn_test_and_clear(evtchn_port_t port)
{
    assert(port < EVTCHN_2L_NR_CHANNELS);
    if (evtchn_callback_ml[port] > 0) {
        evtchn_callback_ml[port] = 0;
        evtchn_unmask(port);
>>>> Unmask added at Roger's suggestion, not in the original Mini-OS
>>>> PV-based implementation. I'm not entirely happy about this, since
>>>> it'll probably result in a lot more hypercalls when the event channel
>>>> is busy?
        return true;
    }
    else {
        return false;
    }
}

If this returns true, then the event handler will get run immediately after
returning from here, and before any further call to evtchn_demux_pending().

At this point I don't really have a clear idea of how to progress,
comparing my implementation side-by-side with the original PV Mini-OS-based
implementation doesn't show up any differences I can see.

AFAICT the OCaml code I've also not changed in any material way, and that
has been running in production on PV for years, so I'd be inclined to think
the problem is in my reimplementation of the C parts, but where...?

Martin

[1] https://github.com/mato/solo5/commit/42afe3a2177b9caf63d0df435391ae03a6684ac8


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-18 10:13   ` Martin Lucina
@ 2020-06-18 11:46     ` Roger Pau Monné
  2020-06-19 10:28       ` Martin Lucina
  2020-06-18 23:43     ` Andrew Cooper
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-18 11:46 UTC (permalink / raw)
  To: Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> On Monday, 15.06.2020 at 17:58, Andrew Cooper wrote:
> > On 15/06/2020 15:25, Martin Lucina wrote:
> > > Hi,
> > >
> > > puzzle time: In my continuing explorations of the PVHv2 ABIs for the
> > > new MirageOS Xen stack, I've run into some issues with what looks like
> > > missed deliveries of events on event channels.
> > >
> > > While a simple unikernel that only uses the Xen console and
> > > effectively does for (1..5) { printf("foo"); sleep(1); } works fine,
> > > once I plug in the existing OCaml Xenstore and Netfront code, the
> > > behaviour I see is that the unikernel hangs in random places, blocking
> > > as if an event that should have been delivered has been missed.
> > 
> > You can see what is going on, event channel wise, with the 'e'
> > debug-key.  This will highlight cases such as the event channel being
> > masked and pending, which is a common guest bug ending up in this state.
> 
> Ok, based on your and Roger's suggestions I've made some changes:
> 
> 1. I've dropped all the legacy PIC initialisation code from the Solo5
> parts, written some basic APIC initialisation code and switched to using
> HVMOP_set_evtchn_upcall_vector for upcall registration, along with setting
> HVM_PARAM_CALLBACK_IRQ to 1 as suggested by Roger and done by Xen when
> running as a guest. Commit at [1], nothing controversial there.
> 
> 2. I've re-worked the "bottom half" of the event channel upcall handler to
> mask the event when marking it as pending in the OCaml-side "shadow"
> structure, and unmask it immediately before an OCaml-side handler would be
> run, i.e. when doing a "test and clear port" operation on the OCaml-side
> structure.

As long as the unmask happens after you having cleared the bit on the
ocaml-side structure I think it should be fine, however I would unmask
the event channel once you have finished running the ocaml handler for
it.

> 
> However, none of this seems to have fundamentally changed the behaviour.
> The unikernel still gets "stuck" at random points during netfront set-up.
> Now that I've extended the grant table size, *sometimes* get as far as a
> fully functioning netfront and packets will flow, but things will
> eventually freeze up (pretty much immediately if I do something like ping
> -f).
> 
> When the unikernel is in the "frozen" state, the domain is blocked (so we
> think we're waiting for events), and the "e" debug key shows the relevant
> event channels (Xenstore or netfront) as pending but unmasked:
> 
> Example - when hung during netfront bring-up:
> 
> (XEN) Event channel information for domain 27:
> (XEN) Polling vCPUs: {}
> (XEN)     port [p/m/s]
> (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> 
> (1 is Xenstore, 2 is console which we don't care about, 3 is VIRQ_TIMER).
> 
> When hung after hammering with "ping -f":
> 
> (XEN) Event channel information for domain 28:
> (XEN) Polling vCPUs: {}
> (XEN)     port [p/m/s]
> (XEN)        1 [0/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> (XEN)        4 [1/0/1]: s=3 n=0 x=0 d=0 p=35
> 
> (as above, 4 is netfront)

So events are pending but somehow not injected.

> 
> Some more questions about the "bottom half" of the event channel upcall,
> called periodically by OCaml outside of interrupt context:
> 
> static bool evtchn_demux_pending(void)
> {
>     struct shared_info *s = SHARED_INFO();
>     struct vcpu_info *vi = VCPU0_INFO();
>     bool some_pending = false;
> 
>     vi->evtchn_upcall_pending = 0;
> >>>> Based on Roger's suggestion, this is now only done here and no longer
> >>>> in the "top half" in interrupt context, which now only ACKs the vector
> >>>> on the APIC.
> 
>     /*
>      * Demux events received from Xen.
>      *
>      * pending_l1 is the "outer" per-VCPU selector (evtchn_pending_sel).
>      * pending_l2 is the "inner" system-wide word (evtchn_pending[l1i]).
>      */
>     xen_ulong_t pending_l1, pending_l2;
>     atomic_sync_xchg(&vi->evtchn_pending_sel, 0, &pending_l1);
>     while (pending_l1 != 0) {
>         xen_ulong_t l1i = ffs(pending_l1);
>         pending_l1 &= ~(1UL << l1i);
> 
>         /*
>          * Masking pending_l2 with ~evtchn_mask[l1i] is necessary to get the
>          * *current* masked events; otherwise races like the following
>          * can occur:
>          *
>          *     1. X is generated, upcall scheduled by Xen.
>          *     2. X is masked.
>          *     3. Upcall is delivered.
>          *     4. X fires despite now being masked.
>          */
>         while ((pending_l2 =
>                     (s->evtchn_pending[l1i] & ~s->evtchn_mask[l1i])) != 0) {
>             xen_ulong_t l2i = ffs(pending_l2);
>             pending_l2 &= ~(1UL << l2i);
> >>>> Do I need the above? It doesn't make a difference and seems redundant
> >>>> since pending_l2 is always reassigned in the loop, but Mini-OS and
> >>>> others are doing it...?

No, pending_l2 AFAICT it's only used to get the event channel to
process, so there's no point in clearing it on the local variable.
What you care about is clearing it in evtchn_pending[l1i].

> 
>             evtchn_port_t port = (l1i * (sizeof(xen_ulong_t) * 8)) + l2i;
>             /*
>              * Mark as pending on the OCaml side and mask the event until
>              * just before OCaml gets around to handling it. Also, clear
>              * the pending bit on the Xen side.
>              */
>             evtchn_callback_ml[port] = 1;
>             atomic_sync_bts(l2i, &s->evtchn_mask[l1i]);
> >>>> Mask added at Roger's suggestion, not in the original Mini-OS PV-based
> >>>> implementation.
>             atomic_sync_btc(l2i, &s->evtchn_pending[l1i]);
>             some_pending = true;
>         }
>     }
>     return some_pending;
> }
> 
> The OCaml code essentially calls the above periodically, and, if
> it returns true, then calls into the following "test and clear" operation
> for all ports:
> 
> static inline bool evtchn_test_and_clear(evtchn_port_t port)
> {
>     assert(port < EVTCHN_2L_NR_CHANNELS);
>     if (evtchn_callback_ml[port] > 0) {
>         evtchn_callback_ml[port] = 0;
>         evtchn_unmask(port);
> >>>> Unmask added at Roger's suggestion, not in the original Mini-OS
> >>>> PV-based implementation. I'm not entirely happy about this, since
> >>>> it'll probably result in a lot more hypercalls when the event channel
> >>>> is busy?

OTOH you will likely continue to get interrupts from it if you don't
mask it, so you might receive several interrupts (because you clear
the pending bit) without having executed the handler.

IMO it would be better to do the unmask after the handler has run.

>         return true;
>     }
>     else {
>         return false;
>     }
> }
> 
> If this returns true, then the event handler will get run immediately after
> returning from here, and before any further call to evtchn_demux_pending().
> 
> At this point I don't really have a clear idea of how to progress,
> comparing my implementation side-by-side with the original PV Mini-OS-based
> implementation doesn't show up any differences I can see.
> 
> AFAICT the OCaml code I've also not changed in any material way, and that
> has been running in production on PV for years, so I'd be inclined to think
> the problem is in my reimplementation of the C parts, but where...?

A good start would be to print the ISR and IRR lapic registers when
blocked, to assert there are no pending vectors there.

Can you apply the following patch to your Xen, rebuild and check the
output of the 'l' debug key?

Also add the output of the 'v' key.

Roger.
---8<---
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 7b5c633033..45b195cc05 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -23,6 +23,7 @@
 #include <xen/domain.h>
 #include <xen/domain_page.h>
 #include <xen/event.h>
+#include <xen/keyhandler.h>
 #include <xen/nospec.h>
 #include <xen/trace.h>
 #include <xen/lib.h>
@@ -1662,6 +1663,34 @@ void vlapic_destroy(struct vcpu *v)
     free_domheap_page(vlapic->regs_page);
 }
 
+static void print_lapic(unsigned char key)
+{
+    const struct domain *d;
+
+    for_each_domain ( d )
+    {
+        const struct vcpu *v;
+
+        if ( !has_vlapic(d) )
+            continue;
+
+        for_each_vcpu ( d, v )
+        {
+            const struct vlapic *vlapic = vcpu_vlapic(v);
+
+            printk("%pv IRR: %*pb\n", v, 256, &vlapic->regs->data[APIC_IRR]);
+            printk("%pv ISR: %*pb\n", v, 256, &vlapic->regs->data[APIC_ISR]);
+        }
+    }
+}
+
+static int __init print_lapic_init(void)
+{
+    register_keyhandler('l', print_lapic, "dump local APIC info", 1);
+    return 0;
+}
+__initcall(print_lapic_init);
+
 /*
  * Local variables:
  * mode: C



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-18 10:13   ` Martin Lucina
  2020-06-18 11:46     ` Roger Pau Monné
@ 2020-06-18 23:43     ` Andrew Cooper
  2020-06-19  8:41       ` Martin Lucina
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2020-06-18 23:43 UTC (permalink / raw)
  To: Martin Lucina, Roger Pau Monné, xen-devel, mirageos-devel

On 18/06/2020 11:13, Martin Lucina wrote:
> On Monday, 15.06.2020 at 17:58, Andrew Cooper wrote:
>> On 15/06/2020 15:25, Martin Lucina wrote:
>>> Hi,
>>>
>>> puzzle time: In my continuing explorations of the PVHv2 ABIs for the
>>> new MirageOS Xen stack, I've run into some issues with what looks like
>>> missed deliveries of events on event channels.
>>>
>>> While a simple unikernel that only uses the Xen console and
>>> effectively does for (1..5) { printf("foo"); sleep(1); } works fine,
>>> once I plug in the existing OCaml Xenstore and Netfront code, the
>>> behaviour I see is that the unikernel hangs in random places, blocking
>>> as if an event that should have been delivered has been missed.
>> You can see what is going on, event channel wise, with the 'e'
>> debug-key.  This will highlight cases such as the event channel being
>> masked and pending, which is a common guest bug ending up in this state.
> Ok, based on your and Roger's suggestions I've made some changes:
>
> 1. I've dropped all the legacy PIC initialisation code from the Solo5
> parts, written some basic APIC initialisation code and switched to using
> HVMOP_set_evtchn_upcall_vector for upcall registration, along with setting
> HVM_PARAM_CALLBACK_IRQ to 1 as suggested by Roger and done by Xen when
> running as a guest. Commit at [1], nothing controversial there.

Well...

    uint64_t apic_base = rdmsrq(MSR_IA32_APIC_BASE);
    wrmsrq(MSR_IA32_APIC_BASE,
            apic_base | (APIC_BASE << 4) | MSR_IA32_APIC_BASE_ENABLE);
    apic_base = rdmsrq(MSR_IA32_APIC_BASE);
    if (!(apic_base & MSR_IA32_APIC_BASE_ENABLE)) {
        log(ERROR, "Solo5: Could not enable APIC or not present\n");
        assert(false);
    }

The only reason Xen doesn't crash your guest on that WRMSR is because
0xfee00080ull | (0xfee00000u << 4) == 0xfee00080ull, due to truncation
and 0xfe | 0xee == 0xfe.

Either way, the logic isn't correct.

Xen doesn't support moving the APIC MMIO window (and almost certainly
never will, because the only thing which changes it is malware).  You
can rely on the default state being correct, because it is
architecturally specified.

~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-18 23:43     ` Andrew Cooper
@ 2020-06-19  8:41       ` Martin Lucina
  0 siblings, 0 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-19  8:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, mirageos-devel, Roger Pau Monné

On 2020-06-19 01:43, Andrew Cooper wrote:
> On 18/06/2020 11:13, Martin Lucina wrote:
>> On Monday, 15.06.2020 at 17:58, Andrew Cooper wrote:
>>> On 15/06/2020 15:25, Martin Lucina wrote:
>>>> Hi,
>>>> 
>>>> puzzle time: In my continuing explorations of the PVHv2 ABIs for the
>>>> new MirageOS Xen stack, I've run into some issues with what looks 
>>>> like
>>>> missed deliveries of events on event channels.
>>>> 
>>>> While a simple unikernel that only uses the Xen console and
>>>> effectively does for (1..5) { printf("foo"); sleep(1); } works fine,
>>>> once I plug in the existing OCaml Xenstore and Netfront code, the
>>>> behaviour I see is that the unikernel hangs in random places, 
>>>> blocking
>>>> as if an event that should have been delivered has been missed.
>>> You can see what is going on, event channel wise, with the 'e'
>>> debug-key.  This will highlight cases such as the event channel being
>>> masked and pending, which is a common guest bug ending up in this 
>>> state.
>> Ok, based on your and Roger's suggestions I've made some changes:
>> 
>> 1. I've dropped all the legacy PIC initialisation code from the Solo5
>> parts, written some basic APIC initialisation code and switched to 
>> using
>> HVMOP_set_evtchn_upcall_vector for upcall registration, along with 
>> setting
>> HVM_PARAM_CALLBACK_IRQ to 1 as suggested by Roger and done by Xen when
>> running as a guest. Commit at [1], nothing controversial there.
> 
> Well...
> 
>     uint64_t apic_base = rdmsrq(MSR_IA32_APIC_BASE);
>     wrmsrq(MSR_IA32_APIC_BASE,
>             apic_base | (APIC_BASE << 4) | MSR_IA32_APIC_BASE_ENABLE);
>     apic_base = rdmsrq(MSR_IA32_APIC_BASE);
>     if (!(apic_base & MSR_IA32_APIC_BASE_ENABLE)) {
>         log(ERROR, "Solo5: Could not enable APIC or not present\n");
>         assert(false);
>     }
> 
> The only reason Xen doesn't crash your guest on that WRMSR is because
> 0xfee00080ull | (0xfee00000u << 4) == 0xfee00080ull, due to truncation
> and 0xfe | 0xee == 0xfe.
> 
> Either way, the logic isn't correct.

Oh, thanks. Don't you wish C had a "strict" mode where you could 
disable/warn
on implicit type promotion? I certainly do.

> 
> Xen doesn't support moving the APIC MMIO window (and almost certainly
> never will, because the only thing which changes it is malware).  You
> can rely on the default state being correct, because it is
> architecturally specified.

Noted. I'll change the code to just verify that APIC_BASE is indeed 
FEE00000
at start of day and that the enable operation succeeded -- I like to 
keep the
code robust, e.g. against cut-n-pasting to somewhere else that might be 
used
in a non-Xen context later where the precondition may not hold.

Martin

> 
> ~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-18 11:46     ` Roger Pau Monné
@ 2020-06-19 10:28       ` Martin Lucina
  2020-06-19 11:21         ` Roger Pau Monné
  2020-06-19 11:21         ` Andrew Cooper
  0 siblings, 2 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-19 10:28 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On 2020-06-18 13:46, Roger Pau Monné wrote:
> On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
>> At this point I don't really have a clear idea of how to progress,
>> comparing my implementation side-by-side with the original PV 
>> Mini-OS-based
>> implementation doesn't show up any differences I can see.
>> 
>> AFAICT the OCaml code I've also not changed in any material way, and 
>> that
>> has been running in production on PV for years, so I'd be inclined to 
>> think
>> the problem is in my reimplementation of the C parts, but where...?
> 
> A good start would be to print the ISR and IRR lapic registers when
> blocked, to assert there are no pending vectors there.
> 
> Can you apply the following patch to your Xen, rebuild and check the
> output of the 'l' debug key?
> 
> Also add the output of the 'v' key.

Had to fight the Xen Debian packages a bit as I wanted to patch the 
exact same Xen (there are some failures when building on a system that 
has Xen installed due to following symlinks when fixing shebangs).

Here you go, when stuck during netfront setup, after allocating its 
event channel, presumably waiting on Xenstore:

'e':

(XEN) Event channel information for domain 3:
(XEN) Polling vCPUs: {}
(XEN)     port [p/m/s]
(XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
(XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
(XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
(XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0

'l':

(XEN) d3v0 IRR:                                                          
                                                                          
                                                                          
                                      ffff8301732dc200b
(XEN) d3v0 ISR:                                                          
                                                                          
                                                                          
                                      ffff8301732dc100b

'v':

(XEN) *********** VMCS Areas **************
(XEN)
(XEN) >>> Domain 3 <<<
(XEN)   VCPU 0
(XEN) *** Guest State ***
(XEN) CR0: actual=0x0000000080000033, shadow=0x0000000080000033, 
gh_mask=ffffffffffffffff
(XEN) CR4: actual=0x0000000000002660, shadow=0x0000000000000020, 
gh_mask=ffffffffffc8f860
(XEN) CR3 = 0x00000000002e9000
(XEN) RSP = 0x000000000ffffec0 (0x000000000ffffec0)  RIP = 
0x0000000000209997 (0x0000000000209998)
(XEN) RFLAGS=0x00000297 (0x00000297)  DR7 = 0x0000000000000400
(XEN) Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
(XEN)        sel  attr  limit   base
(XEN)   CS: 0008 0a099 ffffffff 0000000000000000
(XEN)   DS: 0010 0c093 ffffffff 0000000000000000
(XEN)   SS: 0010 0c093 ffffffff 0000000000000000
(XEN)   ES: 0010 0c093 ffffffff 0000000000000000
(XEN)   FS: 0000 1c000 ffffffff 0000000000000000
(XEN)   GS: 0000 1c000 ffffffff 0000000000000000
(XEN) GDTR:            00000027 00000000003e13c0
(XEN) LDTR: 0000 10000 00000000 0000000000000000
(XEN) IDTR:            000002ff 00000000003e10a8
(XEN)   TR: 0018 0008b 00000068 00000000003e1040
(XEN) EFER = 0x0000000000000000  PAT = 0x0007040600070406
(XEN) PreemptionTimer = 0x00000000  SM Base = 0x00000000
(XEN) DebugCtl = 0x0000000000000000  DebugExceptions = 
0x0000000000000000
(XEN) Interruptibility = 00000000  ActivityState = 00000000
(XEN) *** Host State ***
(XEN) RIP = 0xffff82d08031dd30 (vmx_asm_vmexit_handler)  RSP = 
0xffff83009c057f70
(XEN) CS=e008 SS=0000 DS=0000 ES=0000 FS=0000 GS=0000 TR=e040
(XEN) FSBase=0000000000000000 GSBase=0000000000000000 
TRBase=ffff82d08055e000
(XEN) GDTBase=ffff82d08042e000 IDTBase=ffff82d080559000
(XEN) CR0=0000000080050033 CR3=0000000172a67000 CR4=00000000003526e0
(XEN) Sysenter RSP=ffff83009c057fa0 CS:RIP=e008:ffff82d0803654b0
(XEN) EFER = 0x0000000000000000  PAT = 0x0000050100070406
(XEN) *** Control State ***
(XEN) PinBased=0000003f CPUBased=b6a065fa SecondaryExec=000014eb
(XEN) EntryControls=000053ff ExitControls=000fefff
(XEN) ExceptionBitmap=00060002 PFECmask=00000000 PFECmatch=00000000
(XEN) VMEntry: intr_info=00000020 errcode=00000000 ilen=00000000
(XEN) VMExit: intr_info=00000000 errcode=00000000 ilen=00000001
(XEN)         reason=0000000c qualification=0000000000000000
(XEN) IDTVectoring: info=00000000 errcode=00000000
(XEN) TSC Offset = 0xffffffcf8453199b  TSC Multiplier = 
0x0000000000000000
(XEN) TPR Threshold = 0x00  PostedIntrVec = 0x00
(XEN) EPT pointer = 0x0000000172a0b01e  EPTP index = 0x0000
(XEN) PLE Gap=00000080 Window=00001000
(XEN) Virtual processor ID = 0x000e VMfunc controls = 0000000000000000
(XEN) **************************************

RIP 0x209997 is the 'hlt' instruction in 
mirage_xen_evtchn_block_domain() so we are indeed blocking waiting for 
events to show up.

Martin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 10:28       ` Martin Lucina
@ 2020-06-19 11:21         ` Roger Pau Monné
  2020-06-19 16:41           ` Martin Lucina
  2020-06-19 11:21         ` Andrew Cooper
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-19 11:21 UTC (permalink / raw)
  To: Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> On 2020-06-18 13:46, Roger Pau Monné wrote:
> > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > At this point I don't really have a clear idea of how to progress,
> > > comparing my implementation side-by-side with the original PV
> > > Mini-OS-based
> > > implementation doesn't show up any differences I can see.
> > > 
> > > AFAICT the OCaml code I've also not changed in any material way, and
> > > that
> > > has been running in production on PV for years, so I'd be inclined
> > > to think
> > > the problem is in my reimplementation of the C parts, but where...?
> > 
> > A good start would be to print the ISR and IRR lapic registers when
> > blocked, to assert there are no pending vectors there.
> > 
> > Can you apply the following patch to your Xen, rebuild and check the
> > output of the 'l' debug key?
> > 
> > Also add the output of the 'v' key.
> 
> Had to fight the Xen Debian packages a bit as I wanted to patch the exact
> same Xen (there are some failures when building on a system that has Xen
> installed due to following symlinks when fixing shebangs).
> 
> Here you go, when stuck during netfront setup, after allocating its event
> channel, presumably waiting on Xenstore:
> 
> 'e':
> 
> (XEN) Event channel information for domain 3:
> (XEN) Polling vCPUs: {}
> (XEN)     port [p/m/s]
> (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
> 
> 'l':
> 
> (XEN) d3v0 IRR:
> ffff8301732dc200b
> (XEN) d3v0 ISR:
> ffff8301732dc100b

Which version of Xen is this? AFAICT it doesn't have the support to
print a bitmap.

Do you think you could also pick commit
8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
the info again).

Thanks, Roger.

[0] http://xenbits.xen.org/gitweb/?p=xen.git;a=patch;h=8cd9500958d818e3deabdd0d4164ea6fe1623d7c


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 10:28       ` Martin Lucina
  2020-06-19 11:21         ` Roger Pau Monné
@ 2020-06-19 11:21         ` Andrew Cooper
  2020-06-19 16:46           ` Martin Lucina
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2020-06-19 11:21 UTC (permalink / raw)
  To: Martin Lucina, Roger Pau Monné; +Cc: xen-devel, mirageos-devel

On 19/06/2020 11:28, Martin Lucina wrote:
> On 2020-06-18 13:46, Roger Pau Monné wrote:
>> On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
>>> At this point I don't really have a clear idea of how to progress,
>>> comparing my implementation side-by-side with the original PV
>>> Mini-OS-based
>>> implementation doesn't show up any differences I can see.
>>>
>>> AFAICT the OCaml code I've also not changed in any material way, and
>>> that
>>> has been running in production on PV for years, so I'd be inclined
>>> to think
>>> the problem is in my reimplementation of the C parts, but where...?
>>
>> A good start would be to print the ISR and IRR lapic registers when
>> blocked, to assert there are no pending vectors there.
>>
>> Can you apply the following patch to your Xen, rebuild and check the
>> output of the 'l' debug key?
>>
>> Also add the output of the 'v' key.
>
> Had to fight the Xen Debian packages a bit as I wanted to patch the
> exact same Xen (there are some failures when building on a system that
> has Xen installed due to following symlinks when fixing shebangs).
>
> Here you go, when stuck during netfront setup, after allocating its
> event channel, presumably waiting on Xenstore:
>
> 'e':
>
> (XEN) Event channel information for domain 3:
> (XEN) Polling vCPUs: {}
> (XEN)     port [p/m/s]
> (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
>
> 'l':
>
> (XEN) d3v0
> IRR:                                                         
>                                                                         
>                                                                         
>                                      ffff8301732dc200b
> (XEN) d3v0
> ISR:                                                         
>                                                                         
>                                                                         
>                                      ffff8301732dc100b
>
> 'v':
>
> (XEN) *********** VMCS Areas **************
> (XEN)
> (XEN) >>> Domain 3 <<<
> (XEN)   VCPU 0
> (XEN) *** Guest State ***
> (XEN) CR0: actual=0x0000000080000033, shadow=0x0000000080000033,
> gh_mask=ffffffffffffffff
> (XEN) CR4: actual=0x0000000000002660, shadow=0x0000000000000020,
> gh_mask=ffffffffffc8f860
> (XEN) CR3 = 0x00000000002e9000
> (XEN) RSP = 0x000000000ffffec0 (0x000000000ffffec0)  RIP =
> 0x0000000000209997 (0x0000000000209998)
> (XEN) RFLAGS=0x00000297 (0x00000297)  DR7 = 0x0000000000000400
> (XEN) Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
> (XEN)        sel  attr  limit   base
> (XEN)   CS: 0008 0a099 ffffffff 0000000000000000
> (XEN)   DS: 0010 0c093 ffffffff 0000000000000000
> (XEN)   SS: 0010 0c093 ffffffff 0000000000000000
> (XEN)   ES: 0010 0c093 ffffffff 0000000000000000
> (XEN)   FS: 0000 1c000 ffffffff 0000000000000000
> (XEN)   GS: 0000 1c000 ffffffff 0000000000000000
> (XEN) GDTR:            00000027 00000000003e13c0
> (XEN) LDTR: 0000 10000 00000000 0000000000000000
> (XEN) IDTR:            000002ff 00000000003e10a8
> (XEN)   TR: 0018 0008b 00000068 00000000003e1040
> (XEN) EFER = 0x0000000000000000  PAT = 0x0007040600070406
> (XEN) PreemptionTimer = 0x00000000  SM Base = 0x00000000
> (XEN) DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
> (XEN) Interruptibility = 00000000  ActivityState = 00000000
> (XEN) *** Host State ***
> (XEN) RIP = 0xffff82d08031dd30 (vmx_asm_vmexit_handler)  RSP =
> 0xffff83009c057f70
> (XEN) CS=e008 SS=0000 DS=0000 ES=0000 FS=0000 GS=0000 TR=e040
> (XEN) FSBase=0000000000000000 GSBase=0000000000000000
> TRBase=ffff82d08055e000
> (XEN) GDTBase=ffff82d08042e000 IDTBase=ffff82d080559000
> (XEN) CR0=0000000080050033 CR3=0000000172a67000 CR4=00000000003526e0
> (XEN) Sysenter RSP=ffff83009c057fa0 CS:RIP=e008:ffff82d0803654b0
> (XEN) EFER = 0x0000000000000000  PAT = 0x0000050100070406
> (XEN) *** Control State ***
> (XEN) PinBased=0000003f CPUBased=b6a065fa SecondaryExec=000014eb
> (XEN) EntryControls=000053ff ExitControls=000fefff
> (XEN) ExceptionBitmap=00060002 PFECmask=00000000 PFECmatch=00000000
> (XEN) VMEntry: intr_info=00000020 errcode=00000000 ilen=00000000
> (XEN) VMExit: intr_info=00000000 errcode=00000000 ilen=00000001
> (XEN)         reason=0000000c qualification=0000000000000000
> (XEN) IDTVectoring: info=00000000 errcode=00000000
> (XEN) TSC Offset = 0xffffffcf8453199b  TSC Multiplier =
> 0x0000000000000000
> (XEN) TPR Threshold = 0x00  PostedIntrVec = 0x00
> (XEN) EPT pointer = 0x0000000172a0b01e  EPTP index = 0x0000
> (XEN) PLE Gap=00000080 Window=00001000
> (XEN) Virtual processor ID = 0x000e VMfunc controls = 0000000000000000
> (XEN) **************************************
>
> RIP 0x209997 is the 'hlt' instruction in
> mirage_xen_evtchn_block_domain() so we are indeed blocking waiting for
> events to show up.

I can't find this in the code, but it might be an x86-ism you're falling
over here.

Its not safe to use hlt with interrupts enabled, unless it is exactly
`sti; hlt` where the STI instruction transitions the interrupt flag from
clear to set (i.e. you had interrupts disabled beforehand).

Otherwise you can take the interrupt intended to wake you on the before
the hlt is executed.

~Andrew


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 11:21         ` Roger Pau Monné
@ 2020-06-19 16:41           ` Martin Lucina
  2020-06-19 16:54             ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Lucina @ 2020-06-19 16:41 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On 2020-06-19 13:21, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
>> On 2020-06-18 13:46, Roger Pau Monné wrote:
>> > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
>> > > At this point I don't really have a clear idea of how to progress,
>> > > comparing my implementation side-by-side with the original PV
>> > > Mini-OS-based
>> > > implementation doesn't show up any differences I can see.
>> > >
>> > > AFAICT the OCaml code I've also not changed in any material way, and
>> > > that
>> > > has been running in production on PV for years, so I'd be inclined
>> > > to think
>> > > the problem is in my reimplementation of the C parts, but where...?
>> >
>> > A good start would be to print the ISR and IRR lapic registers when
>> > blocked, to assert there are no pending vectors there.
>> >
>> > Can you apply the following patch to your Xen, rebuild and check the
>> > output of the 'l' debug key?
>> >
>> > Also add the output of the 'v' key.
>> 
>> Had to fight the Xen Debian packages a bit as I wanted to patch the 
>> exact
>> same Xen (there are some failures when building on a system that has 
>> Xen
>> installed due to following symlinks when fixing shebangs).
>> 
>> Here you go, when stuck during netfront setup, after allocating its 
>> event
>> channel, presumably waiting on Xenstore:
>> 
>> 'e':
>> 
>> (XEN) Event channel information for domain 3:
>> (XEN) Polling vCPUs: {}
>> (XEN)     port [p/m/s]
>> (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
>> (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
>> (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
>> (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
>> 
>> 'l':
>> 
>> (XEN) d3v0 IRR:
>> ffff8301732dc200b
>> (XEN) d3v0 ISR:
>> ffff8301732dc100b
> 
> Which version of Xen is this? AFAICT it doesn't have the support to
> print a bitmap.

That in Debian 10 (stable):

ii  xen-hypervisor-4.11-amd64            
4.11.3+24-g14b62ab3e5-1~deb10u1.2 amd64        Xen Hypervisor on AMD64

xen_major              : 4
xen_minor              : 11
xen_extra              : .4-pre
xen_version            : 4.11.4-pre

> 
> Do you think you could also pick commit
> 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> the info again).

Done, here you go:

(XEN) Event channel information for domain 3:
(XEN) Polling vCPUs: {}
(XEN)     port [p/m/s]
(XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
(XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
(XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
(XEN)        4 [0/1/1]: s=3 n=0 x=0 d=0 p=35


(XEN) d3v0 IRR: 
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN) d3v0 ISR: 
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000


(XEN) *********** VMCS Areas **************
(XEN)
(XEN) >>> Domain 3 <<<
(XEN)   VCPU 0
(XEN) *** Guest State ***
(XEN) CR0: actual=0x0000000080000033, shadow=0x0000000080000033, 
gh_mask=ffffffffffffffff
(XEN) CR4: actual=0x0000000000002660, shadow=0x0000000000000020, 
gh_mask=ffffffffffc8f860
(XEN) CR3 = 0x00000000002e9000
(XEN) RSP = 0x000000000ffffec0 (0x000000000ffffec0)  RIP = 
0x0000000000209997 (0x0000000000209998)
(XEN) RFLAGS=0x00000297 (0x00000297)  DR7 = 0x0000000000000400
(XEN) Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
(XEN)        sel  attr  limit   base
(XEN)   CS: 0008 0a099 ffffffff 0000000000000000
(XEN)   DS: 0010 0c093 ffffffff 0000000000000000
(XEN)   SS: 0010 0c093 ffffffff 0000000000000000
(XEN)   ES: 0010 0c093 ffffffff 0000000000000000
(XEN)   FS: 0000 1c000 ffffffff 0000000000000000
(XEN)   GS: 0000 1c000 ffffffff 0000000000000000
(XEN) GDTR:            00000027 00000000003e13c0
(XEN) LDTR: 0000 10000 00000000 0000000000000000
(XEN) IDTR:            000002ff 00000000003e10a8
(XEN)   TR: 0018 0008b 00000068 00000000003e1040
(XEN) EFER = 0x0000000000000000  PAT = 0x0007040600070406
(XEN) PreemptionTimer = 0x00000000  SM Base = 0x00000000
(XEN) DebugCtl = 0x0000000000000000  DebugExceptions = 
0x0000000000000000
(XEN) Interruptibility = 00000000  ActivityState = 00000000
(XEN) *** Host State ***
(XEN) RIP = 0xffff82d08031df40 (vmx_asm_vmexit_handler)  RSP = 
0xffff83009c057f70
(XEN) CS=e008 SS=0000 DS=0000 ES=0000 FS=0000 GS=0000 TR=e040
(XEN) FSBase=0000000000000000 GSBase=0000000000000000 
TRBase=ffff82d08055e000
(XEN) GDTBase=ffff82d08042f000 IDTBase=ffff82d080559000
(XEN) CR0=0000000080050033 CR3=0000000171d9b000 CR4=00000000003526e0
(XEN) Sysenter RSP=ffff83009c057fa0 CS:RIP=e008:ffff82d0803664b0
(XEN) EFER = 0x0000000000000000  PAT = 0x0000050100070406
(XEN) *** Control State ***
(XEN) PinBased=0000003f CPUBased=b6a065fa SecondaryExec=000014eb
(XEN) EntryControls=000053ff ExitControls=000fefff
(XEN) ExceptionBitmap=00060002 PFECmask=00000000 PFECmatch=00000000
(XEN) VMEntry: intr_info=00000020 errcode=00000000 ilen=00000000
(XEN) VMExit: intr_info=00000000 errcode=00000000 ilen=00000001
(XEN)         reason=0000000c qualification=0000000000000000
(XEN) IDTVectoring: info=00000000 errcode=00000000
(XEN) TSC Offset = 0xffffffd2984c98e6  TSC Multiplier = 
0x0000000000000000
(XEN) TPR Threshold = 0x00  PostedIntrVec = 0x00
(XEN) EPT pointer = 0x000000016e5ac01e  EPTP index = 0x0000
(XEN) PLE Gap=00000080 Window=00001000
(XEN) Virtual processor ID = 0x000e VMfunc controls = 0000000000000000
(XEN) **************************************

Martin

> 
> Thanks, Roger.
> 
> [0]
> http://xenbits.xen.org/gitweb/?p=xen.git;a=patch;h=8cd9500958d818e3deabdd0d4164ea6fe1623d7c


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 11:21         ` Andrew Cooper
@ 2020-06-19 16:46           ` Martin Lucina
  0 siblings, 0 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-19 16:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, mirageos-devel, Roger Pau Monné

On 2020-06-19 13:21, Andrew Cooper wrote:
> On 19/06/2020 11:28, Martin Lucina wrote:
>> RIP 0x209997 is the 'hlt' instruction in
>> mirage_xen_evtchn_block_domain() so we are indeed blocking waiting for
>> events to show up.
> 
> I can't find this in the code, but it might be an x86-ism you're 
> falling
> over here.

Solo5 only contains the lowest-level bits, and only then those parts 
that
"fit" the responsibility of that layer. The rest is here (WIP, not 
cleaned
up yet):

https://github.com/mato/mirage-xen/tree/xen-pvh-via-solo5

The event channel code, including the function that blocks the domain,
is in lib/bindings/evtchn_stubs.c.

> 
> Its not safe to use hlt with interrupts enabled, unless it is exactly
> `sti; hlt` where the STI instruction transitions the interrupt flag 
> from
> clear to set (i.e. you had interrupts disabled beforehand).
> 
> Otherwise you can take the interrupt intended to wake you on the before
> the hlt is executed.

Hmm, so what is the right thing to do when blocking on PVHv2 in this
scenario? Just use SCHEDOP_block? "cli; sti; hlt"? (Tried the latter,
doesn't help).

Martin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 16:41           ` Martin Lucina
@ 2020-06-19 16:54             ` Roger Pau Monné
  2020-06-19 17:42               ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-19 16:54 UTC (permalink / raw)
  To: Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
> On 2020-06-19 13:21, Roger Pau Monné wrote:
> > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> > > On 2020-06-18 13:46, Roger Pau Monné wrote:
> > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > > > At this point I don't really have a clear idea of how to progress,
> > > > > comparing my implementation side-by-side with the original PV
> > > > > Mini-OS-based
> > > > > implementation doesn't show up any differences I can see.
> > > > >
> > > > > AFAICT the OCaml code I've also not changed in any material way, and
> > > > > that
> > > > > has been running in production on PV for years, so I'd be inclined
> > > > > to think
> > > > > the problem is in my reimplementation of the C parts, but where...?
> > > >
> > > > A good start would be to print the ISR and IRR lapic registers when
> > > > blocked, to assert there are no pending vectors there.
> > > >
> > > > Can you apply the following patch to your Xen, rebuild and check the
> > > > output of the 'l' debug key?
> > > >
> > > > Also add the output of the 'v' key.
> > > 
> > > Had to fight the Xen Debian packages a bit as I wanted to patch the
> > > exact
> > > same Xen (there are some failures when building on a system that has
> > > Xen
> > > installed due to following symlinks when fixing shebangs).
> > > 
> > > Here you go, when stuck during netfront setup, after allocating its
> > > event
> > > channel, presumably waiting on Xenstore:
> > > 
> > > 'e':
> > > 
> > > (XEN) Event channel information for domain 3:
> > > (XEN) Polling vCPUs: {}
> > > (XEN)     port [p/m/s]
> > > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> > > (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
> > > 
> > > 'l':
> > > 
> > > (XEN) d3v0 IRR:
> > > ffff8301732dc200b
> > > (XEN) d3v0 ISR:
> > > ffff8301732dc100b
> > 
> > Which version of Xen is this? AFAICT it doesn't have the support to
> > print a bitmap.
> 
> That in Debian 10 (stable):
> 
> ii  xen-hypervisor-4.11-amd64            4.11.3+24-g14b62ab3e5-1~deb10u1.2
> amd64        Xen Hypervisor on AMD64
> 
> xen_major              : 4
> xen_minor              : 11
> xen_extra              : .4-pre
> xen_version            : 4.11.4-pre
> 
> > 
> > Do you think you could also pick commit
> > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> > the info again).
> 
> Done, here you go:
> 
> (XEN) Event channel information for domain 3:
> (XEN) Polling vCPUs: {}
> (XEN)     port [p/m/s]
> (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> (XEN)        4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
> 
> 
> (XEN) d3v0 IRR:
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> (XEN) d3v0 ISR:
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000

So there's nothing pending on the lapic. Can you assert that you will
always execute evtchn_demux_pending after you have received an event
channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?

I think this would be simpler if you moved evtchn_demux_pending into
solo5__xen_evtchn_vector_handler? As there would be less asynchronous
processing, and thus likely less races?

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 16:54             ` Roger Pau Monné
@ 2020-06-19 17:42               ` Roger Pau Monné
  2020-06-22 10:58                 ` Martin Lucina
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-19 17:42 UTC (permalink / raw)
  To: Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Fri, Jun 19, 2020 at 06:54:26PM +0200, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
> > On 2020-06-19 13:21, Roger Pau Monné wrote:
> > > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> > > > On 2020-06-18 13:46, Roger Pau Monné wrote:
> > > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > > > > At this point I don't really have a clear idea of how to progress,
> > > > > > comparing my implementation side-by-side with the original PV
> > > > > > Mini-OS-based
> > > > > > implementation doesn't show up any differences I can see.
> > > > > >
> > > > > > AFAICT the OCaml code I've also not changed in any material way, and
> > > > > > that
> > > > > > has been running in production on PV for years, so I'd be inclined
> > > > > > to think
> > > > > > the problem is in my reimplementation of the C parts, but where...?
> > > > >
> > > > > A good start would be to print the ISR and IRR lapic registers when
> > > > > blocked, to assert there are no pending vectors there.
> > > > >
> > > > > Can you apply the following patch to your Xen, rebuild and check the
> > > > > output of the 'l' debug key?
> > > > >
> > > > > Also add the output of the 'v' key.
> > > > 
> > > > Had to fight the Xen Debian packages a bit as I wanted to patch the
> > > > exact
> > > > same Xen (there are some failures when building on a system that has
> > > > Xen
> > > > installed due to following symlinks when fixing shebangs).
> > > > 
> > > > Here you go, when stuck during netfront setup, after allocating its
> > > > event
> > > > channel, presumably waiting on Xenstore:
> > > > 
> > > > 'e':
> > > > 
> > > > (XEN) Event channel information for domain 3:
> > > > (XEN) Polling vCPUs: {}
> > > > (XEN)     port [p/m/s]
> > > > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> > > > (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
> > > > 
> > > > 'l':
> > > > 
> > > > (XEN) d3v0 IRR:
> > > > ffff8301732dc200b
> > > > (XEN) d3v0 ISR:
> > > > ffff8301732dc100b
> > > 
> > > Which version of Xen is this? AFAICT it doesn't have the support to
> > > print a bitmap.
> > 
> > That in Debian 10 (stable):
> > 
> > ii  xen-hypervisor-4.11-amd64            4.11.3+24-g14b62ab3e5-1~deb10u1.2
> > amd64        Xen Hypervisor on AMD64
> > 
> > xen_major              : 4
> > xen_minor              : 11
> > xen_extra              : .4-pre
> > xen_version            : 4.11.4-pre
> > 
> > > 
> > > Do you think you could also pick commit
> > > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> > > the info again).
> > 
> > Done, here you go:
> > 
> > (XEN) Event channel information for domain 3:
> > (XEN) Polling vCPUs: {}
> > (XEN)     port [p/m/s]
> > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> > (XEN)        4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
> > 
> > 
> > (XEN) d3v0 IRR:
> > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> > (XEN) d3v0 ISR:
> > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> 
> So there's nothing pending on the lapic. Can you assert that you will
> always execute evtchn_demux_pending after you have received an event
> channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?
> 
> I think this would be simpler if you moved evtchn_demux_pending into
> solo5__xen_evtchn_vector_handler? As there would be less asynchronous
> processing, and thus likely less races?

Having though about this, I think this model of not demuxing in
solo5__xen_evtchn_vector_handler is always racy, as it's not possible
to assert that you would always call evtchn_demux_pending after
solo5__xen_evtchn_vector_handler?

Ie: if you receive an interrupt just before going to sleep (after the
sti and before the hlt) you will execute
solo5__xen_evtchn_vector_handler and EOI the vector, but then
evtchn_demux_pending will never get called, and thus the interrupts
will stay indefinitely pending?

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-19 17:42               ` Roger Pau Monné
@ 2020-06-22 10:58                 ` Martin Lucina
  2020-06-22 13:58                   ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Lucina @ 2020-06-22 10:58 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On 2020-06-19 19:42, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 06:54:26PM +0200, Roger Pau Monné wrote:
>> On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
>> > On 2020-06-19 13:21, Roger Pau Monné wrote:
>> > > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
>> > > > On 2020-06-18 13:46, Roger Pau Monné wrote:
>> > > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
>> > > > > > At this point I don't really have a clear idea of how to progress,
>> > > > > > comparing my implementation side-by-side with the original PV
>> > > > > > Mini-OS-based
>> > > > > > implementation doesn't show up any differences I can see.
>> > > > > >
>> > > > > > AFAICT the OCaml code I've also not changed in any material way, and
>> > > > > > that
>> > > > > > has been running in production on PV for years, so I'd be inclined
>> > > > > > to think
>> > > > > > the problem is in my reimplementation of the C parts, but where...?
>> > > > >
>> > > > > A good start would be to print the ISR and IRR lapic registers when
>> > > > > blocked, to assert there are no pending vectors there.
>> > > > >
>> > > > > Can you apply the following patch to your Xen, rebuild and check the
>> > > > > output of the 'l' debug key?
>> > > > >
>> > > > > Also add the output of the 'v' key.
>> > > >
>> > > > Had to fight the Xen Debian packages a bit as I wanted to patch the
>> > > > exact
>> > > > same Xen (there are some failures when building on a system that has
>> > > > Xen
>> > > > installed due to following symlinks when fixing shebangs).
>> > > >
>> > > > Here you go, when stuck during netfront setup, after allocating its
>> > > > event
>> > > > channel, presumably waiting on Xenstore:
>> > > >
>> > > > 'e':
>> > > >
>> > > > (XEN) Event channel information for domain 3:
>> > > > (XEN) Polling vCPUs: {}
>> > > > (XEN)     port [p/m/s]
>> > > > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
>> > > > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
>> > > > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
>> > > > (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
>> > > >
>> > > > 'l':
>> > > >
>> > > > (XEN) d3v0 IRR:
>> > > > ffff8301732dc200b
>> > > > (XEN) d3v0 ISR:
>> > > > ffff8301732dc100b
>> > >
>> > > Which version of Xen is this? AFAICT it doesn't have the support to
>> > > print a bitmap.
>> >
>> > That in Debian 10 (stable):
>> >
>> > ii  xen-hypervisor-4.11-amd64            4.11.3+24-g14b62ab3e5-1~deb10u1.2
>> > amd64        Xen Hypervisor on AMD64
>> >
>> > xen_major              : 4
>> > xen_minor              : 11
>> > xen_extra              : .4-pre
>> > xen_version            : 4.11.4-pre
>> >
>> > >
>> > > Do you think you could also pick commit
>> > > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
>> > > the info again).
>> >
>> > Done, here you go:
>> >
>> > (XEN) Event channel information for domain 3:
>> > (XEN) Polling vCPUs: {}
>> > (XEN)     port [p/m/s]
>> > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
>> > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
>> > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
>> > (XEN)        4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
>> >
>> >
>> > (XEN) d3v0 IRR:
>> > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
>> > (XEN) d3v0 ISR:
>> > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
>> 
>> So there's nothing pending on the lapic. Can you assert that you will
>> always execute evtchn_demux_pending after you have received an event
>> channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?
>> 
>> I think this would be simpler if you moved evtchn_demux_pending into
>> solo5__xen_evtchn_vector_handler? As there would be less asynchronous
>> processing, and thus likely less races?
> 
> Having though about this, I think this model of not demuxing in
> solo5__xen_evtchn_vector_handler is always racy, as it's not possible
> to assert that you would always call evtchn_demux_pending after
> solo5__xen_evtchn_vector_handler?
> 
> Ie: if you receive an interrupt just before going to sleep (after the
> sti and before the hlt) you will execute
> solo5__xen_evtchn_vector_handler and EOI the vector, but then
> evtchn_demux_pending will never get called, and thus the interrupts
> will stay indefinitely pending?

Aha! Thank you for pointing this out. I think you may be right, but this 
should be possible without doing the demuxing in interrupt context.

How about this arrangement, which appears to work for me; no hangs I can 
see so far and domU survives ping -f fine with no packet loss:

CAMLprim value
mirage_xen_evtchn_block_domain(value v_deadline)
{
     struct vcpu_info *vi = VCPU0_INFO();
     solo5_time_t deadline = Int64_val(v_deadline);

     if (solo5_clock_monotonic() < deadline) {
         __asm__ __volatile__ ("cli" : : : "memory");
         if (vi->evtchn_upcall_pending) {
             __asm__ __volatile__ ("sti");
         }
         else {
             hypercall_set_timer_op(deadline);
             __asm__ __volatile__ ("sti; hlt");
         }
     }
     return Val_unit;
}

i.e. Always go to sleep with interrupts disabled, but before doing so 
re-check that no events have become pending since the last time 
evtchn_demux_pending() was called. This holds, since the only thing that 
sets vi->evtchn_upcall_pending is Xen, and the only thing that clears it 
is evtchn_demux_pending().

Right?

In an attempt to understand why the original PV code worked I re-read 
the PV Mini-OS block_domain code again and realised that I had entirely 
missed one part of its behaviour, which is that it intends[*] to run 
with interrupts/upcalls disabled *all* the time and relies on 
SCHEDOP_block atomically re-enabling them and triggering an upcall 
before returning (PV) or "briefly enabling interrupts to allow handlers 
to run" (HVM). We're doing the inverse, but our behaviour matches my 
mental model of how things should work.

[*] AFAICT there's a bug in Mini-OS as ASSERT(irqs_disabled) is a no-op, 
and block_domain is called with upcalls/interrupts enabled the first 
time round. But I'm not 100% sure, and that code is a twisty little maze 
of #ifdefs all alike.

Martin

> Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 10:58                 ` Martin Lucina
@ 2020-06-22 13:58                   ` Roger Pau Monné
  2020-06-22 15:31                     ` Martin Lucina
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-22 13:58 UTC (permalink / raw)
  To: Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
> On 2020-06-19 19:42, Roger Pau Monné wrote:
> > On Fri, Jun 19, 2020 at 06:54:26PM +0200, Roger Pau Monné wrote:
> > > On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
> > > > On 2020-06-19 13:21, Roger Pau Monné wrote:
> > > > > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> > > > > > On 2020-06-18 13:46, Roger Pau Monné wrote:
> > > > > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > > > > > > At this point I don't really have a clear idea of how to progress,
> > > > > > > > comparing my implementation side-by-side with the original PV
> > > > > > > > Mini-OS-based
> > > > > > > > implementation doesn't show up any differences I can see.
> > > > > > > >
> > > > > > > > AFAICT the OCaml code I've also not changed in any material way, and
> > > > > > > > that
> > > > > > > > has been running in production on PV for years, so I'd be inclined
> > > > > > > > to think
> > > > > > > > the problem is in my reimplementation of the C parts, but where...?
> > > > > > >
> > > > > > > A good start would be to print the ISR and IRR lapic registers when
> > > > > > > blocked, to assert there are no pending vectors there.
> > > > > > >
> > > > > > > Can you apply the following patch to your Xen, rebuild and check the
> > > > > > > output of the 'l' debug key?
> > > > > > >
> > > > > > > Also add the output of the 'v' key.
> > > > > >
> > > > > > Had to fight the Xen Debian packages a bit as I wanted to patch the
> > > > > > exact
> > > > > > same Xen (there are some failures when building on a system that has
> > > > > > Xen
> > > > > > installed due to following symlinks when fixing shebangs).
> > > > > >
> > > > > > Here you go, when stuck during netfront setup, after allocating its
> > > > > > event
> > > > > > channel, presumably waiting on Xenstore:
> > > > > >
> > > > > > 'e':
> > > > > >
> > > > > > (XEN) Event channel information for domain 3:
> > > > > > (XEN) Polling vCPUs: {}
> > > > > > (XEN)     port [p/m/s]
> > > > > > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > > > > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > > > > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> > > > > > (XEN)        4 [0/1/1]: s=2 n=0 x=0 d=0
> > > > > >
> > > > > > 'l':
> > > > > >
> > > > > > (XEN) d3v0 IRR:
> > > > > > ffff8301732dc200b
> > > > > > (XEN) d3v0 ISR:
> > > > > > ffff8301732dc100b
> > > > >
> > > > > Which version of Xen is this? AFAICT it doesn't have the support to
> > > > > print a bitmap.
> > > >
> > > > That in Debian 10 (stable):
> > > >
> > > > ii  xen-hypervisor-4.11-amd64            4.11.3+24-g14b62ab3e5-1~deb10u1.2
> > > > amd64        Xen Hypervisor on AMD64
> > > >
> > > > xen_major              : 4
> > > > xen_minor              : 11
> > > > xen_extra              : .4-pre
> > > > xen_version            : 4.11.4-pre
> > > >
> > > > >
> > > > > Do you think you could also pick commit
> > > > > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> > > > > the info again).
> > > >
> > > > Done, here you go:
> > > >
> > > > (XEN) Event channel information for domain 3:
> > > > (XEN) Polling vCPUs: {}
> > > > (XEN)     port [p/m/s]
> > > > (XEN)        1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > > (XEN)        2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > > (XEN)        3 [1/0/1]: s=5 n=0 x=0 v=0
> > > > (XEN)        4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
> > > >
> > > >
> > > > (XEN) d3v0 IRR:
> > > > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> > > > (XEN) d3v0 ISR:
> > > > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> > > 
> > > So there's nothing pending on the lapic. Can you assert that you will
> > > always execute evtchn_demux_pending after you have received an event
> > > channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?
> > > 
> > > I think this would be simpler if you moved evtchn_demux_pending into
> > > solo5__xen_evtchn_vector_handler? As there would be less asynchronous
> > > processing, and thus likely less races?
> > 
> > Having though about this, I think this model of not demuxing in
> > solo5__xen_evtchn_vector_handler is always racy, as it's not possible
> > to assert that you would always call evtchn_demux_pending after
> > solo5__xen_evtchn_vector_handler?
> > 
> > Ie: if you receive an interrupt just before going to sleep (after the
> > sti and before the hlt) you will execute
> > solo5__xen_evtchn_vector_handler and EOI the vector, but then
> > evtchn_demux_pending will never get called, and thus the interrupts
> > will stay indefinitely pending?
> 
> Aha! Thank you for pointing this out. I think you may be right, but this
> should be possible without doing the demuxing in interrupt context.

If you don't do the demuxing in the interrupt context (ie: making the
interrupt handler a noop), then you don't likely need such interrupt
anyway?

> How about this arrangement, which appears to work for me; no hangs I can see
> so far and domU survives ping -f fine with no packet loss:
> 
> CAMLprim value
> mirage_xen_evtchn_block_domain(value v_deadline)
> {
>     struct vcpu_info *vi = VCPU0_INFO();
>     solo5_time_t deadline = Int64_val(v_deadline);
> 
>     if (solo5_clock_monotonic() < deadline) {
>         __asm__ __volatile__ ("cli" : : : "memory");
>         if (vi->evtchn_upcall_pending) {
>             __asm__ __volatile__ ("sti");
>         }
>         else {
>             hypercall_set_timer_op(deadline);

What if you set a deadline so close that evtchn_upcall_pending gets
set by Xen here and the interrupt is injected? You would execute the
noop handler and just hlt, and could likely end up in the same blocked
situation as before?

>             __asm__ __volatile__ ("sti; hlt");
>         }
>     }
>     return Val_unit;
> }
> 
> i.e. Always go to sleep with interrupts disabled, but before doing so
> re-check that no events have become pending since the last time
> evtchn_demux_pending() was called. This holds, since the only thing that
> sets vi->evtchn_upcall_pending is Xen, and the only thing that clears it is
> evtchn_demux_pending().
> 
> Right?

TBH this is a hard model to get right, I think your best bet at
attempting something along this lines is to forget about using the
event channel interrupt and instead use SCHEDOP_poll. You could do
something like (written in pure C as I have no idea of the ocaml
bindings):

int
mirage_xen_evtchn_block_domain(uint64_t timeout)
{
    evtchn_port_t ports[MAX_PORTS];
    struct sched_poll poll = {
    	.timeout = timeout,
	.nr_ports = 0,
    };

    set_xen_guest_handle(poll.ports, ports);

    /* Fill ports you care about (ie: all event channel ports in use) */
    ports[poll.nr_ports++] = port_1;
    ports[poll.nr_ports++] = port_2;
    [...] /* Check that you don't overrun MAX_PORTS */

    /* On return demux events and call timer handler if timeout expired. */
    return hypercall_sched_op(SCHEDOP_poll, &poll);
}

Doing something like the above you could forget about setting up the
event channel interrupt and the timer.

>             __asm__ __volatile__ ("sti; hlt");
>         }
>     }
>     return Val_unit;
> }
> 
> In an attempt to understand why the original PV code worked I re-read the PV
> Mini-OS block_domain code again and realised that I had entirely missed one
> part of its behaviour, which is that it intends[*] to run with
> interrupts/upcalls disabled *all* the time and relies on SCHEDOP_block
> atomically re-enabling them and triggering an upcall before returning (PV)
> or "briefly enabling interrupts to allow handlers to run" (HVM). We're doing
> the inverse, but our behaviour matches my mental model of how things should
> work.

Not really IMO, as SCHEDOP_block is a single 'instruction' from a
guest PoV that does the enabling of interrupts and returns if there
are pending ones.

Also SCHEDOP_block is not exactly the same on HVM, as it just checks
for pending vectors to inject, but not for pending event channels. On
HVM you cannot call hlt with interrupts disabled, or the vCPU will be
taken down.

There are quite a lot of subtle differences between PV and HVM in this
regard, and I think the best approach would be to use SCHEDOP_poll in
order to implement the kind of model you describe.

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 13:58                   ` Roger Pau Monné
@ 2020-06-22 15:31                     ` Martin Lucina
  2020-06-22 15:36                       ` Jan Beulich
  2020-06-22 16:09                       ` Roger Pau Monné
  0 siblings, 2 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-22 15:31 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On 2020-06-22 15:58, Roger Pau Monné wrote:
> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
>> Aha! Thank you for pointing this out. I think you may be right, but 
>> this
>> should be possible without doing the demuxing in interrupt context.
> 
> If you don't do the demuxing in the interrupt context (ie: making the
> interrupt handler a noop), then you don't likely need such interrupt
> anyway?

I need the/an interrupt to wake the VCPU from HLT state if we went to 
sleep.

> 
>> How about this arrangement, which appears to work for me; no hangs I 
>> can see
>> so far and domU survives ping -f fine with no packet loss:
>> 
>> CAMLprim value
>> mirage_xen_evtchn_block_domain(value v_deadline)
>> {
>>     struct vcpu_info *vi = VCPU0_INFO();
>>     solo5_time_t deadline = Int64_val(v_deadline);
>> 
>>     if (solo5_clock_monotonic() < deadline) {
>>         __asm__ __volatile__ ("cli" : : : "memory");
>>         if (vi->evtchn_upcall_pending) {
>>             __asm__ __volatile__ ("sti");
>>         }
>>         else {
>>             hypercall_set_timer_op(deadline);
> 
> What if you set a deadline so close that evtchn_upcall_pending gets
> set by Xen here and the interrupt is injected? You would execute the
> noop handler and just hlt, and could likely end up in the same blocked
> situation as before?

Why would an interrupt be injected here? Doesn't the immediately 
preceding
"cli" disable that?

Or perhaps I need to do a PV/HVM hybrid and set vi->evtchn_upcall_mask 
just
before the cli, and clear it after the sti?

>> i.e. Always go to sleep with interrupts disabled, but before doing so
>> re-check that no events have become pending since the last time
>> evtchn_demux_pending() was called. This holds, since the only thing 
>> that
>> sets vi->evtchn_upcall_pending is Xen, and the only thing that clears 
>> it is
>> evtchn_demux_pending().
>> 
>> Right?
> 
> TBH this is a hard model to get right, I think your best bet at
> attempting something along this lines is to forget about using the
> event channel interrupt and instead use SCHEDOP_poll. You could do
> something like (written in pure C as I have no idea of the ocaml
> bindings):
> [SCHEDOP_poll code snipped]

Thanks for the suggestion. This brings us full-circle -- I found [1] and
[2] way back from 2013 when Mirage/Xen was initially using SCHEDOP_poll
and then switched to the current interrupts + SCHEDOP_block approach.

Part of the motivation for the change at the time was to allow waiting
on/servicing more than 128 ports (the limit for SCHEDOP_poll). I doubt
anyone wants to do that these days, but it still makes me a bit 
reluctant
to change back to SCHEDOP_poll.

>> In an attempt to understand why the original PV code worked I re-read 
>> the PV
>> Mini-OS block_domain code again and realised that I had entirely 
>> missed one
>> part of its behaviour, which is that it intends[*] to run with
>> interrupts/upcalls disabled *all* the time and relies on SCHEDOP_block
>> atomically re-enabling them and triggering an upcall before returning 
>> (PV)
>> or "briefly enabling interrupts to allow handlers to run" (HVM). We're 
>> doing
>> the inverse, but our behaviour matches my mental model of how things 
>> should
>> work.
> 
> Not really IMO, as SCHEDOP_block is a single 'instruction' from a
> guest PoV that does the enabling of interrupts and returns if there
> are pending ones.

Indeed and this is exactly the operation I need in the HVM world with 
the
current model.

> Also SCHEDOP_block is not exactly the same on HVM, as it just checks
> for pending vectors to inject, but not for pending event channels.

... well, I don't want to use SCHEDOP_block anyway since that is not 
possible
on ARM, and I would like to keep the code at least "portable with not 
too
much effort". So there should be a non-racy "HVM way" to do this?

> HVM you cannot call hlt with interrupts disabled, or the vCPU will be
> taken down.
> 
> There are quite a lot of subtle differences between PV and HVM in this
> regard, and I think the best approach would be to use SCHEDOP_poll in
> order to implement the kind of model you describe.

I can see that now. The interactions between the "virtual hardware"
(interrupt delivery, VCPU IF) and "PV" parts (event delivery, masking) 
are
hard to understand for me, yet the two are intertwined in the way HVM
works :-(

Coming back to your earlier suggestion of moving the event demuxing (but 
not
the handling) into the upcall interrupt handler itself, perhaps that 
approach
is still worth investigating in combination with re-working the OCaml 
side array
view of pending events, or at least ensuring that atomic operations are 
used
since it would now be updated asynchronously.

Martin

[1] 
https://lists.cam.ac.uk/pipermail/cl-mirage/2013-September/msg00053.html
[2] https://github.com/mirage/mirage-platform/pull/58


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 15:31                     ` Martin Lucina
@ 2020-06-22 15:36                       ` Jan Beulich
  2020-06-22 16:09                       ` Roger Pau Monné
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2020-06-22 15:36 UTC (permalink / raw)
  To: Martin Lucina
  Cc: Andrew Cooper, xen-devel, mirageos-devel, Roger Pau Monné

On 22.06.2020 17:31, Martin Lucina wrote:
> On 2020-06-22 15:58, Roger Pau Monné wrote:
>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
>>> How about this arrangement, which appears to work for me; no hangs I 
>>> can see
>>> so far and domU survives ping -f fine with no packet loss:
>>>
>>> CAMLprim value
>>> mirage_xen_evtchn_block_domain(value v_deadline)
>>> {
>>>     struct vcpu_info *vi = VCPU0_INFO();
>>>     solo5_time_t deadline = Int64_val(v_deadline);
>>>
>>>     if (solo5_clock_monotonic() < deadline) {
>>>         __asm__ __volatile__ ("cli" : : : "memory");
>>>         if (vi->evtchn_upcall_pending) {
>>>             __asm__ __volatile__ ("sti");
>>>         }
>>>         else {
>>>             hypercall_set_timer_op(deadline);
>>
>> What if you set a deadline so close that evtchn_upcall_pending gets
>> set by Xen here and the interrupt is injected? You would execute the
>> noop handler and just hlt, and could likely end up in the same blocked
>> situation as before?
> 
> Why would an interrupt be injected here? Doesn't the immediately 
> preceding
> "cli" disable that?
> 
> Or perhaps I need to do a PV/HVM hybrid and set vi->evtchn_upcall_mask 
> just
> before the cli, and clear it after the sti?

evtchn_upcall_mask is a strictly PV-only thing. See e.g. the code
comment in hvm_set_callback_irq_level().

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 15:31                     ` Martin Lucina
  2020-06-22 15:36                       ` Jan Beulich
@ 2020-06-22 16:09                       ` Roger Pau Monné
  2020-06-22 16:20                         ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-22 16:09 UTC (permalink / raw)
  To: Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
> On 2020-06-22 15:58, Roger Pau Monné wrote:
> > On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
> > > Aha! Thank you for pointing this out. I think you may be right, but
> > > this
> > > should be possible without doing the demuxing in interrupt context.
> > 
> > If you don't do the demuxing in the interrupt context (ie: making the
> > interrupt handler a noop), then you don't likely need such interrupt
> > anyway?
> 
> I need the/an interrupt to wake the VCPU from HLT state if we went to sleep.
> 
> > 
> > > How about this arrangement, which appears to work for me; no hangs I
> > > can see
> > > so far and domU survives ping -f fine with no packet loss:
> > > 
> > > CAMLprim value
> > > mirage_xen_evtchn_block_domain(value v_deadline)
> > > {
> > >     struct vcpu_info *vi = VCPU0_INFO();
> > >     solo5_time_t deadline = Int64_val(v_deadline);
> > > 
> > >     if (solo5_clock_monotonic() < deadline) {
> > >         __asm__ __volatile__ ("cli" : : : "memory");
> > >         if (vi->evtchn_upcall_pending) {
> > >             __asm__ __volatile__ ("sti");
> > >         }
> > >         else {
> > >             hypercall_set_timer_op(deadline);
> > 
> > What if you set a deadline so close that evtchn_upcall_pending gets
> > set by Xen here and the interrupt is injected? You would execute the
> > noop handler and just hlt, and could likely end up in the same blocked
> > situation as before?
> 
> Why would an interrupt be injected here? Doesn't the immediately preceding
> "cli" disable that?

Well, I mean between the sti and the hlt instruction. I think there's
always a window for a race here, and that's the reason for having
SCHEDOP_block (see comment referring to avoiding a "wakeup waiting"
race).

> Or perhaps I need to do a PV/HVM hybrid and set vi->evtchn_upcall_mask just
> before the cli, and clear it after the sti?

I think SCHEDOP_block is broken on HVM, as as Jan points out
evtchn_upcall_mask is not used on HVM (we should really have a comment
about this in xen.h). So that hypercall is no longer useful to avoid
wakeup waiting races.

> > > i.e. Always go to sleep with interrupts disabled, but before doing so
> > > re-check that no events have become pending since the last time
> > > evtchn_demux_pending() was called. This holds, since the only thing
> > > that
> > > sets vi->evtchn_upcall_pending is Xen, and the only thing that
> > > clears it is
> > > evtchn_demux_pending().
> > > 
> > > Right?
> > 
> > TBH this is a hard model to get right, I think your best bet at
> > attempting something along this lines is to forget about using the
> > event channel interrupt and instead use SCHEDOP_poll. You could do
> > something like (written in pure C as I have no idea of the ocaml
> > bindings):
> > [SCHEDOP_poll code snipped]
> 
> Thanks for the suggestion. This brings us full-circle -- I found [1] and
> [2] way back from 2013 when Mirage/Xen was initially using SCHEDOP_poll
> and then switched to the current interrupts + SCHEDOP_block approach.
> 
> Part of the motivation for the change at the time was to allow waiting
> on/servicing more than 128 ports (the limit for SCHEDOP_poll). I doubt
> anyone wants to do that these days, but it still makes me a bit reluctant
> to change back to SCHEDOP_poll.

Right, Mirage/Xen being single processor it's not likely to use more
than 128 event channels, but I can understand the desire to not be
limited by that amount.

> > > In an attempt to understand why the original PV code worked I
> > > re-read the PV
> > > Mini-OS block_domain code again and realised that I had entirely
> > > missed one
> > > part of its behaviour, which is that it intends[*] to run with
> > > interrupts/upcalls disabled *all* the time and relies on SCHEDOP_block
> > > atomically re-enabling them and triggering an upcall before
> > > returning (PV)
> > > or "briefly enabling interrupts to allow handlers to run" (HVM).
> > > We're doing
> > > the inverse, but our behaviour matches my mental model of how things
> > > should
> > > work.
> > 
> > Not really IMO, as SCHEDOP_block is a single 'instruction' from a
> > guest PoV that does the enabling of interrupts and returns if there
> > are pending ones.
> 
> Indeed and this is exactly the operation I need in the HVM world with the
> current model.

Another option would be to consider re-purposing SCHEDOP_block for HVM
so it does a sti on behalf of the guest and then checks for pending
interrupts to inject.

> > Also SCHEDOP_block is not exactly the same on HVM, as it just checks
> > for pending vectors to inject, but not for pending event channels.
> 
> ... well, I don't want to use SCHEDOP_block anyway since that is not
> possible
> on ARM, and I would like to keep the code at least "portable with not too
> much effort". So there should be a non-racy "HVM way" to do this?

One solution (albeit it would be slightly crappy IMO) is to make sure
the timer is always fully handled in interrupt context, so that you
_never_ call hlt with a timer event pending. That way you are always
guaranteed to be woken up.

> > HVM you cannot call hlt with interrupts disabled, or the vCPU will be
> > taken down.
> > 
> > There are quite a lot of subtle differences between PV and HVM in this
> > regard, and I think the best approach would be to use SCHEDOP_poll in
> > order to implement the kind of model you describe.
> 
> I can see that now. The interactions between the "virtual hardware"
> (interrupt delivery, VCPU IF) and "PV" parts (event delivery, masking) are
> hard to understand for me, yet the two are intertwined in the way HVM
> works :-(
> 
> Coming back to your earlier suggestion of moving the event demuxing (but not
> the handling) into the upcall interrupt handler itself, perhaps that
> approach
> is still worth investigating in combination with re-working the OCaml side
> array
> view of pending events, or at least ensuring that atomic operations are used
> since it would now be updated asynchronously.

I assume you must be doing something similar for KVM in Solo5, where
you handle (at least certain) interrupts in interrupt context, or else
the same issues would arise?

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 16:09                       ` Roger Pau Monné
@ 2020-06-22 16:20                         ` Jan Beulich
  2020-06-22 16:26                           ` Martin Lucina
  2020-06-22 16:36                           ` Roger Pau Monné
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Beulich @ 2020-06-22 16:20 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Martin Lucina, mirageos-devel, xen-devel

On 22.06.2020 18:09, Roger Pau Monné wrote:
> On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
>> On 2020-06-22 15:58, Roger Pau Monné wrote:
>>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
>>>> Aha! Thank you for pointing this out. I think you may be right, but
>>>> this
>>>> should be possible without doing the demuxing in interrupt context.
>>>
>>> If you don't do the demuxing in the interrupt context (ie: making the
>>> interrupt handler a noop), then you don't likely need such interrupt
>>> anyway?
>>
>> I need the/an interrupt to wake the VCPU from HLT state if we went to sleep.
>>
>>>
>>>> How about this arrangement, which appears to work for me; no hangs I
>>>> can see
>>>> so far and domU survives ping -f fine with no packet loss:
>>>>
>>>> CAMLprim value
>>>> mirage_xen_evtchn_block_domain(value v_deadline)
>>>> {
>>>>     struct vcpu_info *vi = VCPU0_INFO();
>>>>     solo5_time_t deadline = Int64_val(v_deadline);
>>>>
>>>>     if (solo5_clock_monotonic() < deadline) {
>>>>         __asm__ __volatile__ ("cli" : : : "memory");
>>>>         if (vi->evtchn_upcall_pending) {
>>>>             __asm__ __volatile__ ("sti");
>>>>         }
>>>>         else {
>>>>             hypercall_set_timer_op(deadline);
>>>
>>> What if you set a deadline so close that evtchn_upcall_pending gets
>>> set by Xen here and the interrupt is injected? You would execute the
>>> noop handler and just hlt, and could likely end up in the same blocked
>>> situation as before?
>>
>> Why would an interrupt be injected here? Doesn't the immediately preceding
>> "cli" disable that?
> 
> Well, I mean between the sti and the hlt instruction.

When EFLAGS.IF was clear before STI, then the first point at which
an interrupt can get injected is when HLT is already executed (i.e.
to wake from this HLT). That's the so called "STI shadow".

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 16:20                         ` Jan Beulich
@ 2020-06-22 16:26                           ` Martin Lucina
  2020-06-22 16:36                           ` Roger Pau Monné
  1 sibling, 0 replies; 24+ messages in thread
From: Martin Lucina @ 2020-06-22 16:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, xen-devel, mirageos-devel, Roger Pau Monné

On 2020-06-22 18:20, Jan Beulich wrote:
> On 22.06.2020 18:09, Roger Pau Monné wrote:
>> On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
>>> On 2020-06-22 15:58, Roger Pau Monné wrote:
>>>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
>>>>> Aha! Thank you for pointing this out. I think you may be right, but
>>>>> this
>>>>> should be possible without doing the demuxing in interrupt context.
>>>> 
>>>> If you don't do the demuxing in the interrupt context (ie: making 
>>>> the
>>>> interrupt handler a noop), then you don't likely need such interrupt
>>>> anyway?
>>> 
>>> I need the/an interrupt to wake the VCPU from HLT state if we went to 
>>> sleep.
>>> 
>>>> 
>>>>> How about this arrangement, which appears to work for me; no hangs 
>>>>> I
>>>>> can see
>>>>> so far and domU survives ping -f fine with no packet loss:
>>>>> 
>>>>> CAMLprim value
>>>>> mirage_xen_evtchn_block_domain(value v_deadline)
>>>>> {
>>>>>     struct vcpu_info *vi = VCPU0_INFO();
>>>>>     solo5_time_t deadline = Int64_val(v_deadline);
>>>>> 
>>>>>     if (solo5_clock_monotonic() < deadline) {
>>>>>         __asm__ __volatile__ ("cli" : : : "memory");
>>>>>         if (vi->evtchn_upcall_pending) {
>>>>>             __asm__ __volatile__ ("sti");
>>>>>         }
>>>>>         else {
>>>>>             hypercall_set_timer_op(deadline);
>>>> 
>>>> What if you set a deadline so close that evtchn_upcall_pending gets
>>>> set by Xen here and the interrupt is injected? You would execute the
>>>> noop handler and just hlt, and could likely end up in the same 
>>>> blocked
>>>> situation as before?
>>> 
>>> Why would an interrupt be injected here? Doesn't the immediately 
>>> preceding
>>> "cli" disable that?
>> 
>> Well, I mean between the sti and the hlt instruction.
> 
> When EFLAGS.IF was clear before STI, then the first point at which
> an interrupt can get injected is when HLT is already executed (i.e.
> to wake from this HLT). That's the so called "STI shadow".

Indeed, that's what the Intel SDM says, and Andrew already mentioned 
earlier in this thread in a different context, here: 
https://lists.xenproject.org/archives/html/mirageos-devel/2020-06/msg00021.html
.

So it would seem that my latest approach is race-free?

Martin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Event delivery and "domain blocking" on PVHv2
  2020-06-22 16:20                         ` Jan Beulich
  2020-06-22 16:26                           ` Martin Lucina
@ 2020-06-22 16:36                           ` Roger Pau Monné
  1 sibling, 0 replies; 24+ messages in thread
From: Roger Pau Monné @ 2020-06-22 16:36 UTC (permalink / raw)
  To: Jan Beulich, Martin Lucina; +Cc: Andrew Cooper, mirageos-devel, xen-devel

On Mon, Jun 22, 2020 at 06:20:34PM +0200, Jan Beulich wrote:
> On 22.06.2020 18:09, Roger Pau Monné wrote:
> > On Mon, Jun 22, 2020 at 05:31:00PM +0200, Martin Lucina wrote:
> >> On 2020-06-22 15:58, Roger Pau Monné wrote:
> >>> On Mon, Jun 22, 2020 at 12:58:37PM +0200, Martin Lucina wrote:
> >>>> Aha! Thank you for pointing this out. I think you may be right, but
> >>>> this
> >>>> should be possible without doing the demuxing in interrupt context.
> >>>
> >>> If you don't do the demuxing in the interrupt context (ie: making the
> >>> interrupt handler a noop), then you don't likely need such interrupt
> >>> anyway?
> >>
> >> I need the/an interrupt to wake the VCPU from HLT state if we went to sleep.
> >>
> >>>
> >>>> How about this arrangement, which appears to work for me; no hangs I
> >>>> can see
> >>>> so far and domU survives ping -f fine with no packet loss:
> >>>>
> >>>> CAMLprim value
> >>>> mirage_xen_evtchn_block_domain(value v_deadline)
> >>>> {
> >>>>     struct vcpu_info *vi = VCPU0_INFO();
> >>>>     solo5_time_t deadline = Int64_val(v_deadline);
> >>>>
> >>>>     if (solo5_clock_monotonic() < deadline) {
> >>>>         __asm__ __volatile__ ("cli" : : : "memory");
> >>>>         if (vi->evtchn_upcall_pending) {
> >>>>             __asm__ __volatile__ ("sti");
> >>>>         }
> >>>>         else {
> >>>>             hypercall_set_timer_op(deadline);
> >>>
> >>> What if you set a deadline so close that evtchn_upcall_pending gets
> >>> set by Xen here and the interrupt is injected? You would execute the
> >>> noop handler and just hlt, and could likely end up in the same blocked
> >>> situation as before?
> >>
> >> Why would an interrupt be injected here? Doesn't the immediately preceding
> >> "cli" disable that?
> > 
> > Well, I mean between the sti and the hlt instruction.
> 
> When EFLAGS.IF was clear before STI, then the first point at which
> an interrupt can get injected is when HLT is already executed (i.e.
> to wake from this HLT). That's the so called "STI shadow".

Oh, so then what Martin does seems to be fine, as there's no race, by
the fact that evtchn_upcall_pending is checked with interrupts
disabled.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-06-22 16:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 14:25 Event delivery and "domain blocking" on PVHv2 Martin Lucina
2020-06-15 15:03 ` Roger Pau Monné
2020-06-15 15:51   ` Martin Lucina
2020-06-15 16:52     ` Roger Pau Monné
2020-06-15 16:58 ` Andrew Cooper
2020-06-18 10:13   ` Martin Lucina
2020-06-18 11:46     ` Roger Pau Monné
2020-06-19 10:28       ` Martin Lucina
2020-06-19 11:21         ` Roger Pau Monné
2020-06-19 16:41           ` Martin Lucina
2020-06-19 16:54             ` Roger Pau Monné
2020-06-19 17:42               ` Roger Pau Monné
2020-06-22 10:58                 ` Martin Lucina
2020-06-22 13:58                   ` Roger Pau Monné
2020-06-22 15:31                     ` Martin Lucina
2020-06-22 15:36                       ` Jan Beulich
2020-06-22 16:09                       ` Roger Pau Monné
2020-06-22 16:20                         ` Jan Beulich
2020-06-22 16:26                           ` Martin Lucina
2020-06-22 16:36                           ` Roger Pau Monné
2020-06-19 11:21         ` Andrew Cooper
2020-06-19 16:46           ` Martin Lucina
2020-06-18 23:43     ` Andrew Cooper
2020-06-19  8:41       ` Martin Lucina

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).