Re: [PATCH] e1000e: using bottom half to send packets

From: Jason Wang <jasowang@redhat.com>
To: Li Qiang <liq3ea@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Dmitry Fleytman <dmitry.fleytman@gmail.com>,
	Li Qiang <liq3ea@163.com>,
	Qemu Developers <qemu-devel@nongnu.org>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH] e1000e: using bottom half to send packets
Date: Mon, 20 Jul 2020 11:59:48 +0800	[thread overview]
Message-ID: <9cc9842a-975d-417f-de80-d0a57619b94f@redhat.com> (raw)
In-Reply-To: <CAKXe6S+HAcWFEE5ZyLaRkoi+j-v4OE3DCM9aL=6sF1B_2d5-ag@mail.gmail.com>

On 2020/7/17 下午11:46, Li Qiang wrote:
> Jason Wang <jasowang@redhat.com> 于2020年7月17日周五 下午1:39写道：
>>
>> On 2020/7/17 下午12:46, Li Qiang wrote:
>>> Jason Wang <jasowang@redhat.com> 于2020年7月17日周五 上午11:10写道：
>>>> On 2020/7/17 上午12:14, Li Qiang wrote:
>>>>> Alexander Bulekov reported a UAF bug related e1000e packets send.
>>>>>
>>>>> -->https://bugs.launchpad.net/qemu/+bug/1886362
>>>>>
>>>>> This is because the guest trigger a e1000e packet send and set the
>>>>> data's address to e1000e's MMIO address. So when the e1000e do DMA
>>>>> it will write the MMIO again and trigger re-entrancy and finally
>>>>> causes this UAF.
>>>>>
>>>>> Paolo suggested to use a bottom half whenever MMIO is doing complicate
>>>>> things in here:
>>>>> -->https://lists.nongnu.org/archive/html/qemu-devel/2020-07/msg03342.html
>>>>>
>>>>> Reference here:
>>>>> 'The easiest solution is to delay processing of descriptors to a bottom
>>>>> half whenever MMIO is doing something complicated.  This is also better
>>>>> for latency because it will free the vCPU thread more quickly and leave
>>>>> the work to the I/O thread.'
>>>> I think several things were missed in this patch (take virtio-net as a
>>>> reference), do we need the following things:
>>>>
>>> Thanks Jason,
>>> In fact I know this, I'm scared for touching this but I want to try.
>>> Thanks for your advice.
>>>
>>>> - Cancel the bh when VM is stopped.
>>> Ok. I think add a vm state change notifier for e1000e can address this.
>>>
>>>> - A throttle to prevent bh from executing too much timer?
>>> Ok, I think add a config timeout and add a timer in e1000e can address this.
>>
>> Sorry, a typo. I meant we probably need a tx_burst as what virtio-net did.
>>
>>
>>>> - A flag to record whether or not this a pending tx (and migrate it?)
>>> Is just a flag enough? Could you explain more about the idea behind
>>> processing the virtio-net/e1000e using bh like this?
>>
>> Virtio-net use a tx_waiting variable to record whether or not there's a
>> pending bh. (E.g bh is cancelled due to vmstop, we need reschedule it
>> after vmresume). Maybe we can do something simpler by just schecule bh
>> unconditionally during vm resuming.
>>
>>
>>> For example, if the guest trigger a lot of packets send and if the bh
>>> is scheduled in IO thread. So will we lost packets?
>>
>> We don't since we don't populate virtqueue which means packets are
>> queued there.
>>
> This remind of me a question:
> If we use tx_burst like in virtion-net. For detail:
> If we sent out  'tx_burst' packets per bh. Then we set 'tx_waiting' and
> then schedule another bh. However if between two bh schedule, the guest change
> the e1000e register such 'r->dh' 'r->dlen'. The data is fully corrupted.
> In fact this issue does exist in my origin patch.  That's
> What if following happend:
>
> vcpu thread: guest write e1000e MMIO to trigger packets send
> vcpu thread: schedule a bh
> vcpu thread: return
> IO thread: begin to run the bh and start send packets
> vcpu thread: write register again such as  'r->dh' 'r->dlen'..
>
> So here the IO thread and vcpu thread will race the register?
>
> If I remember correctly, the virtio net has no such problem because it
> uses ring buffer
> and the backedn(virtio device) uses the shadow index to index the ring
> buffer data.
>
> What's your idea here?

I think we serialize them through bql? (qemu_mutex_lock_iothread())

Thanks

>
> Thanks,
> Li Qiang
>
>
>> Thanks
>>
>>
>>> How we avoid this in virtio-net.
>>>
>>> Thanks,
>>> Li Qiang
>>>
>>>
>>>
>>>> Thanks
>>>>
>>>>
>>>>> This patch fixes this UAF.
>>>>>
>>>>> Signed-off-by: Li Qiang <liq3ea@163.com>
>>>>> ---
>>>>>     hw/net/e1000e_core.c | 25 +++++++++++++++++--------
>>>>>     hw/net/e1000e_core.h |  2 ++
>>>>>     2 files changed, 19 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>>>>> index bcd186cac5..6165b04b68 100644
>>>>> --- a/hw/net/e1000e_core.c
>>>>> +++ b/hw/net/e1000e_core.c
>>>>> @@ -2423,32 +2423,27 @@ e1000e_set_dbal(E1000ECore *core, int index, uint32_t val)
>>>>>     static void
>>>>>     e1000e_set_tctl(E1000ECore *core, int index, uint32_t val)
>>>>>     {
>>>>> -    E1000E_TxRing txr;
>>>>>         core->mac[index] = val;
>>>>>
>>>>>         if (core->mac[TARC0] & E1000_TARC_ENABLE) {
>>>>> -        e1000e_tx_ring_init(core, &txr, 0);
>>>>> -        e1000e_start_xmit(core, &txr);
>>>>> +        qemu_bh_schedule(core->tx[0].tx_bh);
>>>>>         }
>>>>>
>>>>>         if (core->mac[TARC1] & E1000_TARC_ENABLE) {
>>>>> -        e1000e_tx_ring_init(core, &txr, 1);
>>>>> -        e1000e_start_xmit(core, &txr);
>>>>> +        qemu_bh_schedule(core->tx[1].tx_bh);
>>>>>         }
>>>>>     }
>>>>>
>>>>>     static void
>>>>>     e1000e_set_tdt(E1000ECore *core, int index, uint32_t val)
>>>>>     {
>>>>> -    E1000E_TxRing txr;
>>>>>         int qidx = e1000e_mq_queue_idx(TDT, index);
>>>>>         uint32_t tarc_reg = (qidx == 0) ? TARC0 : TARC1;
>>>>>
>>>>>         core->mac[index] = val & 0xffff;
>>>>>
>>>>>         if (core->mac[tarc_reg] & E1000_TARC_ENABLE) {
>>>>> -        e1000e_tx_ring_init(core, &txr, qidx);
>>>>> -        e1000e_start_xmit(core, &txr);
>>>>> +        qemu_bh_schedule(core->tx[qidx].tx_bh);
>>>>>         }
>>>>>     }
>>>>>
>>>>> @@ -3322,6 +3317,16 @@ e1000e_vm_state_change(void *opaque, int running, RunState state)
>>>>>         }
>>>>>     }
>>>>>
>>>>> +static void e1000e_core_tx_bh(void *opaque)
>>>>> +{
>>>>> +    struct e1000e_tx *tx = opaque;
>>>>> +    E1000ECore *core = tx->core;
>>>>> +    E1000E_TxRing txr;
>>>>> +
>>>>> +    e1000e_tx_ring_init(core, &txr, tx - &core->tx[0]);
>>>>> +    e1000e_start_xmit(core, &txr);
>>>>> +}
>>>>> +
>>>>>     void
>>>>>     e1000e_core_pci_realize(E1000ECore     *core,
>>>>>                             const uint16_t *eeprom_templ,
>>>>> @@ -3340,6 +3345,8 @@ e1000e_core_pci_realize(E1000ECore     *core,
>>>>>         for (i = 0; i < E1000E_NUM_QUEUES; i++) {
>>>>>             net_tx_pkt_init(&core->tx[i].tx_pkt, core->owner,
>>>>>                             E1000E_MAX_TX_FRAGS, core->has_vnet);
>>>>> +        core->tx[i].core = core;
>>>>> +        core->tx[i].tx_bh = qemu_bh_new(e1000e_core_tx_bh, &core->tx[i]);
>>>>>         }
>>>>>
>>>>>         net_rx_pkt_init(&core->rx_pkt, core->has_vnet);
>>>>> @@ -3367,6 +3374,8 @@ e1000e_core_pci_uninit(E1000ECore *core)
>>>>>         for (i = 0; i < E1000E_NUM_QUEUES; i++) {
>>>>>             net_tx_pkt_reset(core->tx[i].tx_pkt);
>>>>>             net_tx_pkt_uninit(core->tx[i].tx_pkt);
>>>>> +        qemu_bh_delete(core->tx[i].tx_bh);
>>>>> +        core->tx[i].tx_bh = NULL;
>>>>>         }
>>>>>
>>>>>         net_rx_pkt_uninit(core->rx_pkt);
>>>>> diff --git a/hw/net/e1000e_core.h b/hw/net/e1000e_core.h
>>>>> index aee32f7e48..94ddc6afc2 100644
>>>>> --- a/hw/net/e1000e_core.h
>>>>> +++ b/hw/net/e1000e_core.h
>>>>> @@ -77,6 +77,8 @@ struct E1000Core {
>>>>>             unsigned char sum_needed;
>>>>>             bool cptse;
>>>>>             struct NetTxPkt *tx_pkt;
>>>>> +        QEMUBH *tx_bh;
>>>>> +        E1000ECore *core;
>>>>>         } tx[E1000E_NUM_QUEUES];
>>>>>
>>>>>         struct NetRxPkt *rx_pkt;