Re: [Qemu-devel] [V11 1/4] hw/i386: Introduce AMD IOMMU

From: David Kiarie <davidkiarie4@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	Jan Kiszka <jan.kiszka@web.de>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Marcel Apfelbaum <marcel@redhat.com>,
	Valentine Sinitsyn <valentine.sinitsyn@gmail.com>
Subject: Re: [Qemu-devel] [V11 1/4] hw/i386: Introduce AMD IOMMU
Date: Tue, 24 May 2016 16:11:33 +0300	[thread overview]
Message-ID: <CABdVeABXKmG7Ot63NeYu6E6Unitj+f70phe85sObi64-DRKrrw@mail.gmail.com> (raw)
In-Reply-To: <20160524123544.GH8247@pxdev.xzpeter.org>

On Tue, May 24, 2016 at 3:35 PM, Peter Xu <peterx@redhat.com> wrote:
> On Sun, May 22, 2016 at 01:21:51PM +0300, David Kiarie wrote:
>
> [...]
>
>> +#define DEBUG_AMD_AMDVI
>> +#ifdef DEBUG_AMD_AMDVI
>> +enum {
>> +    DEBUG_GENERAL, DEBUG_CAPAB, DEBUG_MMIO, DEBUG_ELOG,
>> +    DEBUG_CACHE, DEBUG_COMMAND, DEBUG_MMU, DEBUG_CUSTOM
>> +};
>> +
>> +#define AMDVI_DBGBIT(x)   (1 << DEBUG_##x)
>> +static int iommu_dbgflags = AMDVI_DBGBIT(CUSTOM) | AMDVI_DBGBIT(MMU);
>> +
>> +#define AMDVI_DPRINTF(what, fmt, ...) do { \
>> +    if (iommu_dbgflags & AMDVI_DBGBIT(what)) { \
>> +        fprintf(stderr, "(amd-iommu)%s: " fmt "\n", __func__, \
>> +                ## __VA_ARGS__); } \
>> +    } while (0)
>> +#else
>> +#define AMDVI_DPRINTF(what, fmt, ...) do {} while (0)
>> +#endif
>
> (actually I was considering whether it would be cool that both Intel
>  and AMD IOMMU codes start to leverage trace utilities. Re-compiling
>  for debugging every time is not convenient, and also not aligned
>  with other part of QEMU. However I guess this is in-all-cases too
>  late for a v11 patchset... So just raise this question up in the
>  brackets)
>
> [...]
>
>> +static void amdvi_log_event(AMDVIState *s, uint16_t *evt)
>> +{
>> +    /* event logging not enabled */
>> +    if (!s->evtlog_enabled || *(uint64_t *)&s->mmior[AMDVI_MMIO_STATUS] |
>> +        AMDVI_MMIO_STATUS_EVT_OVF) {
>
> I see that there are lots of places in this patch that used
> something like:
>
>  *(uint64_t *)s->mmior[X] | Y
>
> Or:
>
>  *(uint64_t *)s->mmior[X] |= Y
>
> So... would it be a good idea that we provide several more helpers,
> like amdvi_orq() and amdvi_readq()?

Yes, that could be a good idea.

>
>> +        return;
>> +    }
>> +
>> +    /* event log buffer full */
>> +    if (s->evtlog_tail >= s->evtlog_len) {
>> +        *(uint64_t *)&s->mmior[AMDVI_MMIO_STATUS] |= AMDVI_MMIO_STATUS_EVT_OVF;
>
> Yet another example...
>
>> +        /* generate interrupt */
>> +        amdvi_generate_msi_interrupt(s);
>> +        return;
>> +    }
>> +
>> +    if (dma_memory_write(&address_space_memory, s->evtlog_len + s->evtlog_tail,
>> +       &evt, AMDVI_EVENT_LEN)) {
>> +        AMDVI_DPRINTF(ELOG, "error: fail to write at address 0x%"PRIx64
>> +                      " + offset 0x%"PRIx32, s->evtlog, s->evtlog_tail);
>> +    }
>> +
>> +     s->evtlog_tail += AMDVI_EVENT_LEN;
>> +     *(uint64_t *)&s->mmior[AMDVI_MMIO_STATUS] |= AMDVI_MMIO_STATUS_COMP_INT;
>
> Another one. (will stop finding examples)
>
>> +     amdvi_generate_msi_interrupt(s);
>> +}
>
> [...]
>
>> +/* extract device id */
>> +static inline uint16_t devid_extract(uint8_t *cmd)
>> +{
>> +    return (uint16_t)cmd[2] & AMDVI_INVAL_DEV_ID_MASK;
>
> Here the mask is defined as:
>
> #define AMDVI_INVAL_DEV_ID_MASK   (~((1UL << AMDVI_INVAL_DEV_ID_SHIFT) - 1))
>
> I think it should be 0xffffffff00000000 for 64 bit systems. However
> here cmd[2] is type uint8_t. Is there anything wrong?...
>
> Also, I see many places that we manipulate arbitary elements in
> cmd[] directly with some masks. Not sure whether we can make it more
> readable (which is optional though)

Well, this the code checking for reserved bits is admittedly, not very
legible. Most of the commands(if not all) vary in the position of
reserved bits which means if I decided for instance to have structs
representing the commands I'd probably have to define one for each
command. If I get support for this idea, I don't have a problem
implementing it.

>
> [...]
>
>> +static void amdvi_update_iotlb(AMDVIState *s, uint16_t devid,
>> +                               uint64_t gpa, IOMMUTLBEntry to_cache,
>> +                               uint16_t domid)
>> +{
>> +    AMDVIIOTLBEntry *entry = g_malloc(sizeof(*entry));
>> +    uint64_t *key = g_malloc(sizeof(key));
>> +    uint64_t gfn = gpa >> AMDVI_PAGE_SHIFT_4K;
>> +
>> +    /* don't cache erroneous translations */
>> +    if (to_cache.perm != IOMMU_NONE) {
>> +        AMDVI_DPRINTF(CACHE, " update iotlb domid 0x%"PRIx16" devid: "
>> +                      "%02x:%02x.%xgpa 0x%"PRIx64 " hpa 0x%"PRIx64, domid,
>> +                      PCI_BUS_NUM(devid), PCI_SLOT(devid), PCI_FUNC(devid),
>> +                      gpa, to_cache.translated_addr);
>> +
>> +        if (g_hash_table_size(s->iotlb) >= AMDVI_IOTLB_MAX_SIZE) {
>> +            AMDVI_DPRINTF(CACHE, "iotlb exceeds size limit - reset");
>> +            amdvi_iotlb_reset(s);
>
> I remembered that you have mentioned about one of your work that
> re-implemented another iotlb?

I wouldn't really call it an iotlb. I basically save all translated
entries into a hashtable and retrieve them later. Michael mentioned
that this could have some limitations when it comes to VFIO.

>
> [...]
>
>> +/* log error without aborting since linux seems to be using reserved bits */
>> +static void amdvi_inval_devtab_entry(AMDVIState *s, uint8_t *cmd,
>> +                                     uint8_t type)
>> +{
>> +    /* This command should invalidate internal caches of which there isn't */
>> +    if (*(uint64_t *)&cmd[0] & AMDVI_CMD_INVAL_DEV_RSVD ||
>> +            *(uint64_t *)&cmd[1]) {
>
> Indentation not aligned with the rest of the codes.
>
>> +        amdvi_log_illegalcom_error(s, type, s->cmdbuf + s->cmdbuf_head);
>> +    }
>> +#ifdef DEBUG_AMD_AMDVI
>> +    uint16_t devid = devid_extract(cmd);
>> +#endif
>> +    AMDVI_DPRINTF(COMMAND, "device table entry for devid: %02x:%02x.%x"
>> +                  " invalidated", PCI_BUS_NUM(devid), PCI_SLOT(devid),
>> +                  PCI_FUNC(devid));
>> +}
>> +
>> +static void amdvi_completion_wait(AMDVIState *s, uint8_t *cmd, uint8_t type)
>> +{
>> +    if (*(uint32_t *)&cmd[1] & AMDVI_COMPLETION_WAIT_RSVD) {
>> +        amdvi_log_illegalcom_error(s, type, s->cmdbuf + s->cmdbuf_head);
>> +    }
>> +    /* pretend to wait for command execution to complete */
>> +    AMDVI_DPRINTF(COMMAND, "completion wait requested with store address 0x%"
>> +                  PRIx64 " and store data 0x%"PRIx64, (cmd[0] &
>> +                  AMDVI_COM_STORE_ADDRESS_MASK), *(uint64_t *)(cmd + 8));
>> +    amdvi_completion_wait_exec(s, cmd);
>> +}
>> +
>> +static void amdvi_complete_ppr(AMDVIState *s, uint8_t *cmd, uint8_t type)
>> +{
>> +    if ((*(uint64_t *)&cmd[0] & AMDVI_COMPLETE_PPR_RQ_RSVD) ||
>> +       *(uint64_t *)&cmd[1] & AMDVI_COMPLETE_PPR_HIGH_RSVD) {
>> +        amdvi_log_illegalcom_error(s, type, s->cmdbuf + s->cmdbuf_head);
>> +    }
>
> Could you help explain why this command can be skipped with no-op?
> Thanks.

None of the Qemu devices have PPR support but if the aim is to pass
devices from the host->l1->l2 guest it would have to be implemented.

>
>> +
>> +    AMDVI_DPRINTF(COMMAND, "Execution of PPR queue requested");
>> +}
>
> [...]
>
>> +/* not honouring reserved bits is regarded as an illegal command */
>> +static void amdvi_cmdbuf_exec(AMDVIState *s)
>> +{
>> +    uint8_t type;
>> +    uint8_t cmd[AMDVI_COMMAND_SIZE];
>> +
>> +    memset(cmd, 0, AMDVI_COMMAND_SIZE);
>> +
>> +    if (dma_memory_read(&address_space_memory, s->cmdbuf + s->cmdbuf_head, cmd,
>> +       AMDVI_COMMAND_SIZE)) {
>> +        AMDVI_DPRINTF(COMMAND, "error: fail to access memory at 0x%"PRIx64
>> +                      " + 0x%"PRIu8, s->cmdbuf, s->cmdbuf_head);
>> +        amdvi_log_command_error(s, s->cmdbuf + s->cmdbuf_head);
>> +        return;
>> +    }
>> +
>> +    *cmd = le64_to_cpu(*(uint64_t *)cmd);
>
> Here we are assigning uint64_t to an uint8_t? Is this what we want?

Yes we don't want this. Should be fixed.

Other comments that I didn't respond on should be fixed.

David.

>
> Thanks,
>
> -- peterx