All of lore.kernel.org
 help / color / mirror / Atom feed
From: Cong Wang <xiyou.wangcong@gmail.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux-foundation.org,
	LKML <linux-kernel@vger.kernel.org>,
	Joerg Roedel <joro@8bytes.org>,
	John Garry <john.garry@huawei.com>
Subject: Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
Date: Wed, 22 Jan 2020 09:54:46 -0800	[thread overview]
Message-ID: <CAM_iQpXxbu+bwsW1MMjeG5feDAjhYjeuAwo6epDi22LJPo6X+Q@mail.gmail.com> (raw)
In-Reply-To: <9033456d-1f17-44a3-2640-24de55421e79@arm.com>

On Wed, Jan 22, 2020 at 9:07 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2020 5:21 pm, Cong Wang wrote:
> > On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 18/12/2019 4:39 am, Cong Wang wrote:
> >>> The IOVA cache algorithm implemented in IOMMU code does not
> >>> exactly match the original algorithm described in the paper
> >>> "Magazines and Vmem: Extending the Slab Allocator to Many
> >>> CPUs and Arbitrary Resources".
> >>>
> >>> Particularly, it doesn't need to free the loaded empty magazine
> >>> when trying to put it back to global depot. To make it work, we
> >>> have to pre-allocate magazines in the depot and only recycle them
> >>> when all of them are full.
> >>>
> >>> Before this patch, rcache->depot[] contains either full or
> >>> freed entries, after this patch, it contains either full or
> >>> empty (but allocated) entries.
> >>
> >> How much additional memory overhead does this impose (particularly on
> >> systems that may have many domains mostly used for large, long-term
> >> mappings)? I'm wary that trying to micro-optimise for the "churn network
> >> packets as fast as possible" case may penalise every other case,
> >> potentially quite badly. Lower-end embedded systems are using IOMMUs in
> >> front of their GPUs, video codecs, etc. precisely because they *don't*
> >> have much memory to spare (and thus need to scrape together large
> >> buffers out of whatever pages they can find).
> >
> > The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
> > which is roughly 192K, per domain.
>
> Theoretically. On many architectures, kmalloc(1032,...) is going to
> consume rather more than 1032 bytes. Either way, it's rather a lot of
> memory to waste in the many cases where it will never be used at all.

If this is a concern, we can make IOVA_MAG_SIZE tunable in Kconfig.
I myself want a larger IOVA_MAG_SIZE at least for experiments.
You know, servers now have 100G+ memory, 192k is nearly nothing...


>
> >> But on the other hand, if we were to go down this route, then why is
> >> there any dynamic allocation/freeing left at all? Once both the depot
> >> and the rcaches are preallocated, then AFAICS it would make more sense
> >> to rework the overflow case in __iova_rcache_insert() to just free the
> >> IOVAs and swap the empty mag around rather than destroying and
> >> recreating it entirely.
> >
> > It's due to the algorithm requires a swap(), which can't be done with
> > statically allocated magzine. I had the same thought initially but gave it
> > up quickly when realized this.
>
> I'm not sure I follow... we're replacing a "full magazine" pointer with
> an "empty magazine" pointer regardless of where that empty magazine came
> from. It would be trivial to preallocate an 'overflow' magazine for the
> one remaining case of handling a full depot, although to be honest, at
> that point it's probably most efficient to just free the pfns directly
> from cpu_rcache->loaded while still under the percpu lock and be done
> with it.

I don't follow you either. I thought you are suggesting to completely
get rid of dynamic memory allocations like:

@@ -31,7 +31,7 @@ struct iova_cpu_rcache;
 struct iova_rcache {
        spinlock_t lock;
        unsigned long depot_size;
-       struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+       struct iova_magazine depot[MAX_GLOBAL_MAGS];
        struct iova_cpu_rcache __percpu *cpu_rcaches;
 };

If it is so, I don't see how I can do swap() with pointers like
cpu_rcache->prev.

More importantly, this doesn't save any memory either for your embedded
case. So I don't know why you want to bring it up.

>
> > If you are suggesting to change the algorithm, it is not a goal of this
> > patchset. I do have plan to search for a better algorithm as the IOMMU
> > performance still sucks (comparing to no IOMMU) after this patchset,
> > but once again, I do not want to change it in this patchset.
>
> "Still sucks" is probably the most interesting thing here - the headline
> number for the original patch series was that it reached about 98% of
> bypass performance on Intel VT-d[1]. Sounds like it would be well worth
> digging in to what's different about your system and/or workload.

Just FYI: The latency is 10x/20x worse with IOMMU enabled on AMD
servers here. (mlx5 driver for ethernet, if matters.) The throughput
is roughly same. The patchset you linked only measures throughput.


>
> > (My ultimate goal is to find a spinlock-free algorithm, otherwise there is
> > no way to make it close to no-IOMMU performance.)
> >
> >>
> >> Perhaps there's a reasonable compromise wherein we don't preallocate,
> >> but still 'free' empty magazines back to the depot, such that busy
> >> domains will quickly reach a steady-state. In fact, having now dug up
> >> the paper at this point of writing this reply, that appears to be what
> >> fig. 3.1b describes anyway - I don't see any mention of preallocating
> >> the depot.
> >
> > That paper missed a lot of things, it doesn't even recommend a size
> > of a depot or percpu cache. For implementation, we still have to
> > think about those details, including whether to preallocate memory.
>
> Heh, "missed"... To my reading, the original design actually describes a
> depot consisting of two unbounded (but garbage-collected) lists and a
> dynamically-adjusted magazine size - I'd hardly blame the authors for

I must miss the dynamic size part, as I tried to tune IOVA_MAG_SIZE
manually when I initially thought it is overcached.

> not discussing an implementation from 15 years in the future of a
> fixed-size design *based on* their concept ;)

Are you saying fixed-size implementation is wrong? I'd like to hear
more! :) I am curious how to dynamically adjust the magzine size
too as I still don't believe IOVA_MAG_SIZE fits all, also how to
balance the percpu cache. Can you be more elaborate?

Thanks.

WARNING: multiple messages have this Message-ID (diff)
From: Cong Wang <xiyou.wangcong@gmail.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux-foundation.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
Date: Wed, 22 Jan 2020 09:54:46 -0800	[thread overview]
Message-ID: <CAM_iQpXxbu+bwsW1MMjeG5feDAjhYjeuAwo6epDi22LJPo6X+Q@mail.gmail.com> (raw)
In-Reply-To: <9033456d-1f17-44a3-2640-24de55421e79@arm.com>

On Wed, Jan 22, 2020 at 9:07 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2020 5:21 pm, Cong Wang wrote:
> > On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 18/12/2019 4:39 am, Cong Wang wrote:
> >>> The IOVA cache algorithm implemented in IOMMU code does not
> >>> exactly match the original algorithm described in the paper
> >>> "Magazines and Vmem: Extending the Slab Allocator to Many
> >>> CPUs and Arbitrary Resources".
> >>>
> >>> Particularly, it doesn't need to free the loaded empty magazine
> >>> when trying to put it back to global depot. To make it work, we
> >>> have to pre-allocate magazines in the depot and only recycle them
> >>> when all of them are full.
> >>>
> >>> Before this patch, rcache->depot[] contains either full or
> >>> freed entries, after this patch, it contains either full or
> >>> empty (but allocated) entries.
> >>
> >> How much additional memory overhead does this impose (particularly on
> >> systems that may have many domains mostly used for large, long-term
> >> mappings)? I'm wary that trying to micro-optimise for the "churn network
> >> packets as fast as possible" case may penalise every other case,
> >> potentially quite badly. Lower-end embedded systems are using IOMMUs in
> >> front of their GPUs, video codecs, etc. precisely because they *don't*
> >> have much memory to spare (and thus need to scrape together large
> >> buffers out of whatever pages they can find).
> >
> > The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
> > which is roughly 192K, per domain.
>
> Theoretically. On many architectures, kmalloc(1032,...) is going to
> consume rather more than 1032 bytes. Either way, it's rather a lot of
> memory to waste in the many cases where it will never be used at all.

If this is a concern, we can make IOVA_MAG_SIZE tunable in Kconfig.
I myself want a larger IOVA_MAG_SIZE at least for experiments.
You know, servers now have 100G+ memory, 192k is nearly nothing...


>
> >> But on the other hand, if we were to go down this route, then why is
> >> there any dynamic allocation/freeing left at all? Once both the depot
> >> and the rcaches are preallocated, then AFAICS it would make more sense
> >> to rework the overflow case in __iova_rcache_insert() to just free the
> >> IOVAs and swap the empty mag around rather than destroying and
> >> recreating it entirely.
> >
> > It's due to the algorithm requires a swap(), which can't be done with
> > statically allocated magzine. I had the same thought initially but gave it
> > up quickly when realized this.
>
> I'm not sure I follow... we're replacing a "full magazine" pointer with
> an "empty magazine" pointer regardless of where that empty magazine came
> from. It would be trivial to preallocate an 'overflow' magazine for the
> one remaining case of handling a full depot, although to be honest, at
> that point it's probably most efficient to just free the pfns directly
> from cpu_rcache->loaded while still under the percpu lock and be done
> with it.

I don't follow you either. I thought you are suggesting to completely
get rid of dynamic memory allocations like:

@@ -31,7 +31,7 @@ struct iova_cpu_rcache;
 struct iova_rcache {
        spinlock_t lock;
        unsigned long depot_size;
-       struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+       struct iova_magazine depot[MAX_GLOBAL_MAGS];
        struct iova_cpu_rcache __percpu *cpu_rcaches;
 };

If it is so, I don't see how I can do swap() with pointers like
cpu_rcache->prev.

More importantly, this doesn't save any memory either for your embedded
case. So I don't know why you want to bring it up.

>
> > If you are suggesting to change the algorithm, it is not a goal of this
> > patchset. I do have plan to search for a better algorithm as the IOMMU
> > performance still sucks (comparing to no IOMMU) after this patchset,
> > but once again, I do not want to change it in this patchset.
>
> "Still sucks" is probably the most interesting thing here - the headline
> number for the original patch series was that it reached about 98% of
> bypass performance on Intel VT-d[1]. Sounds like it would be well worth
> digging in to what's different about your system and/or workload.

Just FYI: The latency is 10x/20x worse with IOMMU enabled on AMD
servers here. (mlx5 driver for ethernet, if matters.) The throughput
is roughly same. The patchset you linked only measures throughput.


>
> > (My ultimate goal is to find a spinlock-free algorithm, otherwise there is
> > no way to make it close to no-IOMMU performance.)
> >
> >>
> >> Perhaps there's a reasonable compromise wherein we don't preallocate,
> >> but still 'free' empty magazines back to the depot, such that busy
> >> domains will quickly reach a steady-state. In fact, having now dug up
> >> the paper at this point of writing this reply, that appears to be what
> >> fig. 3.1b describes anyway - I don't see any mention of preallocating
> >> the depot.
> >
> > That paper missed a lot of things, it doesn't even recommend a size
> > of a depot or percpu cache. For implementation, we still have to
> > think about those details, including whether to preallocate memory.
>
> Heh, "missed"... To my reading, the original design actually describes a
> depot consisting of two unbounded (but garbage-collected) lists and a
> dynamically-adjusted magazine size - I'd hardly blame the authors for

I must miss the dynamic size part, as I tried to tune IOVA_MAG_SIZE
manually when I initially thought it is overcached.

> not discussing an implementation from 15 years in the future of a
> fixed-size design *based on* their concept ;)

Are you saying fixed-size implementation is wrong? I'd like to hear
more! :) I am curious how to dynamically adjust the magzine size
too as I still don't believe IOVA_MAG_SIZE fits all, also how to
balance the percpu cache. Can you be more elaborate?

Thanks.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  reply	other threads:[~2020-01-22 17:54 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
2019-12-18  4:39 ` Cong Wang
2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
2019-12-18  4:39   ` Cong Wang
2020-01-21 11:11   ` Robin Murphy
2020-01-21 11:11     ` Robin Murphy
2020-01-21 17:21     ` Cong Wang
2020-01-21 17:21       ` Cong Wang
2020-01-22 17:07       ` Robin Murphy
2020-01-22 17:07         ` Robin Murphy
2020-01-22 17:54         ` Cong Wang [this message]
2020-01-22 17:54           ` Cong Wang
2019-12-18  4:39 ` [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns() Cong Wang
2019-12-18  4:39   ` Cong Wang
2020-01-21  9:52   ` Robin Murphy
2020-01-21  9:52     ` Robin Murphy
2020-01-21 17:29     ` Cong Wang
2020-01-21 17:29       ` Cong Wang
2020-01-22 17:34       ` Robin Murphy
2020-01-22 17:34         ` Robin Murphy
2020-01-22 17:45         ` Cong Wang
2020-01-22 17:45           ` Cong Wang
2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
2019-12-18  4:39   ` Cong Wang
2019-12-19  9:51   ` John Garry
2019-12-19  9:51     ` John Garry
2020-01-21  9:56   ` Robin Murphy
2020-01-21  9:56     ` Robin Murphy
2020-03-03 11:33     ` John Garry
2020-03-03 11:33       ` John Garry
2020-01-20 23:10 ` [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
2020-01-20 23:10   ` Cong Wang
  -- strict thread matches above, loose matches on Subject: below --
2019-12-06 21:38 Cong Wang
2019-12-06 21:38 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
2019-12-06 21:38   ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAM_iQpXxbu+bwsW1MMjeG5feDAjhYjeuAwo6epDi22LJPo6X+Q@mail.gmail.com \
    --to=xiyou.wangcong@gmail.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=john.garry@huawei.com \
    --cc=joro@8bytes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.