linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Cong Wang <xiyou.wangcong@gmail.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux-foundation.org,
	LKML <linux-kernel@vger.kernel.org>,
	Joerg Roedel <joro@8bytes.org>,
	John Garry <john.garry@huawei.com>
Subject: Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
Date: Wed, 22 Jan 2020 09:54:46 -0800	[thread overview]
Message-ID: <CAM_iQpXxbu+bwsW1MMjeG5feDAjhYjeuAwo6epDi22LJPo6X+Q@mail.gmail.com> (raw)
In-Reply-To: <9033456d-1f17-44a3-2640-24de55421e79@arm.com>

On Wed, Jan 22, 2020 at 9:07 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2020 5:21 pm, Cong Wang wrote:
> > On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 18/12/2019 4:39 am, Cong Wang wrote:
> >>> The IOVA cache algorithm implemented in IOMMU code does not
> >>> exactly match the original algorithm described in the paper
> >>> "Magazines and Vmem: Extending the Slab Allocator to Many
> >>> CPUs and Arbitrary Resources".
> >>>
> >>> Particularly, it doesn't need to free the loaded empty magazine
> >>> when trying to put it back to global depot. To make it work, we
> >>> have to pre-allocate magazines in the depot and only recycle them
> >>> when all of them are full.
> >>>
> >>> Before this patch, rcache->depot[] contains either full or
> >>> freed entries, after this patch, it contains either full or
> >>> empty (but allocated) entries.
> >>
> >> How much additional memory overhead does this impose (particularly on
> >> systems that may have many domains mostly used for large, long-term
> >> mappings)? I'm wary that trying to micro-optimise for the "churn network
> >> packets as fast as possible" case may penalise every other case,
> >> potentially quite badly. Lower-end embedded systems are using IOMMUs in
> >> front of their GPUs, video codecs, etc. precisely because they *don't*
> >> have much memory to spare (and thus need to scrape together large
> >> buffers out of whatever pages they can find).
> >
> > The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
> > which is roughly 192K, per domain.
>
> Theoretically. On many architectures, kmalloc(1032,...) is going to
> consume rather more than 1032 bytes. Either way, it's rather a lot of
> memory to waste in the many cases where it will never be used at all.

If this is a concern, we can make IOVA_MAG_SIZE tunable in Kconfig.
I myself want a larger IOVA_MAG_SIZE at least for experiments.
You know, servers now have 100G+ memory, 192k is nearly nothing...


>
> >> But on the other hand, if we were to go down this route, then why is
> >> there any dynamic allocation/freeing left at all? Once both the depot
> >> and the rcaches are preallocated, then AFAICS it would make more sense
> >> to rework the overflow case in __iova_rcache_insert() to just free the
> >> IOVAs and swap the empty mag around rather than destroying and
> >> recreating it entirely.
> >
> > It's due to the algorithm requires a swap(), which can't be done with
> > statically allocated magzine. I had the same thought initially but gave it
> > up quickly when realized this.
>
> I'm not sure I follow... we're replacing a "full magazine" pointer with
> an "empty magazine" pointer regardless of where that empty magazine came
> from. It would be trivial to preallocate an 'overflow' magazine for the
> one remaining case of handling a full depot, although to be honest, at
> that point it's probably most efficient to just free the pfns directly
> from cpu_rcache->loaded while still under the percpu lock and be done
> with it.

I don't follow you either. I thought you are suggesting to completely
get rid of dynamic memory allocations like:

@@ -31,7 +31,7 @@ struct iova_cpu_rcache;
 struct iova_rcache {
        spinlock_t lock;
        unsigned long depot_size;
-       struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+       struct iova_magazine depot[MAX_GLOBAL_MAGS];
        struct iova_cpu_rcache __percpu *cpu_rcaches;
 };

If it is so, I don't see how I can do swap() with pointers like
cpu_rcache->prev.

More importantly, this doesn't save any memory either for your embedded
case. So I don't know why you want to bring it up.

>
> > If you are suggesting to change the algorithm, it is not a goal of this
> > patchset. I do have plan to search for a better algorithm as the IOMMU
> > performance still sucks (comparing to no IOMMU) after this patchset,
> > but once again, I do not want to change it in this patchset.
>
> "Still sucks" is probably the most interesting thing here - the headline
> number for the original patch series was that it reached about 98% of
> bypass performance on Intel VT-d[1]. Sounds like it would be well worth
> digging in to what's different about your system and/or workload.

Just FYI: The latency is 10x/20x worse with IOMMU enabled on AMD
servers here. (mlx5 driver for ethernet, if matters.) The throughput
is roughly same. The patchset you linked only measures throughput.


>
> > (My ultimate goal is to find a spinlock-free algorithm, otherwise there is
> > no way to make it close to no-IOMMU performance.)
> >
> >>
> >> Perhaps there's a reasonable compromise wherein we don't preallocate,
> >> but still 'free' empty magazines back to the depot, such that busy
> >> domains will quickly reach a steady-state. In fact, having now dug up
> >> the paper at this point of writing this reply, that appears to be what
> >> fig. 3.1b describes anyway - I don't see any mention of preallocating
> >> the depot.
> >
> > That paper missed a lot of things, it doesn't even recommend a size
> > of a depot or percpu cache. For implementation, we still have to
> > think about those details, including whether to preallocate memory.
>
> Heh, "missed"... To my reading, the original design actually describes a
> depot consisting of two unbounded (but garbage-collected) lists and a
> dynamically-adjusted magazine size - I'd hardly blame the authors for

I must miss the dynamic size part, as I tried to tune IOVA_MAG_SIZE
manually when I initially thought it is overcached.

> not discussing an implementation from 15 years in the future of a
> fixed-size design *based on* their concept ;)

Are you saying fixed-size implementation is wrong? I'd like to hear
more! :) I am curious how to dynamically adjust the magzine size
too as I still don't believe IOVA_MAG_SIZE fits all, also how to
balance the percpu cache. Can you be more elaborate?

Thanks.

  reply	other threads:[~2020-01-22 17:54 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
2020-01-21 11:11   ` Robin Murphy
2020-01-21 17:21     ` Cong Wang
2020-01-22 17:07       ` Robin Murphy
2020-01-22 17:54         ` Cong Wang [this message]
2019-12-18  4:39 ` [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns() Cong Wang
2020-01-21  9:52   ` Robin Murphy
2020-01-21 17:29     ` Cong Wang
2020-01-22 17:34       ` Robin Murphy
2020-01-22 17:45         ` Cong Wang
2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
2019-12-19  9:51   ` John Garry
2020-01-21  9:56   ` Robin Murphy
2020-03-03 11:33     ` John Garry
2020-01-20 23:10 ` [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
  -- strict thread matches above, loose matches on Subject: below --
2019-12-06 21:38 Cong Wang
2019-12-06 21:38 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAM_iQpXxbu+bwsW1MMjeG5feDAjhYjeuAwo6epDi22LJPo6X+Q@mail.gmail.com \
    --to=xiyou.wangcong@gmail.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=john.garry@huawei.com \
    --cc=joro@8bytes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).