linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Arnd Bergmann <arnd@arndb.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	syzbot <syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com>
Subject: Re: [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory.
Date: Sat, 24 Aug 2019 22:22:24 +0200	[thread overview]
Message-ID: <20190824202224.GA5286@gmail.com> (raw)
In-Reply-To: <CAHk-=whFQNkqPJ5zA1xAyvgtCPLN2C4xeJ181rU3k6bG+2zugg@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Aug 24, 2019 at 9:14 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > What I noticed is that while reading regular RAM is reasonably fast even
> > in very large chunks, accesses become very slow once they hit iomem - and
> > delays even longer than the reported 145 seconds become possible if
> > bs=100M is increased to say bs=1000M.
> 
> Ok, that makes sense.
> 
> So for IOMEM, we actually have two independent issues:
> 
>  (a) for normal unmapped IOMEM (ie no device), which is probably the
> case your test triggers anyway, it quite possibly ends up having ISA
> timings (ie around 1us per read)

That makes sense: I measured 17 seconds per 100 MB of data, which is is 
0.16 usecs per byte. The instruction used by 
copy_user_enhanced_fast_string() is REP MOVSB - which supposedly goes as 
high as cacheline size accesses - but perhaps those get broken down for 
physical memory that has no device claiming it?

>  (b) we use "probe_kernel_read()" to a temporary buffer avoid page
> faults, which uses __copy_from_user_inatomic(). So far so good. But on
> modern x86, because we treat it as just a regular memcpy, we probably
> have ERMS and do "rep movsb".
> 
> So (a) means that each access is slow to begin with, and then (b)
> means that we don't use "copy_from_io()" but a slow byte-by-byte
> access.
> 
> > With Tetsuo's approach the delays are fixed, because the fatal signal is
> > noticed within the 4K chunks of read_mem() immediately:
> 
> Yeah. that's the size of the temp buffer.
> 
> > Note how the first 100MB chunk took 17 seconds, the second chunk was
> > interrupted within ~2 seconds after I sent a SIGKILL.
> 
> It looks closer to 1 second from that trace, but that actually is
> still within the basic noise above: a 4kB buffer being copied one byte
> at a time would take about 4s, but maybe it's not *quite* as bad as
> traditional ISA PIO timings.

Yeah - and note that I phrased it imprecisely: the real in-kernel signal 
interruption delay was probably below 1 msec: in my test the SIGKILL was 
manually triggered, with an about 1 second delay caused by the human 
brain, not by the kernel. ;-)

The in-kernel interruption is almost instantaneous - the 4K max chunking 
is good I think.

> > Except that I think the regular pattern here is not 'break' but 'goto
> > failed' - 'break' would just result in a partial read and 'err' gets
> > ignored.
> 
> That's actually the correct signal handling pattern: either a partial
> read, or -EINTR.
> 
> In the case of SIGKILL, the return value doesn't matter, of course,
> but *if* we were to decide that we can make it interruptible, then it
> would.

Yeah. So while error cases and signals are not equivalent, I also went by 
existing precedent within read_kmem(): the 2 other error cases return an 
error (-ENXIO and -EFAULT), while they could also conceivably return a 
partial read of the previously completed read and only return an error if 
the *first* chunk generates an error without making any progress?

Interruption is arguably *not* an 'error', so preserving partial reads 
sounds like the higher quality solution, nevertheless one could argue 
that particual read *could* be returned by read_kmem() if progress was 
made.

> > I also agree that we should probably allow regular interrupts to
> > interrupt /dev/mem accesses - while the 'dd' process is killable, it's
> > not Ctrl-C-able.
> 
> I'm a bit nervous about it, but there probably aren't all that many
> actual /dev/mem users.
> 
> The main ones I can see are things like "dmidecode" etc.
> 
> > If I change the fatal_signal_pending() to signal_pending() it all behaves
> > much better IMHO - assuming that no utility learned rely on
> > non-interruptibility ...
> 
> So I cloned the dmidecode git tree, and it does "myread()" on the
> /dev/mem file as far as I can tell. And that one correctly hanles
> partial reads.
> 
> It still makes me a bit nervous, but just plain "signal_pending()" is
> not just nicer, it's technically the right thing to do (it's not a
> regular file, so partial reads are permissible, and other files like
> it - eg /dev/zero - already does exactly that).
> 
> I also wonder if we should just not use that crazy
> "probe_kernel_read()" to begin with. The /dev/mem code uses it to
> avoid faults - and that's the intent of it - but it's not meant for
> IO-memory at all.
> 
> But "read()" on /dev/mem does that off "xlate_dev_mem_ptr()", which is
> a bastardized "kernel address or ioremap" thing. It works, but it's
> all kinds of stupid. We'd be a *lot* better off using kmap(), I think.

Hm, I think xlate_dev_mem_ptr() and thus ioremap() would also perform a 
cache aliasing conflict check - which kmap() wouldn't do?

I.e. if for example an iomem area is already mapped by a driver with some 
conflicting cache attribute, xlate_dev_mem_ptr() AFAICS will not 
ioremap_cache() it to cached? IIRC some CPUs would triple fault or 
completely misbehave on certain cache attribute conflicts.

This check in mremap() might also trigger:

        if (is_ram == REGION_MIXED) {
                WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
                                &offset, (unsigned long) size);
                return NULL;
        }

So I'd say xlate_dev_mem_ptr() looks messy, but is possibly a bit more 
robust in this regard?

> So my gut feel is that this is something worth trying to do bigger and
> more fundamental changes to.
> 
> But changing it to use "signal_pending()" at least as a trial looks
> ok. The only user *should* be things like dmidecode that apparently
> already do the right thing.
> 
> And if we changed the bounce buffering to do things at least a 32-bit
> word at a time, that would help latency for IO by a factor of 4.
> 
> And if the signal_pending() is at the end of the copy, then we'd
> guarantee that at least _small_ reads still work the way they did
> before, so people who do things like "lspci()" with direct hardware
> accesses wouldn't be impacted - if they exist.

Yeah.

> So I'd be willing to try that (and then if somebody reports a
> regression we can make it use "fatal_signal_pending()" instead)

Ok, will post a changelogged patch (unless Tetsuo beats me to it?).

Thanks,

	Ingo

  reply	other threads:[~2019-08-24 20:22 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20 22:06 [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory Tetsuo Handa
2019-08-20 22:24 ` Greg Kroah-Hartman
2019-08-21  0:07   ` Tetsuo Handa
2019-08-22  9:59   ` Tetsuo Handa
2019-08-22 13:35     ` Greg Kroah-Hartman
2019-08-22 14:00       ` Tetsuo Handa
2019-08-22 16:42         ` Greg Kroah-Hartman
2019-08-22 17:11           ` Dmitry Vyukov
2019-08-22 21:17             ` Tetsuo Handa
2019-08-22 23:59               ` Dmitry Vyukov
2019-08-23  8:17                 ` Tetsuo Handa
2019-08-23 16:47                   ` Dmitry Vyukov
2019-10-08  9:57                   ` Kernel config for fuzz testing Tetsuo Handa
2019-08-22 21:29 ` [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory Linus Torvalds
2019-08-22 22:08 ` Linus Torvalds
2019-08-23  9:16   ` Ingo Molnar
2019-08-23 16:39     ` Linus Torvalds
2019-08-24 16:14       ` Ingo Molnar
2019-08-24 17:40         ` Linus Torvalds
2019-08-24 20:22           ` Ingo Molnar [this message]
2019-08-24 20:56             ` Linus Torvalds
2019-08-30  9:56               ` David Laight
2019-08-25  5:49             ` Tetsuo Handa
2019-08-25  9:59               ` Ingo Molnar
2019-08-25 10:35                 ` Tetsuo Handa
2019-08-25 10:48                   ` Ingo Molnar
2019-08-25 16:54               ` Linus Torvalds
2019-08-26 10:40                 ` Tetsuo Handa
2019-08-26 11:19                   ` Ingo Molnar
2019-08-26 15:02                     ` Rewriting read_kmem()/write_kmem() ? Tetsuo Handa
2019-08-23 11:46   ` [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190824202224.GA5286@gmail.com \
    --to=mingo@kernel.org \
    --cc=arnd@arndb.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).