All of lore.kernel.org
 help / color / mirror / Atom feed
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	David Miller <davem@davemloft.net>,
	Andres Freund <andres@2ndquadrant.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Linux API <linux-api@vger.kernel.org>,
	Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Kees Cook <kees@outflux.net>
Subject: Re: [PATCH v3 1/3] mm: introduce fincore()
Date: Mon, 7 Jul 2014 16:21:08 -0400	[thread overview]
Message-ID: <20140707202108.GA5031@nhori.bos.redhat.com> (raw)
In-Reply-To: <53BAEE95.50807@intel.com>

Hi Dave,

Thank you for the comments.

On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> > +/*
> > + * You can control how the buffer in userspace is filled with this mode
> > + * parameters:
> 
> I agree that we don't have any good mechanisms for looking at the page
> cache from userspace.  I've hacked some things up using mincore() and
> they weren't pretty, so I welcome _something_ like this.
> 
> But, is this trying to do too many things at once?  Do we have solid use
> cases spelled out for each of these modes?  Have we thought out how they
> will be used in practice?

tools/vm/page-types.c will be an in-kernel user after this base code is
accepted. The idea of doing fincore() thing comes up during the discussion
with Konstantin over file cache mode of this tool.
pfn and page flag are needed there, so I think it's one clear usecase.

> The biggest question for me, though, is whether we want to start
> designing these per-page interfaces to consider different page sizes, or
> whether we're going to just continue to pretend that the entire world is
> 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> silly, for instance.
> 
> > + * - FINCORE_BMAP:
> > + *     the page status is returned in a vector of bytes.
> > + *     The least significant bit of each byte is 1 if the referenced page
> > + *     is in memory, otherwise it is zero.
> 
> I know this is consistent with mincore(), but it did always bother me
> that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

Yes, I got the same comment in previous round. So, OK, not a few people
seem to think that space efficiency is more important than the consistency,
so I'm OK to do it.

We have an idea of making fincore() cover the whole mincore()'s feature
by letting fincore() handle /proc/pid/mem. So mincore() will be obsolete,
and no one has to care about consistency beteen mincore and fincore.
That might be another reason justifying the idea above.

> > + * - FINCORE_PGOFF:
> > + *     if this flag is set, fincore() doesn't store any information about
> > + *     holes. Instead each records per page has the entry of page offset,
> > + *     using 8 bytes. This mode is useful if we handle a large file and
> > + *     only few pages are on memory.
> 
> This bothers me a bit.  How would someone know how sparse file was
> before calling this?  If it's not sparse, and they use this, they'll end
> up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
> sparse, and they use FINCORE_BMAP, they will either waste tons of memory
> on buffers, or have to make a ton of calls.

Yes, that's the hard point.
Some new mode (FINCORE_SUM for example) to get how many pages of a file
is in memory might be helpful to choose which mode, although we need 2 calls.

> I guess this could also be used to do *searches*, which would let you
> search out holes.  Let's say you have a 2TB file.  You could call this
> with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
> your one entry back, you know it's not completely sparse.
> 
> But, that wouldn't work with it as-designed.  The length of the buffer
> and the range of the file being checked are coupled together,

This is only correct for !FINCORE_PGOFF.

> so you
> can't say:
> 
> 	vec = malloc(sizeof(long));
> 	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);
> 
> without overflowing vec.

The 3rd parameter is the number of pages whose data is passed to userspace,
so we expect userspace to set it according to the buffer size.

But yes, I still have a problem. In FINCORE_PGOFF mode we only scan until
the buffer becomes full, but userspace doesn't know at which point the
scan stopped. It can guess the end point from the pgoff of the last buffer,
but it might not be straightforward or well-designed.
And I should describe this behavior more.

> Is it really right to say this is going to be 8 bytes?  Would we want it
> to share types with something else, like be an loff_t?

Could you elaborate it more?

> > + * - FINCORE_PFN:
> > + *     stores pfn, using 8 bytes.
> 
> These are all an unprivileged operations from what I can tell.  I know
> we're going to a lot of trouble to hide kernel addresses from being seen
> in userspace.  This seems like it would be undesirable for the folks
> that care about not leaking kernel addresses, especially for
> unprivileged users.
> 
> This would essentially tell userspace where in the kernel's address
> space some user-controlled data will be.

OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

> > + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> > + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> > + * information is stored like this:
> 
> Instead of specifying the ordering in the manpages alone, would it be
> smarter to just say that the ordering of the items is dependent on the
> ordering of the flags?  In other words if FINCORE_PFN <
> FINCORE_PAGEFLAGS, then its field comes first?

Ah, right. I should've referred to the ordering here also.

Thanks,
Naoya Horiguchi

WARNING: multiple messages have this Message-ID (diff)
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	David Miller <davem@davemloft.net>,
	Andres Freund <andres@2ndquadrant.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Linux API <linux-api@vger.kernel.org>,
	Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Kees Cook <kees@outflux.net>
Subject: Re: [PATCH v3 1/3] mm: introduce fincore()
Date: Mon, 7 Jul 2014 16:21:08 -0400	[thread overview]
Message-ID: <20140707202108.GA5031@nhori.bos.redhat.com> (raw)
In-Reply-To: <53BAEE95.50807@intel.com>

Hi Dave,

Thank you for the comments.

On Mon, Jul 07, 2014 at 12:01:41PM -0700, Dave Hansen wrote:
> > +/*
> > + * You can control how the buffer in userspace is filled with this mode
> > + * parameters:
> 
> I agree that we don't have any good mechanisms for looking at the page
> cache from userspace.  I've hacked some things up using mincore() and
> they weren't pretty, so I welcome _something_ like this.
> 
> But, is this trying to do too many things at once?  Do we have solid use
> cases spelled out for each of these modes?  Have we thought out how they
> will be used in practice?

tools/vm/page-types.c will be an in-kernel user after this base code is
accepted. The idea of doing fincore() thing comes up during the discussion
with Konstantin over file cache mode of this tool.
pfn and page flag are needed there, so I think it's one clear usecase.

> The biggest question for me, though, is whether we want to start
> designing these per-page interfaces to consider different page sizes, or
> whether we're going to just continue to pretend that the entire world is
> 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> silly, for instance.
> 
> > + * - FINCORE_BMAP:
> > + *     the page status is returned in a vector of bytes.
> > + *     The least significant bit of each byte is 1 if the referenced page
> > + *     is in memory, otherwise it is zero.
> 
> I know this is consistent with mincore(), but it did always bother me
> that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

Yes, I got the same comment in previous round. So, OK, not a few people
seem to think that space efficiency is more important than the consistency,
so I'm OK to do it.

We have an idea of making fincore() cover the whole mincore()'s feature
by letting fincore() handle /proc/pid/mem. So mincore() will be obsolete,
and no one has to care about consistency beteen mincore and fincore.
That might be another reason justifying the idea above.

> > + * - FINCORE_PGOFF:
> > + *     if this flag is set, fincore() doesn't store any information about
> > + *     holes. Instead each records per page has the entry of page offset,
> > + *     using 8 bytes. This mode is useful if we handle a large file and
> > + *     only few pages are on memory.
> 
> This bothers me a bit.  How would someone know how sparse file was
> before calling this?  If it's not sparse, and they use this, they'll end
> up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
> sparse, and they use FINCORE_BMAP, they will either waste tons of memory
> on buffers, or have to make a ton of calls.

Yes, that's the hard point.
Some new mode (FINCORE_SUM for example) to get how many pages of a file
is in memory might be helpful to choose which mode, although we need 2 calls.

> I guess this could also be used to do *searches*, which would let you
> search out holes.  Let's say you have a 2TB file.  You could call this
> with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
> your one entry back, you know it's not completely sparse.
> 
> But, that wouldn't work with it as-designed.  The length of the buffer
> and the range of the file being checked are coupled together,

This is only correct for !FINCORE_PGOFF.

> so you
> can't say:
> 
> 	vec = malloc(sizeof(long));
> 	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);
> 
> without overflowing vec.

The 3rd parameter is the number of pages whose data is passed to userspace,
so we expect userspace to set it according to the buffer size.

But yes, I still have a problem. In FINCORE_PGOFF mode we only scan until
the buffer becomes full, but userspace doesn't know at which point the
scan stopped. It can guess the end point from the pgoff of the last buffer,
but it might not be straightforward or well-designed.
And I should describe this behavior more.

> Is it really right to say this is going to be 8 bytes?  Would we want it
> to share types with something else, like be an loff_t?

Could you elaborate it more?

> > + * - FINCORE_PFN:
> > + *     stores pfn, using 8 bytes.
> 
> These are all an unprivileged operations from what I can tell.  I know
> we're going to a lot of trouble to hide kernel addresses from being seen
> in userspace.  This seems like it would be undesirable for the folks
> that care about not leaking kernel addresses, especially for
> unprivileged users.
> 
> This would essentially tell userspace where in the kernel's address
> space some user-controlled data will be.

OK, so this and FINCORE_PAGEFLAGS will be limited for privileged users.

> > + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> > + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> > + * information is stored like this:
> 
> Instead of specifying the ordering in the manpages alone, would it be
> smarter to just say that the ordering of the items is dependent on the
> ordering of the flags?  In other words if FINCORE_PFN <
> FINCORE_PAGEFLAGS, then its field comes first?

Ah, right. I should've referred to the ordering here also.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-07-07 20:22 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-07 18:00 [PATCH v3 0/3] mm: introduce fincore() v3 Naoya Horiguchi
2014-07-07 18:00 ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 1/3] mm: introduce fincore() Naoya Horiguchi
2014-07-07 18:00   ` Naoya Horiguchi
2014-07-07 19:01   ` Dave Hansen
2014-07-07 19:01     ` Dave Hansen
2014-07-07 20:21     ` Naoya Horiguchi [this message]
2014-07-07 20:21       ` Naoya Horiguchi
2014-07-07 20:43       ` Dave Hansen
2014-07-07 20:43         ` Dave Hansen
2014-07-07 21:48         ` Naoya Horiguchi
2014-07-07 21:48           ` Naoya Horiguchi
2014-07-07 22:44           ` Dave Hansen
2014-07-07 22:44             ` Dave Hansen
2014-07-08 15:35             ` Naoya Horiguchi
2014-07-08 15:35               ` Naoya Horiguchi
2014-07-08 19:03     ` Naoya Horiguchi
2014-07-08 19:03       ` Naoya Horiguchi
2014-07-08 19:42       ` Dave Hansen
2014-07-08 19:42         ` Dave Hansen
2014-07-08 20:41         ` Naoya Horiguchi
2014-07-08 20:41           ` Naoya Horiguchi
2014-07-08 22:32           ` Dave Hansen
2014-07-08 22:32             ` Dave Hansen
2014-07-11 16:53             ` Naoya Horiguchi
2014-07-11 16:53               ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 2/3] selftests/fincore: add test code for fincore() Naoya Horiguchi
2014-07-07 18:00   ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2) Naoya Horiguchi
2014-07-07 18:00   ` Naoya Horiguchi
2014-07-07 19:08   ` Dave Hansen
2014-07-07 19:08     ` Dave Hansen
2014-07-07 19:08     ` Dave Hansen
2014-07-07 20:59     ` Naoya Horiguchi
2014-07-07 20:59       ` Naoya Horiguchi
2014-07-07 22:34       ` Dave Hansen
2014-07-07 22:34         ` Dave Hansen
2014-07-08 15:43         ` Naoya Horiguchi
2014-07-08 15:43           ` Naoya Horiguchi
2014-07-08 12:16 ` [PATCH v3 0/3] mm: introduce fincore() v3 Christoph Hellwig
2014-07-08 12:16   ` Christoph Hellwig
2014-07-08 13:27   ` Naoya Horiguchi
2014-07-08 13:27     ` Naoya Horiguchi
2014-07-09  8:51     ` Christoph Hellwig
2014-07-09  8:51       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140707202108.GA5031@nhori.bos.redhat.com \
    --to=n-horiguchi@ah.jp.nec.com \
    --cc=acme@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andres@2ndquadrant.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=kees@outflux.net \
    --cc=kirill@shutemov.name \
    --cc=koct9i@gmail.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mtk.manpages@gmail.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.