From: Joel Fernandes <joel@joelfernandes.org>
To: Jann Horn <jannh@google.com>
Cc: kernel list <linux-kernel@vger.kernel.org>,
Alexey Dobriyan <adobriyan@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Borislav Petkov <bp@alien8.de>,
Brendan Gregg <bgregg@netflix.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Christian Hansen <chansen3@cisco.com>,
Daniel Colascione <dancol@google.com>,
fmayer@google.com, "H. Peter Anvin" <hpa@zytor.com>,
Ingo Molnar <mingo@redhat.com>, Jonathan Corbet <corbet@lwn.net>,
Kees Cook <keescook@chromium.org>,
kernel-team <kernel-team@android.com>,
Linux API <linux-api@vger.kernel.org>,
linux-doc@vger.kernel.org,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@suse.com>,
Mike Rapoport <rppt@linux.ibm.com>,
Minchan Kim <minchan@kernel.org>,
namhyung@google.com, "Paul E. McKenney" <paulmck@linux.ibm.com>,
Robin Murphy <robin.murphy@arm.com>, Roman Gushchin <guro@fb.com>,
Stephen Rothwell <sfr@canb.auug.org.au>,
Suren Baghdasaryan <surenb@google.com>,
Thomas Gleixner <tglx@linutronix.de>,
Todd Kjos <tkjos@google.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>
Subject: Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
Date: Tue, 13 Aug 2019 11:30:20 -0400 [thread overview]
Message-ID: <20190813153020.GC14622@google.com> (raw)
In-Reply-To: <CAG48ez0ysprvRiENhBkLeV9YPTN_MB18rbu2HDa2jsWo5FYR8g@mail.gmail.com>
On Mon, Aug 12, 2019 at 08:14:38PM +0200, Jann Horn wrote:
[snip]
> > +/* Helper to get the start and end frame given a pos and count */
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > + unsigned long *start, unsigned long *end)
> > +{
> > + unsigned long max_frame;
> > +
> > + /* If an mm is not given, assume we want physical frames */
> > + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > + return -EINVAL;
> > +
> > + *start = pos * BITS_PER_BYTE;
> > + if (*start >= max_frame)
> > + return -ENXIO;
> > +
> > + *end = *start + count * BITS_PER_BYTE;
> > + if (*end > max_frame)
> > + *end = max_frame;
> > + return 0;
> > +}
>
> You could add some overflow checks for the multiplications. I haven't
> seen any place where it actually matters, but it seems unclean; and in
> particular, on a 32-bit architecture where the maximum user address is
> very high (like with a 4G:4G split), it looks like this function might
> theoretically return with `*start > *end`, which could be confusing to
> callers.
I could store the multiplication result in unsigned long long (since we are
bounds checking with max_frame, start > end would not occur). Something like
the following (with extraneous casts). But I'll think some more about the
point you are raising.
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 1ea21bbb36cb..8fd7a5559986 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -135,6 +135,7 @@ static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
unsigned long *start, unsigned long *end)
{
unsigned long max_frame;
+ unsigned long long tmp;
/* If an mm is not given, assume we want physical frames */
max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
@@ -142,13 +143,16 @@ static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
return -EINVAL;
- *start = pos * BITS_PER_BYTE;
- if (*start >= max_frame)
+ tmp = pos * BITS_PER_BYTE;
+ if (tmp >= (unsigned long long)max_frame)
return -ENXIO;
+ *start = (unsigned long)tmp;
- *end = *start + count * BITS_PER_BYTE;
- if (*end > max_frame)
+ tmp = *start + count * BITS_PER_BYTE;
+ if (tmp > (unsigned long long)max_frame)
*end = max_frame;
+ else
+ *end = (unsigned long)tmp;
return 0;
}
> [...]
> > for (; pfn < end_pfn; pfn++) {
> > bit = pfn % BITMAP_CHUNK_BITS;
> > if (!bit)
> > *out = 0ULL;
> > - page = page_idle_get_page(pfn);
> > - if (page) {
> > - if (page_is_idle(page)) {
> > - /*
> > - * The page might have been referenced via a
> > - * pte, in which case it is not idle. Clear
> > - * refs and recheck.
> > - */
> > - page_idle_clear_pte_refs(page);
> > - if (page_is_idle(page))
> > - *out |= 1ULL << bit;
> > - }
> > + page = page_idle_get_page_pfn(pfn);
> > + if (page && page_idle_pte_check(page)) {
> > + *out |= 1ULL << bit;
> > put_page(page);
> > }
>
> The `page && !page_idle_pte_check(page)` case looks like it's missing
> a put_page(); you probably intended to write something like this?
>
> page = page_idle_get_page_pfn(pfn);
> if (page) {
> if (page_idle_pte_check(page))
> *out |= 1ULL << bit;
> put_page(page);
> }
Ah, you're right. Will fix, good eyes and thank you!
> [...]
> > +/* page_idle tracking for /proc/<pid>/page_idle */
> > +
> > +struct page_node {
> > + struct page *page;
> > + unsigned long addr;
> > + struct list_head list;
> > +};
> > +
> > +struct page_idle_proc_priv {
> > + unsigned long start_addr;
> > + char *buffer;
> > + int write;
> > +
> > + /* Pre-allocate and provide nodes to pte_page_idle_proc_add() */
> > + struct page_node *page_nodes;
> > + int cur_page_node;
> > + struct list_head *idle_page_list;
> > +};
>
> A linked list is a weird data structure to use if the list elements
> are just consecutive array elements.
Not all of the pages will be considered in the later parts of the code, some
pages are skipped.
However, in v4 I added an array to allocate all the page_node structures,
since Andrew did not want GFP_ATOMIC allocations of individual list nodes.
So I think I could get rid of the linked list and leave the unused
page_node(s) as NULL, but I don't know I have to check if that is possible. I
could be missing a corner case, regardless I'll make a note of this and try!
> > +/*
> > + * Add page to list to be set as idle later.
> > + */
> > +static void pte_page_idle_proc_add(struct page *page,
> > + unsigned long addr, struct mm_walk *walk)
> > +{
> > + struct page *page_get = NULL;
> > + struct page_node *pn;
> > + int bit;
> > + unsigned long frames;
> > + struct page_idle_proc_priv *priv = walk->private;
> > + u64 *chunk = (u64 *)priv->buffer;
> > +
> > + if (priv->write) {
> > + VM_BUG_ON(!page);
> > +
> > + /* Find whether this page was asked to be marked */
> > + frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > + bit = frames % BITMAP_CHUNK_BITS;
> > + chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > + if (((*chunk >> bit) & 1) == 0)
> > + return;
>
> This means that BITMAP_CHUNK_SIZE is UAPI on big-endian systems,
> right? My opinion is that it would be slightly nicer to design the
> UAPI such that incrementing virtual addresses are mapped to
> incrementing offsets in the buffer (iow, either use bytewise access or
> use little-endian), but I'm not going to ask you to redesign the UAPI
> this late.
That would also be slow and consume more memory in userspace buffers.
Currently, a 64-bit (8 byte) chunk accounts for 64 pages worth or 256KB.
Also I wanted to keep the interface consistent with the global
/sys/kernel/mm/page_idle interface.
> [...]
> > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > + size_t count, loff_t *pos, int write)
> > +{
> [...]
> > + down_read(&mm->mmap_sem);
> [...]
> > +
> > + if (!write && !walk_error)
> > + ret = copy_to_user(ubuff, buffer, count);
> > +
> > + up_read(&mm->mmap_sem);
>
> I'd move the up_read() above the copy_to_user(); copy_to_user() can
> block, and there's no reason to hold the mmap_sem across
> copy_to_user().
Will do.
> Sorry about only chiming in at v5 with all this.
No problem, I'm glad you did!
thanks,
- Joel
next prev parent reply other threads:[~2019-08-13 15:30 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-07 17:15 [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index Joel Fernandes (Google)
2019-08-07 17:15 ` [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages Joel Fernandes (Google)
2019-08-13 15:04 ` Michal Hocko
2019-08-13 15:36 ` Joel Fernandes
2019-08-13 19:24 ` Konstantin Khlebnikov
2019-08-14 8:05 ` Michal Hocko
2019-08-14 16:32 ` Joel Fernandes
2019-08-14 18:36 ` Michal Hocko
2019-08-07 17:15 ` [PATCH v5 3/6] [RFC] x86: Add support for idle bit in swap PTE Joel Fernandes (Google)
2019-08-07 17:15 ` [PATCH v5 4/6] [RFC] arm64: " Joel Fernandes (Google)
2019-08-07 17:15 ` [PATCH v5 5/6] page_idle: Drain all LRU pagevec before idle tracking Joel Fernandes (Google)
2019-08-07 17:15 ` [PATCH v5 6/6] doc: Update documentation for page_idle virtual address indexing Joel Fernandes (Google)
2019-08-07 20:04 ` [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index Andrew Morton
2019-08-07 20:45 ` Joel Fernandes
2019-08-07 20:58 ` Andrew Morton
2019-08-07 21:31 ` Joel Fernandes
2019-08-07 21:55 ` Joel Fernandes
2019-08-08 8:00 ` Michal Hocko
2019-08-12 14:56 ` Joel Fernandes
2019-08-13 9:14 ` Michal Hocko
2019-08-13 13:51 ` Joel Fernandes
2019-08-13 14:14 ` Michal Hocko
2019-08-12 18:14 ` Jann Horn
2019-08-13 10:08 ` Michal Hocko
2019-08-13 14:25 ` Joel Fernandes
2019-08-13 15:19 ` Jann Horn
2019-08-13 15:29 ` Jann Horn
2019-08-13 15:34 ` Daniel Gruss
2019-08-13 19:18 ` Joel Fernandes
2019-08-14 7:56 ` Michal Hocko
2019-08-19 21:52 ` Joel Fernandes
2019-08-13 15:30 ` Joel Fernandes [this message]
2019-08-13 15:40 ` Jann Horn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190813153020.GC14622@google.com \
--to=joel@joelfernandes.org \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bgregg@netflix.com \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=chansen3@cisco.com \
--cc=corbet@lwn.net \
--cc=dancol@google.com \
--cc=fmayer@google.com \
--cc=guro@fb.com \
--cc=hpa@zytor.com \
--cc=jannh@google.com \
--cc=keescook@chromium.org \
--cc=kernel-team@android.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=mingo@redhat.com \
--cc=namhyung@google.com \
--cc=paulmck@linux.ibm.com \
--cc=robin.murphy@arm.com \
--cc=rppt@linux.ibm.com \
--cc=sfr@canb.auug.org.au \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=tkjos@google.com \
--cc=vbabka@suse.cz \
--cc=vdavydov.dev@gmail.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).