All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@kernel.org>,
	Christopher Lameter <cl@linux.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
Date: Sun, 4 Nov 2018 23:10:12 -0800	[thread overview]
Message-ID: <84811b54-60bf-2bc3-a58d-6a7925c24aad@nvidia.com> (raw)
In-Reply-To: <20181013164740.GA6593@infradead.org>

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?
> 

Hi Christoph,

I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
and buggy yet, but I've got enough of them done to briefly run the test.

One thing that occurs to me is that jumping on and off the LRU takes time, and
if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
know that leaves 32-bit out in the cold, but...maybe use this slower approach
for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
by 20%. 

Test program is below. I hope I didn't overlook something obvious, but it's 
definitely possible, given my lack of experience with direct IO. 

I'm preparing to send an updated RFC this week, that contains the feedback to date,
and also many converted call sites as well, so that everyone can see what the whole
(proposed) story would look like in its latest incarnation.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

const static unsigned BUF_SIZE       = 4096;
static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;

void read_from_file(int fd, size_t how_much, char * buf)
{
	size_t bytes_read;

	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		bytes_read = read(fd, buf, BUF_SIZE);
		if (bytes_read != BUF_SIZE) {
			printf("reading file failed: %m\n");
			exit(3);
		}
	}
}

void seek_to_start(int fd, char *caller)
{
	off_t result = lseek(fd, 0, SEEK_SET);
	if (result == -1) {
		printf("%s: lseek failed: %m\n", caller);
		exit(4);
	}
}

void write_to_file(int fd, size_t how_much, char * buf)
{
	int result;
	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		result = write(fd, buf, BUF_SIZE);
		if (result < 0) {
			printf("writing file failed: %m\n");
			exit(3);
		}
	}
}

void read_and_write(int fd, size_t how_much, char * buf)
{
	seek_to_start(fd, "About to read");
	read_from_file(fd, how_much, buf);

	memset(buf, 'a', BUF_SIZE);

	seek_to_start(fd, "About to write");
	write_to_file(fd, how_much, buf);
}

int main(int argc, char *argv[])
{
	void *buf;
	/*
	 * O_DIRECT requires at least 512 B alighnment, but runs faster
	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
	 */
	unsigned align = 4096;
	posix_memalign(&buf, align, BUF_SIZE );

	if (argc < 3) {
		printf("Usage: %s <filename> <iterations>\n", argv[0]);
		return 1;
	}
	char *filename = argv[1];
	unsigned iterations = strtoul(argv[2], 0, 0);

	/* Not using O_SYNC for now, anyway. */
	int fd = open(filename, O_DIRECT | O_RDWR);
	if (fd < 0) {
		printf("Failed to open %s: %m\n", filename);
		return 2;
	}

	printf("File: %s, data size: %u, interations: %u\n",
		       filename, FULL_DATA_SIZE, iterations);

	for (int count = 0; count < iterations; count++) {
		read_and_write(fd, FULL_DATA_SIZE, buf);
	}

	close(fd);
	return 0;
}


thanks,
-- 
John Hubbard
NVIDIA

WARNING: multiple messages have this Message-ID (diff)
From: John Hubbard <jhubbard@nvidia.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@kernel.org>,
	Christopher Lameter <cl@linux.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>,
	<linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	<linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
Date: Sun, 4 Nov 2018 23:10:12 -0800	[thread overview]
Message-ID: <84811b54-60bf-2bc3-a58d-6a7925c24aad@nvidia.com> (raw)
In-Reply-To: <20181013164740.GA6593@infradead.org>

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?
> 

Hi Christoph,

I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
and buggy yet, but I've got enough of them done to briefly run the test.

One thing that occurs to me is that jumping on and off the LRU takes time, and
if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
know that leaves 32-bit out in the cold, but...maybe use this slower approach
for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
by 20%. 

Test program is below. I hope I didn't overlook something obvious, but it's 
definitely possible, given my lack of experience with direct IO. 

I'm preparing to send an updated RFC this week, that contains the feedback to date,
and also many converted call sites as well, so that everyone can see what the whole
(proposed) story would look like in its latest incarnation.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

const static unsigned BUF_SIZE       = 4096;
static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;

void read_from_file(int fd, size_t how_much, char * buf)
{
	size_t bytes_read;

	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		bytes_read = read(fd, buf, BUF_SIZE);
		if (bytes_read != BUF_SIZE) {
			printf("reading file failed: %m\n");
			exit(3);
		}
	}
}

void seek_to_start(int fd, char *caller)
{
	off_t result = lseek(fd, 0, SEEK_SET);
	if (result == -1) {
		printf("%s: lseek failed: %m\n", caller);
		exit(4);
	}
}

void write_to_file(int fd, size_t how_much, char * buf)
{
	int result;
	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		result = write(fd, buf, BUF_SIZE);
		if (result < 0) {
			printf("writing file failed: %m\n");
			exit(3);
		}
	}
}

void read_and_write(int fd, size_t how_much, char * buf)
{
	seek_to_start(fd, "About to read");
	read_from_file(fd, how_much, buf);

	memset(buf, 'a', BUF_SIZE);

	seek_to_start(fd, "About to write");
	write_to_file(fd, how_much, buf);
}

int main(int argc, char *argv[])
{
	void *buf;
	/*
	 * O_DIRECT requires at least 512 B alighnment, but runs faster
	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
	 */
	unsigned align = 4096;
	posix_memalign(&buf, align, BUF_SIZE );

	if (argc < 3) {
		printf("Usage: %s <filename> <iterations>\n", argv[0]);
		return 1;
	}
	char *filename = argv[1];
	unsigned iterations = strtoul(argv[2], 0, 0);

	/* Not using O_SYNC for now, anyway. */
	int fd = open(filename, O_DIRECT | O_RDWR);
	if (fd < 0) {
		printf("Failed to open %s: %m\n", filename);
		return 2;
	}

	printf("File: %s, data size: %u, interations: %u\n",
		       filename, FULL_DATA_SIZE, iterations);

	for (int count = 0; count < iterations; count++) {
		read_and_write(fd, FULL_DATA_SIZE, buf);
	}

	close(fd);
	return 0;
}


thanks,
-- 
John Hubbard
NVIDIA

  parent reply	other threads:[~2018-11-05  7:10 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
2018-10-12  6:00 ` [PATCH 1/6] mm: get_user_pages: consolidate error handling john.hubbard
2018-10-12  6:30   ` Balbir Singh
2018-10-12 22:45     ` John Hubbard
2018-10-12 22:45       ` John Hubbard
2018-10-12  6:00 ` [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions john.hubbard
2018-10-12  7:35   ` Balbir Singh
2018-10-12 22:31     ` John Hubbard
2018-10-12 22:31       ` John Hubbard
2018-10-12  6:00 ` [PATCH 3/6] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2018-10-12  6:00 ` [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
2018-10-12 10:56   ` Balbir Singh
2018-10-13  0:15     ` John Hubbard
2018-10-13  0:15       ` John Hubbard
2018-10-24 11:00       ` Balbir Singh
2018-11-02 23:27         ` John Hubbard
2018-11-02 23:27           ` John Hubbard
2018-10-13  3:55   ` Dave Chinner
2018-10-13  7:34     ` John Hubbard
2018-10-13  7:34       ` John Hubbard
2018-10-13 16:47       ` Christoph Hellwig
2018-10-13 21:19         ` John Hubbard
2018-10-13 21:19           ` John Hubbard
2018-11-05  7:10         ` John Hubbard [this message]
2018-11-05  7:10           ` John Hubbard
2018-11-05  9:54           ` Jan Kara
2018-11-06  0:26             ` John Hubbard
2018-11-06  0:26               ` John Hubbard
2018-11-06  2:47               ` Dave Chinner
2018-11-06 11:00                 ` Jan Kara
2018-11-06 20:41                   ` Dave Chinner
2018-11-07  6:36                     ` John Hubbard
2018-11-07  6:36                       ` John Hubbard
2018-10-13 23:01       ` Dave Chinner
2018-10-16  8:51         ` Jan Kara
2018-10-17  1:48           ` John Hubbard
2018-10-17  1:48             ` John Hubbard
2018-10-17 11:09             ` Michal Hocko
2018-10-18  0:03               ` John Hubbard
2018-10-18  0:03                 ` John Hubbard
2018-10-19  8:11                 ` Michal Hocko
2018-10-12  6:00 ` [PATCH 5/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
2018-10-12  6:00 ` [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
2018-10-12 11:07   ` Balbir Singh
2018-10-13  0:33     ` John Hubbard
2018-10-13  0:33       ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84811b54-60bf-2bc3-a58d-6a7925c24aad@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.