linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael Stapelberg <michael+lkml@stapelberg.ch>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>,
	Jack Smith <smith.jack.sidman@gmail.com>,
	fuse-devel <fuse-devel@lists.sourceforge.net>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [fuse-devel] Writing to FUSE via mmap extremely slow (sometimes) on some machines?
Date: Mon, 9 Mar 2020 16:11:41 +0100	[thread overview]
Message-ID: <CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com> (raw)
In-Reply-To: <CAJfpegsWwsmzWb6C61NXKh=TEGsc=TaSSEAsixbBvw_qF4R6YQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

Thanks for clarifying. I have modified the mmap test program (see
attached) to optionally read in the entire file when the WORKAROUND=
environment variable is set, thereby preventing the FUSE reads in the
write phase. I can now see a batch of reads, followed by a batch of
writes.

What’s interesting: when polling using “while :; do grep ^Bdi
/sys/kernel/debug/bdi/0:93/stats; sleep 0.1; done” and running the
mmap test program, I see:

BdiDirtied:            3566304 kB
BdiWritten:            3563616 kB
BdiWriteBandwidth:       13596 kBps

BdiDirtied:            3566304 kB
BdiWritten:            3563616 kB
BdiWriteBandwidth:       13596 kBps

BdiDirtied:            3566528 kB (+224 kB) <-- starting to dirty pages
BdiWritten:            3564064 kB (+448 kB) <-- starting to write
BdiWriteBandwidth:       10700 kBps <-- only bandwidth update!

BdiDirtied:            3668224 kB (+ 101696 kB) <-- all pages dirtied
BdiWritten:            3565632 kB (+1568 kB)
BdiWriteBandwidth:       10700 kBps

BdiDirtied:            3668224 kB
BdiWritten:            3665536 kB (+ 99904 kB) <-- all pages written
BdiWriteBandwidth:       10700 kBps

BdiDirtied:            3668224 kB
BdiWritten:            3665536 kB
BdiWriteBandwidth:       10700 kBps

This seems to suggest that the bandwidth measurements only capture the
rising slope of the transfer, but not the bulk of the transfer itself,
resulting in inaccurate measurements. This effect is worsened when the
test program doesn’t pre-read the output file and hence the kernel
gets fewer FUSE write requests out.

On Mon, Mar 9, 2020 at 3:36 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Mar 9, 2020 at 3:32 PM Michael Stapelberg
> <michael+lkml@stapelberg.ch> wrote:
> >
> > Here’s one more thing I noticed: when polling
> > /sys/kernel/debug/bdi/0:93/stats, I see that BdiDirtied and BdiWritten
> > remain at their original values while the kernel sends FUSE read
> > requests, and only goes up when the kernel transitions into sending
> > FUSE write requests. Notably, the page dirtying throttling happens in
> > the read phase, which is most likely why the write bandwidth is
> > (correctly) measured as 0.
> >
> > Do we have any ideas on why the kernel sends FUSE reads at all?
>
> Memory writes (stores) need the memory page to be up-to-date wrt. the
> backing file before proceeding.   This means that if the page hasn't
> yet been cached by the kernel, it needs to be read first.
>
> Thanks,
> Miklos

[-- Attachment #2: mmap.c --]
[-- Type: text/x-csrc, Size: 2495 bytes --]

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> 
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>

/*
 * An implementation of copy ("cp") that uses memory maps.  Various
 * error checking has been removed to promote readability
 */

// Where we want the source file's memory map to live in virtual memory
// The destination file resides immediately after the source file
#define MAP_LOCATION 0x6100

int main (int argc, char *argv[]) {
 int fdin, fdout;
 char *src, *dst;
 struct stat statbuf;
 off_t fileSize = 0;

 if (argc != 3) {
   printf ("usage: a.out <fromfile> <tofile>\n");
   exit(0);
 }

 /* open the input file */
 if ((fdin = open (argv[1], O_RDONLY)) < 0) {
   printf ("can't open %s for reading\n", argv[1]);
   exit(0);
 }

 /* open/create the output file */
 if ((fdout = open (argv[2], O_RDWR | O_CREAT | O_TRUNC, 0600)) < 0) {
   printf ("can't create %s for writing\n", argv[2]);
   exit(0);
 }
 
 /* find size of input file */
 fstat (fdin,&statbuf) ;
 fileSize = statbuf.st_size;
 
 /* go to the location corresponding to the last byte */
 if (lseek (fdout, fileSize - 1, SEEK_SET) == -1) {
   printf ("lseek error\n");
   exit(0);
 }
 
 /* write a dummy byte at the last location */
 write (fdout, "", 1);
 
 /* 
  * memory map the input file.  Only the first two arguments are
  * interesting: 1) the location and 2) the size of the memory map 
  * in virtual memory space. Note that the location is only a "hint";
  * the OS can choose to return a different virtual memory address.
  * This is illustrated by the printf command below.
 */

 src = mmap ((void*) MAP_LOCATION, fileSize, 
	     PROT_READ, MAP_SHARED | MAP_POPULATE, fdin, 0);

 /* memory map the output file after the input file */
 dst = mmap ((void*) MAP_LOCATION + fileSize , fileSize , 
	     PROT_READ | PROT_WRITE, MAP_SHARED, fdout, 0);


 printf("pid: %d\n", getpid());
 printf("Mapped src: 0x%p  and dst: 0x%p\n",src,dst);

 if (getenv("WORKAROUND") != NULL) {
   printf("workaround: reading output file before dirtying its pages\n");
   uint8_t sum = 0;
   uint8_t *ptr = (uint8_t*)dst;
   for (off_t i = 0; i < fileSize; i++) {
     sum += *ptr;
     ptr++;
   }
   printf("sum: %d\n", sum);
   sleep(1);
   printf("writing\n");
 }

 /* Copy the input file to the output file */
 memcpy (dst, src, fileSize);

 printf("memcpy done\n");

 // we should probably unmap memory and close the files
} /* main */

  reply	other threads:[~2020-03-09 15:11 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24 13:29 Writing to FUSE via mmap extremely slow (sometimes) on some machines? Michael Stapelberg
     [not found] ` <CACQJH27s4HKzPgUkVT+FXWLGqJAAMYEkeKe7cidcesaYdE2Vog@mail.gmail.com>
     [not found]   ` <CANnVG6=Ghu5r44mTkr0uXx_ZrrWo2N5C_UEfM59110Zx+HApzw@mail.gmail.com>
     [not found]     ` <CAJfpegvzhfO7hg1sb_ttQF=dmBeg80WVkV8srF3VVYHw9ybV0w@mail.gmail.com>
     [not found]       ` <CANnVG6kSJJw-+jtjh-ate7CC3CsB2=ugnQpA9ACGFdMex8sftg@mail.gmail.com>
     [not found]         ` <CAJfpegtkEU9=3cvy8VNr4SnojErYFOTaCzUZLYvMuQMi050bPQ@mail.gmail.com>
2020-03-03 10:34           ` [fuse-devel] " Michael Stapelberg
2020-03-03 13:04           ` Tejun Heo
2020-03-03 14:03             ` Michael Stapelberg
2020-03-03 14:13               ` Tejun Heo
2020-03-03 14:21                 ` Michael Stapelberg
2020-03-03 14:25                   ` Tejun Heo
     [not found]                     ` <CANnVG6=yf82CcwmdmawmjTP2CskD-WhcvkLnkZs7hs0OG7KcTg@mail.gmail.com>
2020-03-09 14:32                       ` Michael Stapelberg
2020-03-09 14:36                         ` Miklos Szeredi
2020-03-09 15:11                           ` Michael Stapelberg [this message]
2020-03-12 15:45                             ` Michael Stapelberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com' \
    --to=michael+lkml@stapelberg.ch \
    --cc=fuse-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=smith.jack.sidman@gmail.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).