* Re: fadvise syscall?
2002-03-17 13:41 ` Anton Altaparmakov
@ 2002-03-17 14:31 ` Simon Richter
2002-03-17 14:56 ` Jan Hudec
2002-03-17 15:00 ` Anton Altaparmakov
` (3 subsequent siblings)
4 siblings, 1 reply; 41+ messages in thread
From: Simon Richter @ 2002-03-17 14:31 UTC (permalink / raw)
To: Anton Altaparmakov
Cc: Jeff Garzik, Andrew Morton, linux-kernel, linux-fsdevel
On Sun, 17 Mar 2002, Anton Altaparmakov wrote:
> All of what you are asking for exists in Windows and all the semantics are
> implemented through a very powerful open(2) equivalent. I don't see why we
> shouldn't do the same. It makes more sense to me than inventing yet another
> system call...
It is easier for application writers to code:
[...]
#ifdef HAVE_FADVISE
(void)fadvise(fd, FADV_STREAMING);
#endif
[...]
Than to have a forest of #ifdefs to determine which O_* flags are
supported. After all, we still want our programs to run under Solaris. :-)
Simon
--
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
Fingerprint: 040E B5F7 84F1 4FBC CEAD ADC6 18A0 CC8D 5706 A4B4
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-17 14:31 ` Simon Richter
@ 2002-03-17 14:56 ` Jan Hudec
0 siblings, 0 replies; 41+ messages in thread
From: Jan Hudec @ 2002-03-17 14:56 UTC (permalink / raw)
To: linux-kernel, linux-fsdevel
> It is easier for application writers to code:
>
> [...]
> #ifdef HAVE_FADVISE
> (void)fadvise(fd, FADV_STREAMING);
> #endif
> [...]
>
> Than to have a forest of #ifdefs to determine which O_* flags are
> supported. After all, we still want our programs to run under Solaris. :-)
#ifndef O_STREAMING
#define O_STREAMING 0
#endif
(and then just use the flag in open)
is still better - it can be done in a header somewhere, once for all opens.
--------------------------------------------------------------------------------
- Jan Hudec `Bulb' <bulb@ucw.cz>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-17 13:41 ` Anton Altaparmakov
2002-03-17 14:31 ` Simon Richter
@ 2002-03-17 15:00 ` Anton Altaparmakov
2002-03-17 19:20 ` Joel Becker
` (2 subsequent siblings)
4 siblings, 0 replies; 41+ messages in thread
From: Anton Altaparmakov @ 2002-03-17 15:00 UTC (permalink / raw)
To: Simon Richter; +Cc: Jeff Garzik, Andrew Morton, linux-kernel, linux-fsdevel
At 14:31 17/03/02, Simon Richter wrote:
>On Sun, 17 Mar 2002, Anton Altaparmakov wrote:
>
> > All of what you are asking for exists in Windows and all the semantics are
> > implemented through a very powerful open(2) equivalent. I don't see why we
> > shouldn't do the same. It makes more sense to me than inventing yet another
> > system call...
>
>It is easier for application writers to code:
>
>[...]
>#ifdef HAVE_FADVISE
> (void)fadvise(fd, FADV_STREAMING);
>#endif
>[...]
>
>Than to have a forest of #ifdefs to determine which O_* flags are
>supported. After all, we still want our programs to run under Solaris. :-)
Ugh. Both of your suggestions look ugly. Using the O_* flags, you just need
to have a compatibility header file which contains:
#ifndef HAVE_O_SEQUENTIAL
# define O_SEQUENTIAL 0
#endif
Then in the code you just use O_SEQUENTIAL and if the system doesn't know
about it it is optimised away at compile time.
Note how nicely this fits in with autoconf/automake where the ./configure
script can test for O_SEQUENTIAL and if it is not there automatically
define it to 0. That then means your code is completely free from these
ugly #ifdefs.
Thanks for making your point as that is ANOTHER argument for using open(2)
instead of fadvise() [1]. (-;
Cheers,
Anton
[1] Yeah, I know, one could also define fadvise() to nothing in the compat
header file...
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-17 13:41 ` Anton Altaparmakov
2002-03-17 14:31 ` Simon Richter
2002-03-17 15:00 ` Anton Altaparmakov
@ 2002-03-17 19:20 ` Joel Becker
2002-03-18 7:28 ` Jeff Garzik
2002-03-18 8:05 ` Joel Becker
4 siblings, 0 replies; 41+ messages in thread
From: Joel Becker @ 2002-03-17 19:20 UTC (permalink / raw)
To: Anton Altaparmakov
Cc: Jeff Garzik, Andrew Morton, linux-kernel, linux-fsdevel
On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> When you want large data streaming, i.e. you start getting worried about
> memory pressure, then you want open(2) + O_DIRECT. No caching done. Perfect
> for large data streams and we have that already. I agree that you may want
> some form of asynchronous read ahead with passed pages being dropped from
> the cache but that could be just a open(2) + O_SEQUENTIAL (doesn't exist yet).
O_DIRECT isn't the right thing for large streaming. You want
readahead and dropbehind. O_DIRECT takes substantial penalties for its
lack of copy/cacheing. This works fine in certain circumstances
(applications that keep their own caching), but for something like a
video or mp3, you'll win with working dropbehind easily.
Joel
--
Life's Little Instruction Book #444
"Never underestimate the power of a kind word or deed."
http://www.jlbec.org/
jlbec@evilplan.org
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-17 13:41 ` Anton Altaparmakov
` (2 preceding siblings ...)
2002-03-17 19:20 ` Joel Becker
@ 2002-03-18 7:28 ` Jeff Garzik
2002-03-18 7:55 ` Andrew Morton
2002-03-22 16:05 ` Pavel Machek
2002-03-18 8:05 ` Joel Becker
4 siblings, 2 replies; 41+ messages in thread
From: Jeff Garzik @ 2002-03-18 7:28 UTC (permalink / raw)
To: Anton Altaparmakov; +Cc: Andrew Morton, linux-kernel, linux-fsdevel
Anton Altaparmakov wrote:
> We don't need fadvise IMHO. That is what open(2) is for. The streaming
> request you are asking for is just a normal open(2). It will do read
> ahead which is perfect for streaming (of data size << RAM size in its
> current form).
>
> When you want large data streaming, i.e. you start getting worried
> about memory pressure, then you want open(2) + O_DIRECT. No caching
> done. Perfect for large data streams and we have that already. I agree
> that you may want some form of asynchronous read ahead with passed
> pages being dropped from the cache but that could be just a open(2) +
> O_SEQUENTIAL (doesn't exist yet).
>
> All of what you are asking for exists in Windows and all the semantics
> are implemented through a very powerful open(2) equivalent. I don't
> see why we shouldn't do the same. It makes more sense to me than
> inventing yet another system call...
I disagree, and here's the main reasons:
* fadvise(2) usefulness extends past open(2). It may be useful to call
it at various points during runtime.
* I think putting hints in open(2) is the wrong direction to go. Hints
have a potential to be very flexible. open(2) O_xxx bits are not to be
squandered lightly, while I see a lot more value in being a little more
loose and free with the bit assignment for an "fadvise mask" (just a
list of hint bits). IMO it should be easier to introduce and retire
hints, far easier than O_xxx flags.
Jeff
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 7:28 ` Jeff Garzik
@ 2002-03-18 7:55 ` Andrew Morton
2002-03-18 8:07 ` Jeff Garzik
2002-03-18 16:41 ` Richard Gooch
2002-03-22 16:05 ` Pavel Machek
1 sibling, 2 replies; 41+ messages in thread
From: Andrew Morton @ 2002-03-18 7:55 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Anton Altaparmakov, linux-kernel, linux-fsdevel
Jeff Garzik wrote:
>
> * fadvise(2) usefulness extends past open(2). It may be useful to call
> it at various points during runtime.
>
> * I think putting hints in open(2) is the wrong direction to go. Hints
> have a potential to be very flexible. open(2) O_xxx bits are not to be
> squandered lightly, while I see a lot more value in being a little more
> loose and free with the bit assignment for an "fadvise mask" (just a
> list of hint bits). IMO it should be easier to introduce and retire
> hints, far easier than O_xxx flags.
>
Yup.
posix_fadvise() looks to be a fine interface:
int posix_fadvise(int fd, off_t offset, size_t len, int advice);
DESCRIPTION
The posix_fadvise() function shall advise the implementation on
the expected behavior of the application with respect to the data in
the file associated with the open file descriptor, fd, starting at offset
and continuing for len bytes. The specified range need not currently
exist in the file. If len is zero, all data following offset is specified.
The implementation may use this information to optimize handling
of the specified data. The posix_fadvise() function shall have no
effect on the semantics of other operations on the specified data,
although it may affect the performance of other operations.
The advice to be applied to the data is specified by the advice
parameter and may be one of the following values:
POSIX_FADV_NORMAL
Specifies that the application has no advice to give on its
behavior with respect to the specified data. It is the default
characteristic if no advice is given for an open file.
POSIX_FADV_SEQUENTIAL
Specifies that the application expects to access the specified
data sequentially from lower offsets to higher offsets.
POSIX_FADV_RANDOM
Specifies that the application expects to access the specified
data in a random order.
POSIX_FADV_WILLNEED
Specifies that the application expects to access the specified
data in the near future.
POSIX_FADV_DONTNEED
Specifies that the application expects that it will not access
the specified data in the near future.
POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified
data once and then not reuse it thereafter.
We can usefully implement all of these. FADV_WILLNEED obsoletes
sys_readahead().
We'll need to cheat a bit on the offset/len thing for NORMAL and
SEQUENTIAL - just apply it to the whole file - we don't want to have to
attach an arbitrary number of silly range objects to each file for this.
(We already cheat a bit this way with msync).
Note that it applies to a file descriptor. If posix_fadvise(FADV_DONTNEED) is
called against a file descriptor, and someone else has an fd open
against the same file, that other user gets their foot shot off. That's
OK.
Given this, I don't see a persuasive need to implement a non-standard
interface. It takes an off_t, so posix_fadvise64() is also needed.
The presence of this interface doesn't imply that we don't need
good dropbehind heuristics for streaming reads and writes. We
do need those.
I wouldn't suggest that anyone rush out and implement this stuff for 2.5.
There's some decrudding needed in filemap.c first, and many of these
hints need to interact with the 2.6 VM. Whatever that will be.
A 2.4 implementation could be done any time. If anyone decides to
do this, please let me know...
-
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 7:55 ` Andrew Morton
@ 2002-03-18 8:07 ` Jeff Garzik
2002-03-18 8:17 ` Andrew Morton
2002-03-18 16:41 ` Richard Gooch
1 sibling, 1 reply; 41+ messages in thread
From: Jeff Garzik @ 2002-03-18 8:07 UTC (permalink / raw)
To: Andrew Morton; +Cc: Anton Altaparmakov, linux-kernel, linux-fsdevel
Andrew Morton wrote:
>posix_fadvise() looks to be a fine interface:
>
>We'll need to cheat a bit on the offset/len thing for NORMAL and
>SEQUENTIAL - just apply it to the whole file - we don't want to have to
>attach an arbitrary number of silly range objects to each file for this.
>(We already cheat a bit this way with msync).
>
yep
>Given this, I don't see a persuasive need to implement a non-standard
>interface. It takes an off_t, so posix_fadvise64() is also needed.
>
agreed WRT non-standard.
Are we required to have both foo and foo64 variants? If I had my
druthers, I would just do the foo64 version.
>
>A 2.4 implementation could be done any time. If anyone decides to
>do this, please let me know...
>
count me down as interested after my current project... If someone else
does it, more power to them...
Jeff
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 8:07 ` Jeff Garzik
@ 2002-03-18 8:17 ` Andrew Morton
0 siblings, 0 replies; 41+ messages in thread
From: Andrew Morton @ 2002-03-18 8:17 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Anton Altaparmakov, linux-kernel, linux-fsdevel
Jeff Garzik wrote:
>
> ...
> >Given this, I don't see a persuasive need to implement a non-standard
> >interface. It takes an off_t, so posix_fadvise64() is also needed.
> >
> agreed WRT non-standard.
>
> Are we required to have both foo and foo64 variants? If I had my
> druthers, I would just do the foo64 version.
That would be good. I can't see a reason why
#define posix_fadvise posix_fadvise64
would not suffice. That doesn't mean there isn't one :)
-
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 7:55 ` Andrew Morton
2002-03-18 8:07 ` Jeff Garzik
@ 2002-03-18 16:41 ` Richard Gooch
2002-03-18 19:00 ` Andrew Morton
1 sibling, 1 reply; 41+ messages in thread
From: Richard Gooch @ 2002-03-18 16:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, Anton Altaparmakov, linux-kernel, linux-fsdevel
Andrew Morton writes:
> Note that it applies to a file descriptor. If
> posix_fadvise(FADV_DONTNEED) is called against a file descriptor,
> and someone else has an fd open against the same file, that other
> user gets their foot shot off. That's OK.
Let me verify that I understand what you're saying. Process A and B
independently open the file. The file is already in the cache (because
other processes regularly read this file). Process A is slowly reading
stuff. Process B does FADV_DONTNEED on the whole file. The pages are
dropped.
You're saying this is OK? How about this DoS attack:
int fd = open ("/lib/libc.so", O_RDONLY, 0);
while (1) {
posix_fadvise (fd, 0, 0, FADVISE_DONTNEED);
sleep (1);
}
Let me see that disc head move! Wheeee!
Regards,
Richard....
Permanent: rgooch@atnf.csiro.au
Current: rgooch@ras.ucalgary.ca
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 16:41 ` Richard Gooch
@ 2002-03-18 19:00 ` Andrew Morton
2002-03-18 19:15 ` Richard Gooch
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2002-03-18 19:00 UTC (permalink / raw)
To: Richard Gooch
Cc: Jeff Garzik, Anton Altaparmakov, linux-kernel, linux-fsdevel
Richard Gooch wrote:
>
> Andrew Morton writes:
> > Note that it applies to a file descriptor. If
> > posix_fadvise(FADV_DONTNEED) is called against a file descriptor,
> > and someone else has an fd open against the same file, that other
> > user gets their foot shot off. That's OK.
>
> Let me verify that I understand what you're saying. Process A and B
> independently open the file. The file is already in the cache (because
> other processes regularly read this file). Process A is slowly reading
> stuff. Process B does FADV_DONTNEED on the whole file. The pages are
> dropped.
>
> You're saying this is OK? How about this DoS attack:
> int fd = open ("/lib/libc.so", O_RDONLY, 0);
> while (1) {
> posix_fadvise (fd, 0, 0, FADVISE_DONTNEED);
> sleep (1);
> }
>
> Let me see that disc head move! Wheeee!
>
POSIX_FADV_DONTNEED could only unmap pages from the caller's
VMA's, so the problem would only affect other processes which
share the same mm - CLONE_MM threads.
If some other process has a reference on the pages then they
wouldn't get unmapped as a result of this. It's the same
as madvise(MADV_DONTNEED).
-
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 19:00 ` Andrew Morton
@ 2002-03-18 19:15 ` Richard Gooch
0 siblings, 0 replies; 41+ messages in thread
From: Richard Gooch @ 2002-03-18 19:15 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, Anton Altaparmakov, linux-kernel, linux-fsdevel
Andrew Morton writes:
> Richard Gooch wrote:
> >
> > Andrew Morton writes:
> > > Note that it applies to a file descriptor. If
> > > posix_fadvise(FADV_DONTNEED) is called against a file descriptor,
> > > and someone else has an fd open against the same file, that other
> > > user gets their foot shot off. That's OK.
> >
> > Let me verify that I understand what you're saying. Process A and B
> > independently open the file. The file is already in the cache (because
> > other processes regularly read this file). Process A is slowly reading
> > stuff. Process B does FADV_DONTNEED on the whole file. The pages are
> > dropped.
> >
> > You're saying this is OK? How about this DoS attack:
> > int fd = open ("/lib/libc.so", O_RDONLY, 0);
> > while (1) {
> > posix_fadvise (fd, 0, 0, FADVISE_DONTNEED);
> > sleep (1);
> > }
> >
> > Let me see that disc head move! Wheeee!
> >
>
> POSIX_FADV_DONTNEED could only unmap pages from the caller's
> VMA's, so the problem would only affect other processes which
> share the same mm - CLONE_MM threads.
>
> If some other process has a reference on the pages then they
> wouldn't get unmapped as a result of this. It's the same
> as madvise(MADV_DONTNEED).
OK, I misparsed what you had said. Good.
Regards,
Richard....
Permanent: rgooch@atnf.csiro.au
Current: rgooch@ras.ucalgary.ca
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 7:28 ` Jeff Garzik
2002-03-18 7:55 ` Andrew Morton
@ 2002-03-22 16:05 ` Pavel Machek
2002-03-24 6:38 ` Stevie O
1 sibling, 1 reply; 41+ messages in thread
From: Pavel Machek @ 2002-03-22 16:05 UTC (permalink / raw)
To: Jeff Garzik
Cc: Anton Altaparmakov, Andrew Morton, linux-kernel, linux-fsdevel
Hi!
> > We don't need fadvise IMHO. That is what open(2) is for. The streaming
> > request you are asking for is just a normal open(2). It will do read
> > ahead which is perfect for streaming (of data size << RAM size in its
> > current form).
> >
> > When you want large data streaming, i.e. you start getting worried
> > about memory pressure, then you want open(2) + O_DIRECT. No caching
> > done. Perfect for large data streams and we have that already. I agree
> > that you may want some form of asynchronous read ahead with passed
> > pages being dropped from the cache but that could be just a open(2) +
> > O_SEQUENTIAL (doesn't exist yet).
> >
> > All of what you are asking for exists in Windows and all the semantics
> > are implemented through a very powerful open(2) equivalent. I don't
> > see why we shouldn't do the same. It makes more sense to me than
> > inventing yet another system call...
>
>
>
> I disagree, and here's the main reasons:
>
> * fadvise(2) usefulness extends past open(2). It may be useful to call
> it at various points during runtime.
open(/proc/self/fd/0, O_NEW_FLAGS)?
> * I think putting hints in open(2) is the wrong direction to go. Hints
> have a potential to be very flexible. open(2) O_xxx bits are not to be
> squandered lightly, while I see a lot more value in being a little more
> loose and free with the bit assignment for an "fadvise mask" (just a
> list of hint bits). IMO it should be easier to introduce and retire
> hints, far easier than O_xxx flags.
I don't like idea of new syscall when open works just fine. First prove
O_X hints are usefull, then extend them.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-22 16:05 ` Pavel Machek
@ 2002-03-24 6:38 ` Stevie O
2002-03-24 11:24 ` Pavel Machek
0 siblings, 1 reply; 41+ messages in thread
From: Stevie O @ 2002-03-24 6:38 UTC (permalink / raw)
To: Pavel Machek, Jeff Garzik
Cc: Anton Altaparmakov, Andrew Morton, linux-kernel, linux-fsdevel
At 04:05 PM 3/22/2002 +0000, Pavel Machek wrote:
>>
>>
>> I disagree, and here's the main reasons:
>>
>> * fadvise(2) usefulness extends past open(2). It may be useful to call
>> it at various points during runtime.
>
>open(/proc/self/fd/0, O_NEW_FLAGS)?
So to use fadvise(), the system must have /proc mounted?
Not everybody mounts /proc -- it provides a lot of potential information to anybody who can access it ("hmm... they have a QZ48257 ethernet chipset [cat /proc/pci] -- lets see, sending this specific sequence of bytes in a TCP packet will lock up the receiver...").
--
Stevie-O
Real programmers use COPY CON PROGRAM.EXE
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-24 6:38 ` Stevie O
@ 2002-03-24 11:24 ` Pavel Machek
2002-03-24 12:52 ` Anton Altaparmakov
0 siblings, 1 reply; 41+ messages in thread
From: Pavel Machek @ 2002-03-24 11:24 UTC (permalink / raw)
To: Stevie O
Cc: Pavel Machek, Jeff Garzik, Anton Altaparmakov, Andrew Morton,
linux-kernel, linux-fsdevel
Hi!
> >> I disagree, and here's the main reasons:
> >>
> >> * fadvise(2) usefulness extends past open(2). It may be useful to call
> >> it at various points during runtime.
> >
> >open(/proc/self/fd/0, O_NEW_FLAGS)?
>
> So to use fadvise(), the system must have /proc mounted?
I think it is way more feasible than adding new syscall.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-24 11:24 ` Pavel Machek
@ 2002-03-24 12:52 ` Anton Altaparmakov
2002-03-25 11:12 ` Pavel Machek
0 siblings, 1 reply; 41+ messages in thread
From: Anton Altaparmakov @ 2002-03-24 12:52 UTC (permalink / raw)
To: Pavel Machek
Cc: Stevie O, Pavel Machek, Jeff Garzik, Andrew Morton, linux-kernel,
linux-fsdevel
At 11:24 24/03/02, Pavel Machek wrote:
>Hi!
>
> > >> I disagree, and here's the main reasons:
> > >>
> > >> * fadvise(2) usefulness extends past open(2). It may be useful to call
> > >> it at various points during runtime.
> > >
> > >open(/proc/self/fd/0, O_NEW_FLAGS)?
> >
> > So to use fadvise(), the system must have /proc mounted?
>
>I think it is way more feasible than adding new syscall.
Sorry but it is silly. (-; What's wrong with open("filename", O_FLAGS);
followed by fcntl(); if you want to modify them after opening. That is a
lot cleaner than going via proc in such a way...
posix_fadvise() can then be implemented in userspace and that can go via
fcntl(). That way we have the best of both worlds.
Best regards,
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-24 12:52 ` Anton Altaparmakov
@ 2002-03-25 11:12 ` Pavel Machek
0 siblings, 0 replies; 41+ messages in thread
From: Pavel Machek @ 2002-03-25 11:12 UTC (permalink / raw)
To: Anton Altaparmakov
Cc: Stevie O, Jeff Garzik, Andrew Morton, linux-kernel, linux-fsdevel
Hi!
> >> >> I disagree, and here's the main reasons:
> >> >>
> >> >> * fadvise(2) usefulness extends past open(2). It may be useful to
> >call
> >> >> it at various points during runtime.
> >> >
> >> >open(/proc/self/fd/0, O_NEW_FLAGS)?
> >>
> >> So to use fadvise(), the system must have /proc mounted?
> >
> >I think it is way more feasible than adding new syscall.
>
> Sorry but it is silly. (-; What's wrong with open("filename", O_FLAGS);
> followed by fcntl(); if you want to modify them after opening. That is a
> lot cleaner than going via proc in such a way...
>
> posix_fadvise() can then be implemented in userspace and that can go via
> fcntl(). That way we have the best of both worlds.
Agreed, this is better than my proposal.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-17 13:41 ` Anton Altaparmakov
` (3 preceding siblings ...)
2002-03-18 7:28 ` Jeff Garzik
@ 2002-03-18 8:05 ` Joel Becker
2002-03-18 8:10 ` Jeff Garzik
2002-03-18 8:14 ` Andrew Morton
4 siblings, 2 replies; 41+ messages in thread
From: Joel Becker @ 2002-03-18 8:05 UTC (permalink / raw)
To: Anton Altaparmakov
Cc: Jeff Garzik, Andrew Morton, linux-kernel, linux-fsdevel
On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> We don't need fadvise IMHO. That is what open(2) is for. The streaming
> request you are asking for is just a normal open(2). It will do read ahead
> which is perfect for streaming (of data size << RAM size in its current form).
A quick real world example of where fadvise can work well.
Imagine a database appliction that doesn't use O_DIRECT (for whatever
reason, could even be that they don't trust the linux implementation yet
:-). So, this database gets a query. That query requires a full table
scan, so it calls fadvise(fd, F_SEQUENTIAL). Then another query does
row-specific access, and caching helps. So it wants to turn off
F_SEQUENTIAL.
Other applications can use this sort of stuff. DBM could, for
instance. So might GIMP. Etc. Dynamic hints have real world
applications.
Joel
--
print STDOUT q
Just another Perl hacker,
unless $spring
-Larry Wall
http://www.jlbec.org/
jlbec@evilplan.org
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 8:05 ` Joel Becker
@ 2002-03-18 8:10 ` Jeff Garzik
2002-03-18 8:20 ` Joel Becker
2002-03-18 8:14 ` Andrew Morton
1 sibling, 1 reply; 41+ messages in thread
From: Jeff Garzik @ 2002-03-18 8:10 UTC (permalink / raw)
To: Joel Becker
Cc: Anton Altaparmakov, Andrew Morton, linux-kernel, linux-fsdevel
Joel Becker wrote:
>Other applications can use this sort of stuff. DBM could, for
>instance. So might GIMP. Etc. Dynamic hints have real world
>applications.
>
to be fair, fcntl(2) could be used in conjunction with open(2), to do
dynamic hints.
I prefer to separate the hints from other O_xxx flags, though, so
posix_fadvise seems to be applicable...
Jeff
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 8:10 ` Jeff Garzik
@ 2002-03-18 8:20 ` Joel Becker
0 siblings, 0 replies; 41+ messages in thread
From: Joel Becker @ 2002-03-18 8:20 UTC (permalink / raw)
To: Jeff Garzik
Cc: Joel Becker, Anton Altaparmakov, Andrew Morton, linux-kernel,
linux-fsdevel
On Mon, Mar 18, 2002 at 03:10:03AM -0500, Jeff Garzik wrote:
> to be fair, fcntl(2) could be used in conjunction with open(2), to do
> dynamic hints.
I wasn't speaking to the exact interface, just to the real world
usefulness of hints after open(2). But yes, surely :-)
Joel
--
"Baby, even the losers
Get luck sometimes.
Even the losers
Keep a little bit of pride."
http://www.jlbec.org/
jlbec@evilplan.org
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 8:05 ` Joel Becker
2002-03-18 8:10 ` Jeff Garzik
@ 2002-03-18 8:14 ` Andrew Morton
2002-03-18 14:39 ` Martin K. Petersen
1 sibling, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2002-03-18 8:14 UTC (permalink / raw)
To: Joel Becker; +Cc: Anton Altaparmakov, Jeff Garzik, linux-kernel, linux-fsdevel
Joel Becker wrote:
>
> On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> > We don't need fadvise IMHO. That is what open(2) is for. The streaming
> > request you are asking for is just a normal open(2). It will do read ahead
> > which is perfect for streaming (of data size << RAM size in its current form).
>
> A quick real world example of where fadvise can work well.
> Imagine a database appliction that doesn't use O_DIRECT (for whatever
> reason, could even be that they don't trust the linux implementation yet
> :-).
O_DIRECT is broken against RAID0 (at least) in 2.5 at present. The
RAID driver gets sent BIOs which straddle two or more chunks and RAID
spits out lots of unpleasant warnings. Neil has been informed...
> So, this database gets a query. That query requires a full table
> scan, so it calls fadvise(fd, F_SEQUENTIAL). Then another query does
> row-specific access, and caching helps. So it wants to turn off
> F_SEQUENTIAL.
It'd probably be smarter for the application to hold two fds against
the same file for this sort of access pattern.
-
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 8:14 ` Andrew Morton
@ 2002-03-18 14:39 ` Martin K. Petersen
2002-03-18 19:15 ` Andrew Morton
0 siblings, 1 reply; 41+ messages in thread
From: Martin K. Petersen @ 2002-03-18 14:39 UTC (permalink / raw)
To: Andrew Morton
Cc: Joel Becker, Anton Altaparmakov, Jeff Garzik, linux-kernel,
linux-fsdevel
>>>>> "Andrew" == Andrew Morton <akpm@zip.com.au> writes:
Andrew> O_DIRECT is broken against RAID0 (at least) in 2.5 at present.
Andrew> The RAID driver gets sent BIOs which straddle two or more
Andrew> chunks and RAID spits out lots of unpleasant warnings. Neil
Andrew> has been informed...
Yep. I've been porting my original kiobuf based request splitter to
biobufs. It's almost there, I've just been extremely busy with
something else for a while.
It's not only when you straddle chunks. The current code does not
handle requests straddling RAID zones either.
--
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
mkp@linuxcare.com, http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 14:39 ` Martin K. Petersen
@ 2002-03-18 19:15 ` Andrew Morton
2002-03-18 19:42 ` Martin K. Petersen
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2002-03-18 19:15 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Joel Becker, Anton Altaparmakov, Jeff Garzik, linux-kernel,
linux-fsdevel
"Martin K. Petersen" wrote:
>
> >>>>> "Andrew" == Andrew Morton <akpm@zip.com.au> writes:
>
> Andrew> O_DIRECT is broken against RAID0 (at least) in 2.5 at present.
> Andrew> The RAID driver gets sent BIOs which straddle two or more
> Andrew> chunks and RAID spits out lots of unpleasant warnings. Neil
> Andrew> has been informed...
>
> Yep. I've been porting my original kiobuf based request splitter to
> biobufs. It's almost there, I've just been extremely busy with
> something else for a while.
>
> It's not only when you straddle chunks. The current code does not
> handle requests straddling RAID zones either.
google fails me - where does your kiobuf-based splitter live?
I'm curious to know how this will all work. Will it take a
large BIO and split it into a number of smaller, newly allocated
BIOs? That would be kinda sad, IMO - the current bio-per-bh
allocations in the normal I/O path are really expensive, and
it seems wrong to take a large BIO, split it into lots of
teeny ones and then reassemble all the way down at the driver
level.
If that's really the only way in which we can solve this problem,
would it not be better to pass information up to the higher layer,
telling it when the BIO which is currently under assembly cannot
be grown further? Say, blk_can_i_add_more_stuff_to_this_bio()?
Anyway. I'm interested. O_DIRECT is a bit of a weird curiosity,
but I'm working on making these big-BIO code paths *the* way in which
data gets to and from disk. It needs to be efficient ;)
-
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 19:15 ` Andrew Morton
@ 2002-03-18 19:42 ` Martin K. Petersen
2002-03-19 20:08 ` Eric W. Biederman
0 siblings, 1 reply; 41+ messages in thread
From: Martin K. Petersen @ 2002-03-18 19:42 UTC (permalink / raw)
To: Andrew Morton
Cc: Joel Becker, Anton Altaparmakov, Jeff Garzik, linux-kernel,
linux-fsdevel
>>>>> "Andrew" == Andrew Morton <akpm@zip.com.au> writes:
Andrew> google fails me - where does your kiobuf-based splitter live?
It's in the kiobuf XFS patches.
Andrew> I'm curious to know how this will all work. Will it take a
Andrew> large BIO and split it into a number of smaller, newly
Andrew> allocated BIOs?
For kiobufs I walked the request, cloned a new every time I crossed a
stripe/device boundary and sent it off. I had my own completion
function with an atomic counter that would call the parent kiobuf's
end_io function when all clones had completed.
So I didn't chop the request into page sized chunks or something like
that.
Andrew> If that's really the only way in which we can solve this
Andrew> problem, would it not be better to pass information up to the
Andrew> higher layer, telling it when the BIO which is currently under
Andrew> assembly cannot be grown further? Say,
Andrew> blk_can_i_add_more_stuff_to_this_bio()?
We tried different approaches. One of them was to be able to signal
to upper layers that your I/O was too big and please submit smaller
chunks. Running with that, however, the I/O size converged against
small requests because you'd often start an I/O - say 4K - from a
stripe boundary. And that would kill it right off.
So unless the filesystem knows about stripe/device boundaries it's
really hard to get the size signalling right. And then what happens
when you stack LVM and MD?
In the end, cloning the kiobuf from the above and adjusting
offset/length in the children turned out to be the best approach.
And I suspect that's why Jens kept the clone facility around for bio
bufs :)
Andrew> Anyway. I'm interested. O_DIRECT is a bit of a weird
Andrew> curiosity, but I'm working on making these big-BIO code paths
Andrew> *the* way in which data gets to and from disk. It needs to be
Andrew> efficient ;)
*nod*
I'll try and poke at this again tonight. Will shoot you the patch
once I get the zoning evil sorted out.
--
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
mkp@linuxcare.com, http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-18 19:42 ` Martin K. Petersen
@ 2002-03-19 20:08 ` Eric W. Biederman
2002-03-19 23:38 ` Martin K. Petersen
0 siblings, 1 reply; 41+ messages in thread
From: Eric W. Biederman @ 2002-03-19 20:08 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Andrew Morton, Joel Becker, Anton Altaparmakov, Jeff Garzik,
linux-kernel, linux-fsdevel
"Martin K. Petersen" <mkp@mkp.net> writes:
> >>>>> "Andrew" == Andrew Morton <akpm@zip.com.au> writes:
>
> Andrew> If that's really the only way in which we can solve this
> Andrew> problem, would it not be better to pass information up to the
> Andrew> higher layer, telling it when the BIO which is currently under
> Andrew> assembly cannot be grown further? Say,
> Andrew> blk_can_i_add_more_stuff_to_this_bio()?
Please let's extend BIOs and not break them up.
> We tried different approaches. One of them was to be able to signal
> to upper layers that your I/O was too big and please submit smaller
> chunks. Running with that, however, the I/O size converged against
> small requests because you'd often start an I/O - say 4K - from a
> stripe boundary. And that would kill it right off.
>
> So unless the filesystem knows about stripe/device boundaries it's
> really hard to get the size signalling right. And then what happens
> when you stack LVM and MD?
>
> In the end, cloning the kiobuf from the above and adjusting
> offset/length in the children turned out to be the best approach.
Unless I am mistaken this interacts very badly with the writing data
out to disk to free up memory, because you must allocate memory to
split the bio. Which is the last place you want to allocate memory
if you can avoid it.
It's been a while but I believe there was a similiar thread about
splitting request to disk and the idea was shot down for similiar
reasons.
Eric
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: fadvise syscall?
2002-03-19 20:08 ` Eric W. Biederman
@ 2002-03-19 23:38 ` Martin K. Petersen
0 siblings, 0 replies; 41+ messages in thread
From: Martin K. Petersen @ 2002-03-19 23:38 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andrew Morton, Joel Becker, Anton Altaparmakov, Jeff Garzik,
linux-kernel, linux-fsdevel
>>>>> "Eric" == Eric W Biederman <ebiederm@xmission.com> writes:
>> In the end, cloning the kiobuf from the above and adjusting
>> offset/length in the children turned out to be the best approach.
Eric> Unless I am mistaken this interacts very badly with the writing
Eric> data out to disk to free up memory, because you must allocate
Eric> memory to split the bio. Which is the last place you want to
Eric> allocate memory if you can avoid it.
Well. We have several places in the I/O path already where we need to
allocate memory in order to fulfill an I/O.
Think RAID1 where you need to turn one request from the filesystem
into several - one for each mirror. Or RAID5 where a write may cause
several reads/writes so you can mask and write the checksum out.
Also, with journaling filesystems you may very well be in a situation
where pushing a file to disk involves writing transactions to the log
before you can actually free up buffers.
In this case the clones come from the bio slab cache and are thus no
different from any other I/Os. Furthermore, the clones share the bulk
of their data with the parent, so the overhead isn't that big.
--
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
mkp@linuxcare.com, http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/
^ permalink raw reply [flat|nested] 41+ messages in thread