linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is there a "make hole" (truncate in middle) syscall?
@ 2003-12-04 20:32 Rob Landley
  2003-12-04 20:55 ` Måns Rullgård
                   ` (5 more replies)
  0 siblings, 6 replies; 59+ messages in thread
From: Rob Landley @ 2003-12-04 20:32 UTC (permalink / raw)
  To: linux-kernel

You can make a file with a hole by seeking past it and never writing to that 
bit, but is there any way to punch a hole in a file after the fact?  (I mean 
other with lseek and write.  Having a sparse file as the result....)

What are the downsides of holes?  (How big do they have to be to actually save 
space, is there a performance penalty to having a file with 1000 4k holes in 
it, etc...)

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
@ 2003-12-04 20:55 ` Måns Rullgård
  2003-12-04 21:10 ` Szakacsits Szabolcs
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Måns Rullgård @ 2003-12-04 20:55 UTC (permalink / raw)
  To: linux-kernel

Rob Landley <rob@landley.net> writes:

> You can make a file with a hole by seeking past it and never writing to that 
> bit, but is there any way to punch a hole in a file after the fact?  (I mean 
> other with lseek and write.  Having a sparse file as the result....)

I've never heard of one.

> What are the downsides of holes?  (How big do they have to be to
> actually save space, is there a performance penalty to having a file
> with 1000 4k holes in it, etc...)

A hole has to be at least the size of one block in the filesystem,
typically 4k, to save any space.  Regarding performance, I would
expect it to improve for reads.

-- 
Måns Rullgård
mru@kth.se


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
  2003-12-04 20:55 ` Måns Rullgård
@ 2003-12-04 21:10 ` Szakacsits Szabolcs
  2003-12-05  0:02   ` Rob Landley
  2003-12-05 12:11   ` Måns Rullgård
  2003-12-04 21:48 ` Mike Fedyk
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 59+ messages in thread
From: Szakacsits Szabolcs @ 2003-12-04 21:10 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel


On Thu, 4 Dec 2003, Rob Landley wrote:

> What are the downsides of holes?  [...] is there a performance penalty to
> having a file with 1000 4k holes in it, etc...)

Depends what you do, what fs you use. Using XFS XFS_IOC_GETBMAPX you might
get a huge improvement, see e.g. some numbers,

	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079&w=2

The problem is, 0 general purpose (like cp, tar, cat, etc) util supports
it, you have to code your app accordingly.

	Szaka

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
  2003-12-04 20:55 ` Måns Rullgård
  2003-12-04 21:10 ` Szakacsits Szabolcs
@ 2003-12-04 21:48 ` Mike Fedyk
  2003-12-04 23:59   ` Rob Landley
  2003-12-04 22:53 ` Peter Chubb
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Mike Fedyk @ 2003-12-04 21:48 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> You can make a file with a hole by seeking past it and never writing to that 
> bit, but is there any way to punch a hole in a file after the fact?  (I mean 
> other with lseek and write.  Having a sparse file as the result....)
> 

No, Linux doesn't have this feature.

> What are the downsides of holes?  (How big do they have to be to actually save 
> space, is there a performance penalty to having a file with 1000 4k holes in 
> it, etc...)

When you copy them, you need to use tools that know about sparse files and
how to deal with them.  Also, you will only save space on block aligned
contiguous zeros at least the length of one block.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05  0:02   ` Rob Landley
@ 2003-12-04 22:33     ` Szakacsits Szabolcs
  2003-12-05 11:22     ` Helge Hafting
  1 sibling, 0 replies; 59+ messages in thread
From: Szakacsits Szabolcs @ 2003-12-04 22:33 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel


On Thu, 4 Dec 2003, Rob Landley wrote:
> > Depends what you do, what fs you use. Using XFS XFS_IOC_GETBMAPX you might
> > get a huge improvement, see e.g. some numbers,
> >
> > 	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079&w=2
> >
> > The problem is, 0 general purpose (like cp, tar, cat, etc) util supports
> > it, you have to code your app accordingly.
> 
> Okay, I'll bite.  How would one go about adding hole support to cat? :)

As I wrote above, for XFS use XFS_IOC_GETBMAPX and read only the blocks in
use from the disk and dump a preallocated buffer filled with zeros for the
holes.

For other filesytems you should use FIBMAP but it's so inefficient,
limited, etc that probably it's not worth doing because in general you
would end up being slower.

	Szaka

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
                   ` (2 preceding siblings ...)
  2003-12-04 21:48 ` Mike Fedyk
@ 2003-12-04 22:53 ` Peter Chubb
  2003-12-05  1:04   ` Philippe Troin
  2003-12-04 23:23 ` Andy Isaacson
  2003-12-11  5:13 ` Is there a "make hole" (truncate in middle) syscall? Hua Zhong
  5 siblings, 1 reply; 59+ messages in thread
From: Peter Chubb @ 2003-12-04 22:53 UTC (permalink / raw)
  To: rob; +Cc: linux-kernel

>>>>> "Rob" == Rob Landley <rob@landley.net> writes:

Rob> You can make a file with a hole by seeking past it and never
Rob> writing to that bit, but is there any way to punch a hole in a
Rob> file after the fact?  (I mean other with lseek and write.  Having
Rob> a sparse file as the result....)


SVr4 has fcntl(fd, F_FREESP, flock) that frees the space covered by
the struct flock in the file.  Linux doesn't have this, at least in
the baseline kernels.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
                   ` (3 preceding siblings ...)
  2003-12-04 22:53 ` Peter Chubb
@ 2003-12-04 23:23 ` Andy Isaacson
  2003-12-04 23:42   ` Szakacsits Szabolcs
                     ` (2 more replies)
  2003-12-11  5:13 ` Is there a "make hole" (truncate in middle) syscall? Hua Zhong
  5 siblings, 3 replies; 59+ messages in thread
From: Andy Isaacson @ 2003-12-04 23:23 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> You can make a file with a hole by seeking past it and never writing to that 
> bit, but is there any way to punch a hole in a file after the fact?  (I mean 
> other with lseek and write.  Having a sparse file as the result....)

No, the only way to add a hole to a file is ftruncate(), lseek(),
write() (at least, that's the case at the UNIX API level).

> What are the downsides of holes?  (How big do they have to be to
> actually save space, is there a performance penalty to having a file
> with 1000 4k holes in it, etc...)

It's filesystem-dependent; some filesystems don't implement sparse
files.  The lower bound is one block; on extents-based filesystems like
XFS it might be bigger.  (If you've got 1GB of data, then a 1MB block of
zeros, then another GB of data, you're probably better off allocating a
single 2GB extent rather than two smaller extents with a hole.)

There's no inherent downside to holey files; in fact they can be a
straight-up performance win -- that's a block that doesn't need to be
read from disk, just hand the user a COW pointer to your zero page.  And
if you're lucky and the preceding and following blocks are allocated
adjacent on disk, you can do it all as a single streaming IO.

That said, having holes might make some pessimal behaviors more likely.

I'm curious -- does NTFS implement sparse files?  Does the Win32 API
provide any way to manipulate them?  Does the NT kernel have any sparse
file handling?

-andy

(This post is an exercise in "post possibly-inaccurate information in an
attempt to elicit corrections from people who know better", so take what
I say with a grain of salt.)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 23:23 ` Andy Isaacson
@ 2003-12-04 23:42   ` Szakacsits Szabolcs
  2003-12-05  2:03     ` Mike Fedyk
  2003-12-05 11:22   ` Anton Altaparmakov
  2003-12-05 21:00   ` sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?) Andy Isaacson
  2 siblings, 1 reply; 59+ messages in thread
From: Szakacsits Szabolcs @ 2003-12-04 23:42 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Rob Landley, linux-kernel


On Thu, 4 Dec 2003, Andy Isaacson wrote:

> I'm curious -- does NTFS implement sparse files?  

Since Win2000 (NTFS 3.0+). Also many recently discussed features like
file/directory/volume level compression/encryption, undelete, power of 2
block sizes between 512-64kB, etc.

> Does the Win32 API provide any way to manipulate them?  

More than what Linux provides in general, e.g. "make hole" is also
possible.

	Szaka

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 21:48 ` Mike Fedyk
@ 2003-12-04 23:59   ` Rob Landley
  2003-12-05 22:42     ` Olaf Titz
  0 siblings, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-04 23:59 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: linux-kernel

On Thursday 04 December 2003 15:48, Mike Fedyk wrote:
> On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> > You can make a file with a hole by seeking past it and never writing to
> > that bit, but is there any way to punch a hole in a file after the fact? 
> > (I mean other with lseek and write.  Having a sparse file as the
> > result....)
>
> No, Linux doesn't have this feature.
>
> > What are the downsides of holes?  (How big do they have to be to actually
> > save space, is there a performance penalty to having a file with 1000 4k
> > holes in it, etc...)
>
> When you copy them, you need to use tools that know about sparse files and
> how to deal with them.  Also, you will only save space on block aligned
> contiguous zeros at least the length of one block.

I knew that bit.

I was thinking of making a toy that would run periodically against a 
seldom-changed filesystem, find runs of zeroes of a certain minimum size, and 
turn 'em into holes.  The fragmentation might not be worth it, though...

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 21:10 ` Szakacsits Szabolcs
@ 2003-12-05  0:02   ` Rob Landley
  2003-12-04 22:33     ` Szakacsits Szabolcs
  2003-12-05 11:22     ` Helge Hafting
  2003-12-05 12:11   ` Måns Rullgård
  1 sibling, 2 replies; 59+ messages in thread
From: Rob Landley @ 2003-12-05  0:02 UTC (permalink / raw)
  To: Szakacsits Szabolcs; +Cc: linux-kernel

On Thursday 04 December 2003 15:10, Szakacsits Szabolcs wrote:
> On Thu, 4 Dec 2003, Rob Landley wrote:
> > What are the downsides of holes?  [...] is there a performance penalty to
> > having a file with 1000 4k holes in it, etc...)
>
> Depends what you do, what fs you use. Using XFS XFS_IOC_GETBMAPX you might
> get a huge improvement, see e.g. some numbers,
>
> 	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079&w=2
>
> The problem is, 0 general purpose (like cp, tar, cat, etc) util supports
> it, you have to code your app accordingly.

Okay, I'll bite.  How would one go about adding hole support to cat? :)

Adding hole support to busybox's cp and tar is on my to-do list.  (Pretty far 
down on the list, but still...)

> 	Szaka

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 22:53 ` Peter Chubb
@ 2003-12-05  1:04   ` Philippe Troin
  2003-12-05  2:39     ` Peter Chubb
  2003-12-08  4:03     ` bill davidsen
  0 siblings, 2 replies; 59+ messages in thread
From: Philippe Troin @ 2003-12-05  1:04 UTC (permalink / raw)
  To: Peter Chubb; +Cc: rob, linux-kernel

Peter Chubb <peter@chubb.wattle.id.au> writes:

> >>>>> "Rob" == Rob Landley <rob@landley.net> writes:
> 
> Rob> You can make a file with a hole by seeking past it and never
> Rob> writing to that bit, but is there any way to punch a hole in a
> Rob> file after the fact?  (I mean other with lseek and write.  Having
> Rob> a sparse file as the result....)
> 
> SVr4 has fcntl(fd, F_FREESP, flock) that frees the space covered by
> the struct flock in the file.  Linux doesn't have this, at least in
> the baseline kernels.

However most SVr4 (at least Solaris and HP-UX) only implement FREESP
when the freed space is at the file's tail. In other words, FREESP can
only be used to implement ftruncate().

Phil.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 23:42   ` Szakacsits Szabolcs
@ 2003-12-05  2:03     ` Mike Fedyk
  2003-12-05  7:09       ` Ville Herva
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Fedyk @ 2003-12-05  2:03 UTC (permalink / raw)
  To: Szakacsits Szabolcs; +Cc: Andy Isaacson, Rob Landley, linux-kernel

On Fri, Dec 05, 2003 at 01:42:13AM +0200, Szakacsits Szabolcs wrote:
> 
> On Thu, 4 Dec 2003, Andy Isaacson wrote:
> 
> > I'm curious -- does NTFS implement sparse files?  
> 
> Since Win2000 (NTFS 3.0+). Also many recently discussed features like
> file/directory/volume level compression/encryption, undelete, power of 2
> block sizes between 512-64kB, etc.

That gives us some possibilities for 2.7:
 o undelete
 
 Ext2 has undelete support, but that information is overwritten on unlink by
 ext3, so  ext3 won't work with the undelete utilities.  How do the other 
 filesystems fare in this regard?
 
 o per file compression
 
 Ext2/3 has a flag for it, but support hasn't been implemented.
 
 o per file encryption (can use a user space helper for policy)
 
 Reiser4 plans to have a plugin that will support this.
 What about the others?
 
 o make hole support
 
And my personal favorite:
 o pagecache coherent defrag (on live filesystems)
 
Andrew Morton wrote a patch for this a while back but since there were no
userspace utilities, and it was ext3 specific, it wasn't merged, and nothing
AFAIK has happened with it since.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05  1:04   ` Philippe Troin
@ 2003-12-05  2:39     ` Peter Chubb
  2003-12-08  4:03     ` bill davidsen
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Chubb @ 2003-12-05  2:39 UTC (permalink / raw)
  To: Philippe Troin; +Cc: Peter Chubb, rob, linux-kernel

>>>>> "Philippe" == Philippe Troin <phil@fifi.org> writes:

Philippe> Peter Chubb <peter@chubb.wattle.id.au> writes:
>> >>>>> "Rob" == Rob Landley <rob@landley.net> writes:
>> 
Rob> You can make a file with a hole by seeking past it and never
Rob> writing to that bit, but is there any way to punch a hole in a
Rob> file after the fact?  (I mean other with lseek and write.  Having
Rob> a sparse file as the result....)
>> SVr4 has fcntl(fd, F_FREESP, flock) that frees the space covered by
>> the struct flock in the file.  Linux doesn't have this, at least in
>> the baseline kernels.

Philippe> However most SVr4 (at least Solaris and HP-UX) only
Philippe> implement FREESP when the freed space is at the file's
Philippe> tail. In other words, FREESP can only be used to implement
Philippe> ftruncate().

The original SVr4 worked in the middle of files too.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05  2:03     ` Mike Fedyk
@ 2003-12-05  7:09       ` Ville Herva
  0 siblings, 0 replies; 59+ messages in thread
From: Ville Herva @ 2003-12-05  7:09 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: linux-kernel

On Thu, Dec 04, 2003 at 06:03:12PM -0800, you [Mike Fedyk] wrote:
>  
>  o per file compression
>  
>  Ext2/3 has a flag for it, but support hasn't been implemented.

It has (for 2.0, 2.2, 2.4 ext2) - it just was never merged into baseline.

2.4:
http://sourceforge.net/projects/e2compr/
2.2:
http://his.luky.org/ftp/mirrors/e2compr/www.netspace.net.au/%257Ereiter/e2compr/

FWIW, I have a 2.2 server keeping >20 workstations' daily backups on
compressed ext2:

/dev/md2              441G  154G  287G  35% /backup-versioned
/dev/md4              144G  141G  3.5G  98% /backup-versioned2

and it's really solid.
  
>  o make hole support

According to Andreas Dilger, Peter Braam has implemented this (sys_punch):
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=linux.kernel.200106291838.f5TIcbAM015809%40webber.adilger.int&rnum=2&prev=/groups%3Fhl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3Dhole%2Bpunch%2Bgroup%253Amlist.linux.kernel%26btnG%3DGoogle%2BSearch
  


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 23:23 ` Andy Isaacson
  2003-12-04 23:42   ` Szakacsits Szabolcs
@ 2003-12-05 11:22   ` Anton Altaparmakov
  2003-12-05 11:44     ` viro
  2003-12-05 21:00   ` sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?) Andy Isaacson
  2 siblings, 1 reply; 59+ messages in thread
From: Anton Altaparmakov @ 2003-12-05 11:22 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Rob Landley, linux-kernel

On Thu, 4 Dec 2003, Andy Isaacson wrote:
> On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> I'm curious -- does NTFS implement sparse files?  Does the Win32 API
> provide any way to manipulate them?  Does the NT kernel have any sparse
> file handling?

Yes it does.  The new NTFS Linux driver has full support for sparse files
as does Windows of course.

Windows does provide a function which is just "make hole".  It takes
starting offset and length (or was it ending offset instead of length,
can't remember) and makes this sparse (obviously aligning to cluster
boundaries, etc).

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05  0:02   ` Rob Landley
  2003-12-04 22:33     ` Szakacsits Szabolcs
@ 2003-12-05 11:22     ` Helge Hafting
  1 sibling, 0 replies; 59+ messages in thread
From: Helge Hafting @ 2003-12-05 11:22 UTC (permalink / raw)
  To: rob; +Cc: Szakacsits Szabolcs, linux-kernel

Rob Landley wrote:
> On Thursday 04 December 2003 15:10, Szakacsits Szabolcs wrote:
> 
>>On Thu, 4 Dec 2003, Rob Landley wrote:
>>
>>>What are the downsides of holes?  [...] is there a performance penalty to
>>>having a file with 1000 4k holes in it, etc...)
>>
>>Depends what you do, what fs you use. Using XFS XFS_IOC_GETBMAPX you might
>>get a huge improvement, see e.g. some numbers,
>>
>>	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079&w=2
>>
>>The problem is, 0 general purpose (like cp, tar, cat, etc) util supports
>>it, you have to code your app accordingly.
> 
> 
> Okay, I'll bite.  How would one go about adding hole support to cat? :)

Easy.  Look at the data you're writing. Don't ever write zeroes, seek
ahead in the file being written instead.  The filesystem will create a hole
if possible.  You may want to optimize this a bit by not seeking past
very small runs of zeroes.

Of course cat is sometimes used to write to things that aren't regular
files, so make sure seek is supported and fall back to ordinary writing
when it isn't.

Helge Hafting


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05 11:22   ` Anton Altaparmakov
@ 2003-12-05 11:44     ` viro
  2003-12-05 14:27       ` Anton Altaparmakov
  0 siblings, 1 reply; 59+ messages in thread
From: viro @ 2003-12-05 11:44 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Andy Isaacson, Rob Landley, linux-kernel

On Fri, Dec 05, 2003 at 11:22:01AM +0000, Anton Altaparmakov wrote:
> On Thu, 4 Dec 2003, Andy Isaacson wrote:
> > On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> > I'm curious -- does NTFS implement sparse files?  Does the Win32 API
> > provide any way to manipulate them?  Does the NT kernel have any sparse
> > file handling?
> 
> Yes it does.  The new NTFS Linux driver has full support for sparse files
> as does Windows of course.
> 
> Windows does provide a function which is just "make hole".  It takes
> starting offset and length (or was it ending offset instead of length,
> can't remember) and makes this sparse (obviously aligning to cluster
> boundaries, etc).

Have fun getting it to play nice with mmap()...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 21:10 ` Szakacsits Szabolcs
  2003-12-05  0:02   ` Rob Landley
@ 2003-12-05 12:11   ` Måns Rullgård
  2003-12-05 22:41     ` Mike Fedyk
  2003-12-05 23:25     ` Szakacsits Szabolcs
  1 sibling, 2 replies; 59+ messages in thread
From: Måns Rullgård @ 2003-12-05 12:11 UTC (permalink / raw)
  To: linux-kernel

Szakacsits Szabolcs <szaka@sienet.hu> writes:

>> What are the downsides of holes?  [...] is there a performance penalty to
>> having a file with 1000 4k holes in it, etc...)
>
> Depends what you do, what fs you use. Using XFS XFS_IOC_GETBMAPX you might
> get a huge improvement, see e.g. some numbers,
>
> 	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079&w=2
>
> The problem is, 0 general purpose (like cp, tar, cat, etc) util supports
> it, you have to code your app accordingly.

I found this paragraph in the man page of GNU cp:

       --sparse=WHEN
              A `sparse file' contains  `holes'  -  sequences  of
              zero  bytes  that  do  not occupy any physical disk
              blocks; the  `read'  system  call  reads  these  as
              zeroes.  This can both save considerable disk space
              and increase speed, since many binary files contain
              lots  of  consecutive  zero  bytes.  By default, cp
              detects holes in input source  files  via  a  crude
              heuristic  and  makes the corresponding output file
              sparse as well.

              The WHEN value can be one of the following:

              auto   The default behavior:  the  output  file  is
                     sparse if the input file is sparse.

              always Always make the output file sparse.  This is
                     useful when the  input  file  resides  on  a
                     filesystem  that  does  not  support  sparse
                     files, but the output file is on a  filesys-
                     tem that does.

              never  Never  make  the output file sparse.  If you
                     find an application for this option, let  us
                     know.


-- 
Måns Rullgård
mru@kth.se


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05 11:44     ` viro
@ 2003-12-05 14:27       ` Anton Altaparmakov
  0 siblings, 0 replies; 59+ messages in thread
From: Anton Altaparmakov @ 2003-12-05 14:27 UTC (permalink / raw)
  To: viro; +Cc: Andy Isaacson, Rob Landley, LKML

On Fri, 2003-12-05 at 11:44, viro@parcelfarce.linux.theplanet.co.uk
wrote:
> On Fri, Dec 05, 2003 at 11:22:01AM +0000, Anton Altaparmakov wrote:
> > On Thu, 4 Dec 2003, Andy Isaacson wrote:
> > > On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> > > I'm curious -- does NTFS implement sparse files?  Does the Win32 API
> > > provide any way to manipulate them?  Does the NT kernel have any sparse
> > > file handling?
> > 
> > Yes it does.  The new NTFS Linux driver has full support for sparse files
> > as does Windows of course.
> > 
> > Windows does provide a function which is just "make hole".  It takes
> > starting offset and length (or was it ending offset instead of length,
> > can't remember) and makes this sparse (obviously aligning to cluster
> > boundaries, etc).
> 
> Have fun getting it to play nice with mmap()...

I have no intention to provide such "make hole" functionality in the
Linux kernel NTFS driver so I don't need to...  (-;

Cheers,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ &
http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 59+ messages in thread

* sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?)
  2003-12-04 23:23 ` Andy Isaacson
  2003-12-04 23:42   ` Szakacsits Szabolcs
  2003-12-05 11:22   ` Anton Altaparmakov
@ 2003-12-05 21:00   ` Andy Isaacson
  2003-12-05 21:12     ` Linus Torvalds
  2 siblings, 1 reply; 59+ messages in thread
From: Andy Isaacson @ 2003-12-05 21:00 UTC (permalink / raw)
  To: linux-kernel

On Thu, Dec 04, 2003 at 05:23:48PM -0600, Andy Isaacson wrote:
> On Thu, Dec 04, 2003 at 02:32:23PM -0600, Rob Landley wrote:
> > What are the downsides of holes?  (How big do they have to be to
> > actually save space, is there a performance penalty to having a file
> > with 1000 4k holes in it, etc...)
> 
> It's filesystem-dependent; some filesystems don't implement sparse
> files.  The lower bound is one block; on extents-based filesystems like
> XFS it might be bigger.  (If you've got 1GB of data, then a 1MB block of
> zeros, then another GB of data, you're probably better off allocating a
> single 2GB extent rather than two smaller extents with a hole.)
> 
> There's no inherent downside to holey files; in fact they can be a
> straight-up performance win -- that's a block that doesn't need to be
> read from disk, just hand the user a COW pointer to your zero page.  And
> if you're lucky and the preceding and following blocks are allocated
> adjacent on disk, you can do it all as a single streaming IO.

I got curious enough to run some tests, and was suprised at the results.
My machine (Athlon XP 2400+, 2030 MHz, 512 MB, KT400, 2.4.22) can read
out of buffer cache at 234 MB/s, and off of its IDE disk at 40 MB/s.
I'd assumed that read(2)ing a holey file would go faster than reading
out of buffer cache; in theory you could do it completely in L1 cache
(with a 4KB buffer, it's just a ton of syscalls, some page table
manipulation, and a bunch of memcpy() out of a single zero page).  But
it turns out that reading a hole is *slower* than reading data from
buffer cache, just 195 MB/s.

200 MB file       234 MB/s  (with warm caches)
1 GB file          40 MB/s  (exceeds physical memory)
1 GB sparse file  195 MB/s

the 1GB sparse file was created with "dd if=file of=1gsparse bs=1M
count=1 seek=1023"; the filesystem is ext3.

Here's 'vmstat 5' while reading the 200MB file in a loop:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa

 1  0  50968   4468   4872 410424    0    0     0     9  102    46 62 38  0  0
 1  0  50968   4448   4892 410424    0    0     0     6  101    41 62 38  0  0
 1  0  50968   4428   4912 410424    0    0     0     6  101    40 62 38  0  0
 1  0  50968   4404   4936 410424    0    0     0     6  101    37 61 39  0  0
 1  0  50968   4384   4956 410424    0    0     0     8  105   117 60 40  0  0
 1  0  50968   4484   4984 410296    0    0     0     9  103    81 62 38  0  0

here's 'vmstat 5' while reading the 1GB sparse file in a loop:

 1  0  55448   4460   2464 417320    0    0   217     6  144  3117 45 49  6  0
 1  0  55448   4444   2480 417304    0    0   219     6  204  3237 50 44  6  0
 1  0  55448   4444   2488 417288    0    0   218     9  181  3200 49 45  6  0
 1  0  55460   4456   2468 417140   30    0   249     6  182  3193 46 48  6  0
 1  0  55460   4396   2484 417300    0    2   220    12  140  3084 46 48  6  0
 1  0  55460   4356   2464 417360    0    0   216     2  145  3101 47 48  6  0

The code is simply doing

        while((n = read(fd, buf, sizeof(buf))) > 0) {
                c += n;
                for(i=0; i < n; i++) {
                        hist[buf[i]]++;
                }
        }

compiled with gcc 3.3.2 -O2.

Code appended.

-andy

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>

#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/time.h>
#include <fcntl.h>
#include <ctype.h>

static void
die(char *fmt, ...)
{
	va_list a;
	va_start(a, fmt);
	vfprintf(stderr, fmt, a);
	va_end(a);
	exit(1);
}

double tod(void)
{
	static struct timeval tv1;
	struct timeval tv2;
	double r;

	if(tv1.tv_sec == 0) {
		gettimeofday(&tv1, 0);
		return 0;
	}
	gettimeofday(&tv2, 0);
	r = (tv2.tv_sec - tv1.tv_sec) + (tv2.tv_usec - tv1.tv_usec) / 1e6;
	memcpy(&tv1, &tv2, sizeof(tv1));
	return r;
}

int main(int argc, char **argv)
{
	char buf[4096];
	int fd, i, n, m;
	long long c = 0;
	double t1, t2;
	int hist[256] = { 0 };
	unsigned char *p = buf;

	if(argc != 2) die("usage: %s file\n", argv[0]);

	if((fd = open(argv[1], O_RDONLY)) == -1)
		die("%s: %s\n", argv[1], strerror(errno));

	t1 = tod();
	while((n = read(fd, buf, sizeof(buf))) > 0) {
		c += n;
		for(i=0; i < n; i++) {
			hist[p[i]]++;
		}
	}
	t2 = tod();
	if(n == -1) die("read: %s\n", strerror(errno));

	m = 0;
	for(i=1; i<256; i++)
		if(hist[i] > hist[m]) m = i;
	printf("%lld characters read, mode at %d '%c' with %d\n",
			c, m, isprint(m) ? m : '?', hist[m]);
	printf("%f seconds, %f MB/sec\n", t2-t1, c / (t2-t1) / 1e6);
	return 0;
}

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?)
  2003-12-05 21:00   ` sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?) Andy Isaacson
@ 2003-12-05 21:12     ` Linus Torvalds
  2003-12-08 20:43       ` Andy Isaacson
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2003-12-05 21:12 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: linux-kernel



On Fri, 5 Dec 2003, Andy Isaacson wrote:
>
> I got curious enough to run some tests, and was suprised at the results.
> My machine (Athlon XP 2400+, 2030 MHz, 512 MB, KT400, 2.4.22) can read
> out of buffer cache at 234 MB/s, and off of its IDE disk at 40 MB/s.
> I'd assumed that read(2)ing a holey file would go faster than reading
> out of buffer cache; in theory you could do it completely in L1 cache
> (with a 4KB buffer, it's just a ton of syscalls, some page table
> manipulation, and a bunch of memcpy() out of a single zero page).  But
> it turns out that reading a hole is *slower* than reading data from
> buffer cache, just 195 MB/s.

That's because we actually instantiate the page cache pages even for
holes. We have to, or we'd have to special-case them no end (and quite
frankly, "hole read performance" is not something worth special casing,
since it just isn't done under any real load).

So reading a hole implies creating the page cache entry and _clearing_ it.
For each page. So while you may read from the L1, you also have to do
writeback of the _previous_ pages from the L1 into the L2 and eventually
out to memory.

(And eventually the VM also has to get rid of the pages etc, of course).

			Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05 12:11   ` Måns Rullgård
@ 2003-12-05 22:41     ` Mike Fedyk
  2003-12-05 23:25       ` Måns Rullgård
  2003-12-05 23:33       ` Szakacsits Szabolcs
  2003-12-05 23:25     ` Szakacsits Szabolcs
  1 sibling, 2 replies; 59+ messages in thread
From: Mike Fedyk @ 2003-12-05 22:41 UTC (permalink / raw)
  To: M?ns Rullg?rd; +Cc: linux-kernel

On Fri, Dec 05, 2003 at 01:11:03PM +0100, M?ns Rullg?rd wrote:
> I found this paragraph in the man page of GNU cp:
> 
>        --sparse=WHEN
>               always Always make the output file sparse.  This is
>                      useful when the  input  file  resides  on  a
>                      filesystem  that  does  not  support  sparse
>                      files, but the output file is on a  filesys-
>                      tem that does.

So with this, you can create sparse files for an entire set of files by just
cping them? :)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 23:59   ` Rob Landley
@ 2003-12-05 22:42     ` Olaf Titz
  0 siblings, 0 replies; 59+ messages in thread
From: Olaf Titz @ 2003-12-05 22:42 UTC (permalink / raw)
  To: linux-kernel

> I was thinking of making a toy that would run periodically against a
> seldom-changed filesystem, find runs of zeroes of a certain minimum
> size, and turn 'em into holes. The fragmentation might not be worth
> it, though...

"cp" does that already. Hunt for sparse files, copy them and move the
new file to the old location.
There used to be an installer (Slackware, I think) back in very old
days which did this to every binary after untarring...

Olaf


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05 12:11   ` Måns Rullgård
  2003-12-05 22:41     ` Mike Fedyk
@ 2003-12-05 23:25     ` Szakacsits Szabolcs
  1 sibling, 0 replies; 59+ messages in thread
From: Szakacsits Szabolcs @ 2003-12-05 23:25 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel


On Fri, 5 Dec 2003, [iso-8859-1] M?ns Rullg?rd wrote:
> Szakacsits Szabolcs <szaka@sienet.hu> writes:
> 
> >> What are the downsides of holes?  [...] is there a performance penalty to
> >> having a file with 1000 4k holes in it, etc...)
> >
> > Depends what you do, what fs you use. Using XFS XFS_IOC_GETBMAPX you might
> > get a huge improvement, see e.g. some numbers,
> >
> > 	http://marc.theaimsgroup.com/?l=reiserfs&m=105827549109079&w=2
> >
> > The problem is, 0 general purpose (like cp, tar, cat, etc) util supports
> > it, you have to code your app accordingly.
> 
> I found this paragraph in the man page of GNU cp:
> 
>        --sparse=WHEN

I meant using XFS_IOC_GETBMAPX. tar and cp sparse support is extremely
inefficient (Helge also misunderstood what I meant). Only XFS provides
support doing it the fastest way, without reading and analysing the data.

	Szaka

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05 22:41     ` Mike Fedyk
@ 2003-12-05 23:25       ` Måns Rullgård
  2003-12-05 23:33       ` Szakacsits Szabolcs
  1 sibling, 0 replies; 59+ messages in thread
From: Måns Rullgård @ 2003-12-05 23:25 UTC (permalink / raw)
  To: linux-kernel

Mike Fedyk <mfedyk@matchmail.com> writes:

>> I found this paragraph in the man page of GNU cp:
>> 
>>        --sparse=WHEN
>>               always Always make the output file sparse.  This is
>>                      useful when the  input  file  resides  on  a
>>                      filesystem  that  does  not  support  sparse
>>                      files, but the output file is on a  filesys-
>>                      tem that does.
>
> So with this, you can create sparse files for an entire set of files
> by just cping them? :)

Yes.  It won't query the system for where any potential holes in the
source files might be, though, so if there are large holes, cp will
spend unnecessary time crunching through them.

-- 
Måns Rullgård
mru@kth.se


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05 22:41     ` Mike Fedyk
  2003-12-05 23:25       ` Måns Rullgård
@ 2003-12-05 23:33       ` Szakacsits Szabolcs
  1 sibling, 0 replies; 59+ messages in thread
From: Szakacsits Szabolcs @ 2003-12-05 23:33 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: M?ns Rullg?rd, linux-kernel


On Fri, 5 Dec 2003, Mike Fedyk wrote:

> So with this, you can create sparse files for an entire set of files by just
> cping them? :)

You can create sparse file even from stdin with cp. I wrote about here
(handling sparse files section), 
http://linux-ntfs.sourceforge.net/man/ntfsclone.html

	Szaka

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-05  1:04   ` Philippe Troin
  2003-12-05  2:39     ` Peter Chubb
@ 2003-12-08  4:03     ` bill davidsen
  1 sibling, 0 replies; 59+ messages in thread
From: bill davidsen @ 2003-12-08  4:03 UTC (permalink / raw)
  To: linux-kernel

In article <873cc0nkgf.fsf@ceramic.fifi.org>,
Philippe Troin  <phil@fifi.org> wrote:
| Peter Chubb <peter@chubb.wattle.id.au> writes:
| 
| > >>>>> "Rob" == Rob Landley <rob@landley.net> writes:
| > 
| > Rob> You can make a file with a hole by seeking past it and never
| > Rob> writing to that bit, but is there any way to punch a hole in a
| > Rob> file after the fact?  (I mean other with lseek and write.  Having
| > Rob> a sparse file as the result....)
| > 
| > SVr4 has fcntl(fd, F_FREESP, flock) that frees the space covered by
| > the struct flock in the file.  Linux doesn't have this, at least in
| > the baseline kernels.
| 
| However most SVr4 (at least Solaris and HP-UX) only implement FREESP
| when the freed space is at the file's tail. In other words, FREESP can
| only be used to implement ftruncate().

Actually, I would thinmk that you *don't* want to do this at end of
file, turning zeros into holes is not the same as truncate, since it
will change the value of the file size, and that may not be what you
want at all.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?)
  2003-12-05 21:12     ` Linus Torvalds
@ 2003-12-08 20:43       ` Andy Isaacson
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Isaacson @ 2003-12-08 20:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Dec 05, 2003 at 01:12:21PM -0800, Linus Torvalds wrote:
> On Fri, 5 Dec 2003, Andy Isaacson wrote:
> > I got curious enough to run some tests, and was suprised at the results.
> > My machine (Athlon XP 2400+, 2030 MHz, 512 MB, KT400, 2.4.22) can read
> > out of buffer cache at 234 MB/s, and off of its IDE disk at 40 MB/s.
> > I'd assumed that read(2)ing a holey file would go faster than reading
> > out of buffer cache; in theory you could do it completely in L1 cache
> > (with a 4KB buffer, it's just a ton of syscalls, some page table
> > manipulation, and a bunch of memcpy() out of a single zero page).  But
> > it turns out that reading a hole is *slower* than reading data from
> > buffer cache, just 195 MB/s.
> 
> That's because we actually instantiate the page cache pages even for
> holes. We have to, or we'd have to special-case them no end (and quite
> frankly, "hole read performance" is not something worth special casing,
> since it just isn't done under any real load).
> 
> So reading a hole implies creating the page cache entry and _clearing_ it.
> For each page. So while you may read from the L1, you also have to do
> writeback of the _previous_ pages from the L1 into the L2 and eventually
> out to memory.
> 
> (And eventually the VM also has to get rid of the pages etc, of course).

Thanks for the explanation, Linus.

I modified my benchmark to use mmap(2) instead of read(2) and the
results are broadly comparable.  With a 10MB window, I get 331 MB/s
reading out of buffer cache and 185 MB/s reading a hole.  Reading a file
too large to cache is about the same (disk-limited) speed, 43 MB/s.

-andy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Is there a "make hole" (truncate in middle) syscall?
  2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
                   ` (4 preceding siblings ...)
  2003-12-04 23:23 ` Andy Isaacson
@ 2003-12-11  5:13 ` Hua Zhong
  2003-12-11  6:19   ` Rob Landley
  2003-12-11 18:58   ` Andy Isaacson
  5 siblings, 2 replies; 59+ messages in thread
From: Hua Zhong @ 2003-12-11  5:13 UTC (permalink / raw)
  To: rob, linux-kernel

Sorry for digging out this old discussion.

This would be a tremendous enhancement to Linux filesystems, and one of
my current projects actually needs this capability badly.

The project is a lightweight user-space library which implements a
file-based database. Each database has several files. The files are all
block-based, and each block is always a multiple of 512 byte (and we
could make it a multiple of 4K, in case this feature existed).

Blocks are organized as a B+ tree, so we have a root block, which points
to its child blocks, and in turn they point to the next level. There is
a free block list too.

The problem is with a lot of add/delete, there are a lot of free blocks
inside the file. So essentially we'd have to manually shrink these files
when it grows too big and eats up too much space. If we could just "dig
a hole", it would be trivial to return those blocks to the filesystem
without doing an expensive defragmentation.

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Rob Landley
> Sent: Thursday, December 04, 2003 12:32 PM
> To: linux-kernel@vger.kernel.org
> Subject: Is there a "make hole" (truncate in middle) syscall?
> 
> 
> You can make a file with a hole by seeking past it and never 
> writing to that 
> bit, but is there any way to punch a hole in a file after the 
> fact?  (I mean 
> other with lseek and write.  Having a sparse file as the result....)
> 
> What are the downsides of holes?  (How big do they have to be 
> to actually save 
> space, is there a performance penalty to having a file with 
> 1000 4k holes in 
> it, etc...)
> 
> Rob
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11  5:13 ` Is there a "make hole" (truncate in middle) syscall? Hua Zhong
@ 2003-12-11  6:19   ` Rob Landley
  2003-12-11 18:58   ` Andy Isaacson
  1 sibling, 0 replies; 59+ messages in thread
From: Rob Landley @ 2003-12-11  6:19 UTC (permalink / raw)
  To: hzhong, linux-kernel

On Wednesday 10 December 2003 23:13, Hua Zhong wrote:
> Sorry for digging out this old discussion.
>
> This would be a tremendous enhancement to Linux filesystems, and one of
> my current projects actually needs this capability badly.
>
> The project is a lightweight user-space library which implements a
> file-based database. Each database has several files. The files are all
> block-based, and each block is always a multiple of 512 byte (and we
> could make it a multiple of 4K, in case this feature existed).
>
> Blocks are organized as a B+ tree, so we have a root block, which points
> to its child blocks, and in turn they point to the next level. There is
> a free block list too.
>
> The problem is with a lot of add/delete, there are a lot of free blocks
> inside the file. So essentially we'd have to manually shrink these files
> when it grows too big and eats up too much space. If we could just "dig
> a hole", it would be trivial to return those blocks to the filesystem
> without doing an expensive defragmentation.

It could be worse.  Java didn't have a "truncate file" command at all until I 
yelled at Sun about it.  (It was too late to get it into 1.1, but they added 
it to 1.2.  Of course, that was back when I cared about Java... :)

Al Viro mentioned that making hole creation play nice with mmap would be evil.  
I suspect that having the "punch hole" call simply fail if any part of the 
range you're trying to zap is currently mmaped is probably good enough for a 
first pass.  (Maybe some fallback code could write zeroes into the range so 
the behavior was sort of similar in the failure case...)  But I haven't 
looked at the code enough to even know what the issues are, and I certainly 
won't have time this week...

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11  5:13 ` Is there a "make hole" (truncate in middle) syscall? Hua Zhong
  2003-12-11  6:19   ` Rob Landley
@ 2003-12-11 18:58   ` Andy Isaacson
  2003-12-11 19:15     ` Hua Zhong
  1 sibling, 1 reply; 59+ messages in thread
From: Andy Isaacson @ 2003-12-11 18:58 UTC (permalink / raw)
  To: Hua Zhong; +Cc: linux-kernel

On Wed, Dec 10, 2003 at 09:13:49PM -0800, Hua Zhong wrote:
> This would be a tremendous enhancement to Linux filesystems, and one of
> my current projects actually needs this capability badly.
> 
> The project is a lightweight user-space library which implements a
> file-based database. Each database has several files. The files are all
> block-based, and each block is always a multiple of 512 byte (and we
> could make it a multiple of 4K, in case this feature existed).
> 
> Blocks are organized as a B+ tree, so we have a root block, which points
> to its child blocks, and in turn they point to the next level. There is
> a free block list too.
> 
> The problem is with a lot of add/delete, there are a lot of free blocks
> inside the file. So essentially we'd have to manually shrink these files
> when it grows too big and eats up too much space. If we could just "dig
> a hole", it would be trivial to return those blocks to the filesystem
> without doing an expensive defragmentation.

The abstract interface for make_hole() is simple, but it turns into a
pretty expensive filesystem operation, I think.  After many cycles of
free/allocate, your file would be badly fragmented across the
filesystem.  You'll probably get better overall performance by keeping
track of how "sparse" your file is (you could compare st_blocks versus
how many blocks you have allocated in your tree structure) and re-write
it when you're wasting more than, say, 20% of the allocated space.

It turns into an interesting problem if you don't want to double your
space requirements during the re-write process.  You could write the
new file "backwards", one MB at a time, truncating the previous file at
each step to free up the blocks.  You'd end up with contiguous 1MB
chunks, which given your tree organization is probably good enough.  If
you wanted really good streaming performance you'd want to do bigger
chunks (or just write the file from the beginning, or use the
pre-allocation APIs that I think XFS provides).

-andy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 18:58   ` Andy Isaacson
@ 2003-12-11 19:15     ` Hua Zhong
  2003-12-11 19:43       ` Andreas Dilger
  2003-12-11 19:48       ` Jörn Engel
  0 siblings, 2 replies; 59+ messages in thread
From: Hua Zhong @ 2003-12-11 19:15 UTC (permalink / raw)
  To: 'Andy Isaacson'; +Cc: linux-kernel

> The abstract interface for make_hole() is simple, but it turns into a
> pretty expensive filesystem operation, I think.  After many cycles of
> free/allocate, your file would be badly fragmented across the
> filesystem.  

Understood. Two filesystems we are using: tmpfs and ext3. For the
former, fragmentation doesn't matter.

Hey, I think when I get some cycles I can try to implement this for
tmpfs (since it's simpler) myself, and post a patch. :-) But before
that, I want to make sure it's doable.

> You'll probably get better overall performance by keeping
> track of how "sparse" your file is (you could compare st_blocks versus
> how many blocks you have allocated in your tree structure) 
> and re-write
> it when you're wasting more than, say, 20% of the allocated space.
> 
> It turns into an interesting problem if you don't want to double your
> space requirements during the re-write process.  You could write the
> new file "backwards", one MB at a time, truncating the 
> previous file at
> each step to free up the blocks.  You'd end up with contiguous 1MB
> chunks, which given your tree organization is probably good 
> enough.  If
> you wanted really good streaming performance you'd want to do bigger
> chunks (or just write the file from the beginning, or use the
> pre-allocation APIs that I think XFS provides).
> 
> -andy
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:15     ` Hua Zhong
@ 2003-12-11 19:43       ` Andreas Dilger
  2003-12-12 21:37         ` Daniel Phillips
  2003-12-11 19:48       ` Jörn Engel
  1 sibling, 1 reply; 59+ messages in thread
From: Andreas Dilger @ 2003-12-11 19:43 UTC (permalink / raw)
  To: Hua Zhong; +Cc: 'Andy Isaacson', linux-kernel

On Dec 11, 2003  11:15 -0800, Hua Zhong wrote:
> > The abstract interface for make_hole() is simple, but it turns into a
> > pretty expensive filesystem operation, I think.  After many cycles of
> > free/allocate, your file would be badly fragmented across the
> > filesystem.  
> 
> Understood. Two filesystems we are using: tmpfs and ext3. For the
> former, fragmentation doesn't matter.
> 
> Hey, I think when I get some cycles I can try to implement this for
> tmpfs (since it's simpler) myself, and post a patch. :-) But before
> that, I want to make sure it's doable.

At distant times in the past (i.e. 2.2 days), we had implemented a "punch"
syscall which did what you wanted for ext3.  I've looked at this for 2.4
at least, and it should be relatively straightforward to implement vmpunch
from vmtruncate, since most of the VM routines that vmtruncate calls
allow both a start and end parameter.  For tmpfs that should be enough.
Then, vmtruncate could just be a special case of vmpunch which punches
from start = i_size to end = -1ULL.

Presumably, if a filesystem didn't support a punch filesystem method
(either because it is unimplemented or because the filesystem doesn't
support holes) it would be implemented as either a truncate (if end is
beyond i_size) or a series of zero writes instead.

At some point we may want to update punch for ext3 again, because it is
useful for various things (e.g. cache or heirarchical filesystems, etc).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:15     ` Hua Zhong
  2003-12-11 19:43       ` Andreas Dilger
@ 2003-12-11 19:48       ` Jörn Engel
  2003-12-11 19:55         ` Hua Zhong
                           ` (2 more replies)
  1 sibling, 3 replies; 59+ messages in thread
From: Jörn Engel @ 2003-12-11 19:48 UTC (permalink / raw)
  To: Hua Zhong; +Cc: 'Andy Isaacson', linux-kernel

On Thu, 11 December 2003 11:15:28 -0800, Hua Zhong wrote:
> 
> > The abstract interface for make_hole() is simple, but it turns into a
> > pretty expensive filesystem operation, I think.  After many cycles of
> > free/allocate, your file would be badly fragmented across the
> > filesystem.  
> 
> Understood. Two filesystems we are using: tmpfs and ext3. For the
> former, fragmentation doesn't matter.
> 
> Hey, I think when I get some cycles I can try to implement this for
> tmpfs (since it's simpler) myself, and post a patch. :-) But before
> that, I want to make sure it's doable.

If you really do it, please don't add a syscall for it.  Simply check
each written page if it is completely filled with zero.  (This will be
a very quick check for most pages, as they will contain something
nonzero in the first couple of words)

Jörn

-- 
The story so far:
In the beginning the Universe was created.  This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:48       ` Jörn Engel
@ 2003-12-11 19:55         ` Hua Zhong
  2003-12-11 19:58         ` Andy Isaacson
  2003-12-11 20:32         ` Rob Landley
  2 siblings, 0 replies; 59+ messages in thread
From: Hua Zhong @ 2003-12-11 19:55 UTC (permalink / raw)
  To: 'J鰎n Engel'; +Cc: 'Andy Isaacson', linux-kernel

> > Understood. Two filesystems we are using: tmpfs and ext3. For the
> > former, fragmentation doesn't matter.
> > 
> > Hey, I think when I get some cycles I can try to implement this for
> > tmpfs (since it's simpler) myself, and post a patch. :-) But before
> > that, I want to make sure it's doable.
> 
> If you really do it, please don't add a syscall for it.  Simply check
> each written page if it is completely filled with zero.  (This will be
> a very quick check for most pages, as they will contain something
> nonzero in the first couple of words)

You mean automatically punch it?

I don't think this is desirable. As someone else pointed out, "punch" might be an expensive operation and cause fragmentation (since you return the block in the middle to the fs).

I think this operation should be performed only when the application requires it.
 
> Jörn


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:48       ` Jörn Engel
  2003-12-11 19:55         ` Hua Zhong
@ 2003-12-11 19:58         ` Andy Isaacson
  2003-12-12 12:18           ` Jörn Engel
  2003-12-11 20:32         ` Rob Landley
  2 siblings, 1 reply; 59+ messages in thread
From: Andy Isaacson @ 2003-12-11 19:58 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-kernel

On Thu, Dec 11, 2003 at 08:48:15PM +0100, Jörn Engel wrote:
> On Thu, 11 December 2003 11:15:28 -0800, Hua Zhong wrote:
> > Hey, I think when I get some cycles I can try to implement this for
> > tmpfs (since it's simpler) myself, and post a patch. :-) But before
> > that, I want to make sure it's doable.
> 
> If you really do it, please don't add a syscall for it.  Simply check
> each written page if it is completely filled with zero.  (This will be
> a very quick check for most pages, as they will contain something
> nonzero in the first couple of words)

Um, no.

That is a very bad idea.  Your suggestion would make it impossible to
actually write a block of all-zeros to the disk.  That makes it
impossible to pre-allocate disk space.

Another syscall is precisely the correct thing to do.  (I don't think
make_hole() is a special case of any extant syscall.)

-andy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:48       ` Jörn Engel
  2003-12-11 19:55         ` Hua Zhong
  2003-12-11 19:58         ` Andy Isaacson
@ 2003-12-11 20:32         ` Rob Landley
  2003-12-12 12:55           ` Jörn Engel
  2 siblings, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-11 20:32 UTC (permalink / raw)
  To: Jörn Engel, Hua Zhong; +Cc: 'Andy Isaacson', linux-kernel

On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> On Thu, 11 December 2003 11:15:28 -0800, Hua Zhong wrote:
> > > The abstract interface for make_hole() is simple, but it turns into a
> > > pretty expensive filesystem operation, I think.  After many cycles of
> > > free/allocate, your file would be badly fragmented across the
> > > filesystem.
> >
> > Understood. Two filesystems we are using: tmpfs and ext3. For the
> > former, fragmentation doesn't matter.
> >
> > Hey, I think when I get some cycles I can try to implement this for
> > tmpfs (since it's simpler) myself, and post a patch. :-) But before
> > that, I want to make sure it's doable.
>
> If you really do it, please don't add a syscall for it.  Simply check
> each written page if it is completely filled with zero.  (This will be
> a very quick check for most pages, as they will contain something
> nonzero in the first couple of words)

Cache poisoning, streaming writes to large RAID arrays...  There are about 8 
zllion reasons not to do this.  Really.  (It defeats the whole purpose of 
DMA, doesn't it?)

> Jörn

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:58         ` Andy Isaacson
@ 2003-12-12 12:18           ` Jörn Engel
  2003-12-12 15:40             ` Andy Isaacson
  0 siblings, 1 reply; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 12:18 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: linux-kernel

On Thu, 11 December 2003 13:58:54 -0600, Andy Isaacson wrote:
> On Thu, Dec 11, 2003 at 08:48:15PM +0100, Jörn Engel wrote:
> > 
> > If you really do it, please don't add a syscall for it.  Simply check
> > each written page if it is completely filled with zero.  (This will be
> > a very quick check for most pages, as they will contain something
> > nonzero in the first couple of words)
> 
> Um, no.
> 
> That is a very bad idea.  Your suggestion would make it impossible to
> actually write a block of all-zeros to the disk.  That makes it
> impossible to pre-allocate disk space.

How about implementing it inside the individual filesystems?  Then
each filesystem can figure out a logic that suits it's special needs.

What I would sometimes like to have goes even beyond this.  Create a
simple hash for each size-x chunk of a file, and check against those
hashes whenever writing.  If hashes are identical, compare the chunks
and if those are identical as well, just create another link to that
chunk.  Kinda like rsync on a filesystem level, only without the
rolling checksum thing.

Yes, you can do a lot of this from userspace, but hard links don't
have a copy-on-write semantic, so this often breaks things, unless
*all* userspace programs break hard links before modifying files.

Also, this effectively compresses your data, so you need less
bandwidth to the cache and less cache size.  Whatever applies to code
size and L1 cache should apply here as well.  Sure, the disk access
pattern may be worse, but who cares, if data sets suddenly fit into
memory?

Oh yes, this would also give you my proposed zero-block detection for
free.

> Another syscall is precisely the correct thing to do.  (I don't think
> make_hole() is a special case of any extant syscall.)

Depends.  My proposal has a bunch of problems.  We won't have it
implemented by next year.  I buy all that.  Maybe we can do it with
10% kernel code and 90% userspace, maybe not.  Most likely the first
couple of implementations create more problems than they solve, yes.

But we should get there some day.  Having 15 nearly identical copies
of the kernel on my notebook is a pain and hard links simply have the
wrong semantics.

Jörn

-- 
Anything that can go wrong, will.
-- Finagle's Law

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 20:32         ` Rob Landley
@ 2003-12-12 12:55           ` Jörn Engel
  2003-12-12 13:28             ` Vladimir Saveliev
  2003-12-12 13:39             ` Rob Landley
  0 siblings, 2 replies; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 12:55 UTC (permalink / raw)
  To: Rob Landley; +Cc: Hua Zhong, 'Andy Isaacson', linux-kernel

On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> >
> > If you really do it, please don't add a syscall for it.  Simply check
> > each written page if it is completely filled with zero.  (This will be
> > a very quick check for most pages, as they will contain something
> > nonzero in the first couple of words)
> 
> Cache poisoning, streaming writes to large RAID arrays...  There are about 8 
> zllion reasons not to do this.  Really.  (It defeats the whole purpose of 
> DMA, doesn't it?)

Yes, the obvious and stupid implementation has a ton of problems.
Most likely the right approach is some sort of background deamon
(garbage collector, defragmenter, journald, whatever you may call it)
that does exacly this even after the fact for the last unchecked
writes.  Asyncronous under load, possibly even synchronous when almost
idle.

A stupid implementation would still help for some workload (few, while
hurting many) and already get the code tested, so even a stupid
implementation helps.

Jörn

-- 
There's nothing better for promoting creativity in a medium than
making an audience feel "Hmm ­ I could do better than that!"
-- Douglas Adams in a slashdot interview

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 12:55           ` Jörn Engel
@ 2003-12-12 13:28             ` Vladimir Saveliev
  2003-12-12 13:43               ` Jörn Engel
  2003-12-12 13:53               ` Rob Landley
  2003-12-12 13:39             ` Rob Landley
  1 sibling, 2 replies; 59+ messages in thread
From: Vladimir Saveliev @ 2003-12-12 13:28 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

Hi

On Fri, 2003-12-12 at 15:55, Jörn Engel wrote:
> On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> > On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> > >
> > > If you really do it, please don't add a syscall for it.  Simply check
> > > each written page if it is completely filled with zero.  (This will be
> > > a very quick check for most pages, as they will contain something
> > > nonzero in the first couple of words)
> > 
> > Cache poisoning, streaming writes to large RAID arrays...  There are about 8 
> > zllion reasons not to do this.  Really.  (It defeats the whole purpose of 
> > DMA, doesn't it?)
> 

Sorry,
but doesn't truncate do almost exactly what "make hole" is supposed to
do?

> Yes, the obvious and stupid implementation has a ton of problems.
> Most likely the right approach is some sort of background deamon
> (garbage collector, defragmenter, journald, whatever you may call it)
> that does exacly this even after the fact for the last unchecked
> writes.  Asyncronous under load, possibly even synchronous when almost
> idle.
> 
> A stupid implementation would still help for some workload (few, while
> hurting many) and already get the code tested, so even a stupid
> implementation helps.
> 
> Jörn


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 12:55           ` Jörn Engel
  2003-12-12 13:28             ` Vladimir Saveliev
@ 2003-12-12 13:39             ` Rob Landley
  2003-12-12 13:56               ` Jörn Engel
  1 sibling, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-12 13:39 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Hua Zhong, 'Andy Isaacson', linux-kernel

On Friday 12 December 2003 06:55, Jörn Engel wrote:
> On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> > On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> > > If you really do it, please don't add a syscall for it.  Simply check
> > > each written page if it is completely filled with zero.  (This will be
> > > a very quick check for most pages, as they will contain something
> > > nonzero in the first couple of words)
> >
> > Cache poisoning, streaming writes to large RAID arrays...  There are
> > about 8 zllion reasons not to do this.  Really.  (It defeats the whole
> > purpose of DMA, doesn't it?)
>
> Yes, the obvious and stupid implementation has a ton of problems.
> Most likely the right approach is some sort of background deamon
> (garbage collector, defragmenter, journald, whatever you may call it)
> that does exacly this even after the fact for the last unchecked
> writes.  Asyncronous under load, possibly even synchronous when almost
> idle.

Actually, I'd planned on implementing a cron job that could do it.  We're 
talking a dozen lines of Python code (which can be optimized to only look at 
files with timestamps since the last time it ran).  And doesn't need anything 
from the kernel but the syscall...

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:28             ` Vladimir Saveliev
@ 2003-12-12 13:43               ` Jörn Engel
  2003-12-12 13:52                 ` Vladimir Saveliev
  2003-12-12 13:53               ` Rob Landley
  1 sibling, 1 reply; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 13:43 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: Rob Landley, linux-kernel

On Fri, 12 December 2003 16:28:18 +0300, Vladimir Saveliev wrote:
> 
> Sorry,
> but doesn't truncate do almost exactly what "make hole" is supposed to
> do?

Yeah, *almost* exactly.  Some people happen to care about the almost.

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface. 
-- Doug MacIlroy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:43               ` Jörn Engel
@ 2003-12-12 13:52                 ` Vladimir Saveliev
  2003-12-12 14:04                   ` Jörn Engel
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Saveliev @ 2003-12-12 13:52 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-kernel

On Fri, 2003-12-12 at 16:43, Jörn Engel wrote:
> On Fri, 12 December 2003 16:28:18 +0300, Vladimir Saveliev wrote:
> > 
> > Sorry,
> > but doesn't truncate do almost exactly what "make hole" is supposed to
> > do?
> 
> Yeah, *almost* exactly.  Some people happen to care about the almost.
> 

I meant: where are those tons of problems (except for the fact that
"make hole" is obviously something without which one can live just
fine)? 

> Jörn


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:28             ` Vladimir Saveliev
  2003-12-12 13:43               ` Jörn Engel
@ 2003-12-12 13:53               ` Rob Landley
  2003-12-12 14:01                 ` Vladimir Saveliev
  1 sibling, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-12 13:53 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: linux-kernel

On Friday 12 December 2003 07:28, Vladimir Saveliev wrote:
> Hi
>
> On Fri, 2003-12-12 at 15:55, Jörn Engel wrote:
> > On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> > > On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> > > > If you really do it, please don't add a syscall for it.  Simply check
> > > > each written page if it is completely filled with zero.  (This will
> > > > be a very quick check for most pages, as they will contain something
> > > > nonzero in the first couple of words)
> > >
> > > Cache poisoning, streaming writes to large RAID arrays...  There are
> > > about 8 zllion reasons not to do this.  Really.  (It defeats the whole
> > > purpose of DMA, doesn't it?)
>
> Sorry,
> but doesn't truncate do almost exactly what "make hole" is supposed to
> do?

I have a 2 gigabyte file.  I want to punch a hole from 257 megabytes to 364 
megabytes, saving over 100 megs of disk space.  I do NOT want to have to copy 
off and rewrite 1.6 gigabytes of data from the end of the file.  (There may 
not even be enough room on the disk, and it would take a long time anyway.)

No, it doesn't do the same thing.  Truncate is always to end of file.  Punch 
hole is like a write, it takes a length.  lseek(pos), punch(length).

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:39             ` Rob Landley
@ 2003-12-12 13:56               ` Jörn Engel
  2003-12-12 14:24                 ` Jörn Engel
  0 siblings, 1 reply; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 13:56 UTC (permalink / raw)
  To: Rob Landley; +Cc: Hua Zhong, 'Andy Isaacson', linux-kernel

On Fri, 12 December 2003 07:39:25 -0600, Rob Landley wrote:
> On Friday 12 December 2003 06:55, Jörn Engel wrote:
> >
> > Yes, the obvious and stupid implementation has a ton of problems.
> > Most likely the right approach is some sort of background deamon
> > (garbage collector, defragmenter, journald, whatever you may call it)
> > that does exacly this even after the fact for the last unchecked
> > writes.  Asyncronous under load, possibly even synchronous when almost
> > idle.
> 
> Actually, I'd planned on implementing a cron job that could do it.  We're 
> talking a dozen lines of Python code (which can be optimized to only look at 
> files with timestamps since the last time it ran).  And doesn't need anything 
> from the kernel but the syscall...

...and it sucks.  Same problem as with updatedb - 99% of all work is
bogus, but you don't know which 99%, because the one knowing about it,
the kernel, doesn't tell you a thing.

Maybe a simple notification mechanism would sufficiently solve this as
well, so all the rest can be done in userspace.  Basically a file with
a simple format like this:
#path	offset	len
/tmp/foo	0	12

Meaning that bytes 0-11 of /tmp/foo have changed in whatever way.

Something like that, the details don't matter too much.

Jörn

-- 
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger & Dragon

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:53               ` Rob Landley
@ 2003-12-12 14:01                 ` Vladimir Saveliev
  2003-12-12 21:35                   ` Rob Landley
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Saveliev @ 2003-12-12 14:01 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

On Fri, 2003-12-12 at 16:53, Rob Landley wrote:
> On Friday 12 December 2003 07:28, Vladimir Saveliev wrote:
> > Hi
> >
> > On Fri, 2003-12-12 at 15:55, Jörn Engel wrote:
> > > On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> > > > On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> > > > > If you really do it, please don't add a syscall for it.  Simply check
> > > > > each written page if it is completely filled with zero.  (This will
> > > > > be a very quick check for most pages, as they will contain something
> > > > > nonzero in the first couple of words)
> > > >
> > > > Cache poisoning, streaming writes to large RAID arrays...  There are
> > > > about 8 zllion reasons not to do this.  Really.  (It defeats the whole
> > > > purpose of DMA, doesn't it?)
> >
> > Sorry,
> > but doesn't truncate do almost exactly what "make hole" is supposed to
> > do?
> 
> I have a 2 gigabyte file.  I want to punch a hole from 257 megabytes to 364 
> megabytes, saving over 100 megs of disk space.  I do NOT want to have to copy 
> off and rewrite 1.6 gigabytes of data from the end of the file.  (There may 
> not even be enough room on the disk, and it would take a long time anyway.)
> 
ok.
But I asked why would "make hole" have problems you list (8 zillions)
and truncate would not have them?

> No, it doesn't do the same thing.  Truncate is always to end of file.  Punch 
> hole is like a write, it takes a length.  lseek(pos), punch(length).
> 
> Rob
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:52                 ` Vladimir Saveliev
@ 2003-12-12 14:04                   ` Jörn Engel
  0 siblings, 0 replies; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 14:04 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: linux-kernel

On Fri, 12 December 2003 16:52:43 +0300, Vladimir Saveliev wrote:
> On Fri, 2003-12-12 at 16:43, Jörn Engel wrote:
> > On Fri, 12 December 2003 16:28:18 +0300, Vladimir Saveliev wrote:
> > > 
> > > Sorry,
> > > but doesn't truncate do almost exactly what "make hole" is supposed to
> > > do?
> > 
> > Yeah, *almost* exactly.  Some people happen to care about the almost.
> > 
> 
> I meant: where are those tons of problems (except for the fact that
> "make hole" is obviously something without which one can live just
> fine)? 

Pretty much, yes.  As hinted at before, holes are just special cases
of a more general problem, block pointer handling.  On my hard drive,
there are literally millions of identical blocks.  If the filesystem
knew about those, it throw away most of them and just point to the
blocks from different files (or maybe just file positions).

A hole is simply a file offset pointing to a special and very common
shared block, but there are many others.

Jörn

-- 
He that composes himself is wiser than he that composes a book.
-- B. Franklin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 13:56               ` Jörn Engel
@ 2003-12-12 14:24                 ` Jörn Engel
  2003-12-12 21:37                   ` Rob Landley
  0 siblings, 1 reply; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 14:24 UTC (permalink / raw)
  To: Rob Landley; +Cc: Hua Zhong, 'Andy Isaacson', linux-kernel

On Fri, 12 December 2003 14:56:09 +0100, Jörn Engel wrote:
> On Fri, 12 December 2003 07:39:25 -0600, Rob Landley wrote:
> > On Friday 12 December 2003 06:55, Jörn Engel wrote:
> > >
> > > Yes, the obvious and stupid implementation has a ton of problems.
> > > Most likely the right approach is some sort of background deamon
> > > (garbage collector, defragmenter, journald, whatever you may call it)
> > > that does exacly this even after the fact for the last unchecked
> > > writes.  Asyncronous under load, possibly even synchronous when almost
> > > idle.
> > 
> > Actually, I'd planned on implementing a cron job that could do it.  We're 
> > talking a dozen lines of Python code (which can be optimized to only look at 
> > files with timestamps since the last time it ran).  And doesn't need anything 
> > from the kernel but the syscall...
> 
> ...and it sucks.  Same problem as with updatedb - 99% of all work is
> bogus, but you don't know which 99%, because the one knowing about it,
> the kernel, doesn't tell you a thing.

Actually, updatedb sucks even worse.  The database is notoriously
outdated and each run of updatedb has the effect of flushing the
cache.  Because of the cache-flushing effect, you cannot even run it
with maximum niceness.  Running it still hurts you *afterwards*.

Same goes for you userland daemon without kernel support.

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 12:18           ` Jörn Engel
@ 2003-12-12 15:40             ` Andy Isaacson
  2003-12-12 16:03               ` Jörn Engel
  0 siblings, 1 reply; 59+ messages in thread
From: Andy Isaacson @ 2003-12-12 15:40 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-kernel

On Fri, Dec 12, 2003 at 01:18:24PM +0100, Jörn Engel wrote:
> On Thu, 11 December 2003 13:58:54 -0600, Andy Isaacson wrote:
> > On Thu, Dec 11, 2003 at 08:48:15PM +0100, Jörn Engel wrote:
> > > If you really do it, please don't add a syscall for it.  Simply check
> > > each written page if it is completely filled with zero.  (This will be
> > > a very quick check for most pages, as they will contain something
> > > nonzero in the first couple of words)
> > 
> > Um, no.
> > 
> > That is a very bad idea.  Your suggestion would make it impossible to
> > actually write a block of all-zeros to the disk.  That makes it
> > impossible to pre-allocate disk space.
> 
> How about implementing it inside the individual filesystems?  Then
> each filesystem can figure out a logic that suits it's special needs.
> 
> What I would sometimes like to have goes even beyond this.  Create a
> simple hash for each size-x chunk of a file, and check against those
> hashes whenever writing.  If hashes are identical, compare the chunks
> and if those are identical as well, just create another link to that
> chunk.  Kinda like rsync on a filesystem level, only without the
> rolling checksum thing.

A related idea was reportedly used in the Venti filesystem, which was
discussed on linux-kernel back in October; look for the thread
"Transparent compression in the FS".

The downsides are pretty substantial (but the upsides are too).  You
don't know how many blocks are available on the filesystem for you to
write, because when you write you might not allocate blocks.  And you
lose disk-streaming-perfomance, because you're going to be seeking all
over the disk picking up the blocks for your files.

> > Another syscall is precisely the correct thing to do.  (I don't think
> > make_hole() is a special case of any extant syscall.)
> 
> Depends.  My proposal has a bunch of problems.  We won't have it
> implemented by next year.  I buy all that.  Maybe we can do it with
> 10% kernel code and 90% userspace, maybe not.  Most likely the first
> couple of implementations create more problems than they solve, yes.
> 
> But we should get there some day.  Having 15 nearly identical copies
> of the kernel on my notebook is a pain and hard links simply have the
> wrong semantics.

I don't know about you, but I don't have 15 nearly identical copies of
the kernel; I have 30 copies that have almost no text in common, and
certainly have no blocks in common -- they result from independent
compilations, and the resulting bzImage files are not duplicates.

-andy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 15:40             ` Andy Isaacson
@ 2003-12-12 16:03               ` Jörn Engel
  0 siblings, 0 replies; 59+ messages in thread
From: Jörn Engel @ 2003-12-12 16:03 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: linux-kernel

On Fri, 12 December 2003 09:40:31 -0600, Andy Isaacson wrote:
> 
> A related idea was reportedly used in the Venti filesystem, which was
> discussed on linux-kernel back in October; look for the thread
> "Transparent compression in the FS".
> 
> The downsides are pretty substantial (but the upsides are too).  You
> don't know how many blocks are available on the filesystem for you to
> write, because when you write you might not allocate blocks.  And you
> lose disk-streaming-perfomance, because you're going to be seeking all
> over the disk picking up the blocks for your files.

Right - to some degree.  I'm sure many problems can be dealt with, but
it takes a lot of time to sort out the details.  For streaming
performance, I guess in most cases you will get the same performance
because you won't find a single duplicate block in those files.  Two
competing readers should be a much bigger problem.

Still, there are more problems, no doubt.

> > But we should get there some day.  Having 15 nearly identical copies
> > of the kernel on my notebook is a pain and hard links simply have the
> > wrong semantics.
> 
> I don't know about you, but I don't have 15 nearly identical copies of
> the kernel; I have 30 copies that have almost no text in common, and
> certainly have no blocks in common -- they result from independent
> compilations, and the resulting bzImage files are not duplicates.

s/kernel/kernel source/

If it was just the images, I couldn't care less.  But 15x 200-300 Megs
does hurt a bit. :)
grep -r over multiple trees hurts even more, when RAM spills over.

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 14:01                 ` Vladimir Saveliev
@ 2003-12-12 21:35                   ` Rob Landley
  2003-12-15 10:00                     ` Vladimir Saveliev
  0 siblings, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-12 21:35 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: linux-kernel

On Friday 12 December 2003 08:01, Vladimir Saveliev wrote:
> On Fri, 2003-12-12 at 16:53, Rob Landley wrote:
> > On Friday 12 December 2003 07:28, Vladimir Saveliev wrote:
> > > Hi
> > >
> > > On Fri, 2003-12-12 at 15:55, Jörn Engel wrote:
> > > > On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> > > > > On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> > > > > > If you really do it, please don't add a syscall for it.  Simply
> > > > > > check each written page if it is completely filled with zero. 
> > > > > > (This will be a very quick check for most pages, as they will
> > > > > > contain something nonzero in the first couple of words)
> > > > >
> > > > > Cache poisoning, streaming writes to large RAID arrays...  There
> > > > > are about 8 zllion reasons not to do this.  Really.  (It defeats
> > > > > the whole purpose of DMA, doesn't it?)
> > >
> > > Sorry,
> > > but doesn't truncate do almost exactly what "make hole" is supposed to
> > > do?
> >
> > I have a 2 gigabyte file.  I want to punch a hole from 257 megabytes to
> > 364 megabytes, saving over 100 megs of disk space.  I do NOT want to have
> > to copy off and rewrite 1.6 gigabytes of data from the end of the file. 
> > (There may not even be enough room on the disk, and it would take a long
> > time anyway.)
>
> ok.
> But I asked why would "make hole" have problems you list (8 zillions)
> and truncate would not have them?

Ah.

Truncate doesn't look at the contents of the file, it just frees the space 
regardless of what the data was.  (It doesn't have to load the contents of 
the blocks into memory and look at them in order to make the file's length 
shorter in the metadata and de-allocate those blocks.)

What was suggested a bit earlier was automatically looking at the contents of 
the data being written to disk, and not allocating actual blocks if the data 
is all zeroes.  (A bit like looking at pages of memory and copy-on-write 
aliasing them to the zero page whenever the page is entirely zeroes.)

Truncate doesn't do any of that.  Truncate only plays with metadata, and 
doesn't care about the contents of the file.

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-11 19:43       ` Andreas Dilger
@ 2003-12-12 21:37         ` Daniel Phillips
  0 siblings, 0 replies; 59+ messages in thread
From: Daniel Phillips @ 2003-12-12 21:37 UTC (permalink / raw)
  To: Andreas Dilger, Hua Zhong; +Cc: 'Andy Isaacson', linux-kernel

Hi Andreas,

On Thursday 11 December 2003 14:43, Andreas Dilger wrote:
> Presumably, if a filesystem didn't support a punch filesystem method
> (either because it is unimplemented or because the filesystem doesn't
> support holes) it would be implemented as either a truncate (if end is
> beyond i_size) or a series of zero writes instead.

It would be more regular and less surprising to make it semantically 
equivalent to writing a string of zeros, that is, it will never truncate and 
may extend a file.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 14:24                 ` Jörn Engel
@ 2003-12-12 21:37                   ` Rob Landley
  2003-12-15 12:47                     ` Jörn Engel
  0 siblings, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-12 21:37 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Hua Zhong, 'Andy Isaacson', linux-kernel

On Friday 12 December 2003 08:24, Jörn Engel wrote:

> > ...and it sucks.  Same problem as with updatedb - 99% of all work is
> > bogus, but you don't know which 99%, because the one knowing about it,
> > the kernel, doesn't tell you a thing.
>
> Actually, updatedb sucks even worse.  The database is notoriously
> outdated and each run of updatedb has the effect of flushing the
> cache.  Because of the cache-flushing effect, you cannot even run it
> with maximum niceness.  Running it still hurts you *afterwards*.
>
> Same goes for you userland daemon without kernel support.
>
> Jörn

1) The date optimization, only looking at files newer than the last run, means 
you can avoid looking at 90% of the filesystem.

2) If drop-behind ever gets working, life is good for this sort of thing.  If 
not, there's always O_DIRECT or its replacement (whatever Linus and the 
oracle guy were arguing about last month)...

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 21:35                   ` Rob Landley
@ 2003-12-15 10:00                     ` Vladimir Saveliev
  2003-12-15 11:52                       ` Rob Landley
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Saveliev @ 2003-12-15 10:00 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

Hello

On Sat, 2003-12-13 at 00:35, Rob Landley wrote:
> On Friday 12 December 2003 08:01, Vladimir Saveliev wrote:
> > On Fri, 2003-12-12 at 16:53, Rob Landley wrote:
> > > On Friday 12 December 2003 07:28, Vladimir Saveliev wrote:
> > > > Hi
> > > >
> > > > On Fri, 2003-12-12 at 15:55, Jörn Engel wrote:
> > > > > On Thu, 11 December 2003 14:32:12 -0600, Rob Landley wrote:
> > > > > > On Thursday 11 December 2003 13:48, Jörn Engel wrote:
> > > > > > > If you really do it, please don't add a syscall for it.  Simply
> > > > > > > check each written page if it is completely filled with zero. 
> > > > > > > (This will be a very quick check for most pages, as they will
> > > > > > > contain something nonzero in the first couple of words)
> > > > > >
> > > > > > Cache poisoning, streaming writes to large RAID arrays...  There
> > > > > > are about 8 zllion reasons not to do this.  Really.  (It defeats
> > > > > > the whole purpose of DMA, doesn't it?)
> > > >
> > > > Sorry,
> > > > but doesn't truncate do almost exactly what "make hole" is supposed to
> > > > do?
> > >
> > > I have a 2 gigabyte file.  I want to punch a hole from 257 megabytes to
> > > 364 megabytes, saving over 100 megs of disk space.  I do NOT want to have
> > > to copy off and rewrite 1.6 gigabytes of data from the end of the file. 
> > > (There may not even be enough room on the disk, and it would take a long
> > > time anyway.)
> >
> > ok.
> > But I asked why would "make hole" have problems you list (8 zillions)
> > and truncate would not have them?
> 
> Ah.
> 
> Truncate doesn't look at the contents of the file, it just frees the space 
> regardless of what the data was.  (It doesn't have to load the contents of 
> the blocks into memory and look at them in order to make the file's length 
> shorter in the metadata and de-allocate those blocks.)
> 
> What was suggested a bit earlier was automatically looking at the contents of 
> the data being written to disk, and not allocating actual blocks if the data 
> is all zeroes.  (A bit like looking at pages of memory and copy-on-write 
> aliasing them to the zero page whenever the page is entirely zeroes.)
> 
> Truncate doesn't do any of that.  Truncate only plays with metadata, and 
> doesn't care about the contents of the file.
> 

I thought we are talking about something which would allow to create
holes inside of non sparse file


> Rob
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-15 10:00                     ` Vladimir Saveliev
@ 2003-12-15 11:52                       ` Rob Landley
  2003-12-15 13:26                         ` Jörn Engel
  0 siblings, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-15 11:52 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: linux-kernel

On Monday 15 December 2003 04:00, Vladimir Saveliev wrote:

> > Truncate doesn't look at the contents of the file, it just frees the
> > space regardless of what the data was.  (It doesn't have to load the
> > contents of the blocks into memory and look at them in order to make the
> > file's length shorter in the metadata and de-allocate those blocks.)
> >
> > What was suggested a bit earlier was automatically looking at the
> > contents of the data being written to disk, and not allocating actual
> > blocks if the data is all zeroes.  (A bit like looking at pages of memory
> > and copy-on-write aliasing them to the zero page whenever the page is
> > entirely zeroes.)
> >
> > Truncate doesn't do any of that.  Truncate only plays with metadata, and
> > doesn't care about the contents of the file.
>
> I thought we are talking about something which would allow to create
> holes inside of non sparse file

Yes.  With a syscall that says "from here, to here, punch hole".

The earlier suggestion I was disagreeing with would automatically create holes 
in any file that wrote a sufficiently large range of zero bytes.  Hence the 
cache poisoning and general defeating the purpose of DMA and such.  Neither 
truncate, nor a punch syscall, would mess with the normal "write" path 
(beyond locking so write and truncate/punch didn't stomp each other).

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-12 21:37                   ` Rob Landley
@ 2003-12-15 12:47                     ` Jörn Engel
  2003-12-16  5:43                       ` Rob Landley
  0 siblings, 1 reply; 59+ messages in thread
From: Jörn Engel @ 2003-12-15 12:47 UTC (permalink / raw)
  To: Rob Landley; +Cc: Hua Zhong, 'Andy Isaacson', linux-kernel

On Fri, 12 December 2003 15:37:42 -0600, Rob Landley wrote:
> On Friday 12 December 2003 08:24, Jörn Engel wrote:
> 
> > > ...and it sucks.  Same problem as with updatedb - 99% of all work is
> > > bogus, but you don't know which 99%, because the one knowing about it,
> > > the kernel, doesn't tell you a thing.
> >
> > Actually, updatedb sucks even worse.  The database is notoriously
> > outdated and each run of updatedb has the effect of flushing the
> > cache.  Because of the cache-flushing effect, you cannot even run it
> > with maximum niceness.  Running it still hurts you *afterwards*.
> >
> > Same goes for you userland daemon without kernel support.
> 
> 1) The date optimization, only looking at files newer than the last run, means
> you can avoid looking at 90% of the filesystem.

And how do you figure out the date?  ;)

> 2) If drop-behind ever gets working, life is good for this sort of thing.  If 
> not, there's always O_DIRECT or its replacement (whatever Linus and the 
> oracle guy were arguing about last month)...

Not sure what drop-behind is.  Sounds interesting.

Anyway, what updatedb, userspace defragmenters etc. need is a
notification, what has changed.  Without this notification, they have
to look at everything and figure it out themselves.  Ask the network
people why select doesn't scale too well - updatedb is even worse
because it doesn't even notice that *anything* has changed, much less
what change happened.

O_DIRECT, O_STREAMING or O_WHATEVER is also a different beast.  With
streaming media, there is no way to avoid touching the data in the
first place.  If there was, we could do even better, but there isn't.

For updatedb there is.

Jörn

-- 
When in doubt, punt.  When somebody actually complains, go back and fix it...
The 90% solution is a good thing.
-- Rob Landley

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-15 11:52                       ` Rob Landley
@ 2003-12-15 13:26                         ` Jörn Engel
  0 siblings, 0 replies; 59+ messages in thread
From: Jörn Engel @ 2003-12-15 13:26 UTC (permalink / raw)
  To: Rob Landley; +Cc: Vladimir Saveliev, linux-kernel

On Mon, 15 December 2003 05:52:22 -0600, Rob Landley wrote:
> 
> The earlier suggestion I was disagreeing with would automatically create holes 
> in any file that wrote a sufficiently large range of zero bytes.  Hence the 
> cache poisoning and general defeating the purpose of DMA and such.  Neither 
> truncate, nor a punch syscall, would mess with the normal "write" path 
> (beyond locking so write and truncate/punch didn't stomp each other).

And the suggestor remains convinced that this is a good idea.  It
would be perfectly ok to defer actually looking at the data to later,
move that functionality to a journald or gcd or so, but the principle
mains unchanged.

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-15 12:47                     ` Jörn Engel
@ 2003-12-16  5:43                       ` Rob Landley
  2003-12-16 11:05                         ` Jörn Engel
  0 siblings, 1 reply; 59+ messages in thread
From: Rob Landley @ 2003-12-16  5:43 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-kernel

On Monday 15 December 2003 06:47, Jörn Engel wrote:
> On Fri, 12 December 2003 15:37:42 -0600, Rob Landley wrote:
> > On Friday 12 December 2003 08:24, Jörn Engel wrote:
> > > > ...and it sucks.  Same problem as with updatedb - 99% of all work is
> > > > bogus, but you don't know which 99%, because the one knowing about
> > > > it, the kernel, doesn't tell you a thing.
> > >
> > > Actually, updatedb sucks even worse.  The database is notoriously
> > > outdated and each run of updatedb has the effect of flushing the
> > > cache.  Because of the cache-flushing effect, you cannot even run it
> > > with maximum niceness.  Running it still hurts you *afterwards*.
> > >
> > > Same goes for you userland daemon without kernel support.
> >
> > 1) The date optimization, only looking at files newer than the last run,
> > means you can avoid looking at 90% of the filesystem.
>
> And how do you figure out the date?  ;)

You look at the dentry.  Yes, you have to traverse the filesystem.  There are 
a number of things that care about traversing the filesystem, including 
scheduled backups, updatedb, and potentially this thing.  I realise that you 
believe nobody should ever do this.  I don't care.

> > 2) If drop-behind ever gets working, life is good for this sort of thing.
> >  If not, there's always O_DIRECT or its replacement (whatever Linus and
> > the oracle guy were arguing about last month)...
>
> Not sure what drop-behind is.  Sounds interesting.

Google for it.

> Anyway, what updatedb, userspace defragmenters etc. need is a
> notification, what has changed.

Most of this infrastructure is there already.  Documentation/dnotify.txt.

> Without this notification, they have
> to look at everything and figure it out themselves.  Ask the network
> people why select doesn't scale too well

Were you here for the discussion of I/O completion ports as a potential 
solution to the "thundering herd" problem a few years ago?  Haven't checked 
to see how similar epoll is, I haven't had a scalability bottleneck in that 
area recently...

> - updatedb is even worse
> because it doesn't even notice that *anything* has changed, much less
> what change happened.

You don't WANT it to.  You want to batch up the work so that if a file changes 
every thirty seconds you're not constantly being woken up to deal with it 
again!

> O_DIRECT, O_STREAMING or O_WHATEVER is also a different beast.  With
> streaming media, there is no way to avoid touching the data in the
> first place.  If there was, we could do even better, but there isn't.

If you need to look at the contents of the disk (to check for the runs of null 
bytes), then you need to look at the contents of the disk.  Once you've 
identified data you need to load in, if you don't want to push everything 
else out of cache, you should be able to give some kind of hint that it 
shouldn't keep this data around.  The new ionice work is at least a step in 
this direction, although it doesn't address this particular problem, and I 
believe the dentry cache is a different beast than the page cache...

> For updatedb there is.

I'm not talking about updatedb.

> Jörn

Rob

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Is there a "make hole" (truncate in middle) syscall?
  2003-12-16  5:43                       ` Rob Landley
@ 2003-12-16 11:05                         ` Jörn Engel
  0 siblings, 0 replies; 59+ messages in thread
From: Jörn Engel @ 2003-12-16 11:05 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

On Mon, 15 December 2003 23:43:10 -0600, Rob Landley wrote:
> On Monday 15 December 2003 06:47, Jörn Engel wrote:
> 
> > Anyway, what updatedb, userspace defragmenters etc. need is a
> > notification, what has changed.
> 
> Most of this infrastructure is there already.  Documentation/dnotify.txt.

Right, although that specific implementation only cares about things
like konqueror not constantly polling one specific directory.

> > - updatedb is even worse
> > because it doesn't even notice that *anything* has changed, much less
> > what change happened.
> 
> You don't WANT it to.  You want to batch up the work so that if a file changes
> every thirty seconds you're not constantly being woken up to deal with it 
> again!

Well, *I* sure want it to.  Batching things up is fine, just the
current implementation isn't.

Yeah, it is good enough most of the time, so noone cares enough.

> > For updatedb there is.
> 
> I'm not talking about updatedb.

Then talk about checking for zero runs.  Same thing.  Either you have
to read *everything* from disk, so you don't want to do this too
often, or you need a way to figure out, what has changes since the
last run.

Or talk about updates, same thing as well.

What we do right now is basically polling, a very expensive and
therefore infrequent polling.  Sure, it works, but it's far from
perfect.


Anyway, subject has drifted away allready and we sure won't reach
agreement.  Doesn't matter anyway as long as noone provides the
patches, so let's just drop the thread, ok?

Jörn

-- 
/* Keep these two variables together */
int bar;

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2003-12-16 11:05 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-04 20:32 Is there a "make hole" (truncate in middle) syscall? Rob Landley
2003-12-04 20:55 ` Måns Rullgård
2003-12-04 21:10 ` Szakacsits Szabolcs
2003-12-05  0:02   ` Rob Landley
2003-12-04 22:33     ` Szakacsits Szabolcs
2003-12-05 11:22     ` Helge Hafting
2003-12-05 12:11   ` Måns Rullgård
2003-12-05 22:41     ` Mike Fedyk
2003-12-05 23:25       ` Måns Rullgård
2003-12-05 23:33       ` Szakacsits Szabolcs
2003-12-05 23:25     ` Szakacsits Szabolcs
2003-12-04 21:48 ` Mike Fedyk
2003-12-04 23:59   ` Rob Landley
2003-12-05 22:42     ` Olaf Titz
2003-12-04 22:53 ` Peter Chubb
2003-12-05  1:04   ` Philippe Troin
2003-12-05  2:39     ` Peter Chubb
2003-12-08  4:03     ` bill davidsen
2003-12-04 23:23 ` Andy Isaacson
2003-12-04 23:42   ` Szakacsits Szabolcs
2003-12-05  2:03     ` Mike Fedyk
2003-12-05  7:09       ` Ville Herva
2003-12-05 11:22   ` Anton Altaparmakov
2003-12-05 11:44     ` viro
2003-12-05 14:27       ` Anton Altaparmakov
2003-12-05 21:00   ` sparse file performance (was Re: Is there a "make hole" (truncate in middle) syscall?) Andy Isaacson
2003-12-05 21:12     ` Linus Torvalds
2003-12-08 20:43       ` Andy Isaacson
2003-12-11  5:13 ` Is there a "make hole" (truncate in middle) syscall? Hua Zhong
2003-12-11  6:19   ` Rob Landley
2003-12-11 18:58   ` Andy Isaacson
2003-12-11 19:15     ` Hua Zhong
2003-12-11 19:43       ` Andreas Dilger
2003-12-12 21:37         ` Daniel Phillips
2003-12-11 19:48       ` Jörn Engel
2003-12-11 19:55         ` Hua Zhong
2003-12-11 19:58         ` Andy Isaacson
2003-12-12 12:18           ` Jörn Engel
2003-12-12 15:40             ` Andy Isaacson
2003-12-12 16:03               ` Jörn Engel
2003-12-11 20:32         ` Rob Landley
2003-12-12 12:55           ` Jörn Engel
2003-12-12 13:28             ` Vladimir Saveliev
2003-12-12 13:43               ` Jörn Engel
2003-12-12 13:52                 ` Vladimir Saveliev
2003-12-12 14:04                   ` Jörn Engel
2003-12-12 13:53               ` Rob Landley
2003-12-12 14:01                 ` Vladimir Saveliev
2003-12-12 21:35                   ` Rob Landley
2003-12-15 10:00                     ` Vladimir Saveliev
2003-12-15 11:52                       ` Rob Landley
2003-12-15 13:26                         ` Jörn Engel
2003-12-12 13:39             ` Rob Landley
2003-12-12 13:56               ` Jörn Engel
2003-12-12 14:24                 ` Jörn Engel
2003-12-12 21:37                   ` Rob Landley
2003-12-15 12:47                     ` Jörn Engel
2003-12-16  5:43                       ` Rob Landley
2003-12-16 11:05                         ` Jörn Engel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).