All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
@ 2009-11-19 18:46 Ramsay Jones
  2009-11-20  7:00 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Ramsay Jones @ 2009-11-19 18:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: GIT Mailing-list


Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
---

Whilst testing an msvc-built git on cygwin, I noticed that
git-count-objects was producing different results to the
cygwin version, viz:

    $ ./git --version
    git version 1.6.5.4.gf034d.MSVC
    $ git --version
    git version 1.6.5

    $ ./git-count-objects
    60 objects, 283 kilobytes

    $ git count-objects
    60 objects, 210 kilobytes

    $ git config core.filemode false

    $ git count-objects
    60 objects, 297 kilobytes
    $ 

Note also that the cygwin version of git gives two different
answers, in the *same* repository, depending on the setting
of core.filemode. (since the "win32 stat() functions" in
compat/cygwin.c are used when core.filemode is false)

Having looked at the msvc code-path, I also noticed something
else a bit odd; the value printed by the msvc version should
be a *lower bound* for the amount of disk-space used (since
it simply totals the actual file sizes). However, as you can
see from the above, when core.filemode is true, the cygwin
version of git is *underestimating* even this (210Kb vs 283kb).

A quick trip to the debugger confirmed that (st.st_blocks * 512)
is less than st.st_size for several files. So, it looked like
the block-size was not in units of 512 bytes; so if we look at
/usr/include/sys/param.h we find:

    #define DEV_BSIZE  1024

and in /usr/include/sys/stat.h we find:

    #define S_BLKSIZE  1024

which seems to indicate a 1K block-size instead. Also, different
filesystem types may use different block-sizes, which is why an
st.st_blksize was added to struct stat; it seems that the Cygwin
struct stat contains the st_blksize field, so lets try a quick
test program:

    $ cat junk.c

    #include <stdio.h>
    #include <sys/stat.h>
    
    int main(int argc, char *argv[])
    {
    	int i, bytes = 0, tot_5 = 0, tot_b = 0;
    
    	for (i=1; i< argc; i++) {
    		struct stat st;
    		if (lstat(argv[i], &st))
    			printf("can't lstat '%s'\n", argv[i]);
    		else {
    			int s = (int)st.st_size;
    			int b = (int)st.st_blocks;
    			int f = (int)st.st_blksize;
    			int d = (b * 512);
    			int e = (b * f);
    
    			printf("%8d, ", s);
    			printf("%2d*512=%-6d (%6d), ", b, d, d-s);
    			printf("%2d*%d=%-6d (%6d)\n", b, f, e, e-s);
    
    			bytes += s;
    			tot_5 += d;
    			tot_b += e;
    		}
    	}
    	printf("total bytes: %6d\n", bytes);
    	printf("b * 512:     %6d  (%d)\n", tot_5, tot_5 - bytes);
    	printf("b * blksize: %6d  (%d)\n", tot_b, tot_b - bytes);
    	exit(0);
    }
    
    $ ls .git/objects/??/* | xargs ./junk.exe
         148,  1*512=512    (   364),  1*1024=1024   (   876)
       10058, 12*512=6144   ( -3914), 12*1024=12288  (  2230)
        3701,  4*512=2048   ( -1653),  4*1024=4096   (   395)
         463,  4*512=2048   (  1585),  4*1024=4096   (  3633)
         148,  1*512=512    (   364),  1*1024=1024   (   876)
       10056, 12*512=6144   ( -3912), 12*1024=12288  (  2232)
         198,  1*512=512    (   314),  1*1024=1024   (   826)
        1782,  4*512=2048   (   266),  4*1024=4096   (  2314)
         192,  1*512=512    (   320),  1*1024=1024   (   832)
        3801,  4*512=2048   ( -1753),  4*1024=4096   (   295)
       14851, 16*512=8192   ( -6659), 16*1024=16384  (  1533)
        3761,  4*512=2048   ( -1713),  4*1024=4096   (   335)
          52,  1*512=512    (   460),  1*1024=1024   (   972)
         956,  4*512=2048   (  1092),  4*1024=4096   (  3140)
         956,  4*512=2048   (  1092),  4*1024=4096   (  3140)
        3703,  4*512=2048   ( -1655),  4*1024=4096   (   393)
       10055, 12*512=6144   ( -3911), 12*1024=12288  (  2233)
    total bytes:  64881
    b * 512:      45568  (-19313)
    b * blksize:  91136  (26255)
    $ 

Note that, in addition to confirming the 1K block-size, it appears
that (on NTFS) files less than 1K are allocated a single block
whereas larger files use a multiple of 4 blocks. Well, that is only
a guess but it at least sounds plausible! ;-)

This patch implements a simple solution which has the useful property
of returning a single answer, irrespective of the core.filemode setting.

ATB,
Ramsay Jones

 Makefile |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index 5d5976f..5624563 100644
--- a/Makefile
+++ b/Makefile
@@ -783,6 +783,10 @@ ifeq ($(uname_O),Cygwin)
 	NO_FAST_WORKING_DIRECTORY = UnfortunatelyYes
 	NO_TRUSTABLE_FILEMODE = UnfortunatelyYes
 	OLD_ICONV = UnfortunatelyYes
+	# The st_blocks field is not in units of 512 bytes, as the code
+	# expects, which leads to an under-estimate of the disk space used.
+	# In order to use an alternate algorithm, we claim to lack st_blocks.
+	NO_ST_BLOCKS_IN_STRUCT_STAT = YesPlease
 	# There are conflicting reports about this.
 	# On some boxes NO_MMAP is needed, and not so elsewhere.
 	# Try commenting this out if you suspect MMAP is more efficient
-- 
1.6.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
  2009-11-19 18:46 [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin Ramsay Jones
@ 2009-11-20  7:00 ` Junio C Hamano
  2009-11-20  7:49   ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2009-11-20  7:00 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: GIT Mailing-list

Ramsay Jones <ramsay@ramsay1.demon.co.uk> writes:

> Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
> ---

Could you write a concise summary in the commit log message to explain why
it is a correct fix?  Your detailed analysis after three dash lines was an
amusing read, but people who will be reading "git log" output 6 months
down the road won't have an access to it.  Something like...

    Cygwin has st_blocks in struct stat, but at least on NTFS, the field
    counts in blocks of st_blksize bytes, not in 512-byte blocks.

The Makefile already explains what NO_ST_BLOCKS_IN_STRUCT_STAT is about:

    # Define NO_ST_BLOCKS_IN_STRUCT_STAT if your platform does not have st_blocks
    # field that counts the on-disk footprint in 512-byte blocks.

so the above explanation should be enough.

Note that this is not any standard compliance.  POSIX cops out like this
in the <sys/stat.h> description:

    The unit for the st_blocks member of the stat structure is not defined
    within POSIX.1-2008. In some implementations it is 512 bytes. It may
    differ on a file system basis. There is no correlation between values
    of the st_blocks and st_blksize, and the f_bsize (from
    <sys/statvfs.h>) structure members.

IOW, a system like Cygwin that does not use 512-byte is not in violation.

> diff --git a/Makefile b/Makefile
> index 5d5976f..5624563 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -783,6 +783,10 @@ ifeq ($(uname_O),Cygwin)
>  	NO_FAST_WORKING_DIRECTORY = UnfortunatelyYes
>  	NO_TRUSTABLE_FILEMODE = UnfortunatelyYes
>  	OLD_ICONV = UnfortunatelyYes
> +	# The st_blocks field is not in units of 512 bytes, as the code
> +	# expects, which leads to an under-estimate of the disk space used.
> +	# In order to use an alternate algorithm, we claim to lack st_blocks.
> +	NO_ST_BLOCKS_IN_STRUCT_STAT = YesPlease

This comment is redundant, as the explanation of the Makefile variable
already said the same thing.

Also it is somewhat wrong to say "claim to lack".  The Makefile variable
merely means "do not use this field".  The reason may be that the platform
lacks it, or it may have it but it does not count in 512-byte blocks.  It
does not matter---either way the field is not usable.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
  2009-11-20  7:00 ` Junio C Hamano
@ 2009-11-20  7:49   ` Junio C Hamano
  2009-11-21  0:00     ` Ramsay Jones
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2009-11-20  7:49 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Junio C Hamano, GIT Mailing-list

When estimating the on-disk footprint of a file, we either used st_blocks
that is counted in 512-byte blocks (traditional unix behaviour), or on
platforms that do not have such st_blocks field in struct stat, simply the
file size itself.  Building with NO_ST_BLOCKS_IN_STRUCT_STAT will choose
the latter implementaion.

POSIX.1 says in its <sys/stat.h> description on this issue:

    The unit for the st_blocks member of the stat structure is not
    defined within POSIX.1-2008. In some implementations it is 512
    bytes. It may differ on a file system basis. There is no
    correlation between values of the st_blocks and st_blksize,
    and the f_bsize (from <sys/statvfs.h>) structure members.

Even though the above explicitly states st_blksize does not have any
correlation, at least on one system (Cygwin on NTFS), the st_blocks field
seems to count in blocks of st_blksize bytes.  A new Makefile variable
ST_BLOCKS_COUNTS_IN_BLKSIZE chooses to use this for the on-disk footprint.

---

Not signed-off yet, but this might be a workable alternative.  Testing on
wider platforms (especially not on "Linux with ext?") might find this
alternative useful.  I dunno.

 Makefile          |    8 +++++++-
 git-compat-util.h |    4 +++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 79195c8..5f2c5de 100644
--- a/Makefile
+++ b/Makefile
@@ -148,6 +148,9 @@ all::
 # Define USE_STDEV below if you want git to care about the underlying device
 # change being considered an inode change from the update-index perspective.
 #
+# Define ST_BLOCKS_COUNTS_IN_ST_BLKSIZE if your platform has st_blocks
+# field that counts the on-disk footprint in st_blksize-byte blocks.
+#
 # Define NO_ST_BLOCKS_IN_STRUCT_STAT if your platform does not have st_blocks
 # field that counts the on-disk footprint in 512-byte blocks.
 #
@@ -776,7 +779,7 @@ ifeq ($(uname_O),Cygwin)
 	NO_FAST_WORKING_DIRECTORY = UnfortunatelyYes
 	NO_TRUSTABLE_FILEMODE = UnfortunatelyYes
 	OLD_ICONV = UnfortunatelyYes
-	NO_ST_BLOCKS_IN_STRUCT_STAT = YesPlease
+	ST_BLOCKS_COUNTS_IN_ST_BLKSIZE = YesPlease
 	# There are conflicting reports about this.
 	# On some boxes NO_MMAP is needed, and not so elsewhere.
 	# Try commenting this out if you suspect MMAP is more efficient
@@ -1126,6 +1129,9 @@ endif
 ifdef NO_D_INO_IN_DIRENT
 	BASIC_CFLAGS += -DNO_D_INO_IN_DIRENT
 endif
+ifdef ST_BLOCKS_COUNTS_IN_ST_BLKSIZE
+	BASIC_CFLAGS += -DST_BLOCKS_COUNTS_IN_ST_BLKSIZE
+endif
 ifdef NO_ST_BLOCKS_IN_STRUCT_STAT
 	BASIC_CFLAGS += -DNO_ST_BLOCKS_IN_STRUCT_STAT
 endif
diff --git a/git-compat-util.h b/git-compat-util.h
index ef60803..bba9c9e 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -240,7 +240,9 @@ extern int git_munmap(void *start, size_t length);
 
 #endif /* NO_MMAP */
 
-#ifdef NO_ST_BLOCKS_IN_STRUCT_STAT
+#ifdef ST_BLOCKS_COUNTS_IN_ST_BLKSIZE
+#define on_disk_bytes(st) ((st).st_blocks * (st).st_blksize)
+#elif defined(NO_ST_BLOCKS_IN_STRUCT_STAT) && NO_ST_BLOCKS_IN_STRUCT_STAT
 #define on_disk_bytes(st) ((st).st_size)
 #else
 #define on_disk_bytes(st) ((st).st_blocks * 512)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
  2009-11-20  7:49   ` Junio C Hamano
@ 2009-11-21  0:00     ` Ramsay Jones
  2009-11-22  1:21       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Ramsay Jones @ 2009-11-21  0:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: GIT Mailing-list

Junio C Hamano wrote:
> When estimating the on-disk footprint of a file, we either used st_blocks
> that is counted in 512-byte blocks (traditional unix behaviour), or on
> platforms that do not have such st_blocks field in struct stat, simply the
> file size itself.  Building with NO_ST_BLOCKS_IN_STRUCT_STAT will choose
> the latter implementaion.
> 
> POSIX.1 says in its <sys/stat.h> description on this issue:
> 
>     The unit for the st_blocks member of the stat structure is not
>     defined within POSIX.1-2008. In some implementations it is 512
>     bytes. It may differ on a file system basis. There is no
>     correlation between values of the st_blocks and st_blksize,
>     and the f_bsize (from <sys/statvfs.h>) structure members.
> 
> Even though the above explicitly states st_blksize does not have any
> correlation, at least on one system (Cygwin on NTFS), the st_blocks field
> seems to count in blocks of st_blksize bytes.  A new Makefile variable
> ST_BLOCKS_COUNTS_IN_BLKSIZE chooses to use this for the on-disk footprint.

My first attempt to fix this problem was very similar to this. ;-)

BTW, I thought that st_blocks and st_blksize were both XSI/SUS extensions
and not part of POSIX, but your quote above contradicts that. Also, I don't
know that you can count on *both* fields always being present (I have not
personally used a system that didn't have st_blksize if it had st_blocks,
but I don't think it's guaranteed).

Anyway, I decided against this kind of solution because it didn't address
the problem of returning different answers depending on the setting of
core.filemode.

Having said that, maybe that's not a big deal; in everyday use I can't
imagine that anyone would change the core.filemode setting more than once,
if ever. (I *have* been doing that quite a bit while testing an msvc-built
git on cygwin; but again, that's probably *not* an everyday usage :-P )

I haven't tried this patch, but I think you may need to add something like
the following (*not tested*):

--- >8 ---
diff --git a/compat/cygwin.c b/compat/cygwin.c
index b4a51b9..7e9edec 100644
--- a/compat/cygwin.c
+++ b/compat/cygwin.c
@@ -53,6 +53,7 @@ static int do_stat(const char *file_name, struct stat *buf, stat_fn_t cygstat)
 		buf->st_size = (off_t)fdata.nFileSizeLow;
 #endif
 		buf->st_blocks = size_to_blocks(buf->st_size);
+		buf->st_blksize = 512;
 		filetime_to_timespec(&fdata.ftLastAccessTime, &buf->st_atim);
 		filetime_to_timespec(&fdata.ftLastWriteTime, &buf->st_mtim);
 		filetime_to_timespec(&fdata.ftCreationTime, &buf->st_ctim);
--- >8 ---

ATB,
Ramsay Jones

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
  2009-11-21  0:00     ` Ramsay Jones
@ 2009-11-22  1:21       ` Junio C Hamano
  2009-11-24 20:07         ` Ramsay Jones
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2009-11-22  1:21 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: GIT Mailing-list

Ramsay Jones <ramsay@ramsay1.demon.co.uk> writes:

> I haven't tried this patch, but I think you may need to add something like
> the following (*not tested*):
>
> --- >8 ---
> diff --git a/compat/cygwin.c b/compat/cygwin.c
> index b4a51b9..7e9edec 100644
> --- a/compat/cygwin.c
> +++ b/compat/cygwin.c
> @@ -53,6 +53,7 @@ static int do_stat(const char *file_name, struct stat *buf, stat_fn_t cygstat)
>  		buf->st_size = (off_t)fdata.nFileSizeLow;
>  #endif
>  		buf->st_blocks = size_to_blocks(buf->st_size);
> +		buf->st_blksize = 512;
>  		filetime_to_timespec(&fdata.ftLastAccessTime, &buf->st_atim);
>  		filetime_to_timespec(&fdata.ftLastWriteTime, &buf->st_mtim);
>  		filetime_to_timespec(&fdata.ftCreationTime, &buf->st_ctim);
> --- >8 ---

Doesn't this contradict with everything you said?

You are forcing st_blksize to 512 but still return the same old st_blocks;
I do not understand what that would achieve.

In your experiments, st_blocks was reported in 1024-byte blocks and that
size coincided with what was reported in st_blksize, and that was the
whole point of the approach taken by the ST_BLOCKS_COUNTS_IN_BLKSIZE
patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
  2009-11-22  1:21       ` Junio C Hamano
@ 2009-11-24 20:07         ` Ramsay Jones
  2009-11-24 22:08           ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Ramsay Jones @ 2009-11-24 20:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: GIT Mailing-list

Junio C Hamano wrote:
> Ramsay Jones <ramsay@ramsay1.demon.co.uk> writes:
> 
>> I haven't tried this patch, but I think you may need to add something like
>> the following (*not tested*):
>>
>> --- >8 ---
>> diff --git a/compat/cygwin.c b/compat/cygwin.c
>> index b4a51b9..7e9edec 100644
>> --- a/compat/cygwin.c
>> +++ b/compat/cygwin.c
>> @@ -53,6 +53,7 @@ static int do_stat(const char *file_name, struct stat *buf, stat_fn_t cygstat)
>>  		buf->st_size = (off_t)fdata.nFileSizeLow;
>>  #endif
>>  		buf->st_blocks = size_to_blocks(buf->st_size);
>> +		buf->st_blksize = 512;
>>  		filetime_to_timespec(&fdata.ftLastAccessTime, &buf->st_atim);
>>  		filetime_to_timespec(&fdata.ftLastWriteTime, &buf->st_mtim);
>>  		filetime_to_timespec(&fdata.ftCreationTime, &buf->st_ctim);
>> --- >8 ---
> 
> Doesn't this contradict with everything you said?

Err... no :-D

Note that my suggested addition to your patch is in the core.filemode == false
code path, and so does not affect the "disk-space under-estimate" problem at all.

[To be clear: the "disk-space under-estimate" problem only happens when
core.filemode == true and the regular cygwin lstat()/stat() functions are used.
When core.filemode == false, the code in compat/cygwin.c (namely cygwin_lstat()
and cygwin_stat()) will (most likely) be called instead. These functions use
WIN32 api calls to implement equivalent, but presumably faster, versions of the
stat functions]

> You are forcing st_blksize to 512 but still return the same old st_blocks;
> I do not understand what that would achieve.

Well, as I said, I haven't tested your patch, or my suggested addition, so I could
well be wrong... but what I aimed to achieve was to:

    - avoid "undefined behaviour" in on_disk_bytes(), since the value in
      st_blksize would otherwise be undefined (ie whatever happened to be
      on the stack-frame of the count_objects() function).
    - initialize the st_blksize field with a value consistent with the
      st_blocks field, which is derived from the st_size field, as the
      number of 512-byte blocks. (see the context line just before the
      + line in the above diff, along with the size_to_blocks macro)
    - return the same answer from this code as before.

Note that the answer returned from this core.filemode == false code path
is different to the core.filemode == true code path. Which is why I *slightly*
prefer my original patch.

HTH.

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin
  2009-11-24 20:07         ` Ramsay Jones
@ 2009-11-24 22:08           ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2009-11-24 22:08 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Junio C Hamano, GIT Mailing-list

Ramsay Jones <ramsay@ramsay1.demon.co.uk> writes:

> Err... no :-D
>
> Note that my suggested addition to your patch is in the core.filemode == false
> code path, and so does not affect the "disk-space under-estimate" problem at all.
>
> [To be clear: the "disk-space under-estimate" problem only happens when
> core.filemode == true and the regular cygwin lstat()/stat() functions are used.
> When core.filemode == false, the code in compat/cygwin.c (namely cygwin_lstat()
> and cygwin_stat()) will (most likely) be called instead. These functions use
> WIN32 api calls to implement equivalent, but presumably faster, versions of the
> stat functions]
>
>> You are forcing st_blksize to 512 but still return the same old st_blocks;
>> I do not understand what that would achieve.
>
> Well, as I said, I haven't tested your patch, or my suggested addition, so I could
> well be wrong... but what I aimed to achieve was to:
>
>     - avoid "undefined behaviour" in on_disk_bytes(), since the value in
>       st_blksize would otherwise be undefined (ie whatever happened to be
>       on the stack-frame of the count_objects() function).
>     - initialize the st_blksize field with a value consistent with the
>       st_blocks field, which is derived from the st_size field, as the
>       number of 512-byte blocks. (see the context line just before the
>       + line in the above diff, along with the size_to_blocks macro)
>     - return the same answer from this code as before.

Ah, sorry, so then I misread your comment.  size_to_blocks() in
compat/cygwin.c counts blocks in 512 (I just checked) and you are applying
the reverse.

But you are right.  If the emulation used on cygwin is _not_ doing the
"blocks * blksize is close to size" thing that is not POSIX but you saw in
your experiment on NTFS, and if we need your follow-up patch to make it do
so, there is no point in using my patch.

> Note that the answer returned from this core.filemode == false code path
> is different to the core.filemode == true code path. Which is why I *slightly*
> prefer my original patch.

Thanks for a clarification.  I've already queued both to 'next', but I
will revert mine and make your patch graduate to 'master'.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-11-24 22:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-19 18:46 [PATCH] git-count-objects: Fix a disk-space under-estimate on Cygwin Ramsay Jones
2009-11-20  7:00 ` Junio C Hamano
2009-11-20  7:49   ` Junio C Hamano
2009-11-21  0:00     ` Ramsay Jones
2009-11-22  1:21       ` Junio C Hamano
2009-11-24 20:07         ` Ramsay Jones
2009-11-24 22:08           ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.