* [GIT PULL] ext4 update for 2.6.37 @ 2010-10-28 4:52 ` Theodore Ts'o 0 siblings, 0 replies; 35+ messages in thread From: Theodore Ts'o @ 2010-10-28 4:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-ext4, linux-kernel Hi Linus, Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for_linus or git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git upstream-merge the merge is somewhat complex, since there's been a lot of work going on in parallel with discard and zeroout changes, as well as the change block_prepare_write/__block_write_begin. The changes pass the xfstests regression test suite, using both 1k and 4k block sizes --- both before the upstream merge (the for_linus branch) and after doing a trial merge with the head of your tree as of Wednesday evening (the upstream-merge branch). The changes this time around have two major features, which are responsible for most of the new lines of code. One is lazy inode table initialization, which allows ext4 file systems to be mkfs'ed very quickly. The second is changing the I/O submission path so that it uses the block I/O layer directly. This makes blktraces much smaller, and makes ext4 far more scalable. On the boxacle "large file create" workload, run with 48 and 192 threads on a 48-core AMD box, ext4 now has a 3x increase in write throughput, and CPU usage has been reduced by a factor of 3-4. Most of this was achieved by reducing spinlock contention on the block queue submission locks. We also added support for run-time discard of unused blocks using the new FITRIM ioctl (which has been run by the linux-fs mailing list as a generic file system-independent interface). And, of course, a lot of bug fixes and clean ups. - Ted Brian King (1): jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode Curt Wohlgemuth (1): ext4: use dedicated slab caches for group_info structures Dmitry Monakhov (1): ext4: optimize orphan_list handling for ext4_setattr Eric Sandeen (10): ext4: stop looping in ext4_num_dirty_pages when max_pages reached ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 ext4: fix oops in trace_ext4_mb_release_group_pa ext4: don't use ext4_allocation_contexts for tracing ext4: queue conversion after adding to inode's completed IO list ext4: remove unused ext4_sb_info members ext4: tidy up a void argument in inode.c ext4: implement writeback livelock avoidance using page tagging ext4: update writeback_index based on last page scanned ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static Kazuya Mio (1): ext4: fix compile error in ext4_fallocate() Lukas Czerner (11): ext4: check for negative error code from sb_issue_discard ext4: don't hold spinlock while calling ext4_issue_discard() Add helper function for blkdev_issue_zeroout (sb_issue_discard) ext4: add support for lazy inode table initialization ext4: add interface to advertise ext4 features in sysfs ext4: use sb_issue_zeroout in setup_new_group_blocks ext4: use sb_issue_zeroout in ext4_ext_zeroout ext4: Use return value from sb_issue_discard() fs: Add FITRIM ioctl ext4: Add batched discard support for ext4 ext4: add batched_discard into ext4 feature list Maciej Żenczykowski (1): ext4: don't update sb journal_devnum when RO dev Namhyung Kim (1): ext4: Check return value of sb_getblk() and friends Nicolas Kaiser (1): ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new Sergey Senozhatsky (1): ext4: fix NULL pointer dereference in print_daily_error_info() Theodore Ts'o (18): ext4: fix EOFBLOCKS_FL handling jbd2: Add sanity check for attempts to start handle during umount ext4: avoid uninitialized memory references in ext3_htree_next_block() ext4: use search_dirblock() in ext4_dx_find_entry() ext4: use KMEM_CACHE instead of kmem_cache_create ext4: call mpage_da_submit_io() from mpage_da_map_blocks() ext4: simplify ext4_writepage() ext4: inline ext4_writepage() into mpage_da_submit_io() ext4: inline walk_page_buffers() into mpage_da_submit_io ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io() ext4: use bio layer instead of buffer layer in mpage_da_submit_io ext4: fix kernel oops if the journal superblock has a non-zero j_errno ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() ext4: make various ext4 functions be static ext4: rename {ext,idx}_pblock and inline small extent functions ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() ext4,jbd2: convert tracepoints to use major/minor numbers Toshiyuki Okajima (2): ext4: improve llseek error handling for overly large seek offsets ext4: fix potential infinite loop in ext4_da_writepages() Wen Congyang (1): ext4: avoid null dereference in trace_ext4_mballoc_discard Documentation/filesystems/ext4.txt | 14 + fs/ext4/Makefile | 2 +- fs/ext4/balloc.c | 5 +- fs/ext4/block_validity.c | 7 +- fs/ext4/dir.c | 2 +- fs/ext4/ext4.h | 110 ++++++-- fs/ext4/ext4_extents.h | 65 ++++- fs/ext4/extents.c | 369 ++++++++++------------- fs/ext4/file.c | 44 +++- fs/ext4/fsync.c | 83 +++++ fs/ext4/ialloc.c | 136 ++++++++- fs/ext4/inode.c | 587 +++++++++++++----------------------- fs/ext4/mballoc.c | 554 ++++++++++++++++++++++------------ fs/ext4/migrate.c | 2 +- fs/ext4/move_extent.c | 22 +- fs/ext4/namei.c | 63 ++--- fs/ext4/page-io.c | 430 ++++++++++++++++++++++++++ fs/ext4/resize.c | 53 +--- fs/ext4/super.c | 531 +++++++++++++++++++++++++++++++-- fs/ext4/xattr.c | 4 +- fs/ext4/xattr.h | 8 +- fs/ioctl.c | 39 +++ fs/jbd2/checkpoint.c | 10 + fs/jbd2/commit.c | 12 +- fs/jbd2/journal.c | 4 +- fs/jbd2/transaction.c | 1 + include/linux/blkdev.h | 8 + include/linux/fs.h | 8 + include/linux/jbd2.h | 2 +- include/linux/percpu_counter.h | 10 + include/linux/writeback.h | 2 + include/trace/events/ext4.h | 378 ++++++++++++++--------- include/trace/events/jbd2.h | 78 +++-- 33 files changed, 2513 insertions(+), 1130 deletions(-) ^ permalink raw reply [flat|nested] 35+ messages in thread
* [GIT PULL] ext4 update for 2.6.37 @ 2010-10-28 4:52 ` Theodore Ts'o 0 siblings, 0 replies; 35+ messages in thread From: Theodore Ts'o @ 2010-10-28 4:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-ext4, linux-kernel Hi Linus, Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for_linus or git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git upstream-merge the merge is somewhat complex, since there's been a lot of work going on in parallel with discard and zeroout changes, as well as the change block_prepare_write/__block_write_begin. The changes pass the xfstests regression test suite, using both 1k and 4k block sizes --- both before the upstream merge (the for_linus branch) and after doing a trial merge with the head of your tree as of Wednesday evening (the upstream-merge branch). The changes this time around have two major features, which are responsible for most of the new lines of code. One is lazy inode table initialization, which allows ext4 file systems to be mkfs'ed very quickly. The second is changing the I/O submission path so that it uses the block I/O layer directly. This makes blktraces much smaller, and makes ext4 far more scalable. On the boxacle "large file create" workload, run with 48 and 192 threads on a 48-core AMD box, ext4 now has a 3x increase in write throughput, and CPU usage has been reduced by a factor of 3-4. Most of this was achieved by reducing spinlock contention on the block queue submission locks. We also added support for run-time discard of unused blocks using the new FITRIM ioctl (which has been run by the linux-fs mailing list as a generic file system-independent interface). And, of course, a lot of bug fixes and clean ups. - Ted Brian King (1): jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode Curt Wohlgemuth (1): ext4: use dedicated slab caches for group_info structures Dmitry Monakhov (1): ext4: optimize orphan_list handling for ext4_setattr Eric Sandeen (10): ext4: stop looping in ext4_num_dirty_pages when max_pages reached ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 ext4: fix oops in trace_ext4_mb_release_group_pa ext4: don't use ext4_allocation_contexts for tracing ext4: queue conversion after adding to inode's completed IO list ext4: remove unused ext4_sb_info members ext4: tidy up a void argument in inode.c ext4: implement writeback livelock avoidance using page tagging ext4: update writeback_index based on last page scanned ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static Kazuya Mio (1): ext4: fix compile error in ext4_fallocate() Lukas Czerner (11): ext4: check for negative error code from sb_issue_discard ext4: don't hold spinlock while calling ext4_issue_discard() Add helper function for blkdev_issue_zeroout (sb_issue_discard) ext4: add support for lazy inode table initialization ext4: add interface to advertise ext4 features in sysfs ext4: use sb_issue_zeroout in setup_new_group_blocks ext4: use sb_issue_zeroout in ext4_ext_zeroout ext4: Use return value from sb_issue_discard() fs: Add FITRIM ioctl ext4: Add batched discard support for ext4 ext4: add batched_discard into ext4 feature list Maciej Żenczykowski (1): ext4: don't update sb journal_devnum when RO dev Namhyung Kim (1): ext4: Check return value of sb_getblk() and friends Nicolas Kaiser (1): ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new Sergey Senozhatsky (1): ext4: fix NULL pointer dereference in print_daily_error_info() Theodore Ts'o (18): ext4: fix EOFBLOCKS_FL handling jbd2: Add sanity check for attempts to start handle during umount ext4: avoid uninitialized memory references in ext3_htree_next_block() ext4: use search_dirblock() in ext4_dx_find_entry() ext4: use KMEM_CACHE instead of kmem_cache_create ext4: call mpage_da_submit_io() from mpage_da_map_blocks() ext4: simplify ext4_writepage() ext4: inline ext4_writepage() into mpage_da_submit_io() ext4: inline walk_page_buffers() into mpage_da_submit_io ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io() ext4: use bio layer instead of buffer layer in mpage_da_submit_io ext4: fix kernel oops if the journal superblock has a non-zero j_errno ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() ext4: make various ext4 functions be static ext4: rename {ext,idx}_pblock and inline small extent functions ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() ext4,jbd2: convert tracepoints to use major/minor numbers Toshiyuki Okajima (2): ext4: improve llseek error handling for overly large seek offsets ext4: fix potential infinite loop in ext4_da_writepages() Wen Congyang (1): ext4: avoid null dereference in trace_ext4_mballoc_discard Documentation/filesystems/ext4.txt | 14 + fs/ext4/Makefile | 2 +- fs/ext4/balloc.c | 5 +- fs/ext4/block_validity.c | 7 +- fs/ext4/dir.c | 2 +- fs/ext4/ext4.h | 110 ++++++-- fs/ext4/ext4_extents.h | 65 ++++- fs/ext4/extents.c | 369 ++++++++++------------- fs/ext4/file.c | 44 +++- fs/ext4/fsync.c | 83 +++++ fs/ext4/ialloc.c | 136 ++++++++- fs/ext4/inode.c | 587 +++++++++++++----------------------- fs/ext4/mballoc.c | 554 ++++++++++++++++++++++------------ fs/ext4/migrate.c | 2 +- fs/ext4/move_extent.c | 22 +- fs/ext4/namei.c | 63 ++--- fs/ext4/page-io.c | 430 ++++++++++++++++++++++++++ fs/ext4/resize.c | 53 +--- fs/ext4/super.c | 531 +++++++++++++++++++++++++++++++-- fs/ext4/xattr.c | 4 +- fs/ext4/xattr.h | 8 +- fs/ioctl.c | 39 +++ fs/jbd2/checkpoint.c | 10 + fs/jbd2/commit.c | 12 +- fs/jbd2/journal.c | 4 +- fs/jbd2/transaction.c | 1 + include/linux/blkdev.h | 8 + include/linux/fs.h | 8 + include/linux/jbd2.h | 2 +- include/linux/percpu_counter.h | 10 + include/linux/writeback.h | 2 + include/trace/events/ext4.h | 378 ++++++++++++++--------- include/trace/events/jbd2.h | 78 +++-- 33 files changed, 2513 insertions(+), 1130 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [GIT PULL] ext4 update for 2.6.37 2010-10-28 4:52 ` Theodore Ts'o (?) @ 2010-10-28 7:50 ` Markus Trippelsdorf -1 siblings, 0 replies; 35+ messages in thread From: Markus Trippelsdorf @ 2010-10-28 7:50 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Linus Torvalds, linux-ext4, linux-kernel On Thu, Oct 28, 2010 at 12:52:07AM -0400, Theodore Ts'o wrote: > Hi Linus, > > Please pull from: > > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for_linus > or > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git upstream-merge > > the merge is somewhat complex, since there's been a lot of work going on > in parallel with discard and zeroout changes, as well as the change > block_prepare_write/__block_write_begin. The changes pass the xfstests > regression test suite, using both 1k and 4k block sizes --- both before > the upstream merge (the for_linus branch) and after doing a trial merge > with the head of your tree as of Wednesday evening (the upstream-merge > branch). This misses a trivial ifdef wrapper in fs/ext4/super.c for !CONFIG_EXT4_FS_XATTR configurations. Something like this: diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0348ce0..5f82585 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4773,9 +4773,12 @@ static int __init ext4_init_fs(void) if (err) goto out3; +#ifdef CONFIG_EXT4_FS_XATTR err = ext4_init_xattr(); +#endif if (err) goto out2; + err = init_inodecache(); if (err) goto out1; -- Markus ^ permalink raw reply related [flat|nested] 35+ messages in thread
* -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 4:52 ` Theodore Ts'o @ 2010-10-28 7:56 ` Ingo Molnar -1 siblings, 0 replies; 35+ messages in thread From: Ingo Molnar @ 2010-10-28 7:56 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Linus Torvalds, linux-ext4, linux-kernel hi Ted, > Theodore Ts'o (18): > ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() Today's -tip fails to build due to upstream commit 5dabfc7 ("ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()"), on all[yes/mod]config with CONFIG_EXT4_FS_XATTR disabled: fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ Commit 5dabfc7 renamed init_ext4_xattr to ext4_init_xattr but forgot to update the definition in fs/ext4/xattr.h. The patch below fixes it. Thanks, Ingo Signed-off-by: Ingo Molnar <mingo@elte.hu> --- linux.orig/fs/ext4/xattr.h +++ linux/fs/ext4/xattr.h @@ -122,7 +122,7 @@ ext4_xattr_put_super(struct super_block } static __init inline int -init_ext4_xattr(void) +ext4_init_xattr(void) { return 0; } ^ permalink raw reply [flat|nested] 35+ messages in thread
* -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) @ 2010-10-28 7:56 ` Ingo Molnar 0 siblings, 0 replies; 35+ messages in thread From: Ingo Molnar @ 2010-10-28 7:56 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Linus Torvalds, linux-ext4, linux-kernel hi Ted, > Theodore Ts'o (18): > ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() Today's -tip fails to build due to upstream commit 5dabfc7 ("ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()"), on all[yes/mod]config with CONFIG_EXT4_FS_XATTR disabled: fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ Commit 5dabfc7 renamed init_ext4_xattr to ext4_init_xattr but forgot to update the definition in fs/ext4/xattr.h. The patch below fixes it. Thanks, Ingo Signed-off-by: Ingo Molnar <mingo@elte.hu> --- linux.orig/fs/ext4/xattr.h +++ linux/fs/ext4/xattr.h @@ -122,7 +122,7 @@ ext4_xattr_put_super(struct super_block } static __init inline int -init_ext4_xattr(void) +ext4_init_xattr(void) { return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 7:56 ` Ingo Molnar @ 2010-10-28 12:12 ` Theodore Tso -1 siblings, 0 replies; 35+ messages in thread From: Theodore Tso @ 2010-10-28 12:12 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-ext4, linux-kernel On Oct 28, 2010, at 3:56 AM, Ingo Molnar wrote: > > hi Ted, > >> Theodore Ts'o (18): >> ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() > > Today's -tip fails to build due to upstream commit 5dabfc7 ("ext4: rename > {exit,init}_ext4_*() to ext4_{exit,init}_*()"), on all[yes/mod]config with > CONFIG_EXT4_FS_XATTR disabled: > > fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ > > Commit 5dabfc7 renamed init_ext4_xattr to ext4_init_xattr but forgot to update the > definition in fs/ext4/xattr.h. The patch below fixes it. Acked-by: "Theodore Ts'o" <tytso@mit.edu> Thanks for catching this, my bad. -- Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) @ 2010-10-28 12:12 ` Theodore Tso 0 siblings, 0 replies; 35+ messages in thread From: Theodore Tso @ 2010-10-28 12:12 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-ext4, linux-kernel On Oct 28, 2010, at 3:56 AM, Ingo Molnar wrote: > > hi Ted, > >> Theodore Ts'o (18): >> ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() > > Today's -tip fails to build due to upstream commit 5dabfc7 ("ext4: rename > {exit,init}_ext4_*() to ext4_{exit,init}_*()"), on all[yes/mod]config with > CONFIG_EXT4_FS_XATTR disabled: > > fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ > > Commit 5dabfc7 renamed init_ext4_xattr to ext4_init_xattr but forgot to update the > definition in fs/ext4/xattr.h. The patch below fixes it. Acked-by: "Theodore Ts'o" <tytso@mit.edu> Thanks for catching this, my bad. -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 7:56 ` Ingo Molnar (?) (?) @ 2010-10-28 16:30 ` Linus Torvalds 2010-10-28 16:38 ` Ingo Molnar -1 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2010-10-28 16:30 UTC (permalink / raw) To: Ingo Molnar; +Cc: Theodore Ts'o, linux-ext4, linux-kernel On Thu, Oct 28, 2010 at 12:56 AM, Ingo Molnar <mingo@elte.hu> wrote: > > Today's -tip fails to build due to upstream commit 5dabfc7 ("ext4: rename > {exit,init}_ext4_*() to ext4_{exit,init}_*()"), on all[yes/mod]config with > CONFIG_EXT4_FS_XATTR disabled: Btw, could you try to write these things so that the changelog doesn't have to be totally rewritten? Now I always end up having to move things around and edit them to be useful from a long-term perspective, which is kind of silly. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 16:30 ` Linus Torvalds @ 2010-10-28 16:38 ` Ingo Molnar 2010-10-28 16:55 ` Ted Ts'o 2010-10-28 17:00 ` Linus Torvalds 0 siblings, 2 replies; 35+ messages in thread From: Ingo Molnar @ 2010-10-28 16:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Theodore Ts'o, linux-ext4, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Oct 28, 2010 at 12:56 AM, Ingo Molnar <mingo@elte.hu> wrote: > > > > Today's -tip fails to build due to upstream commit 5dabfc7 ("ext4: rename > > {exit,init}_ext4_*() to ext4_{exit,init}_*()"), on all[yes/mod]config with > > CONFIG_EXT4_FS_XATTR disabled: > > Btw, could you try to write these things so that the changelog doesn't have to be > totally rewritten? Now I always end up having to move things around and edit them > to be useful from a long-term perspective, which is kind of silly. Heh, i was just lazy and used sfr's linux-next build bug reporting mails as a template ;-) Point taken in any case, i'll read your edited changelog and will change the template accordingly. Would this: Upstream commit 5dabfc7 ("ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()"), breaks the build on all[yes/mod]config with CONFIG_EXT4_FS_XATTR disabled: ... have been better? Thanks, Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 16:38 ` Ingo Molnar @ 2010-10-28 16:55 ` Ted Ts'o 2010-10-28 17:00 ` Linus Torvalds 1 sibling, 0 replies; 35+ messages in thread From: Ted Ts'o @ 2010-10-28 16:55 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-ext4, linux-kernel Linus, if it's helpful, I have a pull request with correctly worded changelogs here: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for_linus My apologies again for screwing up these configs. The following has been tested with allnoconfig, and I'm currently in the process of testing allyesconfig (which I'm fairly confident will compile, or at least if it fails, it won't be due to changes that came in via my tree). - Ted Ingo Molnar (2): ext4: Fix build when !CONFIG_EXT4_FS_XATTR fs: build fix when !CONFIG_BLOCK fs/ext4/xattr.h | 2 +- include/linux/fs.h | 1 + 2 files changed, 2 insertions(+), 1 deletions(-) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 16:38 ` Ingo Molnar 2010-10-28 16:55 ` Ted Ts'o @ 2010-10-28 17:00 ` Linus Torvalds 2010-10-28 17:17 ` Ingo Molnar 2010-10-28 21:39 ` Junio C Hamano 1 sibling, 2 replies; 35+ messages in thread From: Linus Torvalds @ 2010-10-28 17:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: Theodore Ts'o, linux-ext4, linux-kernel On Thu, Oct 28, 2010 at 9:38 AM, Ingo Molnar <mingo@elte.hu> wrote: > > Point taken in any case, i'll read your edited changelog and will change the > template accordingly. > > Would this: > > Upstream commit 5dabfc7 ("ext4: rename {exit,init}_ext4_*() to > ext4_{exit,init}_*()"), breaks the build on all[yes/mod]config with > CONFIG_EXT4_FS_XATTR disabled: > > ... > > have been better? Yes. Except for the kernel the default git commit abbreviation is borderline too short. Seven hex-chars can easily alias with a few more pulls from me: git will not give aliases at the time it gives a shorthand, but a month or two later the abbreviated commit may no longer be unique. So I suggest using --abbrev=12 or similar. What I ended up writing your commit as was this: ext4: fix compile with CONFIG_EXT4_FS_XATTR disabled Commit 5dabfc78dced ("ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()") causes fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ when CONFIG_EXT4_FS_XATTR is disabled. It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the dummy definition in fs/ext4/xattr.h. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> but that's just me. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) 2010-10-28 17:00 ` Linus Torvalds @ 2010-10-28 17:17 ` Ingo Molnar 2010-10-28 21:39 ` Junio C Hamano 1 sibling, 0 replies; 35+ messages in thread From: Ingo Molnar @ 2010-10-28 17:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Theodore Ts'o, linux-ext4, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Oct 28, 2010 at 9:38 AM, Ingo Molnar <mingo@elte.hu> wrote: > > > > Point taken in any case, i'll read your edited changelog and will change the > > template accordingly. > > > > Would this: > > > > Upstream commit 5dabfc7 ("ext4: rename {exit,init}_ext4_*() to > > ext4_{exit,init}_*()"), breaks the build on all[yes/mod]config with > > CONFIG_EXT4_FS_XATTR disabled: > > > > ... > > > > have been better? > > Yes. Except for the kernel the default git commit abbreviation is borderline too > short. Seven hex-chars can easily alias with a few more pulls from me: git will > not give aliases at the time it gives a shorthand, but a month or two later the > abbreviated commit may no longer be unique. > > So I suggest using --abbrev=12 or similar. ok. A helper script i use does this: git log --pretty=format:"%h: %s" $@ I have added --abbrev=12. Might make sense to lengthen the %h default in upstream Git as well? > What I ended up writing your commit as was this: > > ext4: fix compile with CONFIG_EXT4_FS_XATTR disabled > > Commit 5dabfc78dced ("ext4: rename {exit,init}_ext4_*() to > ext4_{exit,init}_*()") causes > > fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ > > when CONFIG_EXT4_FS_XATTR is disabled. > > It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the > dummy definition in fs/ext4/xattr.h. > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > Acked-by: "Theodore Ts'o" <tytso@mit.edu> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > but that's just me. Ok, this indeed is much nicer to read. Thanks, Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) @ 2010-10-28 17:17 ` Ingo Molnar 0 siblings, 0 replies; 35+ messages in thread From: Ingo Molnar @ 2010-10-28 17:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Theodore Ts'o, linux-ext4, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Oct 28, 2010 at 9:38 AM, Ingo Molnar <mingo@elte.hu> wrote: > > > > Point taken in any case, i'll read your edited changelog and will change the > > template accordingly. > > > > Would this: > > > > Upstream commit 5dabfc7 ("ext4: rename {exit,init}_ext4_*() to > > ext4_{exit,init}_*()"), breaks the build on all[yes/mod]config with > > CONFIG_EXT4_FS_XATTR disabled: > > > > ... > > > > have been better? > > Yes. Except for the kernel the default git commit abbreviation is borderline too > short. Seven hex-chars can easily alias with a few more pulls from me: git will > not give aliases at the time it gives a shorthand, but a month or two later the > abbreviated commit may no longer be unique. > > So I suggest using --abbrev=12 or similar. ok. A helper script i use does this: git log --pretty=format:"%h: %s" $@ I have added --abbrev=12. Might make sense to lengthen the %h default in upstream Git as well? > What I ended up writing your commit as was this: > > ext4: fix compile with CONFIG_EXT4_FS_XATTR disabled > > Commit 5dabfc78dced ("ext4: rename {exit,init}_ext4_*() to > ext4_{exit,init}_*()") causes > > fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’ > > when CONFIG_EXT4_FS_XATTR is disabled. > > It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the > dummy definition in fs/ext4/xattr.h. > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > Acked-by: "Theodore Ts'o" <tytso@mit.edu> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > but that's just me. Ok, this indeed is much nicer to read. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* Minimum git commit abbrev length (Was Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update) for 2.6.37) 2010-10-28 17:17 ` Ingo Molnar (?) @ 2010-10-28 17:27 ` Ted Ts'o 2010-10-28 18:28 ` Linus Torvalds -1 siblings, 1 reply; 35+ messages in thread From: Ted Ts'o @ 2010-10-28 17:27 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, git, linux-kernel On Thu, Oct 28, 2010 at 07:17:01PM +0200, Ingo Molnar wrote: > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yes. Except for the kernel the default git commit abbreviation is > > borderline too short. Seven hex-chars can easily alias with a few > > more pulls from me: git will not give aliases at the time it gives > > a shorthand, but a month or two later the abbreviated commit may > > no longer be unique. > > > > So I suggest using --abbrev=12 or similar. > > ok. A helper script i use does this: > > git log --pretty=format:"%h: %s" $@ > > I have added --abbrev=12. Might make sense to lengthen the %h > default in upstream Git as well? Maybe the right thing to do is add a git config option which allows for a configurable minimum git commit abbreviation length? - Ted ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Minimum git commit abbrev length (Was Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update) for 2.6.37) 2010-10-28 17:27 ` Minimum git commit abbrev length (Was Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update) " Ted Ts'o @ 2010-10-28 18:28 ` Linus Torvalds 2010-10-28 18:54 ` Linus Torvalds 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2010-10-28 18:28 UTC (permalink / raw) To: Ted Ts'o, Ingo Molnar, Linus Torvalds, git, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2132 bytes --] On Thu, Oct 28, 2010 at 10:27 AM, Ted Ts'o <tytso@mit.edu> wrote: > On Thu, Oct 28, 2010 at 07:17:01PM +0200, Ingo Molnar wrote: >> > Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > Yes. Except for the kernel the default git commit abbreviation is >> > borderline too short. Seven hex-chars can easily alias with a few >> > more pulls from me: git will not give aliases at the time it gives >> > a shorthand, but a month or two later the abbreviated commit may >> > no longer be unique. >> > >> > So I suggest using --abbrev=12 or similar. >> >> ok. A helper script i use does this: >> >> git log --pretty=format:"%h: %s" $@ >> >> I have added --abbrev=12. Might make sense to lengthen the %h >> default in upstream Git as well? > > Maybe the right thing to do is add a git config option which allows > for a configurable minimum git commit abbreviation length? Yes. The default of 7 (I think) comes from fairly early in git development, when seven hex digits was a lot (it covers about 250+ million hash values). Back then I thought that 65k revisions was a lot (it was what we were about to hit in BK), and each revision tends to be about 5-10 new objects or so, so a million objects was a big number. These days, the kernel isn't even the largest git project, and even the kernel has about 220k revisions (_much_ bigger than the BK tree ever was) and we are approaching two million objects. At that point, seven hex digits is still unique for a lot of them, but when we're talking about just two orders of magnitude difference between number of objects and the hash size, there _will_ be hash collisions. It's no longer even close to unrealistic - it happens all the time. So I suspect we should both increase the default abbrev that was unrealistically small, _and_ add a way for people to set their own default per-project in the git config file. Maybe something like the attached (not necessarily well-thought-out or well-tested: I also didn't actually change the default, although I suspect we should up it from 7 to at least 10). Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 2200 bytes --] builtin/describe.c | 2 +- cache.h | 5 +++-- config.c | 8 ++++++++ environment.c | 1 + 4 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/describe.c b/builtin/describe.c index 43caff2..2d98702 100644 --- a/builtin/describe.c +++ b/builtin/describe.c @@ -20,7 +20,7 @@ static int debug; /* Display lots of verbose info */ static int all; /* Any valid ref can be used */ static int tags; /* Allow lightweight tags */ static int longformat; -static int abbrev = DEFAULT_ABBREV; +static int abbrev = 7; /* NOTE! Not DEFAULT_ABBREV */ static int max_candidates = 10; static int found_names; static const char *pattern; diff --git a/cache.h b/cache.h index 33decd9..6c28a81 100644 --- a/cache.h +++ b/cache.h @@ -540,6 +540,7 @@ extern int trust_executable_bit; extern int trust_ctime; extern int quote_path_fully; extern int has_symlinks; +extern int minimum_abbrev, default_abbrev; extern int ignore_case; extern int assume_unchanged; extern int prefer_symlink_refs; @@ -757,8 +758,8 @@ static inline unsigned int hexval(unsigned char c) } /* Convert to/from hex/sha1 representation */ -#define MINIMUM_ABBREV 4 -#define DEFAULT_ABBREV 7 +#define MINIMUM_ABBREV minimum_abbrev +#define DEFAULT_ABBREV default_abbrev struct object_context { unsigned char tree[20]; diff --git a/config.c b/config.c index 4b0a820..474361c 100644 --- a/config.c +++ b/config.c @@ -514,6 +514,14 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.abbrev")) { + int abbrev = git_config_int(var, value); + if (abbrev < minimum_abbrev || abbrev > 40) + return -1; + default_abbrev = abbrev; + return 0; + } + if (!strcmp(var, "core.loosecompression")) { int level = git_config_int(var, value); if (level == -1) diff --git a/environment.c b/environment.c index de5581f..b98003c 100644 --- a/environment.c +++ b/environment.c @@ -15,6 +15,7 @@ int user_ident_explicitly_given; int trust_executable_bit = 1; int trust_ctime = 1; int has_symlinks = 1; +int minimum_abbrev = 4, default_abbrev = 7; int ignore_case; int assume_unchanged; int prefer_symlink_refs; ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: Minimum git commit abbrev length (Was Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update) for 2.6.37) 2010-10-28 18:28 ` Linus Torvalds @ 2010-10-28 18:54 ` Linus Torvalds 2010-10-29 0:14 ` Minimum git commit abbrev length (Was Re: -tip: origin tree build failure Brandon Casey 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2010-10-28 18:54 UTC (permalink / raw) To: Ted Ts'o, Ingo Molnar, Linus Torvalds, git, linux-kernel On Thu, Oct 28, 2010 at 11:28 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yes. The default of 7 (I think) comes from fairly early in git > development, when seven hex digits was a lot (it covers about 250+ > million hash values). Back then I thought that 65k revisions was a lot > (it was what we were about to hit in BK), and each revision tends to > be about 5-10 new objects or so, so a million objects was a big > number. > > These days, the kernel isn't even the largest git project, and even > the kernel has about 220k revisions (_much_ bigger than the BK tree > ever was) and we are approaching two million objects. At that point, > seven hex digits is still unique for a lot of them, but when we're > talking about just two orders of magnitude difference between number > of objects and the hash size, there _will_ be hash collisions. It's no > longer even close to unrealistic - it happens all the time. Hmm. In fact, in the kernel, we currently have about twelve thousand objects that end up having collisions in 7 hex digits. Even in the old historical BK kernel tree, we have over a thousand objects that collide (each bucket in both cases gets just two objects, there are as of yet no multiple collisions, which is what you'd expect with a good hash). See with git rev-list --objects --all | cut -c1-7 | sort | uniq -dc and in fact git itself has a few collisions (but currently just 44 objects ending up sharing 22 SHA1 buckets in 7 digits). With each digit, you'd expect the collisions to decrease by a factor of 16, and that is indeed exactly what happens. For my current kernel tree I get: - 7 digits: 5823 buckets with duplicates (ie 11646 objects that aren't unique) - 8: 406 - 9: 30 - 10: 1 - 11: 0 so 12 hex digits is indeed pretty safe for the kernel, and is likely to remain so until the kernel history grows by a factor of 16. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Minimum git commit abbrev length (Was Re: -tip: origin tree build failure 2010-10-28 18:54 ` Linus Torvalds @ 2010-10-29 0:14 ` Brandon Casey 0 siblings, 0 replies; 35+ messages in thread From: Brandon Casey @ 2010-10-29 0:14 UTC (permalink / raw) To: torvalds; +Cc: tytso, mingo, git, Brandon Casey From: Brandon Casey <drafnel@gmail.com> [dropped linux-kernel list from discussion] On 10/28/2010 01:54 PM, Linus Torvalds wrote: > and in fact git itself has a few collisions (but currently just 44 > objects ending up sharing 22 SHA1 buckets in 7 digits). > > With each digit, you'd expect the collisions to decrease by a factor > of 16, and that is indeed exactly what happens. For my current kernel > tree I get: > > - 7 digits: 5823 buckets with duplicates (ie 11646 objects that aren't unique) > - 8: 406 > - 9: 30 > - 10: 1 > - 11: 0 > > so 12 hex digits is indeed pretty safe for the kernel, and is likely > to remain so until the kernel history grows by a factor of 16. Perhaps we should calculate DEFAULT_ABBREV dynamically? Something like the code down below would use 11 digits for the current kernel repo, and would bump up to 12 digits once it hit 4M objects. It would use 9 digits for git.git. The digits of abbrev are bumped up for each factor of 4 objects. This seems to produce a desirable number for the git.git and linux-2.6.git repos. estimate_loose_objects() quickly estimates the number of loose objects using Junio's code extracted from gc.c. default_abbrev() estimates the loose objects, counts the number of packed objects, and calculates a default abbreviation width. Subsequent calls to default_abbrev() return the previously calculated value. This is obviously not a patch meant for inclusion, just an example. --- builtin/describe.c | 4 ++- cache.h | 3 +- sha1_name.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/builtin/describe.c b/builtin/describe.c index 43caff2..551b491 100644 --- a/builtin/describe.c +++ b/builtin/describe.c @@ -20,7 +20,7 @@ static int debug; /* Display lots of verbose info */ static int all; /* Any valid ref can be used */ static int tags; /* Allow lightweight tags */ static int longformat; -static int abbrev = DEFAULT_ABBREV; +static int abbrev; static int max_candidates = 10; static int found_names; static const char *pattern; @@ -386,6 +386,8 @@ int cmd_describe(int argc, const char **argv, const char *prefix) OPT_END(), }; + abbrev = DEFAULT_ABBREV; + argc = parse_options(argc, argv, prefix, options, describe_usage, 0); if (max_candidates < 0) max_candidates = 0; diff --git a/cache.h b/cache.h index 33decd9..049ef2b 100644 --- a/cache.h +++ b/cache.h @@ -758,7 +758,7 @@ static inline unsigned int hexval(unsigned char c) /* Convert to/from hex/sha1 representation */ #define MINIMUM_ABBREV 4 -#define DEFAULT_ABBREV 7 +#define DEFAULT_ABBREV default_abbrev() struct object_context { unsigned char tree[20]; @@ -766,6 +766,7 @@ struct object_context { unsigned mode; }; +extern int default_abbrev(void); extern int get_sha1(const char *str, unsigned char *sha1); extern int get_sha1_with_mode_1(const char *str, unsigned char *sha1, unsigned *mode, int gently, const char *prefix); static inline int get_sha1_with_mode(const char *str, unsigned char *sha1, unsigned *mode) diff --git a/sha1_name.c b/sha1_name.c index 484081d..8d622fe 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -7,6 +7,67 @@ #include "refs.h" #include "remote.h" +static unsigned long estimate_loose_objects(unsigned threshold) +{ + /* + * Quickly estimate how many loose objects there are. + * Because SHA-1 is evenly distributed, we can check + * only one and get a reasonable estimate. + */ + char path[PATH_MAX]; + const char *objdir = get_object_directory(); + DIR *dir; + struct dirent *ent; + unsigned long num_loose = 0; + + if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) { + warning("insanely long object directory %.*s", 50, objdir); + return 0; + } + + dir = opendir(path); + if (!dir) + return 0; + + threshold = (threshold + 255) / 256; + while ((ent = readdir(dir)) != NULL) { + if (strspn(ent->d_name, "0123456789abcdef") != 38 || + ent->d_name[38] != '\0') + continue; + num_loose++; + if (threshold && num_loose > threshold) + break; + } + closedir(dir); + return num_loose * 256; +} + +int default_abbrev(void) +{ + static int default_abbrev; + unsigned long count; + struct packed_git *p; + + if (default_abbrev) + return default_abbrev; + + count = estimate_loose_objects(0); + + prepare_packed_git(); + for (p = packed_git; p; p = p->next) + if (!open_pack_index(p)) + count += p->num_objects; + + default_abbrev = 1; + while ((count >>= 2)) + default_abbrev++; + + if (default_abbrev < MINIMUM_ABBREV) + default_abbrev = MINIMUM_ABBREV; + + return default_abbrev; +} + static int find_short_object_filename(int len, const char *name, unsigned char *sha1) { struct alternate_object_database *alt; -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure 2010-10-28 17:00 ` Linus Torvalds 2010-10-28 17:17 ` Ingo Molnar @ 2010-10-28 21:39 ` Junio C Hamano 2010-10-28 21:50 ` Linus Torvalds 2011-03-10 22:37 ` [PATCH] find_unique_abbrev(): honor caller-supplied "len" better Junio C Hamano 1 sibling, 2 replies; 35+ messages in thread From: Junio C Hamano @ 2010-10-28 21:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ingo Molnar, Theodore Ts'o, linux-ext4, linux-kernel Linus Torvalds <torvalds@linux-foundation.org> writes: > Yes. Except for the kernel the default git commit abbreviation is > borderline too short. Seven hex-chars can easily alias with a few more > pulls from me: git will not give aliases at the time it gives a > shorthand, but a month or two later the abbreviated commit may no > longer be unique. > > So I suggest using --abbrev=12 or similar. Would a new configuration to specify how many more letters to ensure the uniqueness at the time of generation make sense? By the way, I noticed that you started sending patches as attachments lately. What made you change your mind? Documentation/config.txt | 9 +++++++++ cache.h | 1 + config.c | 7 +++++++ environment.c | 1 + sha1_name.c | 4 +++- 5 files changed, 21 insertions(+), 1 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 538ebb5..6994338 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -374,6 +374,15 @@ core.warnAmbiguousRefs:: If true, git will warn you if the ref name you passed it is ambiguous and might match multiple refs in the .git/refs/ tree. True by default. +core.abbrevguard:: + Even though git makes sure that it uses enough hexdigits to show + an abbreviated object name unambiguously, as more objects are + added to the repository over time, a short name that used to be + unique will stop being unique. Git uses this many extra hexdigits + that are more than necessary to make the object name currently + unique, in the hope that its output will stay unique a bit longer. + Defaults to 0. + core.compression:: An integer -1..9, indicating a default compression level. -1 is the zlib default. 0 means no compression, diff --git a/cache.h b/cache.h index 33decd9..931fb59 100644 --- a/cache.h +++ b/cache.h @@ -545,6 +545,7 @@ extern int assume_unchanged; extern int prefer_symlink_refs; extern int log_all_ref_updates; extern int warn_ambiguous_refs; +extern int unique_abbrev_extra_length; extern int shared_repository; extern const char *apply_default_whitespace; extern const char *apply_default_ignorewhitespace; diff --git a/config.c b/config.c index 4b0a820..1aa72c2 100644 --- a/config.c +++ b/config.c @@ -489,6 +489,13 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.abbrevguard")) { + unique_abbrev_extra_length = git_config_int(var, value); + if (unique_abbrev_extra_length < 0) + unique_abbrev_extra_length = 0; + return 0; + } + if (!strcmp(var, "core.bare")) { is_bare_repository_cfg = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index de5581f..92e16b1 100644 --- a/environment.c +++ b/environment.c @@ -21,6 +21,7 @@ int prefer_symlink_refs; int is_bare_repository_cfg = -1; /* unspecified */ int log_all_ref_updates = -1; /* unspecified */ int warn_ambiguous_refs = 1; +int unique_abbrev_extra_length; int repository_format_version; const char *git_commit_encoding; const char *git_log_output_encoding; diff --git a/sha1_name.c b/sha1_name.c index 484081d..4a226ad 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -206,7 +206,9 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len) if (exists ? !status : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; + int cut_at = len + unique_abbrev_extra_length; + cut_at = (cut_at < 40) ? cut_at : 40; + hex[cut_at] = 0; return hex; } len++; ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure 2010-10-28 21:39 ` Junio C Hamano @ 2010-10-28 21:50 ` Linus Torvalds 2011-03-10 22:37 ` [PATCH] find_unique_abbrev(): honor caller-supplied "len" better Junio C Hamano 1 sibling, 0 replies; 35+ messages in thread From: Linus Torvalds @ 2010-10-28 21:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: Ingo Molnar, Theodore Ts'o, linux-ext4, linux-kernel On Thu, Oct 28, 2010 at 2:39 PM, Junio C Hamano <gitster@pobox.com> wrote: > > By the way, I noticed that you started sending patches as attachments > lately. What made you change your mind? Nothing. I still hate them. But the tools I use (web interface to gmail) are broken in this respect. There's no way to include a file, or specify that an attachement should be inlined. And don't tell me about IMAP - if I wanted to use IMAP, I'd be living in a padded cell. I have a deep love/hate relationship with gmail. Many things make it wonderful, and I'm not regretting the switch (which was initially just a trial while traveling). But it has two issues that I absolutely detest: (a) the idiotic inability to inline attachements and (b) the android gmail app is a total piece of shit and cannot even do simple text messages. Crazy. (there are other small annoyances, but they are smallish in comparison to the above big honking bugs) Does anybody know anybody who works on the google mail clients and could raise these as bugs inside google? Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: -tip: origin tree build failure @ 2010-10-28 21:50 ` Linus Torvalds 0 siblings, 0 replies; 35+ messages in thread From: Linus Torvalds @ 2010-10-28 21:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: Ingo Molnar, Theodore Ts'o, linux-ext4, linux-kernel On Thu, Oct 28, 2010 at 2:39 PM, Junio C Hamano <gitster@pobox.com> wrote: > > By the way, I noticed that you started sending patches as attachments > lately. What made you change your mind? Nothing. I still hate them. But the tools I use (web interface to gmail) are broken in this respect. There's no way to include a file, or specify that an attachement should be inlined. And don't tell me about IMAP - if I wanted to use IMAP, I'd be living in a padded cell. I have a deep love/hate relationship with gmail. Many things make it wonderful, and I'm not regretting the switch (which was initially just a trial while traveling). But it has two issues that I absolutely detest: (a) the idiotic inability to inline attachements and (b) the android gmail app is a total piece of shit and cannot even do simple text messages. Crazy. (there are other small annoyances, but they are smallish in comparison to the above big honking bugs) Does anybody know anybody who works on the google mail clients and could raise these as bugs inside google? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2010-10-28 21:39 ` Junio C Hamano 2010-10-28 21:50 ` Linus Torvalds @ 2011-03-10 22:37 ` Junio C Hamano 2011-03-10 23:07 ` Linus Torvalds 1 sibling, 1 reply; 35+ messages in thread From: Junio C Hamano @ 2011-03-10 22:37 UTC (permalink / raw) To: git; +Cc: Linus Torvalds, Namhyung Kim The caller of this function wants to ensure that the returned string is a unique abbreviation of the object name, and at least "len" characters long. When "len" is too short to ensure uniqueness with only that many characters, it returns minimally unique prefix (i.e. if you dropped the last character, there would be two or more objects that share that same prefix as their names in the repository). An earlier change introduced core.abbrevguard configuration with a realization that a short prefix that is unique today may not stay unique forever, as new objects are added to the repository. When "len" is shorter than the length necessary to ensure uniqueness today, instead of returning a string that is only one character longer than the longest ambiguous prefix, we want to add that many extra characters (in addition to the "one character that is absolutely needed to make it unique today") to ensure uniqueness for longer time. The code however forgot that the function may be called with a "len" that is long enough. If an object is uniquely identifiable with only 4 leading characters today, and if the caller gives 7 as len and the guard is set to 3, it returned 10 hexdigits, which was 3 characters longer than necessary. We should instead return 7 leading characters in such a case, as that is in line with the original intention of using 3 characters more than absolutely necessary to give the disambiguation we find today a better chance to survive. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- * A three-line patch, with 35 new comment lines and four paragraph commit log message, to fix a stupid thinko that I noticed during a separate discussion with Namhyung-ssi, who wanted to add an option to ensure uniqueness of the truncated commit name in the human-readable blame output. sha1_name.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 38 insertions(+), 0 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 4a226ad..62950aa 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -191,6 +191,26 @@ static int get_short_sha1(const char *name, int len, unsigned char *sha1, return status; } +/* + * The caller wants a unique abbreviation of the full object name in + * "sha1" that is at least "len" characters long. A string is a unique + * abbrevation of the full object name when: + * + * (1) there is no other object that shares the returned string as the + * prefix of its name, if sha1 identifies an existing object; or + * + * (2) there is no object that has the returned string as the prefix + * of its name, if sha1 does not identify any existing object. + * + * If there exist two or more objects that share the same N characters + * at the beginning of their object names, N+1 leading characters is + * sufficient to make the abbreviation unique today. As the number of + * objects in the repository grows, however, such an abbreviation + * won't stay unique forever. core.abbrevguard configuration variable + * can be used to add extra G characters when ensuring the uniqueness + * of the abbreviation, i.e. making N+1+G characters the minimum in + * order to keep the result unique a bit longer, instead of just N+1. + */ const char *find_unique_abbrev(const unsigned char *sha1, int len) { int status, exists; @@ -200,6 +220,24 @@ const char *find_unique_abbrev(const unsigned char *sha1, int len) memcpy(hex, sha1_to_hex(sha1), 40); if (len == 40 || !len) return hex; + /* + * Try to see how short a prefix we can feed to get a unique + * hit. When len is sufficiently long, we may find the + * absolute minimum abbreviation that is a lot shorter than + * len, so we try from (len - unique_abbrev_extra_length), to + * avoid the final addition of u_a_e_l to the result getting + * longer than necessary. E.g. an object that is ambiguous + * today with only 3 but is unique with 4 leading characters, + * it should yield 7 character result if the caller called us + * with len=7 with u_a_e_l=3 (or shorter), so we start from 4 + * in such a case. When len is shorter than the minimum + * required to make the result unique, the loop counts up and + * finds the absolute minimum (just one character longer than + * ambiguous truncation) and then we add u_a_e_l to it. + */ + len -= unique_abbrev_extra_length; + if (len <= 0) + len = 1; while (len < 40) { unsigned char sha1_ret[20]; status = get_short_sha1(hex, len, sha1_ret, 1); -- 1.7.4.1.373.g37629 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-10 22:37 ` [PATCH] find_unique_abbrev(): honor caller-supplied "len" better Junio C Hamano @ 2011-03-10 23:07 ` Linus Torvalds 2011-03-11 0:40 ` Junio C Hamano 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2011-03-10 23:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Namhyung Kim On Thu, Mar 10, 2011 at 2:37 PM, Junio C Hamano <gitster@pobox.com> wrote: > > The code however forgot that the function may be called with a "len" that > is long enough. If an object is uniquely identifiable with only 4 leading > characters today, and if the caller gives 7 as len and the guard is set to > 3, it returned 10 hexdigits, which was 3 characters longer than necessary. > We should instead return 7 leading characters in such a case, as that is > in line with the original intention of using 3 characters more than > absolutely necessary to give the disambiguation we find today a better > chance to survive. The thing is, that just makes the notion of "abbrevguard" pointless. Why have it? When you pass in 6 as a len, and that isn't sufficient, it expands it to (say) 10. And then you pass in 7 as a length, and now it's sufficient, so it keeps it at 7. That's just stupid. You gave it a bigger length suggestion, and you got a smaller end result. That's crazy. However, I think the _real_ problem is not whether that behavior is really stupid or not. I think the real problem is that abbrevguard really isn't a well-defined, and you get this kind of crazy semantics. So I think the REAL problem is different: (a) DEFAULT_ABBREV is just too damn small. 7 made sense as a random number back when we did this, but we're talking over 5 years ago. The seven comes from commit 47dd0d595d04e. Back then, a million objects was a really almost inconceivably big number. Even the BK tree (that I was going by as a target) was just 65k revisions for Linux, so with most changes only touching a few files, "million" was "long time in the future". Now we're close to 2 million. It turns out 640kB isn't enough for everybody. For the kernel, we have several objects that need 10 digits just for uniqeness right NOW.. 12 digits is a _somewhat_ reasonable safe value for the forseeable future. But 11 would be too short. And I don't think the kernel is the biggest repo. (b) You can't change DEFAULT_ABBREV except with the command line option. (c) Even there it's unnecessarily hard. Want to see your commit numbers abbreviated appropriately too? Oh, you have to use "--abbrev=12 --abbrev-commit". We didn't think the interface through. (d) some places don't even take the command line option. Grep for DEFAULT_ABBREV, and notice how often it's just used as-is. So I would suggest ditching 'unique_abbrev_extra_length' entirely. I doubt anybody uses it, and the whole concept is simply badly designed with crazy semantics as per your patch. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-10 23:07 ` Linus Torvalds @ 2011-03-11 0:40 ` Junio C Hamano 2011-03-11 1:16 ` Linus Torvalds 0 siblings, 1 reply; 35+ messages in thread From: Junio C Hamano @ 2011-03-11 0:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Namhyung Kim Linus Torvalds <torvalds@linux-foundation.org> writes: > When you pass in 6 as a len, and that isn't sufficient, it expands it > to (say) 10. And then you pass in 7 as a length, and now it's > sufficient, so it keeps it at 7. Hmph, that is not the way I wanted to do things. If len is 6 and the result needs to be 10 to be sufficient, it should still yield 10 if the len you give is 7 (or 8 or 9 or 10). Of course it would be stupid if you don't do it that way. I thought the code should work that way already (unless of course there is an implementation bug). How many characters do we need to name master uniquely today? $ ./git -c core.abbrevguard=0 rev-parse --short=1 master 83c3c Ok, so there are more than one object with 83c3 and 83c3c is the absolute minimum. We can ask for extra for futureproofing. $ ./git -c core.abbrevguard=3 rev-parse --short=1 master 83c3c622 And we get three extra and the result is 8 characters long. What if we give len=7 that is still short? $ ./git -c core.abbrevguard=3 rev-parse --short=7 master 83c3c622 Of course we allow the limit 7 to be busted to keep the futureproofing. $ ./git -c core.abbrevguard=3 rev-parse --short=8 master 83c3c622 Obeying 8 doesn't hurt the futureproofing so the output remains the same. But if you say len=9, you get 9 $ ./git -c core.abbrevguard=3 rev-parse --short=9 master 83c3c6222 even though we know 8 is plenty futureproof, but you get 9 because you asked for 9. What am I missing to be called "just stupid" and "crazy"? The earlier code was adding the three extra over what was asked for. With guard=3 and len=7, you would have got 10 for this object that only needs 5 as absolute minimum today, and giving 8 would have been enough to stay unique with room for 3 extra characters to grow. That is what I was fixing. > (b) You can't change DEFAULT_ABBREV except with the command line option. I would agree that would need changing, but I think that is more or less an independent issue. I thought I took your patch to introduce the configuration when I sent the abbrevguard as a companion patch, but apparently I didn't. That would fix the "grep DEFAULT_ABBREV" issue. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 0:40 ` Junio C Hamano @ 2011-03-11 1:16 ` Linus Torvalds 2011-03-11 1:33 ` Junio C Hamano 0 siblings, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2011-03-11 1:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Namhyung Kim On Thu, Mar 10, 2011 at 4:40 PM, Junio C Hamano <gitster@pobox.com> wrote: > > How many characters do we need to name master uniquely today? > > $ ./git -c core.abbrevguard=0 rev-parse --short=1 master > 83c3c > > Ok, so there are more than one object with 83c3 and 83c3c is the absolute > minimum. Actually, that's just lucky. Do this: git rev-list --abbrev=4 --abbrev-commit --all --objects | grep '^.........*$' and you'll see several commits in the git tree that need eight characters to be unique. The fact that your current 'master' isn't one of them, and in fact happens to be one that only needs five, is just pure luck. So even in the (much smaller) git repo, 7 is not a sufficient unique minimum even today. Never mind any future guarding. And your argument fails for exactly that reason: yes, for an _individual_ SHA1, you may think that "five characters is sufficient", and when you then use that random number (5) as a basis for forming your "max 8 character" logic, it fails miserably for other cases. Now, if the "abbrevguard" was based not one one random data-point, but on the _whole_ current state of the repository, things would be different. Then it would actually become a "let's pick a good default abbreviation length for this repo". But that's now that it does. The abbrevguard depends on the one particular SHA1 you're looking at, so you can actually be asking for a longer abbreviation, but still get a _shorter_ end result than when you asked for a shorter abbreviation. IOW, try your example thing not just with "master", but with two extreme commits. For example, try git -c core.abbrevguard=2 rev-parse --short=5 83c3c622 git -c core.abbrevguard=2 rev-parse --short=4 979f7929 and tell me what you get. I _think_ you should get 7 digits for the first case, and 8 digits for the second one. Even though you "asked" for a longer name in the first case, and you had the same abbrevguard. See what I'm saying? I think that's just insane. And it comes exactly from the fact that abbreviation ends up being a "local" thing. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 1:16 ` Linus Torvalds @ 2011-03-11 1:33 ` Junio C Hamano 2011-03-11 2:21 ` Linus Torvalds 2011-03-11 3:03 ` Junio C Hamano 0 siblings, 2 replies; 35+ messages in thread From: Junio C Hamano @ 2011-03-11 1:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Namhyung Kim Linus Torvalds <torvalds@linux-foundation.org> writes: > IOW, try your example thing not just with "master", but with two > extreme commits. For example, try > > git -c core.abbrevguard=2 rev-parse --short=5 83c3c622 > git -c core.abbrevguard=2 rev-parse --short=4 979f7929 > > and tell me what you get. I _think_ you should get 7 digits for the > first case, and 8 digits for the second one. Even though you "asked" > for a longer name in the first case, and you had the same abbrevguard. > > See what I'm saying? I think that's just insane. Hmph, why? That 979f79 one already have enough other objects with similar names, so compared to 83c3c that doesn't, it is natural that you would need more digits to protect its uniqueness, no? The result shouldn't be affected by the value of "short" as long as it is not long enough, as that is merely specifying "at least this many letters". Another thing to realize is that without abbrevguard with the code before the configuration was introduced, you would get very similar results. With --short=5, you would get 5 letters for 83c3c622 and with --short=4, you would still need 8 letters for 979f7929. Because either of these values are not long enough and it is merely a way to specify "at least this many letters". If we don't care about uniqueness, we can just truncate at the number of characters specified by the "len" without doing anything else. That would give us an output of uniform length without the uniqueness guarantee, and "git describe" output cannot even be fed back to git running in the same repository immediately after it is obtained. I don't think that is what we want. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 1:33 ` Junio C Hamano @ 2011-03-11 2:21 ` Linus Torvalds 2011-03-11 3:09 ` Junio C Hamano 2011-03-11 3:03 ` Junio C Hamano 1 sibling, 1 reply; 35+ messages in thread From: Linus Torvalds @ 2011-03-11 2:21 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Namhyung Kim On Thu, Mar 10, 2011 at 5:33 PM, Junio C Hamano <gitster@pobox.com> wrote: > > That 979f79 one already have enough other objects with similar names, so > compared to 83c3c that doesn't, it is natural that you would need more > digits to protect its uniqueness, no? The result shouldn't be affected by > the value of "short" as long as it is not long enough, as that is merely > specifying "at least this many letters" Yes, uniqueness in that sense is sane and has a good definition. But that's _not_ the case when you then randomly add extra <n> digits to it. Why? Because that <n> is meaningless, because what <n> means depends on what the base number was. And the base number is different for different objects. The case of n=0 is special, because it is the "current state". But what does "n=1" mean? Let me make it more explicit by making an extreme example about this. It's extreme just because I'm going to assume that the shortest abbreviation is 1 (in reality it's 4) but that doesn't really change the math, it just makes the numbers smaller/easier. So let's say that we have a repository with just 100 objects. What does that mean? In practical terms, it means that it is not impossible that you will have some object that is unique in a single digit (yes, it may be a bit unlikely, but it's not unreasonable), while you'll have other objects that need three digits. And most will be unique in two. Shortening the numbers that way has a _meaning_: the notion of "unique" is clearly meaningful. Sure, different objects get different lengths, but the different lengths have a very real reason for them. So then (again, to make the numbers small, and the math simple), let's assume abbrevguard=1. What does that MEAN? And I claim that it means something totally _different_ for the different objects, and that's the crazy thing. Because now we're talking about possible _future_ objects, but the likely _future_ uniqueness of "unique in one digit" is TOTALLY DIFFERENT from the future uniqueness of "unique in three digits"! The single-digit uniqueness is going to be gone _long long_ before the three-digit uniqueness is gone. Adding a single digit to the object that currently happens to need only a single-digit unique id will _not_ do a whole lot of future-proofing - if you add another one hundred objects, that object may well need three digits to be unique. But if you add a single digit to the one that currently already needed three digits, you're likely to need to add an order of magnitude more objects to need to extend the three digits to four. See what I'm trying to say? This is why I think abbrevguard is a broken concept, when it is relative to "how unique is the object now". If the abbrevguard was related to the maximum number of digits required for _any_ object in the current repository, it would be meaningful - it would actually be about the _size_ of the current repository, and thus indirectly about the size of a future one. But it isn't. It's always relative to the "local uniqueness", which is only valid for the *current* state, and has very little to do with future growth. Now, to put things in terms of a real repository ("git" itself), and two extreme cases from it: - commit 1dae(0d38b8119de2b67f87e800c860ed958bbed6) currently unique in four digits - commit 979f7929(51913d75f992f87022b75610303a614f) currently unique in eight digits and think about what "abbrevguard" means for those two commits. Let's pick an abbrev-guard of two digits. For the first one, it means that you use six digits total, and for the second one, you'd use 10 digits total. What does that mean for future work? How many objects do we need to add for clashes to start happening? For the first commit, it's _almost certain_ that if you double the size of the repository, those six digits will no longer be unique. For the second case? I can pretty much guarantee that EVEN IF you didn't have any abbrev-guard at all, and you doubled the size of the git repository, the thing would still be unique in eight digits. Why? It's simply *much*much* less likely that you'll get future clashes from new objects in the eight digits. The likelihood of a clash with the currently unique 4-digit object is 16^4=65536 times higher than a clash with the currently 8-digit unique shortening. So it's senseless to add an equal number of digits to the two objects. They simply don't have the same likelihood of future collisions. So what is mathematically the sensible thing? It's actually to extend both objects to the _same_ number of digits. It's _more_ sensible to extend the current four-digit number to eight digits, than it is to extend the currently unique in eight digits by even a single digit. It would future-proof things a fair amount, exactly because the likelihood of future objects clashing with the two objects are totally different. That's why I said that it would be sensible to make the abbrevguard be relative to the current worst-case uniqueness. Because THAT actually is what is probable. If we currently have a maximum uniqueness requirement of 8 characters, it is _probable_ that by the time the project has grown by a factor of 4, we'll need 9 characters (I think, I may have gotten the math wrong). But it is somewhat expensive to calculate "max current uniqueness", so I would suggest ditching the whole notion of "how many extra numbers do I need for futureproofing", and go for just setting the absolute value of DEFAULT_ABBREV. Linus ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 2:21 ` Linus Torvalds @ 2011-03-11 3:09 ` Junio C Hamano 0 siblings, 0 replies; 35+ messages in thread From: Junio C Hamano @ 2011-03-11 3:09 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git, Namhyung Kim Linus Torvalds <torvalds@linux-foundation.org> writes: > But it is somewhat expensive to calculate "max current uniqueness", so > I would suggest ditching the whole notion of "how many extra numbers > do I need for futureproofing", and go for just setting the absolute > value of DEFAULT_ABBREV. Thanks; apparently our mails crossed... I suspect that in a sane repository with set of reasonably packed objects, we can cheaply get a rough estimate of the number of objects to about an order of magnitude (e.g. "likely to have 10k to 100k objects"), and we might able to come up with a way to automatically raise the default length. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 1:33 ` Junio C Hamano 2011-03-11 2:21 ` Linus Torvalds @ 2011-03-11 3:03 ` Junio C Hamano 2011-03-11 5:22 ` Jeff King 2011-03-11 22:45 ` Junio C Hamano 1 sibling, 2 replies; 35+ messages in thread From: Junio C Hamano @ 2011-03-11 3:03 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git, Namhyung Kim Junio C Hamano <gitster@pobox.com> writes: > Hmph, why? > > That 979f79 one already have enough other objects with similar names, so > compared to 83c3c that doesn't, it is natural that you would need more > digits to protect its uniqueness, no? Yuck, this is showing my total non-understanding of statistics and secure hashing. The example does not mean the next object we will create in git.git project somehow magically is more likely to have 979 prefix than 83c prefix. In other words, 979f7929 is much less likely to collide with new objects than 83c3c, simply because it has 4 more digits in it (making the likelyhood of collision with the next one by four orders of magnitude in base-16), the likelyhood does not depend on what other objects happen to share the same prefix right now, and adding one digits to each would make it 1/16 less likely to have collision relative to the likelyhood with their current length. In any case, if we want to protect abbreviations we generate today from future collisions, the only sane thing to do is to give sufficiently long "len" that is computed globally by taking the total number of objects into account. We should consider the loop in find_unique_abbrev that makes sure that prefix truncated to len is unique, and increments until the result becomes unique, merely a last-ditch thing, not contributing to futureproofing. Even though adding 1 to make it minimally unique as we have always did before the abbrevguard patch has the same "that 1 extra has different effectiveness for protecting against future collisions, depending on how long the base string is" issue, we cannot get away by adding less than 1 character, and adding more doesn't make it better. So let's scrap the abbrevguard thing. I somehow thought that I already took your "make DEFAULT_ABBREV tweakable" patch, but apparently I didn't. That one is the real fix to the issue of futureproofing. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 3:03 ` Junio C Hamano @ 2011-03-11 5:22 ` Jeff King 2011-03-11 5:33 ` Jeff King 2011-03-11 22:45 ` Junio C Hamano 1 sibling, 1 reply; 35+ messages in thread From: Jeff King @ 2011-03-11 5:22 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git, Namhyung Kim On Thu, Mar 10, 2011 at 07:03:39PM -0800, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Hmph, why? > > > > That 979f79 one already have enough other objects with similar names, so > > compared to 83c3c that doesn't, it is natural that you would need more > > digits to protect its uniqueness, no? > > Yuck, this is showing my total non-understanding of statistics and secure > hashing. The example does not mean the next object we will create in > git.git project somehow magically is more likely to have 979 prefix than > 83c prefix. In other words, 979f7929 is much less likely to collide with > new objects than 83c3c, simply because it has 4 more digits in it (making > the likelyhood of collision with the next one by four orders of magnitude > in base-16), the likelyhood does not depend on what other objects happen > to share the same prefix right now, and adding one digits to each would > make it 1/16 less likely to have collision relative to the likelyhood with > their current length. This is basically the birthday problem used in hash collision analysis. Which means that if we decide on some acceptable probability of collision, we can figure out the hash-length required to keep the probability of collision below that acceptable level for a repository with objects. You can find the formulas here: http://en.wikipedia.org/wiki/Birthday_problem The closest version to what we want is: n(p,d) ~ sqrt(2d * ln(1/(1-p))) where "n" is the number of objects in the repository, "p" is the probability of a collision, and "d" is the size of the distribution space. In our case, for sha1 subset with "l" hex characters, it is 2^(4l). And note that this is an approximation, but it should be close enough for our purposes. We can arrange the formula into: d(n,p) = n^2 / 2 / ln(1/(1-p)) and rearrange our length relationship into: l(d) = log2(d) / 4 Then we just pick an acceptable probability of collision. Let's say one in a million, 10^-6. So for a repository with 175,000 objects (close to git.git), to keep the probability of collision at 10^-6, we get d(n,p) = 1.5e16, or 53 bits, or 13 hex characters. Which is obviously a lot more than we're doing now. But remember that this is for _any_ collision. So it's not likely to be for one of the few numbers you write down for reference later. So we could probably get by with a much higher collision probability. At one-in-a-thousand, it's 43 bits or 10 characters. Or we could just take it all the way down to 50%: we are likely to have ae collision, but that's probably OK since it's unlikely to be one of the n out of 175,000 that we actually remember and care about. That's 34 bits, or 8.5 hex characters. For the kernel repository, there are 1.8 million objects. So to hit 50%, we're talking about 41 bits or about 10 characters. Which, as a quick sanity check, matches what Linus said earlier: they are seeing collisions around the 10-character mark now in the kernel repo. So all of this is more or less what Linus has been saying, but with numbers and formulas to back it up. One observation I should note are that these probabilities are what we would guess knowing _nothing_ about what is in the repo. That is, if you gave me an empty repo and told me you were going to populate it with N objects, these are the probabilities I would calculate. But of course the problem we actually have is that we already have N objects, and we expect to have N+M at some point in the future; we already know which collisions we have in the first N, and we want to future proof against the next M. So in theory we could say "we currently are unique at 8 characters; we anticipate 50,000 new objects per year and want to future proof for 4 years (so M=200,000). What is the length we need to keep the probability of collision for just this _one_ sha1 below some acceptable level?" Which I think is just: p(M) = 1 - d^-M but when I tried solving for d, I ended up with nonsensical numbers. Which means I'm either wrong or I'm way too sleepy to be doing algebra. Anyway, point being if we wanted to make a table of reasonable lengths for repos with N objects, it would look like this (for p=0.5): N | bits | hex ------------------ 1e2 | 13 | 3 1e3 | 19 | 5 1e4 | 26 | 7 1e5 | 33 | 8 1e6 | 39 | 10 2e6 | 41 | 10 5e6 | 44 | 11 1e7 | 46 | 12 So '11' would take the kernel repo to 5 million objects, which is more than doubling. 12 would take it to 10 million. Most smaller projects are probably fine at 8 or 9. But you can see that it takes quite a lot of objects to require a bump in the hex length, meaning we don't need to re-check it very often. If we really wanted to tweak it automatically, we could probably just set core.defaultAbbrev (or whatever) during git-gc based on the number of objects in the repository. We could do the formula, or even just have a precomputed table (it would be the inverse of the table above, mapping the switch points for hex characters to numbers of objects). -Peff ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 5:22 ` Jeff King @ 2011-03-11 5:33 ` Jeff King 0 siblings, 0 replies; 35+ messages in thread From: Jeff King @ 2011-03-11 5:33 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git, Namhyung Kim On Fri, Mar 11, 2011 at 12:22:35AM -0500, Jeff King wrote: > This is basically the birthday problem used in hash collision analysis. > Which means that if we decide on some acceptable probability of > collision, we can figure out the hash-length required to keep the > probability of collision below that acceptable level for a repository > with objects. Urgh, that was supposed to be "...for a repository with _n_ objects". But hopefully that became clear as you read on. -Peff ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 3:03 ` Junio C Hamano 2011-03-11 5:22 ` Jeff King @ 2011-03-11 22:45 ` Junio C Hamano 2011-03-13 13:30 ` Namhyung Kim 2011-03-19 1:22 ` Jay Soffian 1 sibling, 2 replies; 35+ messages in thread From: Junio C Hamano @ 2011-03-11 22:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Namhyung Kim Junio C Hamano <gitster@pobox.com> writes: > So let's scrap the abbrevguard thing. I somehow thought that I already > took your "make DEFAULT_ABBREV tweakable" patch, but apparently I didn't. > That one is the real fix to the issue of futureproofing. In return for a free education last night, I now owe you your patch from October last year, resurrected from the list archive, and here it is. With a forged sign-off from you, as I know that everything you write is supposed to be open source. - I do not think making minimum configurable is worth it, but I left it in (there is no UI to tweak it anyway). - I somewhat tweaked "describe" and "describe --abbrev" implementation. OPT__ABBREV() uses DEFAULT_ABBREV in its callback, so we need to read from the configuration before calling parse_options(). As it won't make any sense to call "git describe" outside repository where you cannot get to your configuration, I think it is safe to add a call to git_config() in this codepath. Other users of OPT__ABBREV() may need to be audited. By the way, I've already reverted the abbrevguard thing away. -- >8 -- From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 28 Oct 2010 11:28:04 -0700 Subject: [PATCH] Make the default abbrev length configurable The default of 7 comes from fairly early in git development, when seven hex digits was a lot (it covers about 250+ million hash values). Back then I thought that 65k revisions was a lot (it was what we were about to hit in BK), and each revision tends to be about 5-10 new objects or so, so a million objects was a big number. These days, the kernel isn't even the largest git project, and even the kernel has about 220k revisions (_much_ bigger than the BK tree ever was) and we are approaching two million objects. At that point, seven hex digits is still unique for a lot of them, but when we're talking about just two orders of magnitude difference between number of objects and the hash size, there _will_ be collisions in truncated hash values. It's no longer even close to unrealistic - it happens all the time. We should both increase the default abbrev that was unrealistically small, _and_ add a way for people to set their own default per-project in the git config file. This is the first step to first make it configurable; the default of 7 is not raised yet. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Documentation/config.txt | 6 ++++++ builtin/describe.c | 6 +++++- cache.h | 5 +++-- config.c | 8 ++++++++ environment.c | 1 + 5 files changed, 23 insertions(+), 3 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index c5e1835..30afde9 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -567,6 +567,12 @@ core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. +core.abbrevLength:: + Set the length object names are abbreviated to. If unspecified, + many commands abbreviate to 7 hexdigits, which may not be enough + for abbreviated object names to stay unique for sufficiently long + time. + add.ignore-errors:: add.ignoreErrors:: Tells 'git add' to continue adding files when some files cannot be diff --git a/builtin/describe.c b/builtin/describe.c index 342129f..9591596 100644 --- a/builtin/describe.c +++ b/builtin/describe.c @@ -21,7 +21,7 @@ static int debug; /* Display lots of verbose info */ static int all; /* Any valid ref can be used */ static int tags; /* Allow lightweight tags */ static int longformat; -static int abbrev = DEFAULT_ABBREV; +static int abbrev = -1; /* unspecified */ static int max_candidates = 10; static struct hash_table names; static int have_util; @@ -420,7 +420,11 @@ int cmd_describe(int argc, const char **argv, const char *prefix) OPT_END(), }; + git_config(git_default_config, NULL); argc = parse_options(argc, argv, prefix, options, describe_usage, 0); + if (abbrev < 0) + abbrev = DEFAULT_ABBREV; + if (max_candidates < 0) max_candidates = 0; else if (max_candidates > MAX_TAGS) diff --git a/cache.h b/cache.h index d83d68c..8d73d88 100644 --- a/cache.h +++ b/cache.h @@ -540,6 +540,7 @@ extern int trust_executable_bit; extern int trust_ctime; extern int quote_path_fully; extern int has_symlinks; +extern int minimum_abbrev, default_abbrev; extern int ignore_case; extern int assume_unchanged; extern int prefer_symlink_refs; @@ -758,8 +759,8 @@ static inline unsigned int hexval(unsigned char c) } /* Convert to/from hex/sha1 representation */ -#define MINIMUM_ABBREV 4 -#define DEFAULT_ABBREV 7 +#define MINIMUM_ABBREV minimum_abbrev +#define DEFAULT_ABBREV default_abbrev struct object_context { unsigned char tree[20]; diff --git a/config.c b/config.c index 625e051..e8a6e56 100644 --- a/config.c +++ b/config.c @@ -531,6 +531,14 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.abbrevlength")) { + int abbrev = git_config_int(var, value); + if (abbrev < minimum_abbrev || abbrev > 40) + return -1; + default_abbrev = abbrev; + return 0; + } + if (!strcmp(var, "core.loosecompression")) { int level = git_config_int(var, value); if (level == -1) diff --git a/environment.c b/environment.c index 9564475..f2d90a8 100644 --- a/environment.c +++ b/environment.c @@ -15,6 +15,7 @@ int user_ident_explicitly_given; int trust_executable_bit = 1; int trust_ctime = 1; int has_symlinks = 1; +int minimum_abbrev = 4, default_abbrev = 7; int ignore_case; int assume_unchanged; int prefer_symlink_refs; -- 1.7.4.1.404.g62d316 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 22:45 ` Junio C Hamano @ 2011-03-13 13:30 ` Namhyung Kim 2011-03-19 1:22 ` Jay Soffian 1 sibling, 0 replies; 35+ messages in thread From: Namhyung Kim @ 2011-03-13 13:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git 2011-03-11 (금), 14:45 -0800, Junio C Hamano: > Junio C Hamano <gitster@pobox.com> writes: > > > So let's scrap the abbrevguard thing. I somehow thought that I already > > took your "make DEFAULT_ABBREV tweakable" patch, but apparently I didn't. > > That one is the real fix to the issue of futureproofing. > > In return for a free education last night, I now owe you your patch from > October last year, resurrected from the list archive, and here it is. > > With a forged sign-off from you, as I know that everything you write is > supposed to be open source. > > - I do not think making minimum configurable is worth it, but I left it in > (there is no UI to tweak it anyway). > > - I somewhat tweaked "describe" and "describe --abbrev" implementation. > OPT__ABBREV() uses DEFAULT_ABBREV in its callback, so we need to read > from the configuration before calling parse_options(). As it won't > make any sense to call "git describe" outside repository where you > cannot get to your configuration, I think it is safe to add a call to > git_config() in this codepath. Other users of OPT__ABBREV() may need to > be audited. > > By the way, I've already reverted the abbrevguard thing away. > > -- >8 -- > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Thu, 28 Oct 2010 11:28:04 -0700 > Subject: [PATCH] Make the default abbrev length configurable > > The default of 7 comes from fairly early in git development, when > seven hex digits was a lot (it covers about 250+ million hash > values). Back then I thought that 65k revisions was a lot (it was what > we were about to hit in BK), and each revision tends to be about 5-10 > new objects or so, so a million objects was a big number. > > These days, the kernel isn't even the largest git project, and even > the kernel has about 220k revisions (_much_ bigger than the BK tree > ever was) and we are approaching two million objects. At that point, > seven hex digits is still unique for a lot of them, but when we're > talking about just two orders of magnitude difference between number > of objects and the hash size, there _will_ be collisions in truncated > hash values. It's no longer even close to unrealistic - it happens all > the time. > > We should both increase the default abbrev that was unrealistically > small, _and_ add a way for people to set their own default per-project > in the git config file. > > This is the first step to first make it configurable; the default of 7 > is not raised yet. > > Signed-off-by: Junio C Hamano <gitster@pobox.com> > --- Reviewed-by: Namhyung Kim <namhyung@gmail.com> Thanks. -- Regards, Namhyung Kim ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-11 22:45 ` Junio C Hamano 2011-03-13 13:30 ` Namhyung Kim @ 2011-03-19 1:22 ` Jay Soffian 2011-03-19 16:24 ` Namhyung Kim 1 sibling, 1 reply; 35+ messages in thread From: Jay Soffian @ 2011-03-19 1:22 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git, Namhyung Kim On Fri, Mar 11, 2011 at 5:45 PM, Junio C Hamano <gitster@pobox.com> wrote: > +core.abbrevLength:: > + Set the length object names are abbreviated to. If unspecified, > + many commands abbreviate to 7 hexdigits, which may not be enough > + for abbreviated object names to stay unique for sufficiently long > + time. > + Isn't this the minimum length though? i.e. a longer length is used as needed for uniqueness. If so, at least the description is misleading, and I'd argue the option name too. Perhaps core.abbrevMinLength? j. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better 2011-03-19 1:22 ` Jay Soffian @ 2011-03-19 16:24 ` Namhyung Kim 0 siblings, 0 replies; 35+ messages in thread From: Namhyung Kim @ 2011-03-19 16:24 UTC (permalink / raw) To: Jay Soffian; +Cc: Junio C Hamano, Linus Torvalds, git 2011-03-18 (금), 21:22 -0400, Jay Soffian: > On Fri, Mar 11, 2011 at 5:45 PM, Junio C Hamano <gitster@pobox.com> wrote: > > +core.abbrevLength:: > > + Set the length object names are abbreviated to. If unspecified, > > + many commands abbreviate to 7 hexdigits, which may not be enough > > + for abbreviated object names to stay unique for sufficiently long > > + time. > > + > > Isn't this the minimum length though? i.e. a longer length is used as > needed for uniqueness. If so, at least the description is misleading, > and I'd argue the option name too. Perhaps core.abbrevMinLength? > > j. It could confuse what it points is the default length or the minimum. I'd like to suggest core.defaultAbbrev or core.abbrevDefault. -- Regards, Namhyung Kim ^ permalink raw reply [flat|nested] 35+ messages in thread
* ext3 patches related to uninitialized memory references 2010-10-28 4:52 ` Theodore Ts'o ` (2 preceding siblings ...) (?) @ 2010-11-01 16:34 ` Roman Borisov -1 siblings, 0 replies; 35+ messages in thread From: Roman Borisov @ 2010-11-01 16:34 UTC (permalink / raw) To: ext Theodore Ts'o; +Cc: linux-ext4 Hello, Do you have a plans for next patches commitment? [PATCH 1/2] ext3: Avoid uninitialized memory references with a corrupted htree directory http://www.spinics.net/lists/linux-ext4/msg21215.html [PATCH 2/2] ext3: Use search_dirblock() in ext3_dx_find_entry() http://www.spinics.net/lists/linux-ext4/msg21216.html I've found only the same patches for ext4 but not for ext3. Thanks, Roman ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2011-03-19 16:24 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-10-28 4:52 [GIT PULL] ext4 update for 2.6.37 Theodore Ts'o 2010-10-28 4:52 ` Theodore Ts'o 2010-10-28 7:50 ` Markus Trippelsdorf 2010-10-28 7:56 ` -tip: origin tree build failure (was: [GIT PULL] ext4 update for 2.6.37) Ingo Molnar 2010-10-28 7:56 ` Ingo Molnar 2010-10-28 12:12 ` Theodore Tso 2010-10-28 12:12 ` Theodore Tso 2010-10-28 16:30 ` Linus Torvalds 2010-10-28 16:38 ` Ingo Molnar 2010-10-28 16:55 ` Ted Ts'o 2010-10-28 17:00 ` Linus Torvalds 2010-10-28 17:17 ` Ingo Molnar 2010-10-28 17:17 ` Ingo Molnar 2010-10-28 17:27 ` Minimum git commit abbrev length (Was Re: -tip: origin tree build failure (was: [GIT PULL] ext4 update) " Ted Ts'o 2010-10-28 18:28 ` Linus Torvalds 2010-10-28 18:54 ` Linus Torvalds 2010-10-29 0:14 ` Minimum git commit abbrev length (Was Re: -tip: origin tree build failure Brandon Casey 2010-10-28 21:39 ` Junio C Hamano 2010-10-28 21:50 ` Linus Torvalds 2010-10-28 21:50 ` Linus Torvalds 2011-03-10 22:37 ` [PATCH] find_unique_abbrev(): honor caller-supplied "len" better Junio C Hamano 2011-03-10 23:07 ` Linus Torvalds 2011-03-11 0:40 ` Junio C Hamano 2011-03-11 1:16 ` Linus Torvalds 2011-03-11 1:33 ` Junio C Hamano 2011-03-11 2:21 ` Linus Torvalds 2011-03-11 3:09 ` Junio C Hamano 2011-03-11 3:03 ` Junio C Hamano 2011-03-11 5:22 ` Jeff King 2011-03-11 5:33 ` Jeff King 2011-03-11 22:45 ` Junio C Hamano 2011-03-13 13:30 ` Namhyung Kim 2011-03-19 1:22 ` Jay Soffian 2011-03-19 16:24 ` Namhyung Kim 2010-11-01 16:34 ` ext3 patches related to uninitialized memory references Roman Borisov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.