linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-04 01:24:12 +08:00

Author	SHA1	Message	Date
Theodore Ts'o	b3881f74b3	ext4: Add mount option to set kjournald's I/O priority Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jens Axboe <jens.axboe@oracle.com>	2009-01-05 22:46:26 -05:00
Jan Kara	a5b5ee3201	ext4: Add default allocation routines for quota structures Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Jan Kara	17bd13b31c	ext4: Use sb_any_quota_loaded() instead of sb_any_quota_enabled() Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Nick Piggin	54566b2c15	fs: symlink write_begin allocation context fix With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-04 13:33:20 -08:00
Pekka Enberg	c644f0e4b5	fs: introduce bgl_lock_ptr() As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in <linux/blockgroup_lock.h> and add separate sb_bgl_lock() helpers to filesystem specific header files to break the hidden dependency to struct ext[234]_sb_info. Also, while at it, convert the macros to static inlines to try make up for all the times I broke Andrew Morton's tree. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-04 13:33:20 -08:00
Theodore Ts'o	ba80b1019a	ext4: Add markers for better debuggability Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-03 20:03:21 -05:00
Theodore Ts'o	c319106723	ext4: Remove code to create the journal inode This code has been obsolete in quite some time, since the supported method for adding a journal inode is to use tune2fs (or to creating new filesystem with a journal via mke2fs or mkfs.ext4). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-06 11:14:25 -05:00
Toshiyuki Okajima	c39a7f84d7	ext4: provide function to release metadata pages under memory pressure Pages in the page cache belonging to ext4 data files are released via the ext4_releasepage() function specified in the ext4 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which calls jbd2_journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org	2009-01-05 22:38:48 -05:00
Aneesh Kumar K.V	0087d9fb3f	ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc With nodelalloc option we need to update the dirty block counter on block allocation failure. This is needed because we increment the dirty block counter early in the block allocation phase. Without the patch s_dirty_blocks_counter goes wrong so that filesystem's free blocks decreases incorrectly. Tested-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:49:12 -05:00
Aneesh Kumar K.V	29eaf02498	ext4: Init the complete page while building buddy cache We need to init the complete page during buddy cache init by setting the contents to '1'. Otherwise we can see the following errors after doing an online resize of the filesystem: EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used: Allocating block 1040385 in system zone of 127 group Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:48:56 -05:00
Aneesh Kumar K.V	8556e8f3b6	ext4: Don't allow new groups to be added during block allocation After we mark the blocks in the buddy cache as allocated, we need to ensure that we don't reinit the buddy cache until the block bitmap is updated. This commit achieves this by holding the group_info alloc_semaphore till ext4_mb_release_context Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:46:55 -05:00
Aneesh Kumar K.V	648f5879f5	ext4: mark the blocks/inode bitmap beyond end of group as used We need to mark the block/inode bitmap beyond the end of the group with '1'. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:46:04 -05:00
Aneesh Kumar K.V	2ccb5fb9f1	ext4: Use new buffer_head flag to check uninit group bitmaps initialization For uninit block group, the on-disk bitmap is not initialized. That implies we cannot depend on the uptodate flag on the bitmap buffer_head to find bitmap validity. Use a new buffer_head flag which would be set after we properly initialize the bitmap. This also prevents (re-)initializing the uninit group bitmap every time we call ext4_read_block_bitmap(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:49:55 -05:00
Aneesh Kumar K.V	393418676a	ext4: Fix the race between read_inode_bitmap() and ext4_new_inode() We need to make sure we update the inode bitmap and clear EXT4_BG_INODE_UNINIT flag with sb_bgl_lock held, since ext4_read_inode_bitmap() looks at EXT4_BG_INODE_UNINIT to decide whether to initialize the inode bitmap each time it is called. (introduced by commit c806e68f.) ext4_read_inode_bitmap does: spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group)); if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) { ext4_init_inode_bitmap(sb, bh, block_group, desc); and ext4_new_inode does if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, group), ino, inode_bitmap_bh->b_data)) ...... ... spin_lock(sb_bgl_lock(sbi, group)); gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT); i.e., on allocation we update the bitmap then we take the sb_bgl_lock and clear the EXT4_BG_INODE_UNINIT flag. What can happen is a parallel ext4_read_inode_bitmap can zero out the bitmap in between the above ext4_set_bit_atomic and spin_lock(sb_bg_lock..) The race results in below user visible errors EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449 EXT4-fs warning (device sdb1): ext4_unlink: Deleting nonexistent file ... EXT4-fs warning (device sdb1): ext4_rmdir: empty directory has too many links ... # ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71 ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:38:14 -05:00
Aneesh Kumar K.V	3300beda52	ext4: code cleanup Rename some variables. We also unlock locks in the reverse order we acquired as a part of cleanup. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-03 22:33:39 -05:00
Aneesh Kumar K.V	560671a0d3	ext4: Use high 16 bits of the block group descriptor's free counts fields Rename the lower bits with suffix _lo and add helper to access the values. Also rename bg_itable_unused_hi to bg_pad as in e2fsprogs. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-05 22:20:24 -05:00
Aneesh Kumar K.V	e8134b27e3	ext4: Fix race between read_block_bitmap() and mark_diskspace_used() We need to make sure we update the block bitmap and clear EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held, since ext4_read_block_bitmap() looks at EXT4_BG_BLOCK_UNINIT to decide whether to initialize the block bitmap each time it is called (introduced by commit `c806e68f`), and this can race with block allocations in ext4_mb_mark_diskspace_used(). ext4_read_block_bitmap does: spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group)); if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { ext4_init_block_bitmap(sb, bh, block_group, desc); Now on the block allocation side we do mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data, ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len); .... spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group)); if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); ie on allocation we update the bitmap then we take the sb_bgl_lock and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a parallel ext4_read_block_bitmap can zero out the bitmap in between the above mb_set_bits and spin_lock(sb_bg_lock..) The race results in below user visible errors EXT4-fs error (device sdb1): ext4_mb_release_inode_pa: free 100, pa_free 105 EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's block .. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:38:26 -05:00
Aneesh Kumar K.V	5d1b1b3f49	ext4: fix BUG when calling ext4_error with locked block group The mballoc code likes to call ext4_error while it is holding locked block groups. This can causes a scheduling in atomic context BUG. We can't just unlock the block group and relock it after/if ext4_error returns since that might result in race conditions in the case where the filesystem is set to continue after finding errors. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-05 22:19:52 -05:00
Al Viro	6b38e842bb	nfsd race fixes: ext4 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-12-31 18:07:44 -05:00
Duane Griffin	e83c1397ca	ext4: ensure fast symlinks are NUL-terminated Ensure fast symlink targets are NUL-terminated, even if corrupted on-disk. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: adilger@sun.com Cc: linux-ext4@vger.kernel.org Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-12-31 18:07:39 -05:00
Jens Axboe	b3a6ffe16b	Get rid of CONFIG_LSF We have two seperate config entries for large devices/files. One is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF that handles large files. This doesn't make a lot of sense, you typically want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording to indicate that it covers both. Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-12-29 08:29:51 +01:00
James Morris	cbacc2c7f0	Merge branch 'next' into for-linus	2008-12-25 11:40:09 +11:00
Andrew Morton	02d2116887	revert "percpu_counter: new function percpu_counter_sum_and_set" Revert commit `e8ced39d5e` Author: Mingming Cao <cmm@us.ibm.com> Date: Fri Jul 11 19:27:31 2008 -0400 percpu_counter: new function percpu_counter_sum_and_set As described in revert "percpu counter: clean up percpu_counter_sum_and_set()" the new percpu_counter_sum_and_set() is racy against updates to the cpu-local accumulators on other CPUs. Revert that change. This means that ext4 will be slow again. But correct. Reported-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mingming Cao <cmm@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-12-10 08:01:52 -08:00
Andrew Morton	71c5576fbd	revert "percpu counter: clean up percpu_counter_sum_and_set()" Revert commit `1f7c14c62c` Author: Mingming Cao <cmm@us.ibm.com> Date: Thu Oct 9 12:50:59 2008 -0400 percpu counter: clean up percpu_counter_sum_and_set() Before this patch we had the following: percpu_counter_sum(): return the percpu_counter's value percpu_counter_sum_and_set(): return the percpu_counter's value, copying that value into the central value and zeroing the per-cpu counters before returning. After this patch, percpu_counter_sum_and_set() has gone, and percpu_counter_sum() gets the old percpu_counter_sum_and_set() functionality. Problem is, as Eric points out, the old percpu_counter_sum_and_set() functionality was racy and wrong. It zeroes out counters on "other" cpus, without holding any locks which will prevent races agaist updates from those other CPUS. This patch reverts `1f7c14c62c`. This means that percpu_counter_sum_and_set() still has the race, but percpu_counter_sum() does not. Note that this is not a simple revert - ext4 has since started using percpu_counter_sum() for its dirty_blocks counter as well. Note that this revert patch changes percpu_counter_sum() semantics. Before the patch, a call to percpu_counter_sum() will bring the counter's central counter mostly up-to-date, so a following percpu_counter_read() will return a close value. After this patch, a call to percpu_counter_sum() will leave the counter's central accumulator unaltered, so a subsequent call to percpu_counter_read() can now return a significantly inaccurate result. If there is any code in the tree which was introduced after `e8ced39d5e` was merged, and which depends upon the new percpu_counter_sum() semantics, that code will break. Reported-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mingming Cao <cmm@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-12-10 08:01:52 -08:00
Aneesh Kumar K.V	b7be019e80	ext4: Fix lockdep recursive locking warning In ext4_mb_init_group(), if the filesystem block size is less than PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block groups in a loop. We need to allow for this by using down_write_nested() and passing in the loop index as a lock subclass number. This works because no other code path needs to take multiple alloc_sem's. Note that lockdep will fail for filesystem blocksize smaller than to PAGE_SIZE/16k. (e.g., a 1k filesystem blocksize with a 32k page size, or a 2k filesystem blocksize with a 64k blocksize, etc.) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-23 23:51:53 -05:00
Aneesh Kumar K.V	7a2fcbf7f8	ext4: don't use blocks freed but not yet committed in buddy cache init When we generate buddy cache (especially during resize) we need to make sure we don't use the blocks freed but not yet comitted. This makes sure we have the right value of free blocks count in the group info and also in the bitmap. This also ensures the ordered mode consistency Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:36:55 -05:00
James Morris	2b82892565	Merge branch 'master' into next Conflicts: security/keys/internal.h security/keys/process_keys.c security/keys/request_key.c Fixed conflicts above by using the non 'tsk' versions. Signed-off-by: James Morris <jmorris@namei.org>	2008-11-14 11:29:12 +11:00
David Howells	4c9c544e49	CRED: Wrap task credential accesses in the Ext4 filesystem Wrap access to task credentials so that they can be separated more easily from the task_struct during the introduction of COW creds. Change most current->(\|e\|s\|fs)[ug]id to current_(\|e\|s\|fs)[ug]id(). Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more sense to use RCU directly rather than a convenient wrapper; these will be addressed by later patches. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: adilger@sun.com Cc: linux-ext4@vger.kernel.org Signed-off-by: James Morris <jmorris@namei.org>	2008-11-14 10:38:51 +11:00
Frederic Bohe	23712a9c28	ext4: add checksum calculation when clearing UNINIT flag in ext4_new_inode When initializing an uninitialized block group in ext4_new_inode(), its block group checksum must be re-calculated. This fixes a race when several threads try to allocate a new inode in an UNINIT'd group. There is some question whether we need to be initializing the block bitmap in ext4_new_inode() at all, but for now, if we are going to init the block group, let's eliminate the race. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-07 09:21:01 -05:00
Aneesh Kumar K.V	ed9b3e3379	ext4: Mark the buffer_heads as dirty and uptodate after prepare_write We need to make sure we mark the buffer_heads as dirty and uptodate so that block_write_full_page write them correctly. This fixes mmap corruptions that can occur in low memory situations. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-07 09:06:45 -05:00
Aneesh Kumar K.V	c3a326a657	ext4: cleanup mballoc header files Move some of the forward declaration of the static functions to mballoc.c where they are used. This enables us to include mballoc.h in other .c files. Also correct the buddy cache documentation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-25 15:11:52 -05:00
Aneesh Kumar K.V	920313a726	ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize The new groups added during resize are flagged as need_init group. Make sure we properly initialize these groups. When we have block size < page size and we are adding new groups the page may still be marked uptodate even though we haven't initialized the group. While forcing the init of buddy cache we need to make sure other groups part of the same page of buddy cache is not using the cache. group_info->alloc_sem is added to ensure the same. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> cc: stable@kernel.org	2009-01-05 21:36:19 -05:00
Aneesh Kumar K.V	e21675d4b6	ext4: Add blocks added during resize to bitmap With this change new blocks added during resize are marked as free in the block bitmap and the group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT flag. This makes sure when mballoc tries to allocate blocks from the new group we would reload the buddy information using the bitmap present in the disk. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:36:02 -05:00
Aneesh Kumar K.V	3a06d778df	ext4: sparse fixes * Change EXT4_HAS__FEATURE to return a boolean Add a function prototype for ext4_fiemap() in ext4.h * Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions * Add lock annotations to mb_free_blocks() Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-22 15:04:59 -05:00
Theodore Ts'o	ac51d83705	ext4: calculate journal credits correctly This fixes a 2.6.27 regression which was introduced in commit `a02908f1`. We weren't passing the chunk parameter down to the two subections, ext4_indirect_trans_blocks() and ext4_ext_index_trans_blocks(), with the result that massively overestimate the amount of credits needed by ext4_da_writepages, especially in the non-extents case. This causes failures especially on /boot partitions, which tend to be small and non-extent using since GRUB doesn't handle extents. This patch fixes the bug reported by Joseph Fannin at: http://bugzilla.kernel.org/show_bug.cgi?id=11964 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-06 16:49:36 -05:00
Theodore Ts'o	498e5f2415	ext4: Change unsigned long to unsigned int Convert the unsigned longs that are most responsible for bloating the stack usage on 64-bit systems. Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-05 00:14:04 -05:00
Theodore Ts'o	a9df9a4910	ext4: Make ext4_group_t be an unsigned int Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-05 22:18:16 -05:00
Theodore Ts'o	cde6436004	ext4: Remove i_ext_generation from ext4_inode_info structure The i_ext_generation was incremented, but never used. Remove it to slim down the ext4_inode_info structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-04 18:46:03 -05:00
Theodore Ts'o	30773840c1	ext4: add fsync batch tuning knobs Add new mount options, min_batch_time and max_batch_time, which controls how long the jbd2 layer should wait for additional filesystem operations to get batched with a synchronous write transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-03 20:27:38 -05:00
Aneesh Kumar K.V	032115fcef	ext4: Don't overwrite allocation_context ac_status We can call ext4_mb_check_limits even after successfully allocating the requested blocks. In that case, make sure we don't overwrite ac_status if it already has the status AC_STATUS_FOUND. This fixes the lockdep warning: ============================================= [ INFO: possible recursive locking detected ] 2.6.28-rc6-autokern1 #1 --------------------------------------------- fsstress/11948 is trying to acquire lock: (&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278 ..... stack backtrace: ..... [<c04db974>] ext4_mb_regular_allocator+0xbb5/0xd44 ..... but task is already holding lock: (&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:34:30 -05:00
Theodore Ts'o	fde4d95ad8	ext4: remove extraneous newlines from calls to ext4_error() and ext4_warning() This removes annoying blank syslog entries emitted by ext4_error() or ext4_warning(), since these functions add their own newline. Signed-off-by: Nick Warne <nick@ukfsn.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-05 22:17:35 -05:00
Frank Mayhar	0390131ba8	ext4: Allow ext4 to run without a journal A few weeks ago I posted a patch for discussion that allowed ext4 to run without a journal. Since that time I've integrated the excellent comments from Andreas and fixed several serious bugs. We're currently running with this patch and generating some performance numbers against both ext2 (with backported reservations code) and ext4 with and without a journal. It just so happens that running without a journal is slightly faster for most everything. We did iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2 which creates 4 threads, each of which create and do reads and writes on a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens to bypass the page cache. Results: ext2 ext4, default ext4, no journal initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s reads 15.2 MB/s 16.9 MB/s 17.2 MB/s re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s So it seems that, so far, this was a useful exercise. Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-07 00:06:22 -05:00
Yasunori Goto	ff7ef329b2	ext4: Widen type of ext4_sb_info.s_mb_maxs[] I chased the cause of following ext4 oops report which is tested on ia64 box. http://bugzilla.kernel.org/show_bug.cgi?id=12018 The cause is the size of s_mb_maxs array that is defined as "unsigned short" in ext4_sb_info structure. If the file system's block size is 8k or greater, an unsigned short is not wide enough to contain the value fs->blocksize << 3. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: stable@kernel.org	2008-12-17 00:48:39 -05:00
Solofo.Ramangalahy@bull.net	93c0d86371	ext4: When resizing set the EXT4_BG_INODE_ZEROED flag for new block groups The inode table has been zeroed in setup_new_group_blocks(). Mark it as such in ext4_group_add(). Since we are currently clearing inode table for the new block group, we should set the EXT4_BG_INODE_ZEROED flag. If at some point in the future we don't immediately zero out the inode table as part of the resize operation, then obviously we shouldn't do this. Signed-off-by: Solofo.Ramangalahy@bull.net Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-26 23:44:10 -05:00
Roel Kluin	23475e264c	ext4: Use simple_strtol() instead of simple_strtoul() in ext4_ui_proc_open Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-26 02:23:19 -05:00
Wu Fengguang	25f1ee3aba	ext4: fix build warning Replace `if' with `goto' to assure gcc that ix has been initialized. Signed-off-by: Wu Fengguang <wfg@linux.intel.com>	2008-11-25 17:24:23 -05:00
Aneesh Kumar K.V	565a9617b2	ext4: avoid ext4_error when mounting a fs with a single bg Remove some completely unneeded code which which caused an ext4_error to be generated when mounting a file system with only a single block group. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:51:07 -05:00
Aneesh Kumar K.V	791b7f0895	ext4: Fix the delalloc writepages to allocate blocks at the right offset. When iterating through the pages which have mapped buffer_heads, we failed to update the b_state value. This results in allocating blocks at logical offset 0. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:50:43 -05:00
Theodore Ts'o	2a21e37e48	ext4: tone down ext4_da_writepages warnings If the filesystem has errors, ext4_da_writepages() will return a lot of errors, including lots and lots of stack dumps. While it's true that we are dropping user data on the floor, which is unfortunate, the stack dumps aren't helpful, and they tend to obscure the true original root cause of the problem. So in the case where the filesystem has aborted, return an EROFS right away. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-05 09:22:24 -05:00
Theodore Ts'o	97df5d155d	ext4: remove do_blk_alloc() The convenience function do_blk_alloc() is a static function with only one caller, so fold it into ext4_new_meta_blocks() to simplify the code and to make it easier to understand. To save more stack space, if count is a null pointer in ext4_new_meta_blocks() assume that caller wanted a single block (and if there is an error, no blocks were allocated). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-12-12 12:41:28 -05:00
Theodore Ts'o	cfe82c8567	ext4: remove ext4_new_meta_block() There were only two one callers of the function ext4_new_meta_block(), which just a very simpler wrapper function around ext4_new_meta_blocks(). Change those two functions to call ext4_new_meta_blocks() directly, to save code and stack space usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-12-07 14:10:54 -05:00
Theodore Ts'o	815a113068	ext4: remove ext4_new_blocks() and call ext4_mb_new_blocks() directly There was only one caller of the compatibility function ext4_new_blocks(), in balloc.c's ext4_alloc_blocks(). Change it to call ext4_mb_new_blocks() directly, and remove ext4_new_blocks() altogether. This cleans up the code, by removing two extra functions from the call chain, and hopefully saving some stack usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-01 23:59:43 -05:00
Theodore Ts'o	59e315b4c4	ext3/4: Fix loop index in do_split() so it is signed This fixes a gcc warning but it doesn't appear able to result in a failure, since the primary way the loop is exited is the first conditional in the for loop, and at least for a consistent filesystem, the signed/unsigned should in practice never be exposed. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-12-06 16:58:39 -05:00
Theodore Ts'o	14ce0cb411	ext4: wait on all pending commits in ext4_sync_fs() In ext4_sync_fs, we only wait for a commit to finish if we started it, but there may be one already in progress which will not be synced. In the case of a data=ordered umount with pending long symlinks which are delayed due to a long list of other I/O on the backing block device, this causes the buffer associated with the long symlinks to not be moved to the inode dirty list in the second phase of fsync_super. Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are never written to the backing block device, causing long symlink corruption and exposing new or previously freed block data to userspace. To ensure all commits are synced, we flush all journal commits now when sync_fs'ing ext4. Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org>	2008-11-03 18:10:55 -05:00
Aneesh Kumar K.V	d94e99a64c	ext4: Convert to host order before using the values. Use le16_to_cpu to read the s_reserved_gdt_blocks values from super block. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-04 09:11:26 -05:00
Aneesh Kumar K.V	ae2d9fb18e	ext4: fix missing ext4_unlock_group in error path If we try to free a block which is already freed, the code was returning without first unlocking the group. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-11-04 09:10:50 -05:00
Theodore Ts'o	f99b25897a	ext4: Add support for non-native signed/unsigned htree hash algorithms The original ext3 hash algorithms assumed that variables of type char were signed, as God and K&R intended. Unfortunately, this assumption is not true on some architectures. Userspace support for marking filesystems with non-native signed/unsigned chars was added two years ago, but the kernel-side support was never added (until now). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-28 13:21:44 -04:00
Alexander Beregalov	8f72fbdf0d	ext4: fix printk format warning fs/ext4/balloc.c:607: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1822: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1824: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-29 17:13:08 -04:00
Eric Sandeen	a996031c87	delay capable() check in ext4_has_free_blocks() As reported by Eric Paris, the capable() check in ext4_has_free_blocks() sometimes causes SELinux denials. We can rearrange the logic so that we only try to use the root-reserved blocks when necessary, and even then we can move the capable() test to last, to avoid the check most of the time. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-28 00:08:17 -04:00
Eric Sandeen	8c3bf8a01c	merge ext4_claim_free_blocks & ext4_has_free_blocks Mingming pointed out that ext4_claim_free_blocks & ext4_has_free_blocks are largely cut & pasted; they can be collapsed/merged as follows. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-28 00:08:12 -04:00
Hidehiro Kawai	ef2cabf7c6	ext4: fix a bug accessing freed memory in ext4_abort Vegard Nossum reported a bug which accesses freed memory (found via kmemcheck). When journal has been aborted, ext4_put_super() calls ext4_abort() after freeing the journal_t object, and then ext4_abort() accesses it. This patch fix it. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-27 22:53:05 -04:00
Theodore Ts'o	3c37fc86d2	ext4: Fix duplicate entries returned from getdents() system call Fix a regression caused by commit `d0156417`, "ext4: fix ext4_dx_readdir hash collision handling", where deleting files in a large directory (requiring more than one getdents system call), results in some filenames being returned twice. This was caused by a failure to update info->curr_hash and info->curr_minor_hash, so that if the directory had gotten modified since the last getdents() system call (as would be the case if the user is running "rm -r" or "git clean"), a directory entry would get returned twice to the userspace. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> This patch fixes the bug reported by Markus Trippelsdorf at: http://bugzilla.kernel.org/show_bug.cgi?id=11844 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>	2008-10-25 22:37:55 -04:00
Christoph Hellwig	3856d30ded	ext4: remove unused variable in ext4_get_parent Signed-off-by: Christoph Hellwig <hch@lst.de> [ All users removed in "switch all filesystems over to d_obtain_alias", aka commit `440037287c` ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-23 12:03:23 -07:00
Linus Torvalds	2248485640	Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits) [PATCH] kill the rest of struct file propagation in block ioctls [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET [PATCH] get rid of blkdev_locked_ioctl() [PATCH] get rid of blkdev_driver_ioctl() [PATCH] sanitize blkdev_get() and friends [PATCH] remember mode of reiserfs journal [PATCH] propagate mode through swsusp_close() [PATCH] propagate mode through open_bdev_excl/close_bdev_excl [PATCH] pass fmode_t to blkdev_put() [PATCH] kill the unused bsize on the send side of /dev/loop [PATCH] trim file propagation in block/compat_ioctl.c [PATCH] end of methods switch: remove the old ones [PATCH] switch sr [PATCH] switch sd [PATCH] switch ide-scsi [PATCH] switch tape_block [PATCH] switch dcssblk [PATCH] switch dasd [PATCH] switch mtd_blkdevs [PATCH] switch mmc ...	2008-10-23 10:23:07 -07:00
Christoph Hellwig	440037287c	[PATCH] switch all filesystems over to d_obtain_alias Switch all users of d_alloc_anon to d_obtain_alias. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-23 05:13:01 -04:00
Al Viro	8264613def	[PATCH] switch quota_on-related stuff to kern_path() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-23 05:12:44 -04:00
Al Viro	9a1c354276	[PATCH] pass fmode_t to blkdev_put() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:58 -04:00
Alexey Dobriyan	6da0b38f44	fs/Kconfig: move ext2, ext3, ext4, JBD, JBD2 out Use fs/*/Kconfig more, which is good because everything related to one filesystem is in one place and fs/Kconfig is quite fat. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 11:43:59 -07:00
Theodore Ts'o	f287a1a561	ext4: Remove automatic enabling of the HUGE_FILE feature flag If the HUGE_FILE feature flag is not set, don't allow the creation of large files, instead of automatically enabling the feature flag. Recent versions of mke2fs will set the HUGE_FILE flag automatically anyway for ext4 filesystems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-16 22:50:48 -04:00
Theodore Ts'o	3e624fc72f	ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback The multiblock allocator needs to be able to release blocks (and issue a blkdev discard request) when the transaction which freed those blocks is committed. Previously this was done via a polling mechanism when blocks are allocated or freed. A much better way of doing things is to create a jbd2 callback function and attaching the list of blocks to be freed directly to the transaction structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-16 20:00:24 -04:00
Theodore Ts'o	01436ef2e4	ext4: Remove unused mount options: nomballoc, mballoc, nocheck These mount options don't actually do anything any more, so remove them. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-17 07:22:35 -04:00
Manish Katiyar	0b09923eab	ext4: Remove compile warnings when building w/o CONFIG_PROC_FS Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-17 14:58:45 -04:00
Eric Sesterhenn	5128273a32	ext4: Add missing newlines to printk messages There are some newlines missing in ext4_check_descriptors, which cause the printk level to be printed out when the next printk call is made: [ 778.847265] EXT4-fs: ext4_check_descriptors: Block bitmap for group 0 not in group (block 1509949442)!<3>EXT4-fs: group descriptors corrupted! [ 802.646630] EXT4-fs: ext4_check_descriptors: Inode bitmap for group 0 not in group (block 9043971)!<3>EXT4-fs: group descriptors corrupted! Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-17 09:16:19 -04:00
Aneesh Kumar K.V	22208dedbd	ext4: Fix file fragmentation during large file write. The range_cyclic writeback mode uses the address_space writeback_index as the start index for writeback. With delayed allocation we were updating writeback_index wrongly resulting in highly fragmented file. This patch reduces the number of extents reduced from 4000 to 27 for a 3GB file. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-16 10:10:36 -04:00
Aneesh Kumar K.V	af6f029d38	ext4: Use tag dirty lookup during mpage_da_submit_io This enables us to drop the range_cont writeback mode use from ext4_da_writepages. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2008-10-14 09:20:19 -04:00
Theodore Ts'o	8a0aba733d	ext4: let the block device know when unused blocks can be discarded Let the block device know when unused blocks can be discarded, using the new sb_issue_discard() interface. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-16 10:06:27 -04:00
Aneesh Kumar K.V	a1aebc1e2d	ext4: Don't reuse released data blocks until transaction commits We need to make sure we don't reuse the data blocks released during the transaction untill the transaction commits. We force this mode only for ordered and journalled mode. Writeback mode already don't provided data consistency. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:13:31 -04:00
Aneesh Kumar K.V	c894058d66	ext4: Use an rbtree for tracking blocks freed during transaction. With this patch we track the block freed during a transaction using red-black tree. We also make sure contiguous blocks freed are collected in one node in the tree. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-16 10:14:27 -04:00
Aneesh Kumar K.V	c2774d84fd	ext4: Do mballoc init before doing filesystem recovery During filesystem recovery we may be doing a truncate which expects some of the mballoc data structures to be initialized. So do ext4_mb_init before recovery. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:07:20 -04:00
Aneesh Kumar K.V	688f05a019	ext4: Free ext4_prealloc_space using kmem_cache_free We should use kmem_cache_free to free memory allocated via kmem_cache_alloc Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-13 12:14:14 -04:00
Steven Whitehouse	a447c09324	vfs: Use const for kernel parser table This is a much better version of a previous patch to make the parser tables constant. Rather than changing the typedef, we put the "const" in all the various places where its required, allowing the __initconst exception for nfsroot which was the cause of the previous trouble. This was posted for review some time ago and I believe its been in -mm since then. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Alexander Viro <aviro@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 10:10:37 -07:00
Alexander Beregalov	3244fcb1ae	ext4: fix build failure without procfs fs/ext4/super.c: In function 'ext4_fill_super': fs/ext4/super.c:2226: error: 'ext4_ui_proc_fops' undeclared (first use in this function) fs/ext4/super.c:2226: error: (Each undeclared identifier is reported only once fs/ext4/super.c:2226: error: for each function it appears in.) Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 17:27:49 -04:00
Hidehiro Kawai	5bf5683a33	ext4: add an option to control error handling on file data If the journal doesn't abort when it gets an IO error in file data blocks, the file data corruption will spread silently. Because most of applications and commands do buffered writes without fsync(), they don't notice the IO error. It's scary for mission critical systems. On the other hand, if the journal aborts whenever it gets an IO error in file data blocks, the system will easily become inoperable. So this patch introduces a filesystem option to determine whether it aborts the journal or just call printk() when it gets an IO error in file data. If you mount an ext4 fs with data_err=abort option, it aborts on file data write error. If you mount it with data_err=ignore, it doesn't abort, just call printk(). data_err=ignore is the default. Here is the corresponding patch of the ext3 version: http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374 Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 22:12:43 -04:00
Hidehiro Kawai	7ffe1ea894	ext4: add checks for errors from jbd2 If the journal has aborted due to a checkpointing failure, we have to keep the contents of the journal space. Otherwise, the filesystem will lose uncheckpointed metadata completely and become inconsistent. To avoid this, we need to keep needs_recovery flag if checkpoint has failed. With this patch, ext4_put_super() detects a checkpointing failure from the return value of journal_destroy(), then it invokes ext4_abort() to make the filesystem read only and keep needs_recovery flag. Errors from jbd2_journal_flush() are also handled by this patch in some places. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:21 -04:00
Theodore Ts'o	03010a3350	ext4: Rename ext4dev to ext4 The ext4 filesystem is getting stable enough that it's time to drop the "dev" prefix. Also remove the requirement for the TEST_FILESYS flag. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 20:02:48 -04:00
Andi Kleen	39d80c33a0	ext4: Avoid double dirtying of super block in ext4_put_super() While reading code I noticed that ext4_put_super() dirties the superblock bh twice. It is always done in ext4_commit_super() too. Remove the redundant dirty operation. Should be a nop semantically. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2008-10-06 21:37:44 -04:00
Eric Sandeen	6873fa0de1	Hook ext4 to the vfs fiemap interface. ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org	2008-10-07 00:46:36 -04:00
Kalpak Shah	4d20c685fa	ext4: fix xattr deadlock ext4_xattr_set_handle() eventually ends up calling ext4_mark_inode_dirty() which tries to expand the inode by shifting the EAs. This leads to the xattr_sem being downed again and leading to a deadlock. This patch makes sure that if ext4_xattr_set_handle() is in the call-chain, ext4_mark_inode_dirty() will not expand the inode. Signed-off-by: Kalpak Shah <kalpak.shah@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:21:54 -04:00
Theodore Ts'o	ede86cc473	ext4: Add debugging markers that can be used by systemtap This debugging markers are designed to debug problems such as the random filesystem latency problems reported by Arjan. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-05 20:50:06 -04:00
Frederic Bohe	c806e68f56	ext4: fix initialization of UNINIT bitmap blocks This fixes a bug which caused on-line resizing of filesystems with a 1k blocksize to fail. The root cause of this bug was the fact that if an uninitalized bitmap block gets read in by userspace (which e2fsprogs does try to avoid, but can happen when the blocksize is less than the pagesize and an adjacent blocks is read into memory) ext4_read_block_bitmap() was erroneously depending on the buffer uptodate flag to decide whether it needed to initialize the bitmap block in memory --- i.e., to set the standard set of blocks in use by a block group (superblock, bitmaps, inode table, etc.). Essentially, ext4_read_block_bitmap() assumed it was the only routine that might try to read a block containing a block bitmap, which is simply not true. To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap() must always initialize uninitialized bitmap blocks. Once a block or inode is allocated out of that bitmap, it will be marked as initialized in the block group descriptor, so in general this won't result any extra unnecessary work. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 08:09:18 -04:00
Theodore Ts'o	c2ea3fde61	ext4: Remove old legacy block allocator Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 09:40:52 -04:00
Theodore Ts'o	240799cdf2	ext4: Use readahead when reading an inode from the inode table With modern hard drives, reading 64k takes roughly the same time as reading a 4k block. So request readahead for adjacent inode table blocks to reduce the time it takes when iterating over directories (especially when doing this in htree sort order) in a cold cache case. With this patch, the time it takes to run "git status" on a kernel tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches" is reduced by 21%. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-09 23:53:47 -04:00
Theodore Ts'o	5e8814f2f7	ext4: Combine proc file handling into a single set of functions Previously mballoc created a separate set of functions for each proc file. This combines the tunables into a single set of functions which gets used for all of the per-superblock proc files, saving approximately 2k of compiled object code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-23 18:07:35 -04:00
Theodore Ts'o	9f6200bbfc	ext4: move /proc setup and teardown out of mballoc.c ...and into the core setup/teardown code in fs/ext4/super.c so that other parts of ext4 can define tuning parameters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-23 09:18:24 -04:00
Theodore Ts'o	f702ba0fd7	ext4: Don't use 'struct dentry' for internal lookups This is a port of a patch from Linus which fixes a 200+ byte stack usage problem in ext4_get_parent(). It's more efficient to pass down only the actual parts of the dentry that matter: the parent inode and the name, instead of allocating a struct dentry on the stack. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-22 15:21:01 -04:00
Theodore Ts'o	914258bf2c	ext4/jbd2: Avoid WARN() messages when failing to write to the superblock This fixes some very common warnings reported by kerneloops.org Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-06 21:35:40 -04:00
Eric Sandeen	730c213c79	ext4: use percpu data structures for lg_prealloc_list lg_prealloc_list seems to cry out for a per-cpu data structure; on a large smp system I think this should be better. I've lightly tested this change on a 4-cpu system. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-13 15:23:29 -04:00
Theodore Ts'o	8eea80d52b	ext4: Renumber EXT4_IOC_MIGRATE Pick an ioctl number for EXT4_IOC_MIGRATE that won't conflict with other ext4 ioctl's. Since there haven't been any major userspace users of this ioctl, we can afford to change this now, to avoid potential problems later. Also, reorder the ioctl numbers in ext4.h to avoid this sort of mistake in the future. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-13 19:54:35 -04:00
Aneesh Kumar K.V	4db46fc266	ext4: hook the ext3 migration interface to the EXT4_IOC_SETFLAGS ioctl This patch hooks the ext3 to ext4 migrate interface to EXT4_IOC_SETFLAGS ioctl. The userspace interface is via chattr +e. We only allow setting extent flags. Clearing extent flag (migrating from ext4 to ext3) is not supported. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:34:06 -04:00
Aneesh Kumar K.V	2a43a87800	ext4: elevate write count for migrate ioctl The migrate ioctl writes to the filsystem, so we need to elevate the write count. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-13 12:52:26 -04:00
Li Zefan	7ee1ec4ca3	ext4: add missing unlock in ext4_check_descriptors() on error path If there group descriptors are corrupted we need unlock the block group lock before returning from the function; else we will oops when freeing a spinlock which is still being held. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 10:47:19 -04:00
Theodore Ts'o	05496769e5	jbd2: clean up how the journal device name is printed Calculate the journal device name once and stash it away in the journal_s structure. This avoids needing to call bdevname() everywhere and reduces stack usage by not needing to allocate an on-stack buffer. In addition, we eliminate the '/' that can appear in device names (e.g. "cciss/c0d0p9" --- see kernel bugzilla #11321) that can cause problems when creating proc directory names, and include the inode number to support ocfs2 which creates multiple journals with different inode numbers. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-16 14:36:17 -04:00
Alexey Dobriyan	899fc1a4cf	ext4: fix #11321 : create /proc/ext4/*/stats more carefully ext4 creates per-suberblock directory in /proc/ext4/ . Name used as basis is taken from bdevname, which, surprise, can contain slash. However, proc while allowing to use proc_create("a/b", parent) form of PDE creation, assumes that parent/a was already created. bdevname in question is 'cciss/c0d0p9', directory is not created and all this stuff goes directly into /proc (which is real bug). Warning comes when _second_ partition is mounted. http://bugzilla.kernel.org/show_bug.cgi?id=11321 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-14 10:21:33 -04:00
Frederic Bohe	c62a11fd95	Update flex_bg free blocks and free inodes counters when resizing. This fixes a bug which prevented the newly created inodes after a resize from being used on filesystems with flex_bg. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 10:20:24 -04:00
Eric Sandeen	9d9f177572	ext4: Avoid printk floods in the face of directory corruption Note: some people thinks this represents a security bug, since it might make the system go away while it is printing a large number of console messages, especially if a serial console is involved. Hence, it has been assigned CVE-2008-3528, but it requires that the attacker either has physical access to your machine to insert a USB disk with a corrupted filesystem image (at which point why not just hit the power button), or is otherwise able to convince the system administrator to mount an arbitrary filesystem image (at which point why not just include a setuid shell or world-writable hard disk device file or some such). Me, I think they're just being silly. --tytso Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: Eugene Teo <eugeneteo@kernel.sg>	2008-10-09 11:15:52 -04:00
Aneesh Kumar K.V	cf17fea657	ext4: Properly update i_disksize. With delayed allocation we use i_data_sem to update i_disksize. We need to update i_disksize only if the new size specified is greater than the current value and we need to make sure we don't race with other i_disksize update. With delayed allocation we will switch to the write_begin function for non-delayed allocation if we are low on free blocks. This means the write_begin function for non-delayed allocation also needs to use the same locking. We also need to check and update i_disksize even if the new size is less that inode.i_size because of delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-13 13:06:18 -04:00
Aneesh Kumar K.V	ae4d537211	ext4: truncate block allocated on a failed ext4_write_begin For blocksize < pagesize we need to remove blocks that got allocated in block_write_begin() if we fail with ENOSPC for later blocks. block_write_begin() internally does this if it allocated pages locally. This makes sure we don't have blocks outside inode.i_size during ENOSPC. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-13 13:10:25 -04:00
Aneesh Kumar K.V	df22291ff0	ext4: Retry block allocation if we have free blocks left When we truncate files, the meta-data blocks released are not reused untill we commit the truncate transaction. That means delayed get_block request will return ENOSPC even if we have free blocks left. Force a journal commit and retry block allocation if we get ENOSPC with free blocks left. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 23:05:34 -04:00
Aneesh Kumar K.V	166348dd37	ext4: Don't add the inode to journal handle until after the block is allocated Make sure we don't add the inode to the journal handle until after the block allocation, so that a journal commit will not include the inode in case of block allocation failure. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 23:08:40 -04:00
Aneesh Kumar K.V	68629f29c6	ext4: Fix ext4 nomballoc allocator for ENOSPC We run into ENOSPC error on nonmballoc ext4, even when there is free blocks on the filesystem. The patch includes two changes: a) Set reservation to NULL if we trying to allocate near group_target_block from the goal group if the free block in the group is less than windows. This should give us a better chance to allocate near group_target_block. This also ensures that if we are not allocating near group_target_block then we don't trun off reservation. This should enable us to allocate with reservation from other groups that have large free blocks count. b) we don't need to check the window size if the block reservation is off. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 23:09:17 -04:00
Aneesh Kumar K.V	5c79161689	ext4: Signed arithmetic fix This patch converts some usage of ext4_fsblk_t to s64. This is needed so that some of the sign conversion works as expected in if loops. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:12:24 -04:00
Aneesh Kumar K.V	79f0be8d2e	ext4: Switch to non delalloc mode when we are low on free blocks count. The delayed allocation code allocates blocks during writepages(), which can not handle block allocation failures. To deal with this, we switch away from delayed allocation mode when we are running low on free blocks. This also allows us to avoid needing to reserve a large number of meta-data blocks in case all of the requested blocks are discontiguous. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:13:30 -04:00
Aneesh Kumar K.V	6bc6e63fcd	ext4: Add percpu dirty block accounting. This patch adds dirty block accounting using percpu_counters. Delayed allocation block reservation is now done by updating dirty block counter. In a later patch we switch to non delalloc mode if the filesystem free blocks is greater than 150% of total filesystem dirty blocks Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao<cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 09:39:00 -04:00
Aneesh Kumar K.V	030ba6bc67	ext4: Retry block reservation During block reservation if we don't have enough blocks left, retry block reservation with smaller block counts. This makes sure we try fallocate and DIO with smaller request size and don't fail early. The delayed allocation reservation cannot try with smaller block count. So retry block reservation to handle temporary disk full conditions. Also print free blocks details if we fail block allocation during writepages. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 23:14:50 -04:00
Aneesh Kumar K.V	a30d542a00	ext4: Make sure all the block allocation paths reserve blocks With delayed allocation we need to make sure block are reserved before we attempt to allocate them. Otherwise we get block allocation failure (ENOSPC) during writepages which cannot be handled. This would mean silent data loss (We do a printk stating data will be lost). This patch updates the DIO and fallocate code path to do block reservation before block allocation. This is needed to make sure parallel DIO and fallocate request doesn't take block out of delayed reserve space. When free blocks count go below a threshold we switch to a slow patch which looks at other CPU's accumulated percpu counter values. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-09 10:56:23 -04:00
Aneesh Kumar K.V	c4a0c46ec9	ext4: invalidate pages if delalloc block allocation fails. We are a bit agressive in invalidating all the pages. But it is ok because we really don't know why the block allocation failed and it is better to come of the writeback path so that user can look for more info. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2008-08-19 21:08:18 -04:00
Theodore Ts'o	af5bc92dde	ext4: Fix whitespace checkpatch warnings/errors Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 22:25:24 -04:00
Theodore Ts'o	e5f8eab885	ext4: Fix long long checkpatch warnings Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 22:25:04 -04:00
Theodore Ts'o	4776004f54	ext4: Add printk priority levels to clean up checkpatch warnings Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-09-08 23:00:52 -04:00
Mingming Cao	1f7c14c62c	percpu counter: clean up percpu_counter_sum_and_set() percpu_counter_sum_and_set() and percpu_counter_sum() is the same except the former updates the global counter after accounting. Since we are taking the fbc->lock to calculate the precise value of the counter in percpu_counter_sum() anyway, it should simply set fbc->count too, as the percpu_counter_sum_and_set() does. This patch merges these two interfaces into one. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-09 12:50:59 -04:00
Aneesh Kumar K.V	5e745b041f	ext4: Fix small file fragmentation For small file block allocations, mballoc uses per cpu prealloc space. Use goal block when searching for the right prealloc space. Also make sure ext4_da_writepages tries to write all the pages for small files in single attempt Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-18 18:00:57 -04:00
Aneesh Kumar K.V	91246c0090	ext4: Initialize writeback_index to 0 when allocating a new inode The write_cache_pages() function uses the mapping->writeback_index as the starting index to write out when range_cyclic is set. Properly initialize writeback_index so that we start the writeout at index 0. This was found when debugging the small file fragmentation on ext4. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 21:14:52 -04:00
Aneesh Kumar K.V	16eb729564	ext4: make sure ext4_has_free_blocks returns 0 for ENOSPC Fix ext4_has_free_blocks() to return 0 when we don't have enough space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 21:16:54 -04:00
Mingming Cao	525f4ed8dc	ext4: journal credit fix for the delayed allocation's writepages() function Previous delalloc writepages implementation started a new transaction outside of a loop which called get_block() to do the block allocation. Since we didn't know exactly how many blocks would need to be allocated, the estimated journal credits required was very conservative and caused many issues. With the reworked delayed allocation, a new transaction is created for each get_block(), thus we don't need to guess how many credits for the multiple chunk of allocation. We start every transaction with enough credits for inserting a single exent. When estimate the credits for indirect blocks to allocate a chunk of blocks, we need to know the number of data blocks to allocate. We use the total number of reserved delalloc datablocks; if that is too big, for non-extent files, we need to limit the number of blocks to EXT4_MAX_TRANS_BLOCKS. Code cleanup from Aneesh. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:15:58 -04:00
Aneesh Kumar K.V	a1d6cc563b	ext4: Rework the ext4_da_writepages() function With the below changes we reserve credit needed to insert only one extent resulting from a call to single get_block. This makes sure we don't take too much journal credits during writeout. We also don't limit the pages to write. That means we loop through the dirty pages building largest possible contiguous block request. Then we issue a single get_block request. We may get less block that we requested. If so we would end up not mapping some of the buffer_heads. That means those buffer_heads are still marked delay. Later in the writepage callback via __mpage_writepage we redirty those pages. We should also not limit/throttle wbc->nr_to_write in the filesystem writepages callback. That cause wrong behaviour in generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 21:55:02 -04:00
Mingming Cao	f3bd1f3fa8	ext4: journal credits reservation fixes for DIO, fallocate DIO and fallocate credit calculation is different than writepage, as they do start a new journal right for each call to ext4_get_blocks_wrap(). This patch uses the helper function in DIO and fallocate case, passing a flag indicating that the modified data are contigous thus could account less indirect/index blocks. This patch also fixed the journal credit reservation for direct I/O (DIO). Previously the estimated credits for DIO only was calculated for non-extent files, which was not enough if the file is extent-based. Also fixed was fallocate double-counting credits for modifying the the superblock. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:16:03 -04:00
Mingming Cao	ee12b63068	ext4: journal credits reservation fixes for extent file writepage This patch modified the writepage/write_begin credit calculation for extent files, to use the credits caculation helper function. The current calculation of how many index/leaf blocks should be accounted is too conservetive, it always considered the worse case, where the tree level is 5, and in the case of multiple chunk allocations, it always assumed no blocks were dirtied in common across the allocations. This path uses the accurate depth of the inode with some extras to calculate the index blocks, and also less conservative in the case of multiple allocation accounting. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:16:05 -04:00
Mingming Cao	a02908f19c	ext4: journal credits calulation cleanup and fix for non-extent writepage When considering how many journal credits are needed for modifying a chunk of data, we need to account for the super block, inode block, quota blocks and xattr block, indirect/index blocks, also, group bitmap and group descriptor blocks for new allocation (including data and indirect/index blocks). There are many places in ext4 do the calculation on their own and often missed one or two meta blocks, and often they assume single block allocation, and did not considering the multile chunk of allocation case. This patch is trying to cleanup current journal credit code, provides some common helper funtion to calculate the journal credits, to be used for writepage, writepages, DIO, fallocate, migration, defrag, and for both nonextent and extent files. This patch modified the writepage/write_begin credit caculation for nonextent files, to use the new helper function. It also fixed the problem that writepage on nonextent files did not consider the case blocksize <pagesize, thus could possibelly need multiple block allocation in a single transaction. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:16:07 -04:00
Eric Sandeen	c001077f40	ext4: Fix bug where we return ENOSPC even though we have plenty of inodes The find_group_flex() function starts with best_flex as the parent_fbg_group, which happens to have 0 inodes free. Some of the flex groups searched have free blocks and free inodes, but the flex_freeb_ratio is < 10, so they're skipped. Then when a group is compared to the current "best" flex group, it does not have more free blocks than "best", so it is skipped as well. This continues until no flex group with free inodes is found which has a proper ratio or which has more free blocks than the "best" group, and we're left with a "best" group that has 0 inodes free, and we return -ENOSPC. We fix this by changing the logic so that if the current "best" flex group has no inodes free, and the current one does have room, it is promoted to the next "best." Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:19:50 -04:00
Josef Bacik	37609fd5ae	ext4: don't try to resize if there are no reserved gdt blocks left When trying to resize an ext4 fs and you run out of reserved gdt blocks, you get an error that doesn't actually tell you what went wrong, it just says that the gdb it picked is not correct, which is the case since you don't have any reserved gdt blocks left. This patch adds a check to make sure you have reserved gdt blocks to use, and if not prints out a more relevant error. Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Andreas Dilger <adilger@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:13:41 -04:00
Theodore Ts'o	88aa3cff4e	ext4: Use ext4_discard_reservations instead of mballoc-specific call In ext4_ext_truncate(), we should use the more generic ext4_discard_reservations() call so we do the right thing when the filesystem is mounted with the nomballoc option. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>	2008-08-16 07:57:35 -04:00
Theodore Ts'o	d015641734	ext4: Fix ext4_dx_readdir hash collision handling This fixes a bug where readdir() would return a directory entry twice if there was a hash collision in an hash tree indexed directory. Signed-off-by: Eugene Dashevsky <eugene@ibrix.com> Signed-off-by: Mike Snitzer <msnitzer@ibrix.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 21:57:43 -04:00
Mingming Cao	cd21322616	ext4: Fix delalloc release block reservation for truncate Ext4 will release the reserved blocks for delayed allocations when inode is truncated/unlinked. If there is no reserved block at all, we shouldn't need to do so. But current code still tries to release the reserved blocks regardless whether the counters's value is 0. Continue to do that causes the later calculation to go wrong and a kernel BUG_ON() caught that. This doesn't happen for extent-based files, as the calculation for 0 reserved blocks was right for extent based file. This patch fixed the kernel BUG() due to above reason. It adds checks for 0 to avoid unnecessary release and fix calculation for non-extent files. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:16:59 -04:00
Theodore Ts'o	b4df203085	ext4: Fix potential truncate BUG due to i_prealloc_list being non-empty We need to call ext4_discard_reservation() earlier in ext4_truncate(), to avoid a BUG() in ext4_mb_return_to_preallocation(), which is called (ultimately) by ext4_free_blocks(). So we must ditch the blocks on i_prealloc_list before we start freeing the data blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-13 21:44:34 -04:00
Aneesh Kumar K.V	bf068ee266	ext4: Handle unwritten extent properly with delayed allocation When using fallocate the buffer_heads are marked unwritten and unmapped. We need to map them in the writepages after a get_block. Otherwise we split the uninit extents, but never write the content to disk. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-19 22:16:43 -04:00
Linus Torvalds	8f616cd524	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove write-only variables from ext4_ordered_write_end ext4: unexport jbd2_journal_update_superblock ext4: Cleanup whitespace and other miscellaneous style issues ext4: improve ext4_fill_flex_info() a bit ext4: Cleanup the block reservation code path ext4: don't assume extents can't cross block groups when truncating ext4: Fix lack of credits BUG() when deleting a badly fragmented inode ext4: Fix ext4_ext_journal_restart() ext4: fix ext4_da_write_begin error path jbd2: don't abort if flushing file data failed ext4: don't read inode block if the buffer has a write error ext4: Don't allow lg prealloc list to be grow large. ext4: Convert the usage of NR_CPUS to nr_cpu_ids. ext4: Improve error handling in mballoc ext4: lock block groups when initializing ext4: sync up block and inode bitmap reading functions ext4: Allow read/only mounts with corrupted block group checksums ext4: Fix data corruption when writing to prealloc area	2008-08-03 10:50:44 -07:00
Eric Sandeen	7d55992d60	ext4: remove write-only variables from ext4_ordered_write_end The variables 'from' and 'to' are not used anywhere. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 21:22:18 -04:00
Al Viro	77e69dac3c	[PATCH] fix races and leaks in vfs_quota_on() users * new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the pathname resolution. * callers of vfs_quota_on() that do their own pathname resolution and checks based on it are switched to vfs_quota_on_path(); that way we avoid the races. * reiserfs leaked dentry/vfsmount references on several failure exits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 11:25:25 -04:00
Hisashi Hifumi	8ab22b9abb	vfs: pagecache usage optimization for pagesize!=blocksize When we read some part of a file through pagecache, if there is a pagecache of corresponding index but this page is not uptodate, read IO is issued and this page will be uptodate. I think this is good for pagesize == blocksize environment but there is room for improvement on pagesize != blocksize environment. Because in this case a page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate. So I suggest that when all buffers which correspond to a part of a file that we want to read are uptodate, use this pagecache and copy data from this pagecache to user buffer even if a page is not uptodate. This can reduce read IO and improve system throughput. I wrote a benchmark program and got result number with this program. This benchmark do: 1: mount and open a test file. 2: create a 512MB file. 3: close a file and umount. 4: mount and again open a test file. 5: pwrite randomly 300000 times on a test file. offset is aligned by IO size(1024bytes). 6: measure time of preading randomly 100000 times on a test file. The result was: 2.6.26 330 sec 2.6.26-patched 226 sec Arch:i386 Filesystem:ext3 Blocksize:1024 bytes Memory: 1GB On ext3/4, a file is written through buffer/block. So random read/write mixed workloads or random read after random write workloads are optimized with this patch under pagesize != blocksize environment. This test result showed this. The benchmark program is as follows: #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <time.h> #include <stdlib.h> #include <string.h> #include <sys/mount.h> #define LEN 1024 #define LOOP 1024512 / 512MB / main(void) { unsigned long i, offset, filesize; int fd; char buf[LEN]; time_t t1, t2; if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) { perror("cannot mount\n"); exit(1); } memset(buf, 0, LEN); fd = open("/root/test1/testfile", O_CREAT\|O_RDWR\|O_TRUNC); if (fd < 0) { perror("cannot open file\n"); exit(1); } for (i = 0; i < LOOP; i++) write(fd, buf, LEN); close(fd); if (umount("/root/test1/") < 0) { perror("cannot umount\n"); exit(1); } if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) { perror("cannot mount\n"); exit(1); } fd = open("/root/test1/testfile", O_RDWR); if (fd < 0) { perror("cannot open file\n"); exit(1); } filesize = LEN LOOP; for (i = 0; i < 300000; i++){ offset = (random() % filesize) & (~(LEN - 1)); pwrite(fd, buf, LEN, offset); } printf("start test\n"); time(&t1); for (i = 0; i < 100000; i++){ offset = (random() % filesize) & (~(LEN - 1)); pread(fd, buf, LEN, offset); } time(&t2); printf("%ld sec\n", t2-t1); close(fd); if (umount("/root/test1/") < 0) { perror("cannot umount\n"); exit(1); } } Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-28 16:30:21 -07:00
Al Viro	e6305c43ed	[PATCH] sanitize ->permission() prototype * kill nameidata * argument; map the 3 bits in ->flags anybody cares about to new MAY_... ones and pass with the mask. * kill redundant gfs2_iop_permission() * sanitize ecryptfs_permission() * fix remaining places where ->permission() instances might barf on new MAY_... found in mask. The obvious next target in that direction is permission(9) folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:14 -04:00
Theodore Ts'o	2b2d6d0197	ext4: Cleanup whitespace and other miscellaneous style issues Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-26 16:15:44 -04:00
Alexey Dobriyan	51cc50685a	SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:07 -07:00
Li Zefan	ec05e868ac	ext4: improve ext4_fill_flex_info() a bit - use kzalloc() instead of kmalloc() + memset() - improve a printk info Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-24 12:49:59 -04:00
Aneesh Kumar K.V	12219aea6b	ext4: Cleanup the block reservation code path The truncate patch should not use the i_allocated_meta_blocks value. So add seperate functions to be used in the truncate and alloc path. We also need to release the meta-data block that we reserved for the blocks that we are truncating. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-17 16:12:08 -04:00
Theodore Ts'o	34071da71a	ext4: don't assume extents can't cross block groups when truncating With the FLEX_BG layout, there is no reason why extents can't cross block groups, so make the truncate code reserve enough credits so we don't BUG if we come across such an extent. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-01 21:59:19 -04:00
Theodore Ts'o	bc965ab3f2	ext4: Fix lack of credits BUG() when deleting a badly fragmented inode The extents codepath for ext4_truncate() requests journal transaction credits in very small chunks, requesting only what is needed. This means there may not be enough credits left on the transaction handle after ext4_truncate() returns and then when ext4_delete_inode() tries finish up its work, it may not have enough transaction credits, causing a BUG() oops in the jbd2 core. Also, reserve an extra 2 blocks when starting an ext4_delete_inode() since we need to update the inode bitmap, as well as update the orphaned inode linked list. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 21:10:38 -04:00
Theodore Ts'o	0123c93998	ext4: Fix ext4_ext_journal_restart() The ext4_ext_journal_restart() is a convenience function which checks to see if the requested number of credits is present, and if so it closes the current transaction and attaches the current handle to the new transaction. Unfortunately, it wasn't proprely checking the return value from ext4_journal_extend(), so it was starting a new transaction when one was not necessary, and returning an error when all that was necessary was to restart the handle with a new transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-01 20:57:54 -04:00
Eric Sandeen	d5a0d4f732	ext4: fix ext4_da_write_begin error path ext4_da_write_begin needs to call journal_stop before returning, if the page allocation fails. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 18:51:06 -04:00
Hidehiro Kawai	9c83a923c6	ext4: don't read inode block if the buffer has a write error A transient I/O error can corrupt inode data. Here is the scenario: (1) update inode_A at the block_B (2) pdflush writes out new inode_A to the filesystem, but it results in write I/O error, at this point, BH_Uptodate flag of the buffer for block_B is cleared and BH_Write_EIO is set (3) create new inode_C which located at block_B, and __ext4_get_inode_loc() tries to read on-disk block_B because the buffer is not uptodate (4) if it can read on-disk block_B successfully, inode_A is overwritten by old data This patch makes __ext4_get_inode_loc() not read the inode block if the buffer has BH_Write_EIO flag. In this case, the buffer should have the latest information, so setting the uptodate flag to the buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().) According to this change, we would need to test BH_Write_EIO flag for the error checking. Currently nobody checks write I/O errors on metadata buffers, but it will be done in other patches I'm working on. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: sugita <yumiko.sugita.yf@hitachi.com> Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-26 16:39:26 -04:00
Aneesh Kumar K.V	6be2ded1d7	ext4: Don't allow lg prealloc list to be grow large. Currently, the locality group prealloc list is freed only when there is a block allocation failure. This can result in large number of entries in the preallocation list making ext4_mb_use_preallocated() expensive. To fix this, we convert the locality group prealloc list to a hash list. The hash index is the order of number of blocks in the prealloc space with a max order of 9. When adding prealloc space to the list we make sure total entries for each order does not exceed 8. If it is more than 8 we discard few entries and make sure the we have only <= 5 entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-23 14:14:05 -04:00
Aneesh Kumar K.V	1320cbcf77	ext4: Convert the usage of NR_CPUS to nr_cpu_ids. NR_CPUS can be really large. We should be using nr_cpu_ids instead. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-23 14:09:26 -04:00
Aneesh Kumar K.V	ce89f46cb8	ext4: Improve error handling in mballoc Don't call BUG_ON on file system failures. Instead use ext4_error and also handle the continue case properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-23 14:09:29 -04:00
Eric Sandeen	b5f10eed81	ext4: lock block groups when initializing I noticed when filling a 1T filesystem with 4 threads using the fs_mark benchmark: fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 that I occasionally got checksum mismatch errors: EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935 etc. I'd reliably get 4-5 of them during the run. It appears that the problem is likely a race to init the bg's when the uninit_bg feature is enabled. With the patch below, which adds sb_bgl_locking around initialization, I was able to complete several runs with no errors or warnings. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-08-02 21:21:08 -04:00
Eric Sandeen	e29d1cde63	ext4: sync up block and inode bitmap reading functions ext4_read_block_bitmap and read_inode_bitmap do essentially the same thing, and yet they are structured quite differently. I came across this difference while looking at doing bg locking during bg initialization. This patch: * removes unnecessary casts in the error messages * renames read_inode_bitmap to ext4_read_inode_bitmap * and more substantially, restructures the inode bitmap reading function to be more like the block bitmap counterpart. The change to the inode bitmap reader simplifies the locking to be applied in the next patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-08-02 21:21:02 -04:00
Theodore Ts'o	8a266467b8	ext4: Allow read/only mounts with corrupted block group checksums If the block group checksums are corrupted, still allow the mount to succeed, so e2fsck can have a chance to try to fix things up. Add code in the remount r/w path to make sure the block group checksums are valid before allowing the filesystem to be remounted read/write. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-26 14:34:21 -04:00
Aneesh Kumar K.V	d03856bd5e	ext4: Fix data corruption when writing to prealloc area Inserting an extent can cause a new entry in the already existing index block. That doesn't increase the depth of the instead. Instead it adds a new leaf block. Now with the new leaf block the path information corresponding to the logical block should be fetched from the new block. The old path will be pointing to the old leaf block. We need to recalucate the path information on extent insert even if depth doesn't change. Without this change, the extent merge after converting an unwritten extent to initialized extent takes the wrong extent and cause data corruption. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 18:51:32 -04:00
Eric Sandeen	e4079a11f5	ext4: do not set extents feature from the kernel We've talked for a while about getting rid of any feature- setting from the kernel; this gets rid of the code which would set the INCOMPAT_EXTENTS flag on the first file write when mounted as ext4[dev]. With this patch, if the extents feature is not already set on disk, then mounting as ext4 will fall back to noextents with a warning, and if -o extents is explicitly requested, the mount will fail, also with warning. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	c07651b556	ext4: Don't allow nonextenst mount option for large filesystem The block mapped inode format can address only blocks within 232. This causes a number of issues, the biggest of which is that the block allocator needs to be taught that certain inodes can not utilize block numbers > 232. So until this is fixed, it is simplest to fail mounting of file systems with more than 2**32 blocks if the -o noextents option is given. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	dd919b9822	ext4: Enable delalloc by default. Enable delalloc by default to ensure it gets sufficient testing and because it makes the filesystem much more efficient. Add a nodealalloc option to disable delayed allocation, and update ext4_show_options to show delayed allocation off if it is disabled. If the data=journal mount option is used, disable delayed allocation since the delalloc code doesn't support data=journal yet. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2008-07-11 19:27:31 -04:00
Mingming Cao	3e3398a08d	ext4: delayed allocation i_blocks fix for stat Right now i_blocks is not getting updated until the blocks are actually allocaed on disk. This means with delayed allocation, right after files are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du" also shows the files taking zero space, which is highly confusing to the user. Since delayed allocation already keeps track of per-inode total number of blocks that are subject to delayed allocation, this patch fix this by using that to adjust the value returned by stat(2). When real block allocation is done, the i_blocks will get updated. Since the reserved blocks for delayed allocation will be decreased, this will be keep value returned by stat(2) consistent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	632eaeab1f	ext4: fix delalloc i_disksize early update issue Ext4_da_write_end() used walk_page_buffers() with a callback function of ext4_bh_unmapped_or_delay() to check if it extended the file size without allocating any blocks (since in this case i_disksize needs to be updated). However, this is didn't work proprely because the buffer head has not been marked dirty yet --- this is done later in block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to always return false. In addition, walk_page_buffers() checks all of the buffer heads covering the page, and the only buffer_head that should be checked is the one covering the end of the write. Otherwise, given a 1k blocksize filesystem and a 4k page size, the buffer head covering the first 1k stripe of the file could be unmapped (because it was a sparse file), and the second or third buffer_head covering that page could be mapped, and using walk_page_buffers() would fail in this case since it would stop at the first unmapped buffer_head and return true. The core problem is that walk_page_buffers() was intended to do work in a callback function, and a non-zero return value indicated a failure, which termined the walk of the buffer heads covering the page. It was not intended to be used with a boolean function, such as ext4_bh_unmapped_or_delay(). Add addtional fix from Aneesh to protect i_disksize update rave with truncate. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	f0e6c98593	ext4: Handle page without buffers in ext4__writepage() It can happen that buffers are removed from the page before it gets marked dirty and then is passed to writepage(). In writepage() we just initialize the buffers and check whether they are mapped and non delay. If they are mapped and non delay we write the page. Otherwise we mark them dirty. With this change we don't do block allocation at all in ext4__write_page. writepage() can get called under many condition and with a locking order of journal_start -> lock_page, we should not try to allocate blocks in writepage() which get called after taking page lock. writepage() can get called via shrink_page_list even with a journal handle which was created for doing inode update. For example when doing ext4_da_write_begin we create a journal handle with credit 1 expecting a i_disksize update for the inode. But ext4_da_write_begin can cause shrink_page_list via _grab_page_cache. So having a valid handle via ext4_journal_current_handle is not a guarantee that we can use the handle for block allocation in writepage, since we shouldn't be using credits that had been reserved for other updates. That it could result in we running out of credits when we update inodes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	cd1aac3292	ext4: Add ordered mode support for delalloc This provides a new ordered mode implementation which gets rid of using buffer heads to enforce the ordering between metadata change with the related data chage. Instead, in the new ordering mode, it keeps track of all of the inodes touched by each transaction on a list, and when that transaction is committed, it flushes all of the dirty pages for those inodes. In addition, the new ordered mode reverses the lock ordering of the page lock and transaction lock, which provides easier support for delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	61628a3f3a	ext4: Invert lock ordering of page_lock and transaction start in delalloc With the reverse locking, we need to start a transation before taking the page lock, so in ext4_da_writepages() we need to break the write-out into chunks, and restart the journal for each chunck to ensure the write-out fits in a single transaction. Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> which fixes delalloc sync hang with journal lock inversion, and address the performance regression issue. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	d2a1763791	ext4: delayed allocation ENOSPC handling This patch does block reservation for delayed allocation, to avoid ENOSPC later at page flush time. Blocks(data and metadata) are reserved at da_write_begin() time, the freeblocks counter is updated by then, and the number of reserved blocks is store in per inode counter. At the writepage time, the unused reserved meta blocks are returned back. At unlink/truncate time, reserved blocks are properly released. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix the oldallocator block reservation accounting with delalloc, added lock to guard the counters and also fix the reservation for meta blocks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-14 17:52:37 -04:00
Mingming Cao	e8ced39d5e	percpu_counter: new function percpu_counter_sum_and_set Delayed allocation need to check free blocks at every write time. percpu_counter_read_positive() is not quit accurate. delayed allocation need a more accurate accounting, but using percpu_counter_sum_positive() is frequently is quite expensive. This patch added a new function to update center counter when sum per-cpu counter, to increase the accurate rate for next percpu_counter_read() and require less calling expensive percpu_counter_sum(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Alex Tomas	64769240bd	ext4: Add delayed allocation support in data=writeback mode Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and release the page from page cache if the delalloc write_begin failed, and properly handle preallocated blocks. Also added a fix to clear buffer_delay in block_write_full_page() after allocating a delayed buffer. Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to update i_disksize properly and to add bmap support for delayed allocation. Updated with a fix from Valerie Clement <valerie.clement@bull.net> to avoid filesystem corruption when the filesystem is mounted with the delalloc option and blocksize < pagesize. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2008-07-11 19:27:31 -04:00
Jan Kara	678aaf4814	ext4: Use new framework for data=ordered mode in JBD2 This patch makes ext4 use inode-based implementation of data=ordered mode in JBD2. It allows us to unify some data=ordered and data=writeback paths (especially writepage since we don't have to start a transaction anymore) and remove some buffer walking. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix file system hang due to corrupt jinode values. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	9ddfc3dc75	ext4: Fix lock inversion in ext4_ext_truncate() We cannot call ext4_orphan_add() from under i_data_sem because that causes a lock ordering violation between i_data_sem and and the superblock lock. Updated with Aneesh's locking order fix Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	cf108bca46	ext4: Invert the locking order of page_lock and transaction start This changes are needed to support data=ordered mode handling via inodes. This enables us to get rid of the journal heads and buffer heads for data buffers in the ordered mode. With the changes, during tranasaction commit we writeout the inode pages using the writepages()/writepage(). That implies we take page lock during transaction commit. This can cause a deadlock with the locking order page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait for the journal_commit to happen and the journal_commit now needs to take the page lock. To avoid this dead lock reverse the locking order. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	2e9ee85035	ext4: Use page_mkwrite vma_operations to get mmap write notification. We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Frederic Bohe	5f21b0e642	ext4: fix online resize with mballoc Update group infos when updating a group's descriptor. Add group infos when adding a group's descriptor. Refresh cache pages used by mb_alloc when changes occur. This will probably need modifications when META_BG resizing will be allowed. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2008-07-11 19:27:31 -04:00
Eric Sandeen	953e622b60	ext4: use atomic functions to set bh_state Use the BUFFER_FNS functions (set_buffer_foo) to set buffer head state atomically instead of nonatomic __set_bit(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	47b4a50beb	ext4: Set journal pointer to NULL when journal is released Set sbi->s_journal to NULL after we call journal_destroy(). This will be later needed because after journal_destroy() is called, ext4_clear_inode() can still be called for some inodes (e.g. root inode) and we'll need to detect there that journal doesn't exists anymore. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	0703143107	ext4: mballoc avoid use root reserved blocks for non root allocation mballoc allocation missed check for blocks reserved for root users. Add ext4_has_free_blocks() check before allocation. Also modified ext4_has_free_blocks() to support multiple block allocation request. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Eric Sandeen	d755fb3842	ext4: call blkdev_issue_flush on fsync To ensure that bits are truly on-disk after an fsync, we should call blkdev_issue_flush if barriers are supported. Inspired by an old thread on barriers, by reiserfs & xfs which do the same, and by a patch SuSE ships with their kernel Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	654b4908bc	ext4: cleanup block allocator Move the code related to block allocation to a single function and add helper funtions to differient allocation for data and meta data blocks Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	7061eba75c	ext4: Use inode preallocation with -o noextents When mballoc is enabled, block allocation for old block-based files are allocated using mballoc allocator instead of old block-based allocator. The old ext3 block reservation is turned off when mballoc is turned on. However, the in-core preallocation is not enabled for block-based/ non-extent based file block allocation. This result in performance regression, as now we don't have "reservation" ore in-core preallocation to prevent interleaved fragmentation in multiple writes workload. This patch fix this by enable per inode in-core preallocation for non extent files when mballoc is used. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Akinobu Mita	6afd670713	ext4: fix ext4_init_block_bitmap() for metablock block group When meta_bg feature is enabled and s_first_meta_bg != 0, ext4_init_block_bitmap() miscalculates the number of block used by the group descriptor table (0 or 1 for metablock block group) This patch fixes this by using ext4_bg_num_gdb() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Tweedie <sct@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Andreas Dilger <adilger@sun.com>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	7477827f66	ext4: Fix sparse warning Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Li Zefan	d9c769b769	ext4: cleanup never-used magic numbers from htree code dx_root_limit() will had some dead code which forced it to always return 20, and dx_node_limit to always return 22 for debugging purposes. Remove it. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	9102e4fa80	ext4: Fix ext4_ext_journal_restart() to reflect errors up to the caller Fix ext4_ext_journal_restart() so it returns any errors reported by ext4_journal_extend() and ext4_journal_restart(). Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	1973adcba5	ext4: Make ext4_ext_find_extent fills ext_path completely When pos=0 or depth, the fields of ext4_ext_path is are not completely filled. This patch also removes some unnecessary code. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	787e0981fa	ext4: return error when calling ext4_ext_split failed ext4_ext_create_new_leaf must return error when its calling to ext4_ext_split failed. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	a379cd1d6b	ext4: Update i_disksize properly when allocating from fallocate area. When allocating unitialized space at the end of file which had been preallocated with the FALLOC_FL_KEEP_SIZE option, the file size is not updated at that time. But the later we are not updating the file size when writing to that preallocated space. These changes are for code correctness. This patch allows us to update the i_disksize at the write_end() callback of filesystem properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	363d4251d4	ext4: remove quota allocation when ext4_mb_new_blocks fails Quota allocation is not removed when ext4_mb_new_blocks calls kmem_cache_alloc failed. Also make sure the allocation context is freed on the error path. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Li Zefan	f9a8ac99fd	ext4: remove redundant code in ext4_fill_super() The previous sb_min_blocksize() has already set the block size. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jose R. Santos	772cb7c83b	ext4: New inode allocation for FLEX_BG meta-data groups. This patch mostly controls the way inode are allocated in order to make ialloc aware of flex_bg block group grouping. It achieves this by bypassing the Orlov allocator when block group meta-data are packed toghether through mke2fs. Since the impact on the block allocator is minimal, this patch should have little or no effect on other block allocation algorithms. By controlling the inode allocation, it can basically control where the initial search for new block begins and thus indirectly manipulate the block allocator. This allocator favors data and meta-data locality so the disk will gradually be filled from block group zero upward. This helps improve performance by reducing seek time. Since the group of inode tables within one flex_bg are treated as one giant inode table, uninitialized block groups would not need to partially initialize as many inode table as with Orlov which would help fsck time as the filesystem usage goes up. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: Valerie Clement <valerie.clement@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Stoyan Gaydarov	4db9c54a53	ext4: replace __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ instead Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-13 21:03:29 -04:00
Shen Feng	7e5a8cdd84	ext4: fix error processing in mb_free_blocks The error processing of the return value of mb_free_blocks is meanless because it only returns 0. This fix includes - make mb_free_blocks return void - remove the error processing part in callers - unlock group before calling ext4_error in mb_free_blocks Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-13 21:03:31 -04:00
Shen Feng	cfbe7e4f5e	ext4: error proc entry creation when the fs/ext4 is not correctly created When the directory fs/ext4 is not correctly created under proc, the entry under this directory should not be created. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-13 21:03:31 -04:00
Li Zefan	f795e14073	ext4: fix build failure if DX_DEBUG is enabled ext4_next_entry() is used by the debugging function dx_show_leaf(), so it must be defined before that function. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Theodore Ts'o	7ad72ca60b	ext4: Remove unused variable from ext4_show_options Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Theodore Ts'o	574ca174c9	ext4: Rename read_block_bitmap() to ext4_read_block_bitmap() Since this a non-static function, make it be ext4 specific to avoid conflicts with potentially other filesystems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	3537576a70	ext4: remove double definitions of xattr macros remove the definitions of macros XATTR_TRUSTED_PREFIX and XATTR_USER_PREFIX since they are defined in linux/xattr.h Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	74767c5a2d	ext4: miscellaneous error checks and coding cleanups for mballoc ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL ext4_mb_history_init(): replace kmalloc and memset with kzalloc ext4_mb_init_backend(): remove memset since kzalloc is used ext4_mb_init(): the return value of ext4_mb_init_backend is int, but i is unsigned, replace it with a new int variable. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	fdf6c7a768	ext4: add error processing when calling ext4_mb_init_cache in mballoc Add error processing for ext4_mb_load_buddy when it calls ext4_mb_init_cache. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	31b481dc7c	ext4: Fix ext4_mb_init_cache return error ext4_mb_init_cache() incorrectly always return EIO on success. This causes the caller of ext4_mb_init_cache() fail when it checks the return value. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	69baee062a	ext4: improve some code in rb tree part of dir.c * remove unnecessary code in free_rb_tree_fname * rename free_rb_tree_fname to ext4_htree_create_dir_info since it and ext4_htree_free_dir_info are a pair * replace kmalloc with kzalloc in ext4_htree_free_dir_info All these make the code more readable and simple. PS: this patch is also suitable for ext3. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Alexey Dobriyan	91d9982779	ext4: switch to seq_files Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-07-11 19:27:31 -04:00

... 2 3 4 5 6 ...

615 Commits