linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-19 10:14:23 +08:00

Author	SHA1	Message	Date
Dave Chinner	4497f28750	Merge branch 'xfs-misc-fixes-for-4.2-2' into for-next	2015-06-04 13:31:13 +10:00
Brian Foster	46fc58dacf	xfs: check min blks for random debug mode sparse allocations The inode allocator enables random sparse inode chunk allocations in DEBUG mode to facilitate testing. Sparse inode allocations are not always possible, however, depending on the fs geometry. For example, there is no possibility for a sparse inode allocation on filesystems where the block size is large enough to fit one or more inode chunks within a single block. Fix up the DEBUG mode sparse inode allocation logic to trigger random sparse allocations only when the geometry of the fs allows it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:03:34 +10:00
Brian Foster	3cdaa1898f	xfs: fix sparse inodes 32-bit compile failure The kbuild test robot reports the following compilation failure with a 32-bit kernel configuration: fs/built-in.o: In function `xfs_ifree_cluster': >> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3' This is due to the use of the modulus operator on a 64-bit variable in the ASSERT() added as part of the following commit: xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() This ASSERT() simply checks that the offset of the inode in a sparse cluster is appropriately aligned. Since the maximum inode record offset is 63 (for a 64 inode record) and the calculated offset here should be something less than that, just use a 32-bit variable to store the offset and call the do_mod() helper. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:03:34 +10:00
Dave Chinner	66e8ac7bfa	Merge branch 'xfs-dax-support' into for-next	2015-06-04 13:01:49 +10:00
Dave Chinner	cbe4dab119	xfs: add initial DAX support Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:18 +10:00
Dave Chinner	6e1ba0bcb8	xfs: add DAX IO path support DAX does not do buffered IO (can't buffer direct access!) and hence all read/write IO is vectored through the direct IO path. Hence we need to add the DAX IO path callouts to the direct IO infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:15 +10:00
Dave Chinner	9969441f9f	xfs: add DAX truncate support When we truncate a DAX file, we need to call through the DAX page truncation path rather than through block_truncate_page() so that mappings and block zeroing are all handled correctly. Otherwise, truncate does not need to change. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:10 +10:00
Dave Chinner	4f69f578a8	xfs: add DAX block zeroing support Add initial support for DAX block zeroing operations to XFS. DAX cannot use buffered IO through the page cache for zeroing, nor do we need to issue IO for uncached block zeroing. In both cases, we can simply call out to the dax block zeroing function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:08 +10:00
Dave Chinner	6b698edeee	xfs: add DAX file operations support Add the initial support for DAX file operations to XFS. This includes the necessary block allocation and mmap page fault hooks for DAX to function. Note that there are changes to the splice interfaces to ensure that for DAX splice avoids direct page cache manipulations and instead takes the DAX IO paths for read/write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:53 +10:00
Dave Chinner	ce5c5d554d	dax: expose __dax_fault for filesystems with locking constraints Some filesystems cannot call dax_fault() directly because they have different locking and/or allocation constraints in the page fault IO path. To handle this, we need to follow the same model as the generic block_page_mkwrite code, where the internals are exposed via __block_page_mkwrite() so that filesystems can wrap the correct locking and operations around the outside. This is loosely based on a patch originally from Matthew Willcox. Unlike the original patch, it does not change ext4 code, error returns or unwritten extent conversion handling. It also adds a __dax_mkwrite() wrapper for .page_mkwrite implementations to do the right thing, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:18 +10:00
Dave Chinner	e842f29039	dax: don't abuse get_block mapping for endio callbacks dax_fault() currently relies on the get_block callback to attach an io completion callback to the mapping buffer head so that it can run unwritten extent conversion after zeroing allocated blocks. Instead of this hack, pass the conversion callback directly into dax_fault() similar to the get_block callback. When the filesystem allocates unwritten extents, it will set the buffer_unwritten() flag, and hence the dax_fault code can call the completion function in the contexts where it is necessary without overloading the mapping buffer head. Note: The changes to ext4 to use this interface are suspect at best. In fact, the way ext4 did this end_io assignment in the first place looks suspect because it only set a completion callback when there wasn't already some other write() call taking place on the same inode. The ext4 end_io code looks rather intricate and fragile with all it's reference counting and passing to different contexts for modification via inode private pointers that aren't protected by locks... Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:18 +10:00
Dave Chinner	ec56b1f1fd	xfs: mmap lock needs to be inside freeze protection Lock ordering for the new mmap lock needs to be: mmap_sem sb_start_pagefault i_mmap_lock page lock <fault processsing> Right now xfs_vm_page_mkwrite gets this the wrong way around, While technically it cannot deadlock due to the current freeze ordering, it's still a landmine that might explode if we change anything in future. Hence we need to nest the locks correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:18 +10:00
Dave Chinner	b9a350a118	Merge branch 'xfs-sparse-inode' into for-next	2015-06-01 10:51:38 +10:00
Dave Chinner	e01c025fbd	Merge branch 'xfs-misc-fixes-for-4.2' into for-next	2015-06-01 10:50:18 +10:00
Nan Jia	339e4f66d1	xfs: Clean up xfs_trans_dup_dqinfo Fixed two missing spaces. Signed-off-by: Nan Jia <jiananmail@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 10:50:00 +10:00
Eric Sandeen	39e56d9219	xfs: don't cast string literals The commit: `a9273ca5` xfs: convert attr to use unsigned names added these (unsigned char ) casts, but then the _SIZE macros return "7" - size of a pointer minus one - not the length of the string. This is harmless in the kernel, because the _SIZE macros are not used, but as we sync up with userspace, this will matter. I don't think the cast is necessary; i.e. assigning the string literal to an unsigned char , or passing it to a function expecting an unsigned char *, should be ok, right? Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 07:15:38 +10:00
Brian Foster	7f884dc198	xfs: fix quota block reservation leak when tp allocates and frees blocks Al Viro reports that generic/231 fails frequently on XFS and bisected the problem to the following commit: `5d11fb4b` xfs: rework zero range to prevent invalid i_size updates ... which is just the first commit that happens to cause fsx to reproduce the problem. fsx reproduces via zero range calls. The aforementioned commit overhauls zero range to use hole punch and fallocate. As it turns out, the problem is reproducible on demand using basic hole punch as follows: $ mkfs.xfs -f -m crc=1,finobt=1 <dev> $ mount <dev> /mnt -o uquota $ xfs_io -f -c "falloc 0 50m" /mnt/file $ for i in $(seq 1 20); do xfs_io -c "fpunch ${i}m 32k" /mnt/file; done $ rm -f /mnt/file $ repquota -us /mnt ... User used soft hard grace used soft hard grace ---------------------------------------------------------------------- root -- 32K 0K 0K 3 0 0 A file is allocated with a single 50m extent. The extent count increases via hole punches until the bmap converts to btree format. The file is removed but quota reports 32k of space usage for the user. This reservation is effectively leaked for the lifetime of the mount. The reason this occurs is because the quota block reservation tracking is confused when a transaction happens to free and allocate blocks at the same time. Consider the following sequence of events: - tp is allocated from xfs_free_file_space() and reserves several blocks for btree management. Blocks are reserved against the dquot and marked as such in the transaction (qtrx->qt_blk_res). - 8 blocks are accounted free when the 32k range is punched out. xfs_trans_mod_dquot() is called with XFS_TRANS_DQ_BCOUNT and sets ->qt_bcount_delta to -8. - Subsequently, a block is allocated against the same transaction by xfs_bmap_extents_to_btree() for btree conversion. A call to xfs_trans_mod_dquot() increases qt_blk_res_used to 1 and qt_bcount_delta to -7. - The transaction is dup'd and committed by xfs_bmap_finish(). xfs_trans_dup_dqinfo() sets the first transaction up such that it has a matching qt_blk_res and qt_blk_res_used of 1. The remaining unused reservation is transferred to the duplicate tp. When the transactions are committed, the dquots are fixed up in xfs_trans_apply_dquot_deltas() according to one of two methods: 1.) If the transaction holds a block reservation (->qt_blk_res != 0), _only_ the unused portion reservation is unaccounted from the dquot. Note that the tp duplication behavior of xfs_bmap_finish() makes it such that qt_blk_res is typically 0 for tp's with unused reservation. 2.) Otherwise, the dquot is fixed up based on the block delta (->qt_bcount_delta) created by the transaction. Therefore, if a transaction has a negative qt_bcount_delta and positive qt_blk_res_used, the former set of blocks that have been removed from the file are never factored out of the in-core dquot reservation. Instead, *_apply_dquot_deltas() sees 1 block used out of a 1 block reservation and believes there is nothing to fix up. The on-disk d_bcount is updated independently from qt_bcount_delta, and thus is correct (and allows the quota usage to correct on remount). To deal with this situation, we effectively want the "used reservation" part of the transaction to be consistent with any freed blocks with respect to quota tracking. For example, if 8 blocks are freed, the subsequent single block allocation does not need to consume the initial reservation made by the tp. Instead, it simply borrows one from the previously freed. One possible implementation of such borrowing is to avoid the blks_res_used increment when bcount_delta is negative. This alone is flawed logic in that it only handles the case where blocks are freed before allocated, however. Rather than add more complexity to manage synchronization between bcount_delta and blks_res_used, kill the latter entirely. blk_res_used is only updated in one place and always in sync with delta_bcount. Therefore, the net block reservation consumption of the transaction is always available from bcount_delta. Calculate the reservation consumption on the fly where necessary based on whether the tp has a reservation and results in a positive net block delta on the inode. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 07:15:37 +10:00
Brian Foster	2e588a46aa	xfs: always log the inode on unwritten extent conversion The fsync() requirements for crash consistency on XFS are to flush file data and force any in-core inode updates to the log. We currently check whether the inode is pinned to identify whether the log needs to be forced, since a non-zero pin count generally represents an inode that has transactions awaiting a flush to the on-disk log. This is not sufficient in all cases, however. Reports of xfstests test generic/311 failures on ppc64/s390x hosts have identified failures to fsync outstanding inode modifications due to the inode not being pinned at the time of the fsync. This occurs because certain bmap updates can complete by logging bmapbt buffers but without ever dirtying (and thus pinning) the core inode. The following is a specific incarnation of this problem: $ mount $dev /mnt -o noatime,nobarrier $ for i in $(seq 0 2 31); do \ xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \ done $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \ hexdump /mnt/file; \ ./xfstests-dev/src/godown /mnt ... `0000000` 0000 0000 0000 0000 0000 0000 0000 0000 * 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0014000 0000 0000 0000 0000 0000 0000 0000 0000 * 00f8000 $ umount /mnt; mount ... $ hexdump /mnt/file `0000000` 0000 0000 0000 0000 0000 0000 0000 0000 * 00f8000 In short, the unwritten extent conversion for the last write is lost despite the fact that an fsync executed before the filesystem was shutdown. Note that this is impossible to reproduce on v5 supers due to unconditional time callbacks for di_changecount and highly difficult to reproduce on CONFIG_HZ=1000 kernels due to those same callbacks frequently updating cmtime prior to the bmap update. CONFIG_HZ=100 reduces timer granularity enough to increase the odds that time updates are skipped and allows this to reproduce within a handful of attempts. To deal with this problem, unconditionally log the core in the unwritten extent conversion path. Fix up logflags after the extent conversion to keep the extent update code consistent with the other extent update helpers. This fixup is not necessary for the other (hole, delay) extent helpers because they execute in the block allocation codepath, which already logs the inode for other reasons (e.g., for di_nblocks). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 07:15:23 +10:00
Brian Foster	22ce1e1472	xfs: enable sparse inode chunks for v5 superblocks Enable mounting of filesystems with sparse inode support enabled. Add the incompat. feature bit to the *_ALL mask. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:26:33 +10:00
Brian Foster	09b5660413	xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() xfs_ifree_cluster() is called to mark all in-memory inodes and inode buffers as stale. This occurs after we've removed the inobt records and dropped any references of inobt data. xfs_ifree_cluster() uses the starting inode number to walk the namespace of inodes expected for a single chunk a cluster buffer at a time. The cluster buffer disk addresses are calculated by decoding the sequential inode numbers expected from the chunk. The problem with this approach is that if the inode chunk being removed is a sparse chunk, not all of the buffer addresses that are calculated as part of this sequence may be inode clusters. Attempting to acquire the buffer based on expected inode characterstics (i.e., cluster length) can lead to errors and is generally incorrect. We already use a couple variables to carry requisite state from xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a new internal structure to carry the existing parameters through these functions. Add an alloc field that represents the physical allocation bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster() to check each inode against the bitmap and skip the clusters that were never allocated as real inodes on disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:26:03 +10:00
Brian Foster	10ae3dc7f2	xfs: only free allocated regions of inode chunks An inode chunk is currently added to the transaction free list based on a simple fsb conversion and hardcoded chunk length. The nature of sparse chunks is such that the physical chunk of inodes on disk may consist of one or more discontiguous parts. Blocks that reside in the holes of the inode chunk are not inodes and could be allocated to any other use or not allocated at all. Refactor the existing xfs_bmap_add_free() call into the xfs_difree_inode_chunk() helper. The new helper uses the existing calculation if a chunk is not sparse. Otherwise, use the inobt record holemask to free the contiguous regions of the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:22:52 +10:00
Brian Foster	26dd5217de	xfs: filter out sparse regions from individual inode allocation Inode allocation from an existing record with free inodes traditionally selects the first inode available according to the ir_free mask. With sparse inode chunks, the ir_free mask could refer to an unallocated region. We must mask the unallocated regions out of ir_free before using it to select a free inode in the chunk. Update the xfs_inobt_first_free_inode() helper to find the first free inode available of the allocated regions of the inode chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:20:10 +10:00
Brian Foster	1cdadee11f	xfs: randomly do sparse inode allocations in DEBUG mode Sparse inode allocations generally only occur when full inode chunk allocation fails. This requires some level of filesystem space usage and fragmentation. For filesystems formatted with sparse inode chunks enabled, do random sparse inode chunk allocs when compiled in DEBUG mode to increase test coverage. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:19:29 +10:00
Brian Foster	56d1115c9b	xfs: allocate sparse inode chunks on full chunk allocation failure xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode chunk. If all else fails, reduce the allocation to the sparse length and alignment and attempt to allocate a sparse inode chunk. If sparse chunk allocation succeeds, check whether an inobt record already exists that can track the chunk. If so, inherit and update the existing record. Otherwise, insert a new record for the sparse chunk. Create helpers to align sparse chunk inode records and insert or update existing records in the inode btrees. The xfs_inobt_insert_sprec() helper implements the merge or update semantics required for sparse inode records with respect to both the inobt and finobt. To update the inobt, either insert a new record or merge with an existing record. To update the finobt, use the updated inobt record to either insert or replace an existing record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:18:32 +10:00
Brian Foster	4148c347a4	xfs: helper to convert holemask to inode alloc. bitmap The inobt record holemask field is a condensed data type designed to fit into the existing on-disk record and is zero based (allocated regions are set to 0, sparse regions are set to 1) to provide backwards compatibility. This makes the type somewhat complex for use in higher level inode manipulations such as individual inode allocation, etc. Rather than foist the complexity of dealing with this field to every bit of logic that requires inode granular information, create a helper to convert the holemask to an inode allocation bitmap. The inode allocation bitmap is inode granularity similar to the inobt record free mask and indicates which inodes of the chunk are physically allocated on disk, irrespective of whether the inode is considered allocated or free by the filesystem. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:09:05 +10:00
Brian Foster	7f43c907ad	xfs: handle sparse inode chunks in icreate log recovery Recovery of icreate transactions assumes hardcoded values for the inode count and chunk length. Sparse inode chunks are allocated in units of m_ialloc_min_blks. Update the icreate validity checks to allow for appropriately sized inode chunks and verify the inode count matches what is expected based on the extent length rather than assuming a hardcoded count. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:06:30 +10:00
Brian Foster	463958af5c	xfs: pass inode count through ordered icreate log item v5 superblocks use an ordered log item for logging the initialization of inode chunks. The icreate log item is currently hardcoded to an inode count of 64 inodes. The agbno and extent length are used to initialize the inode chunk from log recovery. While an incorrect inode count does not lead to bad inode chunk initialization, we should pass the correct inode count such that log recovery has enough data to perform meaningful validity checks on the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:05:49 +10:00
Brian Foster	12d0714d4b	xfs: use actual inode count for sparse records in bulkstat/inumbers The bulkstat and inumbers mechanisms make the assumption that inode records consist of a full 64 inode chunk in several places. For example, this is used to track how many inodes have been processed overall as well as to determine whether a record has allocated inodes that must be handled. This assumption is invalid for sparse inode records. While sparse inodes will be marked as free in the ir_free mask, they are not accounted as free in ir_freecount because they cannot be allocated. Therefore, ir_freecount may be less than 64 inodes in an inode record for which all physically allocated inodes are free (and in turn ir_freecount < 64 does not signify that the record has allocated inodes). The new in-core inobt record format includes the ir_count field. This holds the number of true, physical inodes tracked by the record. The in-core ir_count field is always valid as it is hardcoded to XFS_INODES_PER_CHUNK when sparse inodes is not enabled. Use ir_count to handle inode records correctly in bulkstat in a generic manner. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:04:19 +10:00
Brian Foster	5419040fc0	xfs: introduce inode record hole mask for sparse inode chunks The inode btrees track 64 inodes per record regardless of inode size. Thus, inode chunks on disk vary in size depending on the size of the inodes. This creates a contiguous allocation requirement for new inode chunks that can be difficult to satisfy on an aged and fragmented (free space) filesystems. The inode record freecount currently uses 4 bytes on disk to track the free inode count. With a maximum freecount value of 64, only one byte is required. Convert the freecount field to a single byte and use two of the remaining 3 higher order bytes left for the hole mask field. Use the final leftover byte for the total count field. The hole mask field tracks holes in the chunks of physical space that the inode record refers to. This facilitates the sparse allocation of inode chunks when contiguous chunks are not available and allows the inode btrees to identify what portions of the chunk contain valid inodes. The total count field contains the total number of valid inodes referred to by the record. This can also be deduced from the hole mask. The count field provides clarity and redundancy for internal record verification. Note that neither of the new fields can be written to disk on fs' without sparse inode support. Doing so writes to the high-order bytes of freecount and causes corruption from the perspective of older kernels. The on-disk inobt record data structure is updated with a union to distinguish between the original, "full" format and the new, "sparse" format. The conversion routines to get, insert and update records are updated to translate to and from the on-disk record accordingly such that freecount remains a 4-byte value on non-supported fs, yet the new fields of the in-core record are always valid with respect to the record. This means that higher level code can refer to the current in-core record format unconditionally and lower level code ensures that records are translated to/from disk according to the capabilities of the fs. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:03:04 +10:00
Brian Foster	502a4e72b8	xfs: add fs geometry bit for sparse inode chunks Define an fs geometry bit for sparse inode chunks such that the characteristic of the fs can be identified by userspace. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:58:32 +10:00
Brian Foster	e5376fc15a	xfs: sparse inode chunks feature helpers and mount requirements The sparse inode chunks feature uses the helper function to enable the allocation of sparse inode chunks. The incompatible feature bit is set on disk at mkfs time to prevent mount from unsupported kernels. Also, enforce the inode alignment requirements required for sparse inode chunks at mount time. When enabled, full inode chunks (and all inode record) alignment is increased from cluster size to inode chunk size. Sparse inode alignment must match the cluster size of the fs. Both superblock alignment fields are set as such by mkfs when sparse inode support is enabled. Finally, warn that sparse inode chunks is an experimental feature until further notice. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:57:27 +10:00
Brian Foster	066a18845f	xfs: use sparse chunk alignment for min. inode allocation requirement xfs_ialloc_ag_select() iterates through the allocation groups looking for free inodes or free space to determine whether to allow an inode allocation to proceed. If no free inodes are available, it assumes that an AG must have an extent longer than mp->m_ialloc_blks. Sparse inode chunk support currently allows for allocations smaller than the traditional inode chunk size specified in m_ialloc_blks. The current minimum sparse allocation is set in the superblock sb_spino_align field at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use this to represent the minimum supported allocation size for inode chunks. Initialize m_ialloc_min_blks at mount time based on whether sparse inodes are supported. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:55:20 +10:00
Brian Foster	fb4f2b4e5a	xfs: add sparse inode chunk alignment superblock field Add sb_spino_align to the superblock to specify sparse inode chunk alignment. This also currently represents the minimum allowable sparse chunk allocation size. Signed-off-by: Brian Foster <bfoster@redhat.com>	2015-05-29 08:54:03 +10:00
Brian Foster	bfe46d4eb9	xfs: support min/max agbno args in block allocator The block allocator supports various arguments to tweak block allocation behavior and set allocation requirements. The sparse inode chunk feature introduces a new requirement not supported by the current arguments. Sparse inode allocations must convert or merge into an inode record that describes a fixed length chunk (64 inodes x inodesize). Full inode chunk allocations by definition always result in valid inode records. Sparse chunk allocations are smaller and the associated records can refer to blocks not owned by the inode chunk. This model can result in invalid inode records in certain cases. For example, if a sparse allocation occurs near the start of an AG, the aligned inode record for that chunk might refer to agbno 0. If an allocation occurs towards the end of the AG and the AG size is not aligned, the inode record could refer to blocks beyond the end of the AG. While neither of these scenarios directly result in corruption, they both insert invalid inode records and at minimum cause repair to complain, are unlikely to merge into full chunks over time and set land mines for other areas of code. To guarantee sparse inode chunk allocation creates valid inode records, support the ability to specify an agbno range limit for XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are specified in the allocation arguments and limit the block allocation algorithms to that range. The starting 'agbno' hint is clamped to the range if the specified agbno is out of range. If no sufficient extent is available within the range, the allocation fails. For backwards compatibility, the min/max fields can be initialized to 0 to disable range limiting (e.g., equivalent to min=0,max=agsize). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:53:00 +10:00
Brian Foster	999633d304	xfs: update free inode record logic to support sparse inode records xfs_difree_inobt() uses logic in a couple places that assume inobt records refer to fully allocated chunks. Specifically, the use of mp->m_ialloc_inos can cause problems for inode chunks that are sparsely allocated. Sparse inode chunks can, by definition, define a smaller number of inodes than a full inode chunk. Fix the logic that determines whether an inode record should be removed from the inobt to use the ir_free mask rather than ir_freecount. Fix the agi counters modification to use ir_freecount to add the actual number of inodes freed rather than assuming a full inode chunk. Also make sure that we preserve the behavior to not remove inode chunks if the block size is large enough for multiple inode chunks (e.g., bsize=64k, isize=512). This behavior was previously implicit in that in such configurations, ir.freecount of a single record never matches m_ialloc_inos. Hence, add some comments as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:51:37 +10:00
Brian Foster	d4cc540b08	xfs: create individual inode alloc. helper Inode allocation from sparse inode records must filter the ir_free mask against ir_holemask. In preparation for this requirement, create a helper to allocate an individual inode from an inode record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:50:21 +10:00
Brian Foster	22419ac9fe	xfs: fix broken i_nlink accounting for whiteout tmpfile inode XFS uses the internal tmpfile() infrastructure for the whiteout inode used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates the inode, drops di_nlink, adds the inode to the agi unlinked list, calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode, and then finishes the common inode setup (e.g., clear I_NEW and unlock). The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was pulled up out of that function as part of the following commit to resolve a deadlock issue: `330033d6` xfs: fix tmpfile/selinux deadlock and initialize security As a result, callers of xfs_create_tmpfile() are responsible for either calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout tmpfile allocation helper does neither. As a result, the vfs ->i_nlink becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links it back into the source dentry and calls xfs_bumplink(). Update the assert in xfs_rename() to help detect this problem in the future and update xfs_rename_alloc_whiteout() to decrement the link count as part of the manual tmpfile inode setup. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:14:55 +10:00
Dave Chinner	cddc116228	xfs: xfs_iozero can return positive errno It was missed when we converted everything in XFs to use negative error numbers, so fix it now. Bug introduced in 3.17 by commit `2451337` ("xfs: global error sign conversion"), and should go back to stable kernels. Thanks to Brian Foster for noticing it. cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:40:32 +10:00
Dave Chinner	6dfe5a049f	xfs: xfs_attr_inactive leaves inconsistent attr fork state behind xfs_attr_inactive() is supposed to clean up the attribute fork when the inode is being freed. While it removes attribute fork extents, it completely ignores attributes in local format, which means that there can still be active attributes on the inode after xfs_attr_inactive() has run. This leads to problems with concurrent inode writeback - the in-core inode attribute fork is removed without locking on the assumption that nothing will be attempting to access the attribute fork after a call to xfs_attr_inactive() because it isn't supposed to exist on disk any more. To fix this, make xfs_attr_inactive() completely remove all traces of the attribute fork from the inode, regardless of it's state. Further, also remove the in-core attribute fork structure safely so that there is nothing further that needs to be done by callers to clean up the attribute fork. This means we can remove the in-core and on-disk attribute forks atomically. Also, on error simply remove the in-memory attribute fork. There's nothing that can be done with it once we have failed to remove the on-disk attribute fork, so we may as well just blow it away here anyway. cc: <stable@vger.kernel.org> # 3.12 to 4.0 Reported-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:40:08 +10:00
Dave Chinner	6dea405eee	xfs: extent size hints can round up extents past MAXEXTLEN This results in BMBT corruption, as seen by this test: # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc .... # mount /dev/vdc /mnt/scratch # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo which results in this failure on a debug kernel: XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211 .... Call Trace: [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400 [<ffffffff81ddcc29>] ? down_write+0x29/0x60 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17 The tracepoint that indicates the extent that triggered the assert failure is: xfs_iext_insert: idx 0 offset 0 block 16777224 count 2097152 flag 1 Clearly indicating that the extent length is greater than MAXEXTLEN, which is 2097151. A prior trace point shows the allocation was an exact size match and that a length greater than MAXEXTLEN was asked for: xfs_alloc_size_done: agno 1 agbno 8 minlen 2097152 maxlen 2097152 ^^^^^^^ ^^^^^^^ We don't see this problem with extent size hints through the IO path because we can't do single IOs large enough to trigger MAXEXTLEN allocation. fallocate(), OTOH, is not limited in it's allocation sizes and so needs help here. The issue is that the extent size hint alignment is rounding up the extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking into account extent size hints when calculating the maximum extent length to allocate. xfs_bmapi_reserve_delalloc() is already doing this, but direct extent allocation is not. Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is wrong, and it works only because delayed allocation extents are not limited in size to MAXEXTLEN in the in-core extent tree. hence this calculation does not work for direct allocation, and the delalloc code needs fixing. This may, in fact be the underlying bug that occassionally causes transaction overruns in delayed allocation extent conversion, so now we know it's wrong we should fix it, too. Many thanks to Brian Foster for finding this problem during review of this patch. Hence the fix, after much code reading, is to allow xfs_bmap_extsize_align() to align partial extents when full alignment would extend the alignment past MAXEXTLEN. We can safely do this because all callers have higher layer allocation loops that already handle short allocations, and so will simply run another allocation to cover the remainder of the requested allocation range that we ignored during alignment. The advantage of this approach is that it also removes the need for callers to do anything other than limit their requests to MAXEXTLEN - they don't really need to be aware of extent size hints at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:40:06 +10:00
Dave Chinner	8c1903d308	xfs: inode and free block counters need to use __percpu_counter_compare Because the counters use a custom batch size, the comparison functions need to be aware of that batch size otherwise the comparison does not work correctly. This leads to ASSERT failures on generic/027 like this: XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099 ------------[ cut here ]------------ .... Call Trace: [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0 [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0 [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580 [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260 [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0 [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250 [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200 [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0 [<ffffffff811f3958>] evict+0xb8/0x190 [<ffffffff811f433b>] iput+0x18b/0x1f0 [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320 [<ffffffff811d548a>] ? filp_close+0x5a/0x80 [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40 [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71 This is a regression introduced by commit `501ab32` ("xfs: use generic percpu counters for inode counter"). This patch fixes the same problem for both the inode counter and the free block counter in the superblocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:39:34 +10:00
George Wang	74f9ce1cf2	xfs: use percpu_counter_read_positive for mp->m_icount Function percpu_counter_read just return the current counter, which can be negative. This will cause the checking of "allocated inode counts <= m_maxicount" false positive. Use percpu_counter_read_positive can solve this problem, and be consistent with the purpose to introduce percpu mechanism to xfs. Signed-off-by: George Wang <xuw2015@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:39:34 +10:00
Linus Torvalds	8663da2c09	Some miscellaneous bug fixes and some final on-disk and ABI changes for ext4 encryption which provide better security and performance. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJVRsVDAAoJEPL5WVaVDYGj/UUIAI6zLGhq3I8uQLZQC22Ew2Ph TPj6eABDuTrB/7QpAu21Dk59N70MQpsBTES6yLWWLf/eHp0gsH7gCNY/C9185vOh tQjzw18hRH2IfPftOBrjDlPGbbBD8Gu9jAmpm5kKKOtBuSVbKQ4GeN6BTECkgwlg U5EJHJJ5Ahl4MalODFreOE5ZrVC7FWGEpc1y/MquQ0qcGSGlNd35leK5FE2bfHWZ M1IJfXH5RRVPUBp26uNvzEg0TtpqkigmCJUT6gOVLfSYBw+lYEbGl4lCflrJmbgt 8EZh3Q0plsDbNhMzqSvOE4RvsOZ28oMjRNbzxkAaoz/FlatWX2hrfAoI2nqRrKg= =Unbp -----END PGP SIGNATURE----- Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some miscellaneous bug fixes and some final on-disk and ABI changes for ext4 encryption which provide better security and performance" * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix growing of tiny filesystems ext4: move check under lock scope to close a race. ext4: fix data corruption caused by unwritten and delayed extents ext4 crypto: remove duplicated encryption mode definitions ext4 crypto: do not select from EXT4_FS_ENCRYPTION ext4 crypto: add padding to filenames before encrypting ext4 crypto: simplify and speed up filename encryption	2015-05-03 18:23:53 -07:00
Jan Kara	2c869b262a	ext4: fix growing of tiny filesystems The estimate of necessary transaction credits in ext4_flex_group_add() is too pessimistic. It reserves credit for sb, resize inode, and resize inode dindirect block for each group added in a flex group although they are always the same block and thus it is enough to account them only once. Also the number of modified GDT block is overestimated since we fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block. Make the estimation more precise. That reduces number of requested credits enough that we can grow 20 MB filesystem (which has 1 MB journal, 79 reserved GDT blocks, and flex group size 16 by default). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2015-05-02 23:58:32 -04:00
Davide Italiano	280227a75b	ext4: move check under lock scope to close a race. fallocate() checks that the file is extent-based and returns EOPNOTSUPP in case is not. Other tasks can convert from and to indirect and extent so it's safe to check only after grabbing the inode mutex. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2015-05-02 23:21:15 -04:00
Lukas Czerner	d2dc317d56	ext4: fix data corruption caused by unwritten and delayed extents Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2015-05-02 21:36:55 -04:00
Chanho Park	9402bdcacd	ext4 crypto: remove duplicated encryption mode definitions This patch removes duplicated encryption modes which were already in ext4.h. They were duplicated from commit `3edc18d` and commit f542fb. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Chanho Park <chanho61.park@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-02 10:29:22 -04:00
Herbert Xu	fb63e5489f	ext4 crypto: do not select from EXT4_FS_ENCRYPTION This patch adds a tristate EXT4_ENCRYPTION to do the selections for EXT4_FS_ENCRYPTION because selecting from a bool causes all the selected options to be built-in, even if EXT4 itself is a module. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-02 10:29:19 -04:00
Theodore Ts'o	a44cd7a054	ext4 crypto: add padding to filenames before encrypting This obscures the length of the filenames, to decrease the amount of information leakage. By default, we pad the filenames to the next 4 byte boundaries. This costs nothing, since the directory entries are aligned to 4 byte boundaries anyway. Filenames can also be padded to 8, 16, or 32 bytes, which will consume more directory space. Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322 Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-01 16:56:50 -04:00
Theodore Ts'o	5de0b4d0cd	ext4 crypto: simplify and speed up filename encryption Avoid using SHA-1 when calculating the user-visible filename when the encryption key is available, and avoid decrypting lots of filenames when searching for a directory entry in a directory block. Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63 Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-01 16:56:45 -04:00

1 2 3 4 5 ...

40417 Commits