Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written
v2: add trace event writeback_single_inode_requeue as proposed by Dave.
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Remove two unused struct writeback_control fields:
.encountered_congestion (completely unused)
.nonblocking (never set, checked/showed in XFS,NFS/btrfs)
The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.
Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Code refactor for more logical code layout.
No behavior change.
- remove the mis-named __writeback_inodes_sb()
- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
before calling __writeback_inodes_wb()
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.
Based on earlier patches from Nick Piggin and Dave Chinner.
It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.
A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.
This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.
This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.
The below test script completes a slightly faster now:
2.6.39-rc3 2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed 256.043 252.367
stddev 24.381 12.530
tar elapsed 30.097 28.808
dd elapsed 13.214 11.782
#!/bin/zsh
cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs
echo 3 > /proc/sys/vm/drop_caches
tic=$(cat /proc/uptime|cut -d' ' -f2)
cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
wait
sync
tac=$(cat /proc/uptime|cut -d' ' -f2)
echo elapsed: $((tac - tic))
It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.
Analyzes from Dave Chinner in great details:
Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.
With the old code, we do:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
writeback 8 small inodes 800 pages
1 large inode 224 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
.....
Your new code:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
(b_io == 8s)
writeback 8 small inodes 800 pages
b_io empty: (1800 pages written)
b_more_io (large inode) -> b_io (1l)
14 newly expired inodes -> b_io (1l, 14s)
writeback large inode 1024 pages -> b_more_io
(b_io == 14s)
writeback 10 small inodes 1000 pages
1 small inode 24 pages -> b_more_io (1l, 1s(24))
writeback 5 small inodes 500 pages
b_io empty: (2548 pages written)
b_more_io (large inode) -> b_io (1l, 1s(24))
20 newly expired inodes -> b_io (1l, 1s(24), 20s)
......
Rough progression of pages written at b_io refill:
Old code:
total large file % of writeback
1024 224 21.9% (fixed)
New code:
total large file % of writeback
1800 1024 ~55%
2550 1024 ~40%
3050 1024 ~33%
3500 1024 ~29%
3950 1024 ~26%
4250 1024 ~24%
4500 1024 ~22.7%
4700 1024 ~21.7%
4800 1024 ~21.3%
4800 1024 ~21.3%
(pretty much steady state from here)
Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)
The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.
CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Dynamically compute the dirty expire timestamp at queue_io() time.
writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.
This has two possible problems:
- It is unfair for a large dirty inode to delay (for a long time) the
writeback of small dirty inodes.
- As time goes by, the large and busy dirty inode may contain only
_freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
delaying the expired dirty pages to the end of LRU lists, triggering
the evil pageout(). Nevertheless this patch merely addresses part
of the problem.
v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.
For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.
For example, imagine 100 inodes
i0, i1, i2, ..., i90, i91, i99
At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.
Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1). I actually tried to write 1 page
per 1ms with this command
write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".
So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV. A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.
It can prevent both of the following livelock schemes:
- while true; do echo data >> f; done
- while true; do touch f; done (in theory)
The exact livelock condition is, during sync(1):
(1) no new inodes are dirtied
(2) an inode being actively dirtied
On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
When finished, it will be redirty_tail()ed because it's still dirty
and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
on condition (1). The sync work will then revisit it on the next
queue_io() and find it eligible again because its old ->dirtied_when
predates the sync work start time.
We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac5b ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.
Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.
Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.
Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.
Impact: It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.
Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:
fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function
Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").
This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With xfstest 254 I can panic the box every time with the inode number caching
stuff on. This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free. To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0). With this fix 254 no longer panics. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree. It
prevents crashes during balancing operations.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page. This adds a check to
enforce that restriction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Use hlist_entry() for io_context.cic_list.first
cfq-iosched: Remove bogus check in queue_fail path
xen/blkback: potential null dereference in error handling
xen/blkback: don't call vbd_size() if bd_disk is NULL
block: blkdev_get() should access ->bd_disk only after success
CFQ: Fix typo and remove unnecessary semicolon
block: remove unwanted semicolons
Revert "block: Remove extra discard_alignment from hd_struct."
nbd: adjust 'max_part' according to part_shift
nbd: limit module parameters to a sane value
nbd: pass MSG_* flags to kernel_recvmsg()
block: improve the bio_add_page() and bio_add_pc_page() descriptions
The free space fixup is currently initiated during mount after the call to
ubifs_write_master() which results in a write to PEBs; this has been observed
with the patch 'assert no fixup when writing a node' applied:
Move the free space fixup on mount to before the calls to
ubifs_recover_inl_heads() and ubifs_write_master(). This results in no
assertions with the previously mentioned patch applied.
Artem: tweaked the patch a bit
Signed-off-by: Ben Gardiner <bengardiner@nanometrics>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current 'mount_ubifs()' implementation does not initialize the LPT until the
the master node is marked dirty. Move the LPT initialization to before marking
the master node dirty. This is a preparation for the next patch which will move
the free-space-fixup check to before marking the master node dirty, because we
have to fix-up the free space before doing any writes.
Artem: massaged the patch and commit message.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current free space fixup can result in some writing to the UBI volume
when the space_fixup flag is set.
To catch instances where UBIFS is writing to the NAND while the space_fixup
flag is set, add an assert to ubifs_write_node().
Artem: tweaked the patch, added similar assertion to the write buffer
write path.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS maintains per-filesystem and global clean znode counters
('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain
correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'.
However, in case of failures during commit the counters were corrupted. E.g.,
if a failure happens in the middle of 'write_index()', then some nodes in the
commit list ('c->cnext') are marked as clean, and some are marked as dirty. And
the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we
end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we
have 2 file-sytem and one of them fails, and we unmount it,
'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write
failure because it forgets to free the 'struct ubifs_dent_node *dent' object.
Although the object is small, the alignment can make it large - e.g., 2KiB
if the min. I/O unit is 2KiB.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
Sometimes VM asks the shrinker to return amount of objects it can shrink,
and we return the ubifs_clean_zn_cnt in that case. However, it is possible
that this counter is negative for a short period of time, due to the way
UBIFS TNC code updates it. And I can observe the following warnings sometimes:
shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788
This patch makes sure UBIFS never returns negative count of objects.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb
(UBIFS: fix extremely rare mount failure) broke recovery. This commit make
UBIFS drop the last min. I/O unit in all journal heads, but this is needed only
for the GC head. And this does not work for non-GC heads. For example, if
suppose we have min. I/O units A and B, and A contains a valid node X, which
was fsynced, and then a group of nodes Y which spans the rest of A and B. In
this case we'll drop not only Y, but also X, which is obviously incorrect.
This patch fixes the issue and additionally makes recovery to drop last min.
I/O unit only for the GC head, and leave things as they have been for ages for
the other heads - this is safer.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells
whether the nodes are grouped in the LEB to recover, pass the journal head
number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag.
This patch is a preparation to a further fix where we'll need to know the
journal head number for other purposes.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Journal heads are different in a way how UBIFS writes nodes there. All normal
journal heads receive grouped nodes, while the GC journal heads receives
ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which
describes this property.
This patch is a preparation to a further recovery fix.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but
it introduced a regression - now UBIFS prints scary error messages during
recovery on all corrupted nodes, even though the corruptions are expected
(due to a power cut). This patch fixes the issue.
Additionally fix a typo in a commentary introduced by the same commit.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
d4dc210f69 (block: don't block events on excl write for non-optical
devices) added dereferencing of bdev->bd_disk to test
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
%NULL if open failed which can lead to an oops.
Test the flag after testing open was successful, not before.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Miller <davem@davemloft.net>
Tested-by: David Miller <davem@davemloft.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The dentry_unhash push-down series missed that shink_dcache_parent needs to
be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and
allow efficient dentry reclaim.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It was not a good idea to start dereferencing disk->queue from
the fs sysfs strategy for displaying discard alignment. We ran
into first a NULL pointer deref, and after fixing that we sometimes
see unvalid disk->queue pointer values.
Since discard is the only one of the bunch actually looking into
the queue, just revert the change.
This reverts commit 23ceb5b771.
Conflicts:
fs/partitions/check.c
Now that ecryptfs_lookup_interpose() is no longer using
ecryptfs_header_cache_2 to read in metadata, the kmem_cache can be
removed and the ecryptfs_header_cache_1 kmem_cache can be renamed to
ecryptfs_header_cache.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
ecryptfs_lookup_interpose() has turned into spaghetti code over the
years. This is an effort to clean it up.
- Shorten overly descriptive variable names such as ecryptfs_dentry
- Simplify gotos and error paths
- Create helper function for reading plaintext i_size from metadata
It also includes an optimization when reading i_size from the metadata.
A complete page-sized kmem_cache_alloc() was being done to read in 16
bytes of metadata. The buffer for that is now statically declared.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Instead of having the calling functions translate the true/false return
code to either 0 or -EINVAL, have contains_ecryptfs_marker() return 0 or
-EINVAL so that the calling functions can just reuse the return code.
Also, rename the function to ecryptfs_validate_marker() to avoid callers
mistakenly thinking that it returns true/false codes.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Only unlock and d_add() new inodes after the plaintext inode size has
been read from the lower filesystem. This fixes a race condition that
was sometimes seen during a multi-job kernel build in an eCryptfs mount.
https://bugzilla.kernel.org/show_bug.cgi?id=36002
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Reported-by: David <david@unsolicited.net>
Tested-by: David <david@unsolicited.net>
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd: make local functions static
NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
NFSD: Check status from nfsd4_map_bcts_dir()
NFSD: Remove setting unused variable in nfsd_vfs_read()
nfsd41: error out on repeated RECLAIM_COMPLETE
nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
nfsd v4.1 lOCKT clientid field must be ignored
nfsd41: add flag checking for create_session
nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
nfsd4: fix wrongsec handling for PUTFH + op cases
nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
nfsd4: introduce OPDESC helper
nfsd4: allow fh_verify caller to skip pseudoflavor checks
nfsd: distinguish functions of NFSD_MAY_* flags
svcrpc: complete svsk processing on cb receive failure
svcrpc: take advantage of tcp autotuning
SUNRPC: Don't wait for full record to receive tcp data
svcrpc: copy cb reply instead of pages
svcrpc: close connection if client sends short packet
svcrpc: note network-order types in svc_process_calldir
...
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Support for RPC over AF_LOCAL transports
SUNRPC: Remove obsolete comment
SUNRPC: Use AF_LOCAL for rpcbind upcalls
SUNRPC: Clean up use of curly braces in switch cases
NFS: Revert NFSROOT default mount options
SUNRPC: Rename xs_encode_tcp_fragment_header()
nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
nfs41: Correct offset for LAYOUTCOMMIT
NFS: nfs_update_inode: print current and new inode size in debug output
NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
NFSv4: Handle expired stateids when the lease is still valid
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...